48 releases
new 0.1.52 | Jan 10, 2025 |
---|---|
0.1.47 | Dec 10, 2024 |
0.1.40 | Nov 20, 2024 |
0.1.24 | Jul 22, 2024 |
0.1.0 | Dec 16, 2022 |
#55 in Science
525 downloads per month
210KB
4K
SLoC
Cheminée
The chemistry search stack. Index chemical structures and include arbitrary extra properties. Fast to search and your callers don't need RDKit.
- RDKit provides the smarts
- Rust provides the type safety and relative ease of use
See rdkit-sys for more on how the bindings work.
Tested on Rust Stable 1.76 (rustup default 1.76
)
Intended Functionality
Cheminée is intended to work via CLI (e.g. for diagnostics) and API endpoints. Cheminée currently supports CLI and API endpoints for "identity search" (i.e. exact structure match), "substructure search", and "superstructure search". We plan to create a Tanimoto-based "similarity search" endpoint in the near future. Aside from structure searching, the API also supports endpoints for standardization of SMILES in bulk, as well as molecular format conversions (e.g. smiles-to-molblock and molblock-to-smiles). Indexing endpoints (e.g. index creation, bulk indexing, compound deletion, index deletion) are executed using Tantivy. Users can utilize the "basic search" (i.e. non-structure search) endpoint to search for compounds by RDKit chemical descriptors (e.g. "exactmw", "NumAtoms", etc.) or any other metadata you decided to include when indexing.
The API
To boot up the API server, you can run the following:
cargo run --color=always --release --package cheminee --bin cheminee -- rest-api-server
The simplest way to test out the API is by going to localhost:4001 in your browser and testing out the functionality of the different endpoints.
This repo is also set to automatically update a ruby gem whenever a new release is published. That gem is " cheminee-ruby". This gem is particularly helpful for interacting with the API in a ruby-friendly, programmatic environment.
See cheminee-ruby for more information into how the gem works.
The CLI
Assuming compounds are already present in an index path, you can search by:
Basic Search
For example:
cargo run -- basic-search -i "/tmp/cheminee/index0" -q "exactmw: [10 TO 10000] AND NumAtoms: [8 TO 100]" -l 10
Here, "i" refers to the index path, "q" refers to a composite query of chemical descriptor values and/or other indexed data types, and "l" refers to the number of desired results (defaulted to 1000).
Substructure Search
For example:
cargo run -- substructure-search -i /tmp/cheminee/index0 -s CCC -r 10 -t 10 -u true -e "exactmw: [20 TO 200]"
Similar to basic search, "i" refers to the index path. "s" refers to the query SMILES, "r" refers to the number of desired results (defaulted to 1000), "t" refers to the number of tautomers to be used for the query SMILES if applicable (defaulted to 0), "u" dictates whether to use indexed scaffolds to speed up the search (defaulted to "true"), and "e" refers to the "extra query" which is a composite query for chemical descriptors or other index data types as in the basic search implementation.
Superstructure Search
For example:
cargo run -- superstructure-search -i /tmp/cheminee/index0 -s c1ccccc1CCc1ccccc1CC -r 10 -t 10 -u true -e "exactmw: [20 TO 200]"
The input arguments here are the same as used for substructure search.
Identity Search
For example:
cargo run -- identity-search -i /tmp/cheminee/index0 -s c1ccccc1CC -e "exactmw: [20 TO 200]" -u true
Note: for identity search there is no need to specify the number of desired results (we are assuming you only want one result), nor is there any need to specify any tautomers (we will look for the canonical tautomer of the query). Note: the extra query above isn't necessary in this case, but if you have quasi-duplicate compound records (e.g. same molecule, but some differing metadata), you can use the extra query to get more specific, otherwise Cheminée will stop searching for the query once it finds an exact structure match (even if there are other duplicate structures present in the database).
Similarity Search
For example:
cargo run -- similarity-search -i /tmp/cheminee/index0 -s c1ccccc1CC -r 10 -t 10 -p 0.1 -m 0.4 -e "exactmw: [20 TO 200]"
There are some additional terms for this (Tanimoto-based) similarity search. "p" denotes the database percentage to search whereas "m" denotes the minimum Tanimoto similarity score to consider for "similar" compounds. Note that in the case of a non-zero number of tautomers specified with term "t", Cheminée will attempt to use that number of tautomers for the search to maximize the chance of finding similar molecules.
Another thing to note: this similarity search endpoint does not rely on brute-force searching. In other words, for a given query compound, we do NOT compute Tanimoto similarities against every compound in the database. Instead we make use of a neural network encoder model to embed the query compound Morgan fingerprint onto a Gaussian distributed latent space. We also have precomputed 10,000 cluster centroids from 3 million fingerprint embeddings in that latent space. For the query compound latent factor embedding, we rank clusters according to Euclidean distance from the query compound's embedding and search the top "p" percent of those ranked clusters for similar compounds. These candidate compounds are then subjected to Tanimoto scoring with the query compound prior to output. The fingerprint-to-latent factor embedding and cluster assignment is incorporated via our cheminee-similarity-model crate dependency.
Testing in Docker
Run the Cheminée image in a docker container. Note: in this command, logging is turned off, so you won't be able to follow along during the SDF indexing step below. If you prefer to follow along, then replace "RUST_LOG=off" with " RUST_LOG=info" but note that performance will be a bit slower:
docker run --rm -dt -e RUST_LOG=off -p 4001:4001 --name cheminee ghcr.io/rdkit-rs/cheminee:0.1.31
Exec into the container:
docker exec -it cheminee bash
Check out the CLI endpoints:
cheminee -h
Fetch some PubChem SDFs for testing. Each file has ~400K+ compounds; use <ctrl + c> to stop when you're happy with the number of files:
mkdir -p tmp/sdfs
cheminee fetch-pubchem -d tmp/sdfs
Create an index. We only have one schema at the moment (i.e. "descriptor_v1"):
cheminee create-index -i tmp/cheminee/index0 -n descriptor_v1 -s exactmw
Start indexing an SDF file. Note: Cheminée does a bulk write after every 1,000 compounds. So if you <ctrl + c> interrupt very soon after you start the indexing, you might end up with no indexed compounds. If you want to follow along and kill early for some simple testing, replace "RUST_LOG=off" with "RUST_LOG=info" in the docker run command above. Once you see a statement such as "10000 compounds processed so far" and you are happy with the number, then feel free to interrupt the indexing:
cheminee index-sdf -s tmp/sdfs/Compound_000000001_000500000.sdf.gz -i tmp/cheminee/index0
Go to "localhost:4001" in your favorite browser to test out the API endpoints. Note: for this test case, use "index0" for the index fields. Alternatively, test out the CLI some more:
cheminee -h
cheminee <CLI action> -h
Cutting A New Release
cargo release patch --tag-prefix='' --execute
Dependencies
~88MB
~1.5M SLoC