Chemical Similarity Search Tool

Search inputs

Query compound name

Molecular formula (optional)

Include atom-count features

From formulas in query and dataset.

Molecular weight (g/mol)

logP

H-bond donors (HBD)

H-bond acceptors (HBA)

Topological polar surface area (TPSA)

Rotatable bonds

Aromatic rings

Similarity metric

Scaling

Missing values

Similarity threshold (0 to 1)

Top N results

Feature weights

Increase weight to emphasize a descriptor in the score.

Weight: Molecular weight

Weight: logP

Weight: HBD

Weight: HBA

Weight: TPSA

Weight: Rotatable bonds

Weight: Aromatic rings

Weight: Formula atom counts

Compound library (CSV)

Columns: name, formula, mw, logp, hbd, hba, tpsa, rotb, arom

Tip: Paste your own CSV lines. Headers are case-insensitive.

Example data table

This sample matches the dataset template used by the tool.

Name	Formula	MW	logP	HBD	HBA	TPSA	RotB	Arom
Caffeine	C8H10N4O2	194.19	-0.1	0	4	61.8	0	2
Aspirin	C9H8O4	180.16	1.2	1	4	63.6	3	1
Acetaminophen	C8H9NO2	151.16	0.5	1	2	49.3	1	1
Ibuprofen	C13H18O2	206.28	3.5	1	2	37.3	4	1
Benzene	C6H6	78.11	2.1	0	0	0.0	0	1
Ethanol	C2H6O	46.07	-0.3	1	1	20.2	0	0

Formula used

This tool converts each compound into a weighted feature vector. Features can include numeric descriptors and optional atom-count features parsed from the molecular formula.

Z-score scaling

x' = (x − μ) / σ

Computed per feature using the dataset values.

Cosine similarity

sim = (A · B) / (||A|| ||B||)

Good for comparing overall direction of descriptor patterns.

Tanimoto (continuous)

sim = (A · B) / (||A||² + ||B||² − A · B)

Popular for similarity scoring when features are non-negative.

Euclidean option uses distance = √Σ(Aᵢ − Bᵢ)² and converts it to similarity as 1/(1+distance).

How to use this calculator

Enter your query compound details. Add as many descriptors as you trust.
Paste a CSV library of compounds. Keep the header row present.
Choose a similarity metric, scaling method, and missing-value behavior.
Adjust weights to emphasize properties that matter for your search.
Set a threshold and Top N, then run the similarity search.
Download CSV or PDF from the results card.

Why similarity search matters in chemistry

Similarity screening speeds early discovery by ranking close analogs. In a 5,000-compound library, filtering to the top 50 candidates cuts wet-lab work by 99%. Researchers use similarity to find replacements for restricted solvents, identify scaffold hops, and cluster impurities around a known contaminant. This tool keeps the workflow lightweight: paste descriptors, score, and export.

Descriptors used by this tool

The calculator combines physicochemical descriptors with optional formula atom counts. Typical ranges are MW 40-800 g/mol, logP -1.0-6.0, TPSA 0-200 A^2, HBD 0-8, HBA 0-12, rotatable bonds 0-15, and aromatic rings 0-6. When formulas are supplied, counts for C, H, N, O, S, P, F, Cl, Br, and I add compositional context.

Scaling and weighting for fair comparisons

Raw descriptors can differ by orders of magnitude, so scaling matters. Z-score scaling centers each feature to mean 0 and standard deviation 1, preventing MW from dominating TPSA. Min-max scaling maps features to 0-1, which is useful for bounded inputs. Weights let you emphasize what matters; for example, set logP weight to 2.0 when permeability drives similarity.

Choosing a similarity metric

Cosine similarity compares vector direction and is robust when magnitudes vary. Euclidean similarity converts distance to 1/(1+distance), giving intuitive decay as compounds diverge. Continuous Tanimoto is widely used for chemical fingerprints and behaves well when features are non-negative. If your dataset contains negatives after scaling, cosine often remains stable.

Interpreting the ranked output

The table reports similarity, distance, and the number of features used. A score near 1.0000 indicates a close match under the selected settings, while values below 0.3000 often signal weak resemblance for mixed descriptor sets. Check features used when missing values exist; a high score from only two features can be misleading. Use the plot to spot score drop-offs and select a rational cutoff.

Good practices for real datasets

Curate inputs before scoring. Standardize units, remove salts, and keep consistent protonation states. For QSAR-style libraries, include at least 200-500 rows to estimate stable scaling statistics. Start with threshold 0.30 and Top N 25, then tighten as you validate hits. Always confirm candidates with structure-based methods or expert review. Document assumptions to ensure repeatable decisions. Keep versioned datasets, parameters, and exported reports for audits.

FAQs

1) What inputs are required for a reliable search?

Provide at least two numeric descriptors or a formula, plus a dataset with matching columns. More filled descriptors usually improve discrimination and reduce ties among similar candidates.

2) When should I include molecular formula features?

Enable formula features when your dataset contains consistent formulas and you want composition to influence ranking. It helps separate isomers from very different elemental makeups, but it will not capture bond connectivity.

3) How do weights change the ranking?

Weights multiply each feature before scoring. Increasing a weight makes that descriptor contribute more to similarity, so compounds closer on that dimension rise in rank while others drop.

4) Why do some rows show fewer features used?

If missing values are ignored, any descriptor absent in the query or a row is skipped. The features used count shows how many paired values actually contributed to the score.

5) Which metric should I choose for descriptor data?

Start with cosine for mixed, scaled descriptors and use Tanimoto when features are non-negative and you want fingerprint-style behavior. Euclidean is useful when you prefer distance-based decay.

6) How should I set a practical similarity threshold?

Use the Plotly bar chart to find the first sharp score drop. Common starting thresholds are 0.30 to 0.50 for scaled descriptors, then adjust based on validation and domain expectations.