Find lookalike chemicals with flexible metrics and weights. Paste your dataset, then search instantly here. Download clean reports for meetings, audits, and research teams.
This sample matches the dataset template used by the tool.
| Name | Formula | MW | logP | HBD | HBA | TPSA | RotB | Arom |
|---|---|---|---|---|---|---|---|---|
| Caffeine | C8H10N4O2 | 194.19 | -0.1 | 0 | 4 | 61.8 | 0 | 2 |
| Aspirin | C9H8O4 | 180.16 | 1.2 | 1 | 4 | 63.6 | 3 | 1 |
| Acetaminophen | C8H9NO2 | 151.16 | 0.5 | 1 | 2 | 49.3 | 1 | 1 |
| Ibuprofen | C13H18O2 | 206.28 | 3.5 | 1 | 2 | 37.3 | 4 | 1 |
| Benzene | C6H6 | 78.11 | 2.1 | 0 | 0 | 0.0 | 0 | 1 |
| Ethanol | C2H6O | 46.07 | -0.3 | 1 | 1 | 20.2 | 0 | 0 |
This tool converts each compound into a weighted feature vector. Features can include numeric descriptors and optional atom-count features parsed from the molecular formula.
Euclidean option uses distance = √Σ(Aᵢ − Bᵢ)² and converts it to similarity as 1/(1+distance).
Similarity screening speeds early discovery by ranking close analogs. In a 5,000-compound library, filtering to the top 50 candidates cuts wet-lab work by 99%. Researchers use similarity to find replacements for restricted solvents, identify scaffold hops, and cluster impurities around a known contaminant. This tool keeps the workflow lightweight: paste descriptors, score, and export.
The calculator combines physicochemical descriptors with optional formula atom counts. Typical ranges are MW 40-800 g/mol, logP -1.0-6.0, TPSA 0-200 A^2, HBD 0-8, HBA 0-12, rotatable bonds 0-15, and aromatic rings 0-6. When formulas are supplied, counts for C, H, N, O, S, P, F, Cl, Br, and I add compositional context.
Raw descriptors can differ by orders of magnitude, so scaling matters. Z-score scaling centers each feature to mean 0 and standard deviation 1, preventing MW from dominating TPSA. Min-max scaling maps features to 0-1, which is useful for bounded inputs. Weights let you emphasize what matters; for example, set logP weight to 2.0 when permeability drives similarity.
Cosine similarity compares vector direction and is robust when magnitudes vary. Euclidean similarity converts distance to 1/(1+distance), giving intuitive decay as compounds diverge. Continuous Tanimoto is widely used for chemical fingerprints and behaves well when features are non-negative. If your dataset contains negatives after scaling, cosine often remains stable.
The table reports similarity, distance, and the number of features used. A score near 1.0000 indicates a close match under the selected settings, while values below 0.3000 often signal weak resemblance for mixed descriptor sets. Check features used when missing values exist; a high score from only two features can be misleading. Use the plot to spot score drop-offs and select a rational cutoff.
Curate inputs before scoring. Standardize units, remove salts, and keep consistent protonation states. For QSAR-style libraries, include at least 200-500 rows to estimate stable scaling statistics. Start with threshold 0.30 and Top N 25, then tighten as you validate hits. Always confirm candidates with structure-based methods or expert review. Document assumptions to ensure repeatable decisions. Keep versioned datasets, parameters, and exported reports for audits.
Provide at least two numeric descriptors or a formula, plus a dataset with matching columns. More filled descriptors usually improve discrimination and reduce ties among similar candidates.
Enable formula features when your dataset contains consistent formulas and you want composition to influence ranking. It helps separate isomers from very different elemental makeups, but it will not capture bond connectivity.
Weights multiply each feature before scoring. Increasing a weight makes that descriptor contribute more to similarity, so compounds closer on that dimension rise in rank while others drop.
If missing values are ignored, any descriptor absent in the query or a row is skipped. The features used count shows how many paired values actually contributed to the score.
Start with cosine for mixed, scaled descriptors and use Tanimoto when features are non-negative and you want fingerprint-style behavior. Euclidean is useful when you prefer distance-based decay.
Use the Plotly bar chart to find the first sharp score drop. Common starting thresholds are 0.30 to 0.50 for scaled descriptors, then adjust based on validation and domain expectations.
Important Note: All the Calculators listed in this site are for educational purpose only and we do not guarentee the accuracy of results. Please do consult with other sources as well.