Analyze sample similarity through distances and divergence measures. View histograms, cumulative curves, and overlap statistics. Make stronger distribution comparisons with transparent math and visuals.
This example shows two short samples with similar centers but small shape differences. Use it to test the tool quickly.
| Observation | Dataset A | Dataset B |
|---|---|---|
| 1 | 12 | 11 |
| 2 | 13 | 12 |
| 3 | 13 | 13 |
| 4 | 14 | 14 |
| 5 | 15 | 14 |
| 6 | 15 | 15 |
| 7 | 16 | 16 |
| 8 | 17 | 16 |
| 9 | 18 | 17 |
| 10 | 18 | 18 |
| 11 | 19 | 19 |
| 12 | 20 | 21 |
1. Histogram probabilities: For each bin i, probability is pᵢ = cᵢ / n, where cᵢ is the bin count and n is sample size.
2. Overlap coefficient: OVL = Σ min(pᵢ, qᵢ). Higher values mean more shared probability mass.
3. Total variation distance: TVD = 0.5 × Σ |pᵢ − qᵢ|. Lower values show smaller differences.
4. Hellinger distance: H = (1/√2) × √[Σ(√pᵢ − √qᵢ)²]. Lower values indicate closer distributions.
5. Bhattacharyya coefficient: BC = Σ √(pᵢqᵢ). Higher values indicate stronger similarity.
6. Jensen-Shannon divergence: JSD = 0.5 KL(P‖M) + 0.5 KL(Q‖M), where M = (P + Q) / 2. Lower divergence is better.
7. Kolmogorov-Smirnov distance: D = max |F₁(x) − F₂(x)|. It tracks the largest gap between empirical cumulative distributions.
8. KS critical value: Dα = √[-0.5 ln(α / 2)] × √[(n₁ + n₂) / (n₁n₂)]. When D > Dα, the equal-distribution assumption is rejected at alpha.
9. Overall similarity score: This page averages six bounded similarity components: overlap, Bhattacharyya, 1 − H, 1 − TVD, Jensen-Shannon similarity, and 1 − D.
It compares two numeric datasets as full distributions, not just by averages. It measures overlap, divergence, cumulative separation, and shape similarity using several complementary metrics.
No single metric captures every aspect of similarity. Some focus on shared mass, some on cumulative gaps, and others on divergence. Using several gives a more balanced reading.
A high overlap coefficient means the two histogram probability patterns share a large amount of mass across bins. It suggests stronger distribution resemblance.
The KS distance is the largest vertical gap between empirical cumulative distributions. Smaller values mean the cumulative behavior of both datasets is more alike.
Yes. Histogram-based metrics depend on binning. Very few bins can hide differences, while too many can exaggerate noise. Automatic or moderate bin counts usually work well.
Use the overall score as a summary, not a replacement for inspection. Always read it together with the KS result, divergence values, and the comparison graph.
Yes. The calculator normalizes histogram counts to probabilities, so different sample sizes can still be compared meaningfully, provided both samples are representative.
Use a custom range when you want consistent external boundaries, such as comparing repeated experiments against a fixed scale. Otherwise, leave the range blank for automatic limits.
Important Note: All the Calculators listed in this site are for educational purpose only and we do not guarentee the accuracy of results. Please do consult with other sources as well.