Distribution Similarity Test Calculator

Enter datasets and test settings

Use commas, spaces, new lines, or semicolons between numbers.

Dataset A values

Dataset B values

Number of bins

Enter 0 to use automatic bin selection.

Alpha level

Similarity threshold

Displayed decimals

Custom range minimum

Leave blank to use the combined minimum.

Custom range maximum

Leave blank to use the combined maximum.

Example data table

This example shows two short samples with similar centers but small shape differences. Use it to test the tool quickly.

Observation	Dataset A	Dataset B
1	12	11
2	13	12
3	13	13
4	14	14
5	15	14
6	15	15
7	16	16
8	17	16
9	18	17
10	18	18
11	19	19
12	20	21

Formula used

1. Histogram probabilities: For each bin i, probability is pᵢ = cᵢ / n, where cᵢ is the bin count and n is sample size.

2. Overlap coefficient: OVL = Σ min(pᵢ, qᵢ). Higher values mean more shared probability mass.

3. Total variation distance: TVD = 0.5 × Σ |pᵢ − qᵢ|. Lower values show smaller differences.

4. Hellinger distance: H = (1/√2) × √[Σ(√pᵢ − √qᵢ)²]. Lower values indicate closer distributions.

5. Bhattacharyya coefficient: BC = Σ √(pᵢqᵢ). Higher values indicate stronger similarity.

6. Jensen-Shannon divergence: JSD = 0.5 KL(P‖M) + 0.5 KL(Q‖M), where M = (P + Q) / 2. Lower divergence is better.

7. Kolmogorov-Smirnov distance: D = max |F₁(x) − F₂(x)|. It tracks the largest gap between empirical cumulative distributions.

8. KS critical value: Dα = √[-0.5 ln(α / 2)] × √[(n₁ + n₂) / (n₁n₂)]. When D > Dα, the equal-distribution assumption is rejected at alpha.

9. Overall similarity score: This page averages six bounded similarity components: overlap, Bhattacharyya, 1 − H, 1 − TVD, Jensen-Shannon similarity, and 1 − D.

How to use this calculator

Paste numeric values for Dataset A and Dataset B into the two text boxes.
Use commas, spaces, semicolons, or new lines to separate numbers.
Choose a bin count, or enter zero for automatic selection.
Set the alpha level for the KS decision rule.
Set a similarity threshold to control the final similarity label.
Optionally define a custom common range for histogram construction.
Click Run Similarity Test to display results below the header.
Review the summary cards, the Plotly graph, and the bin table.
Export the result as CSV or PDF when needed.

Frequently asked questions

1) What does this calculator actually compare?

It compares two numeric datasets as full distributions, not just by averages. It measures overlap, divergence, cumulative separation, and shape similarity using several complementary metrics.

2) Why use several metrics instead of one?

No single metric captures every aspect of similarity. Some focus on shared mass, some on cumulative gaps, and others on divergence. Using several gives a more balanced reading.

3) What does a high overlap coefficient mean?

A high overlap coefficient means the two histogram probability patterns share a large amount of mass across bins. It suggests stronger distribution resemblance.

4) What does the KS distance tell me?

The KS distance is the largest vertical gap between empirical cumulative distributions. Smaller values mean the cumulative behavior of both datasets is more alike.

5) Does the number of bins affect results?

Yes. Histogram-based metrics depend on binning. Very few bins can hide differences, while too many can exaggerate noise. Automatic or moderate bin counts usually work well.

6) Should I trust the overall similarity score alone?

Use the overall score as a summary, not a replacement for inspection. Always read it together with the KS result, divergence values, and the comparison graph.

7) Can this compare samples of different sizes?

Yes. The calculator normalizes histogram counts to probabilities, so different sample sizes can still be compared meaningfully, provided both samples are representative.

8) When should I use a custom range?

Use a custom range when you want consistent external boundaries, such as comparing repeated experiments against a fixed scale. Otherwise, leave the range blank for automatic limits.

Observation	Dataset A	Dataset B
1	12	11
2	13	12
3	13	13
4	14	14
5	15	14
6	15	15
7	16	16
8	17	16
9	18	17
10	18	18
11	19	19
12	20	21

Observation	Dataset A	Dataset B
1	12	11
2	13	12
3	13	13
4	14	14
5	15	14
6	15	15
7	16	16
8	17	16
9	18	17
10	18	18
11	19	19
12	20	21