Measure overlap between two physics datasets as sets. Choose set or multiset handling with tolerance. Export results to CSV, or print clean PDFs easily.
The Jaccard similarity compares overlap between two collections:
J(A, B) = |A \u2229 B| / |A \u222a B|
The distance form is:
D(A, B) = 1 \u2212 J(A, B)
In numeric mode, two values are considered matching when their absolute difference is within the selected tolerance (absolute and/or relative).
Example: comparing detected hydrogen spectral lines (nm) across two runs using numeric tolerance.
| Run | Observed values | Notes |
|---|---|---|
| A | 656.3, 486.1, 434.0, 410.2 | Balmer series peaks with measurement noise |
| B | 656.28, 486.13, 397.0, 410.25 | One extra line plus small peak shifts |
| Suggested tolerance | 0.05 | Matches peaks within \u00b10.05 nm |
Many physics pipelines compare two lists of detected features. Common examples include spectral peak identities from two calibrations, activated detector channels across repeated triggers, segment labels in particle tracking, or “events of interest” selected by two filtering methods. Jaccard similarity gives a single, interpretable overlap score for these discrete outputs.
The classic Jaccard similarity is J = |A ∩ B| / |A ∪ B|, where A and B are sets of unique items. A value of 1 means perfect agreement, and 0 means no shared items. This calculator also reports intersection and union sizes to make the score auditable.
Experimental lists are rarely clean. Remove empty entries, normalize case for text labels, and choose a delimiter that matches your acquisition export. If you record repeated labels, decide whether duplicates should be ignored (set method) or preserved (multiset method). For sensor logs, consistent naming and unit conventions reduce false mismatches.
Physics similarity is context dependent. For sparse feature lists, a Jaccard score of 0.70 can be strong agreement, while dense lists may require 0.90 or higher. When comparing classification outputs, pair Jaccard with counts: two runs might share 7 features, but a large union of 30 indicates meaningful disagreement despite a nonzero overlap.
When items are numeric (for example wavelengths, frequencies, or energy bins), exact matching is too strict. The numeric mode lets you apply an absolute tolerance (such as 0.05 nm) and an optional relative tolerance for proportional uncertainty. This makes similarity robust to instrument drift, rounding, and digitization effects.
Sometimes repeats are meaningful, such as counting repeated particle IDs, repeated resonance detections, or repeated fault codes in a time window. The multiset method treats item counts as part of the comparison by matching repeated occurrences up to the minimum count in both lists. This can separate “same items” from “same rates.”
For operational monitoring, define thresholds tied to your acceptance criteria. For example, you might require J ≥ 0.85 before merging two segmentation outputs, or flag J ≤ 0.50 as a possible calibration shift. Record the mode, delimiter, tolerances, and whether set or multiset matching was used.
To make comparisons reproducible, store the exact input lists and the preprocessing rules. If you use numeric tolerance, report the tolerance values and whether relative tolerance was enabled. If you use multiset matching, report count handling. This calculator’s detailed metrics and exports are designed to support lab notes and verification.
It means 60% of the combined unique items are shared. Always confirm by checking intersection and union counts, because small lists can produce large score changes from a single added or missing item.
Use set mode when duplicates are not informative, such as unique spectral line IDs. Use multiset mode when repeated detections matter, such as repeated event codes or repeated particle identifiers in a time window.
Start with instrument resolution or expected drift. For wavelengths, a tolerance near your peak-fitting uncertainty is typical. Increase tolerance if the same physical feature appears with small systematic offsets across runs.
Convert the time-series into discrete features first, such as threshold crossings, detected peaks, or labeled segments. Jaccard similarity is best for comparing the presence or absence of discrete items, not raw amplitudes.
Text labels like “Halpha” and “halpha” are treated as different items when case sensitivity is enabled. Disable case sensitivity when your labels come from mixed sources, unless letter case has physical meaning in your naming convention.
Enable removal of empty items and trimming. This prevents blank entries from being counted as items, which can inflate the union size and reduce the similarity score artificially.
Exports let you attach computed metrics to reports and lab notebooks. The CSV is useful for spreadsheets and scripts, while the PDF print view provides a clean summary table for documentation and review.