Calculator Inputs
Use the controls below to compare intended sample allocation against observed experiment traffic.
Example Data Table
This example reflects a three-way model experiment where the planned allocation was 50:25:25, but routing drift created a detectable mismatch.
| Variant | Expected Ratio | Observed Count | Expected Count | Observed Share |
|---|---|---|---|---|
| Control | 50% | 5,100 | 5,000 | 51.0% |
| Variant A | 25% | 2,400 | 2,500 | 24.0% |
| Variant B | 25% | 2,500 | 2,500 | 25.0% |
Formula Used
The calculator applies a chi-square goodness-of-fit test to compare the observed variant counts against the counts implied by your target allocation.
1. Normalize expected ratios: expected share for each variant = variant ratio / total ratio.
2. Expected count: expected count = total observed traffic × normalized expected share.
3. Chi-square statistic: χ² = Σ ((observed − expected)² / expected).
4. Degrees of freedom: df = number of variants − 1.
5. P-value: p = upper-tail probability of the chi-square distribution using χ² and df.
6. Effect size: Cohen's w = √(χ² / total observed traffic).
Smaller p-values mean the observed traffic split is unlikely under the intended routing plan, which signals sample ratio mismatch.
How to Use This Calculator
- Enter the experiment name, test identifier, significance level, and minimum expected-count warning threshold.
- Select the number of variants included in the experiment or model-routing comparison.
- For each variant, provide a label, the intended traffic ratio, and the observed sample count.
- Click Check Sample Ratio Mismatch to calculate the chi-square statistic, p-value, deviation, and warning flags.
- Review the result section above the form to see whether traffic allocation is statistically consistent with the target design.
- Use the CSV or PDF export buttons to save the summary for QA notes, incident reviews, or experiment governance records.
Frequently Asked Questions
1. What is sample ratio mismatch?
Sample ratio mismatch happens when observed experiment traffic differs from the intended split beyond random chance. It often signals randomization, routing, logging, or eligibility problems.
2. Why should machine learning teams care about SRM?
SRM can bias offline-to-online comparisons, skew lift estimates, and mislead model-selection decisions. Detecting it early protects experiment validity before outcome metrics are trusted.
3. What p-value is usually considered suspicious?
Many teams flag SRM when the p-value falls below 0.05, while stricter governance may use 0.01 or 0.001. Lower thresholds reduce false alarms but may miss smaller mismatches.
4. Can expected ratios be percentages or weights?
Yes. The calculator normalizes the expected ratios, so you can enter percentages, weights, or raw proportions as long as each value is positive.
5. What causes SRM in practice?
Common causes include broken randomization, caching rules, asynchronous logging loss, bot filtering, user eligibility drift, traffic throttling, and release rollbacks that affect variants unequally.
6. What does Cohen's w add here?
Cohen's w summarizes practical imbalance size. The p-value shows statistical evidence, while w helps you judge whether the mismatch is trivial, moderate, or operationally meaningful.
7. Is the chi-square test reliable with small samples?
It becomes less reliable when expected counts are very small. The calculator warns about low expected cells so you can treat conclusions more carefully.
8. Should I analyze experiment outcomes after SRM appears?
Usually no. Investigate routing, event logging, eligibility logic, and exposure tracking first. Outcome analysis should wait until the allocation problem is understood or corrected.