Inputs
Responsive layout: 3 columns large, 2 medium, 1 mobile.Example data table
Use these values to validate outputs quickly.
| Scenario | Control visitors | Control conversions | Variant visitors | Variant conversions | Observed lift |
|---|---|---|---|---|---|
| Checkout button copy | 10,000 | 520 | 10,000 | 565 | ~ 8.65% |
| Pricing page layout | 25,000 | 1,250 | 25,000 | 1,320 | ~ 5.60% |
| Email subject line | 8,500 | 425 | 8,500 | 410 | ~ -3.53% |
Formula used
Conversion rates: p1 = x1/n1, p2 = x2/n2, difference d = p2 - p1.
Pooled proportion: p = (x1 + x2) / (n1 + n2).
Z-test: z = d / sqrt( p(1-p)(1/n1 + 1/n2) ).
P-value: two-sided = 2*(1 - Phi(|z|)). One-sided uses the chosen direction.
Confidence interval: d +/- zcrit*sqrt( p1(1-p1)/n1 + p2(1-p2)/n2 ).
Sample size estimate: standard two-proportion normal approximation with alpha and power targets.
How to use this calculator
- Enter visitors and conversions for Control and Variant.
- Pick alpha and whether your test is one- or two-sided.
- Press Calculate to see results above the form.
- Review p-value, confidence interval, and lift together.
- Use Planning inputs to estimate required sample size.
- Download results using CSV or PDF buttons.
Why two-proportion testing fits product experiments
Most A/B programs compare a binary outcome: convert or not. This calculator treats each variant as a proportion and tests whether the observed gap can occur under a no-difference assumption. For large samples, the normal approximation is accurate and produces a clear z-score and p-value.
Lift and absolute difference answer different questions
Lift expresses relative change, while the absolute difference shows added conversions per visitor. Moving from 5.20% to 5.65% is +0.45 percentage points, which can be forecast into incremental orders. Report both because stakeholders often overreact to lift when baselines are small.
Reading significance without overconfidence
Alpha is your tolerated false-positive rate, often 0.05. If the p-value is below alpha, the result is statistically significant for the chosen hypothesis. Significance is not a guarantee of future impact; it only indicates the data is unlikely under the null model given your stopping rule. If you repeatedly check midstream, the true false-positive rate rises unless you use sequential methods.
Confidence intervals quantify practical uncertainty
A confidence interval gives a range of plausible effects. This calculator reports an interval for the difference and also visualizes each conversion rate with Wilson intervals, which remain stable when rates are low. If the difference interval crosses zero, the evidence is compatible with both lift and decline. If the entire interval is above zero, you can communicate a minimum expected improvement at the chosen confidence level.
Power planning prevents wasted traffic
Power is the probability of detecting your minimum detectable effect if it is real. Using baseline rate, MDE, alpha, and target power (commonly 80% or 90%), the calculator estimates required visitors per variant. Planning reduces inconclusive tests and clarifies how long an experiment must run. Smaller MDE targets are more expensive because required sample size grows roughly with the inverse square of the effect.
Operational guardrails for trustworthy decisions
Randomize consistently, keep users in one variant, and avoid mid-test changes. Track one primary metric and a small set of guardrails such as revenue per session, latency, refunds, or support contacts. Stop only when the planned sample is reached to limit peeking-driven false positives. After rollout, monitor the metric in production to confirm the effect holds across segments and time.
FAQs
1) When should I use a one-sided test?
Use one-sided only when you will act exclusively on improvement and you would not ship a variant that performs worse. Two-sided is safer for general product decisions.
2) What does it mean if the confidence interval crosses zero?
It means the data is compatible with both positive and negative true effects at the chosen confidence level. The result is inconclusive; you may need more traffic or a larger effect.
3) Why can lift look big while impact stays small?
Lift is relative to the baseline. A 10% lift on a 1% baseline equals only +0.10 percentage points. Always evaluate absolute difference and expected incremental conversions.
4) Does a small p-value guarantee long-term gains?
No. It indicates the observed effect is unlikely under the null model, not that the effect will persist. Check effect size, confidence interval, seasonality, and post-launch monitoring.
5) What is MDE and how should I choose it?
MDE is the smallest effect worth detecting. Choose it from business value, risk, and engineering cost. Smaller MDE requires larger samples, so pick a threshold that meaningfully changes outcomes.
6) Why is early peeking risky?
Repeatedly checking results increases false positives. Plan your sample size, run to completion, and then decide. If interim looks are required, use sequential or alpha-spending approaches.