A/B Test Calculator

Inputs

Responsive layout: 3 columns large, 2 medium, 1 mobile.

Control

Visitors (n1)

Conversions (x1)

Conversion rate p1 = x1 / n1

Variant

Visitors (n2)

Conversions (x2)

Conversion rate p2 = x2 / n2

Test options

Alpha (significance level)

Common: 0.10, 0.05, 0.01

Hypothesis

One-sided direction

Apply continuity correction (optional)

Planning (sample size)

Baseline conversion rate

Example: 0.05 means 5%.

Minimum Detectable Effect (MDE)

Relative: 10 means +10% of baseline. Absolute: 1 means +1 percentage point.

Power target

Common: 0.8 or 0.9

Quick checks

Keep runs independent and random.
Run until the planned sample size is reached.
Avoid peeking; it inflates false positives.
Track one primary metric per decision.
Use two-sided unless you truly need one-sided.

Reset Results appear above this form after submit.

Example data table

Use these values to validate outputs quickly.

Scenario	Control visitors	Control conversions	Variant visitors	Variant conversions	Observed lift
Checkout button copy	10,000	520	10,000	565	~ 8.65%
Pricing page layout	25,000	1,250	25,000	1,320	~ 5.60%
Email subject line	8,500	425	8,500	410	~ -3.53%

Formula used

Conversion rates: p1 = x1/n1, p2 = x2/n2, difference d = p2 - p1.

Pooled proportion: p = (x1 + x2) / (n1 + n2).

Z-test: z = d / sqrt( p(1-p)(1/n1 + 1/n2) ).

P-value: two-sided = 2*(1 - Phi(|z|)). One-sided uses the chosen direction.

Confidence interval: d +/- zcrit*sqrt( p1(1-p1)/n1 + p2(1-p2)/n2 ).

Sample size estimate: standard two-proportion normal approximation with alpha and power targets.

How to use this calculator

Enter visitors and conversions for Control and Variant.
Pick alpha and whether your test is one- or two-sided.
Press Calculate to see results above the form.
Review p-value, confidence interval, and lift together.
Use Planning inputs to estimate required sample size.
Download results using CSV or PDF buttons.

Why two-proportion testing fits product experiments

Most A/B programs compare a binary outcome: convert or not. This calculator treats each variant as a proportion and tests whether the observed gap can occur under a no-difference assumption. For large samples, the normal approximation is accurate and produces a clear z-score and p-value.

Lift and absolute difference answer different questions

Lift expresses relative change, while the absolute difference shows added conversions per visitor. Moving from 5.20% to 5.65% is +0.45 percentage points, which can be forecast into incremental orders. Report both because stakeholders often overreact to lift when baselines are small.

Reading significance without overconfidence

Alpha is your tolerated false-positive rate, often 0.05. If the p-value is below alpha, the result is statistically significant for the chosen hypothesis. Significance is not a guarantee of future impact; it only indicates the data is unlikely under the null model given your stopping rule. If you repeatedly check midstream, the true false-positive rate rises unless you use sequential methods.

Confidence intervals quantify practical uncertainty

A confidence interval gives a range of plausible effects. This calculator reports an interval for the difference and also visualizes each conversion rate with Wilson intervals, which remain stable when rates are low. If the difference interval crosses zero, the evidence is compatible with both lift and decline. If the entire interval is above zero, you can communicate a minimum expected improvement at the chosen confidence level.

Power planning prevents wasted traffic

Power is the probability of detecting your minimum detectable effect if it is real. Using baseline rate, MDE, alpha, and target power (commonly 80% or 90%), the calculator estimates required visitors per variant. Planning reduces inconclusive tests and clarifies how long an experiment must run. Smaller MDE targets are more expensive because required sample size grows roughly with the inverse square of the effect.

Operational guardrails for trustworthy decisions

Randomize consistently, keep users in one variant, and avoid mid-test changes. Track one primary metric and a small set of guardrails such as revenue per session, latency, refunds, or support contacts. Stop only when the planned sample is reached to limit peeking-driven false positives. After rollout, monitor the metric in production to confirm the effect holds across segments and time.

FAQs

1) When should I use a one-sided test?

Use one-sided only when you will act exclusively on improvement and you would not ship a variant that performs worse. Two-sided is safer for general product decisions.

2) What does it mean if the confidence interval crosses zero?

It means the data is compatible with both positive and negative true effects at the chosen confidence level. The result is inconclusive; you may need more traffic or a larger effect.

3) Why can lift look big while impact stays small?

Lift is relative to the baseline. A 10% lift on a 1% baseline equals only +0.10 percentage points. Always evaluate absolute difference and expected incremental conversions.

4) Does a small p-value guarantee long-term gains?

No. It indicates the observed effect is unlikely under the null model, not that the effect will persist. Check effect size, confidence interval, seasonality, and post-launch monitoring.

5) What is MDE and how should I choose it?

MDE is the smallest effect worth detecting. Choose it from business value, risk, and engineering cost. Smaller MDE requires larger samples, so pick a threshold that meaningfully changes outcomes.

6) Why is early peeking risky?

Repeatedly checking results increases false positives. Plan your sample size, run to completion, and then decide. If interim looks are required, use sequential or alpha-spending approaches.