A/B Test Calculator

Make smarter product decisions using reliable experiment math. Enter visitors and conversions for two variants. Get p-values, confidence intervals, and lift instantly on screen.

Inputs

Responsive layout: 3 columns large, 2 medium, 1 mobile.
Control
Conversion rate p1 = x1 / n1
Variant
Conversion rate p2 = x2 / n2
Test options
Common: 0.10, 0.05, 0.01
Planning (sample size)
Example: 0.05 means 5%.
Relative: 10 means +10% of baseline. Absolute: 1 means +1 percentage point.
Common: 0.8 or 0.9
Quick checks
  • Keep runs independent and random.
  • Run until the planned sample size is reached.
  • Avoid peeking; it inflates false positives.
  • Track one primary metric per decision.
  • Use two-sided unless you truly need one-sided.
Reset Results appear above this form after submit.

Example data table

Use these values to validate outputs quickly.

Scenario Control visitors Control conversions Variant visitors Variant conversions Observed lift
Checkout button copy 10,000 520 10,000 565 ~ 8.65%
Pricing page layout 25,000 1,250 25,000 1,320 ~ 5.60%
Email subject line 8,500 425 8,500 410 ~ -3.53%

Formula used

Conversion rates: p1 = x1/n1, p2 = x2/n2, difference d = p2 - p1.

Pooled proportion: p = (x1 + x2) / (n1 + n2).

Z-test: z = d / sqrt( p(1-p)(1/n1 + 1/n2) ).

P-value: two-sided = 2*(1 - Phi(|z|)). One-sided uses the chosen direction.

Confidence interval: d +/- zcrit*sqrt( p1(1-p1)/n1 + p2(1-p2)/n2 ).

Sample size estimate: standard two-proportion normal approximation with alpha and power targets.

How to use this calculator

  1. Enter visitors and conversions for Control and Variant.
  2. Pick alpha and whether your test is one- or two-sided.
  3. Press Calculate to see results above the form.
  4. Review p-value, confidence interval, and lift together.
  5. Use Planning inputs to estimate required sample size.
  6. Download results using CSV or PDF buttons.

Why two-proportion testing fits product experiments

Most A/B programs compare a binary outcome: convert or not. This calculator treats each variant as a proportion and tests whether the observed gap can occur under a no-difference assumption. For large samples, the normal approximation is accurate and produces a clear z-score and p-value.

Lift and absolute difference answer different questions

Lift expresses relative change, while the absolute difference shows added conversions per visitor. Moving from 5.20% to 5.65% is +0.45 percentage points, which can be forecast into incremental orders. Report both because stakeholders often overreact to lift when baselines are small.

Reading significance without overconfidence

Alpha is your tolerated false-positive rate, often 0.05. If the p-value is below alpha, the result is statistically significant for the chosen hypothesis. Significance is not a guarantee of future impact; it only indicates the data is unlikely under the null model given your stopping rule. If you repeatedly check midstream, the true false-positive rate rises unless you use sequential methods.

Confidence intervals quantify practical uncertainty

A confidence interval gives a range of plausible effects. This calculator reports an interval for the difference and also visualizes each conversion rate with Wilson intervals, which remain stable when rates are low. If the difference interval crosses zero, the evidence is compatible with both lift and decline. If the entire interval is above zero, you can communicate a minimum expected improvement at the chosen confidence level.

Power planning prevents wasted traffic

Power is the probability of detecting your minimum detectable effect if it is real. Using baseline rate, MDE, alpha, and target power (commonly 80% or 90%), the calculator estimates required visitors per variant. Planning reduces inconclusive tests and clarifies how long an experiment must run. Smaller MDE targets are more expensive because required sample size grows roughly with the inverse square of the effect.

Operational guardrails for trustworthy decisions

Randomize consistently, keep users in one variant, and avoid mid-test changes. Track one primary metric and a small set of guardrails such as revenue per session, latency, refunds, or support contacts. Stop only when the planned sample is reached to limit peeking-driven false positives. After rollout, monitor the metric in production to confirm the effect holds across segments and time.

FAQs

1) When should I use a one-sided test?

Use one-sided only when you will act exclusively on improvement and you would not ship a variant that performs worse. Two-sided is safer for general product decisions.

2) What does it mean if the confidence interval crosses zero?

It means the data is compatible with both positive and negative true effects at the chosen confidence level. The result is inconclusive; you may need more traffic or a larger effect.

3) Why can lift look big while impact stays small?

Lift is relative to the baseline. A 10% lift on a 1% baseline equals only +0.10 percentage points. Always evaluate absolute difference and expected incremental conversions.

4) Does a small p-value guarantee long-term gains?

No. It indicates the observed effect is unlikely under the null model, not that the effect will persist. Check effect size, confidence interval, seasonality, and post-launch monitoring.

5) What is MDE and how should I choose it?

MDE is the smallest effect worth detecting. Choose it from business value, risk, and engineering cost. Smaller MDE requires larger samples, so pick a threshold that meaningfully changes outcomes.

6) Why is early peeking risky?

Repeatedly checking results increases false positives. Plan your sample size, run to completion, and then decide. If interim looks are required, use sequential or alpha-spending approaches.

Related Calculators

binomial test calculatorab test sample sizeeffect size calculatorbayesian ab testpooled variance testab test p valueab test powerrisk ratio significancechi square ab test

Important Note: All the Calculators listed in this site are for educational purpose only and we do not guarentee the accuracy of results. Please do consult with other sources as well.