What problem does this solve?
The one‑sample proportion z‑test answers a simple question: does an observed proportion from a sample,
denoted p̂, differ significantly from a hypothesized population proportion p₀?
Examples include “Is our conversion rate different from 50%?” or “Did more than 20% of users click the new feature?”
When the normal approximation is reasonable, the test statistic follows (approximately) a standard normal distribution,
enabling fast calculation of p‑values and critical values.
Core notation and assumptions
- Sample size:
nindependent trials with binary outcomes (success/failure). - Observed successes:
x; the sample proportion isp̂ = x / n. - Null hypothesis:
H₀: p = p₀. Alternatives: two‑sided, left‑tailed, or right‑tailed. - Independence: observations are approximately independent (simple random sample or well‑designed process).
- Normal approximation conditions: a common rule of thumb is
n·p₀ ≥ 5andn·(1−p₀) ≥ 5. When these are violated orp̂is extreme (near 0 or 1), prefer the exact binomial test.
The z‑statistic and p‑value
Optional finite population correction (FPC) if sampling without replacement from population size N: multiply SE₀ by √[(N − n) / (N − 1)].
Under the null, and when the approximation is adequate, z is approximately standard normal.
The p‑value depends on the tail of the alternative: two‑sided uses 2·min{Φ(z), 1−Φ(z)},
right‑tailed uses 1−Φ(z), and left‑tailed uses Φ(z). For extremely large |z|,
compute tails via survival functions (e.g., complementary error function) to avoid underflow.
Choosing the tail
| Alternative | Research question | p‑value expression |
|---|---|---|
Two‑sided: Hₐ: p ≠ p₀ | Any departure from p₀ | 2 · min{ Φ(z), 1−Φ(z) } |
Right‑tailed: Hₐ: p > p₀ | Is the proportion higher? | 1 − Φ(z) |
Left‑tailed: Hₐ: p < p₀ | Is the proportion lower? | Φ(z) |
Confidence intervals for the true proportion
Confidence intervals (CIs) communicate estimation uncertainty. Several methods exist:
- Wald:
p̂ ± zα/2 · √[ p̂(1−p̂)/n ]. Simple but unreliable for smallnor extremep̂. - Wilson (score): More accurate coverage across a wide range; often the recommended default.
- Agresti–Coull: A quick improvement over Wald via “add‑z²” adjustments to
xandn. - Clopper–Pearson (exact): Inverts the binomial test; conservative but valid for any
nandx.
Tip: For reporting, Wilson or Clopper–Pearson are good defaults. Avoid relying solely on Wald unless sample sizes are comfortably large and p̂ is not near 0 or 1.
Continuity correction (Yates)
Because the binomial distribution is discrete and the normal distribution is continuous, a small
continuity correction can be applied to z when n is small. A common form subtracts
0.5/n in the numerator towards zero. It tends to make tests more conservative; many practitioners omit it for moderate or large samples.
Effect size: Cohen’s h
Rules of thumb: 0.2 (small), 0.5 (medium), 0.8 (large).
Reporting a statistically significant difference without an effect size can be misleading. Cohen’s h expresses
the magnitude of change on a stabilized scale, facilitating comparison across studies.
When the normal approximation is dubious
Use the exact binomial test when n·p₀ or n·(1−p₀) is small (e.g., less than 5),
or when p̂ is exactly 0 or 1. The exact test computes the probability of outcomes as or more extreme
than x under the binomial(n, p₀) model. Two‑sided definitions vary (e.g., doubling the smaller tail,
“as‑or‑less‑likely,” or Blaker’s test), but conclusions are usually similar in practice.
Worked example
Suppose you tested a feature with n = 100 users, observed x = 56 conversions (p̂ = 0.56),
and wish to test H₀: p = 0.50 against a two‑sided alternative at α = 0.05.
- Compute the standard error under the null:
SE₀ = √[ 0.5·0.5 / 100 ] = √(0.0025) = 0.05. - Compute the test statistic:
z = (0.56 − 0.50) / 0.05 = 1.20. - Two‑sided p‑value: approximately
2·(1 − Φ(1.20)) ≈ 2·0.115 = 0.230. - Decision: since
0.230 > 0.05, fail to rejectH₀. The sample does not provide strong evidence that the conversion rate differs from 50%. - 95% Wald CI for
p(estimation, not testing):SÊ = √[ 0.56·0.44 / 100 ] ≈ 0.0496and0.56 ± 1.96·0.0496 ≈ [0.463, 0.657]. - Effect size:
h ≈ |2·arcsin(√0.56) − 2·arcsin(√0.5)| ≈ 0.12(a small effect).
If your sample came from a finite population without replacement (say, drawing 100 items from a lot of N = 2000),
you could apply the FPC by multiplying SE₀ by √[(N − n)/(N − 1)]. This slightly narrows the standard error and
can affect z at high sampling fractions.
Power and planning
Before collecting data, you can assess power: the probability of detecting a true difference
when the real proportion is p₁. For a given n, α, and tail, power increases as the
difference between p₁ and p₀ grows. Conversely, you can solve for the required sample size to
achieve a target power (e.g., 80%) for a practically meaningful effect. These calculations use the normal
approximation and should be checked with exact methods for small samples.
Common pitfalls and good practice
- Sampling bias: Random, independent sampling matters. Non‑random samples undermine inference.
- P‑hacking and multiplicity: If you run multiple tests, adjust
α(e.g., Bonferroni) or control the false discovery rate. - Over‑reliance on Wald intervals: Prefer Wilson or exact methods when in doubt.
- Reporting only significance: Always include an effect size and a confidence interval.
- Discarding tail direction: Choose your alternative based on the question before seeing the data.
- Ignoring discreteness: For small
n, discreteness matters—use exact tests and interpret carefully.
Summary
The one‑sample proportion z‑test provides a fast, interpretable way to compare an observed proportion to a benchmark.
Compute z using the null standard error (with optional FPC), choose the tail to match the research question,
and translate the result to a p‑value. Supplement testing with confidence intervals—ideally Wilson or exact—and an effect
size such as Cohen’s h. When normal approximations are shaky, fall back to the exact binomial test. For planning,
evaluate power and required sample sizes to ensure your study can detect practically meaningful effects.
Glossary: Φ is the standard normal CDF; SE is standard error; FPC is finite population correction.