Logistic Regression Sample Size Planning
Why sample size matters
Logistic regression is used when the outcome is binary. The model may predict disease, success, default, failure, or any event with two states. A weak sample can make an odds ratio look unstable. It can also produce wide intervals and poor calibration. Sample size planning reduces that risk before data collection starts.
What this calculator checks
This calculator combines two common planning views. The first view estimates the records needed to detect an odds ratio for a binary predictor. It uses baseline event risk, expected odds ratio, alpha, power, and exposure balance. The second view checks the events per variable rule. That rule asks whether the expected number of events can support the planned predictors.
Advanced assumptions
Logistic models often include correlated predictors. A covariate R squared value can inflate the power based sample size. Clustered data or weighted surveys can need a design effect. Attrition also matters, because not every enrolled record becomes a complete case. The final sample is inflated for missing data so the usable sample stays near the target.
Interpreting results
The complete case sample is the number needed after losses. The enrolled sample is the larger number you should recruit or extract. Expected events show whether the model has enough outcome information. Expected non-events are also important, because a model with only events cannot separate risk groups well.
Good study practice
Use realistic values from pilot work, published studies, or registry summaries. Run sensitivity checks with smaller odds ratios and lower event rates. Document every assumption in your protocol. If the study is clinical, regulatory, or high cost, ask a statistician to review the design. This page gives a planning estimate, not a final guarantee.
Using results with R
Researchers may later confirm the estimate with simulation in R. Simulations can include nonlinearity, interaction terms, unbalanced sampling, and planned exclusions. They can also test convergence rates. Use the calculator output as the first scenario. Then vary each assumption. Save the table of runs. A transparent range is more useful than one optimistic number. Conservative planning protects power and improves model reliability. It also helps reviewers understand why the target sample was chosen before data collection starts.