Test Set Size Calculator

Calculator inputs

Use the statistical fields for confidence planning and the practical fields for workflow constraints.

Total labeled records

Confidence level

Target margin of error (%)

Expected proportion or score (%)

Design effect multiplier

Rounding mode

Validation share of dataset (%)

Minimum training share (%)

Rare class rate in dataset (%)

Minimum rare class cases in test

Example data table

Scenario	Dataset Size	Confidence	Margin	Rare Rate	Recommended Test	Validation	Training
Balanced classifier audit	10,000	95%	3%	12%	965	1,000	8,035
Rare event detector	2,500	95%	5%	8%	500	250	1,750
Large production benchmark	120,000	99%	2%	4%	4,024	12,000	103,976

Formula used

The calculator first estimates the test sample size for a proportion, then corrects it for finite datasets, expands it for design complexity, and checks rare-class coverage.

n0 = (Z² × p × (1 - p)) / e²

n = (N × n0) / (N + n0 - 1)

n_adjusted = n × design effect

rare coverage need = minimum rare cases / rare class rate

final test size = min(max(n_adjusted, rare coverage need), dataset limit)

Z is the standard score for your confidence level.

p is the expected model success rate or target proportion.

e is the acceptable margin of error.

N is the total labeled dataset size.

How to use this calculator

Enter the total labeled records available for training, validation, and testing.
Select a confidence level and target margin of error for your evaluation.
Set the expected proportion. Use 50% when uncertainty is highest.
Apply a design effect above 1 when data are clustered or less independent.
Reserve a validation share and minimum training share that match your workflow.
Add rare-class rate and minimum rare examples when minority coverage matters.
Submit the form and review the recommended test size, split shares, and achieved margin.
Download the result as CSV or PDF for planning notes, model cards, or experiment logs.

Frequently asked questions

1. Why not always use a 20% test split?

A fixed percentage ignores confidence goals, finite population effects, and class imbalance. This calculator sizes the test set from statistical targets first, then checks practical training constraints.

2. What does the expected proportion mean?

It represents the metric or class proportion you expect to measure, such as accuracy or positive prediction rate. A 50% value is conservative because it produces the largest required sample size.

3. When should I increase the design effect?

Raise it when records are correlated, clustered, duplicated, or drawn from repeated sessions. A higher design effect inflates the test size to compensate for lower effective independence.

4. Why is finite population correction important?

When the full labeled dataset is not huge, sampling many records reduces uncertainty faster. Finite population correction lowers the needed sample size compared with infinite-population formulas.

5. How does rare class coverage affect the answer?

If you need a minimum number of minority examples, the calculator may enlarge the test set beyond the statistical minimum. This helps make per-class evaluation more stable and interpretable.

6. What if the recommendation is capped?

A cap means your training and validation requirements leave less room than the statistical recommendation requested. You may need a larger dataset, looser precision, or smaller reserved shares.

7. Should validation come from the training data instead?

For many workflows, yes. This calculator treats validation as a separate reserved share because many teams plan dataset partitions explicitly before model iteration begins.

8. Can I use this for regression models?

Yes, as a planning approximation. Use an expected proportion or normalized success estimate as a proxy, then validate the split with domain-specific error analysis afterward.