Repeated K Fold Calculator

Calculator Inputs

Dataset size

K folds

Repeats

Class count

Metric name

Split method label

Score display scale

Mean score

Standard deviation

Confidence level (%)

Training time per fit (minutes)

Scoring time per fit (minutes)

Optional fold scores list

If you provide raw repeated fold scores, the calculator derives mean, standard deviation, standard error, and confidence interval from them.

Example Data Table

This example shows a small repeated evaluation log for a model tested with 3 folds and 2 repeats on 900 samples.

Repeat	Fold	Training Samples	Validation Samples	Accuracy	Fit Time (min)
1	1	600	300	0.842	2.7
1	2	600	300	0.851	2.6
1	3	600	300	0.838	2.5
2	1	600	300	0.847	2.7
2	2	600	300	0.856	2.8
2	3	600	300	0.844	2.6

Formula Used

Total fits
total fits = k × repeats

Average validation size per fit
validation size = dataset size ÷ k

Average training size per fit
training size = dataset size − validation size

Mean score
mean = (sum of repeated fold scores) ÷ number of scores

Sample standard deviation
std = √[ Σ(score − mean)² ÷ (m − 1) ]

Standard error
SE = std ÷ √m

Confidence interval
CI = mean ± z × SE

Estimated runtime
runtime = total fits × (training time per fit + scoring time per fit)

When dataset size is not perfectly divisible by k, the calculator reports average fold sizes. Real folds may differ by one sample.

How to Use This Calculator

Enter the dataset size, number of folds, repeats, and class count.
Choose the evaluation label and metric name you want to track.
Either enter mean and standard deviation, or paste raw repeated fold scores.
Add timing inputs to estimate total experiment runtime, then submit the form.

Frequently Asked Questions

1) What does repeated k fold measure?

It measures model performance stability by running k fold cross validation several times with different data shuffles. This reduces luck from a single split and gives a stronger estimate of generalization.

2) Why repeat the folds?

Repeating the folds exposes the model to many train and validation arrangements. That usually lowers dependence on one favorable split and makes score uncertainty easier to quantify.

3) When should I paste raw score values?

Paste raw scores when you already have fold results from an experiment log. The calculator then derives the mean, standard deviation, standard error, and confidence interval directly from observed values.

4) What is a good number of folds?

Five or ten folds are common. Smaller datasets often benefit from higher k, but runtime grows because each additional fold means more model fits.

5) What does the confidence interval tell me?

It shows the uncertainty around the average validation score. A narrower interval suggests the repeated estimate is more precise and less sensitive to sample partitioning.

6) Does repeated k fold replace a final test set?

No. It improves model selection and performance estimation, but a clean untouched test set is still valuable for a final unbiased check after tuning.

7) Why does runtime increase quickly?

Because the total number of fits equals folds multiplied by repeats. A 10-fold setup repeated 8 times requires 80 separate training and scoring cycles.

8) Can this work for metrics other than accuracy?

Yes. You can label the score as accuracy, F1, AUC, recall, precision, or any other metric, as long as the values represent comparable repeated fold results.