Analyze fold composition, validation exposure, training workload, and confidence intervals using customizable class proportions and settings. Build reliable evaluation plans before running expensive models today.
This example shows a three class dataset with proportions commonly used to test repeated stratified evaluation behavior.
| Dataset | Total Samples | Classes | Class Split | Folds | Repeats | Mean Score | Std Dev |
|---|---|---|---|---|---|---|---|
| Customer Churn Model | 1200 | 3 | 50%, 30%, 20% | 5 | 10 | 0.842 | 0.031 |
| Fraud Screening Model | 5000 | 2 | 92%, 8% | 5 | 8 | 0.914 | 0.024 |
| Medical Triage Model | 2400 | 4 | 40%, 25%, 20%, 15% | 6 | 5 | 0.801 | 0.041 |
1. Total model fits
Total Fits = K Folds × Repeats
2. Validation fraction per fit
Validation Fraction = 1 ÷ K Folds
3. Training fraction per fit
Training Fraction = (K Folds − 1) ÷ K Folds
4. Approximate class count
Class Count = Total Samples × (Class Proportion ÷ 100)
5. Validation exposures across all repeats
Total Validation Exposures = Total Samples × Repeats
6. Training exposures across all repeats
Total Training Exposures = Total Samples × (K Folds − 1) × Repeats
7. Standard error of the mean score
SE = Score Standard Deviation ÷ √(Total Fits)
8. Confidence interval
Confidence Interval = Mean Score ± Z × SE
9. Runtime estimate
Runtime Minutes = Total Fits × Average Training Minutes per Fit
10. Stability score
Stability Score = (1 − |Standard Deviation ÷ Mean Score|) × 100
The calculator distributes each class across folds as evenly as possible. Extra samples are assigned one by one to the earliest folds.
Enter the total dataset size and the number of classes first. Add class labels and class proportions using comma separated values.
Choose the number of folds and repeats. Larger repeat counts improve stability estimates but increase total model fits and runtime.
Provide a score mean and standard deviation from previous experiments or pilot runs. These values drive the confidence interval and stability score.
Optionally enter a random seed and average training minutes per fit. The calculator uses them for planning reproducibility and compute time.
Press the calculate button. The result section appears below the header and above the form, showing fold composition, exposures, runtime, interval estimates, and charts.
Use the CSV button to export tables for documentation. Use the PDF button to save a print friendly version of the page.
It measures model performance across many balanced train and validation splits. Stratification preserves class proportions, while repeats reduce dependence on one random partition.
Stratification keeps minority classes represented in every fold. That makes validation more reliable, especially for fraud, diagnosis, churn, and other skewed classification tasks.
Use more repeats when scores vary strongly between splits or when datasets are small. Common settings are 5 to 10 repeats, but expensive models may need fewer.
Not directly. Repeated stratified k fold is designed for classification because it preserves label distribution. Standard repeated k fold is usually better for regression tasks.
It is a quick planning indicator derived from relative score spread. Higher values suggest more consistent cross validation results, though it should not replace full statistical analysis.
Class counts are whole numbers, so perfectly equal splits are not always possible. The calculator distributes leftover samples across folds as evenly as possible.
No. It summarizes uncertainty around the supplied mean score using repeated split variability. External validation and careful experiment design are still essential.
Repeated stratified k fold becomes invalid because each fold must receive at least one sample from every class. Reduce folds or gather more data.
Important Note: All the Calculators listed in this site are for educational purpose only and we do not guarentee the accuracy of results. Please do consult with other sources as well.