Cross Validation Split Calculator

Calculator Inputs

The page uses a single vertical content flow, while the input controls adapt to large, medium, and mobile screens.

Total Samples All rows available before any external holdout split.

Feature Count Used to estimate the sample-to-feature ratio.

Validation Method Choose the split family that matches your dataset structure.

Folds / Splits Used by K-Fold, stratified, repeated, group, and time-series modes.

Repeats Applies only to repeated K-Fold calculations.

External Holdout % Reserved for final testing after cross validation.

Minority Class Samples Used for binary stratified estimates.

Unique Groups Use when grouped records must stay together.

Time-Series Gap Rows skipped between train and validation windows.

Shuffle Before Split Ignored for time-series mode because order must remain intact.

Random Seed Stored for reproducibility planning and documentation.

Example Data Table

This table shows how different settings change the number of training and validation rows per fit.

Scenario	Total Samples	Holdout %	CV Pool	Method	Folds / Repeats	Train per Fit	Validation per Fit
Balanced tabular model	1,000	20%	800	K-Fold	5 / 1	640	160
Imbalanced binary classifier	2,400	10%	2,160	Stratified K-Fold	6 / 1	1,800	360
Variance reduction study	1,200	15%	1,020	Repeated K-Fold	5 / 3	816	204
Ordered forecasting series	600	0%	600	Time-Series Split	4 / 1	120 to 480	120

Formula Used

1) External holdout:
External Test Samples = round(Total Samples × Holdout % ÷ 100)

2) Cross-validation pool:
CV Pool = Total Samples − External Test Samples

3) Standard K-Fold validation size:
Validation per Fold ≈ CV Pool ÷ K

4) Standard K-Fold training size:
Training per Fold = CV Pool − Validation per Fold

5) Repeated K-Fold fits:
Total Fits = K × Repeats

6) Leave-One-Out:
Validation per Split = 1, Training per Split = CV Pool − 1

7) Time-series default test window:
Test Window = floor(CV Pool ÷ (Splits + 1))

8) Time-series growing train window:
Base Train Window = (Split Index × Test Window) + Remainder

9) Time-series gap adjustment:
Effective Train Window = Base Train Window − Gap Size

10) Stratified approximation:
Minority Fold Count ≈ Minority Samples ÷ K

This calculator focuses on split sizing, repetition load, reuse intensity, and leakage-aware planning. It does not train models or estimate actual performance metrics.

How to Use This Calculator

Enter the total number of dataset rows and the feature count.
Select the validation method that matches your problem structure.
Choose folds or splits, then add repeats if needed.
Set an external holdout percentage for final untouched testing.
Provide minority class samples for stratified mode or group count for grouped mode.
Use a time-series gap when nearby observations could leak future information.
Press Calculate Split Plan to render the summary above the form.
Review the cards, table, warnings, and Plotly graph.
Export the split table with the CSV or PDF buttons.

FAQs

1) What does cross validation do?

Cross validation divides available modeling data into repeated train and validation subsets. It estimates generalization more reliably than one random split because every sample gets evaluated across multiple iterations.

2) When should I use stratified splitting?

Use stratified splitting when your target classes are imbalanced. It keeps each fold closer to the overall class ratio, reducing unstable validation scores caused by missing minority examples.

3) Why keep an external holdout test set?

An external test set stays untouched until final evaluation. Cross validation helps tune models, while the holdout test set provides a cleaner estimate of real-world performance.

4) Are more folds always better?

More folds usually lower bias but increase runtime. Five or ten folds are common because they balance stability, training size, and computational cost for many datasets.

5) What is repeated K-Fold useful for?

Repeated k-fold reruns shuffled folds multiple times. It smooths random variation and can produce more stable average metrics, especially on modest datasets.

6) Why should time-series data avoid random shuffling?

Time series data should preserve chronology. Use time-series splitting when future rows must never influence earlier training windows, otherwise leakage can make results look unrealistically strong.

7) What problem does Group K-Fold solve?

Group k-fold keeps related records together in the same fold. It is useful for user-level, patient-level, or device-level datasets where grouped leakage would inflate performance.

8) Can I use this on a very small dataset?

Small datasets can still use cross validation, but settings matter. Avoid too many folds when validation subsets become tiny, and watch minority class counts carefully.