Nested Cross Validation Calculator

Calculator Inputs

Enter one score per outer fold for training, inner best tuning, and outer validation. Keep fold counts aligned with the outer fold setting.

Problem Type

Metric Name

Optimization Direction

Total Samples

Outer Folds

Inner Folds

Confidence Z

Use 1.96 for an approximate 95% interval.

Decimals

Fold Sizes

Comma separated. Leave aligned with outer folds.

Outer Validation Scores

One numeric score per outer fold.

Inner Best Scores

Best inner search result for each outer loop.

Training Scores

Training score for the selected model in each fold.

Selected Hyperparameters

Enter one chosen setting per line, in outer fold order.

Example Data Table

Fold	Fold Size	Training Score	Inner Best Score	Outer Validation Score	Selected Hyperparameters
1	200	0.9080	0.8820	0.8420	max_depth=6, eta=0.05
2	200	0.9140	0.8910	0.8570	max_depth=5, eta=0.05
3	200	0.9110	0.8860	0.8510	max_depth=6, eta=0.04
4	200	0.9060	0.8790	0.8390	max_depth=5, eta=0.06
5	200	0.9180	0.8940	0.8640	max_depth=6, eta=0.05

Formula Used

Outer nested estimate
\( \hat{\theta}_{NCV} = \frac{1}{K} \sum_{i=1}^{K} s_i \)

Weighted outer estimate
\( \hat{\theta}_{weighted} = \frac{\sum n_i s_i}{\sum n_i} \)

Sample variance and standard deviation
\( s^2 = \frac{\sum (s_i - \bar{s})^2}{K - 1} \), \( SD = \sqrt{s^2} \)

Standard error and confidence interval
\( SE = \frac{SD}{\sqrt{K}} \), \( CI = \bar{s} \pm z \cdot SE \)

Tuning optimism gap
For higher-better metrics: \( \bar{s}_{inner} - \bar{s}_{outer} \).
For lower-better metrics: \( \bar{s}_{outer} - \bar{s}_{inner} \).

Train validation gap
For higher-better metrics: \( \bar{s}_{train} - \bar{s}_{outer} \).
For lower-better metrics: \( \bar{s}_{outer} - \bar{s}_{train} \).

How to Use This Calculator

Choose the problem type and metric that matches your experiment.
Set the outer and inner fold counts from your workflow.
Paste one training, inner, and outer score for each outer fold.
Enter fold sizes to calculate the weighted estimate correctly.
Add one chosen hyperparameter setting per outer fold line.
Submit the form to view the summary, gaps, tables, and graph.
Use CSV or PDF export to save the current evaluation report.
Review the confidence interval and gaps before model selection.

Frequently Asked Questions

1. What does nested cross validation measure?

It estimates model performance while separating tuning from final evaluation. The outer loop tests generalization, and the inner loop selects settings within each training partition.

2. Why is the outer mean important?

The average outer validation score is the least biased summary in a nested workflow. It reflects performance on data never seen during inner tuning.

3. What is the tuning optimism gap?

It compares the average inner best score with the average outer score. A larger positive value means tuning looked better than final held-out performance.

4. Can I use regression metrics here?

Yes. Choose a regression metric such as RMSE, MAE, MSE, or R². Set the score direction correctly so the gaps are interpreted properly.

5. Why do I need one value per outer fold?

Each outer split contributes one final held-out score. Matching one score per fold preserves the correct nested summary and variability estimates.

6. What does a wide confidence interval mean?

It usually means fold outcomes vary a lot or you have few outer folds. Wider intervals suggest less certainty around the expected generalization score.

7. Should I rely on the best inner score?

Not for final reporting. Inner best scores help choose settings, but the outer results provide the fairer estimate of how the tuned pipeline should perform.

8. What does the train validation gap show?

It shows how much better the selected models perform on training data than on outer validation data. Larger gaps can indicate stronger overfitting risk.