Calculator
Example data
| Scenario | Inputs | Outputs |
|---|---|---|
| Balanced model |
R² = 0.82, Validation R² = 0.76 n = 250, p = 12, beta = 2, gamma = 5, blend = 0.6 |
Adjusted R² ≈ 0.811 Score shrinks for gap and complexity. |
| Overfit risk |
R² = 0.92, Validation R² = 0.70 n = 180, p = 40, beta = 3, gamma = 6, blend = 0.7 |
Strong gap penalty and complexity penalty. Score drops despite high training fit. |
| Lean model |
SSE = 1200, SST = 6500 n = 300, p = 6, beta = 2, gamma = 4, blend = 0.5 |
R² computed from errors. Simplicity keeps penalties mild. |
Formula used
Generalization penalty uses the positive gap between training and validation: gap = max(0, R² − R²_val), G = 1 / (1 + beta · gap)
It is reported on a 0–100 scale after clamping.
How to use
- Choose an input mode: enter R², or compute it from SSE and SST.
- Enter n (observations) and p (predictors).
- Optionally enter validation R² to penalize overfitting risk.
- Tune beta and gamma to match your penalty preference.
- Pick blend weight to balance generalization versus simplicity.
- Press Submit to see results above the form.
- Use the download buttons to export CSV or PDF.
Fit beyond training R2
Training R2 can look impressive while real performance lags. The adjusted fit score combines an adjusted R2 baseline with penalties that reflect generalization and model size. When n is close to p, adjusted R2 often drops sharply, warning that the apparent fit may be driven by degrees of freedom rather than signal. Using a 0–100 scale makes comparisons easier across experiments, feature sets, and time windows.
Generalization gap signal
If you provide validation R2, the calculator measures the positive gap max(0, R2 − R2_val). A gap of 0.10 with beta 2 yields a generalization penalty of 1/(1+0.2)=0.833, reducing the score even when training fit stays high. This encourages selecting models that keep training and validation aligned, which is especially important under dataset shift, leakage risk, or aggressive feature engineering.
Complexity control with p/n
The complexity penalty uses p/n to represent how crowded the feature space is relative to data. With gamma 5, p=25 and n=250 gives C=1/(1+0.5)=0.667, while p=10 and n=250 gives C=1/(1+0.2)=0.833. This simple ratio approximates the intuition that larger models require more data to maintain stable estimates and avoid brittle coefficients.
Tuning blend, beta, gamma
Blend weight w sets the emphasis between generalization and simplicity. When w=0.7, the score reacts more to validation gaps; when w=0.3, it reacts more to p/n. Beta and gamma should match your risk tolerance: raise beta for production models where surprises are costly, and raise gamma for interpretable models where feature parsimony matters. Keep defaults consistent to track improvements fairly.
Reading the score in practice
Use the grade bands as a quick screen, then inspect components. Two models can share the same score for different reasons: one may have strong adjusted R2 but a large gap, while another may generalize well but be overly complex. Export results to document experiments, attach them to model cards, and compare runs alongside MAE, RMSE, or business KPIs for a complete view. Track the score over releases to detect silent regressions early.
FAQs
It rates model fit after accounting for sample size, feature count, and validation consistency, producing a 0–100 number that is easier to compare across experiments than raw R2 alone.
Use it whenever you have cross‑validation or a holdout set. The score penalizes only positive train‑to‑validation gaps, helping you spot overfitting and select models that generalize.
Adjusted R2 requires n greater than p plus one. If the condition fails, the calculator falls back to training R2 and notes that the adjusted statistic is undefined for that configuration.
Beta strengthens the generalization penalty from validation gaps, while gamma strengthens the complexity penalty from p/n. Keep them consistent within a project to make scores comparable and interpret changes confidently.
Yes, if you supply an R2‑like metric from a regression‑style evaluation, such as explained variance on probabilities. For pure classification, consider pairing this score with AUC, log loss, or calibration error.
Negative R2 means the model is worse than predicting the mean. If flooring is enabled, the score is clamped to zero, making the rating easier to interpret while still showing the underlying R2 value.