Calculator inputs
Use independent mode for two separate model result distributions. Use paired mode for before-versus-after or matched experiments.
Example data table
This example compares two model F1 distributions across repeated evaluation runs.
| Metric | Model A | Model B | Approximate interpretation |
|---|---|---|---|
| Runs (n) | 12 | 12 | Balanced repeated experiments |
| Mean F1 | 0.842 | 0.793 | Model A performs better |
| Standard deviation | 0.031 | 0.028 | Spread is similar |
| Cohen's d | ≈ 1.65 | Large practical separation | |
| Hedges' g | ≈ 1.60 | Bias-corrected large effect | |
| Common language effect | ≈ 87.9% | Model A likely exceeds Model B in most draws | |
Formula used
spooled = √[ ((n1-1)s12 + (n2-1)s22) / (n1 + n2 - 2) ]
d = (x̄1 - x̄2) / spooled
J = 1 - 3 / (4df - 1), then g = d × J
Δ = (x̄1 - x̄2) / sreference
sdiff = √(s12 + s22 - 2rs1s2)
dz = (x̄1 - x̄2) / sdiff
dav = (x̄1 - x̄2) / √[(s12 + s22) / 2]
CLES = Φ(d / √2) for independent mode, or Φ(dz / √2) for paired mode
These formulas are useful for comparing model scores such as accuracy, F1, AUROC, precision, recall, calibration loss, or error values across repeated runs.
How to use this calculator
- Choose Independent groups when two result distributions come from separate experiments, models, or evaluation batches.
- Choose Paired samples when the same folds, same users, or same tasks were scored twice.
- Enter labels for both groups, then provide means, standard deviations, and sample sizes.
- For paired mode, supply the within-pair correlation to improve the difference spread estimate.
- Pick the confidence level and the number of decimals you want displayed.
- Click Estimate Effect Size to show the results above the form.
- Review Hedges' g or bias-corrected gz as the main estimate, then inspect r, CLES, overlap, and the confidence interval.
- Use the CSV and PDF buttons to export the report for documentation, model comparison notes, or experiment reviews.
FAQs
1) Why use effect size in machine learning?
Effect size shows how meaningful a model difference is, not only whether a statistical test flags significance. That helps you judge practical model gains across repeated runs, folds, or experiment settings.
2) When should I prefer Hedges' g over Cohen's d?
Use Hedges' g when samples are modest, because it corrects small-sample bias. For larger datasets, d and g will be very close, but g is often the safer value to report.
3) What does Glass's delta tell me?
Glass's delta standardizes the mean difference using one chosen group’s standard deviation. It is helpful when one group has unstable variance or when a baseline system should define the reference spread.
4) When is paired mode the right choice?
Use paired mode when both measurements come from matched items, such as identical folds, same users, same documents, or before-versus-after tuning. Pairing usually gives a more precise estimate than treating scores as independent.
5) How do I interpret common language effect?
Common language effect estimates how often one group is expected to outperform the other in random draws. A value near 50% means little separation, while much higher or lower values show clearer dominance.
6) Can I use this for loss or error metrics?
Yes. The calculator works with any continuous performance metric, including loss, RMSE, MAE, AUROC, or F1. Just remember that the sign reflects direction, so lower-is-better metrics may need careful interpretation.
7) Does a large effect size guarantee deployment value?
No. A large standardized difference is helpful, but deployment also depends on cost, fairness, stability, latency, drift risk, and business constraints. Use effect size as one decision layer, not the only one.
8) What should I report in a model comparison summary?
A strong summary includes sample sizes, means, standard deviations, design type, Hedges' g or gz, confidence interval, common language effect, and a short interpretation tied to the machine learning objective.