Effect Size Estimator Calculator for AI & Machine Learning

Calculator inputs

Use independent mode for two separate model result distributions. Use paired mode for before-versus-after or matched experiments.

Study design

Performance metric label

Confidence level (%)

First label

Second label

Displayed decimals

Group 1 sample size / pair count

Group 2 sample size

Paired correlation

Group 1 mean

Group 1 standard deviation

Group 2 mean

Group 2 standard deviation

Glass delta reference SD

Example data table

This example compares two model F1 distributions across repeated evaluation runs.

Metric	Model A	Model B	Approximate interpretation
Runs (n)	12	12	Balanced repeated experiments
Mean F1	0.842	0.793	Model A performs better
Standard deviation	0.031	0.028	Spread is similar
Cohen's d	≈ 1.65		Large practical separation
Hedges' g	≈ 1.60		Bias-corrected large effect
Common language effect	≈ 87.9%		Model A likely exceeds Model B in most draws

Formula used

Independent pooled standard deviation:
s_pooled = √[ ((n₁-1)s₁² + (n₂-1)s₂²) / (n₁ + n₂ - 2) ]

Cohen's d for independent groups:
d = (x̄₁ - x̄₂) / s_pooled

Hedges' g bias correction:
J = 1 - 3 / (4df - 1), then g = d × J

Glass's delta:
Δ = (x̄₁ - x̄₂) / s_reference

Paired difference spread:
s_diff = √(s₁² + s₂² - 2rs₁s₂)

Cohen's d_z for paired data:
d_z = (x̄₁ - x̄₂) / s_diff

Cohen's d_av for paired data:
d_av = (x̄₁ - x̄₂) / √[(s₁² + s₂²) / 2]

Common language effect:
CLES = Φ(d / √2) for independent mode, or Φ(d_z / √2) for paired mode

These formulas are useful for comparing model scores such as accuracy, F1, AUROC, precision, recall, calibration loss, or error values across repeated runs.

How to use this calculator

Choose Independent groups when two result distributions come from separate experiments, models, or evaluation batches.
Choose Paired samples when the same folds, same users, or same tasks were scored twice.
Enter labels for both groups, then provide means, standard deviations, and sample sizes.
For paired mode, supply the within-pair correlation to improve the difference spread estimate.
Pick the confidence level and the number of decimals you want displayed.
Click Estimate Effect Size to show the results above the form.
Review Hedges' g or bias-corrected g_z as the main estimate, then inspect r, CLES, overlap, and the confidence interval.
Use the CSV and PDF buttons to export the report for documentation, model comparison notes, or experiment reviews.

FAQs

1) Why use effect size in machine learning?

Effect size shows how meaningful a model difference is, not only whether a statistical test flags significance. That helps you judge practical model gains across repeated runs, folds, or experiment settings.

2) When should I prefer Hedges' g over Cohen's d?

Use Hedges' g when samples are modest, because it corrects small-sample bias. For larger datasets, d and g will be very close, but g is often the safer value to report.

3) What does Glass's delta tell me?

Glass's delta standardizes the mean difference using one chosen group’s standard deviation. It is helpful when one group has unstable variance or when a baseline system should define the reference spread.

4) When is paired mode the right choice?

Use paired mode when both measurements come from matched items, such as identical folds, same users, same documents, or before-versus-after tuning. Pairing usually gives a more precise estimate than treating scores as independent.

5) How do I interpret common language effect?

Common language effect estimates how often one group is expected to outperform the other in random draws. A value near 50% means little separation, while much higher or lower values show clearer dominance.

6) Can I use this for loss or error metrics?

Yes. The calculator works with any continuous performance metric, including loss, RMSE, MAE, AUROC, or F1. Just remember that the sign reflects direction, so lower-is-better metrics may need careful interpretation.

7) Does a large effect size guarantee deployment value?

No. A large standardized difference is helpful, but deployment also depends on cost, fairness, stability, latency, drift risk, and business constraints. Use effect size as one decision layer, not the only one.

8) What should I report in a model comparison summary?

A strong summary includes sample sizes, means, standard deviations, design type, Hedges' g or g_z, confidence interval, common language effect, and a short interpretation tied to the machine learning objective.