Effect Size Estimator Calculator for AI & Machine Learning

Measure practical differences beyond raw significance values. Review pooled spread, small-sample bias, and robust alternatives. Turn experimental differences into interpretable model comparison insights quickly.

Calculator inputs

Use independent mode for two separate model result distributions. Use paired mode for before-versus-after or matched experiments.

Example data table

This example compares two model F1 distributions across repeated evaluation runs.

Metric Model A Model B Approximate interpretation
Runs (n) 12 12 Balanced repeated experiments
Mean F1 0.842 0.793 Model A performs better
Standard deviation 0.031 0.028 Spread is similar
Cohen's d ≈ 1.65 Large practical separation
Hedges' g ≈ 1.60 Bias-corrected large effect
Common language effect ≈ 87.9% Model A likely exceeds Model B in most draws

Formula used

Independent pooled standard deviation:
spooled = √[ ((n1-1)s12 + (n2-1)s22) / (n1 + n2 - 2) ]
Cohen's d for independent groups:
d = (x̄1 - x̄2) / spooled
Hedges' g bias correction:
J = 1 - 3 / (4df - 1), then g = d × J
Glass's delta:
Δ = (x̄1 - x̄2) / sreference
Paired difference spread:
sdiff = √(s12 + s22 - 2rs1s2)
Cohen's dz for paired data:
dz = (x̄1 - x̄2) / sdiff
Cohen's dav for paired data:
dav = (x̄1 - x̄2) / √[(s12 + s22) / 2]
Common language effect:
CLES = Φ(d / √2) for independent mode, or Φ(dz / √2) for paired mode

These formulas are useful for comparing model scores such as accuracy, F1, AUROC, precision, recall, calibration loss, or error values across repeated runs.

How to use this calculator

  1. Choose Independent groups when two result distributions come from separate experiments, models, or evaluation batches.
  2. Choose Paired samples when the same folds, same users, or same tasks were scored twice.
  3. Enter labels for both groups, then provide means, standard deviations, and sample sizes.
  4. For paired mode, supply the within-pair correlation to improve the difference spread estimate.
  5. Pick the confidence level and the number of decimals you want displayed.
  6. Click Estimate Effect Size to show the results above the form.
  7. Review Hedges' g or bias-corrected gz as the main estimate, then inspect r, CLES, overlap, and the confidence interval.
  8. Use the CSV and PDF buttons to export the report for documentation, model comparison notes, or experiment reviews.

FAQs

1) Why use effect size in machine learning?

Effect size shows how meaningful a model difference is, not only whether a statistical test flags significance. That helps you judge practical model gains across repeated runs, folds, or experiment settings.

2) When should I prefer Hedges' g over Cohen's d?

Use Hedges' g when samples are modest, because it corrects small-sample bias. For larger datasets, d and g will be very close, but g is often the safer value to report.

3) What does Glass's delta tell me?

Glass's delta standardizes the mean difference using one chosen group’s standard deviation. It is helpful when one group has unstable variance or when a baseline system should define the reference spread.

4) When is paired mode the right choice?

Use paired mode when both measurements come from matched items, such as identical folds, same users, same documents, or before-versus-after tuning. Pairing usually gives a more precise estimate than treating scores as independent.

5) How do I interpret common language effect?

Common language effect estimates how often one group is expected to outperform the other in random draws. A value near 50% means little separation, while much higher or lower values show clearer dominance.

6) Can I use this for loss or error metrics?

Yes. The calculator works with any continuous performance metric, including loss, RMSE, MAE, AUROC, or F1. Just remember that the sign reflects direction, so lower-is-better metrics may need careful interpretation.

7) Does a large effect size guarantee deployment value?

No. A large standardized difference is helpful, but deployment also depends on cost, fairness, stability, latency, drift risk, and business constraints. Use effect size as one decision layer, not the only one.

8) What should I report in a model comparison summary?

A strong summary includes sample sizes, means, standard deviations, design type, Hedges' g or gz, confidence interval, common language effect, and a short interpretation tied to the machine learning objective.

Related Calculators

binomial test calculatorab test sample sizebayesian ab testab test calculatorpooled variance testab test p valueab test powerrisk ratio significancechi square ab test

Important Note: All the Calculators listed in this site are for educational purpose only and we do not guarentee the accuracy of results. Please do consult with other sources as well.