Likelihood Score Calculator

Calculator Inputs

Model type

Choose how your dataset is interpreted.

Primary output

Select the headline number shown in the result.

Probability smoothing (epsilon)

Clamps probabilities to [ε, 1−ε] to avoid log(0).

Decimals

Controls rounding for displayed metrics.

Dataset

Delimiters supported: comma, tab, semicolon, or pipe. You may include an optional header row.

Example Data Table

Row	Observed (y)	Predicted probability (p)	Weight (w)
1	1	0.72	1.0
2	0	0.30	1.0
3	1	0.55	0.8
4	0	0.12	1.2
5	1	0.91	1.0

Paste as: obs,p,weight. Switch to multiclass to use label,p_true,weight.

Formula Used

Binary (Bernoulli) likelihood

For each row with observed outcome y in {0,1}, predicted probability p, and weight w:

logL = Σ w · [ y·ln(p) + (1−y)·ln(1−p) ]
NLL = −logL and log loss = NLL / Σw
geomMean = exp(logL / Σw) (0 to 1)
score(0–100) = 100 · geomMean

Multiclass (true-class probability)

For each row, provide the true label and the model probability assigned to that label p_true:

logL = Σ w · ln(p_true)
All other metrics follow from logL the same way.

Probabilities are clamped to [ε, 1−ε] to prevent undefined logs.

How to Use This Calculator

Select the model type that matches your dataset.
Paste your rows into the dataset box using a consistent delimiter.
Include an optional weight column if rows have different importance.
Click Calculate to see the score and full metrics.
Download CSV or PDF to save results and share reports.

Tip: Compare two models by running both on the same dataset and comparing log loss or the 0–100 score.

Interpreting Likelihood in Model Evaluation

Likelihood expresses how plausible the observed outcomes are under your model. In binary mode, each row contributes p when y is one and (1−p) when y is zero. In multiclass mode, each row uses the probability assigned to the true label. Multiplying row contributions yields the dataset likelihood, which is useful for comparing two models on the same cases. Higher likelihood indicates better fit when inputs and rows match across evaluation runs.

Why Log-Likelihood Beats Raw Products

Raw likelihood products can become extremely small as rows grow, causing underflow and confusing comparisons. Log-likelihood avoids this by summing logarithms, making computation stable and interpretable. A convenient diagnostic is deviance, defined as minus two times the log-likelihood, where lower values indicate stronger support. Because zero probabilities break logs, this calculator clamps probabilities using epsilon before scoring. Use average log-likelihood to compare datasets with different total weights and to monitor numeric stability.

Weighted Events for Cost-Sensitive Decisions

Event weights let you emphasize records that matter more, such as valuable customers, rare classes, or audited transactions. Mathematically, each weight scales that row’s log contribution, similar to repeating the observation. The calculator reports total weight, then normalizes key metrics by that total to produce comparable averages. Use weights to correct sampling schemes or align evaluation with business costs, but keep weights positive and consistent. Record weighting rules for reproducible reviews later.

From Log Loss to a Comparable Score

A single likelihood product is hard to compare across dataset sizes, so the calculator reports a geometric mean likelihood. This equals exp(average log-likelihood) and stays between zero and one. For reporting, it becomes a zero to one hundred score by multiplying by one hundred. Because it is averaged over total weight, the score remains comparable as you add rows. Use log loss for strict, proper scoring. It highlights small probability mistakes quickly.

Operational Monitoring and Model Drift Signals

In production, compute the score on a holdout or a rolling window to detect drift. Track log loss and average log-likelihood with business metrics, since a stable score can hide threshold issues. Watch for sudden drops in geometric mean likelihood, which often signal calibration problems or label shifts. Segment results by cohort, geography, or device to localize failures. When changes occur, retrain, recalibrate, and revalidate using the same weighting and epsilon settings.

FAQs

1) What does the 0–100 likelihood score represent?

It is one hundred times the geometric mean likelihood, computed from the average log-likelihood. Higher values mean the model assigns higher probability to observed outcomes on the same dataset.

2) When should I use binary versus multiclass mode?

Use binary mode for yes or no outcomes with a single predicted probability. Use multiclass mode when each row has a true class and the probability your model assigned to that class.

3) Why does the calculator use epsilon smoothing?

If any probability is zero or one, logarithms become undefined. Epsilon clamps probabilities into a safe range so log-likelihood and log loss remain finite and comparable.

4) How do weights change the results?

Weights scale each row’s contribution to log-likelihood. Larger weights make specific cases influence the averages more, which is useful for cost-sensitive evaluation or correcting sampled datasets.

5) Can I compare two models with different datasets?

Comparisons are most meaningful when both models are scored on the same rows and labels. If datasets differ, prefer average log loss and document any weighting or filtering differences.

6) What metric should I report to stakeholders?

For broad audiences, report the 0–100 score and show the trend over time. For technical reviews, include log loss, average log-likelihood, and deviance to explain changes and diagnose issues.