Quantify evidence from your predictive model outputs instantly. Compare scenarios using probabilities, logs, and checks. Make faster, clearer decisions with reliable likelihood scoring today.
| Row | Observed (y) | Predicted probability (p) | Weight (w) |
|---|---|---|---|
| 1 | 1 | 0.72 | 1.0 |
| 2 | 0 | 0.30 | 1.0 |
| 3 | 1 | 0.55 | 0.8 |
| 4 | 0 | 0.12 | 1.2 |
| 5 | 1 | 0.91 | 1.0 |
Likelihood expresses how plausible the observed outcomes are under your model. In binary mode, each row contributes p when y is one and (1−p) when y is zero. In multiclass mode, each row uses the probability assigned to the true label. Multiplying row contributions yields the dataset likelihood, which is useful for comparing two models on the same cases. Higher likelihood indicates better fit when inputs and rows match across evaluation runs.
Raw likelihood products can become extremely small as rows grow, causing underflow and confusing comparisons. Log-likelihood avoids this by summing logarithms, making computation stable and interpretable. A convenient diagnostic is deviance, defined as minus two times the log-likelihood, where lower values indicate stronger support. Because zero probabilities break logs, this calculator clamps probabilities using epsilon before scoring. Use average log-likelihood to compare datasets with different total weights and to monitor numeric stability.
Event weights let you emphasize records that matter more, such as valuable customers, rare classes, or audited transactions. Mathematically, each weight scales that row’s log contribution, similar to repeating the observation. The calculator reports total weight, then normalizes key metrics by that total to produce comparable averages. Use weights to correct sampling schemes or align evaluation with business costs, but keep weights positive and consistent. Record weighting rules for reproducible reviews later.
A single likelihood product is hard to compare across dataset sizes, so the calculator reports a geometric mean likelihood. This equals exp(average log-likelihood) and stays between zero and one. For reporting, it becomes a zero to one hundred score by multiplying by one hundred. Because it is averaged over total weight, the score remains comparable as you add rows. Use log loss for strict, proper scoring. It highlights small probability mistakes quickly.
In production, compute the score on a holdout or a rolling window to detect drift. Track log loss and average log-likelihood with business metrics, since a stable score can hide threshold issues. Watch for sudden drops in geometric mean likelihood, which often signal calibration problems or label shifts. Segment results by cohort, geography, or device to localize failures. When changes occur, retrain, recalibrate, and revalidate using the same weighting and epsilon settings.
It is one hundred times the geometric mean likelihood, computed from the average log-likelihood. Higher values mean the model assigns higher probability to observed outcomes on the same dataset.
Use binary mode for yes or no outcomes with a single predicted probability. Use multiclass mode when each row has a true class and the probability your model assigned to that class.
If any probability is zero or one, logarithms become undefined. Epsilon clamps probabilities into a safe range so log-likelihood and log loss remain finite and comparable.
Weights scale each row’s contribution to log-likelihood. Larger weights make specific cases influence the averages more, which is useful for cost-sensitive evaluation or correcting sampled datasets.
Comparisons are most meaningful when both models are scored on the same rows and labels. If datasets differ, prefer average log loss and document any weighting or filtering differences.
For broad audiences, report the 0–100 score and show the trend over time. For technical reviews, include log loss, average log-likelihood, and deviance to explain changes and diagnose issues.
Important Note: All the Calculators listed in this site are for educational purpose only and we do not guarentee the accuracy of results. Please do consult with other sources as well.