Analyze entire ranked lists for learning-to-rank workflows. This calculator compares relevance labels against model scores, then reports probability alignment, ranking quality, gain-based utility, and order agreement in one place.
Calculator Inputs
Example Data Table
| Item | True Relevance | Predicted Score | Meaning |
|---|---|---|---|
| Doc A | 3 | 1.92 | Highly relevant document with strong model confidence. |
| Doc B | 2 | 1.25 | Relevant document ranked below stronger alternatives. |
| Doc C | 3 | 1.71 | Highly relevant document with slightly lower score. |
| Doc D | 0 | 0.44 | Non-relevant item correctly assigned a weak score. |
| Doc E | 1 | 0.88 | Marginally relevant item with moderate predicted value. |
This example helps test the calculator and illustrates how relevance grades compare with model-generated ranking scores.
Formula Used
P(i) = exp(si / τ) / Σ exp(sj / τ)
L = -Σ Ptrue(i) · ln(Ppred(i))
KL = Σ Ptrue(i) · ln(Ptrue(i) / Ppred(i))
Linear gain = reli
Exponential gain = 2reli - 1
DCG@K = Σ [ gainr / logb(r + 1) ], for r = 1 to K
NDCG@K = DCG@K / IDCG@K
ρ = 1 - [ 6Σd2 / n(n2 - 1) ]
The calculator turns both label grades and model scores into listwise probability distributions, then compares their agreement, gain-weighted utility, and ordering quality across the full list.
How to Use This Calculator
- Enter item labels for each ranked candidate. Leave labels blank only if you want automatic names.
- Provide true relevance grades. These can be graded judgments such as 0 to 3.
- Paste predicted model scores in the same item order as the labels and relevance grades.
- Choose a temperature to control how sharply scores convert into list probabilities.
- Set Top-K for gain-based evaluation and choose linear or exponential gain scaling.
- Click the calculate button. Results appear above the form, below the page header.
- Review loss metrics, NDCG, overlap, correlation, and the item-level contribution table.
- Download the calculated output as CSV or PDF for reports, audits, or model comparisons.
Frequently Asked Questions
1) What does this calculator measure?
It measures how closely a predicted ranked list matches the ideal list. It reports listwise distribution loss, gain-based ranking quality, overlap at K, and rank agreement metrics for learning-to-rank evaluation.
2) Why use listwise evaluation instead of item-wise checking?
Listwise evaluation treats the entire ranked set as one structure. That is useful when model quality depends on relative ordering, not only on isolated item scores.
3) What is the purpose of temperature?
Temperature controls how sharply raw scores become probabilities. Lower temperature makes the highest scores dominate. Higher temperature spreads probability mass more evenly across items.
4) When should I use exponential gain?
Exponential gain is common when higher relevance grades should receive much larger reward. It emphasizes ranking the best items near the top and aligns well with many information retrieval settings.
5) What does NDCG@K tell me?
NDCG@K compares the predicted top positions against the best possible ordering up to K items. Values closer to 1 indicate better ranking quality near the top of the list.
6) Why can cross-entropy improve while NDCG changes little?
Cross-entropy measures probability distribution alignment across the whole list. NDCG focuses more on ranking utility, especially near the top. The two metrics capture different aspects of ranking behavior.
7) Does this calculator handle ties?
Yes. Ties are resolved with stable ordering based on original input sequence. That keeps the calculator deterministic and makes repeated evaluations consistent.
8) Can I use model logits instead of probabilities?
Yes. Raw logits or unbounded ranking scores are acceptable. The calculator converts them into listwise probabilities with softmax before computing list-based comparison metrics.