Analyze pair outcomes, margins, ties, and weighted preferences for recommendation experiments and ranking evaluation. Rank candidates using clear metrics, confidence bands, and visuals.
| Item A | Item B | Outcome | Weight | Margin | Meaning |
|---|---|---|---|---|---|
| Model A | Model B | a | 1.0 | 2 | Model A beats Model B with moderate margin. |
| Model C | Model D | b | 1.3 | 1 | Model D beats Model C in a weighted comparison. |
| Model A | Model E | tie | 1.0 | 0 | Both items receive tie treatment based on tie credit. |
This setup helps compare models, search results, recommendation candidates, prompts, labels, or ranked outputs when judgments come as pairwise preferences.
Enter the candidate names in the items box. Add one item per line.
Paste pairwise comparison rows using the format: item_a, item_b, outcome, weight, margin.
Select a scoring method. Hybrid blends Bradley-Terry strengths with direct head-to-head outcomes.
Choose tie credit to control how tied judgments affect ranking scores.
Set confidence level for the win-rate interval and margin power for stronger margin influence.
Press the calculate button. Results appear below the header and above the form.
Review the table, confidence bands, and chart. Then export CSV or PDF if needed.
It estimates which items are preferred when evaluated in head-to-head comparisons. This is useful for ranking models, recommendations, prompts, labels, or search results.
Use Bradley-Terry when you want a probabilistic ability estimate from pairwise outcomes. It is helpful when comparisons are sparse, repeated, or unevenly weighted.
Copeland counts direct head-to-head wins against opponents. It is simple, interpretable, and useful when you want a transparent scoreboard from observed comparisons.
Hybrid scoring balances model-based estimation with direct matchup evidence. It can be more stable than pure head-to-head counts while remaining easier to explain.
The tie credit setting controls the value assigned to tied outcomes. Zero ignores ties, 0.5 treats them as half-wins, and 1.0 gives both items full tie credit.
Margin power increases the impact of larger victory margins. Higher values make strong wins influence the ranking more than narrow wins.
It gives an uncertainty range around the estimated weighted win rate. Wider intervals usually mean fewer comparisons or noisier evidence.
Yes. It works well for preference data from human raters, offline evaluation sets, response battles, search relevance judging, and recommendation testing.
Important Note: All the Calculators listed in this site are for educational purpose only and we do not guarentee the accuracy of results. Please do consult with other sources as well.