Pairwise Ranking Calculator

Calculator Input

Scoring method

Tie credit

Confidence level

Margin power

Raises comparison weight by max(|margin|,1)^power.

Hybrid BT weight

Hybrid Copeland weight

Items

Enter one item per line. Extra items from pair rows are added automatically.

Pairwise comparisons

Format: item_a,item_b,outcome,weight,margin. Outcome must be a, b, or tie.

Example Data Table

Item A	Item B	Outcome	Weight	Margin	Meaning
Model A	Model B	a	1.0	2	Model A beats Model B with moderate margin.
Model C	Model D	b	1.3	1	Model D beats Model C in a weighted comparison.
Model A	Model E	tie	1.0	0	Both items receive tie treatment based on tie credit.

Formula Used

1. Effective comparison weight
Effective Weight = Base Weight × max(|Margin|, 1)^{Margin Power}

2. Weighted win rate
Win Rate = (Weighted Wins + Tie Credit × Weighted Ties) / Total Exposure

3. Bradley-Terry ability model
P(i beats j) = Ability_i / (Ability_i + Ability_j)
The page estimates abilities iteratively from observed weighted wins.

4. Copeland score
For each opponent, an item gets 1 point for a head-to-head win, 0.5 for a tie, and 0 for a loss.

5. Hybrid score
Final Score = Normalized BT Score × BT Weight + Normalized Copeland Score × Copeland Weight

6. Confidence interval for win rate
CI = p ± z × √(p(1-p) / n)

This setup helps compare models, search results, recommendation candidates, prompts, labels, or ranked outputs when judgments come as pairwise preferences.

How to Use This Calculator

Enter the candidate names in the items box. Add one item per line.

Paste pairwise comparison rows using the format: item_a, item_b, outcome, weight, margin.

Select a scoring method. Hybrid blends Bradley-Terry strengths with direct head-to-head outcomes.

Choose tie credit to control how tied judgments affect ranking scores.

Set confidence level for the win-rate interval and margin power for stronger margin influence.

Press the calculate button. Results appear below the header and above the form.

Review the table, confidence bands, and chart. Then export CSV or PDF if needed.

Frequently Asked Questions

What does pairwise ranking measure?

It estimates which items are preferred when evaluated in head-to-head comparisons. This is useful for ranking models, recommendations, prompts, labels, or search results.

When should I use Bradley-Terry?

Use Bradley-Terry when you want a probabilistic ability estimate from pairwise outcomes. It is helpful when comparisons are sparse, repeated, or unevenly weighted.

What is the Copeland score?

Copeland counts direct head-to-head wins against opponents. It is simple, interpretable, and useful when you want a transparent scoreboard from observed comparisons.

Why use a hybrid method?

Hybrid scoring balances model-based estimation with direct matchup evidence. It can be more stable than pure head-to-head counts while remaining easier to explain.

How are ties handled?

The tie credit setting controls the value assigned to tied outcomes. Zero ignores ties, 0.5 treats them as half-wins, and 1.0 gives both items full tie credit.

What does margin power do?

Margin power increases the impact of larger victory margins. Higher values make strong wins influence the ranking more than narrow wins.

What does the confidence interval show?

It gives an uncertainty range around the estimated weighted win rate. Wider intervals usually mean fewer comparisons or noisier evidence.

Can I use this for LLM evaluation?

Yes. It works well for preference data from human raters, offline evaluation sets, response battles, search relevance judging, and recommendation testing.