Inter Rater Reliability Calculator

Method

Pick the statistic that matches your study design.

Weight scheme (ordinal)

Quadratic penalizes distant disagreements more strongly.

Bootstrap iterations (optional)

Use 0 to skip, or 500–2000 for a CI.

Ordered categories (optional)

For ordinal weighting, order matters.

Two-rater data (one item per line)

Accepts commas, tabs, or semicolons. If you have an item id, use: id,rater1,rater2.

Number of categories (k)

Each row must have exactly k counts.

Category labels (optional)

If omitted, default labels are used.

Multi-rater counts (one item per line)

All lines must sum to the same number of raters (n ≥ 2).

Reset

Example data table

Two raters classify 10 items into three categories.

Item	Rater 1	Rater 2
1	Low	Low
2	Low	Medium
3	Medium	Medium
4	High	High
5	Medium	High
6	Low	Low
7	High	High
8	Medium	Medium
9	High	Medium
10	Low	Low

Formula used

Percent agreement: \(P_o = \frac{\sum_i O_{ii}}{N}\)
Cohen’s kappa: \(\kappa = \frac{P_o - P_e}{1 - P_e}\), where \(P_e = \sum_i (r_i/N)(c_i/N)\)
Weighted kappa: \(\kappa_w = \frac{P_{o,w} - P_{e,w}}{1 - P_{e,w}}\), with weights \(w_{ij}\) (linear or quadratic)
Fleiss’ kappa: \(\kappa = \frac{\bar P - P_e}{1 - P_e}\), where \(\bar P\) is mean item agreement and \(P_e = \sum_j p_j^2\)

Symbols: \(O_{ij}\) is the confusion count, \(N\) is items, \(r_i\) and \(c_i\) are row/column totals, and \(p_j\) is category prevalence.

How to use this calculator

Select a method that matches your rater setup.
For two raters, paste one item per line as rater1,rater2.
For ordinal ratings, provide ordered categories and choose weighting.
For many raters, paste per-item category counts; each row must sum to the same rater count.
Optionally set bootstrap iterations to get a 95% interval.
Press Calculate, review plots, then export CSV or PDF.

Why agreement statistics outperform simple consistency checks

Two reviewers can match on 80 of 100 items, yet still disagree systematically on rare labels. Percent agreement reports 80%, but it ignores chance matching driven by category prevalence. Kappa-style coefficients correct for that baseline, helping you compare rating quality across projects, teams, and time periods.

Method selection by rater count and measurement scale

Use Cohen’s κ for two raters and nominal categories. If categories are ordered (for example: low, medium, high), weighted κ rewards near-misses with partial credit. For three or more raters rating the same items, Fleiss’ κ summarizes multi-rater agreement using per-item category counts.

Interpreting values with operational thresholds

Many teams treat κ≈0.60 as “acceptable” for production labeling and κ≥0.80 as “high confidence.” A κ of 0.40 may still be useful in exploratory studies if the task is ambiguous. This calculator also shows percent agreement so you can spot prevalence effects when agreement is high but κ remains modest. For a 4-category scale, a weighted κ of 0.75 can coexist with an unweighted κ of 0.55 when most errors are one-step apart. Use the heatmap to confirm whether disagreements cluster on adjacent categories or jump across extremes. That pattern informs training and guideline edits. Review at least 25 items per category when feasible, usually.

Typical dataset sizes and stability expectations

With 20 items, reliability can swing noticeably if just 2–3 labels flip. With 100 items, a single disagreement changes agreement by only 1%. As a practical rule, many audits target 50–200 jointly rated items per round, then re-check after guideline updates or model changes.

Quality control checks that prevent misleading κ

If one category dominates (for example 95% “No”), κ can drop even when raters agree often. Review the confusion matrix to see which labels drive mismatches, and verify both raters used all categories as intended. For ordinal work, compare linear vs quadratic weighting when distance matters.

Reporting-ready outputs for papers, audits, and dashboards

Good reporting includes the coefficient, item count, rater count, category set, and an uncertainty range. This tool provides a bootstrap 95% interval when requested, plus downloadable tables and plots. Pair the statistic with examples of disagreements so stakeholders can understand what “reliability” looks like in practice.

FAQs

What is the difference between percent agreement and kappa?

Percent agreement is the raw match rate. Kappa adjusts for agreement expected by chance from category prevalence. When one label dominates, agreement can be high while kappa stays moderate.

When should I use weighted kappa?

Use it when categories are ordered, such as severity levels. Weighted kappa gives partial credit to near disagreements, and penalizes distant disagreements more strongly than adjacent ones.

How many items do I need for a reliable estimate?

More is better. Small samples (under 30 items) can produce unstable values. Many teams audit 50–200 jointly rated items per round, depending on label complexity.

Why can kappa be negative?

A negative kappa means agreement is worse than what chance alone would predict. This often signals a guideline misunderstanding, swapped label definitions, or systematic bias between raters.

Can Fleiss’ kappa handle missing ratings?

Standard Fleiss’ kappa assumes the same number of raters for every item. If some items have fewer ratings, consider rebalancing your data or using alternative statistics designed for missingness.

What does the bootstrap interval mean here?

It resamples items with replacement and recomputes the statistic many times. The 2.5th and 97.5th percentiles provide an empirical 95% uncertainty band around your estimate.