Compute Cohen's kappa from customizable rating tables. Compare weighted methods, confidence bounds, and agreement metrics. Turn categorical judgments into clear reliability evidence fast today.
Enter labels and counts for a square agreement table. Rows represent Rater A. Columns represent Rater B.
This sample shows three response categories from two raters evaluating the same 100 items.
| Rater A \ Rater B | Negative | Neutral | Positive | Row total |
|---|---|---|---|---|
| Negative | 35 | 3 | 2 | 40 |
| Neutral | 4 | 28 | 3 | 35 |
| Positive | 1 | 5 | 19 | 25 |
| Column total | 40 | 36 | 24 | 100 |
For this example, exact agreement is 82%, expected agreement is 34.6%, and unweighted kappa is approximately 0.7248, indicating substantial agreement.
Kappa compares observed agreement with the agreement expected by chance from the marginal totals.
Observed agreement: P_o = (sum of diagonal counts) / N
Expected agreement: P_e = sum[(row total_i / N) × (column total_i / N)]
Cohen's kappa: κ = (P_o - P_e) / (1 - P_e)
Approximate standard error: SE ≈ sqrt( P_o(1-P_o) / [N(1-P_e)^2] )
95% confidence interval: κ ± 1.96 × SE
For weighted kappa, the calculator replaces exact agreement with a weighted agreement score:
Weighted observed agreement: P_o(w) = sum[w_ij × p_ij]
Weighted expected agreement: P_e(w) = sum[w_ij × p_i+ × p_+j]
Weighted kappa: κ_w = (P_o(w) - P_e(w)) / (1 - P_e(w))
Linear and quadratic weights give partial credit to near agreements. Use weighted methods when categories follow a logical order.
Kappa measures how much two raters agree after removing the agreement expected by chance. It works best for categorical ratings and helps assess reliability beyond simple percent agreement.
Use weighted kappa when categories are ordered, such as severity levels or satisfaction scores. It gives partial credit when raters are close but not identical, making it more informative than unweighted kappa for ordinal data.
Percent agreement ignores chance agreement. When one category is very common, raters may agree often simply because both choose it frequently. Kappa adjusts for this imbalance and can therefore be lower.
Kappa can range from -1 to 1. A value near 1 suggests very strong agreement, 0 suggests chance-level agreement, and negative values suggest systematic disagreement.
Yes. You can choose between two and six categories and enter a full square confusion matrix. That makes it suitable for many practical coding, labeling, diagnostic, or review workflows.
The interval shows a likely range for the true kappa under repeated sampling. Narrow intervals suggest more precision. This calculator reports an approximate large-sample confidence interval.
Yes. It is useful for comparing human annotators, model reviewers, or two coding systems. The matrix format matches many classification and labeling review tasks.
Enter counts of items assigned to each category combination. Each cell should represent how many items Rater A placed in one category while Rater B placed the same items in another category.
Important Note: All the Calculators listed in this site are for educational purpose only and we do not guarentee the accuracy of results. Please do consult with other sources as well.