Input Your Retrieval Outcomes
Example Data Table
Use this sample format for a small RAG evaluation batch.
| Query | R | k | r | Comment |
|---|---|---|---|---|
| Billing address change | 6 | 10 | 5 | One relevant chunk missing. |
| Refund timeframe | 4 | 8 | 3 | Irrelevant policy outranked relevant. |
| Rate limit headers | 3 | 5 | 3 | Good coverage for top-k. |
Formula Used
- Per-query Recall: Recall = r / R, where R is total relevant contexts, and r is relevant among retrieved.
- Micro Recall (dataset-level): Σr / ΣR. Emphasizes frequent intents with larger R.
- Macro Recall (balanced): average of per-query recall values where R > 0.
- Optional helpers: Precision r/k and F1 to show tradeoffs when raising k.
How to Use This Calculator
- Decide your retrieval cutoff k per query (top-k contexts).
- For each query, count R ground-truth relevant contexts.
- Run retrieval, label how many returned contexts are relevant (r).
- Click Calculate Context Recall to see micro and macro recall.
- Export CSV for tracking, or PDF for sharing with stakeholders.
FAQs
1) What does context recall measure in retrieval-augmented systems?
It estimates how much of the needed ground-truth context your retriever returns within top-k. Higher recall usually improves answer grounding when the generator uses retrieved content.
2) Should I use micro or macro recall for reporting?
Use micro recall for overall user-weighted performance, and macro recall for fairness across intents. If rare queries matter, macro recall prevents them from being drowned out.
3) What if my ground truth has zero relevant contexts?
If R equals zero, recall is undefined for that row and excluded from macro averaging. Consider revising the evaluation set to include only queries requiring retrieval.
4) Can recall be high while answers are still wrong?
Yes. High recall only means relevant context was retrieved. The generator may ignore it, hallucinate, or misinterpret it. Track answer faithfulness and citation accuracy too.
5) How does changing k affect recall?
Increasing k often improves recall but may reduce precision by adding noise. Use the precision and F1 columns to observe the tradeoff while tuning k and reranking.
6) How should I count “relevant contexts” in practice?
Define a relevance rubric: exact policy clause, supporting paragraph, or canonical chunk. Keep chunking consistent across runs, and label with two reviewers when possible.
7) What are common causes of low context recall?
Poor chunking, weak embeddings, domain mismatch, missing metadata filters, or overly strict reranking. Also check query rewriting and synonym handling for specialized terminology.