Calculator Inputs
Enter prompt quality scores, runtime metrics, and custom weights. The result appears above this form after submission.
Example Data Table
| Prompt Profile | Test Cases | Pass Rate | Latency (ms) | Avg Tokens | Benchmark Score | Grade |
|---|---|---|---|---|---|---|
| Retrieval Assistant | 48 | 92% | 540 | 410 | 89.32 | Strong |
| JSON Extractor | 36 | 87% | 690 | 330 | 83.71 | Strong |
| Support Rewriter | 24 | 79% | 460 | 280 | 78.44 | Good |
| Agent Planner | 18 | 68% | 1280 | 820 | 66.90 | Improving |
Formula Used
Weighted Quality = Σ(criterion score × criterion weight) ÷ Σ(weights)
Operational Index = (Pass Rate × 0.50) + (Latency Index × 0.25) + (Token Efficiency Index × 0.25)
Confidence Index = min(100, 40 + √(Test Cases) × 6)
Final Score = (Weighted Quality × 0.60) + (Operational Index × 0.30) + (Confidence Index × 0.10)
Latency Index Bands: ≤250ms = 100, ≤500ms = 95, ≤1000ms = 85, ≤1500ms = 75, ≤2500ms = 65, ≤4000ms = 50, above that = 35.
Token Efficiency Bands: ≤150 = 100, ≤300 = 95, ≤600 = 85, ≤900 = 75, ≤1200 = 65, ≤1800 = 50, above that = 35.
How to Use This Calculator
- Score the prompt across clarity, context, specificity, grounding, safety, format compliance, robustness, and efficiency on a 0–100 scale.
- Enter operational results such as pass rate, average latency, average output tokens, and number of benchmark test cases.
- Adjust the weights to reflect your evaluation priorities. Higher weights give those dimensions more influence in the weighted quality index.
- Click Calculate Benchmark Score. The score summary and Plotly graph will appear above the form and below the header.
- Use the export buttons to download the result as CSV or PDF for reports, experiments, and stakeholder reviews.
FAQs
1. What does the benchmark score measure?
It estimates overall prompt performance by blending quality metrics, operational behavior, and test confidence. The score helps compare prompts using one consistent framework instead of isolated observations.
2. Why are latency and token counts included?
A prompt can be accurate yet costly or slow. Including latency and token usage helps balance quality with real deployment efficiency, especially for production systems and frequent workloads.
3. How should I choose the weights?
Start with balanced weights, then raise the dimensions that matter most for your use case. For example, safety may dominate healthcare tasks, while format compliance may matter more for structured automation.
4. What is considered a good score?
Scores above 85 are usually strong enough for production review. Scores from 70 to 84 often suit pilots. Lower scores usually indicate missing context, weak structure, or unstable behavior.
5. Can this calculator compare different prompts?
Yes. Run each prompt through the same benchmark conditions, keep the weights consistent, and compare scores, grades, latency, and failure counts side by side.
6. Why does the confidence index depend on test cases?
More test cases usually produce more stable conclusions. A larger evaluation set reduces noise and makes the final score more trustworthy for ranking prompt candidates.
7. Does a high pass rate guarantee prompt quality?
No. A prompt can pass narrow tests while still being verbose, brittle, unsafe, or poorly grounded. That is why the calculator combines several dimensions into one score.
8. When should I recalibrate the scoring model?
Revisit weights and thresholds when your task type changes, new risks appear, model behavior shifts, or business priorities move from experimentation toward production reliability.