Inputs
Enter prompt testing results and evaluation signals. Use advanced options to match your workflow and reporting needs.
Example data table
| Run | Variant | Total | Success | Partial | Retries | Quality | Compliance | Latency (s) | Cost ($) |
|---|---|---|---|---|---|---|---|---|---|
| 1 | A | 100 | 72 | 10 | 25 | 4.1 | 4.3 | 2.8 | 0.0125 |
| 2 | B | 120 | 78 | 18 | 34 | 4.4 | 4.1 | 3.2 | 0.0110 |
| 3 | C | 80 | 58 | 8 | 14 | 3.8 | 4.6 | 2.4 | 0.0138 |
Formula used
- Basic Success Rate = success / total
- Weighted Success Rate = (success + partialWeight × partial) / total
- Retry Factor = 1 / (1 + retries / total)
- Quality Factor = quality / 5
- Compliance Factor = compliance / 5
- Cost Factor = 1 / (1 + cost / targetCost)
- Latency Factor = 1 / (1 + latency / targetLatency)
- Efficiency Factor = average of cost and latency factors
Composite score is a weighted blend of these factors, normalized and scaled to 0–100. Presets simply load different weight profiles.
How to use this calculator
- Run a consistent test set for your prompt or agent flow.
- Count successes and partial successes using defined criteria.
- Record retries, average quality, compliance, latency, and cost.
- Select a scoring preset or set custom weights.
- Submit to view the score, grade, and confidence interval.
- Export CSV/PDF to share results and track improvements.
Defining success in prompt evaluation
A prompt is “successful” when the output is usable without manual rescue. Track three outcomes: success, partial success, and failure. In the calculator, partial successes are discounted, so 10 partials contribute like 5 full successes. This prevents inflated scores when outputs need heavy edits. Also note the compliance rating: an accurate answer that violates constraints should not be treated as a win.
Building a repeatable test set
Use a fixed set of tasks that mirrors real usage: intents, formats, and edge cases. Include short queries, long queries, and adversarial phrasing. Record total runs, then log retries per run, average latency, and average cost. For example, 120 runs with 78 successes and 18 partials yields an adjusted success rate of 72.5%, before other factors.
Interpreting the composite score
The score blends five signals: adjusted success, quality, compliance, retries, and efficiency. Quality and compliance are 0–10 ratings averaged across the test set. Retries are penalized using a diminishing factor, so moving from 0.2 to 0.1 retries matters more than 1.2 to 1.1. Efficiency combines cost and latency against your targets, so cheaper and faster flows earn higher points. Weights can be preset or customized to match your risk tolerance and SLAs.
Improving reliability with iteration
Treat the score as a diagnostic, not a verdict. If retries are high, tighten constraints, add tool-use checks, or reduce ambiguity with structured fields. If compliance is low, add explicit policy and refusal rules and validate formatting. When quality is low but success is high, add few-shot examples and clearer acceptance criteria. Small changes can be measured quickly: running the same test set weekly makes a 3–5 point gain meaningful, especially when the confidence interval narrows with larger samples.
Reporting results to stakeholders
Share outputs with context. Export CSV to compare versions by date, model, or prompt template. Use the PDF summary for reviews, including the Wilson confidence interval: a 72.5% rate on 120 runs might span roughly 63%–80%. If intervals overlap between versions, you may need more samples before declaring a winner. Document target latency and cost so efficiency improvements are visible.
FAQs
1) What does “partial success” mean here?
A partial success is usable only after noticeable edits, reformatting, or missing steps. The calculator discounts partials so you can compare prompts without overstating day-to-day usability.
2) How should I score quality and compliance?
Use a simple rubric from 0 to 10. Quality reflects correctness and completeness. Compliance reflects adherence to instructions, safety rules, and output format. Average across your test set for consistent comparisons.
3) Why does the calculator penalize retries?
Retries signal instability and add hidden cost and latency. The retry factor reduces the score as retries rise, helping you favor prompts that succeed on the first attempt.
4) What is the confidence interval used for?
It shows the uncertainty around your success rate given the sample size. Wider intervals mean less certainty. Running more tests tightens the interval and supports stronger decisions between prompt versions.
5) How do presets differ from custom weights?
Presets load common weight mixes for balanced, safety-first, speed-first, or cost-first goals. Custom weights let you align the score with your product priorities, SLAs, and risk profile.
6) Can I compare different models using this tool?
Yes. Keep the same test set and criteria, then run separate evaluations per model. Export results to track changes over time and avoid conclusions when confidence intervals overlap heavily.