Prompt Success Rate Calculator

Inputs

Enter prompt testing results and evaluation signals. Use advanced options to match your workflow and reporting needs.

Total prompts tested

Include all prompt runs in the test set.

Successful outputs

Outputs meeting your acceptance criteria.

Partial successes

Useful but requires edits or follow-up.

Partial success weight (0–1)

How much partials count as success.

Total retries used

Extra attempts to reach success.

Avg. response quality (1–5)

Accuracy, completeness, and clarity.

Avg. instruction compliance (1–5)

Follows constraints and formatting.

Avg. latency (seconds)

Mean end-to-end response time.

Avg. cost per prompt (USD)

Use your measured average cost.

Advanced options

Tune targets and weighting for composite scoring.

Optional

Target latency (seconds)

Used to scale latency penalty.

Target cost per prompt (USD)

Used to scale cost penalty.

Scoring preset

Auto-fills weights unless custom.

Weight: success

Weight: quality

Weight: compliance

Weight: retries

Weight: efficiency

Tip: weights are normalized automatically if they do not sum to 1.

Results will appear above this form.

Example data table

Run	Variant	Total	Success	Partial	Retries	Quality	Compliance	Latency (s)	Cost ($)
1	A	100	72	10	25	4.1	4.3	2.8	0.0125
2	B	120	78	18	34	4.4	4.1	3.2	0.0110
3	C	80	58	8	14	3.8	4.6	2.4	0.0138

Use this format to compare prompt variants across test sets.

Formula used

Basic Success Rate = success / total
Weighted Success Rate = (success + partialWeight × partial) / total
Retry Factor = 1 / (1 + retries / total)
Quality Factor = quality / 5
Compliance Factor = compliance / 5
Cost Factor = 1 / (1 + cost / targetCost)
Latency Factor = 1 / (1 + latency / targetLatency)
Efficiency Factor = average of cost and latency factors

Composite score is a weighted blend of these factors, normalized and scaled to 0–100. Presets simply load different weight profiles.

How to use this calculator

Run a consistent test set for your prompt or agent flow.
Count successes and partial successes using defined criteria.
Record retries, average quality, compliance, latency, and cost.
Select a scoring preset or set custom weights.
Submit to view the score, grade, and confidence interval.
Export CSV/PDF to share results and track improvements.

Defining success in prompt evaluation

A prompt is “successful” when the output is usable without manual rescue. Track three outcomes: success, partial success, and failure. In the calculator, partial successes are discounted, so 10 partials contribute like 5 full successes. This prevents inflated scores when outputs need heavy edits. Also note the compliance rating: an accurate answer that violates constraints should not be treated as a win.

Building a repeatable test set

Use a fixed set of tasks that mirrors real usage: intents, formats, and edge cases. Include short queries, long queries, and adversarial phrasing. Record total runs, then log retries per run, average latency, and average cost. For example, 120 runs with 78 successes and 18 partials yields an adjusted success rate of 72.5%, before other factors.

Interpreting the composite score

The score blends five signals: adjusted success, quality, compliance, retries, and efficiency. Quality and compliance are 0–10 ratings averaged across the test set. Retries are penalized using a diminishing factor, so moving from 0.2 to 0.1 retries matters more than 1.2 to 1.1. Efficiency combines cost and latency against your targets, so cheaper and faster flows earn higher points. Weights can be preset or customized to match your risk tolerance and SLAs.

Improving reliability with iteration

Treat the score as a diagnostic, not a verdict. If retries are high, tighten constraints, add tool-use checks, or reduce ambiguity with structured fields. If compliance is low, add explicit policy and refusal rules and validate formatting. When quality is low but success is high, add few-shot examples and clearer acceptance criteria. Small changes can be measured quickly: running the same test set weekly makes a 3–5 point gain meaningful, especially when the confidence interval narrows with larger samples.

Reporting results to stakeholders

Share outputs with context. Export CSV to compare versions by date, model, or prompt template. Use the PDF summary for reviews, including the Wilson confidence interval: a 72.5% rate on 120 runs might span roughly 63%–80%. If intervals overlap between versions, you may need more samples before declaring a winner. Document target latency and cost so efficiency improvements are visible.

FAQs

1) What does “partial success” mean here?

A partial success is usable only after noticeable edits, reformatting, or missing steps. The calculator discounts partials so you can compare prompts without overstating day-to-day usability.

2) How should I score quality and compliance?

Use a simple rubric from 0 to 10. Quality reflects correctness and completeness. Compliance reflects adherence to instructions, safety rules, and output format. Average across your test set for consistent comparisons.

3) Why does the calculator penalize retries?

Retries signal instability and add hidden cost and latency. The retry factor reduces the score as retries rise, helping you favor prompts that succeed on the first attempt.

4) What is the confidence interval used for?

It shows the uncertainty around your success rate given the sample size. Wider intervals mean less certainty. Running more tests tightens the interval and supports stronger decisions between prompt versions.

5) How do presets differ from custom weights?

Presets load common weight mixes for balanced, safety-first, speed-first, or cost-first goals. Custom weights let you align the score with your product priorities, SLAs, and risk profile.

6) Can I compare different models using this tool?

Yes. Keep the same test set and criteria, then run separate evaluations per model. Export results to track changes over time and avoid conclusions when confidence intervals overlap heavily.

Prompt Success Rate Calculator

Results

Basic success rate

Composite success score

Reliability grade

95% CI for basic rate

Efficiency penalty

Retry impact

Inputs

Example data table

Formula used

How to use this calculator

Defining success in prompt evaluation

Building a repeatable test set

Interpreting the composite score

Improving reliability with iteration

Reporting results to stakeholders

FAQs

1) What does “partial success” mean here?

2) How should I score quality and compliance?

3) Why does the calculator penalize retries?

4) What is the confidence interval used for?

5) How do presets differ from custom weights?

6) Can I compare different models using this tool?

Related Calculators