Prompt Failure Rate Calculator

Calculator

Enter your run summary

Use final counts whenever possible. If your totals already include retries, set retries to zero.

Initial prompts

Count first-attempt prompts in the window.

Hard failures

Errors, timeouts, crashes, blocked calls.

Soft issues

Low-quality, refusals, hallucinations, format breaks.

Retries attempted

Extra attempts beyond the first.

Success after retry

How many issues recovered by retry.

Window (days)

Used for per-day estimates.

Soft issue weight

0 ignores soft issues; 1 counts them fully.

Target effective rate (%)

Used for a simple pass/fail check.

Confidence level

Used for a Wilson interval on hard failures.

Include retries in attempts

Attempts = prompts + retries.

Subtract recovered from hard failures

Final hard = max(0, hard − recovered).

Notes

Included in exports for traceability.

New calculation

Tip: track hard failures as “cannot complete”, soft issues as “completes but unusable”.

Example data table

Sample weekly prompt run log

Model	Prompts	Hard	Soft	Retries	Recovered	Eff. Rate*
Model A	300	6	10	40	28	1.67%
Model B	260	7	8	30	18	1.54%
Model C	210	5	7	22	15	1.67%
Model D	180	3	6	18	12	1.67%
Model E	50	1	4	10	7	4.00%

*Eff. Rate uses soft weight 0.50 and subtracts recovered from hard.

Formula used

How the calculator computes the rates

Attempts = InitialPrompts + (IncludeRetries ? RetriesAttempted : 0)

FinalHard = CountRecovered ? max(0, HardFailures − SuccessAfterRetry) : HardFailures

WeightedFailures = FinalHard + (SoftFailures × SoftWeight)

EffectiveFailureRate% = (WeightedFailures ÷ Attempts) × 100

Hard failures stop completion (errors, timeouts, blocked calls).
Soft issues complete but are unusable (refusals, wrong format, low quality).
Soft weight lets you reflect business impact on a 0–1 scale.
Wilson interval estimates uncertainty for the hard failure rate.

How to use this calculator

Practical workflow

Pick a stable window (for example, 7 or 14 days).
Count initial prompts, then classify hard and soft issues.
If you retry automatically, enter retries and recovered successes.
Set a soft weight that matches user impact and support cost.
Compare against your target rate and export the report.

Recommendation: log failures with a short reason code to make fixes measurable.

Failure classification and scope

Track hard failures that stop completion, then log soft issues that finish but fail quality checks. For example, label timeouts, API errors, and blocked calls as hard. Treat refusals, invalid JSON, missing fields, or hallucinated citations as soft. Record the window length and keep the definition constant so week over week comparisons remain meaningful. Keeping these categories separate improves root-cause analysis and makes trend charts stable across releases.

Weighted effective rate and normalization

This calculator converts issues into a single effective rate using a soft weight between 0 and 1. If soft issues are expensive to detect and remediate, raise the weight toward 1. If post-processing fixes most soft issues, set it closer to 0.25–0.50. Normalize by attempts so teams can compare runs with different traffic volumes, and optionally include retries when you want an “all attempts” view of overall reliability.

Retries, recovery, and per-day planning

Retries can hide user pain while increasing cost and latency. Enter retries attempted and successes after retry to estimate how much reliability depends on recovery. Subtracting recovered items from hard failures reflects a system that eventually completes, but you should still monitor retry volume. If you provide the window length, the calculator estimates attempts per day and expected weighted failures per day, which helps capacity planning and support forecasting.

Targets, status bands, and reporting

Operational teams commonly define a target effective rate and monitor a pass/fail gap. A practical banding is: under 1% excellent, 1–3% good, 3–5% watch, above 5% risk. Also track failures per 1,000 attempts for simple dashboards. Exported CSV and PDF snapshots support change reviews, incident retrospectives, and weekly quality reporting with consistent definitions and timestamps.

Confidence intervals and decision discipline

When sample sizes are small, point estimates can swing. The calculator uses a Wilson interval on hard failures to show uncertainty at 90%, 95%, or 99% confidence. A narrow interval suggests stable reliability; a wide interval suggests you need more volume, better test coverage, or controlled rollouts before changing model settings broadly. After each change, rerun the same evaluation set and compare effective rate, confidence bounds, and target gap.

FAQs

1) What counts as a hard failure?

Hard failures are outcomes where the system cannot complete the task, such as API errors, timeouts, crashes, blocked tool calls, or invalid authentication. They usually require retries or manual intervention.

2) How do I choose soft issue weight?

Start at 0.50 when soft issues cause rework. Move toward 1.00 if unusable responses create tickets or refunds. Move toward 0.25 if downstream validation, rewriting, or human review reliably repairs most soft issues with low cost.

3) Should I include retries in attempts?

Include retries when you want reliability per attempt and to reflect extra cost and latency. Exclude retries when you want per-prompt experience based on first attempts only. Keep the setting consistent for trend comparisons.

4) Why subtract recovered successes from hard failures?

Subtracting recovered items shows a system that eventually completes after retry, which matches some SLAs. Still track retry volume separately, because recovery can increase spend and delay responses even when completion succeeds.

5) What is failures per 1,000 used for?

It converts the effective failure rate into an intuitive scale for dashboards. For example, 2.5% equals 25 weighted failures per 1,000 attempts, which is easy to compare across teams, models, or product surfaces.

6) What does the confidence interval mean?

It is a range that likely contains the true hard failure rate, given your sample size and chosen confidence. Wide bounds mean uncertainty is high; collect more data or tighten testing before shipping aggressive changes.