Audit your prompts with confidence daily. Separate hard errors from soft quality breakdowns clearly now. Set targets, export results, and reduce failures each week.
| Model | Prompts | Hard | Soft | Retries | Recovered | Eff. Rate* |
|---|---|---|---|---|---|---|
| Model A | 300 | 6 | 10 | 40 | 28 | 1.67% |
| Model B | 260 | 7 | 8 | 30 | 18 | 1.54% |
| Model C | 210 | 5 | 7 | 22 | 15 | 1.67% |
| Model D | 180 | 3 | 6 | 18 | 12 | 1.67% |
| Model E | 50 | 1 | 4 | 10 | 7 | 4.00% |
Track hard failures that stop completion, then log soft issues that finish but fail quality checks. For example, label timeouts, API errors, and blocked calls as hard. Treat refusals, invalid JSON, missing fields, or hallucinated citations as soft. Record the window length and keep the definition constant so week over week comparisons remain meaningful. Keeping these categories separate improves root-cause analysis and makes trend charts stable across releases.
This calculator converts issues into a single effective rate using a soft weight between 0 and 1. If soft issues are expensive to detect and remediate, raise the weight toward 1. If post-processing fixes most soft issues, set it closer to 0.25–0.50. Normalize by attempts so teams can compare runs with different traffic volumes, and optionally include retries when you want an “all attempts” view of overall reliability.
Retries can hide user pain while increasing cost and latency. Enter retries attempted and successes after retry to estimate how much reliability depends on recovery. Subtracting recovered items from hard failures reflects a system that eventually completes, but you should still monitor retry volume. If you provide the window length, the calculator estimates attempts per day and expected weighted failures per day, which helps capacity planning and support forecasting.
Operational teams commonly define a target effective rate and monitor a pass/fail gap. A practical banding is: under 1% excellent, 1–3% good, 3–5% watch, above 5% risk. Also track failures per 1,000 attempts for simple dashboards. Exported CSV and PDF snapshots support change reviews, incident retrospectives, and weekly quality reporting with consistent definitions and timestamps.
When sample sizes are small, point estimates can swing. The calculator uses a Wilson interval on hard failures to show uncertainty at 90%, 95%, or 99% confidence. A narrow interval suggests stable reliability; a wide interval suggests you need more volume, better test coverage, or controlled rollouts before changing model settings broadly. After each change, rerun the same evaluation set and compare effective rate, confidence bounds, and target gap.
Hard failures are outcomes where the system cannot complete the task, such as API errors, timeouts, crashes, blocked tool calls, or invalid authentication. They usually require retries or manual intervention.
Start at 0.50 when soft issues cause rework. Move toward 1.00 if unusable responses create tickets or refunds. Move toward 0.25 if downstream validation, rewriting, or human review reliably repairs most soft issues with low cost.
Include retries when you want reliability per attempt and to reflect extra cost and latency. Exclude retries when you want per-prompt experience based on first attempts only. Keep the setting consistent for trend comparisons.
Subtracting recovered items shows a system that eventually completes after retry, which matches some SLAs. Still track retry volume separately, because recovery can increase spend and delay responses even when completion succeeds.
It converts the effective failure rate into an intuitive scale for dashboards. For example, 2.5% equals 25 weighted failures per 1,000 attempts, which is easy to compare across teams, models, or product surfaces.
It is a range that likely contains the true hard failure rate, given your sample size and chosen confidence. Wide bounds mean uncertainty is high; collect more data or tighten testing before shipping aggressive changes.
Important Note: All the Calculators listed in this site are for educational purpose only and we do not guarentee the accuracy of results. Please do consult with other sources as well.