Calculator Inputs
Rate each signal from 0 (none) to 5 (strong). Optionally adjust weights to match your policy, domain, and risk tolerance.
Example Data Table
These sample rows show how different prompt patterns can change the risk score. Values are illustrative and should be validated in your environment.
| Scenario | Score | Band | Reviewer Note |
|---|---|---|---|
| Hiring screen prompt | 2.5 | Moderate | Mentions age and gender without need |
| Customer support prompt | 1.0 | Low | Neutral wording; minimal group cues |
| Credit offer prompt | 4.2 | High | Targets demographic; implies unequal treatment |
Formula Used
Each signal is scored from 0 to 5, then normalized to 0–1. Weights are normalized to sum to 1, then combined.
The score estimates prompt-level bias risk, not model bias. Use it with evaluation outputs, policy review, and domain constraints.
How to Use This Calculator
- Paste or summarize the prompt context in your notes.
- Rate each signal from 0 to 5 with examples.
- Adjust weights to match your policy and domain.
- Submit to generate score, band, and recommendations.
- Export CSV or PDF for audits and peer review.
- Re-run after mitigations and compare saved results.
Bias risk scoring in prompt review
Bias risk scoring complements model evaluation by turning prompt characteristics into measurable signals. This calculator captures eight drivers of disparate or harmful outputs and converts them into a comparable 0–100 score for reviews and audits. It fits early design checks, procurement reviews, and post-incident retrospectives. Teams can run it during prompt authoring to surface risk before any user interaction, reducing costly rework and limiting downstream exposure.
Signal selection and consistent scaling
Protected attribute mentions, demographic targeting, and exclusionary wording are treated as high-impact factors because they can steer content toward unequal treatment. Leading language and unverified claims increase the chance of confident but skewed answers. Each signal is rated 0–5 to reflect intensity and frequency, making scoring repeatable across reviewers and teams. Using half-step scoring supports nuance when prompts contain mixed intent or partial context, improving calibration in reviews.
Weighting aligned to domain policy
Weights translate organizational policy into a numeric profile. Hiring, lending, and health domains may assign higher weight to demographic targeting and discriminatory intent, while customer support may prioritize toxicity and leading language to reduce harmful escalation. When policies change, updating weights preserves comparability without rebuilding the rubric. Weight sensitivity analysis is useful: adjust one weight at a time to see which controls most influence the final score, then document rationale.
Interpreting bands and confidence
Low and Moderate bands indicate prompts that are mostly neutral but still benefit from counterfactual testing. High and Critical bands suggest the prompt may encode differential treatment or stereotyping and should trigger deeper review. Confidence rises when evidence includes prompt variants, example outputs, and clearly logged assumptions. A low confidence flag is a signal to gather more evidence, not to ignore the measured risk.
Mitigation workflow and traceability
Use recommendations to rewrite prompts with neutral constraints, evidence requests, and inclusive wording. Add explicit fairness instructions, avoid irrelevant demographic cues, and require uncertainty when data is missing. Re-run after edits and store exports with change logs. Comparing runs quantifies risk reduction and supports accountable governance reporting. For production systems, pair scoring with A/B tests, bias benchmarks, and incident metrics so the rubric stays aligned with real-world outcomes.
FAQs
1) What does the risk score represent?
It estimates prompt-level bias risk by combining weighted signal ratings into a 0–100 value. It does not measure model bias directly; use it alongside output testing and policy review.
2) How should I choose weights?
Start from your policy priorities and domain risk. Increase weights for signals with higher regulatory or reputational impact. Keep weights stable within a program, and record the rationale when changing them.
3) Why can two reviewers score differently?
Prompts are contextual. Differences usually come from missing context, unclear intent, or uneven evidence. Improve alignment by sharing examples, defining thresholds for 0–5 ratings, and documenting assumptions in notes.
4) What is evidence quality used for?
Evidence quality produces a confidence percentage. Higher confidence indicates the score is backed by variant testing, logged outputs, and clear reasoning. Low confidence suggests collecting more evidence before acting.
5) How can I reduce a High or Critical score?
Remove demographic targeting, replace stereotypes with neutral language, and add fairness constraints. Ask for sources, allow uncertainty, and test counterfactual variants. Re-score after edits and keep exports with change logs.
6) When should I export CSV or PDF?
Export after each review milestone, such as pre-release, policy sign-off, or mitigation completion. Attach exports to tickets or audit folders so reviewers can trace decisions and compare improvements across versions.