Calculator Inputs
Example Data Table
| Baseline prompt | Current prompt | Expected drift interpretation |
|---|---|---|
| Summarize the text in three bullets. | Summarize the text in three bullets and list risks. | Moderate drift due to added constraint and task. |
| Classify sentiment as positive, neutral, or negative. | Detect sentiment and explain the decision briefly. | Moderate drift; intent expands to explanations. |
| Extract name, email, and phone from the message. | Extract entities and redact sensitive values. | High drift; new policy behavior changes outputs. |
Formula Used
Drift = 0.30(1−J) + 0.40(1−Cos) + 0.30(KL′)
How to Use This Calculator
- Paste your stable baseline prompt on the left.
- Paste the updated or observed prompt on the right.
- Adjust thresholds to match your release risk tolerance.
- Optionally paste multiple rows into Batch CSV for monitoring.
- Click Detect drift to generate metrics and status.
- Download CSV for audits, or PDF for sharing reviews.
Operational Drift Baselines
Teams commonly treat a drift score under 0.35 as a safe band for routine edits, while 0.35–0.55 signals review. Above 0.55 often correlates with changed task scope, policy language, or new output formats. In audits, a 0.10–0.20 jump is frequently tied to added constraints like “always cite” or “never reveal.” Log scores with version IDs and deploy dates for repeatable comparisons. For production, review any change that pushes drift above 0.55 and re-run evaluations on edge cases and safety prompts before shipping.
Metric Coverage Across Edits
Jaccard highlights new or removed tokens, reacting strongly to added requirements like “return JSON.” Cosine similarity stabilizes longer prompts by weighting frequent terms, and it often stays above 0.80 when intent is unchanged. When a prompt gains a new section, Jaccard can drop while Cosine stays steady, indicating structural growth more than intent replacement.
Distribution Shift With KL
KL divergence captures changes in token distribution. Add-α smoothing prevents spikes when new tokens appear, especially for short prompts. α between 0.3 and 1.0 usually balances sensitivity and stability. Values above 0.50 often appear after adding domain vocabulary or compliance wording. The tool normalizes KL to KL′ so dashboards remain comparable.
Composite Score Interpretation
The calculator blends (1−Jaccard), (1−Cosine), and normalized KL′. The 0.40 Cosine weight prioritizes semantic continuity, while overlap and divergence terms detect structure changes and instruction creep. High drift with strong Cosine often points to formatting or policy additions. Low Cosine typically means the task definition changed.
Batch Monitoring For Releases
Batch mode supports weekly audits and release checklists. Track mean drift and the maximum row drift. If mean drift rises by 0.10 or the maximum crosses your high threshold, schedule regression tests. A 20-row batch with mean 0.28 but max 0.62 suggests one risky variant, not a systemic change. Plotly trend lines help spot gradual drift.
Governance And Audit Exports
CSV export supports approvals and reproducibility, while PDF summaries speed reviews. Record baseline, current prompt, thresholds, and drift values next to ticket numbers. Over time, logs enable “drift budgets” by feature and make incident investigations faster when outputs shift unexpectedly.
FAQs
What does “prompt drift” mean here?
It is a measurable change between a baseline prompt and a newer prompt. The tool flags lexical overlap loss, vector similarity loss, and distribution divergence as a single score.
Why use multiple metrics instead of one?
One metric can miss changes. Jaccard catches token edits, Cosine captures broader similarity, and KL detects distribution shifts. The blend reduces false confidence from any single view.
How should I choose thresholds?
Start with 0.35 for moderate and 0.55 for high drift. Tune using historical releases: set thresholds where regressions or output format changes first appear in your tests.
Does stopword removal help?
Often, yes for short prompts. Removing common words reduces noise so added constraints and domain terms dominate the comparison. For long policy prompts, keep stopwords to preserve structure.
Can I use this for multilingual prompts?
Yes for basic token comparisons. The tokenizer is Unicode-aware, but it does not apply language-specific stemming. For best results, compare prompts written in the same language and style.
Is this a replacement for evaluation?
No. Drift scores guide review and gating. Always validate with test suites, safety checks, and representative user examples, especially when thresholds indicate moderate or high drift.