Estimator inputs
Example data table
| Scenario | Requests | Prompt tokens | Completion tokens | Reuse rate | Estimated savings |
|---|---|---|---|---|---|
| Support bot, stable policy header | 50,000 | 1,600 | 420 | 60% | Lower prompt spend, same output spend |
| RAG Q&A, shared retrieval template | 20,000 | 1,200 | 280 | 45% | Moderate savings, depends on cache hit rate |
| Agent workflow, reusable tool instructions | 10,000 | 2,300 | 650 | 35% | Savings grow with repeated runs |
Formula used
The estimator assumes token reuse applies to prompt tokens only, while completion tokens remain unchanged. A safety margin is applied to both prompt and completion.
prompt_eff = prompt_tokens × (1 + overhead%)completion_eff = completion_tokens × (1 + overhead%)baseline_cost = (N×prompt_eff/1e6)×input_price + (N×completion_eff/1e6)×output_pricecached_prompt = (N×prompt_eff)×reuse_rate%reuse_cost = (fresh_prompt/1e6)×input_price + (cached_prompt/1e6)×cached_input_price + output_costsavings = baseline_cost − reuse_cost
How to use this calculator
- Estimate average prompt and completion tokens per request from logs.
- Choose a realistic reuse rate based on stable prompt segments.
- Enter your pricing for input, cached input, and output tokens.
- Add an overhead margin if usage fluctuates across requests.
- Submit to see savings above the form, then export results.
Token reuse as a measurable spend lever
Token reuse reduces billed prompt tokens when repeated instructions, policies, or tool schemas stay identical. In a 10,000‑request month, a 1,200‑token prompt produces about 12.0 million input tokens before overhead. If 55% of that prompt is cached, the estimator shifts roughly 6.6 million tokens to the cached tier. At 70% reuse, 12.0 million prompt tokens become 8.4 million cached and 3.6 million fresh. With input 3.00 and cached 1.50, prompt spend drops by 12.6 currency units per million cached tokens. This matters most when prompts exceed outputs, such as long policy headers or tool manifests in high-volume customer support systems.
Deriving reuse rate from real traffic
Start with logs and isolate stable segments: system rules, formatting templates, and shared retrieval headers. A practical method is sampling 200–500 requests, computing the repeated portion, then averaging the share. When prompts vary by user text, reuse rate often stays below 30%; with standardized workflows, 50–80% is common.
Blended prompt pricing and sensitivity checks
The calculator reports a blended prompt price per million tokens after reuse. For example, standard input at 3.00 and cached input at 1.50 yields a midpoint when reuse is near 50%. Try a sensitivity sweep: reuse 30%, 50%, 70% while keeping outputs fixed. You will see savings scale linearly with cached prompt tokens, not with completion tokens.
Overhead margin for variance and burstiness
Overhead accounts for longer prompts during edge cases, retries, or expanded tool arguments. A 5% margin turns a 1,200‑token prompt into 1,260 effective tokens and keeps forecasts from under‑budgeting. For noisy agent workloads, 10–15% is safer, especially when multi‑step reasoning or extra retrieval chunks appear.
Actions that increase reuse without harming quality
Move stable rules into a single header, keep tool instructions constant, and version templates deliberately. Use short, consistent system messages and avoid injecting dynamic timestamps into cached segments. If retrieval is needed, cache the query plan and keep only the document excerpts variable. These changes can raise reuse by 10–25 points and improve cost predictability.
FAQs
1) What does “prompt reuse rate” represent?
It is the percentage of prompt tokens served from a cache because the prompt segment is identical. The estimator applies reuse only to prompt tokens, not completions.
2) Why are completion tokens not discounted here?
Most caching approaches target repeated input. Outputs depend on user intent and model variation, so they are billed normally. This keeps forecasts conservative and comparable.
3) How do I estimate prompt and completion tokens?
Use usage fields from your provider logs or SDK responses. Average over a representative sample, then add a small overhead margin to cover spikes, retries, and longer tool arguments.
4) What if cached input costs the same as standard input?
Set cached input price equal to input price. Savings will approach zero, and the calculator becomes a token budget planner that still helps size volumes and per‑request costs.
5) Can reuse exceed 80% in production?
Yes, when system rules, templates, and tool schemas are stable and user text is short. Heavy retrieval inserts, large user messages, or frequent prompt edits typically reduce reuse.
6) What is the fastest way to raise reuse safely?
Standardize instruction blocks and keep them unchanged across requests. Avoid dynamic content inside reusable headers, and version templates so only intentional changes invalidate cached segments.