Token Cost Breakdown Calculator

Formula Used

1) Split prompt tokens into cached and fresh:

cached_prompt = prompt_tokens × cache_hit%
fresh_prompt = prompt_tokens − cached_prompt

2) Compute base costs per request (rates are per 1,000,000 tokens):

fresh_prompt_cost = (fresh_prompt / 1,000,000) × prompt_rate
cached_prompt_cost = (cached_prompt / 1,000,000) × prompt_rate × (1 − cache_discount%)
output_cost = (output_tokens / 1,000,000) × output_rate
base_cost = fresh_prompt_cost + cached_prompt_cost + output_cost

3) Add retry usage, apply batch savings, then overhead:

retry_cost = base_cost × retry%
subtotal = base_cost + retry_cost
batch_savings = subtotal × batch_discount%
after_batch = subtotal − batch_savings
overhead_cost = after_batch × overhead%
total_per_request = after_batch + overhead_cost
total_cost = total_per_request × requests

How to Use This Calculator

Enter prompt and output tokens per request from your logs.
Set pricing rates for input and output tokens.
Add cache hit rate if you reuse templates or contexts.
Use retry percentage to cover failures and tool calls.
Apply batch discount if your provider offers reductions.
Add operational overhead to include platform-level costs.
Click Calculate to view the breakdown above the form.
Use the CSV/PDF buttons to export the latest result.

Example Data Table

Scenario	Prompt Tokens	Output Tokens	Requests	Prompt Rate / 1M	Output Rate / 1M	Cache Hit	Total Cost (est.)
Support chatbot batch	1,800	650	10,000	$3.00	$15.00	30%	$??.??
RAG search answers	3,200	900	2,500	$2.50	$10.00	15%	$??.??
Evaluation runs	900	150	50,000	$1.00	$4.00	60%	$??.??

Tip: Run your own inputs to replace the placeholder totals.

Why a token-level cost ledger matters

Token costs become predictable when you track them like any other compute bill. This calculator converts per‑request token usage into a ledger you can audit by prompt, cached context, and output. Teams can compare experiments, production traffic, and evaluation runs using the same unit economics. The per‑request view highlights where a small template change increases total spend at scale, even when quality looks unchanged. and when requests jump unexpectedly overnight.

Separating fresh and cached prompt spend

Prompt tokens are not all equal when caching is available. Reused system prompts, tool schemas, and stable reference context can be discounted or billed differently. By splitting prompt tokens into cached and fresh portions, you estimate how much spend is tied to dynamic user content versus reusable scaffolding. That clarity supports decisions like trimming long instructions, moving static text into cacheable blocks, and reusing retrieval results across turns. to cut latency today.

Output volatility and guardrails

Output tokens often drive variance because completions expand with reasoning depth, tool logs, citations, and structured formats. The breakdown isolates output cost so you can set caps, apply truncation rules, or switch response templates. Monitoring output per request alongside cost per request helps you spot regressions after prompt edits or model upgrades. Add a retry percentage to cover re-asks, tool failures, and timeout recoveries. without masking the true unit economics behind.

Batching, retries, and overhead

Batch discounts can materially lower subtotal cost when workloads are flexible and latency is less critical. Apply a batch reduction to the token subtotal after retries, then add overhead for routing, monitoring, storage, and observability. Overhead is not a token charge, but it is real money paid to run production systems. Keeping it explicit avoids underpricing internal services and improves forecasting when traffic spikes. or when you launch new features globally suddenly.

Budgeting signals you can reuse

Once you know total cost, derive actionable ratios: cost per request, cost per 1,000 users, and cost per successful task. Compare scenarios by adjusting rates and token counts to model tradeoffs between quality and budget. The CSV and PDF exports support reviews for finance, engineering, and vendors. Use the example table as a starting point, then replace placeholders with your logs from production. to validate estimates before committing to long contracts.

FAQs

What token rates should I enter?

Enter your provider’s input and output prices expressed per 1,000,000 tokens. If you have per‑1K pricing, multiply by 1,000. Keep currency consistent across fields so totals and exports remain comparable.

What does cache hit rate represent?

It is the share of prompt tokens served from reusable context, templates, or stored system blocks. Higher cache hit rates shift spend from fresh prompt tokens to discounted cached tokens, lowering total cost without changing outputs.

How should I estimate retry percentage?

Use logs to compute extra requests caused by failures, re-prompts, tool errors, or timeouts. If 3 out of 100 requests repeat once, start with 3%. Revisit after incident fixes or model changes.

When is batch discount appropriate?

Apply it when your workload can be queued or processed asynchronously and your provider offers a reduced rate for batch jobs. Leave it at 0% for interactive chat flows where latency matters.

Why include operational overhead?

Token charges ignore supporting costs like gateways, vector stores, monitoring, and incident response. Overhead percentage helps you price internal APIs realistically and prevents budget surprises as usage grows.

How do exports work in this file?

CSV and PDF export the most recent calculation stored in your session. Run a calculation first, then click Export CSV or Export PDF to download a snapshot of inputs, per‑request breakdown, and totals.