| Scenario | Requests/day | Input tokens | Output tokens | Cache hit | Token buffer | Notes |
|---|---|---|---|---|---|---|
| Support assistant | 2,500 | 850 + 120 overhead | 350 | 35% | 8% | Typical RAG prompts, moderate reuse. |
| Document summarization | 400 | 7,500 + 250 overhead | 1,200 | 10% | 12% | Long inputs, higher variance per file. |
| Agentic workflow | 900 | 1,600 + 300 overhead | 650 | 25% | 15% | Tool calls increase overhead and retries. |
effective_requests = base_requests × (1 + retry_rate%)
input_tokens_per_request = avg_input_tokens + overhead_tokens
billable_input_tokens = effective_requests × input_tokens_per_request × (1 + token_buffer%)
billable_output_tokens = effective_requests × avg_output_tokens × (1 + token_buffer%)
standard_input_tokens = billable_input_tokens − cached_input_tokens
input_cost = (standard_input_tokens ÷ 1,000,000) × input_price
+ (cached_input_tokens ÷ 1,000,000) × cached_input_price
output_cost = (billable_output_tokens ÷ 1,000,000) × output_price
subtotal = input_cost + output_cost
after_discount = subtotal × (1 − batch_discount%)
total = after_discount × (1 + contingency%)
display_total = total × exchange_rate
- Measure tokens using logs or a token counter for your prompts and responses.
- Set volumes (requests/day and days) to match your forecast period.
- Enter token prices from your provider and plan type.
- Model real-world effects like retries, overhead, and caching.
- Add buffers for variance, then apply discounts and contingency.
- Export a CSV for finance, or a PDF for approvals.
Token drivers and measurement
Accurate forecasting starts with measuring tokens, not characters. Capture median and p95 input and output tokens per request from logs, then separate user text from retrieved context and guardrail overhead. A 10% increase in retrieval size can raise input tokens more than request volume. Track tool calls, system prompts, and formatting templates because they often add stable overhead that scales linearly with traffic.
Pricing inputs and cached rates
Providers usually bill input and output at different rates, so the split matters. Enter prices per million tokens and, when available, a discounted cached input rate. Cache hit rate should reflect repeated prefixes such as policies, instructions, or shared conversation state. If caching is uncertain, model conservative and optimistic scenarios to bracket risk, then revisit after a week of production telemetry.
Throughput forecasting for budgets
Requests per day should be tied to product metrics: active users, sessions, and features that trigger calls. Use days in period to align with billing cycles and planned launches. For bursty workloads, consider using a higher daily average during peak campaigns. When you forecast growth, update both volume and token averages because prompt complexity often increases as capabilities expand.
Buffers, retries, and risk controls
Retry rate captures hidden cost from timeouts, rate limits, and user re-asks. Even a 2% retry rate becomes meaningful at scale. Token buffer protects you from variance in long documents, multilingual content, and atypical agent loops. Add a separate contingency margin for budgeting approvals; it supports procurement planning and avoids mid-quarter surprises when product usage shifts. Use dashboards to validate assumptions and catch drift after releases quickly.
Reporting for stakeholder alignment
Finance teams respond well to unit economics. Use cost per request, cost per 1K tokens, and daily burn to compare models and features. Share the token breakdown to explain why caching or prompt refactors reduce spend. Export CSV for spreadsheets and a PDF for review packets. Keep a versioned snapshot of assumptions so engineering and product can iterate responsibly. For teams managing multiple applications, maintain a small cost registry that lists model, feature owner, target metrics, and monthly cap, then review it in planning meetings to adjust assumptions.
1) What token numbers should I use?
Start with real logs. Use the median for typical traffic and the 95th percentile for stress testing. Separate input, output, and overhead tokens so you can reduce cost by targeting the biggest driver.
2) How do I estimate cache hit rate?
Look for repeated prompt prefixes and stable conversation scaffolding. If you do not have telemetry yet, test 10%, 30%, and 50% scenarios. Replace assumptions after collecting a few days of production traces.
3) Why include overhead tokens?
System prompts, safety policies, formatting, and tool routing can be a consistent share of every request. Ignoring overhead can understate spend and hide optimization opportunities like shorter templates or fewer tool calls.
4) How should I model retries?
Use incident history and rate-limit behavior. Count both automatic retries and user resubmissions. A small retry rate can add substantial cost at high volume, so treat it as a reliability and budgeting metric.
5) What does token buffer represent?
It is a cushion for variability in prompts, retrieved context, and long outputs. Buffers reduce the chance of missing budget targets when inputs change, languages vary, or new features increase prompt length.
6) Can this calculator compare two models?
Yes. Run one model’s pricing and token assumptions, export results, then repeat for the alternative. Compare cost per request and cost per 1K tokens to evaluate tradeoffs alongside quality and latency.