Calculator Inputs
Enter workload and pricing assumptions to estimate usable token throughput, request capacity, and operating cost for AI inference traffic.
Example Data Table
Use this sample to understand how concurrency, latency, and token size change throughput and cost outcomes.
| Scenario | Input Tokens | Output Tokens | Latency (s) | Concurrency | Total Tokens/sec | Cost per Active Hour |
|---|---|---|---|---|---|---|
| Chat Support | 800 | 350 | 1.20 | 10 | 6,998.86 | $47.37 |
| Code Assistant | 1,200 | 600 | 1.80 | 12 | 9,090.91 | $122.73 |
| Document Summaries | 2,400 | 450 | 2.60 | 16 | 10,238.77 | $98.89 |
Formula Used
These formulas estimate planning throughput under steady assumptions. Real systems can vary because of queuing, caching, batching, network delays, and model warmup time.
How to Use This Calculator
- Enter average prompt and completion token counts for one request.
- Add the typical end-to-end latency for that request shape.
- Set expected concurrency, success rate, and practical utilization.
- Include any streaming or transport overhead that slows delivery.
- Choose active minutes per hour to reflect real operating time.
- Enter input and output prices per million tokens.
- Press Calculate Throughput to show results above the form.
- Use the CSV or PDF buttons to export the results.
FAQs
1. What does token throughput mean?
Token throughput is the number of tokens your system can process or generate over time. It helps estimate serving capacity, scaling needs, and operating cost.
2. Why are input and output tokens separated?
Many AI platforms price prompt and completion tokens differently. Separating them improves cost planning and shows how generation-heavy workloads change economics.
3. Why does latency reduce throughput?
Longer latency keeps each request busy for more time. That lowers how many completed requests your available concurrency can finish every second.
4. What is utilization in this model?
Utilization represents how much of theoretical peak capacity you actually sustain. It accounts for idle periods, uneven traffic, and operational inefficiencies.
5. Should I use average or peak token counts?
Start with realistic averages for routine planning. For safety limits, also test heavier scenarios using higher token counts and latency values.
6. Does this calculator include batching effects?
Not directly. You can approximate batching by adjusting concurrency, latency, and utilization to reflect the net performance you observe.
7. Why use active minutes per hour?
Some systems do not run at steady full load for every minute. Active minutes help convert peak throughput into a more realistic hourly estimate.
8. Can this calculator replace production benchmarking?
No. It is a planning tool. Validate important decisions with real traffic tests, monitoring data, queue behavior, and infrastructure limits.