Token Throughput Calculator

Calculator Inputs

Enter workload and pricing assumptions to estimate usable token throughput, request capacity, and operating cost for AI inference traffic.

Input Tokens per Request

Prompt tokens consumed by each request.

Output Tokens per Request

Generated completion tokens per request.

Average Latency (seconds)

Base end-to-end response latency.

Concurrent Requests

Parallel in-flight requests handled together.

Success Rate (%)

Fraction of requests completed successfully.

Utilization (%)

Share of peak capacity used in practice.

Streaming Overhead (%)

Extra latency caused by transport overhead.

Active Minutes per Hour

Loaded minutes during each operating hour.

Input Price per 1M Tokens

Cost applied to prompt tokens.

Output Price per 1M Tokens

Cost applied to generated tokens.

Reset

Example Data Table

Use this sample to understand how concurrency, latency, and token size change throughput and cost outcomes.

Scenario	Input Tokens	Output Tokens	Latency (s)	Concurrency	Total Tokens/sec	Cost per Active Hour
Chat Support	800	350	1.20	10	6,998.86	$47.37
Code Assistant	1,200	600	1.80	12	9,090.91	$122.73
Document Summaries	2,400	450	2.60	16	10,238.77	$98.89

Formula Used

Total tokens per request = Input tokens + Output tokens

Effective latency = Average latency × (1 + Streaming overhead ÷ 100)

Raw requests per second = Concurrent requests ÷ Effective latency

Effective requests per second = Raw requests per second × Success rate × Utilization

Total tokens per second = Total tokens per request × Effective requests per second

Total tokens per active hour = Total tokens per second × Active minutes per hour × 60

Cost per request = Input cost + Output cost, where each cost is tokens ÷ 1,000,000 × price

These formulas estimate planning throughput under steady assumptions. Real systems can vary because of queuing, caching, batching, network delays, and model warmup time.

How to Use This Calculator

Enter average prompt and completion token counts for one request.
Add the typical end-to-end latency for that request shape.
Set expected concurrency, success rate, and practical utilization.
Include any streaming or transport overhead that slows delivery.
Choose active minutes per hour to reflect real operating time.
Enter input and output prices per million tokens.
Press Calculate Throughput to show results above the form.
Use the CSV or PDF buttons to export the results.

FAQs

1. What does token throughput mean?

Token throughput is the number of tokens your system can process or generate over time. It helps estimate serving capacity, scaling needs, and operating cost.

2. Why are input and output tokens separated?

Many AI platforms price prompt and completion tokens differently. Separating them improves cost planning and shows how generation-heavy workloads change economics.

3. Why does latency reduce throughput?

Longer latency keeps each request busy for more time. That lowers how many completed requests your available concurrency can finish every second.

4. What is utilization in this model?

Utilization represents how much of theoretical peak capacity you actually sustain. It accounts for idle periods, uneven traffic, and operational inefficiencies.

5. Should I use average or peak token counts?

Start with realistic averages for routine planning. For safety limits, also test heavier scenarios using higher token counts and latency values.

6. Does this calculator include batching effects?

Not directly. You can approximate batching by adjusting concurrency, latency, and utilization to reflect the net performance you observe.

7. Why use active minutes per hour?

Some systems do not run at steady full load for every minute. Active minutes help convert peak throughput into a more realistic hourly estimate.

8. Can this calculator replace production benchmarking?

No. It is a planning tool. Validate important decisions with real traffic tests, monitoring data, queue behavior, and infrastructure limits.