Know where time goes in every request. Adjust tokens, throughput, and RTT to compare paths. Export results, share assumptions, and improve product reliability today.
| Scenario | Input tokens | Output tokens | Decode tps | RTT (ms) | Queue (ms) |
|---|---|---|---|---|---|
| Chat assistant, streaming | 900 | 300 | 85 | 90 | 80 |
| Long document summary | 4000 | 800 | 70 | 110 | 120 |
| Small request, low RTT | 500 | 200 | 95 | 35 | 40 |
| High contention burst | 1200 | 450 | 85 | 90 | 250 |
| Tool-assisted run | 900 | 350 | 80 | 90 | 110 |
prefill_time_s = effective_input_tokens / prefill_tps
decode_time_s = output_tokens / decode_tps
TTFT_ms = queue + client + handshake + RTT + provider + prefill_time
Total_ms (streaming) = TTFT + decode + tools + images + packaging + tail_RTT
Buffered = Total × (1 + safety_pct/100)
Prompt latency is the sum of waiting, transfer, and compute. Queue time reflects load and scheduling. Network time is driven by RTT and handshakes. Compute splits into prefill, which processes input tokens, and decode, which produces output tokens. Streaming changes perception by exposing the first token earlier. For chat UIs, users often notice TTFT above 300–500 ms, even when totals remain acceptable. Measure TTFT and total separately; they guide different optimization efforts.
Compute scales roughly linearly with token counts. Prefill time depends on effective input tokens after caching and the prefill throughput. Decode time depends on output tokens and decode throughput. If your workload is retrieval heavy, prefill dominates. If generations are long, decode dominates and small throughput changes matter. For example, doubling decode throughput halves decode time, but does not change queue or network costs.
Many applications add tool calls, image processing, and response packaging. Each tool call can introduce external latency and additional network hops, so model them separately. Image inputs add parsing and embedding overhead, and may increase tokenization and safety checks. Packaging covers formatting, policy evaluation, and transport framing. In this estimator, these costs are modeled as fixed milliseconds per item for planning and sensitivity analysis.
Concurrency often increases queue delay and provider overhead. During bursts, contention can inflate both even if average compute is stable. Tail latency is typically higher than the median, so SLAs should target p95 or p99. Use the safety buffer to add headroom and the p95 multiplier to approximate tail behavior. If your p95 is drifting, reducing max output tokens can be the quickest stabilizer.
Start with measured RTT and realistic token counts from logs. Calibrate throughputs using provider benchmarks or internal tracing. Then simulate changes: shorter prompts, smaller max tokens, faster regions, or caching. Export CSV for reviews and keep PDFs with assumptions. Revisit inputs after model upgrades or traffic shifts. Over time, compare predicted totals to observed traces and adjust overhead knobs until errors stay within a small margin.
TTFT is the time until the first token is available. It strongly influences perceived speed in chat and streaming UIs, and is most affected by queue delay, RTT, handshake cost, provider overhead, and input token prefill time.
Use provider benchmarks, internal tracing, or load tests. Prefill throughput is higher for small contexts and can drop with long prompts. Decode throughput depends on model size, sampling settings, and server load.
Enable streaming when your interface can render partial output and users benefit from early feedback. Streaming reduces perceived latency even if total time is similar, but may slightly increase tail latency due to chunking and client rendering.
Measure each tool call end to end, including network, retries, and backend processing. Enter the average per call, then test worst case by increasing the per-call value or count. Treat dependent tools as additive in the critical path.
Caching reduces effective input work by reusing previously processed context. A higher hit rate lowers prefill time and TTFT, but it will not reduce RTT, queue delay, or decode time. Validate hit rates with logs before relying on them.
Limit max output tokens, reduce prompt length, and route traffic to a closer region. If queue delay is dominant, add capacity or throttle bursts. For tool-heavy flows, parallelize independent calls and cache stable tool responses where possible.
Important Note: All the Calculators listed in this site are for educational purpose only and we do not guarentee the accuracy of results. Please do consult with other sources as well.