Prompt Latency Estimator Calculator

Estimator Inputs

Input tokens

Tokens in your prompt and context.

Output tokens

Tokens generated by the model.

Streaming

Streaming improves perceived speed via TTFT.

Prefill throughput (tokens/sec)

Higher is better for long contexts.

Decode throughput (tokens/sec)

Higher is better for long generations.

Network RTT (ms)

Round-trip time between client and server.

Handshake / TLS setup (ms)

One-time setup per request when uncached.

Queue delay (ms)

Time spent waiting before compute starts.

Provider overhead (ms)

Routing, scheduling, safety checks, and prep.

Client overhead (ms)

Serialization and client-side processing time.

Output packaging (ms)

Response formatting and transport framing.

Concurrency (requests)

Used to scale contention effects.

Contention impact (%)

Inflates queue and overhead under burst load.

Cache hit rate (%)

Reduces effective prefill work when reusing context.

Images (count)

Adds overhead per image item.

Image overhead each (ms)

Parsing and embedding image features.

Tool calls (count)

Each tool call adds execution time and latency.

Tool overhead each (ms)

API round trips or function runtime overhead.

Safety buffer (%)

Adds headroom for variability and jitter.

p95 multiplier

Scales buffered results into a p95 estimate.

Results appear above this form.

Example Data Table

Scenario	Input tokens	Output tokens	Decode tps	RTT (ms)	Queue (ms)
Chat assistant, streaming	900	300	85	90	80
Long document summary	4000	800	70	110	120
Small request, low RTT	500	200	95	35	40
High contention burst	1200	450	85	90	250
Tool-assisted run	900	350	80	90	110

Use these rows to sanity-check your inputs and expected ranges.

Formula Used

prefill_time_s = effective_input_tokens / prefill_tps

decode_time_s = output_tokens / decode_tps

TTFT_ms = queue + client + handshake + RTT + provider + prefill_time

Total_ms (streaming) = TTFT + decode + tools + images + packaging + tail_RTT

Buffered = Total × (1 + safety_pct/100)

effective_input_tokens is reduced by cache hit assumptions.
Contention inflates queue and provider overhead using your concurrency inputs.
p95 is a multiplier over the buffered estimate to model tail latency.

How to Use This Calculator

Enter your expected input and output token counts.
Set prefill and decode throughput to match your model tier.
Measure RTT from the client location to your provider region.
Estimate queue and provider overhead during normal traffic.
Enable streaming if your UI renders tokens as they arrive.
Add tool calls and images if your workflow uses them.
Apply a safety buffer and p95 multiplier for planning.
Press Submit to see results above, then export CSV or PDF.

Latency components that shape user experience

Prompt latency is the sum of waiting, transfer, and compute. Queue time reflects load and scheduling. Network time is driven by RTT and handshakes. Compute splits into prefill, which processes input tokens, and decode, which produces output tokens. Streaming changes perception by exposing the first token earlier. For chat UIs, users often notice TTFT above 300–500 ms, even when totals remain acceptable. Measure TTFT and total separately; they guide different optimization efforts.

Token volume and throughput drive compute

Compute scales roughly linearly with token counts. Prefill time depends on effective input tokens after caching and the prefill throughput. Decode time depends on output tokens and decode throughput. If your workload is retrieval heavy, prefill dominates. If generations are long, decode dominates and small throughput changes matter. For example, doubling decode throughput halves decode time, but does not change queue or network costs.

Overhead from tools, images, and packaging

Many applications add tool calls, image processing, and response packaging. Each tool call can introduce external latency and additional network hops, so model them separately. Image inputs add parsing and embedding overhead, and may increase tokenization and safety checks. Packaging covers formatting, policy evaluation, and transport framing. In this estimator, these costs are modeled as fixed milliseconds per item for planning and sensitivity analysis.

Planning for contention and tail latency

Concurrency often increases queue delay and provider overhead. During bursts, contention can inflate both even if average compute is stable. Tail latency is typically higher than the median, so SLAs should target p95 or p99. Use the safety buffer to add headroom and the p95 multiplier to approximate tail behavior. If your p95 is drifting, reducing max output tokens can be the quickest stabilizer.

Using estimates to improve reliability

Start with measured RTT and realistic token counts from logs. Calibrate throughputs using provider benchmarks or internal tracing. Then simulate changes: shorter prompts, smaller max tokens, faster regions, or caching. Export CSV for reviews and keep PDFs with assumptions. Revisit inputs after model upgrades or traffic shifts. Over time, compare predicted totals to observed traces and adjust overhead knobs until errors stay within a small margin.

FAQs

What is TTFT and why does it matter?

TTFT is the time until the first token is available. It strongly influences perceived speed in chat and streaming UIs, and is most affected by queue delay, RTT, handshake cost, provider overhead, and input token prefill time.

How do I estimate prefill and decode throughput?

Use provider benchmarks, internal tracing, or load tests. Prefill throughput is higher for small contexts and can drop with long prompts. Decode throughput depends on model size, sampling settings, and server load.

When should I enable streaming?

Enable streaming when your interface can render partial output and users benefit from early feedback. Streaming reduces perceived latency even if total time is similar, but may slightly increase tail latency due to chunking and client rendering.

How should I model tool calls accurately?

Measure each tool call end to end, including network, retries, and backend processing. Enter the average per call, then test worst case by increasing the per-call value or count. Treat dependent tools as additive in the critical path.

What does cache hit rate change in the estimate?

Caching reduces effective input work by reusing previously processed context. A higher hit rate lowers prefill time and TTFT, but it will not reduce RTT, queue delay, or decode time. Validate hit rates with logs before relying on them.

How can I reduce p95 latency quickly?

Limit max output tokens, reduce prompt length, and route traffic to a closer region. If queue delay is dominant, add capacity or throttle bursts. For tool-heavy flows, parallelize independent calls and cache stable tool responses where possible.