Inference Latency Calculator

Calculator Inputs

Large screens: 3 columns · Small: 2 · Mobile: 1

Estimate mode

Choose the model you can measure most reliably.

Input tokens

Prompt tokens processed before generation begins.

Output tokens

Tokens generated for the response.

Batch size

Used to amortize compute in FLOPs mode.

Concurrency

Parallel lanes used for capacity estimation.

Utilization

Applied as effective throughput/compute scaling.

Pre-processing

Tokenization, validation, routing.

Post-processing

Decoding, formatting, logging.

Queue time

Waiting time before service begins.

Framework overhead

Runtime overhead and scheduler costs.

Network RTT

Round-trip delay between client and service.

Payload size

Request + response payload over the network.

Bandwidth

Mbps

Used to estimate transfer time from payload size.

Jitter for P95

P95 = P50 × (1 + jitter).

Throughput

tokens/s

Effective tokens per second at your chosen setup.

Model parameters

Approximate parameter count in billions.

Device compute

TFLOPs

Sustained compute for your precision and runtime.

Reset

Tip: Use measured throughput when available. Use FLOPs mode for early sizing.

Formula Used

Transfer time

transfer_ms = payload_kb × 1024 × 8 ÷ (bandwidth_mbps × 1,000,000) × 1000

Throughput mode compute

compute_ms = (tokens_in + tokens_out) ÷ (throughput_tps × util_pct/100) × 1000

FLOPs mode compute (rough)

flops_per_token ≈ 2 × params_b × 1e9
compute_ms = (flops_per_token × total_tokens) ÷ (device_tflops × 1e12 × util_pct/100) × 1000 ÷ batch_size

This is a sizing approximation and can differ from measured runtimes.

End-to-end latency

service_ms = pre_ms + compute_ms + post_ms + over_ms + transfer_ms
P50_ms = queue_ms + rtt_ms + service_ms
P95_ms = P50_ms × (1 + jitter_pct/100)

How to Use This Calculator

Pick an estimate mode that matches your available measurements.
Enter tokens, utilization, and realistic overhead timings.
Fill network RTT, payload size, and bandwidth values.
Press Submit to view results above the form.
Compare multiple scenarios and export CSV or PDF reports.

Example Data Table

Scenario	Tokens (in/out)	RTT (ms)	Throughput (tokens/s)	P50 (ms)	P95 (ms)
Edge GPU, low RTT	256 / 64	8	520	~44	~51
Cloud GPU, moderate RTT	512 / 128	20	450	~78	~90
CPU-only, high RTT	512 / 128	45	75	~320	~368

The example values are illustrative and depend on overhead, payload, and utilization.

Why Inference Latency Matters in Production

Inference latency is the delay between a request and a usable response. Many interactive systems target 50–150 ms at P50, while real-time loops may need under 30 ms. When latency grows, abandonment and retries increase, raising load and cost. This calculator quantifies where time is spent, helping you set practical objectives and track improvements across model, runtime, and network changes.

Breaking Down Service Time with Measurable Inputs

Service time sums pre-processing, compute, post-processing, framework overhead, and transfer time. Pre and post steps often add 5–20 ms each from tokenization, validation, logging, and formatting. Transfer time depends on payload size and bandwidth; a 64 KB payload over 50 Mbps adds about 10 ms. Network transfer excludes RTT, but includes serialization; TLS and JSON encoding can add 1–5 ms, especially on small CPUs. Enter realistic values to replace “mystery latency” with an actionable breakdown.

Interpreting Throughput Versus FLOPs Estimates

Throughput mode uses tokens per second and is best when you have benchmarks from the same runtime. FLOPs mode helps earlier, using parameters and device compute. The calculator approximates FLOPs per token as 2 × parameters and scales by utilization. Because precision, memory bandwidth, and batching change efficiency, treat FLOPs results as sizing guidance, then switch to measured throughput.

Queueing, Concurrency, and Capacity Planning

Queue time is waiting before execution and usually spikes near saturation. The capacity estimate divides 1000 by service time and multiplies by concurrency to approximate requests per second per instance. If arrival rate exceeds capacity, queueing grows nonlinearly; even a 10% traffic spike can double wait time on an instance. If P50 looks healthy but queue time dominates, adding replicas or reducing per-request work can beat micro-optimizing kernels. Use this view to keep utilization in a safer 60–80% band.

Using Scenario Comparison to Reduce Tail Latency

Tail latency shapes perceived quality because slow outliers are remembered. This calculator estimates P95 by applying a jitter percentage to P50; raise it for variable scheduling or shared networks. Compare scenarios: lower RTT with edge deployment, smaller payloads with compression, or fewer output tokens with strict limits. Export CSV snapshots and keep a baseline to verify improvements persist.

FAQs

What value should I enter for throughput?

Use a measured tokens-per-second value from your serving stack under realistic batch and sequence lengths. Run several minutes, discard warmup, and report the median. If you only have GPU kernel benchmarks, start with a conservative number and refine after profiling end-to-end.

Why does batch size mainly affect FLOPs mode?

Throughput mode already captures batching in the measured tokens per second. FLOPs mode is a sizing approximation, so batch size is used to amortize compute across a batch. Real systems still trade latency for throughput as batching increases.

How do I estimate payload size in KB?

Sum request and response bodies after serialization, including headers if they are significant. For JSON, measure the byte length of typical prompts and responses. Compression lowers transfer time but may add CPU overhead; reflect that in pre or post time.

Does the calculator include cold starts?

Not by default. Add cold-start costs to framework overhead or queue time, and test separate scenarios for warm and cold traffic. For serverless or autoscaling, track how often cold starts occur and design limits around their tail impact.

How can I reduce P95 more than P50?

Attack variability: lower queueing by keeping utilization below saturation, pin CPU cores for preprocessing, and reduce network jitter by co-locating services. Cap output tokens to limit long generations. Then re-check jitter percentage to see the improvement.

Is the capacity estimate exact for my cluster?

It is a directional estimate for one instance under steady load. Real capacity depends on scheduling, memory limits, request mix, and autoscaling policies. Use it to compare configurations, then validate with load tests and production telemetry.