Inference Latency Calculator

Tune inputs to reflect your model and hardware. See latency breakdown instantly after every submission. Download reports, validate assumptions, and reduce slow requests today.

Calculator Inputs

Large screens: 3 columns · Small: 2 · Mobile: 1
Choose the model you can measure most reliably.
Prompt tokens processed before generation begins.
Tokens generated for the response.
Used to amortize compute in FLOPs mode.
Parallel lanes used for capacity estimation.
%
Applied as effective throughput/compute scaling.
ms
Tokenization, validation, routing.
ms
Decoding, formatting, logging.
ms
Waiting time before service begins.
ms
Runtime overhead and scheduler costs.
ms
Round-trip delay between client and service.
KB
Request + response payload over the network.
Mbps
Used to estimate transfer time from payload size.
%
P95 = P50 × (1 + jitter).
tokens/s
Effective tokens per second at your chosen setup.
B
Approximate parameter count in billions.
TFLOPs
Sustained compute for your precision and runtime.
Reset

Tip: Use measured throughput when available. Use FLOPs mode for early sizing.

Formula Used

Transfer time
transfer_ms = payload_kb × 1024 × 8 ÷ (bandwidth_mbps × 1,000,000) × 1000
Throughput mode compute
compute_ms = (tokens_in + tokens_out) ÷ (throughput_tps × util_pct/100) × 1000
FLOPs mode compute (rough)
flops_per_token ≈ 2 × params_b × 1e9
compute_ms = (flops_per_token × total_tokens) ÷ (device_tflops × 1e12 × util_pct/100) × 1000 ÷ batch_size
This is a sizing approximation and can differ from measured runtimes.
End-to-end latency
service_ms = pre_ms + compute_ms + post_ms + over_ms + transfer_ms
P50_ms = queue_ms + rtt_ms + service_ms
P95_ms = P50_ms × (1 + jitter_pct/100)

How to Use This Calculator

  1. Pick an estimate mode that matches your available measurements.
  2. Enter tokens, utilization, and realistic overhead timings.
  3. Fill network RTT, payload size, and bandwidth values.
  4. Press Submit to view results above the form.
  5. Compare multiple scenarios and export CSV or PDF reports.

Example Data Table

Scenario Tokens (in/out) RTT (ms) Throughput (tokens/s) P50 (ms) P95 (ms)
Edge GPU, low RTT 256 / 64 8 520 ~44 ~51
Cloud GPU, moderate RTT 512 / 128 20 450 ~78 ~90
CPU-only, high RTT 512 / 128 45 75 ~320 ~368

The example values are illustrative and depend on overhead, payload, and utilization.

Why Inference Latency Matters in Production

Inference latency is the delay between a request and a usable response. Many interactive systems target 50–150 ms at P50, while real-time loops may need under 30 ms. When latency grows, abandonment and retries increase, raising load and cost. This calculator quantifies where time is spent, helping you set practical objectives and track improvements across model, runtime, and network changes.

Breaking Down Service Time with Measurable Inputs

Service time sums pre-processing, compute, post-processing, framework overhead, and transfer time. Pre and post steps often add 5–20 ms each from tokenization, validation, logging, and formatting. Transfer time depends on payload size and bandwidth; a 64 KB payload over 50 Mbps adds about 10 ms. Network transfer excludes RTT, but includes serialization; TLS and JSON encoding can add 1–5 ms, especially on small CPUs. Enter realistic values to replace “mystery latency” with an actionable breakdown.

Interpreting Throughput Versus FLOPs Estimates

Throughput mode uses tokens per second and is best when you have benchmarks from the same runtime. FLOPs mode helps earlier, using parameters and device compute. The calculator approximates FLOPs per token as 2 × parameters and scales by utilization. Because precision, memory bandwidth, and batching change efficiency, treat FLOPs results as sizing guidance, then switch to measured throughput.

Queueing, Concurrency, and Capacity Planning

Queue time is waiting before execution and usually spikes near saturation. The capacity estimate divides 1000 by service time and multiplies by concurrency to approximate requests per second per instance. If arrival rate exceeds capacity, queueing grows nonlinearly; even a 10% traffic spike can double wait time on an instance. If P50 looks healthy but queue time dominates, adding replicas or reducing per-request work can beat micro-optimizing kernels. Use this view to keep utilization in a safer 60–80% band.

Using Scenario Comparison to Reduce Tail Latency

Tail latency shapes perceived quality because slow outliers are remembered. This calculator estimates P95 by applying a jitter percentage to P50; raise it for variable scheduling or shared networks. Compare scenarios: lower RTT with edge deployment, smaller payloads with compression, or fewer output tokens with strict limits. Export CSV snapshots and keep a baseline to verify improvements persist.

FAQs

What value should I enter for throughput?

Use a measured tokens-per-second value from your serving stack under realistic batch and sequence lengths. Run several minutes, discard warmup, and report the median. If you only have GPU kernel benchmarks, start with a conservative number and refine after profiling end-to-end.

Why does batch size mainly affect FLOPs mode?

Throughput mode already captures batching in the measured tokens per second. FLOPs mode is a sizing approximation, so batch size is used to amortize compute across a batch. Real systems still trade latency for throughput as batching increases.

How do I estimate payload size in KB?

Sum request and response bodies after serialization, including headers if they are significant. For JSON, measure the byte length of typical prompts and responses. Compression lowers transfer time but may add CPU overhead; reflect that in pre or post time.

Does the calculator include cold starts?

Not by default. Add cold-start costs to framework overhead or queue time, and test separate scenarios for warm and cold traffic. For serverless or autoscaling, track how often cold starts occur and design limits around their tail impact.

How can I reduce P95 more than P50?

Attack variability: lower queueing by keeping utilization below saturation, pin CPU cores for preprocessing, and reduce network jitter by co-locating services. Cap output tokens to limit long generations. Then re-check jitter percentage to see the improvement.

Is the capacity estimate exact for my cluster?

It is a directional estimate for one instance under steady load. Real capacity depends on scheduling, memory limits, request mix, and autoscaling policies. Use it to compare configurations, then validate with load tests and production telemetry.

Related Calculators

Model Training TimeLearning Rate FinderParameter Count CalculatorDataset Split CalculatorEpoch Time EstimatorCloud GPU CostThroughput CalculatorMemory Footprint CalculatorLatency Budget PlannerModel Compression Ratio

Important Note: All the Calculators listed in this site are for educational purpose only and we do not guarentee the accuracy of results. Please do consult with other sources as well.