Calculator Inputs
Large screens: 3 columns · Small: 2 · Mobile: 1Formula Used
compute_ms = (flops_per_token × total_tokens) ÷ (device_tflops × 1e12 × util_pct/100) × 1000 ÷ batch_size
P50_ms = queue_ms + rtt_ms + service_ms
P95_ms = P50_ms × (1 + jitter_pct/100)
How to Use This Calculator
- Pick an estimate mode that matches your available measurements.
- Enter tokens, utilization, and realistic overhead timings.
- Fill network RTT, payload size, and bandwidth values.
- Press Submit to view results above the form.
- Compare multiple scenarios and export CSV or PDF reports.
Example Data Table
| Scenario | Tokens (in/out) | RTT (ms) | Throughput (tokens/s) | P50 (ms) | P95 (ms) |
|---|---|---|---|---|---|
| Edge GPU, low RTT | 256 / 64 | 8 | 520 | ~44 | ~51 |
| Cloud GPU, moderate RTT | 512 / 128 | 20 | 450 | ~78 | ~90 |
| CPU-only, high RTT | 512 / 128 | 45 | 75 | ~320 | ~368 |
The example values are illustrative and depend on overhead, payload, and utilization.
Why Inference Latency Matters in Production
Inference latency is the delay between a request and a usable response. Many interactive systems target 50–150 ms at P50, while real-time loops may need under 30 ms. When latency grows, abandonment and retries increase, raising load and cost. This calculator quantifies where time is spent, helping you set practical objectives and track improvements across model, runtime, and network changes.
Breaking Down Service Time with Measurable Inputs
Service time sums pre-processing, compute, post-processing, framework overhead, and transfer time. Pre and post steps often add 5–20 ms each from tokenization, validation, logging, and formatting. Transfer time depends on payload size and bandwidth; a 64 KB payload over 50 Mbps adds about 10 ms. Network transfer excludes RTT, but includes serialization; TLS and JSON encoding can add 1–5 ms, especially on small CPUs. Enter realistic values to replace “mystery latency” with an actionable breakdown.
Interpreting Throughput Versus FLOPs Estimates
Throughput mode uses tokens per second and is best when you have benchmarks from the same runtime. FLOPs mode helps earlier, using parameters and device compute. The calculator approximates FLOPs per token as 2 × parameters and scales by utilization. Because precision, memory bandwidth, and batching change efficiency, treat FLOPs results as sizing guidance, then switch to measured throughput.
Queueing, Concurrency, and Capacity Planning
Queue time is waiting before execution and usually spikes near saturation. The capacity estimate divides 1000 by service time and multiplies by concurrency to approximate requests per second per instance. If arrival rate exceeds capacity, queueing grows nonlinearly; even a 10% traffic spike can double wait time on an instance. If P50 looks healthy but queue time dominates, adding replicas or reducing per-request work can beat micro-optimizing kernels. Use this view to keep utilization in a safer 60–80% band.
Using Scenario Comparison to Reduce Tail Latency
Tail latency shapes perceived quality because slow outliers are remembered. This calculator estimates P95 by applying a jitter percentage to P50; raise it for variable scheduling or shared networks. Compare scenarios: lower RTT with edge deployment, smaller payloads with compression, or fewer output tokens with strict limits. Export CSV snapshots and keep a baseline to verify improvements persist.
FAQs
What value should I enter for throughput?
Use a measured tokens-per-second value from your serving stack under realistic batch and sequence lengths. Run several minutes, discard warmup, and report the median. If you only have GPU kernel benchmarks, start with a conservative number and refine after profiling end-to-end.
Why does batch size mainly affect FLOPs mode?
Throughput mode already captures batching in the measured tokens per second. FLOPs mode is a sizing approximation, so batch size is used to amortize compute across a batch. Real systems still trade latency for throughput as batching increases.
How do I estimate payload size in KB?
Sum request and response bodies after serialization, including headers if they are significant. For JSON, measure the byte length of typical prompts and responses. Compression lowers transfer time but may add CPU overhead; reflect that in pre or post time.
Does the calculator include cold starts?
Not by default. Add cold-start costs to framework overhead or queue time, and test separate scenarios for warm and cold traffic. For serverless or autoscaling, track how often cold starts occur and design limits around their tail impact.
How can I reduce P95 more than P50?
Attack variability: lower queueing by keeping utilization below saturation, pin CPU cores for preprocessing, and reduce network jitter by co-locating services. Cap output tokens to limit long generations. Then re-check jitter percentage to see the improvement.
Is the capacity estimate exact for my cluster?
It is a directional estimate for one instance under steady load. Real capacity depends on scheduling, memory limits, request mix, and autoscaling policies. Use it to compare configurations, then validate with load tests and production telemetry.