Example Data
| Scenario | Target RPS | Payload (KB) | Service Time (ms) | Cores | Bandwidth (Mbps) | Safety (%) | Typical Output |
|---|---|---|---|---|---|---|---|
| Public API, moderate payloads | 1200 | 8 | 35 | 4 | 200 | 20 | 2–4 instances, CPU-bound risk |
| High payload, network sensitive | 800 | 40 | 25 | 8 | 150 | 25 | 3–6 instances, bandwidth bound |
| Low latency, cache heavy | 2500 | 4 | 18 | 8 | 300 | 15 | 2–3 instances, concurrency tuned |
Use the “Load Example” button to prefill realistic inputs and generate results.
Formula Used
- Retry load multiplier: L = 1 + (retry% / 100)
- Effective service time: S = processingMs × (1 − 0.5 × cacheHit%)
- Little’s Law concurrency: C = (targetRps × L) × (S / 1000)
- CPU-limited RPS per instance: R_cpu = (cores × 1000) / cpuMs × (1 − safety%)
- Bandwidth-limited RPS per instance: payloadEffKB = (payloadKB × (1 + overhead%)) / compressionRatio
R_bw = (bandwidthMbps × 1024) / (payloadEffKB × 8) × (1 − safety%) - Memory-limited concurrency: C_mem = (memGB × 1024 × (1 − safety%)) / memPerReqMB
- Memory-limited RPS: R_mem = C_mem / (S / 1000)
- Per-instance sustainable RPS: R_inst = min(R_cpu, R_bw, R_mem, R_conn)
- Required instances: N = ceil((targetRps × L) / R_inst)
How to Use This Calculator
- Enter your target RPS, payload size, and average service time.
- Add retries, cache hit rate, and protocol overhead for realism.
- Set per-instance limits: cores, CPU per request, bandwidth, and memory.
- Choose a safety margin to preserve headroom during variability.
- Click Submit and review the bottleneck, concurrency, and instance count.
- Export the plan as CSV or PDF for reviews and runbooks.
Operational Brief
Workload Modeling and Traffic Shape
Throughput planning starts with a clear request-rate target, plus realistic retry and burst behavior. This calculator converts your steady RPS goal into an effective required RPS by applying a retry multiplier. Use production logs to separate baseline traffic from flash spikes, and consider diurnal patterns. If your API serves mixed endpoints, run the planner per critical route and weight results by route share.
Concurrency and Latency Guardrails
Concurrency is the hidden driver of queueing and tail latency. Using Little’s Law, required in-flight requests equal effective RPS multiplied by effective service time. When the latency target is close to service time, even modest utilization pushes P95 upward. Keep headroom so that GC pauses, lock contention, and noisy neighbors do not turn short stalls into cascading retries.
CPU Capacity and Instance Sizing
CPU capacity is estimated from cores and CPU time per request. A smaller CPU-per-request number usually comes from faster code paths, reduced serialization, fewer allocations, and efficient database access. Because cache hits often bypass heavy work, the planner reduces effective service and CPU time as cache improves. Validate the assumed CPU time with profiling under load, not with idle benchmarks.
Network and Payload Efficiency
Bandwidth becomes the limiting factor when payloads grow. The tool models protocol overhead and compression, producing an effective payload per call. Reducing payload size, enabling keep-alive, and trimming headers can lift sustainable RPS without more instances. If compression is aggressive, verify that added CPU cost does not simply move the bottleneck from network to compute.
Operational Headroom and Validation
The safety margin reserves capacity for variance and deployment events during incident response. Pick higher margins for multi-tenant clusters, batch jobs, unstable dependencies, and regional failover drills. After sizing, validate with staged load tests: confirm saturation points, watch error rates, and measure P95/P99. Re-run the planner when you change payloads, caching, timeouts, or autoscaling policy.
FAQs
What does “effective required RPS” mean?
It is your target request rate adjusted for retries. If retries average 3%, the effective load becomes target RPS × 1.03, which better reflects real traffic hitting the service.
How should I estimate CPU time per request?
Use profiling from representative load tests. Measure CPU per request on a warm system with realistic caches and dependencies, then enter the average. Re-check after major code or library upgrades.
Why does the tool compute concurrency?
Concurrency estimates in-flight requests needed to sustain your rate. High concurrency drives queueing, memory pressure, and tail latency. Planning for concurrency helps you size thread pools, connection limits, and memory safely.
When is bandwidth the bottleneck?
Bandwidth limits dominate when payloads are large or responses stream data. Compare bandwidth-limited RPS to CPU-limited RPS. If bandwidth is lower, reduce payload size, overhead, or increase per-instance network capacity.
How do I pick a safety margin?
Start with 15–25% for steady workloads and strong observability. Increase it for bursty traffic, noisy multi-tenant clusters, frequent deployments, or dependency instability. Safety margin is cheaper than downtime.
Do I still need load testing after this?
Yes. This is a planning model, not a substitute for testing. Validate saturation points, error rates, and P95/P99 under staged traffic. Use results to refine service time, CPU, and retry assumptions.