API Throughput Planner Calculator

Target requests per second (RPS)

Steady-state target load (not peak bursts).

Avg request+response payload (KB)

Combined size per call, excluding protocol overhead.

Avg service time (ms)

Average time in your service, excluding queueing.

P95 latency target (ms)

Used to suggest headroom vs. saturation.

Retry rate (%)

Average extra calls due to retries/timeouts.

Cache hit rate (%)

Higher hits reduce compute/service time in model.

Compression ratio (e.g., 1.5)

Effective payload size = size / ratio.

Protocol overhead (%)

Headers, TLS, framing, etc.

Available bandwidth per instance (Mbps)

Network budget per instance (sustained).

CPU cores per instance

vCPU or core-equivalent.

CPU time per request (ms)

Average CPU time per call on one core.

Memory per in-flight request (MB)

Includes buffers, objects, and framework overhead.

Memory available per instance (GB)

Usable memory for your service, not total RAM.

Max concurrent connections (optional)

Set 0 to ignore. Otherwise caps concurrency.

Safety margin (%)

Reserves headroom for spikes, GC, and variance.

Example Data

Scenario	Target RPS	Payload (KB)	Service Time (ms)	Cores	Bandwidth (Mbps)	Safety (%)	Typical Output
Public API, moderate payloads	1200	8	35	4	200	20	2–4 instances, CPU-bound risk
High payload, network sensitive	800	40	25	8	150	25	3–6 instances, bandwidth bound
Low latency, cache heavy	2500	4	18	8	300	15	2–3 instances, concurrency tuned

Use the “Load Example” button to prefill realistic inputs and generate results.

Formula Used

Retry load multiplier: L = 1 + (retry% / 100)
Effective service time: S = processingMs × (1 − 0.5 × cacheHit%)
Little’s Law concurrency: C = (targetRps × L) × (S / 1000)
CPU-limited RPS per instance: R_cpu = (cores × 1000) / cpuMs × (1 − safety%)
Bandwidth-limited RPS per instance: payloadEffKB = (payloadKB × (1 + overhead%)) / compressionRatio
R_bw = (bandwidthMbps × 1024) / (payloadEffKB × 8) × (1 − safety%)
Memory-limited concurrency: C_mem = (memGB × 1024 × (1 − safety%)) / memPerReqMB
Memory-limited RPS: R_mem = C_mem / (S / 1000)
Per-instance sustainable RPS: R_inst = min(R_cpu, R_bw, R_mem, R_conn)
Required instances: N = ceil((targetRps × L) / R_inst)

How to Use This Calculator

Enter your target RPS, payload size, and average service time.
Add retries, cache hit rate, and protocol overhead for realism.
Set per-instance limits: cores, CPU per request, bandwidth, and memory.
Choose a safety margin to preserve headroom during variability.
Click Submit and review the bottleneck, concurrency, and instance count.
Export the plan as CSV or PDF for reviews and runbooks.

Operational Brief

Workload Modeling and Traffic Shape

Throughput planning starts with a clear request-rate target, plus realistic retry and burst behavior. This calculator converts your steady RPS goal into an effective required RPS by applying a retry multiplier. Use production logs to separate baseline traffic from flash spikes, and consider diurnal patterns. If your API serves mixed endpoints, run the planner per critical route and weight results by route share.

Concurrency and Latency Guardrails

Concurrency is the hidden driver of queueing and tail latency. Using Little’s Law, required in-flight requests equal effective RPS multiplied by effective service time. When the latency target is close to service time, even modest utilization pushes P95 upward. Keep headroom so that GC pauses, lock contention, and noisy neighbors do not turn short stalls into cascading retries.

CPU Capacity and Instance Sizing

CPU capacity is estimated from cores and CPU time per request. A smaller CPU-per-request number usually comes from faster code paths, reduced serialization, fewer allocations, and efficient database access. Because cache hits often bypass heavy work, the planner reduces effective service and CPU time as cache improves. Validate the assumed CPU time with profiling under load, not with idle benchmarks.

Network and Payload Efficiency

Bandwidth becomes the limiting factor when payloads grow. The tool models protocol overhead and compression, producing an effective payload per call. Reducing payload size, enabling keep-alive, and trimming headers can lift sustainable RPS without more instances. If compression is aggressive, verify that added CPU cost does not simply move the bottleneck from network to compute.

Operational Headroom and Validation

The safety margin reserves capacity for variance and deployment events during incident response. Pick higher margins for multi-tenant clusters, batch jobs, unstable dependencies, and regional failover drills. After sizing, validate with staged load tests: confirm saturation points, watch error rates, and measure P95/P99. Re-run the planner when you change payloads, caching, timeouts, or autoscaling policy.

FAQs

What does “effective required RPS” mean?

It is your target request rate adjusted for retries. If retries average 3%, the effective load becomes target RPS × 1.03, which better reflects real traffic hitting the service.

How should I estimate CPU time per request?

Use profiling from representative load tests. Measure CPU per request on a warm system with realistic caches and dependencies, then enter the average. Re-check after major code or library upgrades.

Why does the tool compute concurrency?

Concurrency estimates in-flight requests needed to sustain your rate. High concurrency drives queueing, memory pressure, and tail latency. Planning for concurrency helps you size thread pools, connection limits, and memory safely.

When is bandwidth the bottleneck?

Bandwidth limits dominate when payloads are large or responses stream data. Compare bandwidth-limited RPS to CPU-limited RPS. If bandwidth is lower, reduce payload size, overhead, or increase per-instance network capacity.

How do I pick a safety margin?

Start with 15–25% for steady workloads and strong observability. Increase it for bursty traffic, noisy multi-tenant clusters, frequent deployments, or dependency instability. Safety margin is cheaper than downtime.

Do I still need load testing after this?

Yes. This is a planning model, not a substitute for testing. Validate saturation points, error rates, and P95/P99 under staged traffic. Use results to refine service time, CPU, and retry assumptions.