Autoscaling Capacity Planner Calculator

Planner Inputs

Average traffic (requests/second)

Baseline load across a typical interval.

Peak traffic (requests/second)

Sustained peak you must survive safely.

Per-instance capacity (rps)

Measured throughput per instance at acceptable latency.

Target utilization (%)

Keeps room for jitter and burst absorption.

Safety buffer (%)

Extra headroom for uncertainty and uneven traffic.

Minimum instances

Floor for availability and warm capacity.

Maximum instances

Hard limit for budget or quota constraints.

Scale-out cooldown (seconds)

Wait time after scaling out before next action.

Scale-in cooldown (seconds)

Longer cooldown reduces oscillation risk.

Scale-out step (instances)

How many instances to add per scaling action.

Scale-in step (instances)

How many instances to remove per scaling action.

Growth per period (%)

Used only for the projection table.

Projection periods

Creates a forward view for capacity planning.

Period label

Example: Week, Sprint, Month, Quarter.

Optional CPU sizing cross-check

If you have per-request CPU cost, estimate a CPU-limited rps ceiling per instance.

CPU per request (ms)

Measured CPU time spent handling one request.

vCPU cores per instance

Effective cores available to your workload.

Target CPU (%)

Reserve CPU headroom for spikes and GC.

Example Data Table

Use this sample scenario to validate expected behavior.

Scenario	Average rps	Peak rps	Capacity rps/instance	Target util	Buffer	Min	Max	Recommended range
API service launch week	200	800	120	65%	20%	2	30	4 to 13

The example assumes buffered load and utilization-limited effective capacity, then rounds up to whole instances.

Formula Used

The planner uses a simple, deployment-friendly sizing model.

Effective capacity per instance

C_eff = C × U

C is per-instance rps capacity, U is utilization target as a fraction.

Buffered load

L_buf = L × (1 + B)

L is traffic (average or peak). B is buffer fraction.

Required instances

N = ceil( L_buf / C_eff )

Round up to keep enough capacity under worst-case variance.

CPU cross-check (optional)

C_cpu ≈ (cores × 1000 × U_cpu) / msPerReq

A sanity check when CPU is the dominant constraint.

How to Use This Calculator

Measure throughput per instance at your latency target to fill “Per-instance capacity”.
Enter average and sustained peak traffic based on logs, forecasts, or load tests.
Choose utilization and buffer to reflect jitter, retries, and noisy neighbors.
Set min and max bounds from availability targets and budget constraints.
Review the recommended range and check the headroom warning for peak limits.
Export CSV or PDF to share plans and keep deployment notes.

Workload inputs and measurement

Use average requests per second for steady demand and sustained peak for stress windows. Measure per‑instance capacity from load tests at your latency SLO, not single‑thread microbenchmarks. For example, 120 rps at p95 latency can outperform 180 rps that violates error budgets. Record the test duration, payload mix, and cache state so later comparisons stay valid.

Utilization targets and headroom

Target utilization converts raw capacity into usable capacity. With a 65% target, an instance rated at 120 rps delivers 78 rps of planned capacity, leaving room for retries, GC pauses, and noisy neighbors. Raising the target reduces cost but amplifies tail latency and queueing risk. Keep a consistent target across services so fleet sizing remains comparable.

Safety buffer and burst behavior

Buffer adds uncertainty margin on top of traffic. A 20% buffer turns 800 rps peak into 960 rps planned load. Buffers are helpful when traffic is spiky, when upstream retries are common, or when instance performance varies. If you already have strong rate limiting and even load distribution, you can lower the buffer, but validate with incident data.

Cooldowns and step policies

Cooldown values prevent rapid oscillation when metrics lag. A short scale‑out cooldown reacts fast, while a longer scale‑in cooldown avoids flapping after a burst. Step sizes define how aggressively capacity changes. Adding two instances per action can clear backlog quickly, but may overshoot if demand falls. Pair steps with alarms on error rate and saturation.

Growth projection and capacity limits

The projection table compounds growth to show when peak demand outgrows your maximum. At 10% weekly growth, an 800 rps peak becomes about 1,416 rps by week six. The planner recomputes required instances using the same utilization and buffer, which exposes when quotas or budgets will break. Use this view to schedule scaling tests and procurement.

Operational review checklist

Review results before deployment. Confirm the recommended minimum covers warm capacity and availability zones. Confirm the recommended maximum stays within account limits and that peak requirement does not exceed it. Cross‑check CPU sizing when CPU dominates request cost. Finally, test scaling behavior in staging with realistic traffic ramps and cooldowns. Document assumptions, share exports with stakeholders, and revisit numbers after each major planned release.

FAQs

What is “per-instance capacity” and how do I measure it?

Run a load test that matches real endpoints, payload sizes, and cache behavior. Increase traffic until p95 latency or error rate breaches your SLO, then record the sustainable requests per second for one instance.

Why does the calculator use utilization instead of 100% capacity?

Operating below saturation reduces queueing delay and absorbs jitter from retries, GC pauses, and uneven load balancing. A utilization target converts theoretical throughput into planned throughput that is safer during bursts.

How should I choose the safety buffer percentage?

Start with 10–30%. Use higher buffers for spiky traffic, variable instance performance, or heavy retry amplification. Use lower buffers if you have strong rate limiting, stable workloads, and proven autoscaling responsiveness.

What do cooldown settings change in practice?

Cooldowns limit how often scaling actions occur. Short scale-out cooldowns react faster to rising demand, while longer scale-in cooldowns prevent oscillation after transient spikes and metric delays.

Why does the projection sometimes exceed my maximum instances?

The projection applies your growth rate to peak traffic and recomputes required instances. If the required count is above your maximum, you need a higher quota, better per-instance capacity, or revised targets.

How do I use the CSV and PDF exports?

Export after each scenario run to share assumptions, ranges, and projections with teams. Keep exports with deployment notes so you can compare planned versus observed scaling during incidents and releases.