Inputs
Example dataset
| Scenario | Records/s | Record KB | Workers | Net Mbps | Sink MB/s | Est. MB/s | Bottleneck |
|---|---|---|---|---|---|---|---|
| Batch ETL | 12,000 | 2.4 | 8 | 2,000 | 130 | 28.1 | Transform |
| High ingest | 40,000 | 1.8 | 10 | 2,500 | 220 | 55.0 | Ingest |
| Write constrained | 18,000 | 3.0 | 12 | 5,000 | 90 | 30.7 | Sink |
Formula used
1) Source throughput (MB/s): records/s × recordKB ÷ 1024
2) Network throughput (MB/s): (Mbps ÷ 8) × utilization%
3) Ingest limit: min(sourceMB/s, networkMB/s)
4) Transform limit: workers × perWorkerMB/s × (1 − overhead%)
5) Output size factor: transformMultiplier ÷ compressionRatio
6) Sink limit (MB/s): min(writeCapMB/s, IOPS×blockKB÷1024, networkMB/s)
7) Overall input throughput: min(ingestLimit, transformLimit, sinkLimit ÷ outputSizeFactor)
How to use this calculator
- Enter observed records per second and typical record size.
- Set compression and transform multiplier from sample jobs.
- Fill workers and per-worker capacity using benchmarks.
- Add overhead for retries, coordination, and background work.
- Provide realistic bandwidth and safe utilization percentages.
- Estimate sink limits with write cap, IOPS, and block size.
- Press Submit to view throughput, bottleneck, and SLA.
Operational meaning of throughput
Throughput is the sustained input rate your pipeline can accept while keeping queues stable. This calculator translates records per second and average payload size into megabytes per second, then constrains that flow by ingest, compute, and sink limits. The highest number is not the goal; the goal is a stable rate that survives bursts, retries, and noisy neighbors without missing delivery commitments.
Ingest and network constraints
Ingest is bounded by both what the source produces and what the shared network path can carry. Bandwidth is converted from megabits per second to megabytes per second, then reduced by your safe utilization percentage. If ingest is the bottleneck, increasing workers will not help. Typical fixes include raising link capacity, reducing payload size, increasing compression, or smoothing bursts with buffering and backpressure-aware batching.
Compute parallelism and overhead
Transform capacity scales with workers, but only up to the point where coordination and runtime overhead dominate. The overhead input represents retries, serialization, garbage collection, checkpointing, and logging. A 12% overhead means only 88% of theoretical capacity is usable. When compute is limiting, validate per‑worker benchmarks using realistic schemas and joins, then scale horizontally or simplify transformations to reduce per‑record cost.
Sink performance and output sizing
Storage writes are limited by sequential throughput, IOPS, and sometimes the same network path. The calculator converts IOPS into an effective megabytes-per-second cap using your block size. Output sizing matters: transform multiplier and compression ratio combine into an output size factor. If output is larger than input, the sink backpressures the entire pipeline sooner. Improve sink results by enlarging write batches, partitioning wisely, and choosing formats that compress well.
Batch windows and SLA readiness
Batch windows translate a sustained rate into per-window volume and an estimated time to process one window’s incoming data. If processing time exceeds your SLA, latency accumulates even if average throughput looks acceptable. To meet SLAs, shorten windows for faster feedback, increase capacity at the bottleneck stage, or reduce output size. Use served percentage as a quick indicator of backlog risk during peaks. Document your assumptions, then validate with monitoring so modeled capacity matches real runtimes during deployment, scaling, and incident response.
FAQs
1) What does “output size factor” represent?
It combines transform expansion or reduction with compression. A factor above 1.0 means the sink must write more bytes than were ingested, which lowers sustainable input throughput.
2) Why can throughput be lower than the source rate?
If ingest, compute, or the sink cannot keep up, the pipeline backpressures and queues grow. Sustained throughput is the stable rate without unbounded backlog.
3) How should I estimate per-worker capacity?
Run a representative job with production-like data and measure steady-state processed MB/s per worker. Avoid short tests that ignore warmup, caching, and retries.
4) What overhead percentage is reasonable?
Start with 10–20% for mature workloads. Increase it if you see frequent retries, heavy logging, encryption, or strict checkpointing. Decrease only after observing stable runs.
5) How do I decide on a batch window?
Choose the smallest window that still yields efficient writes and acceptable compute utilization. Smaller windows reduce latency but may increase overhead and IOPS pressure.
6) Which bottleneck should I fix first?
Fix the smallest limiter reported. Improving non-bottleneck stages rarely changes end-to-end throughput. Recalculate after each change because bottlenecks can shift.