Calculator
Example data table
These sample rows show how inputs influence throughput and ETA.
| Scenario | Total records | Batch size | Workers | Cycle time (s) | Utilization | Throughput (records/s) | ETA |
|---|---|---|---|---|---|---|---|
| Balanced | 1,000,000 | 1,000 | 8 | 3.90 | 85% | ~1,744 | ~9m 33s |
| I/O bound | 1,000,000 | 1,000 | 8 | 5.20 | 75% | ~1,154 | ~14m 26s |
| Higher parallelism | 1,000,000 | 2,000 | 16 | 4.40 | 85% | ~6,182 | ~2m 42s |
Formula used
This calculator uses steady-state throughput with utilization derating.
- Workers = nodes × workers per node
- Base batch time = batch duration, or (batch size × per-record ms ÷ 1000) + fixed setup
- Expected retry time = (failure rate ÷ 100) × retry penalty
- Cycle time = base batch time + overhead + I/O wait + expected retry time
- Batches/s = (workers ÷ cycle time) × (utilization ÷ 100)
- Records/s = batches/s × batch size
- ETA = warmup + (total records ÷ records/s)
- MB/s = (records/s × avg record KB) ÷ 1024
- Workers needed = ceil((target records/s × cycle time) ÷ (batch size × utilization))
How to use this calculator
- Choose an estimation mode based on your measurement quality.
- Enter total records and an expected batch size.
- Set nodes and workers per node for parallel throughput.
- Adjust overhead, I/O wait, failure rate, and retry penalty.
- Tune utilization to match real-world contention and idle time.
- Press Calculate throughput to see results.
- Download CSV or PDF if you want a shareable report.
FAQs
1) What does throughput mean in batch processing?
Throughput is how many records your pipeline completes per unit time. It depends on batch size, parallel workers, and the effective cycle time that includes overhead, I/O waits, and retries.
2) Why include utilization instead of assuming 100%?
Real systems idle due to scheduling gaps, skew, resource contention, throttling, and dependencies. Utilization lets you derate theoretical capacity to better match observed behavior.
3) How should I estimate I/O wait?
Use monitoring data or logs to approximate time spent waiting on storage and network. If you cannot measure it directly, start with a small value and increase until predicted throughput aligns with production metrics.
4) How are failures modeled here?
Failures are modeled as an expected retry penalty per batch: failure rate × retry penalty. This captures average rework and delay, but it will not reflect rare cascading incidents or prolonged outages.
5) Which mode should I choose?
Choose “Known batch duration” if you can measure average batch wall time reliably. Choose “Per-record time” when you have microbenchmarks or profiler data and a stable fixed setup cost.
6) Why can increasing batch size reduce throughput?
Larger batches can increase memory pressure, serialization cost, and I/O bursts. That can raise cycle time and reduce effective utilization, even if you process more records per batch.
7) How do I use the target throughput field?
Enter a desired records-per-second value. The calculator estimates how many total workers you need given your current batch size, cycle time, and utilization setting, then shows that count in the results.
8) Are the “waves” estimate and ETA exact?
They are approximations intended for planning. Real runtimes vary with skew, autoscaling, queue dynamics, and shared services. Use the CSV or PDF outputs to document assumptions and refine them over time.