Calculator
Fill what you know. Use “Measured” for best accuracy.
How to use this calculator
- Enter model size, sequence length, and batch settings.
- Set cluster size and parallelism degrees.
- Choose “Measured” if you know step time.
- If estimating, adjust utilization and network settings.
- Add tokens to estimate total training duration.
- Use exports to share a repeatable report.
Formula used
This estimator uses a token-centric compute model and a ring all-reduce approximation for data-parallel gradient synchronization.
total_gpus = nodes × gpus_per_node
data_parallel (DP) = floor(total_gpus / (TP × PP))
global_batch = micro_batch × grad_accum × DP
tokens_per_step = global_batch × seq_len
flops_per_step = flops_per_token × params × tokens_per_step × extra_flops_mult
peak_flops_total = total_gpus × gpu_peak_tflops × 1e12 × utilization
compute_time = flops_per_step / peak_flops_total
grad_bytes_total = params × grad_bytes_per_param
ring_factor = 2 × (DP - 1) / DP
comm_time = ring_factor × grad_bytes_total / (bandwidth × allreduce_eff) + latency
base_step_time = measured_step_time OR (compute_time + comm_time + misc_overhead)
step_time = base_step_time / (1 - pipeline_bubble)
tokens_per_second = tokens_per_step / step_time
steps_total = ceil(total_training_tokens / tokens_per_step)
train_wall_time = steps_total × step_time
Example data
These are illustrative scenarios using the same estimator.
| Scenario | Model (B) | GPUs | DP | Seq | Global batch | Step (ms) | Tokens/s | Tokens/day | Train days |
|---|---|---|---|---|---|---|---|---|---|
| Single node, mid-size model | 7 | 8 | 8 | 2,048 | 128 | 14,199 | 18,462 | 1,595,111,723 | 62.7 |
| Multi-node, larger model | 70 | 64 | 4 | 4,096 | 64 | 35,132 | 7,462 | 644,696,687 | 1,551.1 |
| Measured step-time mode | 13 | 16 | 16 | 2,048 | 256 | 680 | 771,012 | 66,615,416,471 | 3.8 |
Use the CSV export for spreadsheet analysis, and the PDF export for sharing a snapshot.
Model scale and token targets
Throughput planning starts by connecting model size to token budgets. A 7B parameter run at 100B tokens implies far fewer optimizer steps than a 70B run at the same token target. When target tokens are unknown, dataset tokens multiplied by epochs gives a practical proxy. Converting tokens to steps is the bridge that makes training calendars comparable across teams.
Batching, sequence length, and parallel degrees
Tokens per step are driven by global batch and sequence length, often 2048 or 4096 tokens. Global batch equals micro-batch per GPU times gradient accumulation times the data-parallel degree. Increasing accumulation raises tokens per step without increasing activation memory as sharply as micro-batch. Tensor and pipeline degrees consume GPUs, reduce data-parallelism, and can increase pipeline bubble overhead when stages are imbalanced.
Compute, utilization, and step time realism
The compute model scales with parameters, tokens per step, and a FLOPs-per-token constant, then applies a utilization factor. Sustained utilization commonly lands between 25% and 55%, depending on kernels, precision, and input pipeline health. The extra FLOPs multiplier, typically 1.05 to 1.30, covers checkpointing and framework overhead. If you have logs, measured step time is the most reliable input.
Network effects and gradient synchronization
For data-parallel training, gradient synchronization can be the limiting factor once compute is efficient. The estimator uses a ring all-reduce approximation based on gradient size, effective bandwidth, protocol efficiency, and a fixed latency term. Interconnect rates of 100 to 400 Gbps often yield 55% to 80% effective efficiency under contention. For BF16 or FP16, gradient bytes per parameter are frequently near 2.
Cost, energy, and capacity decisions
After wall time is estimated, cost and energy become direct multiplications that support approvals. Multiply GPU-hour rate by total GPUs and training hours for compute spend. Multiply average GPU watts, often 300 to 700, by GPUs and hours to estimate kWh, then apply electricity pricing. Compare scenarios: higher utilization or better networking can shorten runs without adding hardware.
Use tokens per day to validate nightly capacity, and track compute versus comm time to prioritize kernel tuning or interconnect upgrades first safely.
FAQs
What does the calculator estimate?
It estimates tokens per second, tokens per day, step time, and optional total wall time from your token target. It also computes rough compute cost and energy use when you provide rates and power draw.
When should I use measured step time?
Use it when you have reliable logs for optimizer step duration. Measured mode bypasses compute and network modeling, so it captures framework overhead, input stalls, and real scaling behavior.
How do I set utilization realistically?
Start with 30% to 45% for new stacks. Increase it after profiling shows stable kernels and data feeding. If communication time is large, higher utilization alone will not improve throughput.
What is pipeline bubble percentage?
It is an efficiency penalty that accounts for idle time between pipeline stages. Higher bubbles raise effective step time. Keep it low with balanced stage placement, enough micro-batches, and consistent activation checkpointing.
Why does throughput drop when I increase TP or PP?
TP and PP consume GPUs for model splitting, which reduces the data-parallel degree. Lower DP means fewer samples processed per step, and additional synchronization or bubbles can increase step time.
How are CSV and PDF reports produced?
After you submit inputs, the page stores results for export. CSV contains inputs, derived values, and outputs for spreadsheets. PDF provides a one-page summary for sharing during reviews and capacity planning.