Training Throughput Estimator

Turn specs into realistic training speed forecasts. Include compute, communication, cost, energy, and checkpoints details. Adjust assumptions, export reports, and share with teammates easily.

Calculator

Fill what you know. Use “Measured” for best accuracy.

Workload

tokens-based

Example: 7, 13, 70.
Tokens per sample per step.
Per GPU per forward pass.
Micro-steps per optimizer step.
Used with epochs if target tokens is empty.
Optional; can be fractional.
If set, overrides dataset × epochs.

Parallelism

DP · TP · PP

Shards model compute per step.
Stages for pipelining.
Simple efficiency penalty for bubbles.

Hardware

cluster

Compute nodes in the cluster.
Total GPUs = nodes × GPUs/node.
Peak for chosen precision.
Typical sustained fraction.

Step time

estimate or measured

Measured mode ignores compute/comm modeling.
From logs: optimizer step duration.

Modeling knobs

advanced

Default 6×params×tokens estimate.
Checkpointing, fused ops, overhead.
Dataloader, eval, host sync, etc.

Communication

all-reduce

Effective bandwidth for DP sync path.
Accounts for contention and protocol.
Often 2 bytes for BF16/FP16.
Approx fixed per-step latency term.

Cost & energy

optional

Your internal or cloud rate.
Average draw under training load.
Used for energy cost estimate.

Memory check

sanity

For a rough feasibility check.
Weights + optimizer + grads (approx).
1=no sharding; DP for full sharding.
Rough non-state memory per GPU.

How to use this calculator

  1. Enter model size, sequence length, and batch settings.
  2. Set cluster size and parallelism degrees.
  3. Choose “Measured” if you know step time.
  4. If estimating, adjust utilization and network settings.
  5. Add tokens to estimate total training duration.
  6. Use exports to share a repeatable report.

Formula used

This estimator uses a token-centric compute model and a ring all-reduce approximation for data-parallel gradient synchronization.

total_gpus           = nodes × gpus_per_node
data_parallel (DP)   = floor(total_gpus / (TP × PP))

global_batch         = micro_batch × grad_accum × DP
tokens_per_step      = global_batch × seq_len

flops_per_step       = flops_per_token × params × tokens_per_step × extra_flops_mult
peak_flops_total     = total_gpus × gpu_peak_tflops × 1e12 × utilization
compute_time         = flops_per_step / peak_flops_total

grad_bytes_total     = params × grad_bytes_per_param
ring_factor          = 2 × (DP - 1) / DP
comm_time            = ring_factor × grad_bytes_total / (bandwidth × allreduce_eff) + latency

base_step_time       = measured_step_time OR (compute_time + comm_time + misc_overhead)
step_time            = base_step_time / (1 - pipeline_bubble)

tokens_per_second    = tokens_per_step / step_time
steps_total          = ceil(total_training_tokens / tokens_per_step)
train_wall_time      = steps_total × step_time

Example data

These are illustrative scenarios using the same estimator.

Scenario Model (B) GPUs DP Seq Global batch Step (ms) Tokens/s Tokens/day Train days
Single node, mid-size model 7 8 8 2,048 128 14,199 18,462 1,595,111,723 62.7
Multi-node, larger model 70 64 4 4,096 64 35,132 7,462 644,696,687 1,551.1
Measured step-time mode 13 16 16 2,048 256 680 771,012 66,615,416,471 3.8

Use the CSV export for spreadsheet analysis, and the PDF export for sharing a snapshot.

Model scale and token targets

Throughput planning starts by connecting model size to token budgets. A 7B parameter run at 100B tokens implies far fewer optimizer steps than a 70B run at the same token target. When target tokens are unknown, dataset tokens multiplied by epochs gives a practical proxy. Converting tokens to steps is the bridge that makes training calendars comparable across teams.

Batching, sequence length, and parallel degrees

Tokens per step are driven by global batch and sequence length, often 2048 or 4096 tokens. Global batch equals micro-batch per GPU times gradient accumulation times the data-parallel degree. Increasing accumulation raises tokens per step without increasing activation memory as sharply as micro-batch. Tensor and pipeline degrees consume GPUs, reduce data-parallelism, and can increase pipeline bubble overhead when stages are imbalanced.

Compute, utilization, and step time realism

The compute model scales with parameters, tokens per step, and a FLOPs-per-token constant, then applies a utilization factor. Sustained utilization commonly lands between 25% and 55%, depending on kernels, precision, and input pipeline health. The extra FLOPs multiplier, typically 1.05 to 1.30, covers checkpointing and framework overhead. If you have logs, measured step time is the most reliable input.

Network effects and gradient synchronization

For data-parallel training, gradient synchronization can be the limiting factor once compute is efficient. The estimator uses a ring all-reduce approximation based on gradient size, effective bandwidth, protocol efficiency, and a fixed latency term. Interconnect rates of 100 to 400 Gbps often yield 55% to 80% effective efficiency under contention. For BF16 or FP16, gradient bytes per parameter are frequently near 2.

Cost, energy, and capacity decisions

After wall time is estimated, cost and energy become direct multiplications that support approvals. Multiply GPU-hour rate by total GPUs and training hours for compute spend. Multiply average GPU watts, often 300 to 700, by GPUs and hours to estimate kWh, then apply electricity pricing. Compare scenarios: higher utilization or better networking can shorten runs without adding hardware.

Use tokens per day to validate nightly capacity, and track compute versus comm time to prioritize kernel tuning or interconnect upgrades first safely.

FAQs

What does the calculator estimate?

It estimates tokens per second, tokens per day, step time, and optional total wall time from your token target. It also computes rough compute cost and energy use when you provide rates and power draw.

When should I use measured step time?

Use it when you have reliable logs for optimizer step duration. Measured mode bypasses compute and network modeling, so it captures framework overhead, input stalls, and real scaling behavior.

How do I set utilization realistically?

Start with 30% to 45% for new stacks. Increase it after profiling shows stable kernels and data feeding. If communication time is large, higher utilization alone will not improve throughput.

What is pipeline bubble percentage?

It is an efficiency penalty that accounts for idle time between pipeline stages. Higher bubbles raise effective step time. Keep it low with balanced stage placement, enough micro-batches, and consistent activation checkpointing.

Why does throughput drop when I increase TP or PP?

TP and PP consume GPUs for model splitting, which reduces the data-parallel degree. Lower DP means fewer samples processed per step, and additional synchronization or bubbles can increase step time.

How are CSV and PDF reports produced?

After you submit inputs, the page stores results for export. CSV contains inputs, derived values, and outputs for spreadsheets. PDF provides a one-page summary for sharing during reviews and capacity planning.

Related Calculators

LLM Fine-Tuning CostModel Training CostFine-Tune Budget EstimatorDataset Size EstimatorTraining Data SizeGPU Cost CalculatorCloud Training CostFine-Tuning Price EstimatorEpoch Cost CalculatorToken Volume Estimator

Important Note: All the Calculators listed in this site are for educational purpose only and we do not guarentee the accuracy of results. Please do consult with other sources as well.