Training Throughput Estimator

Calculator

Fill what you know. Use “Measured” for best accuracy.

Workload

tokens-based

Model parameters (B)

Example: 7, 13, 70.

Sequence length (tokens)

Tokens per sample per step.

Micro-batch per GPU

Per GPU per forward pass.

Gradient accumulation

Micro-steps per optimizer step.

Dataset tokens (B)

Used with epochs if target tokens is empty.

Epochs

Optional; can be fractional.

Target training tokens (B)

If set, overrides dataset × epochs.

Parallelism

DP · TP · PP

Tensor parallel degree (TP)

Shards model compute per step.

Pipeline stages (PP)

Stages for pipelining.

Pipeline bubble (%)

Simple efficiency penalty for bubbles.

Hardware

cluster

Nodes

Compute nodes in the cluster.

GPUs per node

Total GPUs = nodes × GPUs/node.

GPU peak TFLOPs

Peak for chosen precision.

Utilization (%)

Typical sustained fraction.

Step time

estimate or measured

Mode

Measured mode ignores compute/comm modeling.

Measured step time (ms)

From logs: optimizer step duration.

Modeling knobs

advanced

FLOPs per token

Default 6×params×tokens estimate.

Extra FLOPs multiplier

Checkpointing, fused ops, overhead.

Misc overhead per step (ms)

Dataloader, eval, host sync, etc.

Communication

all-reduce

Interconnect bandwidth (Gbps)

Effective bandwidth for DP sync path.

All-reduce efficiency (%)

Accounts for contention and protocol.

Gradient bytes per parameter

Often 2 bytes for BF16/FP16.

All-reduce latency (ms)

Approx fixed per-step latency term.

Cost & energy

optional

Cost per GPU-hour

Your internal or cloud rate.

GPU power (watts)

Average draw under training load.

Electricity rate (per kWh)

Used for energy cost estimate.

Memory check

sanity

GPU memory (GB)

For a rough feasibility check.

State bytes per parameter

Weights + optimizer + grads (approx).

Sharding factor

1=no sharding; DP for full sharding.

Activation memory (GB)

Rough non-state memory per GPU.

How to use this calculator

Enter model size, sequence length, and batch settings.
Set cluster size and parallelism degrees.
Choose “Measured” if you know step time.
If estimating, adjust utilization and network settings.
Add tokens to estimate total training duration.
Use exports to share a repeatable report.

Formula used

This estimator uses a token-centric compute model and a ring all-reduce approximation for data-parallel gradient synchronization.

total_gpus           = nodes × gpus_per_node
data_parallel (DP)   = floor(total_gpus / (TP × PP))

global_batch         = micro_batch × grad_accum × DP
tokens_per_step      = global_batch × seq_len

flops_per_step       = flops_per_token × params × tokens_per_step × extra_flops_mult
peak_flops_total     = total_gpus × gpu_peak_tflops × 1e12 × utilization
compute_time         = flops_per_step / peak_flops_total

grad_bytes_total     = params × grad_bytes_per_param
ring_factor          = 2 × (DP - 1) / DP
comm_time            = ring_factor × grad_bytes_total / (bandwidth × allreduce_eff) + latency

base_step_time       = measured_step_time OR (compute_time + comm_time + misc_overhead)
step_time            = base_step_time / (1 - pipeline_bubble)

tokens_per_second    = tokens_per_step / step_time
steps_total          = ceil(total_training_tokens / tokens_per_step)
train_wall_time      = steps_total × step_time

Example data

These are illustrative scenarios using the same estimator.

Scenario	Model (B)	GPUs	DP	Seq	Global batch	Step (ms)	Tokens/s	Tokens/day	Train days
Single node, mid-size model	7	8	8	2,048	128	14,199	18,462	1,595,111,723	62.7
Multi-node, larger model	70	64	4	4,096	64	35,132	7,462	644,696,687	1,551.1
Measured step-time mode	13	16	16	2,048	256	680	771,012	66,615,416,471	3.8

Use the CSV export for spreadsheet analysis, and the PDF export for sharing a snapshot.

Model scale and token targets

Throughput planning starts by connecting model size to token budgets. A 7B parameter run at 100B tokens implies far fewer optimizer steps than a 70B run at the same token target. When target tokens are unknown, dataset tokens multiplied by epochs gives a practical proxy. Converting tokens to steps is the bridge that makes training calendars comparable across teams.

Batching, sequence length, and parallel degrees

Tokens per step are driven by global batch and sequence length, often 2048 or 4096 tokens. Global batch equals micro-batch per GPU times gradient accumulation times the data-parallel degree. Increasing accumulation raises tokens per step without increasing activation memory as sharply as micro-batch. Tensor and pipeline degrees consume GPUs, reduce data-parallelism, and can increase pipeline bubble overhead when stages are imbalanced.

Compute, utilization, and step time realism

The compute model scales with parameters, tokens per step, and a FLOPs-per-token constant, then applies a utilization factor. Sustained utilization commonly lands between 25% and 55%, depending on kernels, precision, and input pipeline health. The extra FLOPs multiplier, typically 1.05 to 1.30, covers checkpointing and framework overhead. If you have logs, measured step time is the most reliable input.

Network effects and gradient synchronization

For data-parallel training, gradient synchronization can be the limiting factor once compute is efficient. The estimator uses a ring all-reduce approximation based on gradient size, effective bandwidth, protocol efficiency, and a fixed latency term. Interconnect rates of 100 to 400 Gbps often yield 55% to 80% effective efficiency under contention. For BF16 or FP16, gradient bytes per parameter are frequently near 2.

Cost, energy, and capacity decisions

After wall time is estimated, cost and energy become direct multiplications that support approvals. Multiply GPU-hour rate by total GPUs and training hours for compute spend. Multiply average GPU watts, often 300 to 700, by GPUs and hours to estimate kWh, then apply electricity pricing. Compare scenarios: higher utilization or better networking can shorten runs without adding hardware.

Use tokens per day to validate nightly capacity, and track compute versus comm time to prioritize kernel tuning or interconnect upgrades first safely.

FAQs

What does the calculator estimate?

It estimates tokens per second, tokens per day, step time, and optional total wall time from your token target. It also computes rough compute cost and energy use when you provide rates and power draw.

When should I use measured step time?

Use it when you have reliable logs for optimizer step duration. Measured mode bypasses compute and network modeling, so it captures framework overhead, input stalls, and real scaling behavior.

How do I set utilization realistically?

Start with 30% to 45% for new stacks. Increase it after profiling shows stable kernels and data feeding. If communication time is large, higher utilization alone will not improve throughput.

What is pipeline bubble percentage?

It is an efficiency penalty that accounts for idle time between pipeline stages. Higher bubbles raise effective step time. Keep it low with balanced stage placement, enough micro-batches, and consistent activation checkpointing.

Why does throughput drop when I increase TP or PP?

TP and PP consume GPUs for model splitting, which reduces the data-parallel degree. Lower DP means fewer samples processed per step, and additional synchronization or bubbles can increase step time.

How are CSV and PDF reports produced?

After you submit inputs, the page stores results for export. CSV contains inputs, derived values, and outputs for spreadsheets. PDF provides a one-page summary for sharing during reviews and capacity planning.