Calculator inputs
Use the grid below for large, medium, and mobile screens. The page stays single-column, while the form fields adapt to screen width.
Plotly graph
The chart compares linear, square-root, and selected scaling across batch sizes. The marker highlights the current effective batch and recommended learning rate.
Example data table
| Scenario | Base LR | Reference Batch | Per-Device Batch | Devices | Accumulation | Effective Global Batch | Rule | Safety % | Recommended LR |
|---|---|---|---|---|---|---|---|---|---|
| Baseline image training | 0.0010 | 256 | 32 | 4 | 2 | 256 | Linear | 90 | 0.00090000 |
| Larger distributed run | 0.0010 | 256 | 64 | 8 | 1 | 512 | Linear | 90 | 0.00180000 |
| Transformer fine-tuning | 0.0005 | 1024 | 16 | 8 | 8 | 1024 | Square-root | 95 | 0.00047500 |
Formula used
1. Effective global batch
effective_global_batch = per_device_batch × devices × gradient_accumulation
2. Batch scaling ratio
batch_ratio = effective_global_batch ÷ reference_global_batch
3. Scaled learning rate
scaled_lr = base_lr × (batch_ratio ^ exponent)
4. Recommended learning rate
recommended_lr = scaled_lr × (safety_percent ÷ 100)
5. Steps and warmup
steps_per_epoch = ceil(dataset_size ÷ effective_global_batch)
warmup_steps = total_update_steps × warmup_percent ÷ 100
6. Memory-fit estimate
estimated_max_per_device_batch = usable_memory_mb ÷ sample_memory_mb
How to use this calculator
- Enter your known baseline learning rate and the reference global batch where that rate already worked.
- Add your current per-device batch, number of devices, and gradient accumulation to compute the effective global batch.
- Choose a scaling rule. Linear is aggressive, square-root is safer, and custom supports intermediate behavior.
- Set epochs, dataset size, and warmup percent to estimate update counts and warmup steps.
- Enter memory values to get a rough per-device batch fit estimate.
- Press Calculate to place the results above the form, review the suggested range, and compare it on the graph.
- Use the CSV or PDF export buttons to save the current scenario for training notes or experiment tracking.
FAQs
1. Does a bigger batch always justify a bigger learning rate?
Not always. Larger batches reduce gradient noise, but optimizer behavior, augmentation strength, normalization layers, and training objectives can still limit safe increases. Treat scaling rules as starting estimates, not guarantees.
2. When should I use linear scaling?
Linear scaling is common when batch changes are modest, optimization is stable, and the reference setup is trusted. It is often used with SGD or large-batch methods, especially when warmup is included.
3. When is square-root scaling safer?
Square-root scaling is safer when the optimizer is sensitive, the model is already near instability, or batch size jumps are large. It grows learning rate more slowly and often reduces divergence risk.
4. Why include gradient accumulation here?
Gradient accumulation increases the effective global batch without requiring a larger per-device microbatch. That changes update frequency and learning-rate scaling, so it should be included in the batch calculation.
5. Are AdamW and SGD equally tolerant of scaling?
No. SGD often tolerates broader scaling with proper momentum and warmup. AdamW can scale well too, but its stable range is usually narrower, so a smaller search band is sensible.
6. Why is a safety factor useful?
A safety factor dampens the raw scaled estimate before training starts. It helps when memory estimates are rough, data distributions changed, or the new run differs from the reference configuration.
7. How accurate is the memory estimate?
It is only a rough planning estimate. Real usage depends on activations, sequence length, optimizer states, mixed precision, checkpointing, and framework overhead. Always confirm with a small dry run.
8. What should I validate after choosing a rate?
Check early loss, gradient explosions, NaNs, validation accuracy, update smoothness, and throughput. A short sweep around the recommended range usually reveals whether the new setup is too aggressive or too conservative.