Batch Size Learning Rate Calculator

Plan global batch, steps, and scaled learning rates. Review memory fit, optimizer hints, and schedules. Export results and compare ranges before costly training runs.

Calculator inputs

Use the grid below for large, medium, and mobile screens. The page stays single-column, while the form fields adapt to screen width.

Reset

Plotly graph

The chart compares linear, square-root, and selected scaling across batch sizes. The marker highlights the current effective batch and recommended learning rate.

Example data table

Scenario Base LR Reference Batch Per-Device Batch Devices Accumulation Effective Global Batch Rule Safety % Recommended LR
Baseline image training 0.0010 256 32 4 2 256 Linear 90 0.00090000
Larger distributed run 0.0010 256 64 8 1 512 Linear 90 0.00180000
Transformer fine-tuning 0.0005 1024 16 8 8 1024 Square-root 95 0.00047500

Formula used

1. Effective global batch

effective_global_batch = per_device_batch × devices × gradient_accumulation

2. Batch scaling ratio

batch_ratio = effective_global_batch ÷ reference_global_batch

3. Scaled learning rate

scaled_lr = base_lr × (batch_ratio ^ exponent)

4. Recommended learning rate

recommended_lr = scaled_lr × (safety_percent ÷ 100)

5. Steps and warmup

steps_per_epoch = ceil(dataset_size ÷ effective_global_batch)
warmup_steps = total_update_steps × warmup_percent ÷ 100

6. Memory-fit estimate

estimated_max_per_device_batch = usable_memory_mb ÷ sample_memory_mb

Linear scaling uses exponent 1.0, square-root scaling uses 0.5, and the custom rule uses your chosen exponent.

How to use this calculator

  1. Enter your known baseline learning rate and the reference global batch where that rate already worked.
  2. Add your current per-device batch, number of devices, and gradient accumulation to compute the effective global batch.
  3. Choose a scaling rule. Linear is aggressive, square-root is safer, and custom supports intermediate behavior.
  4. Set epochs, dataset size, and warmup percent to estimate update counts and warmup steps.
  5. Enter memory values to get a rough per-device batch fit estimate.
  6. Press Calculate to place the results above the form, review the suggested range, and compare it on the graph.
  7. Use the CSV or PDF export buttons to save the current scenario for training notes or experiment tracking.

FAQs

1. Does a bigger batch always justify a bigger learning rate?

Not always. Larger batches reduce gradient noise, but optimizer behavior, augmentation strength, normalization layers, and training objectives can still limit safe increases. Treat scaling rules as starting estimates, not guarantees.

2. When should I use linear scaling?

Linear scaling is common when batch changes are modest, optimization is stable, and the reference setup is trusted. It is often used with SGD or large-batch methods, especially when warmup is included.

3. When is square-root scaling safer?

Square-root scaling is safer when the optimizer is sensitive, the model is already near instability, or batch size jumps are large. It grows learning rate more slowly and often reduces divergence risk.

4. Why include gradient accumulation here?

Gradient accumulation increases the effective global batch without requiring a larger per-device microbatch. That changes update frequency and learning-rate scaling, so it should be included in the batch calculation.

5. Are AdamW and SGD equally tolerant of scaling?

No. SGD often tolerates broader scaling with proper momentum and warmup. AdamW can scale well too, but its stable range is usually narrower, so a smaller search band is sensible.

6. Why is a safety factor useful?

A safety factor dampens the raw scaled estimate before training starts. It helps when memory estimates are rough, data distributions changed, or the new run differs from the reference configuration.

7. How accurate is the memory estimate?

It is only a rough planning estimate. Real usage depends on activations, sequence length, optimizer states, mixed precision, checkpointing, and framework overhead. Always confirm with a small dry run.

8. What should I validate after choosing a rate?

Check early loss, gradient explosions, NaNs, validation accuracy, update smoothness, and throughput. A short sweep around the recommended range usually reveals whether the new setup is too aggressive or too conservative.

Related Calculators

effective learning ratepolynomial decay rateinverse time decay

Important Note: All the Calculators listed in this site are for educational purpose only and we do not guarentee the accuracy of results. Please do consult with other sources as well.