Batch Size Learning Rate Calculator

Calculator inputs

Use the grid below for large, medium, and mobile screens. The page stays single-column, while the form fields adapt to screen width.

Base learning rate

Reference global batch

Per-device batch

Number of devices

Gradient accumulation

Epochs

Dataset size

Warmup percent

Scaling rule

Custom exponent

Optimizer

Safety percent

GPU memory per device (GB)

Memory per sample (MB)

Memory utilization percent

Reset

Plotly graph

The chart compares linear, square-root, and selected scaling across batch sizes. The marker highlights the current effective batch and recommended learning rate.

Example data table

Scenario	Base LR	Reference Batch	Per-Device Batch	Devices	Accumulation	Effective Global Batch	Rule	Safety %	Recommended LR
Baseline image training	0.0010	256	32	4	2	256	Linear	90	0.00090000
Larger distributed run	0.0010	256	64	8	1	512	Linear	90	0.00180000
Transformer fine-tuning	0.0005	1024	16	8	8	1024	Square-root	95	0.00047500

Formula used

1. Effective global batch

effective_global_batch = per_device_batch × devices × gradient_accumulation

2. Batch scaling ratio

batch_ratio = effective_global_batch ÷ reference_global_batch

3. Scaled learning rate

scaled_lr = base_lr × (batch_ratio ^ exponent)

4. Recommended learning rate

recommended_lr = scaled_lr × (safety_percent ÷ 100)

5. Steps and warmup

steps_per_epoch = ceil(dataset_size ÷ effective_global_batch)
warmup_steps = total_update_steps × warmup_percent ÷ 100

6. Memory-fit estimate

estimated_max_per_device_batch = usable_memory_mb ÷ sample_memory_mb

Linear scaling uses exponent 1.0, square-root scaling uses 0.5, and the custom rule uses your chosen exponent.

How to use this calculator

Enter your known baseline learning rate and the reference global batch where that rate already worked.
Add your current per-device batch, number of devices, and gradient accumulation to compute the effective global batch.
Choose a scaling rule. Linear is aggressive, square-root is safer, and custom supports intermediate behavior.
Set epochs, dataset size, and warmup percent to estimate update counts and warmup steps.
Enter memory values to get a rough per-device batch fit estimate.
Press Calculate to place the results above the form, review the suggested range, and compare it on the graph.
Use the CSV or PDF export buttons to save the current scenario for training notes or experiment tracking.

FAQs

1. Does a bigger batch always justify a bigger learning rate?

Not always. Larger batches reduce gradient noise, but optimizer behavior, augmentation strength, normalization layers, and training objectives can still limit safe increases. Treat scaling rules as starting estimates, not guarantees.

2. When should I use linear scaling?

Linear scaling is common when batch changes are modest, optimization is stable, and the reference setup is trusted. It is often used with SGD or large-batch methods, especially when warmup is included.

3. When is square-root scaling safer?

Square-root scaling is safer when the optimizer is sensitive, the model is already near instability, or batch size jumps are large. It grows learning rate more slowly and often reduces divergence risk.

4. Why include gradient accumulation here?

Gradient accumulation increases the effective global batch without requiring a larger per-device microbatch. That changes update frequency and learning-rate scaling, so it should be included in the batch calculation.

5. Are AdamW and SGD equally tolerant of scaling?

No. SGD often tolerates broader scaling with proper momentum and warmup. AdamW can scale well too, but its stable range is usually narrower, so a smaller search band is sensible.

6. Why is a safety factor useful?

A safety factor dampens the raw scaled estimate before training starts. It helps when memory estimates are rough, data distributions changed, or the new run differs from the reference configuration.

7. How accurate is the memory estimate?

It is only a rough planning estimate. Real usage depends on activations, sequence length, optimizer states, mixed precision, checkpointing, and framework overhead. Always confirm with a small dry run.

8. What should I validate after choosing a rate?

Check early loss, gradient explosions, NaNs, validation accuracy, update smoothness, and throughput. A short sweep around the recommended range usually reveals whether the new setup is too aggressive or too conservative.