Calculator inputs
Example data table
| Scenario | Global Batch | Scaling Rule | Schedule | Optimizer | Effective LR |
|---|---|---|---|---|---|
| Transformer fine-tune | 512 | Linear | Cosine | ADAMW | 0.00072981 |
| Vision pretraining | 1,024 | Sqrt | Linear | ADAMW | 0.00253825 |
| CNN with SGD | 512 | Linear | Step | SGD | 0.20000000 |
These examples show how global batch, scheduler choice, and optimizer correction can shift the final effective rate.
Formula used
1) Global batch size
Global Batch = Micro Batch Size × Gradient Accumulation Steps × Devices
2) Batch scaling multiplier
Linear: (Global Batch ÷ Reference Batch)
Square root: (Global Batch ÷ Reference Batch)0.5
Custom: (Global Batch ÷ Reference Batch)Custom Exponent
3) Warmup factor
Warmup Factor = Current Step ÷ Warmup Steps, while Current Step ≤ Warmup Steps. Otherwise it becomes 1.
4) Schedule factor
The schedule factor depends on the selected decay method. Cosine and linear use normalized progress after warmup. Exponential and step schedules use gamma and step size.
5) Optimizer multiplier
For Adam-family optimizers with bias correction:
Optimizer Multiplier = √(1 - β2t) ÷ (1 - β1t)
6) Final effective learning rate
Effective LR = Base LR × Batch Scale × Warmup Factor × Schedule Factor × Optimizer Multiplier
In practice, teams define “effective learning rate” slightly differently. This calculator uses a transparent, training-oriented formulation that includes distributed batch growth and optional optimizer correction.
How to use this calculator
- Enter the base learning rate chosen for your reference batch.
- Provide micro batch size, accumulation steps, and device count.
- Select the reference batch used to justify the original base rate.
- Choose a scaling rule that matches your training strategy.
- Set current step, total steps, warmup steps, and scheduler parameters.
- Select the optimizer family and keep bias correction enabled for Adam-style runs when needed.
- Press the calculate button to show the result above the form.
- Review the summary table, compare the chart, then export CSV or PDF.
FAQs
1. What does effective learning rate mean here?
It is the learning rate after applying batch scaling, warmup, scheduler decay, and optional Adam-family bias correction. This makes the reported value closer to the update magnitude seen during actual training.
2. Why does global batch size matter?
When total batch size increases, many training recipes raise the learning rate to keep optimization dynamics stable. The calculator lets you inspect that relationship using linear, square-root, or custom scaling.
3. When should I use linear scaling?
Linear scaling is common when batch size changes substantially and the model remains stable under larger updates. It is often used in distributed training, especially with strong warmup and careful monitoring.
4. Why include warmup in the calculation?
Warmup reduces instability at early steps by ramping the learning rate gradually. Without it, large effective updates can cause divergence, especially in transformer training and large distributed runs.
5. What is the optimizer multiplier?
For Adam, AdamW, and Nadam, bias correction changes early-step behavior. The multiplier approximates that effect, so your displayed effective rate better reflects the update scale during startup.
6. Should I compare scheduled LR or final effective LR?
Use scheduled LR when comparing scheduler behavior alone. Use final effective LR when you want a broader view that also includes batch scaling and optimizer correction.
7. Does this replace validation-based tuning?
No. It provides a structured estimate, not a guarantee of convergence quality. Final learning rate choices should still be validated with loss curves, gradient stability, and downstream performance.
8. Which settings are most sensitive?
Base learning rate, global batch size, scaling rule, and warmup length usually drive the largest changes. Scheduler floor and optimizer correction mainly refine how updates behave across training.