Calculator Inputs
This page uses a single vertical flow, while the calculator inputs adapt to three columns on large screens, two on medium screens, and one on mobile.
Example Data Table
Use this sample set to test different training plans and compare how power, warmup, and total decay length influence convergence.
| Run | Initial Rate | End Rate | Power | Warmup Steps | Decay Steps | Batch Size | Epochs |
|---|---|---|---|---|---|---|---|
| Transformer Base | 0.001000 | 0.000050 | 2.0 | 400 | 6000 | 64 | 10 |
| Vision Fine-Tune | 0.000500 | 0.000010 | 1.5 | 200 | 2500 | 32 | 8 |
| LoRA Adaptation | 0.000300 | 0.000020 | 1.0 | 100 | 1800 | 16 | 6 |
| Classifier Refresh | 0.002000 | 0.000100 | 3.0 | 150 | 1200 | 128 | 5 |
| Sequence Tagger | 0.000800 | 0.000030 | 2.5 | 300 | 4200 | 48 | 12 |
Formula Used
Interpretation: The schedule starts at a warmup rate, rises linearly to the initial rate, then decays toward the end rate following a polynomial curve. A power above 1 delays the steepest drop until later steps, while a power below 1 front-loads the reduction.
Cycling option: When enabled, the effective decay window expands to the next multiple of the base decay length once the post-warmup step exceeds it. This creates a longer tail and can raise the rate again after crossing a completed decay boundary.
How to Use This Calculator
- Enter the initial learning rate, end learning rate, and polynomial power.
- Set the current optimizer step you want to evaluate.
- Choose manual decay steps or derive them from dataset size, batch size, epochs, and gradient accumulation.
- Add warmup values if your training plan ramps the rate before decay.
- Enable cycling only when you intentionally want an extended decay window.
- Submit the form to view the calculated rate, milestone table, and Plotly chart.
- Use CSV to export sampled schedule points or PDF to save the report.
FAQs
1) What does polynomial decay control in training?
It controls how the learning rate decreases over time. Instead of dropping linearly, the curve follows a power term, which changes how aggressively optimization slows near the end of training.
2) When should I use a larger power value?
Use a larger power when you want the rate to stay higher for longer, then fall more sharply later. This can help models keep exploring before fine-tuning near convergence.
3) Why include warmup steps?
Warmup reduces instability at the beginning of training. It gradually moves the rate from a lower starting value to the main rate, which is especially useful for transformers and large-batch runs.
4) What is the difference between manual and derived decay steps?
Manual steps let you define the schedule directly. Derived steps estimate total optimizer updates from dataset size, batch size, epochs, and gradient accumulation for a training-plan-based schedule.
5) Does the calculator support gradient accumulation?
Yes. Derived step planning divides effective batches by the accumulation factor, so the estimated number of optimizer updates better reflects how often weights actually change.
6) Why can cycling raise the learning rate again?
Cycling expands the effective decay window to the next multiple after the schedule length is exceeded. That resets the relative position inside a larger window and can move the rate above the floor again.
7) What should the end learning rate be?
It depends on your model and optimizer. A smaller end rate usually supports gentler convergence, while a higher floor can preserve adaptation if you still expect useful learning late in training.
8) What do the CSV and PDF exports include?
The CSV export contains sampled schedule points for plotting or analysis. The PDF export summarizes the chosen settings, current result, average sampled rate, and milestone values for reporting.