Polynomial Decay Rate Calculator

Plan smooth learning-rate decay for modern training workloads. Review schedules, milestones, projections, and exports easily. Balance exploration and stability through every optimization step today.

Calculator Inputs

This page uses a single vertical flow, while the calculator inputs adapt to three columns on large screens, two on medium screens, and one on mobile.

Starting rate before decay begins.
Floor value after polynomial decay finishes.
Higher values produce a steeper late-stage drop.
Step used to evaluate the current rate.
Used when the step source is manual.
Use dataset settings below to derive total optimizer updates.
Linear warmup before polynomial decay starts.
Learning rate at step zero during warmup.
Extends the decay window when the post-warmup step exceeds it.
Total training examples available.
Examples processed in each batch.
Training passes used for derived step planning.
Optimizer updates happen after this many batches.
Controls schedule detail in the graph and CSV export.

Example Data Table

Use this sample set to test different training plans and compare how power, warmup, and total decay length influence convergence.

Run Initial Rate End Rate Power Warmup Steps Decay Steps Batch Size Epochs
Transformer Base 0.001000 0.000050 2.0 400 6000 64 10
Vision Fine-Tune 0.000500 0.000010 1.5 200 2500 32 8
LoRA Adaptation 0.000300 0.000020 1.0 100 1800 16 6
Classifier Refresh 0.002000 0.000100 3.0 150 1200 128 5
Sequence Tagger 0.000800 0.000030 2.5 300 4200 48 12

Formula Used

Warmup: LR(step) = warmup_start_lr + (initial_lr - warmup_start_lr) × (step / warmup_steps) Polynomial decay after warmup: LR(step) = end_lr + (initial_lr - end_lr) × (1 - clipped_step / effective_decay_steps)^power Where: clipped_step = min(post_warmup_step, effective_decay_steps) post_warmup_step = max(0, current_step - warmup_steps)

Interpretation: The schedule starts at a warmup rate, rises linearly to the initial rate, then decays toward the end rate following a polynomial curve. A power above 1 delays the steepest drop until later steps, while a power below 1 front-loads the reduction.

Cycling option: When enabled, the effective decay window expands to the next multiple of the base decay length once the post-warmup step exceeds it. This creates a longer tail and can raise the rate again after crossing a completed decay boundary.

How to Use This Calculator

  1. Enter the initial learning rate, end learning rate, and polynomial power.
  2. Set the current optimizer step you want to evaluate.
  3. Choose manual decay steps or derive them from dataset size, batch size, epochs, and gradient accumulation.
  4. Add warmup values if your training plan ramps the rate before decay.
  5. Enable cycling only when you intentionally want an extended decay window.
  6. Submit the form to view the calculated rate, milestone table, and Plotly chart.
  7. Use CSV to export sampled schedule points or PDF to save the report.

FAQs

1) What does polynomial decay control in training?

It controls how the learning rate decreases over time. Instead of dropping linearly, the curve follows a power term, which changes how aggressively optimization slows near the end of training.

2) When should I use a larger power value?

Use a larger power when you want the rate to stay higher for longer, then fall more sharply later. This can help models keep exploring before fine-tuning near convergence.

3) Why include warmup steps?

Warmup reduces instability at the beginning of training. It gradually moves the rate from a lower starting value to the main rate, which is especially useful for transformers and large-batch runs.

4) What is the difference between manual and derived decay steps?

Manual steps let you define the schedule directly. Derived steps estimate total optimizer updates from dataset size, batch size, epochs, and gradient accumulation for a training-plan-based schedule.

5) Does the calculator support gradient accumulation?

Yes. Derived step planning divides effective batches by the accumulation factor, so the estimated number of optimizer updates better reflects how often weights actually change.

6) Why can cycling raise the learning rate again?

Cycling expands the effective decay window to the next multiple after the schedule length is exceeded. That resets the relative position inside a larger window and can move the rate above the floor again.

7) What should the end learning rate be?

It depends on your model and optimizer. A smaller end rate usually supports gentler convergence, while a higher floor can preserve adaptation if you still expect useful learning late in training.

8) What do the CSV and PDF exports include?

The CSV export contains sampled schedule points for plotting or analysis. The PDF export summarizes the chosen settings, current result, average sampled rate, and milestone values for reporting.

Related Calculators

effective learning ratebatch size learning rateinverse time decay

Important Note: All the Calculators listed in this site are for educational purpose only and we do not guarentee the accuracy of results. Please do consult with other sources as well.