Polynomial Decay Rate Calculator for AI & Machine Learning

Calculator Inputs

This page uses a single vertical flow, while the calculator inputs adapt to three columns on large screens, two on medium screens, and one on mobile.

Initial Learning Rate

Starting rate before decay begins.

End Learning Rate

Floor value after polynomial decay finishes.

Polynomial Power

Higher values produce a steeper late-stage drop.

Current Optimizer Step

Step used to evaluate the current rate.

Manual Decay Steps

Used when the step source is manual.

Decay Step Source

Use dataset settings below to derive total optimizer updates.

Warmup Steps

Linear warmup before polynomial decay starts.

Warmup Start Rate

Learning rate at step zero during warmup.

Enable cycling behavior

Extends the decay window when the post-warmup step exceeds it.

Dataset Size

Total training examples available.

Batch Size

Examples processed in each batch.

Epochs

Training passes used for derived step planning.

Gradient Accumulation

Optimizer updates happen after this many batches.

Graph Sample Points

Controls schedule detail in the graph and CSV export.

Example Data Table

Use this sample set to test different training plans and compare how power, warmup, and total decay length influence convergence.

Run	Initial Rate	End Rate	Power	Warmup Steps	Decay Steps	Batch Size	Epochs
Transformer Base	0.001000	0.000050	2.0	400	6000	64	10
Vision Fine-Tune	0.000500	0.000010	1.5	200	2500	32	8
LoRA Adaptation	0.000300	0.000020	1.0	100	1800	16	6
Classifier Refresh	0.002000	0.000100	3.0	150	1200	128	5
Sequence Tagger	0.000800	0.000030	2.5	300	4200	48	12

Formula Used

Warmup: LR(step) = warmup_start_lr + (initial_lr - warmup_start_lr) × (step / warmup_steps) Polynomial decay after warmup: LR(step) = end_lr + (initial_lr - end_lr) × (1 - clipped_step / effective_decay_steps)^power Where: clipped_step = min(post_warmup_step, effective_decay_steps) post_warmup_step = max(0, current_step - warmup_steps)

Interpretation: The schedule starts at a warmup rate, rises linearly to the initial rate, then decays toward the end rate following a polynomial curve. A power above 1 delays the steepest drop until later steps, while a power below 1 front-loads the reduction.

Cycling option: When enabled, the effective decay window expands to the next multiple of the base decay length once the post-warmup step exceeds it. This creates a longer tail and can raise the rate again after crossing a completed decay boundary.

How to Use This Calculator

Enter the initial learning rate, end learning rate, and polynomial power.
Set the current optimizer step you want to evaluate.
Choose manual decay steps or derive them from dataset size, batch size, epochs, and gradient accumulation.
Add warmup values if your training plan ramps the rate before decay.
Enable cycling only when you intentionally want an extended decay window.
Submit the form to view the calculated rate, milestone table, and Plotly chart.
Use CSV to export sampled schedule points or PDF to save the report.

FAQs

1) What does polynomial decay control in training?

It controls how the learning rate decreases over time. Instead of dropping linearly, the curve follows a power term, which changes how aggressively optimization slows near the end of training.

2) When should I use a larger power value?

Use a larger power when you want the rate to stay higher for longer, then fall more sharply later. This can help models keep exploring before fine-tuning near convergence.

3) Why include warmup steps?

Warmup reduces instability at the beginning of training. It gradually moves the rate from a lower starting value to the main rate, which is especially useful for transformers and large-batch runs.

4) What is the difference between manual and derived decay steps?

Manual steps let you define the schedule directly. Derived steps estimate total optimizer updates from dataset size, batch size, epochs, and gradient accumulation for a training-plan-based schedule.

5) Does the calculator support gradient accumulation?

Yes. Derived step planning divides effective batches by the accumulation factor, so the estimated number of optimizer updates better reflects how often weights actually change.

6) Why can cycling raise the learning rate again?

Cycling expands the effective decay window to the next multiple after the schedule length is exceeded. That resets the relative position inside a larger window and can move the rate above the floor again.

7) What should the end learning rate be?

It depends on your model and optimizer. A smaller end rate usually supports gentler convergence, while a higher floor can preserve adaptation if you still expect useful learning late in training.

8) What do the CSV and PDF exports include?

The CSV export contains sampled schedule points for plotting or analysis. The PDF export summarizes the chosen settings, current result, average sampled rate, and milestone values for reporting.