Generalized Advantage Estimator Calculator

Inputs

Trajectory and settings

Separate numbers with commas, spaces, or new lines.

Rewards r_t (required)

Length T. Example: 1, 0, 0, 1

Values V_t (required)

Same length as rewards.

Next values V_t+1 (optional)

Leave blank to auto-shift V_t and use bootstrap.

Done flags (optional)

Use 0/1 (or true/false). Empty → last step terminal.

Discount γ

Typical: 0.95–0.999.

Trace decay λ

Typical: 0.90–0.97.

Bootstrap value (last V_T)

Used only if V_t+1 is empty.

Advantage clip min (optional)

Leave blank for no min clipping.

Advantage clip max (optional)

Leave blank for no max clipping.

Decimal precision

Affects display and exports.

Normalize advantages (mean 0, std 1)

Results appear above this form after submission.

Example data table

This example matches the prefilled inputs. It demonstrates how deltas accumulate into advantages.

t	r_t	V_t	V_t+1	done	δ_t	A_t (raw)	Return target
0	1	0.5	0.4	0	0.896	2.943374	3.443374
1	0	0.4	0.3	0	-0.103	2.176900	2.576900
2	0	0.3	0.2	0	-0.102	2.424136	2.724136
3	1	0.2	0.1	0	0.899	2.685950	2.885950
4	2	0.1	0.0	1	1.900	1.900000	2.000000

Using γ=0.99 and λ=0.95, computed backwards from the last step.

Formula used

Generalized Advantage Estimation uses temporal-difference residuals:

δ_t = r_t + γ V(s_t+1) (1 - done_t) - V(s_t)
A_t = ∑_l=0^T-1-t (γλ)^l δ_t+l
Return target: R_t = A_t + V(s_t)

The calculator computes A_t by iterating backward, resetting accumulation when done=1. Optional normalization keeps policy updates stable.

How to use this calculator

Paste rewards and value estimates with the same length.
Optionally add done flags (0/1) for episode boundaries.
If you have V_t+1 already, paste it; otherwise leave it blank.
Set γ and λ based on your environment and bias/variance needs.
Click Calculate GAE to view the table above the form.
Use the download buttons to export CSV or a one-page PDF report.

FAQs

1) What is a generalized advantage estimate?

It is a smoothed advantage signal that blends multi-step returns using γ and λ. It reduces variance compared with pure Monte Carlo while keeping bias manageable for policy optimization.

2) How do γ and λ change the result?

Higher γ values weigh distant rewards more. Higher λ values use longer traces, reducing bias but increasing variance. Lower λ approaches one-step TD, often more stable but less accurate.

3) Do I need V_t+1 values?

Not always. If you provide only V_t, the calculator shifts it to estimate V_t+1 and uses the bootstrap value for the final step. Supplying explicit V_t+1 is useful with truncated rollouts.

4) What does the done flag control?

It marks episode termination at time t. When done=1, the next-state value is ignored and the backward accumulation is reset. This prevents advantages from leaking across episode boundaries.

5) Should I normalize advantages?

Often yes for policy gradients. Normalization keeps update magnitudes consistent across batches and can improve stability. This tool keeps returns based on raw advantages, while also showing normalized values for training use.

6) What is the return target used for?

Return targets train the value function. A common choice is R_t=A_t+V_t, which matches the same bootstrapping assumptions used in the advantage estimate.

7) Can I mix multiple episodes in one batch?

Yes. Provide done flags at each boundary. The calculator will reset the trace where done=1, producing correct per-episode advantages even when sequences are concatenated.

8) What input formats are accepted?

Commas, spaces, semicolons, and new lines work. Scientific notation like 1e-3 is supported. For done flags, use 0/1 or true/false.

1) What is a generalized advantage estimate?

2) How do γ and λ change the result?

3) Do I need Vt+1 values?

4) What does the done flag control?

5) Should I normalize advantages?

6) What is the return target used for?

7) Can I mix multiple episodes in one batch?

8) What input formats are accepted?

Related Calculators

3) Do I need V_t+1 values?