| t | rt | Vt | Vt+1 | done | δt | At (raw) | Return target |
|---|---|---|---|---|---|---|---|
| 0 | 1 | 0.5 | 0.4 | 0 | 0.896 | 2.943374 | 3.443374 |
| 1 | 0 | 0.4 | 0.3 | 0 | -0.103 | 2.176900 | 2.576900 |
| 2 | 0 | 0.3 | 0.2 | 0 | -0.102 | 2.424136 | 2.724136 |
| 3 | 1 | 0.2 | 0.1 | 0 | 0.899 | 2.685950 | 2.885950 |
| 4 | 2 | 0.1 | 0.0 | 1 | 1.900 | 1.900000 | 2.000000 |
Generalized Advantage Estimation uses temporal-difference residuals:
At = ∑l=0T-1-t (γλ)l δt+l
Return target: Rt = At + V(st)
The calculator computes At by iterating backward, resetting accumulation when done=1. Optional normalization keeps policy updates stable.
- Paste rewards and value estimates with the same length.
- Optionally add done flags (0/1) for episode boundaries.
- If you have Vt+1 already, paste it; otherwise leave it blank.
- Set γ and λ based on your environment and bias/variance needs.
- Click Calculate GAE to view the table above the form.
- Use the download buttons to export CSV or a one-page PDF report.
1) What is a generalized advantage estimate?
It is a smoothed advantage signal that blends multi-step returns using γ and λ. It reduces variance compared with pure Monte Carlo while keeping bias manageable for policy optimization.
2) How do γ and λ change the result?
Higher γ values weigh distant rewards more. Higher λ values use longer traces, reducing bias but increasing variance. Lower λ approaches one-step TD, often more stable but less accurate.
3) Do I need Vt+1 values?
Not always. If you provide only Vt, the calculator shifts it to estimate Vt+1 and uses the bootstrap value for the final step. Supplying explicit Vt+1 is useful with truncated rollouts.
4) What does the done flag control?
It marks episode termination at time t. When done=1, the next-state value is ignored and the backward accumulation is reset. This prevents advantages from leaking across episode boundaries.
5) Should I normalize advantages?
Often yes for policy gradients. Normalization keeps update magnitudes consistent across batches and can improve stability. This tool keeps returns based on raw advantages, while also showing normalized values for training use.
6) What is the return target used for?
Return targets train the value function. A common choice is Rt=At+Vt, which matches the same bootstrapping assumptions used in the advantage estimate.
7) Can I mix multiple episodes in one batch?
Yes. Provide done flags at each boundary. The calculator will reset the trace where done=1, producing correct per-episode advantages even when sequences are concatenated.
8) What input formats are accepted?
Commas, spaces, semicolons, and new lines work. Scientific notation like 1e-3 is supported. For done flags, use 0/1 or true/false.