Generalized Advantage Estimator Calculator

Generalized Advantage Estimator built for clear reinforcement insights. Paste rollouts, set gamma and lambda quickly. Get deltas, advantages, returns, then download reports instantly today.

Inputs
Trajectory and settings
Separate numbers with commas, spaces, or new lines.
Length T. Example: 1, 0, 0, 1
Same length as rewards.
Leave blank to auto-shift Vt and use bootstrap.
Use 0/1 (or true/false). Empty → last step terminal.
Typical: 0.95–0.999.
Typical: 0.90–0.97.
Used only if Vt+1 is empty.
Leave blank for no min clipping.
Leave blank for no max clipping.
Affects display and exports.
Results appear above this form after submission.
Example data table
This example matches the prefilled inputs. It demonstrates how deltas accumulate into advantages.
t rt Vt Vt+1 done δt At (raw) Return target
010.50.400.8962.9433743.443374
100.40.30-0.1032.1769002.576900
200.30.20-0.1022.4241362.724136
310.20.100.8992.6859502.885950
420.10.011.9001.9000002.000000
Using γ=0.99 and λ=0.95, computed backwards from the last step.
Formula used

Generalized Advantage Estimation uses temporal-difference residuals:

δt = rt + γ V(st+1) (1 - donet) - V(st)
At = ∑l=0T-1-t (γλ)l δt+l
Return target: Rt = At + V(st)

The calculator computes At by iterating backward, resetting accumulation when done=1. Optional normalization keeps policy updates stable.

How to use this calculator
  1. Paste rewards and value estimates with the same length.
  2. Optionally add done flags (0/1) for episode boundaries.
  3. If you have Vt+1 already, paste it; otherwise leave it blank.
  4. Set γ and λ based on your environment and bias/variance needs.
  5. Click Calculate GAE to view the table above the form.
  6. Use the download buttons to export CSV or a one-page PDF report.
FAQs

1) What is a generalized advantage estimate?

It is a smoothed advantage signal that blends multi-step returns using γ and λ. It reduces variance compared with pure Monte Carlo while keeping bias manageable for policy optimization.

2) How do γ and λ change the result?

Higher γ values weigh distant rewards more. Higher λ values use longer traces, reducing bias but increasing variance. Lower λ approaches one-step TD, often more stable but less accurate.

3) Do I need Vt+1 values?

Not always. If you provide only Vt, the calculator shifts it to estimate Vt+1 and uses the bootstrap value for the final step. Supplying explicit Vt+1 is useful with truncated rollouts.

4) What does the done flag control?

It marks episode termination at time t. When done=1, the next-state value is ignored and the backward accumulation is reset. This prevents advantages from leaking across episode boundaries.

5) Should I normalize advantages?

Often yes for policy gradients. Normalization keeps update magnitudes consistent across batches and can improve stability. This tool keeps returns based on raw advantages, while also showing normalized values for training use.

6) What is the return target used for?

Return targets train the value function. A common choice is Rt=At+Vt, which matches the same bootstrapping assumptions used in the advantage estimate.

7) Can I mix multiple episodes in one batch?

Yes. Provide done flags at each boundary. The calculator will reset the trace where done=1, producing correct per-episode advantages even when sequences are concatenated.

8) What input formats are accepted?

Commas, spaces, semicolons, and new lines work. Scientific notation like 1e-3 is supported. For done flags, use 0/1 or true/false.

Related Calculators

Important Note: All the Calculators listed in this site are for educational purpose only and we do not guarentee the accuracy of results. Please do consult with other sources as well.