Calculator Inputs
Use value mode for state values. Use q-value mode when each future estimate already represents the best next action value.
Plotly Graph
The chart shows how each transition contributes to the discounted future term of the Bellman update.
Example Data Table
| State | Action | Reward | γ | Next States | Probabilities | Future Estimates | Bellman Target |
|---|---|---|---|---|---|---|---|
| S0 | A0 | 5.00 | 0.90 | S1, S2, S3 | 0.50, 0.30, 0.20 | 12.00, 8.00, 4.00 | 13.6400 |
| S4 | A1 | 2.50 | 0.80 | S5, S6 | 0.70, 0.30 | 9.00, 3.00 | 8.4200 |
| S7 | A2 | -1.00 | 0.95 | S8, S9, S10 | 0.20, 0.50, 0.30 | 15.00, 10.00, 6.00 | 8.8800 |
Formula Used
State-value update:
V(s) = R(s) + γ Σ P(s'|s,a) × V(s')
Action-value update:
Q(s,a) = R(s,a) + γ Σ P(s'|s,a) × maxQ(s',a')
Expected future term: Σ P × future estimate
The calculator first multiplies each future estimate by its transition probability. It adds those weighted values to get the expected future value. Then it multiplies that expectation by the discount factor γ. Finally, it adds the immediate reward to produce the Bellman target.
How to Use This Calculator
- Choose State Value when entering future state values V(s').
- Choose Action Value when future inputs represent best next-action values.
- Enter the immediate reward and the discount factor γ between 0 and 1.
- Select how many transitions you want to include, from 1 to 6.
- For each transition, enter the next state label, probability, and future estimate.
- Make sure all probabilities sum exactly to 1.0000.
- Click Calculate Bellman Target to show results above the form.
- Use the CSV and PDF buttons to save the computed output.
Frequently Asked Questions
1. What does the Bellman equation measure?
It measures the current state or action value as immediate reward plus discounted expected future value. It is a core recursion behind reinforcement learning methods.
2. When should I use state-value mode?
Use state-value mode when your future inputs are state values V(s'). This fits policy evaluation, value iteration, and many Markov decision process examples.
3. When should I use q-value mode?
Use q-value mode when each future estimate already reflects the best available next action value. That setup matches many action-value learning interpretations.
4. Why must probabilities sum to one?
Transition probabilities represent a full distribution over possible next states. If they do not sum to one, the expected future value becomes inconsistent.
5. What does the discount factor γ control?
γ controls how strongly future outcomes affect the present estimate. Higher values favor long-term returns, while lower values emphasize immediate reward.
6. Can rewards be negative?
Yes. Negative rewards can represent penalties, costs, or undesirable outcomes. The calculator handles positive, zero, and negative rewards directly.
7. Why do I see discounted contribution values?
They show how much each transition adds to the discounted future term. This helps explain which next states influence the Bellman target most.
8. Is this useful for teaching reinforcement learning?
Yes. It turns recursive updates into visible steps, tables, and charts. That makes Bellman reasoning easier to inspect, compare, and explain.