Solver Inputs
Enter a finite-state decision problem and solve it with value iteration.
Example Data Table
This example shows a compact two-state, two-action model for practice and validation.
| State | Action | Immediate Reward | Transition Probabilities to [Low Demand, High Demand] |
|---|---|---|---|
| Low Demand | Hold | 5 | 0.70, 0.30 |
| Low Demand | Expand | 9 | 0.55, 0.45 |
| High Demand | Hold | 4 | 0.40, 0.60 |
| High Demand | Expand | 12 | 0.20, 0.80 |
Formula Used
The calculator applies the Bellman optimality update for each state:
Vk+1(s) = maxa [ R(s,a) + γ Σ P(s'|s,a) Vk(s') ]
For cost minimization, the calculator replaces the max operator with min.
Here, R(s,a) is the immediate reward or cost, γ is the discount factor, and P(s'|s,a) is the probability of moving from state s to next state s' after taking action a.
Value iteration repeats the update until the largest change between consecutive value vectors falls below the chosen tolerance, or until the maximum iteration limit is reached.
How to Use This Calculator
- Enter the state names and the possible actions.
- Choose whether you want to maximize rewards or minimize costs.
- Set the discount factor, tolerance, and maximum iteration count.
- Type the reward matrix with one row per state and one column per action.
- Enter one transition matrix for each action. Each matrix must be state-by-state.
- Submit the form to view values, optimal policy, Q-values, iteration history, and downloadable exports.
FAQs
1. What does this solver calculate?
It computes the value function for each state, identifies the best action under the chosen objective, and shows Q-values and convergence history from value iteration.
2. What is the discount factor?
The discount factor weights future outcomes relative to immediate outcomes. Larger values place more emphasis on long-run consequences, while smaller values favor short-term results.
3. Why must each transition row sum to one?
Each row represents a full probability distribution over next states for a specific current state and action. Total probability must equal one for the model to be valid.
4. What is the difference between max and min objectives?
Max chooses actions with the highest expected discounted reward. Min chooses actions with the lowest expected discounted cost, which is useful for planning and control problems.
5. What does tolerance control?
Tolerance defines the stopping threshold. When the maximum change between two successive value vectors becomes smaller than this number, the algorithm stops.
6. Can I use negative rewards or costs?
Yes. Negative rewards, penalties, and mixed reward structures are allowed, provided the matrices remain numeric and the transition model is valid.
7. When should I enable automatic row normalization?
Enable it when your transition rows are slightly off because of rounding or manual entry. It rescales each row so the probabilities sum correctly.
8. Is this the same as reinforcement learning?
Not exactly. This tool solves a known model with explicit rewards and transitions. Reinforcement learning typically estimates good policies from sampled experience instead.