Bellman Equation Solver Calculator

Solver Inputs

Enter a finite-state decision problem and solve it with value iteration.

State Names

Comma-separated. Example: Low Demand, High Demand

Action Names

Comma-separated. Example: Hold, Expand

Objective

Choose reward maximization or cost minimization.

Discount Factor (γ)

Use a value from 0 to less than 1.

Tolerance

Iteration stops when the max change drops below this value.

Maximum Iterations

Choose a practical iteration limit for convergence testing.

Initial Values

Enter one number, or one value per state.

Normalize transition rows automatically

Useful when rows are close to, but not exactly, 1.

Detected dimensions will appear here.

Reward Matrix

Rows must match states. Columns must match actions. Use commas between values.

Example Data Table

This example shows a compact two-state, two-action model for practice and validation.

State	Action	Immediate Reward	Transition Probabilities to [Low Demand, High Demand]
Low Demand	Hold	5	0.70, 0.30
Low Demand	Expand	9	0.55, 0.45
High Demand	Hold	4	0.40, 0.60
High Demand	Expand	12	0.20, 0.80

Formula Used

The calculator applies the Bellman optimality update for each state:

V_k+1(s) = max_a [ R(s,a) + γ Σ P(s'|s,a) V_k(s') ]

For cost minimization, the calculator replaces the max operator with min.

Here, R(s,a) is the immediate reward or cost, γ is the discount factor, and P(s'|s,a) is the probability of moving from state s to next state s' after taking action a.

Value iteration repeats the update until the largest change between consecutive value vectors falls below the chosen tolerance, or until the maximum iteration limit is reached.

How to Use This Calculator

Enter the state names and the possible actions.
Choose whether you want to maximize rewards or minimize costs.
Set the discount factor, tolerance, and maximum iteration count.
Type the reward matrix with one row per state and one column per action.
Enter one transition matrix for each action. Each matrix must be state-by-state.
Submit the form to view values, optimal policy, Q-values, iteration history, and downloadable exports.

FAQs

1. What does this solver calculate?

It computes the value function for each state, identifies the best action under the chosen objective, and shows Q-values and convergence history from value iteration.

2. What is the discount factor?

The discount factor weights future outcomes relative to immediate outcomes. Larger values place more emphasis on long-run consequences, while smaller values favor short-term results.

3. Why must each transition row sum to one?

Each row represents a full probability distribution over next states for a specific current state and action. Total probability must equal one for the model to be valid.

4. What is the difference between max and min objectives?

Max chooses actions with the highest expected discounted reward. Min chooses actions with the lowest expected discounted cost, which is useful for planning and control problems.

5. What does tolerance control?

Tolerance defines the stopping threshold. When the maximum change between two successive value vectors becomes smaller than this number, the algorithm stops.

6. Can I use negative rewards or costs?

Yes. Negative rewards, penalties, and mixed reward structures are allowed, provided the matrices remain numeric and the transition model is valid.

7. When should I enable automatic row normalization?

Enable it when your transition rows are slightly off because of rounding or manual entry. It rescales each row so the probabilities sum correctly.

8. Is this the same as reinforcement learning?

Not exactly. This tool solves a known model with explicit rewards and transitions. Reinforcement learning typically estimates good policies from sampled experience instead.