Model uncertainty across states, actions, rewards, and outcomes with strong clarity today. Test transitions quickly. Find optimal policies, values, and decisions for planning tasks.
Enter three states, three actions, rewards, and transition probabilities. Each state-action row should sum to one.
This example matches the default inputs and models maintenance decisions across three operating states.
| Current State | Action | Reward | P(Healthy) | P(Warning) | P(Critical) |
|---|---|---|---|---|---|
| Healthy | Hold | 8 | 0.80 | 0.15 | 0.05 |
| Healthy | Repair | 6 | 0.88 | 0.10 | 0.02 |
| Healthy | Replace | 3 | 0.95 | 0.04 | 0.01 |
| Warning | Hold | 4 | 0.20 | 0.55 | 0.25 |
| Warning | Repair | 5 | 0.55 | 0.35 | 0.10 |
| Warning | Replace | 2 | 0.90 | 0.08 | 0.02 |
| Critical | Hold | -6 | 0.05 | 0.25 | 0.70 |
| Critical | Repair | -1 | 0.35 | 0.45 | 0.20 |
| Critical | Replace | 1 | 0.92 | 0.06 | 0.02 |
Bellman optimality update:
V_{k+1}(s) = max_a [ R(s,a) + γ Σ P(s'|s,a) V_k(s') ]
Action value:
Q(s,a) = R(s,a) + γ Σ P(s'|s,a) V(s')
Stopping rule:
Δ = max_s |V_{k+1}(s) - V_k(s)|. Stop when Δ < tolerance.
Here, R(s,a) is the immediate reward, γ is the discount factor, and P(s'|s,a) is the transition probability to the next state.
The calculator applies value iteration, identifies the largest action value in each state, and returns the optimal policy with the associated state values.
It solves a discounted Markov decision process with three states and three actions. It estimates state values, compares all actions, and reports the best policy.
Each state-action row represents a complete probability distribution over next states. A valid distribution must total one, or expected future values become mathematically inconsistent.
The discount factor determines how strongly future value affects current decisions. A value near one emphasizes long-term outcomes, while smaller values emphasize immediate rewards.
Value iteration is an iterative dynamic programming method. It updates each state value using the best available action until the change between iterations becomes very small.
Q(s,a) is the action value. It combines the immediate reward with the discounted expected value of future states reachable from that action.
Normalization prevents invalid rows from breaking the model. When a row does not sum to one, the calculator rescales its probabilities and records a visible note.
Yes. Negative rewards can represent costs, losses, downtime, penalties, or risk exposure. The solver still chooses the action with the highest total discounted value.
Use a smaller tolerance when you need tighter convergence. Increase iterations if the model changes slowly, especially when the discount factor is close to one.
Important Note: All the Calculators listed in this site are for educational purpose only and we do not guarentee the accuracy of results. Please do consult with other sources as well.