Markov Decision Process Calculator

Model uncertainty across states, actions, rewards, and outcomes with strong clarity today. Test transitions quickly. Find optimal policies, values, and decisions for planning tasks.

Calculator Inputs

Enter three states, three actions, rewards, and transition probabilities. Each state-action row should sum to one.

Inputs for Healthy

Inputs for Warning

Inputs for Critical

Example Data Table

This example matches the default inputs and models maintenance decisions across three operating states.

Current State Action Reward P(Healthy) P(Warning) P(Critical)
HealthyHold80.800.150.05
HealthyRepair60.880.100.02
HealthyReplace30.950.040.01
WarningHold40.200.550.25
WarningRepair50.550.350.10
WarningReplace20.900.080.02
CriticalHold-60.050.250.70
CriticalRepair-10.350.450.20
CriticalReplace10.920.060.02

Formula Used

Bellman optimality update:

V_{k+1}(s) = max_a [ R(s,a) + γ Σ P(s'|s,a) V_k(s') ]

Action value:

Q(s,a) = R(s,a) + γ Σ P(s'|s,a) V(s')

Stopping rule:

Δ = max_s |V_{k+1}(s) - V_k(s)|. Stop when Δ < tolerance.

Here, R(s,a) is the immediate reward, γ is the discount factor, and P(s'|s,a) is the transition probability to the next state.

The calculator applies value iteration, identifies the largest action value in each state, and returns the optimal policy with the associated state values.

How to Use This Calculator

  1. Enter labels for three states and three actions.
  2. Set the discount factor, tolerance, maximum iterations, and optional initial state values.
  3. Provide one immediate reward for every state-action pair.
  4. Enter transition probabilities from each state-action pair to the three next states.
  5. Press Calculate MDP to run value iteration.
  6. Review the optimal action for each state, the final values, and the action evaluation table.
  7. Use the Plotly graph to inspect convergence across iterations.
  8. Download the result as CSV or PDF when needed.

FAQs

1) What does this calculator solve?

It solves a discounted Markov decision process with three states and three actions. It estimates state values, compares all actions, and reports the best policy.

2) Why must probabilities sum to one?

Each state-action row represents a complete probability distribution over next states. A valid distribution must total one, or expected future values become mathematically inconsistent.

3) What does the discount factor control?

The discount factor determines how strongly future value affects current decisions. A value near one emphasizes long-term outcomes, while smaller values emphasize immediate rewards.

4) What is value iteration?

Value iteration is an iterative dynamic programming method. It updates each state value using the best available action until the change between iterations becomes very small.

5) What does Q(s,a) mean here?

Q(s,a) is the action value. It combines the immediate reward with the discounted expected value of future states reachable from that action.

6) Why are some rows normalized automatically?

Normalization prevents invalid rows from breaking the model. When a row does not sum to one, the calculator rescales its probabilities and records a visible note.

7) Can I use negative rewards?

Yes. Negative rewards can represent costs, losses, downtime, penalties, or risk exposure. The solver still chooses the action with the highest total discounted value.

8) When should I change tolerance or iterations?

Use a smaller tolerance when you need tighter convergence. Increase iterations if the model changes slowly, especially when the discount factor is close to one.

Related Calculators

markov chain simulatormarkov chain probability calculatorstochastic matrix calculator

Important Note: All the Calculators listed in this site are for educational purpose only and we do not guarentee the accuracy of results. Please do consult with other sources as well.