Markov Decision Process Calculator

Calculator Inputs

Enter three states, three actions, rewards, and transition probabilities. Each state-action row should sum to one.

State 1 Name

State 2 Name

State 3 Name

Action 1 Name

Action 2 Name

Action 3 Name

Discount Factor (γ)

Maximum Iterations

Tolerance

Initial Value for State 1

Initial Value for State 2

Initial Value for State 3

Inputs for Healthy

Reward for Hold

P(Healthy | Healthy, Hold)

P(Warning | Healthy, Hold)

P(Critical | Healthy, Hold)

Reward for Repair

P(Healthy | Healthy, Repair)

P(Warning | Healthy, Repair)

P(Critical | Healthy, Repair)

Reward for Replace

P(Healthy | Healthy, Replace)

P(Warning | Healthy, Replace)

P(Critical | Healthy, Replace)

Inputs for Warning

Reward for Hold

P(Healthy | Warning, Hold)

P(Warning | Warning, Hold)

P(Critical | Warning, Hold)

Reward for Repair

P(Healthy | Warning, Repair)

P(Warning | Warning, Repair)

P(Critical | Warning, Repair)

Reward for Replace

P(Healthy | Warning, Replace)

P(Warning | Warning, Replace)

P(Critical | Warning, Replace)

Inputs for Critical

Reward for Hold

P(Healthy | Critical, Hold)

P(Warning | Critical, Hold)

P(Critical | Critical, Hold)

Reward for Repair

P(Healthy | Critical, Repair)

P(Warning | Critical, Repair)

P(Critical | Critical, Repair)

Reward for Replace

P(Healthy | Critical, Replace)

P(Warning | Critical, Replace)

P(Critical | Critical, Replace)

Example Data Table

This example matches the default inputs and models maintenance decisions across three operating states.

Current State	Action	Reward	P(Healthy)	P(Warning)	P(Critical)
Healthy	Hold	8	0.80	0.15	0.05
Healthy	Repair	6	0.88	0.10	0.02
Healthy	Replace	3	0.95	0.04	0.01
Warning	Hold	4	0.20	0.55	0.25
Warning	Repair	5	0.55	0.35	0.10
Warning	Replace	2	0.90	0.08	0.02
Critical	Hold	-6	0.05	0.25	0.70
Critical	Repair	-1	0.35	0.45	0.20
Critical	Replace	1	0.92	0.06	0.02

Formula Used

Bellman optimality update:

V_{k+1}(s) = max_a [ R(s,a) + γ Σ P(s'|s,a) V_k(s') ]

Action value:

Q(s,a) = R(s,a) + γ Σ P(s'|s,a) V(s')

Stopping rule:

Δ = max_s |V_{k+1}(s) - V_k(s)|. Stop when Δ < tolerance.

Here, R(s,a) is the immediate reward, γ is the discount factor, and P(s'|s,a) is the transition probability to the next state.

The calculator applies value iteration, identifies the largest action value in each state, and returns the optimal policy with the associated state values.

How to Use This Calculator

Enter labels for three states and three actions.
Set the discount factor, tolerance, maximum iterations, and optional initial state values.
Provide one immediate reward for every state-action pair.
Enter transition probabilities from each state-action pair to the three next states.
Press Calculate MDP to run value iteration.
Review the optimal action for each state, the final values, and the action evaluation table.
Use the Plotly graph to inspect convergence across iterations.
Download the result as CSV or PDF when needed.

FAQs

1) What does this calculator solve?

It solves a discounted Markov decision process with three states and three actions. It estimates state values, compares all actions, and reports the best policy.

2) Why must probabilities sum to one?

Each state-action row represents a complete probability distribution over next states. A valid distribution must total one, or expected future values become mathematically inconsistent.

3) What does the discount factor control?

The discount factor determines how strongly future value affects current decisions. A value near one emphasizes long-term outcomes, while smaller values emphasize immediate rewards.

4) What is value iteration?

Value iteration is an iterative dynamic programming method. It updates each state value using the best available action until the change between iterations becomes very small.

5) What does Q(s,a) mean here?

Q(s,a) is the action value. It combines the immediate reward with the discounted expected value of future states reachable from that action.

6) Why are some rows normalized automatically?

Normalization prevents invalid rows from breaking the model. When a row does not sum to one, the calculator rescales its probabilities and records a visible note.

7) Can I use negative rewards?

Yes. Negative rewards can represent costs, losses, downtime, penalties, or risk exposure. The solver still chooses the action with the highest total discounted value.

8) When should I change tolerance or iterations?

Use a smaller tolerance when you need tighter convergence. Increase iterations if the model changes slowly, especially when the discount factor is close to one.

Calculator Inputs

Inputs for Healthy

Inputs for Warning

Inputs for Critical

Example Data Table

Formula Used

How to Use This Calculator

FAQs

1) What does this calculator solve?

2) Why must probabilities sum to one?

3) What does the discount factor control?

4) What is value iteration?

5) What does Q(s,a) mean here?

6) Why are some rows normalized automatically?

7) Can I use negative rewards?

8) When should I change tolerance or iterations?

Related Calculators