Contextual Bandit Calculator

Calculator Inputs

Configure context, policy, and arm parameters

Use the responsive form below. It displays three columns on large screens, two on smaller screens, and one on mobile devices.

Policy settings

Contextual policy

Exploration alpha

Epsilon rate

Softmax temperature

Planned decisions

Context features

These values represent the current request, user, environment, or session features.

Feature 1

Feature 2

Feature 3

Feature 4

Quick guidance

Use LinUCB when uncertainty should drive exploration.
Use epsilon-greedy for simple controlled exploration.
Use softmax for smoother probability-based allocation.
Actual rewards are optional but useful for regret.
Weight signs can be positive or negative.

Arm A parameters

Arm label

Bias term

Uncertainty estimate

Actual reward (optional)

Weight for Feature 1

Weight for Feature 2

Weight for Feature 3

Weight for Feature 4

Arm B parameters

Arm label

Bias term

Uncertainty estimate

Actual reward (optional)

Weight for Feature 1

Weight for Feature 2

Weight for Feature 3

Weight for Feature 4

Arm C parameters

Arm label

Bias term

Uncertainty estimate

Actual reward (optional)

Weight for Feature 1

Weight for Feature 2

Weight for Feature 3

Weight for Feature 4

Example Data

Sample contextual bandit dataset

This example mirrors the default values in the calculator so you can test the workflow quickly.

Arm	Bias	Uncertainty	Actual Reward	Weight 1	Weight 2	Weight 3	Weight 4
Recommendation Model A	0.12	0.18	0.74	0.42	0.28	0.15	0.10
Recommendation Model B	0.09	0.12	0.68	0.34	0.36	0.18	0.07
Recommendation Model C	0.15	0.16	0.79	0.30	0.22	0.31	0.16

Default context: Feature 1 = 0.80, Feature 2 = 0.50, Feature 3 = 0.30, Feature 4 = 0.20.

Formula Used

Core equations behind the calculator

Estimated reward:

r̂(a) = bias(a) + Σ [weight(a,i) × context(i)]

LinUCB score:

score(a) = r̂(a) + α × uncertainty(a)

Epsilon-greedy share:

best arm share = 1 − ε + ε / K, other arm share = ε / K

Softmax probability:

P(a) = exp(r̂(a) / τ) ÷ Σ exp(r̂(j) / τ)

Expected cumulative reward:

total reward = selected estimated reward × planned decisions

Actual regret:

regret = oracle actual reward − selected actual reward

How To Use

Steps for running the calculator correctly

Choose the policy you want to evaluate: LinUCB, epsilon-greedy, or softmax.
Enter exploration settings such as alpha, epsilon, temperature, and the number of planned decisions.
Add the live context values that describe the current user, session, or environment.
For each arm, supply a label, bias, uncertainty estimate, and four feature weights.
Optionally enter actual rewards for all arms if you want the calculator to compute regret.
Press submit to view the result summary above the form, then export the output as CSV or PDF.

FAQs

Common questions about contextual bandit modeling

1. What does this calculator estimate?

It estimates contextual reward scores for several actions, compares exploration policies, and highlights the arm most likely to perform best under the chosen decision rule.

2. When should I use LinUCB?

Use LinUCB when you want uncertainty-aware exploration. It adds a confidence bonus, so arms with less data can still receive traffic when their upside remains plausible.

3. What does epsilon control?

Epsilon controls random exploration. Higher epsilon sends more traffic away from the current best estimated arm and spreads opportunities across competing actions.

4. What does temperature mean in softmax?

Temperature controls how sharply probabilities react to reward differences. Lower values concentrate traffic on the best arm, while higher values distribute traffic more evenly.

5. Why are actual rewards optional?

Actual rewards are useful for regret analysis, but not required for score estimation. You can still compare predicted performance using model weights, context, and uncertainty.

6. Can I use negative feature weights?

Yes. Negative weights simply mean a feature lowers the estimated reward for that arm. This is common when a context signal predicts weaker response.

7. What does expected regret show?

Expected regret measures the estimated reward lost by not picking the best estimated arm every time across the planned decision horizon.

8. How should I choose the planned decisions value?

Set it to the number of impressions, sessions, or allocation rounds you expect. This lets the calculator scale single-step estimates into campaign-level expectations.

Configure context, policy, and arm parameters

Policy settings

Context features

Quick guidance

Arm A parameters

Arm B parameters

Arm C parameters

Sample contextual bandit dataset

Core equations behind the calculator

Steps for running the calculator correctly

Common questions about contextual bandit modeling

1. What does this calculator estimate?

2. When should I use LinUCB?

3. What does epsilon control?

4. What does temperature mean in softmax?

5. Why are actual rewards optional?

6. Can I use negative feature weights?

7. What does expected regret show?

8. How should I choose the planned decisions value?

Related Calculators