Calculator Inputs
Example Data Table
| Index | Pᵢ | Qᵢ | Pᵢ·ln(Pᵢ/Qᵢ) |
|---|---|---|---|
| 1 | 0.40 | 0.30 | 0.1151 |
| 2 | 0.35 | 0.40 | -0.0467 |
| 3 | 0.25 | 0.30 | -0.0456 |
Formula Used
Relative entropy between distributions P and Q is:
- log base e gives nats; base 2 gives bits.
- If pi = 0, that term contributes 0.
- If qi = 0 while pi > 0, divergence is infinite.
How to Use This Calculator
- Paste your P list and Q list using the same length.
- Choose your delimiter and preferred log base.
- Enable normalization if inputs are counts or weights.
- Optional: add epsilon smoothing to reduce zero issues.
- Click Calculate to view the full per-term breakdown.
- Use Download CSV or Download PDF for sharing.
Why Relative Entropy Matters
Relative entropy, also called KL divergence, quantifies how one probability model differs from another. In practice, it measures extra coding cost when Q is used to represent events generated by P. A value of 0 means identical distributions; larger values indicate stronger mismatch. It is directional, so D(P‖Q) generally differs from D(Q‖P).
Input Quality and Normalization
This calculator accepts probabilities or raw weights. When weights are supplied, enabling normalization converts each list into a valid distribution that sums to 1. For example, counts 40, 35, 25 become 0.40, 0.35, 0.25. If totals are very small or inconsistent, normalization prevents scale from distorting the comparison.
Choosing the Log Base
The log base controls units. Natural log reports nats, base 2 reports bits, and base 10 reports hartleys. If you are evaluating compression or coding efficiency, bits are common. For statistical modeling and likelihood work, nats are often preferred. Changing base rescales results by a constant factor, preserving rankings.
Handling Zeros with Smoothing
If any qi equals 0 while pi > 0, the divergence becomes infinite because Q assigns impossible probability to an event P can produce. To reduce this, you may apply epsilon smoothing, adding a small ε to every entry before normalization. Typical values range from 1e-6 to 1e-3, depending on sample size.
Interpreting the Per‑Term Breakdown
The table shows pi/qi, log ratio, and the contribution pi·log(pi/qi) for each index. Positive terms occur when P assigns higher probability than Q; negative terms occur when Q is higher. The Plotly graph visualizes P and Q side‑by‑side and overlays term contributions to highlight dominant mismatches. In A/B tests, report KL in bits per symbol; a drop from 0.050 to 0.020 bits can indicate meaningful calibration improvement. For monitoring, track weekly overall medians and the 95th percentile, and investigate categories with the largest per-term spikes quickly.
Related Metrics for Reporting
Alongside KL divergence, the calculator reports entropy H(P) and cross‑entropy H(P,Q). Cross‑entropy equals H(P) plus KL divergence, linking model mismatch to expected code length. Optionally, Jensen–Shannon divergence is provided; it is symmetric and finite when distributions are well‑defined, making it useful for dashboards and comparisons across many segments.
FAQs
What is relative entropy in simple terms?
It measures how inefficient it is to use distribution Q when the data actually follows P. Zero means the distributions match; larger values mean Q deviates more from P.
Can KL divergence be negative?
No. The total divergence is always zero or positive, although individual per-index terms can be negative when Q assigns higher probability than P at that index.
Why does the result become infinite?
If Q assigns zero probability to an event that P assigns a positive probability, the ratio pᵢ/qᵢ diverges. Smoothing or clipping can prevent division by zero for practical comparisons.
Should I normalize my inputs?
Yes when you enter counts, scores, or weights. Normalization converts them into probabilities that sum to one, making the divergence comparable across datasets and time windows.
Which log base should I choose?
Use base 2 for bits in coding and compression contexts, natural base for statistical modeling, and base 10 when reporting in decimal units. Changing base rescales values but does not change comparisons.
What is Jensen–Shannon divergence used for?
It is a symmetric, bounded alternative built from KL divergence against the average distribution. It is often preferred for clustering, similarity searches, and dashboards because it behaves well with noisy data.