Calculator
Example Data Table
| Row | Category | Target | Label Code | Frequency Ratio |
|---|---|---|---|---|
| 1 | Red | 12 | 0 | 0.40 |
| 2 | Blue | 18 | 1 | 0.30 |
| 3 | Green | 9 | 2 | 0.20 |
| 4 | Blue | 20 | 1 | 0.30 |
| 5 | Yellow | 7 | 3 | 0.10 |
Formula Used
Label Encoding
Assign each unique category a numeric index. If the ordered unique set is U, then e(cᵢ) = start + i.
One Hot Encoding
Create one binary column per category. For column j, xᵢⱼ = 1 when cᵢ = uⱼ, otherwise xᵢⱼ = 0.
Frequency Encoding
Use occurrence count or ratio. Count mode uses n(c). Ratio mode uses f(c) = n(c) / N.
Ordinal Encoding
Map categories to ranked scores. If a category is unmapped, this tool applies the fallback code you provide.
Target Mean Encoding
This tool uses smoothing to reduce overfitting:
TE(c) = (Σy(c) + αμ) / (n(c) + α), where μ is the global target mean, n(c) is the category count, and α is the smoothing value.
Entropy and Cardinality
Entropy is H = −Σ p(c) log₂ p(c). Cardinality ratio equals unique categories divided by total used rows.
How to Use This Calculator
- Enter category values, one per line or comma separated.
- Add target values only when you want target mean encoding.
- Choose the encoding method that matches your modeling need.
- Set ordering, smoothing, fallback code, and drop-first options if needed.
- Press Submit to generate the mapping table, row output, metrics, and Plotly chart.
- Use the CSV or PDF buttons to download your processed result.
FAQs
1. What input format does this tool accept?
You can paste categories one per line or as a comma separated list. Target values follow the same pattern. Matching row counts are required for target mean encoding.
2. When should I use label encoding?
Use label encoding for tree based models, compact storage, or quick prototypes. Avoid it when the model may wrongly treat category codes as meaningful numeric distances.
3. Why does one hot encoding create many columns?
Each unique category becomes its own binary feature. High cardinality inputs therefore expand quickly, which may increase memory usage and slow training on wide datasets.
4. What does frequency encoding preserve?
Frequency encoding preserves how common each category is within the dataset. It keeps one column only, but it does not preserve identity as clearly as one hot encoding.
5. When is ordinal encoding risky?
Ordinal encoding is risky when categories have no real ranking. A false order can introduce bias, because models may interpret higher codes as stronger or larger values.
6. Why is smoothing important in target encoding?
Smoothing pulls small category averages toward the global mean. This reduces instability, especially when rare categories would otherwise receive extreme values from just one or two rows.
7. How are blank categories handled?
Blank rows can either be dropped or converted into a literal missing category. This lets you test how models behave when absent labels become a distinct signal.
8. Can I export the encoded results?
Yes. After calculation, the page provides CSV export, PDF export, and a print option for reports, documentation, validation, or dataset review.