Calculator Input
Use direct values or extract a single CSV column for counting.
Example Data Table
This example mirrors a small label distribution audit often used before feature encoding or vocabulary filtering.
| Row | Raw Value | Normalized Value | Counted? |
|---|---|---|---|
| 1 | Cat | cat | Yes |
| 2 | cat | cat | Yes |
| 3 | DOG | dog | Yes |
| 4 | dog! | dog | Yes |
| 5 | Bird | bird | Yes |
| 6 | (blank) | (blank) | No, if blanks ignored |
Formula Used
Unique Count: U = number of distinct normalized values
Duplicate Entries: D = N - U, where N is the total processed values.
Uniqueness Rate: (U / N) × 100
Duplicate Rate: (D / N) × 100
Share of a Value: (count of value / N) × 100
Shannon Entropy: H = -Σ p(i) log₂ p(i)
Gini Impurity: G = 1 - Σ p(i)²
Effective Cardinality: 2^H. This estimates how many equally likely categories would produce the same entropy.
How to Use This Calculator
- Select Direct values for pasted lists or CSV column for structured data.
- Choose the delimiter or CSV settings that match your dataset.
- Enable preprocessing options such as trimming, lowercasing, blank removal, or punctuation cleanup.
- Paste your data into the text area.
- Set the chart size and visible table row limit.
- Click Count Unique Values to generate metrics, a distribution table, and the Plotly chart.
- Use the export buttons to save the results as CSV or PDF.
Why This Matters In AI & Machine Learning
Unique value counting helps measure vocabulary size, categorical cardinality, class balance, and data cleanliness. It can reveal exploding token spaces, inconsistent labels, sparse categories, and preprocessing issues before encoding, embedding, clustering, or model training. This makes the calculator useful for feature engineering, NLP preparation, and dataset auditing.
FAQs
1. What does a unique value counter measure?
It counts how many distinct items exist after optional preprocessing. It also reports duplicates, frequency share, entropy, and related distribution metrics for better dataset inspection.
2. Why are preprocessing options important?
They prevent misleading counts. For example, “Dog”, “dog”, and “dog!” may represent one category. Normalization reduces accidental fragmentation in labels or tokens.
3. What is Shannon entropy here?
Entropy summarizes how evenly values are distributed. A higher score means counts are spread more evenly. A lower score means a few values dominate.
4. What does Gini impurity tell me?
Gini impurity measures category mixing. It becomes larger when values are more evenly spread and smaller when one or two values dominate the dataset.
5. Can I use this for categorical features?
Yes. It is useful for label sets, encoded classes, raw category fields, vocabulary audits, and checking high-cardinality features before modeling.
6. What is effective cardinality?
It converts entropy into an intuitive number of equally likely categories. It helps compare how concentrated two different distributions really are.
7. Why might my unique count seem low?
Lower counts often happen when case folding, trimming, punctuation removal, or blank skipping merges noisy variants into cleaner standardized values.
8. Can I export the results for reporting?
Yes. The page includes CSV export for the frequency table and PDF export for the summary and displayed distribution results.