Unique Value Counter Calculator for AI & Machine Learning

Calculator Input

Use direct values or extract a single CSV column for counting.

Input Mode

Direct mode splits text into items. CSV mode reads one chosen column.

Direct Delimiter

Used only for direct mode.

Custom Delimiter

Example: | or ::

CSV Delimiter

Used only for CSV column mode.

CSV Column Index

Zero-based index. First column is 0.

CSV has header row

Sort Distribution By

Top Plot Values

Rows To Show In Table

Case sensitive counting

Treat Dog and dog as different values.

Trim outer spaces

Useful for pasted lists and messy labels.

Normalize inner spaces

Converts repeated spaces into one space.

Remove punctuation

Helpful for token cleanup before counting.

Ignore blank values

Skips empty cells after preprocessing.

Paste Dataset Values

Examples: newline-separated categories, comma-separated tokens, or CSV text.

Example Data Table

This example mirrors a small label distribution audit often used before feature encoding or vocabulary filtering.

Row	Raw Value	Normalized Value	Counted?
1	Cat	cat	Yes
2	cat	cat	Yes
3	DOG	dog	Yes
4	dog!	dog	Yes
5	Bird	bird	Yes
6	(blank)	(blank)	No, if blanks ignored

Formula Used

Unique Count: U = number of distinct normalized values

Duplicate Entries: D = N - U, where N is the total processed values.

Uniqueness Rate: (U / N) × 100

Duplicate Rate: (D / N) × 100

Share of a Value: (count of value / N) × 100

Shannon Entropy: H = -Σ p(i) log₂ p(i)

Gini Impurity: G = 1 - Σ p(i)²

Effective Cardinality: 2^H. This estimates how many equally likely categories would produce the same entropy.

How to Use This Calculator

Select Direct values for pasted lists or CSV column for structured data.
Choose the delimiter or CSV settings that match your dataset.
Enable preprocessing options such as trimming, lowercasing, blank removal, or punctuation cleanup.
Paste your data into the text area.
Set the chart size and visible table row limit.
Click Count Unique Values to generate metrics, a distribution table, and the Plotly chart.
Use the export buttons to save the results as CSV or PDF.

Why This Matters In AI & Machine Learning

Unique value counting helps measure vocabulary size, categorical cardinality, class balance, and data cleanliness. It can reveal exploding token spaces, inconsistent labels, sparse categories, and preprocessing issues before encoding, embedding, clustering, or model training. This makes the calculator useful for feature engineering, NLP preparation, and dataset auditing.

FAQs

1. What does a unique value counter measure?

It counts how many distinct items exist after optional preprocessing. It also reports duplicates, frequency share, entropy, and related distribution metrics for better dataset inspection.

2. Why are preprocessing options important?

They prevent misleading counts. For example, “Dog”, “dog”, and “dog!” may represent one category. Normalization reduces accidental fragmentation in labels or tokens.

3. What is Shannon entropy here?

Entropy summarizes how evenly values are distributed. A higher score means counts are spread more evenly. A lower score means a few values dominate.

4. What does Gini impurity tell me?

Gini impurity measures category mixing. It becomes larger when values are more evenly spread and smaller when one or two values dominate the dataset.

5. Can I use this for categorical features?

Yes. It is useful for label sets, encoded classes, raw category fields, vocabulary audits, and checking high-cardinality features before modeling.

6. What is effective cardinality?

It converts entropy into an intuitive number of equally likely categories. It helps compare how concentrated two different distributions really are.

7. Why might my unique count seem low?

Lower counts often happen when case folding, trimming, punctuation removal, or blank skipping merges noisy variants into cleaner standardized values.

8. Can I export the results for reporting?

Yes. The page includes CSV export for the frequency table and PDF export for the summary and displayed distribution results.