Cluster Validity Index Calculator

Calculator

Input mode

Choose upload for files, or paste for quick tests.

Delimiter

Match your dataset separator.

Distance metric

Some indices assume Euclidean-style geometry.

Cluster label column

Header name or 1-based index for cluster labels.

ID column (optional)

Used for labeling rows (not for math).

Data options

First row is header

Exclude ID column from features

Z-score standardize features

Standardization helps when features have different scales.

Indices to compute

Silhouette Score

Davies–Bouldin

Calinski–Harabasz

Dunn Index

WCSS (compactness)

Tip: use multiple indices together for robust decisions.

Upload CSV file

If no file is uploaded, the example dataset is used.

Or paste CSV data

Paste mode ignores the uploaded file.

Reset

Example dataset

Sample CSV format

id	x1	x2	cluster
1	1.0	1.2	A
4	4.0	4.1	B
7	8.0	7.9	C

Include numeric feature columns plus one cluster label column.

Formulas used

Cluster validity indices

Silhouette (mean): for each point s(i) = (b(i) − a(i)) / max(a(i), b(i)), where a(i) is mean intra-cluster distance and b(i) is the smallest mean distance to another cluster.
Davies–Bouldin: DB = (1/k) Σ_i max_{j≠i} (S_i + S_j) / M_{ij}, where S_i is mean distance to centroid and M_{ij} is centroid distance.
Calinski–Harabasz: CH = (B/(k−1)) / (W/(n−k)), with between-cluster dispersion B and within-cluster dispersion W using squared distances.
Dunn: D = min intercluster distance / max intracluster diameter. Higher values indicate well-separated compact clusters.
WCSS: Σ_i Σ_{x∈C_i} ||x − μ_i||² measures compactness; useful alongside separation metrics.

How to use

Steps

Prepare a CSV containing numeric features and a cluster label column.
Select delimiter, indicate whether the first row is a header.
Set the cluster label column name (or 1-based index).
Optionally standardize features for fair distance comparisons.
Choose which indices to compute, then press Compute indices.
Review the results shown above the form, then export CSV or PDF.

FAQs

Frequently asked questions

1) Which index should I trust most?

Use several together. Silhouette rewards separation and cohesion, Davies–Bouldin penalizes overlap, and Calinski–Harabasz highlights strong between-cluster spread. Agreement across indices is the safest signal.

2) Why does standardization change the score?

Distance-based indices are sensitive to feature scale. Z-scoring prevents one large-scale feature from dominating distances, often improving comparability across variables and clusters.

3) Can I use non-numeric columns?

Non-numeric columns are ignored for features. Keep one label column for cluster assignment, and ensure the remaining feature columns are numeric for correct calculations.

4) What does a negative silhouette mean?

It suggests many points are closer, on average, to another cluster than their own. This can indicate poor clustering, wrong distance metric, or features needing scaling or transformation.

5) Why is Davies–Bouldin lower-is-better?

It compares within-cluster scatter against separation between centroids. Lower values mean tighter clusters and larger separation relative to scatter, indicating clearer cluster structure.

6) Will this handle large datasets?

Yes, but some metrics can be heavy because they compare many pairs. For large inputs, the tool may use sampling for speed and will note it in the results area.

7) My CH score is huge. Is that normal?

It can be large when clusters are very separated or when within-cluster dispersion is small. Compare CH across different k values on the same dataset rather than across unrelated datasets.