Sample CSV format
| id | x1 | x2 | cluster |
|---|---|---|---|
| 1 | 1.0 | 1.2 | A |
| 4 | 4.0 | 4.1 | B |
| 7 | 8.0 | 7.9 | C |
Include numeric feature columns plus one cluster label column.
Cluster validity indices
- Silhouette (mean): for each point s(i) = (b(i) − a(i)) / max(a(i), b(i)), where a(i) is mean intra-cluster distance and b(i) is the smallest mean distance to another cluster.
- Davies–Bouldin: DB = (1/k) Σ_i max_{j≠i} (S_i + S_j) / M_{ij}, where S_i is mean distance to centroid and M_{ij} is centroid distance.
- Calinski–Harabasz: CH = (B/(k−1)) / (W/(n−k)), with between-cluster dispersion B and within-cluster dispersion W using squared distances.
- Dunn: D = min intercluster distance / max intracluster diameter. Higher values indicate well-separated compact clusters.
- WCSS: Σ_i Σ_{x∈C_i} ||x − μ_i||² measures compactness; useful alongside separation metrics.
Steps
- Prepare a CSV containing numeric features and a cluster label column.
- Select delimiter, indicate whether the first row is a header.
- Set the cluster label column name (or 1-based index).
- Optionally standardize features for fair distance comparisons.
- Choose which indices to compute, then press Compute indices.
- Review the results shown above the form, then export CSV or PDF.
Frequently asked questions
1) Which index should I trust most?
Use several together. Silhouette rewards separation and cohesion, Davies–Bouldin penalizes overlap, and Calinski–Harabasz highlights strong between-cluster spread. Agreement across indices is the safest signal.
2) Why does standardization change the score?
Distance-based indices are sensitive to feature scale. Z-scoring prevents one large-scale feature from dominating distances, often improving comparability across variables and clusters.
3) Can I use non-numeric columns?
Non-numeric columns are ignored for features. Keep one label column for cluster assignment, and ensure the remaining feature columns are numeric for correct calculations.
4) What does a negative silhouette mean?
It suggests many points are closer, on average, to another cluster than their own. This can indicate poor clustering, wrong distance metric, or features needing scaling or transformation.
5) Why is Davies–Bouldin lower-is-better?
It compares within-cluster scatter against separation between centroids. Lower values mean tighter clusters and larger separation relative to scatter, indicating clearer cluster structure.
6) Will this handle large datasets?
Yes, but some metrics can be heavy because they compare many pairs. For large inputs, the tool may use sampling for speed and will note it in the results area.
7) My CH score is huge. Is that normal?
It can be large when clusters are very separated or when within-cluster dispersion is small. Compare CH across different k values on the same dataset rather than across unrelated datasets.