Enter Dataset and Settings
Example Data Table
This sample shows how a labeled numeric dataset should look before analysis.
| Segment | Feature_A | Feature_B | Feature_C |
|---|---|---|---|
| A1 | 2.1 | 1.9 | 2.3 |
| A2 | 2.4 | 2.2 | 2.0 |
| A3 | 1.8 | 2.0 | 2.2 |
| B1 | 7.4 | 7.1 | 6.9 |
| B2 | 7.8 | 7.5 | 7.2 |
| B3 | 7.1 | 7.3 | 7.0 |
| C1 | 4.5 | 8.2 | 5.9 |
| C2 | 4.9 | 8.5 | 6.2 |
| C3 | 4.3 | 7.9 | 5.7 |
Formula Used
1. Standardization
When standardization is enabled, each value is converted using:
z = (x - mean) / standard deviation
2. Distance Calculation
Euclidean distance:
d = √Σ(xi - ci)²
Manhattan distance:
d = Σ|xi - ci|
Chebyshev distance:
d = max|xi - ci|
3. Centroid Update
For Euclidean distance, each centroid component is the mean of all assigned observations:
centroid_j = Σxj / n
For Manhattan and Chebyshev modes, the tool uses a median-based update for stability.
4. Total Cluster Error
Euclidean mode uses squared distances to estimate within-cluster compactness. Other modes sum the selected distance values for each assigned observation.
5. Silhouette Score
s(i) = (b(i) - a(i)) / max(a(i), b(i))
Here, a(i) is the average distance to the current cluster, and b(i) is the smallest average distance to another cluster.
How to Use This Calculator
- Paste your dataset into the input area.
- Keep row labels in the first column if you want named observations.
- Set the number of clusters you want to test.
- Choose a distance rule that matches your analysis goal.
- Enable standardization when features have very different scales.
- Increase random starts for a more stable best solution.
- Click Run Cluster Analysis to generate assignments, centroids, and the graph.
- Download the results as CSV or PDF after the analysis appears.
FAQs
1. What kind of data should I enter?
Use rows of numeric observations. The first column may hold labels, and the first row may hold headers. Avoid text inside feature columns because clustering calculations require numeric values.
2. When should I standardize my data?
Standardize when one feature has much larger values than others. This prevents large-scale variables from dominating distance calculations and often improves balanced grouping.
3. How do I choose the number of clusters?
Start with a reasonable guess based on domain knowledge. Compare total error, silhouette score, and visual separation across several values of k to find a practical balance.
4. What does the silhouette score mean?
A higher silhouette score usually indicates better separation and tighter grouping. Values near one are strong, around zero are mixed, and negative values suggest overlap or poor assignments.
5. Why does the tool use multiple random starts?
Different starting centroids can lead to different solutions. Multiple starts reduce the risk of keeping a weak local result and improve the chance of finding a better cluster arrangement.
6. What does total cluster error show?
It measures how tightly observations sit around their assigned centroids. Lower values usually indicate more compact groups, though they should be interpreted together with silhouette score and domain logic.
7. Why is my graph based on only two features?
A two-dimensional plot is easier to read on a webpage. When your dataset contains more than two features, the chart displays the first two columns while the clustering still uses all columns.
8. Can I use this for market, customer, or survey segmentation?
Yes. This tool is useful for many segmentation tasks involving numeric variables, including customer profiles, product behavior, biological measurements, quality control, and statistical grouping exercises.