Example Data Table
| Point |
X |
Y |
Expected Pattern |
| A | 2 | 3 | Lower left group |
| B | 2.5 | 4 | Lower left group |
| C | 1.8 | 2.7 | Lower left group |
| D | 8 | 7 | Middle group |
| E | 8.5 | 8 | Middle group |
| F | 9 | 7.5 | Middle group |
| G | 14 | 2 | Right group |
| H | 14.5 | 2.2 | Right group |
Formula Used
The direct clustering method starts with every point as a separate cluster. It repeatedly joins the nearest two clusters.
Euclidean Distance
d(a,b) = sqrt(Σ wi(ai - bi)²)
Manhattan Distance
d(a,b) = Σ wi |ai - bi|
Chebyshev Distance
d(a,b) = max(wi |ai - bi|)
Cosine Distance
d(a,b) = 1 - (a · b) / (||a|| ||b||)
Centroid
Centroid = sum of point coordinates divided by cluster size.
Within Cluster Sum of Squares
WSS = Σ distance(point, cluster centroid)²
Silhouette Style Score
s = (b - a) / max(a, b). Here, a is average same-cluster distance. The value b is the nearest other-cluster average distance.
How to Use This Calculator
- Enter one point per line.
- Use comma separated coordinates.
- Add labels before coordinates if needed.
- Select a distance metric.
- Choose a linkage rule.
- Select threshold mode or target cluster mode.
- Use normalization when variables have different scales.
- Press the calculate button.
- Review cluster assignments, summaries, and merge history.
- Download CSV or PDF results for saving.
Understanding Direct Clustering
Direct clustering is a practical way to group points without guessing every label first. It starts with each point as its own cluster. Then it joins the closest groups according to a selected distance rule. The process continues until a threshold is reached or a target cluster count is met. This calculator follows that direct idea. It lets you test how the same data behaves under different metrics, weights, and linkage choices.
Why Distance Choices Matter
Distance controls the meaning of similarity. Euclidean distance works well for ordinary geometric data. Manhattan distance suits grid movement or step based measurements. Chebyshev distance focuses on the largest coordinate gap. Cosine distance compares direction, so it is useful when scale is less important than pattern. A weight field lets one variable count more than another. Normalization can reduce unfair influence from large scale columns.
Reading the Results
The output gives a cluster label for every point. It also shows centroids, within cluster spread, diameter, and a silhouette style score. A low spread means points in the same group sit close together. A smaller diameter means the widest internal gap is controlled. The silhouette score compares internal closeness with separation from other clusters. Positive values usually show better grouping. Negative values suggest that a point may fit another cluster more naturally.
Good Analysis Habits
Start with an example threshold, then adjust it slowly. Watch the merge history. It reveals which groups joined first. If many unrelated points merge early, try normalization or another distance rule. If clusters stay too small, raise the threshold. If clusters become too broad, lower it. For fixed groups, use the target count mode. It is easier when a report requires a set number of segments.
Use Cases
Direct clustering helps with classroom data, market segments, coordinate grouping, experiment readings, and feature based comparisons. It is also helpful for checking manual labels. The export buttons make the calculation easier to save. Use the CSV file for spreadsheets. Use the PDF file for a quick report. Always remember that clustering suggests structure. It does not prove a cause. Review context before making decisions. Document assumptions, compare several settings, and keep original measurements beside normalized values for auditing later.
FAQs
What is a direct clustering algorithm?
It is a grouping method that starts from raw points and directly forms clusters using distances. This version begins with single-point clusters, then merges the closest groups until the selected stop rule is reached.
Can I enter more than two dimensions?
Yes. Each row can contain two or more coordinate values. All valid rows must use the same number of dimensions for the calculation to work correctly.
Which distance metric should I use?
Use Euclidean for geometric distance. Use Manhattan for grid style movement. Use Chebyshev when the largest coordinate gap matters. Use cosine when direction matters more than scale.
What does linkage mean?
Linkage decides how distance between two clusters is measured. Single uses the nearest pair. Complete uses the farthest pair. Average uses all pair distances. Centroid uses cluster centers.
What does the threshold do?
The threshold stops merging when the closest cluster distance is greater than your limit. A higher threshold usually creates fewer, larger clusters.
What is target cluster count mode?
This mode keeps merging the nearest clusters until the chosen number of clusters remains. It is useful when a fixed number of groups is required.
Why use normalization?
Normalization helps when variables use different scales. For example, income may dominate age. Z score and min max options reduce that imbalance.
What does silhouette score show?
It compares how close a point is to its own cluster against other clusters. Higher positive values usually mean cleaner grouping. Negative values may show weak assignment.