| Height_cm | Weight_kg | StudyHours | Satisfaction |
|---|---|---|---|
| 170 | 65 | 12 | 7 |
| 165 | 60 | 10 | 6 |
| 180 | 78 | 15 | 8 |
| 175 | 72 | 14 | 7 |
| 160 | 55 | 9 | 5 |
| 185 | 82 | 16 | 9 |
| 172 | 68 | 13 | 7 |
| 168 | 63 | 11 | 6 |
| 178 | 75 | 15 | 8 |
| 162 | 58 | 10 | 5 |
- Centering: for each column j,
x̂ij = xij − μj - Optional scaling:
zij = (xij − μj) / sj(or robust(x − median) / IQR). - Covariance matrix:
S = (1/(n−1)) · XᵀXusing the centered/scaled matrix X. - Eigen decomposition:
S vk = λk vk, with eigenvalues λ sorted descending. - Projection (scores): keep the first K eigenvectors VK, then
T = X · VK.
- Paste your dataset into the input box.
- Enable “First row is header” if needed.
- Pick scaling: Z-score is usually best.
- Choose missing-value handling: impute or drop rows.
- Set target variance to get a suggested PC count.
- Click Calculate to view variance, scores, and loadings.
- Use Download buttons to export CSV and PDF.
Variance capture in practical datasets
PCA summarizes correlated variables into orthogonal components ranked by variance. In many business or lab tables, the first component explains 50–80% when measures move together, and PC2 often adds another 10–25%. This tool lists eigenvalues, explained percent, and cumulative percent for the first ten PCs, making it easy to justify how much information is retained after compression. Scores are the projected coordinates used for scatterplots, clustering, or regression. Export CSV to reuse in dashboards, and keep the PDF summary for audit-ready documentation across teams and projects.
Scaling choices and their numeric impact
Scaling changes the covariance structure. Z‑score scaling standardizes each variable to unit variance, so centimeters, kilograms, and hours contribute comparably. Center‑only PCA can be dominated by high‑magnitude features; a single variable with a standard deviation ten times larger can overwhelm directions. Robust scaling uses median and IQR, reducing outlier leverage when a few extreme rows inflate variance.
Interpreting loadings for feature influence
Loadings connect components back to original variables. Large absolute loadings indicate stronger influence on that component’s direction, while signs show whether variables move together or oppose. For example, if Height and Weight load positively on PC1 but StudyHours loads negatively, PC1 can be interpreted as a “body size versus effort” axis. Use consistent units and domain logic before naming a component.
Choosing components with target thresholds
Component selection balances compression and interpretability. A practical rule is keeping the smallest K where cumulative explained variance exceeds 80–95%, depending on downstream risk. The calculator converts your target percent into a suggested K, yet you can override K to test scenarios. If you drop rows with missing values, report how many rows were removed; if you impute means, note the added smoothing.
Quality checks with reconstruction error
Projection quality is not only variance; it is also fidelity. The tool estimates reconstruction RMSE in scaled space after projecting onto K components and back. Lower RMSE indicates better preservation of structure. If RMSE stays high even with several PCs, your data may be weakly correlated, noisy, or too small. As a rule of thumb, aim for at least 5–10 rows per variable for steadier covariance estimates.
Paste numeric rows using commas, semicolons, tabs, or pipes. A header row is optional. Non-numeric tokens become missing values. Keep at least two columns and two rows after cleaning.
Use Z-score when variables have different units or ranges. Use center-only when variables share units and similar variance. Use robust scaling when outliers or heavy tails can distort variance.
Signs indicate direction: a negative loading means that variable decreases as the component score increases, relative to others. Flipping all signs gives the same solution, so focus on relative signs and magnitudes.
Keep the smallest K that reaches your target cumulative variance, commonly 80–95%. Visualization often needs 2–3 PCs, while modeling may benefit from more. Confirm with stability and interpretability.
RMSE summarizes reconstruction error after projecting to K components and reconstructing in scaled space. Lower values mean less information loss. Compare RMSE across different K values to find diminishing returns.
Project new rows by applying the same centering and scaling used for training, then multiplying by the saved eigenvectors. This page doesn’t persist parameters, so export loadings and reuse them in your workflow.