| Row | A | B | C | D |
|---|---|---|---|---|
| 1 | 2.5 | 2.4 | 1.2 | 0.7 |
| 2 | 0.5 | 0.7 | 0.3 | 0.2 |
| 3 | 2.2 | 2.9 | 1.0 | 0.6 |
| 4 | 1.9 | 2.2 | 0.9 | 0.5 |
| 5 | 3.1 | 3.0 | 1.4 | 0.8 |
Tip: Try correlation mode if features have different units.
-
Preprocess: optionally center and scale each feature.
Centering: x' = x − μ. Z-score scaling: x'' = (x − μ) / σ.
-
Train covariance: compute C = (1/(n−1)) XᵀX using training rows.
In correlation mode, z-score scaling is applied first.
-
Eigen decomposition: solve C v = λ v.
Eigenvalues λ give variances along components; eigenvectors v are loadings.
- Explained variance: rᵢ = λᵢ / Σⱼ λⱼ, cumulative Rₖ = Σᵢ≤ₖ rᵢ.
-
Scores and reconstruction:
Scores: Z = X Vₖ. Reconstruction: X̂ = Z Vₖᵀ. Error metric: mean squared error over all entries.
- Paste your dataset with samples in rows and features in columns.
- Select delimiter and whether the first row is a header.
- Choose missing-value handling, then choose centering and scaling.
- Set train split, shuffle preference, and a seed for repeatability.
- Pick auto variance target or set a manual component count.
- Click Run PCA to compute components and validation errors.
- Use the export buttons to save CSV or PDF outputs.
Preprocessing choices that shape principal components
This calculator lets you center, scale, and impute before training. Centering subtracts feature means so components describe variation, not absolute levels. Z-score scaling standardizes units, preventing high‑variance measurements from dominating. Mean and median imputation keep row counts stable, while row dropping preserves raw integrity when missingness is rare.
Covariance versus correlation training matrices
Covariance mode preserves original scale after your selected preprocessing, which is useful when units are already comparable. Correlation mode forces z‑score scaling and then trains on a scale‑invariant matrix, highlighting relationships rather than magnitudes. Use correlation when columns mix currencies, counts, and percentages, or when sensors report in different units.
Component selection driven by explained variance
Eigenvalues quantify variance captured by each component. Explained variance is computed as λᵢ divided by the sum of all eigenvalues, and cumulative variance adds these ratios in order. Automatic selection picks the smallest k meeting your retention target, such as 0.95, balancing compression with information preservation. Manual k is ideal for fixed downstream models or dashboard constraints.
Training split and reconstruction quality checks
The tool fits components on the training split and evaluates reconstruction on both training and test rows. Scores are Z = X Vₖ, and the reconstructed matrix is X̂ = Z Vₖᵀ in the transformed space. Mean squared error summarizes the average squared difference per cell; lower values indicate that k components capture the structure consistently. A higher test MSE than train MSE suggests overfitting or unstable preprocessing.
Loadings, repeatability, and exportable outputs
Loadings are the eigenvector weights that connect original features to each component. Large absolute loadings identify influential variables, while signs can flip without changing meaning, so compare magnitudes. Row shuffling plus a fixed seed makes splits reproducible for audits and team reviews. CSV export captures variance tables, loadings, and score previews, and the PDF report provides a concise training summary for stakeholders. Because the eigensolver is optimized for smaller feature counts, the calculator limits processed columns by default; raise the cap only when necessary and expect longer runtime as p grows. For stability, prefer at least three rows per feature and remove constant columns. before running training.
FAQs
What is the difference between covariance and correlation mode?
Covariance reflects variance in the current scale after preprocessing. Correlation standardizes features and emphasizes relationships. Choose correlation when units differ widely; choose covariance when units are comparable and you want magnitude to matter.
How does the variance target choose the number of components?
The tool sorts components by eigenvalue, then accumulates explained variance until it meets your target, such as 0.95. It selects the smallest k that reaches the threshold, keeping the model compact while retaining information.
When should I enable centering and z-score scaling?
Centering is recommended for most datasets because PCA assumes zero-mean features. Enable z-score scaling when variables use different units or ranges, so no single feature dominates the first components purely by magnitude.
How are missing values treated during training?
Blank cells and NA/NaN entries are treated as missing. You can impute by column mean or median, or drop any row containing missing values. Imputation keeps more data, while dropping avoids assumptions when missingness is rare.
What does reconstruction MSE tell me?
Reconstruction MSE measures average squared error between the transformed data and its projection back from k components. Low train and test MSE indicate stable components that generalize. A large gap suggests overfitting or unstable preprocessing.
Why can component loadings have flipped signs?
Eigenvectors are defined up to a sign, so multiplying a component by −1 yields an equivalent solution. Interpret loadings by magnitude and relative pattern across features. Use the same seed and settings for consistent comparisons.