| ID | F1 | F2 | F3 | F4 |
|---|---|---|---|---|
| A | 2.5 | 2.4 | 1.2 | 0.7 |
| B | 0.5 | 0.7 | 0.3 | 0.2 |
| C | 2.2 | 2.9 | 1.1 | 0.9 |
| D | 1.9 | 2.2 | 0.8 | 0.6 |
| E | 3.1 | 3.0 | 1.5 | 1.1 |
| F | 1.1 | 1.3 | 0.5 | 0.4 |
- Center / scale: Z = (X - μ) / σ (scaling is optional).
- Covariance / correlation matrix: S = (1/(n-1)) · ZᵀZ.
- Eigen decomposition: S v = λ v, sorted by decreasing λ.
- Scores (reduced features): T = Z Vₖ, where columns of Vₖ are top eigenvectors.
- Paste your dataset or upload a CSV file.
- Set delimiter, header, and whether the first column is an ID.
- Pick covariance or correlation, then choose scaling.
- Select components by K or by explained variance target.
- Click Compute PCA to view results above.
- Use the download buttons to export scores and a report.
Why PCA is used for feature reduction
PCA compresses correlated numeric variables into a smaller set of orthogonal components that retain most of the variation. For an input matrix with n rows and p features, the reduced score matrix has n × k values, where k is typically far smaller than p. This reduces model training time, limits multicollinearity, and improves stability when features overlap. This is valuable in large-scale workflows.
Centering, scaling, and basis choice
Centering subtracts each feature mean so the first component represents variation rather than offsets. Standardization (z-score) divides by the feature standard deviation and is recommended when units differ. A covariance basis preserves original units after centering, while a correlation basis is equivalent to standardizing and then using covariance. In practice, correlation avoids high-variance features dominating the first components.
Explained variance and interpretability
Eigenvalues quantify how much variance each component captures. The explained variance ratio is λᵢ / Σλ, and the cumulative ratio shows how quickly information concentrates. If the first few components explain a large share (for example, 80–95%), the data likely lies near a lower-dimensional subspace. Loadings (eigenvector weights) indicate which original features drive each component, supporting interpretation.
Choosing k with a variance target
Two common rules are selecting a fixed k or stopping when the cumulative explained variance reaches a target. Higher targets preserve more information but return more components. For forecasting or classification, start with 90% and compare performance versus 95% to quantify the tradeoff. When you export scores, keep the same preprocessing settings so new data projects consistently.
Operational checks and practical limits
The covariance/correlation matrix is p × p, so memory and runtime grow with the number of features. A simple diagnostic is to review the eigenvalue drop-off: a steep decline suggests strong redundancy. Also confirm missing-value handling, because dropping rows changes n while imputation changes feature moments. Use the reconstruction idea (Z ≈ T Vₖᵀ) to judge how much structure is retained. When p is large relative to n, keep k below n-1 because extra components add no variance. Check outliers, since extreme values can rotate components and inflate variance estimates materially.
1) Does PCA work with categorical variables?
PCA requires numeric inputs. Convert categories using suitable encoding, then consider scaling so encoded columns do not dominate variance.
2) Should I standardize my features?
Standardize when features use different units or ranges. If all features share comparable scales, centering alone can be sufficient.
3) What is the difference between covariance and correlation?
Covariance reflects variance in original units after centering. Correlation is scale-free and effectively standardizes features, preventing high-variance variables from dominating.
4) How many components should I keep?
Keep a k that meets your variance target and preserves model accuracy. Common starting targets are 0.90 or 0.95, then validate downstream performance.
5) Why did my row count change after computing PCA?
If you selected “Drop rows with missing,” any row containing a missing value is removed before PCA. Choose mean imputation to retain all rows.
6) Can I use these components for new incoming data?
Yes. Apply the same means and standard deviations used here, then multiply by the saved loading vectors. Consistent preprocessing is essential for comparable scores.