Enter Dataset and Options
Example Data Table
| Col 1 | Col 2 | Col 3 |
|---|---|---|
| 2.5 | 2.4 | 1.2 |
| 0.5 | 0.7 | 0.3 |
| 2.2 | 2.9 | 1.9 |
| 1.9 | 2.2 | 1.4 |
| 3.1 | 3.0 | 2.1 |
Formula Used
Whitening transforms a centered dataset X so its covariance becomes (approximately) the identity matrix.
- Centering: Xc = X − 1μᵀ, where μ is the per-column mean.
- Covariance: C = (1/(n−ddof)) · Xcᵀ Xc.
- Regularization: Cε = C + εI.
- Cholesky whitening: compute Cε−1 = L Lᵀ, set A = Lᵀ, then Xw = Xc A, so Cov(Xw) ≈ I.
How to Use This Calculator
- Paste your numeric dataset into the textarea; each line is one row.
- Choose a method. Use Cholesky whitening for decorrelation.
- Pick ddof. Use n−1 for typical sample estimates.
- If results look unstable, increase epsilon slightly (example: 1e-5).
- Click Whiten Data. Results appear above the form.
- Use CSV/PDF buttons to export the whitened dataset.
Why Whitening Matters in Modeling
Data whitening converts correlated features into a space where directions are orthogonal and variances are comparable. This reduces dominance by high-variance inputs, stabilizes gradient-based optimization, and improves numerical behavior in downstream matrix operations such as inversion, PCA, or Mahalanobis distance.
In practice, whitening can make clustering and distance metrics reflect structure rather than raw units, improving comparability across features.
Choosing Centering and Scaling
Centering removes the mean so covariance reflects only joint variation, not offsets. Pre-standardization is useful when units differ greatly, for example mixing counts, temperatures, and monetary values. When standardization is enabled, each column is shifted to zero mean and scaled by its standard deviation before whitening.
If you already centered upstream, you can disable centering, but keep it enabled when unsure. Standardization can prevent extreme scales from destabilizing covariance estimates. If columns contain constant values, remove them first, because near-zero variance can break scaling, inflate noise, and distort covariance interpretation in practice.
Regularization and Numerical Stability
Real datasets often produce near-singular covariance matrices, especially with many variables, collinearity, or small sample sizes. The epsilon option adds a small value to the diagonal, yielding Cε = C + εI. This makes inversion safer and helps Cholesky decomposition succeed without distorting structure excessively.
Start with 1e-6 and increase only when you see errors or large residual correlations after whitening. Too much epsilon can oversmooth relationships.
Interpreting Covariance Diagnostics
The tool reports covariance before and after transformation using the chosen ddof. After successful whitening, diagonal terms should approach 1 and off-diagonals should approach 0. Deviations indicate insufficient data, noise, poor scaling, or a need for slightly higher epsilon.
Using n−1 is common for sample covariance, while using n slightly shrinks estimates. Use one convention consistently across your workflow.
Operational Export and Reproducibility
Use the CSV download for immediate ingestion into analysis notebooks, pipelines, or modeling scripts. The PDF export provides a shareable preview for review and audit. For reproducibility, record your settings: method, centering, pre-standardization, ddof, epsilon, and numeric precision, then rerun with identical inputs.
For deployment, fit the transform on training data and apply it unchanged to new observations, avoiding leakage and preserving comparability.
FAQs
What is data whitening?
Whitening transforms your variables so the covariance matrix becomes close to the identity matrix, meaning features are decorrelated and have comparable variance.
Should I always enable centering?
Yes in most cases. Centering removes mean offsets so covariance measures variation properly. Disable it only when your data is already centered and you must preserve that state.
What does epsilon do?
Epsilon adds a small value to the covariance diagonal to stabilize inversion and Cholesky decomposition. Increase it slightly if the tool reports singular or non‑positive‑definite matrix errors.
What is ddof and which option should I pick?
ddof sets the covariance divisor. Choose n−1 for sample statistics and n for population statistics. Consistency across your pipeline is usually more important than the exact choice.
Why doesn’t my after covariance look exactly like identity?
Finite samples, noise, rounding, and regularization can leave small residual correlations. Improve results by adding more rows, standardizing scales, or increasing epsilon modestly.
Can I apply the whitening transform to new data later?
Yes. Compute the transform using training data, store the settings and matrix, then apply the same transform to future observations to keep feature space consistent.