Sanitization Calculator
Use issue counts, toggle cleanup rules, and estimate how much better your dataset becomes before feature engineering, model training, or deployment.
Formula Used
1) Total issue instances
Total Issues = Missing + Duplicates + Outliers + Invalid Formats + Sensitive Values + Whitespace + Case Issues + HTML Noise + Encoding Errors
2) Remaining issues after sanitization
Remaining Issue = Original Issue - Automated Fixes - Manual Review Fixes
3) Weighted sanitization coverage
Coverage % = (Weighted Fixed Issues / Weighted Total Issues) × 100
4) Issue density
Issue Density = (Issue Instances / Total Records) × 100
5) Quality score
Quality Score = 100 - ((Weighted Remaining Issues / Total Records) × 10)
6) Readiness score
Readiness Score = (Quality × 0.70) + (Coverage × 0.20) + (Rule Depth × 0.10) - Critical Penalties
Weighted severity makes sensitive and invalid fields more important than minor whitespace or casing issues, which reflects common machine learning governance and preprocessing priorities.
How to Use This Tool
- Enter a dataset name and the total number of records.
- Add counts for each issue category you discovered during profiling.
- Enable the sanitization rules your workflow will apply.
- Set the manual review fix rate to estimate analyst cleanup after automation.
- Click Sanitize Dataset to generate summary metrics and the issue reduction graph.
- Review readiness, coverage, and remaining risk before model training.
- Download the CSV or PDF report for audits, stakeholder sharing, or pipeline documentation.
Example Data Table
| Record ID | Original Value | Detected Issue | Applied Rule | Sanitized Value |
|---|---|---|---|---|
| 101 | john.doe@email.com | Sensitive identifier | Mask personal identifiers | j***.***@email.com |
| 102 | NEW YORK | Whitespace issue | Trim spaces | NEW YORK |
| 103 | <b>gold_plan</b> | HTML noise | Remove HTML noise | gold_plan |
| 104 | female | Case inconsistency | Normalize letter case | Female |
| 105 | 2026/31/03 | Invalid format | Standardize formats | 2026-03-31 |
Frequently Asked Questions
1) What does this data sanitization tool measure?
It estimates how automated cleanup and manual review reduce common dataset problems, then converts that reduction into quality, readiness, and risk indicators.
2) Why are some issues weighted more heavily?
Sensitive values and invalid formats usually create larger privacy, compliance, and model integrity problems. The weighting system reflects that higher operational impact.
3) Is this tool useful before model training?
Yes. It helps teams estimate whether a dataset is clean enough for feature engineering, validation, training, testing, and production scoring pipelines.
4) Does the tool replace actual profiling software?
No. It is a planning and reporting calculator. You should still run proper profiling, validation, lineage, and governance checks in your real workflow.
5) What is issue density?
Issue density shows the number of issue instances per 100 records. Lower density generally means cleaner inputs for machine learning tasks.
6) What does the manual review fix rate represent?
It estimates how many unresolved issues analysts can still correct after automated rules finish. This is useful for semi-automated cleanup pipelines.
7) Can I use the report for governance reviews?
Yes. The CSV and PDF exports help document assumptions, estimated issue reduction, and dataset readiness for audits or internal review discussions.
8) Why can the risk level stay high after cleanup?
Risk remains high when critical issues, especially unresolved sensitive data or serious formatting failures, still affect the dataset after sanitization.