Calculator Inputs
Enter observed dataset quality metrics and optional scoring weights. Results appear above this form after submission.
Example Data Table
Use this sample to understand how real audit inputs might look before running the calculator.
| Dataset | Records | Missing % | Duplicate % | Label Error % | Consent % | Leakage Flags |
|---|---|---|---|---|---|---|
| Customer Intent Training Set | 50,000 | 3.50% | 1.20% | 2.50% | 96.00% | 0 |
| Fraud Screening Validation Set | 18,500 | 7.80% | 2.90% | 4.10% | 89.00% | 1 |
| Support Ticket Classification Set | 72,300 | 2.10% | 0.70% | 1.60% | 98.50% | 0 |
Formula Used
Completeness Score = 100 − (Missing % × 1.25) − ((Freshness Days − 7, minimum 0) × 0.35) − (Schema Mismatches × 2.2)
Consistency Score = 100 − (Duplicate % × 1.5) − (Outlier % × 0.9) − (Schema Mismatches × 1.4)
Labeling Score = 100 − (Label Error % × 2.2) − ((Class Ratio − 1, minimum 0) × 8)
Privacy Score = Consent % − (PII Columns × 6)
Governance Score = (Documentation % × 0.75) + (Consent % × 0.10) + (Coverage Confidence × 0.15) − (Schema Mismatches × 0.8)
Readiness Score = 100 − (Leakage Flags × 14) − (Label Error % × 0.9) − (Duplicate % × 0.4) − ((Class Ratio − 1, minimum 0) × 4) − ((Freshness Days − 14, minimum 0) × 0.2)
Overall Audit Score = Sum of (Category Score × Category Weight) ÷ Sum of Weights
How to Use This Calculator
- Collect recent audit metrics from profiling reports, labeling reviews, governance logs, and privacy checks.
- Enter the observed percentages, counts, ratio values, and freshness timing into the calculator fields.
- Adjust category weights when your project prioritizes privacy, labeling quality, or deployment readiness differently.
- Submit the form to generate the weighted score, checklist pass rate, category graph, and remediation guidance.
- Download the CSV for structured records or the PDF for a visual summary to share with teams.
FAQs
1. What does the overall audit score represent?
It combines completeness, consistency, labeling, privacy, governance, and readiness into one weighted score. Higher values usually indicate cleaner, safer, and better-documented data for machine learning work.
2. Why is class imbalance included?
Class imbalance can distort model learning, weaken minority class predictions, and create misleading performance metrics. Tracking the ratio helps identify when rebalancing or stratified sampling is needed.
3. What counts as data leakage here?
Leakage includes future information, target proxies, duplicate entities across splits, or any field that lets the model indirectly peek at answers during training or validation.
4. Can this be used before vendor dataset approval?
Yes. It works well as a structured screening step before accepting external data, refreshing labels, approving training sources, or moving a dataset into model development.
5. Does this replace privacy or legal review?
No. It is an operational scoring aid, not a legal opinion. Sensitive or regulated projects still need formal privacy, security, and governance review from the right teams.
6. How often should audits be repeated?
Run a new audit after major schema changes, dataset refreshes, vendor handoffs, policy updates, or before important model retraining cycles and deployment decisions.
7. Can I customize the scoring logic?
Yes. The weight inputs already let you change category importance. You can also edit the threshold rules and score multipliers in the file to match your policy.
8. What is the difference between CSV and PDF exports?
The CSV stores structured report data for spreadsheets or databases. The PDF captures the visible summary and chart for presentations, reviews, or audit documentation.