Profile a Dataset
How to Use This Tool
- Select Upload or Paste as your data source.
- Set delimiter and text qualifier to match your CSV.
- Choose whether the first row contains headers.
- Optionally limit rows to profile for faster results.
- Press Generate Profiling Report to see insights.
- Use downloads to share results with your team.
Formulas Used
- Missing % = (missing cells ÷ total cells) × 100
- Mean = (Σx) ÷ n
- Median = middle value after sorting (or average of two middles)
- Sample Std Dev = √(Σ(x−mean)² ÷ (n−1))
- Quartiles use the median of lower/upper halves
- IQR = Q3 − Q1
- Outlier bounds = [Q1 − k·IQR, Q3 + k·IQR]
- Pearson r = cov(x,y) ÷ (sd(x)·sd(y))
Example Dataset
| customer_id | age | country | spend | churned | signup_date |
|---|---|---|---|---|---|
| 1001 | 29 | PK | 120.50 | false | 2025-11-02 |
| 1002 | 34 | AE | 0 | true | 2025-11-05 |
| 1003 | PK | 75.00 | false | 2025-11-09 | |
| 1004 | 22 | SA | 300.10 | false | 2025-11-12 |
| 1005 | 46 | PK | 980.00 | true | 2025-12-01 |
customer_id,age,country,spend,churned,signup_date 1001,29,PK,120.50,false,2025-11-02 1002,34,AE,0,true,2025-11-05 1003,,PK,75.00,false,2025-11-09 1004,22,SA,300.10,false,2025-11-12 1005,46,PK,980.00,true,2025-12-01
FAQs
1) What does data profiling mean?
Data profiling summarizes structure and quality. It measures missingness, uniqueness, value ranges, and frequent categories. It helps you spot issues before modeling.
2) How is the column type inferred?
The tool checks non-missing values. If all values match numbers, it becomes Numeric. If most values parse as dates, it becomes Date/Time. Otherwise it uses uniqueness and length to separate Categorical from Text.
3) What counts as missing?
Blanks can be treated as missing. Optional tokens like NA, null, and NaN can also be treated as missing. You can toggle both behaviors in the options.
4) How are outliers detected?
Outliers use the IQR rule: values below Q1−k·IQR or above Q3+k·IQR. Lower k flags more points. This works well for many numeric distributions.
5) Are correlations always reliable?
Correlations summarize linear relationships and ignore non-linear patterns. They also change with outliers and missing data. Use them as a quick signal, then validate with plots or domain knowledge.
6) Why limit the number of rows?
Profiling large files can be slow on shared hosting. Sampling by max rows keeps results responsive while still revealing common issues. For final checks, set max rows to zero.
7) Can I use this for machine learning features?
Yes. The report helps you choose encodings, scaling, and missing-value handling. It can highlight leakage risks using correlations and suspiciously high uniqueness.
8) Does this tool modify my dataset?
No. It only reads and summarizes values. Any cleaning or transformation should be done in your pipeline after reviewing the report.