Missing Data Correlation Calculator

Calculator Inputs

Enter two aligned numeric series. Use commas or new lines. Leave blanks, or use listed missing tokens, wherever values are unavailable. Avoid thousands separators inside numbers.

First variable name

Second variable name

Complete-case correlation method

Decimal places

Missing tokens

Example: NA, N/A, NULL, NaN, -

Feature_A values

Feature_B values

Example Data Table

This sample table mirrors the default demo values already loaded in the form.

Observation	Feature_A	Feature_B	Status
Obs 1	12	5	Complete
Obs 2	15	7	Complete
Obs 3	NA	8	Feature_A missing
Obs 4	18	NA	Feature_B missing
Obs 5	20	11	Complete
Obs 6	22	12	Complete
Obs 7	—	13	Feature_A missing
Obs 8	25	—	Feature_B missing
Obs 9	27	16	Complete
Obs 10	29	18	Complete
Obs 11	31	19	Complete
Obs 12	34	21	Complete

Formula Used

1) Missingness indicator

For each observation, define an indicator: M = 1 if a value is missing, otherwise M = 0.

2) Missingness phi coefficient

The calculator measures missingness correlation using the phi coefficient from the 2×2 table of observed and missing states.

phi = (a*d - b*c) / sqrt((a+b)(c+d)(a+c)(b+d))

a = both variables observed
b = first observed, second missing
c = first missing, second observed
d = both variables missing

3) Complete-case correlation

Only rows where both variables are present are used for the value correlation.

Pearson r = sum[(xi - x̄)(yi - ȳ)] / sqrt(sum[(xi - x̄)^2] * sum[(yi - ȳ)^2])

Spearman rho = Pearson correlation of ranked complete-case values

4) Retention and overlap metrics

Usable pair retention rate = complete pairs / total observations

Missingness overlap Jaccard = both missing / (x missing + y missing - both missing)

How to Use This Calculator

Enter short names for your two variables.
Paste aligned values into both text areas.
Keep row order identical across both variables.
List custom missing tokens if your dataset uses them.
Choose Pearson or Spearman for complete-case correlation.
Click the calculate button to generate metrics and graphs.
Review retention rate before trusting complete-case estimates.
Export the summary with the CSV or PDF buttons.

8 FAQs

1) What does missing data correlation measure?

It measures whether the absence of one variable tends to occur with the absence of another. This helps reveal structured missingness that may distort complete-case analysis, imputation, or downstream modeling decisions.

2) Why calculate phi for missingness?

Phi is appropriate for two binary indicators, such as missing versus observed. It quantifies whether missingness patterns move together, move apart, or show little direct relationship across aligned observations.

3) When should I choose Spearman instead of Pearson?

Choose Spearman when your complete cases are ordinal, nonlinear but monotonic, or strongly affected by outliers. Choose Pearson when you want linear association on roughly continuous, well-behaved complete-case values.

4) Why is my value correlation unavailable?

The complete-case correlation becomes unavailable when you have too few usable pairs or when one complete-case variable has zero variation. Both situations make the denominator collapse to zero.

5) What does a negative missingness phi mean?

A negative phi means the two missingness indicators move in opposite directions. In practice, one variable is more likely to be observed when the other is missing, rather than disappearing together.

6) Is a high complete-case correlation always trustworthy?

No. A strong complete-case correlation can still be misleading if retention is low or missingness is highly structured. Always read the missingness metrics and sample retention alongside the value correlation.

7) Can this calculator help before imputation?

Yes. It helps identify whether two variables share missingness structure before you choose imputation methods, feature filtering rules, or pairwise versus complete-case strategies in preprocessing workflows.

8) Does this replace full missing data diagnostics?

No. It is a focused diagnostic for two aligned variables. Broader workflows may still need missingness maps, mechanism testing, multiple imputation checks, and model-specific sensitivity analysis.