Feature Dependence Calculator

Calculator Input

Feature A name

Feature B name

Bins for information scores

Feature A type

Feature B type

Feature A values

Feature B values

Input tips

Keep rows aligned. Row one in Feature A pairs with row one in Feature B.

Use numeric values for correlations. Use labels for category tests.

Example Data Table

Use this small sample to test numeric dependence.

Row	Training Hours	Accuracy Score	Expected Pattern
1	12	22	Lower input, lower score
2	18	29	Rising score
3	24	36	Clear movement
4	30	44	Strong pattern
5	40	55	High input, high score

Formula Used

Pearson correlation

r = cov(X,Y) / (sd(X) × sd(Y))

It measures straight line dependence between numeric features.

Spearman correlation

rho = Pearson correlation of ranked X and ranked Y.

It measures monotonic dependence after replacing values with ranks.

Mutual information

MI = Σ p(x,y) log( p(x,y) / (p(x)p(y)) )

Numeric values are binned before this score is estimated.

Cramer’s V and eta

V = sqrt(χ² / (n × min(r-1,c-1))). Eta² = SS between / SS total.

These handle categorical and mixed feature pairs.

How to Use This Calculator

Enter the first feature values in the left text area.
Enter the paired second feature values in the next text area.
Select numeric, categorical, or auto detection for each feature.
Set bins for mutual information when numeric data is included.
Press calculate and read the summary above the form.
Use the graph, metrics, and downloads for feature selection notes.

Feature Dependence in Learning

Feature dependence describes how two inputs move together. It can be linear, ranked, grouped, or nonlinear. Strong dependence is not always bad. It can show a useful signal. Yet it can also create duplicate information. Duplicate signals may slow training. They may also make explanations unstable.

Why Dependence Matters

Machine learning models use patterns. When two columns carry the same pattern, the model may overvalue that pattern. Linear models can show inflated coefficients. Tree models may split on either feature without a clear reason. Distance models may give repeated weight to one concept. Checking dependence helps you reduce this risk before training.

Reading the Main Scores

Pearson correlation measures straight line movement. Spearman correlation measures monotonic rank movement. Mutual information can detect wider relationships. Cramer’s V helps when both features are categorical. Correlation ratio helps when one feature is categorical and the other is numeric. No single score explains everything. Use the graph and table beside the scores.

Choosing a Practical Cutoff

A value near zero usually means weak dependence. A value near one means strong dependence. Many teams review pairs above 0.70. Pairs above 0.90 often need action. The right cutoff depends on domain value, model type, and sample size. A rare but important feature may stay even if it is dependent.

Next Steps After Analysis

When dependence is high, compare business meaning first. Keep the feature that is cleaner, cheaper, or easier to explain. You can also combine features into a ratio, score, or index. For linear models, variance inflation can guide removal. For nonlinear models, test performance with and without the feature. Always validate changes with a holdout set. This keeps decisions tied to model quality, not just statistics.

Using This Page

Paste paired values in the two boxes. Select the best data type. Increase bins when numeric values have many ranges. Press calculate. Review the summary, plot, and download files. Save the report for feature selection notes. Repeat this process for important feature pairs.

Document assumptions clearly. Note missing values. Record bin choices. Recheck dependence after encoding, scaling, or feature engineering. New transformations can change relationships quickly again later too.

FAQs

What does feature dependence mean?

It means two features share a pattern. They may move together, rank together, group together, or share information in a nonlinear way.

Is high dependence always bad?

No. It may show a real signal. It becomes risky when two features repeat the same information and reduce model clarity.

Which score should I trust most?

Use the score that matches your data type. Use Pearson for linear numeric data, Cramer’s V for categories, and eta for mixed pairs.

Why is mutual information binned?

The page estimates information from frequency tables. Numeric values need ranges first, so bins convert continuous values into countable groups.

What VIF value is concerning?

Many analysts review VIF above 5. Values above 10 often suggest strong multicollinearity in linear modeling workflows.

Can I paste categorical labels?

Yes. Choose categorical or auto detect. Labels like low, medium, high, region names, or product groups can be compared.

How many rows should I use?

Use as many valid paired rows as possible. Small samples can give unstable dependence scores, especially for categories.

Should I delete a dependent feature?

Not automatically. Compare model performance, explainability, data cost, leakage risk, and business meaning before removing any feature.