Feature Selection Basics

Lesson 36/37 | Study Time: 40 Min

Course: Enroll in Python for Data Analysis Course Online Today

Once a dataset has been cleaned and prepared, the next important question is: "Which variables should actually be used in the analysis or model?"

Not every column in a dataset is useful. Some features carry rich, meaningful information. Others are redundant, irrelevant, or even harmful to your analysis — adding noise, increasing complexity, and reducing the accuracy of any model built on top of them.

Feature selection is the process of identifying and retaining only the most relevant variables while discarding those that do not contribute meaningfully to understanding or predicting the target outcome.

It is a critical step that sits between data cleaning and modeling, and doing it well leads to simpler, faster, and more accurate analytical results.

Why Feature Selection Matters

Using every available column in a dataset might seem like the safest approach, more information should mean better results, right? In practice, the opposite is often true. Including irrelevant or redundant features creates a range of problems that directly undermine analysis quality.

1. Overfitting: Models trained on too many irrelevant features memorize noise rather than learning genuine patterns, performing well on training data but poorly on new data.

2. Increased Complexity: More features mean slower computation, harder interpretation, and more difficult maintenance.

3. Multicollinearity: Highly correlated features provide duplicate information and destabilize model coefficients.

4. Reduced Interpretability: A model built on 5 well-chosen features is far easier to explain to stakeholders than one built on 50 features.

The goal of feature selection is not to throw away data carelessly, it is to keep what genuinely matters and remove what does not.

Types of Features to Look For and Remove

Before applying formal selection techniques, a practical understanding of what kinds of features typically need to be removed helps build analytical judgment.

Method 1 — Removing Low Variance Features

A feature where nearly all values are identical across the dataset contributes almost no useful information. For example, a column where 99% of values are "Yes" tells you almost nothing that differentiates one record from another.

This shows how many unique values exist in each column. Columns with only 1 or 2 unique values across thousands of rows are strong candidates for removal.

Method 2 — Correlation-Based Selection

Highly correlated features carry redundant information. When two features have a correlation above 0.85 or 0.90, keeping both adds no new information but increases complexity. The standard approach is to identify the pair and remove one of them.

Programmatically identifying and removing highly correlated features:

This systematically identifies and removes one feature from each highly correlated pair, retaining the information while eliminating redundancy.

Method 3 — Univariate Selection Using Statistical Tests

Statistical tests measure the relationship between each individual feature and the target variable. Features with a strong statistical relationship with the target are selected — those with a weak relationship are discarded.

Scikit-learn's SelectKBest applies a chosen statistical test to each feature and selects the top K features with the strongest relationship to the target.

from sklearn.feature_selection import SelectKBest, f_classif

import pandas as pd

X = df.drop(columns=['target'])

y = df['target']

selector = SelectKBest(score_func=f_classif, k=5)

selector.fit(X, y)

# View feature scores

feature_scores = pd.DataFrame({'Feature': X.columns, 'Score': selector.scores_})

print(feature_scores.sort_values('Score', ascending=False))

# Get selected feature names

selected_features = X.columns[selector.get_support()]

print(selected_features)

Choosing the Right Statistical Test

Method 4 — Feature Importance from Tree-Based Models

Tree-based models like Random Forest naturally rank features by how much each one contributes to splitting decisions, providing a direct, data-driven measure of feature importance. This is one of the most reliable feature selection methods in practice.

Features with very low importance scores contribute little to the model and are candidates for removal. A common approach is to retain the top features that collectively account for 90–95% of total importance.

Method 5 — Dropping Irrelevant Columns Manually

Domain knowledge and common sense are just as powerful as statistical techniques. Columns like customer IDs, row numbers, timestamps with no analytical value, or fields that clearly have no relationship to the problem at hand should simply be dropped directly.

This is always the first pass of feature selection — before any statistical method is applied, remove what is obviously irrelevant based on understanding the data and the problem.

Previous Lesson Next Lesson

Blake Turner

Product Designer

Profile

Class Sessions

1- What is Data Analysis 2- Importance of Data in Decision Making 3- Overview of Python for Data Analysis 4- Setting up Python Environment (Anaconda / Jupyter Notebook) 5- Introduction to Jupyter Notebook Interface 6- Variables, Data Types, and Operators 7- Conditional Statements and Loops 8- Functions and Modules 9- Working with Lists, Tuples, Sets, and Dictionaries 10- Introduction to NumPy Arrays 11- Array Creation and Operations 12- Indexing, Slicing, and Reshaping 13- Mathematical and Statistical Functions 14- Introduction to Series and DataFrames 15- Reading and Writing Data (CSV, Excel) 16- Data Inspection (.head(), .info(), .describe()) 17- Data Selection and Filtering (.loc[], .iloc[]) 18- Handling Missing Values 19- Data Cleaning and Transformation 20- Importance of Data Visualization 21- Introduction to Matplotlib 22- Creating Line, Bar, Pie, and Histogram Charts 23- Customizing Plots (Titles, Labels, Styles) 24- Introduction to Seaborn (Basic Plots and Styling) 25- Understanding Data Distributions 26- Identifying Patterns and Trends 27- Detecting Outliers 28- Correlation and Relationships Between Variables 29- Summarizing Insights 30- Descriptive Statistics (Mean, Median, Mode) 31- Variance and Standard Deviation 32- Probability Basics 33- Correlation Concepts 34- Loading Real Datasets 35- Data Cleaning Workflow 36- Feature Selection Basics 37- Case Study: End-to-End Data Analysis