Feature selection aims to find the most informative features in a dataset. By reducing the number of features, we can simplify the model, improve its generalization ability, and reduce the computational cost. There are three main types of feature selection methods:
Univariate feature selection works by selecting the best features based on univariate statistical tests. Scikit - learn provides several statistical tests, such as the chi - squared test for categorical features and the F - test for numerical features.
Recursive Feature Elimination (RFE) is a wrapper method that works by recursively removing features and building a model on the remaining features. It ranks the features based on their importance and eliminates the least important features at each iteration.
SelectFromModel is an embedded method that selects features based on the importance weights assigned by a machine learning model. For example, a linear regression model can assign coefficients to each feature, and features with coefficients below a certain threshold can be removed.
from sklearn.datasets import load_breast_cancer
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
import pandas as pd
# Load the breast cancer dataset
data = load_breast_cancer()
X = data.data
y = data.target
# Apply SelectKBest class to extract top 10 best features
bestfeatures = SelectKBest(score_func=chi2, k = 10)
fit = bestfeatures.fit(X,y)
dfscores = pd.DataFrame(fit.scores_)
dfcolumns = pd.DataFrame(data.feature_names)
# Concat two dataframes for better visualization
featureScores = pd.concat([dfcolumns, dfscores], axis = 1)
featureScores.columns = ['Specs', 'Score']
print(featureScores.nlargest(10,'Score'))
In this example, we use the SelectKBest
class to select the top 10 features based on the chi - squared test.
from sklearn.datasets import load_breast_cancer
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
# Load the breast cancer dataset
data = load_breast_cancer()
X = data.data
y = data.target
# Create a logistic regression model
model = LogisticRegression()
# Create the RFE model and select 10 features
rfe = RFE(model, n_features_to_select = 10)
fit = rfe.fit(X, y)
# Print the selected features
print("Num Features: %d" % fit.n_features_)
print("Selected Features: %s" % fit.support_)
print("Feature Ranking: %s" % fit.ranking_)
Here, we use RFE to select the top 10 features for a logistic regression model.
from sklearn.datasets import load_breast_cancer
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestClassifier
# Load the breast cancer dataset
data = load_breast_cancer()
X = data.data
y = data.target
# Create a random forest classifier
model = RandomForestClassifier()
model.fit(X, y)
# Use SelectFromModel to select features
selector = SelectFromModel(model, prefit = True)
X_new = selector.transform(X)
print("Original number of features: %d" % X.shape[1])
print("New number of features: %d" % X_new.shape[1])
In this example, we use SelectFromModel to select features based on the importance weights assigned by a random forest classifier.
Advanced feature selection techniques in Scikit - learn can significantly improve the performance of machine learning models. By understanding the core concepts, typical usage scenarios, and different techniques available in Scikit - learn, you can effectively select the most relevant features for your projects. However, it is important to be aware of the common pitfalls and follow the best practices to ensure that the feature selection process is robust and generalizable.