Summary
Feature selection, which means removing the irrelevant or less important features that do not contribute much to target variable, is one of the most important aspects of machine learning. Irrelevant or partially relevant features can negatively impact the performance of models and also leads to huge computational cost. Feature selection and data cleaning should be the first and most important step of model designing. In this notebook, different approaches of reducing the number of features are discussed. Feature selection is applied for Supervised Learning by removing irrelevant variables that have no impact on target.
Python functions and data files needed to run this notebook are available via this link.
import pandas as pd
import numpy as np
import pylab as plt
from sklearn.feature_selection import f_regression
from scipy.stats import zscore
from sklearn.feature_selection import f_classif
from sklearn.feature_selection import chi2
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import RepeatedStratifiedKFold
from functions import* # import require functions to run this notebook
from sklearn.model_selection import cross_val_score
from sklearn.svm import LinearSVC
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import ExtraTreesClassifier
import warnings
warnings.filterwarnings('ignore')
The process of reducing number of features to develop a predictive model is called feature selection. It leads to:
Two main types of feature selection techniques are: supervised and unsupervised. There are a wide range of supervised methods that may be divided into wrapper, filter and intrinsic. However, there is only dimensionality reduction to select significant features for unsupervised learning since there is no target.
Feature Selection for Supervised Learning is removing irrelevant variables that have no impact on target:
Filter: Select features by applying statistical measures to score the dependence input features based on their relationships with the target.
Wrapper: Look for well-performing subsets of features. Recursive Feature Elimination (RFE) is an example of Warper.
Intrinsic: There are Decision Tree-based Algorithms that automatically perform feature selection during training. Retrieved from Brownlee, Jason
Feature Selection for Unsupervised Learning is only dimensionality reduction including PCA (Principal Component Analysis), t-SNE (t-Distributed Stochastic Neighbor Embedding), LLE (Locally Linear Embedding) and LDA (Linear Discriminant Analysis). See my Github for more information about dimensionality reduction.
In this notebook, only feature selection for Supervised Learning is presented.
Mobile Price Classification Data
The data for this work is downloaded from Data. The definition of the variables in the dataset are:
df = pd.read_csv('./Data/train.csv')
df
First and foremost to select features can be standardized covariance matrix between features and target. The standard covariance, also called the correlation coefficient $\large \rho_{XY}$ is between -1 and +1.
$\large \rho_{XY}=\frac{C_{XY}}{\sqrt{\sigma_{X}^{2}\sigma_{Y}^{2}}}$ where $C_{XY}$ is the covariance between $XY$ and $\sigma_{X}^{2}$ and $\sigma_{Y}^{2}$ are variances for X and Y variables. The covariance is the variance between two variable.
If absolute value of $ \rho_{XY}$ is close to 1 ($ \rho_{XY}$ ≈ −1 𝑜𝑟 1), it implies that two variables are perfectly correlated. If the $ \rho_{XY}$ is close to zero ($\rho_{XY}$ ≈ 0), two variables may be independent. However, $\rho_{XY}$ ≈ 0 does not necessarily imply independence between the two variables (see Figure below). Therefore, correlation coefficient cannot select features that have non-linear correlation with the target.
font = {'size' : 6}
plt.rc('font', **font)
fig, ax1 = plt.subplots(figsize=(3.5, 4), dpi= 200, facecolor='w', edgecolor='k')
corr=df.corr()
corr=corr['price_range'].drop(['price_range'])
coefs=corr.values
features_colums=list(corr.index)
Correlation_plot.corr_bar(coefs,clmns=features_colums,select=False,yfontsize=6.0,title=f'Linear Correlation with price_range',
ymax_vert_lin=30,xlim = [-0.5, 0.95])
ram is the highly correlated with price range followed by battery power, width and pixel height.
The scikit-learn library provides an implementation of most of the useful statistical measures based on Correlation as below:
f_regression()
¶The F value ($\large \frac{\sigma_{1}^{2}}{\sigma_{2}^{2}}$) in regression is the result of a test where the null hypothesis is that all of the regression coefficients are equal to zero. In other words, the model has no predictive capability. Basically, the f-test compares your model with zero predictor variables (the intercept only model), and decides whether your added coefficients improved the model. If you get a significant result, then whatever coefficients you included in your model improved the model’s fit. Read your p-value first. If the p-value is small (less than your alpha level), you can reject the null hypothesis.
x=df.drop(columns='price_range',axis=1)
y=df['price_range']
f_stat,p_values=f_regression(x,y)
sorted_zipped_lists = sorted(zip(list(p_values),list(x.columns)))
importance=np.array([ii[0] for ii in sorted_zipped_lists])
features_colums=[ii[1] for ii in sorted_zipped_lists]
font = {'size' : 7}
plt.rc('font', **font)
fig, ax1 = plt.subplots(figsize=(6, 3), dpi= 180, facecolor='w', edgecolor='k')
_=prfrmnce_plot(importance, title=f'Feature Selection by F Value in Regression (f_regression())',
ylabel='p-value',clmns=features_colums,titlefontsize=9,
xfontsize=7, yfontsize=8).bargraph(perent=False,fontsizelable=6,xshift=-0.15,axt=ax1,
yshift=0.05,ylim=[0,1.5], xlim=[-0.5,20.], y_rot=90)
loc=list(importance).index(max(importance[[importance<0.2]]))
plt.axvspan(-1,loc, facecolor='g', alpha=0.2,label='Selected Features')
plt.legend(loc=1,fontsize=7)
plt.show()
f_classif()
¶ANOVA stands for Analysis of Variance applying f-test to compare variances across the means (or average) of different groups.
f_stat,p_values=f_classif(x,y)
sorted_zipped_lists = sorted(zip(list(p_values),list(x.columns)))
importance=np.array([ii[0] for ii in sorted_zipped_lists])
features_colums=[ii[1] for ii in sorted_zipped_lists]
font = {'size' : 7}
plt.rc('font', **font)
fig, ax1 = plt.subplots(figsize=(6, 3), dpi= 180, facecolor='w', edgecolor='k')
_=prfrmnce_plot(importance, title=f'Feature Selection by ANOVA (f_classif())',
ylabel='p-value',clmns=features_colums,titlefontsize=9,
xfontsize=7, yfontsize=8).bargraph(perent=False,fontsizelable=6,xshift=-0.15,axt=ax1,
yshift=0.05,ylim=[0,1.5], xlim=[-0.5,20.], y_rot=90)
loc=list(importance).index(max(importance[[importance<0.2]]))
plt.axvspan(-1,loc, facecolor='g', alpha=0.2,label='Selected Features')
plt.legend(loc=1,fontsize=7)
plt.show()
chi2()
¶It calculated the chi-squared stats between each non-negative feature and class.
f_stat,p_values=chi2(x,y)
sorted_zipped_lists = sorted(zip(list(p_values),list(x.columns)))
importance=np.array([ii[0] for ii in sorted_zipped_lists])
features_colums=[ii[1] for ii in sorted_zipped_lists]
font = {'size' : 7}
plt.rc('font', **font)
fig, ax1 = plt.subplots(figsize=(6, 3), dpi= 180, facecolor='w', edgecolor='k')
_=prfrmnce_plot(importance, title=f'Feature Selection by Chi-Squared (chi2())',
ylabel='p-value',clmns=features_colums,titlefontsize=9,
xfontsize=7, yfontsize=8).bargraph(perent=False,fontsizelable=6,xshift=-0.15,axt=ax1,
yshift=0.05,ylim=[0,1.5], xlim=[-0.5,20.], y_rot=90)
loc=list(importance).index(max(importance[[importance<0.2]]))
plt.axvspan(-1,loc, facecolor='g', alpha=0.2,label='Selected Features')
plt.legend(loc=1,fontsize=7)
plt.show()
This approach removes all but the highest scoring features. The scoring approach mentioned before is applied for selecting features
# k is number of highest scoring features
X_new = SelectKBest(score_func=f_classif, k=2).fit_transform(x, y)
X_new
The score_func for regression and classifier
RFE is a common technique for feature selection because it is easy to configure and use and effective at selecting those features most relevant for predicting the target. RFE is a wrapper-type feature selection which denotes that a different machine learning algorithm is given and used in the core of the method. This is the opposite of filter-based feature selections that score each feature and select those features with the largest score Brownlee, Jason.
RFE uses an external estimator to assign weights to features (e.g., the coefficients of a linear model), the goal of recursive feature elimination (RFE) is to select features by recursively considering smaller and smaller sets of features. First, the estimator is defined on the initial set of features and the importance of each feature is obtained either through any specific attribute or callable. Then, the least important features are eliminated from current set of features. That procedure is recursively repeated on the reduced feature set until the desired features is achieved.
The example below from scikit learn shows how to select features by RFE through Decision Tree Algorithm.
rfe = RFE(estimator=RandomForestClassifier(), n_features_to_select=3, step=1)
# fit the model
rfe.fit(x, y)
# transform the data
x_reduced = rfe.transform(x)
x_reduced
selector = rfe.fit(x, y)
rank=selector.ranking_
sorted_zipped_lists = sorted(zip(list(rank),list(x.columns)))
importance=np.array([ii[0] for ii in sorted_zipped_lists])
features_colums=[ii[1] for ii in sorted_zipped_lists]
font = {'size' : 7}
plt.rc('font', **font)
fig, ax1 = plt.subplots(figsize=(6, 3), dpi= 180, facecolor='w', edgecolor='k')
_=prfrmnce_plot(importance, title=f'Feature Selection by Recursive Feature Elimination (RFE)',
ylabel='Rank',clmns=features_colums,titlefontsize=9,
xfontsize=7, yfontsize=8).bargraph(perent=False,xshift=-0.15,axt=ax1,
yshift=0.05,ylim=[0,20], xlim=[-0.5,20.], y_rot=90, reverse=False, fontsizelable=False)
plt.show()
Optimum Number of Feature can be achieved by evaluating the model using the repeated stratified k-fold cross-validation for each number of feature.
font = {'size' : 9}
plt.rc('font', **font)
fig, ax1 = plt.subplots(figsize=(8, 4), dpi= 120, facecolor='w', edgecolor='k')
names=[]
results=[]
for ir in range(2, 11):
names.append(str(ir)+' Features')
rfe = RFE(estimator=RandomForestClassifier(), n_features_to_select=ir, step=1)
# fit the model
rfe.fit(x, y)
model = RandomForestClassifier()
pipeline = Pipeline(steps=[('select',rfe),('model',model)])
# evaluate model
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(pipeline, x, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')
results.append(n_scores)
# Compare Algorithms
plt.ylim((0.6,0.95))
plt.boxplot(results,meanline=True, labels=names)
plt.xticks(rotation=90)
plt.xlabel('Number of Features',fontsize=12)
plt.ylabel('Accuracy',fontsize=12)
ax1.grid(linewidth='0.1')
ax1.xaxis.grid(color='k', linestyle='-', linewidth=0.2)
plt.title('Box Plot to Compare the Performance (Accuracy) \n for Different Number of Significant Features',fontsize=12)
plt.show()
From Figure above, accuracy has not changed significantly after 4 features. So, 4 features should optimum.
SelectFromModel can be used with any estimator capable of quantifying importance to each feature through a specific attribute (featureimportances, such as coef_). The features are considered unimportant and eliminated if the importance of the feature values are below the provided threshold parameter. There are built-in heuristics for finding a threshold using a string argument https://scikit-learn.org/stable/modules/feature_selection.html.
There are sparse solutions for linear models penalized with the L1: many of their estimated coefficients are zero. Therefore, zero coefficients can be removed to reduce the dimensionality of the data to use with another classifier. Lasso for regression, and of LogisticRegression and LinearSVC are used for classification.
Lasso or L1 is a regularization technique of Linear Regression: it adds a regularization term to the cost function, which is sum of absolute values of weights:
$MSE(\mathbf{w})+\alpha \sum_{i=1}^{n}|\mathbf{w}_{i}|$
Lasso completely removes the weights for the features that are the least important (the values of the weights set to zero). So, feature selection can be automatically applied by Lasso Regularization.
print('all features:', x.shape)
lsvc = LinearSVC(C=0.01, penalty="l1", dual=False).fit(x, y)
model = SelectFromModel(lsvc, prefit=True)
x_new = model.transform(x)
print('Selected features:', x_new.shape)
Tree-based estimators can be applied to compute impurity-based feature importances, which at the end can lead to eliminate irrelevant features.
exttrs = ExtraTreesClassifier(n_estimators=100)
exttrs = exttrs.fit(x, y)
sorted_zipped_lists = sorted(zip(list(exttrs.feature_importances_),list(x.columns)), reverse=True)
importance=np.array([ii[0] for ii in sorted_zipped_lists])
features_colums=[ii[1] for ii in sorted_zipped_lists]
font = {'size' : 7}
plt.rc('font', **font)
fig, ax1 = plt.subplots(figsize=(6, 3), dpi= 180, facecolor='w', edgecolor='k')
_=prfrmnce_plot(importance, title=f'Feature Selection by Recursive Feature Elimination (RFE)',
ylabel='Percentage of Importance',clmns=features_colums,titlefontsize=9,
xfontsize=7, yfontsize=8).bargraph(perent=True,xshift=-0.15,axt=ax1,
yshift=0.015,ylim=[0,0.50], xlim=[-0.5,20], y_rot=90, reverse=True, fontsizelable=6)
plt.show()
print('all features:', x.shape)
model = SelectFromModel(exttrs, prefit=True)
x_new = model.transform(x)
print('Selected features:', x_new.shape)