Summary

Feature selection, which means removing the irrelevant or less important features that do not contribute much to target variable, is one of the most important aspects of machine learning. Irrelevant or partially relevant features can negatively impact the performance of models and also leads to huge computational cost. Feature selection and data cleaning should be the first and most important step of model designing. In this notebook, different approaches of reducing the number of features are discussed. Feature selection is applied for Supervised Learning by removing irrelevant variables that have no impact on target.

Python functions and data files needed to run this notebook are available via this link.

In [1]:
import pandas as pd
import numpy as np
import pylab as plt
from sklearn.feature_selection import f_regression
from scipy.stats import zscore
from sklearn.feature_selection import f_classif
from sklearn.feature_selection import chi2
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import RepeatedStratifiedKFold
from functions import* # import require functions to run this notebook
from sklearn.model_selection import cross_val_score
from sklearn.svm import LinearSVC
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import ExtraTreesClassifier
import warnings
warnings.filterwarnings('ignore')

Introduction

The process of reducing number of features to develop a predictive model is called feature selection. It leads to:

  1. Mitigate Overfitting. Reducing redundant features signifies less chance of making decisions based on noise.
  2. Reduce Computational Cost. Algorithms are trained faster by feeding less data points.
  3. Improve Performance. Modeling performance is improved by removing less misleading data.

Two main types of feature selection techniques are: supervised and unsupervised. There are a wide range of supervised methods that may be divided into wrapper, filter and intrinsic. However, there is only dimensionality reduction to select significant features for unsupervised learning since there is no target.

Feature Selection for Supervised Learning is removing irrelevant variables that have no impact on target:

  • Filter: Select features by applying statistical measures to score the dependence input features based on their relationships with the target.

  • Wrapper: Look for well-performing subsets of features. Recursive Feature Elimination (RFE) is an example of Warper.

  • Intrinsic: There are Decision Tree-based Algorithms that automatically perform feature selection during training. Retrieved from Brownlee, Jason

  • Multicollinearity: condition where a predictor variable (independent variable) correlates with another predictor is called multicollinearity. It is a problem because it undermines the statistical significance of an independent variable, reduces the precision of the estimated coefficients, and it affects the interpretability. Therefore, it should be remove before applying Supervised Learning. See my Github for more information about multicollinearity.

Feature Selection for Unsupervised Learning is only dimensionality reduction including PCA (Principal Component Analysis), t-SNE (t-Distributed Stochastic Neighbor Embedding), LLE (Locally Linear Embedding) and LDA (Linear Discriminant Analysis). See my Github for more information about dimensionality reduction.

In this notebook, only feature selection for Supervised Learning is presented.

Mobile Price Classification Data

The data for this work is downloaded from Data. The definition of the variables in the dataset are:

  • battery_power: Total energy a battery can store in one time measured in mAh
  • blue: Has Bluetooth or not
  • clock_speed: the speed at which microprocessor executes instructions
  • dual_sim: Has dual sim support or not
  • fc: Front Camera megapixels
  • four_g: Has 4G or not
  • int_memory: Internal Memory in Gigabytes
  • m_dep: Mobile Depth in cm
  • mobile_wt: Weight of mobile phone
  • n_cores: Number of cores of the processor
  • pc: Primary Camera megapixels
  • px_height: Pixel Resolution Height
  • px_width: Pixel Resolution Width
  • ram: Random Access Memory in MegaBytes
  • sc_h: Screen Height of mobile in cm
  • sc_w: Screen Width of mobile in cm
  • talk_time: the longest time that a single battery charge will last when you are
  • three_g: Has 3G or not
  • touch_screen: Has touch screen or not
  • wifi: Has wifi or not
  • price_range: This is the target variable with a value of 0 (low cost), 1(medium cost), 2(high cost) and 3(very high cost).
In [2]:
df = pd.read_csv('./Data/train.csv')
df
Out[2]:
battery_power blue clock_speed dual_sim fc four_g int_memory m_dep mobile_wt n_cores ... px_height px_width ram sc_h sc_w talk_time three_g touch_screen wifi price_range
0 842 0 2.2 0 1 0 7 0.6 188 2 ... 20 756 2549 9 7 19 0 0 1 1
1 1021 1 0.5 1 0 1 53 0.7 136 3 ... 905 1988 2631 17 3 7 1 1 0 2
2 563 1 0.5 1 2 1 41 0.9 145 5 ... 1263 1716 2603 11 2 9 1 1 0 2
3 615 1 2.5 0 0 0 10 0.8 131 6 ... 1216 1786 2769 16 8 11 1 0 0 2
4 1821 1 1.2 0 13 1 44 0.6 141 2 ... 1208 1212 1411 8 2 15 1 1 0 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1995 794 1 0.5 1 0 1 2 0.8 106 6 ... 1222 1890 668 13 4 19 1 1 0 0
1996 1965 1 2.6 1 0 0 39 0.2 187 4 ... 915 1965 2032 11 10 16 1 1 1 2
1997 1911 0 0.9 1 1 1 36 0.7 108 8 ... 868 1632 3057 9 1 5 1 1 0 3
1998 1512 0 0.9 0 4 1 46 0.1 145 5 ... 336 670 869 18 10 19 1 1 1 0
1999 510 1 2.0 1 5 1 45 0.9 168 6 ... 483 754 3919 19 4 2 1 1 1 3

2000 rows × 21 columns

Filter

First and foremost to select features can be standardized covariance matrix between features and target. The standard covariance, also called the correlation coefficient $\large \rho_{XY}$ is between -1 and +1.

$\large \rho_{XY}=\frac{C_{XY}}{\sqrt{\sigma_{X}^{2}\sigma_{Y}^{2}}}$ where $C_{XY}$ is the covariance between $XY$ and $\sigma_{X}^{2}$ and $\sigma_{Y}^{2}$ are variances for X and Y variables. The covariance is the variance between two variable.

If absolute value of $ \rho_{XY}$ is close to 1 ($ \rho_{XY}$ ≈ −1 𝑜𝑟 1), it implies that two variables are perfectly correlated. If the $ \rho_{XY}$ is close to zero ($\rho_{XY}$ ≈ 0), two variables may be independent. However, $\rho_{XY}$ ≈ 0 does not necessarily imply independence between the two variables (see Figure below). Therefore, correlation coefficient cannot select features that have non-linear correlation with the target.

In [3]:
font = {'size'   : 6}
plt.rc('font', **font)
fig, ax1 = plt.subplots(figsize=(3.5, 4), dpi= 200, facecolor='w', edgecolor='k')

corr=df.corr()
corr=corr['price_range'].drop(['price_range'])
coefs=corr.values
features_colums=list(corr.index)
Correlation_plot.corr_bar(coefs,clmns=features_colums,select=False,yfontsize=6.0,title=f'Linear Correlation with price_range',
                          ymax_vert_lin=30,xlim = [-0.5, 0.95])

ram is the highly correlated with price range followed by battery power, width and pixel height.

The scikit-learn library provides an implementation of most of the useful statistical measures based on Correlation as below:

F Value in Regression: f_regression()

The F value ($\large \frac{\sigma_{1}^{2}}{\sigma_{2}^{2}}$) in regression is the result of a test where the null hypothesis is that all of the regression coefficients are equal to zero. In other words, the model has no predictive capability. Basically, the f-test compares your model with zero predictor variables (the intercept only model), and decides whether your added coefficients improved the model. If you get a significant result, then whatever coefficients you included in your model improved the model’s fit. Read your p-value first. If the p-value is small (less than your alpha level), you can reject the null hypothesis.

In [4]:
x=df.drop(columns='price_range',axis=1)
y=df['price_range']
In [5]:
f_stat,p_values=f_regression(x,y)
sorted_zipped_lists = sorted(zip(list(p_values),list(x.columns)))
importance=np.array([ii[0] for ii in sorted_zipped_lists])
features_colums=[ii[1] for ii in sorted_zipped_lists]
In [6]:
font = {'size'   : 7}
plt.rc('font', **font)
fig, ax1 = plt.subplots(figsize=(6, 3), dpi= 180, facecolor='w', edgecolor='k')

_=prfrmnce_plot(importance, title=f'Feature Selection by F Value in Regression (f_regression())', 
            ylabel='p-value',clmns=features_colums,titlefontsize=9, 
            xfontsize=7, yfontsize=8).bargraph(perent=False,fontsizelable=6,xshift=-0.15,axt=ax1,
            yshift=0.05,ylim=[0,1.5], xlim=[-0.5,20.], y_rot=90)
loc=list(importance).index(max(importance[[importance<0.2]]))
plt.axvspan(-1,loc, facecolor='g', alpha=0.2,label='Selected Features')
plt.legend(loc=1,fontsize=7)
plt.show()

ANOVA: f_classif()

ANOVA stands for Analysis of Variance applying f-test to compare variances across the means (or average) of different groups.

In [7]:
f_stat,p_values=f_classif(x,y)
sorted_zipped_lists = sorted(zip(list(p_values),list(x.columns)))
importance=np.array([ii[0] for ii in sorted_zipped_lists])
features_colums=[ii[1] for ii in sorted_zipped_lists]
In [8]:
font = {'size'   : 7}
plt.rc('font', **font)
fig, ax1 = plt.subplots(figsize=(6, 3), dpi= 180, facecolor='w', edgecolor='k')

_=prfrmnce_plot(importance, title=f'Feature Selection by ANOVA (f_classif())', 
            ylabel='p-value',clmns=features_colums,titlefontsize=9, 
            xfontsize=7, yfontsize=8).bargraph(perent=False,fontsizelable=6,xshift=-0.15,axt=ax1,
            yshift=0.05,ylim=[0,1.5], xlim=[-0.5,20.], y_rot=90)
loc=list(importance).index(max(importance[[importance<0.2]]))
plt.axvspan(-1,loc, facecolor='g', alpha=0.2,label='Selected Features')
plt.legend(loc=1,fontsize=7)
plt.show()

Chi-Squared: chi2()

It calculated the chi-squared stats between each non-negative feature and class.

In [9]:
f_stat,p_values=chi2(x,y)
sorted_zipped_lists = sorted(zip(list(p_values),list(x.columns)))
importance=np.array([ii[0] for ii in sorted_zipped_lists])
features_colums=[ii[1] for ii in sorted_zipped_lists]
In [10]:
font = {'size'   : 7}
plt.rc('font', **font)
fig, ax1 = plt.subplots(figsize=(6, 3), dpi= 180, facecolor='w', edgecolor='k')

_=prfrmnce_plot(importance, title=f'Feature Selection by Chi-Squared (chi2())', 
            ylabel='p-value',clmns=features_colums,titlefontsize=9, 
            xfontsize=7, yfontsize=8).bargraph(perent=False,fontsizelable=6,xshift=-0.15,axt=ax1,
            yshift=0.05,ylim=[0,1.5], xlim=[-0.5,20.], y_rot=90)
loc=list(importance).index(max(importance[[importance<0.2]]))
plt.axvspan(-1,loc, facecolor='g', alpha=0.2,label='Selected Features')
plt.legend(loc=1,fontsize=7)
plt.show()

SelectKBest

This approach removes all but the highest scoring features. The scoring approach mentioned before is applied for selecting features

In [11]:
# k is number of highest scoring features
X_new = SelectKBest(score_func=f_classif, k=2).fit_transform(x, y)
X_new
Out[11]:
array([[ 842., 2549.],
       [1021., 2631.],
       [ 563., 2603.],
       ...,
       [1911., 3057.],
       [1512.,  869.],
       [ 510., 3919.]])

The score_func for regression and classifier

  • For regression: f_regression, mutual_info_regression
  • For classification: chi2, f_classif, mutual_info_classif

Wrapper

Recursive Feature Elimination (RFE)

RFE is a common technique for feature selection because it is easy to configure and use and effective at selecting those features most relevant for predicting the target. RFE is a wrapper-type feature selection which denotes that a different machine learning algorithm is given and used in the core of the method. This is the opposite of filter-based feature selections that score each feature and select those features with the largest score Brownlee, Jason.

RFE uses an external estimator to assign weights to features (e.g., the coefficients of a linear model), the goal of recursive feature elimination (RFE) is to select features by recursively considering smaller and smaller sets of features. First, the estimator is defined on the initial set of features and the importance of each feature is obtained either through any specific attribute or callable. Then, the least important features are eliminated from current set of features. That procedure is recursively repeated on the reduced feature set until the desired features is achieved.

The example below from scikit learn shows how to select features by RFE through Decision Tree Algorithm.

In [12]:
rfe = RFE(estimator=RandomForestClassifier(), n_features_to_select=3, step=1)
# fit the model
rfe.fit(x, y)
# transform the data
x_reduced = rfe.transform(x)
x_reduced
Out[12]:
array([[ 842.,   20., 2549.],
       [1021.,  905., 2631.],
       [ 563., 1263., 2603.],
       ...,
       [1911.,  868., 3057.],
       [1512.,  336.,  869.],
       [ 510.,  483., 3919.]])
In [13]:
selector = rfe.fit(x, y)
rank=selector.ranking_
sorted_zipped_lists = sorted(zip(list(rank),list(x.columns)))
importance=np.array([ii[0] for ii in sorted_zipped_lists])
features_colums=[ii[1] for ii in sorted_zipped_lists]
In [14]:
font = {'size'   : 7}
plt.rc('font', **font)
fig, ax1 = plt.subplots(figsize=(6, 3), dpi= 180, facecolor='w', edgecolor='k')

_=prfrmnce_plot(importance, title=f'Feature Selection by Recursive Feature Elimination (RFE)', 
            ylabel='Rank',clmns=features_colums,titlefontsize=9, 
            xfontsize=7, yfontsize=8).bargraph(perent=False,xshift=-0.15,axt=ax1,
            yshift=0.05,ylim=[0,20], xlim=[-0.5,20.], y_rot=90, reverse=False, fontsizelable=False)
plt.show()

Optimum Number of Features

Optimum Number of Feature can be achieved by evaluating the model using the repeated stratified k-fold cross-validation for each number of feature.

In [15]:
font = {'size'   : 9}
plt.rc('font', **font)
fig, ax1 = plt.subplots(figsize=(8, 4), dpi= 120, facecolor='w', edgecolor='k')
names=[]
results=[]
for ir in range(2, 11):
    names.append(str(ir)+' Features')
    rfe = RFE(estimator=RandomForestClassifier(), n_features_to_select=ir, step=1)
    
    # fit the model
    rfe.fit(x, y)
    model = RandomForestClassifier()
    pipeline = Pipeline(steps=[('select',rfe),('model',model)])
    
    # evaluate model
    cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
    n_scores = cross_val_score(pipeline, x, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')
    results.append(n_scores)

# Compare Algorithms
plt.ylim((0.6,0.95))
plt.boxplot(results,meanline=True, labels=names)
plt.xticks(rotation=90)
plt.xlabel('Number of Features',fontsize=12)
plt.ylabel('Accuracy',fontsize=12)
ax1.grid(linewidth='0.1')
ax1.xaxis.grid(color='k', linestyle='-', linewidth=0.2)
plt.title('Box Plot to Compare the Performance (Accuracy) \n for Different Number of Significant Features',fontsize=12)
plt.show()    

From Figure above, accuracy has not changed significantly after 4 features. So, 4 features should optimum.

Intrinsic

SelectFromModel can be used with any estimator capable of quantifying importance to each feature through a specific attribute (featureimportances, such as coef_). The features are considered unimportant and eliminated if the importance of the feature values are below the provided threshold parameter. There are built-in heuristics for finding a threshold using a string argument https://scikit-learn.org/stable/modules/feature_selection.html.

L1-based Feature Selection

There are sparse solutions for linear models penalized with the L1: many of their estimated coefficients are zero. Therefore, zero coefficients can be removed to reduce the dimensionality of the data to use with another classifier. Lasso for regression, and of LogisticRegression and LinearSVC are used for classification.

Lasso or L1 is a regularization technique of Linear Regression: it adds a regularization term to the cost function, which is sum of absolute values of weights:

$MSE(\mathbf{w})+\alpha \sum_{i=1}^{n}|\mathbf{w}_{i}|$
Lasso completely removes the weights for the features that are the least important (the values of the weights set to zero). So, feature selection can be automatically applied by Lasso Regularization.

In [16]:
print('all features:', x.shape)
lsvc = LinearSVC(C=0.01, penalty="l1", dual=False).fit(x, y)
model = SelectFromModel(lsvc, prefit=True)
x_new = model.transform(x)
print('Selected features:', x_new.shape)
all features: (2000, 20)
Selected features: (2000, 13)

Tree-based Feature Selection

Tree-based estimators can be applied to compute impurity-based feature importances, which at the end can lead to eliminate irrelevant features.

In [17]:
exttrs = ExtraTreesClassifier(n_estimators=100)
exttrs = exttrs.fit(x, y)
sorted_zipped_lists = sorted(zip(list(exttrs.feature_importances_),list(x.columns)), reverse=True)
importance=np.array([ii[0] for ii in sorted_zipped_lists])
features_colums=[ii[1] for ii in sorted_zipped_lists]
In [18]:
font = {'size'   : 7}
plt.rc('font', **font)
fig, ax1 = plt.subplots(figsize=(6, 3), dpi= 180, facecolor='w', edgecolor='k')

_=prfrmnce_plot(importance, title=f'Feature Selection by Recursive Feature Elimination (RFE)', 
            ylabel='Percentage of Importance',clmns=features_colums,titlefontsize=9, 
            xfontsize=7, yfontsize=8).bargraph(perent=True,xshift=-0.15,axt=ax1,
            yshift=0.015,ylim=[0,0.50], xlim=[-0.5,20], y_rot=90, reverse=True, fontsizelable=6)
plt.show()
In [19]:
print('all features:', x.shape)
model = SelectFromModel(exttrs, prefit=True)
x_new = model.transform(x)
print('Selected features:', x_new.shape)
all features: (2000, 20)
Selected features: (2000, 2)