Summary

Feature importance means assigning scores to input features using a predictive model; the scores signify the relative importance of the features for making a prediction. In this notebook, Coefficient of Determination, Predictive Power Score, Linear Regression, Decision Tree, Random Forest, Gradient Boosting (XGBoost, CatBoost and LightGBM) are applied to quantify the importance of input features to predict targets. Finally, majority vote technique is considered to integrate all these predictive algorithms to find the most and least important features. Both Regression and Classification problems are applied in this study.

Python functions and data files to run this notebook are in my Github page.

import pandas as pd
import numpy as np
import time
import matplotlib 
import pylab as plt
from scipy.stats import zscore
from functions import* # import require functions to run this notebook
from sklearn import tree
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBRegressor
from xgboost import XGBClassifier
from catboost import Pool, CatBoostRegressor
from catboost import Pool, CatBoostClassifier
from lightgbm import LGBMRegressor
from lightgbm import LGBMClassifier
import ppscore as pps
import warnings
warnings.filterwarnings('ignore')

1 Introduction
2 Coefficient of Determination
3 Predictive Power Score
4 Predictive Models for Feature Importance
- 4.1 Regression (Linear & Logistic)
- 4.2 Tree-based algorithms
  - 4.2.1 Random Forest
  - 4.2.2 Gradient Boosting
5 Majority Vote
6 Synthetic Data
- 6.1 Majority Vote
7 Classification
- 7.1 Majority Vote

Introduction¶

Feature Importance refers to techniques of assigning a score (usually from 0 to 100%) for all the input features for a given predictive model. The “importance” of each feature is quantified by the score: higher score denotes a feature have higher effect on the model being used to predict a certain target.

Feature importance is useful due to the below reasons:

Understanding of Data

Similar to correlation matrix, feature importance can present the relationship between features and target. While correlation matrix can only detect the linear correlation, feature importance can also quantify the non-linear correlation between features and target via predictive algorithms.

Improving Model

During model training, the calculated scores can be used to reduce dimensionality of data, removing irrelevant features that have low score. This not only leads to improve the performance of predictive algorithm but also reduces computational cost.

Interpreting Model

By fitting a predictive model on the dataset, the importance of feature as scores (0 to 100%) are calculated. These scores provide insight into that specific model by quantifying the most important and least important feature to the model for making a prediction. This can lead to model interpretation.

Retrieved from Brownlee, Jason.

There are several techniques for quantifying the importance of features for classification and regression problems. These approaches are applied in this notebook. Finally, a majority vote technique will be applied to get the features with most and least votes from different techniques.

Coefficient of Determination¶

The correlation coefficient also termed as $R^{2}$ is the basic approach to quantify the importance of features for a target. In statistics, $R^{2}$ determines and assesses the ability of a statistical model to explain and predict future outcomes. In other words, if we have dependent variable y and independent variable x in a model, then $R^{2}$ helps in determining the variation in y by variation x. It is one of the key output of regression analysis and is used when we want to predict future or testing some models with related information. $R^{2}$ lies between 0 and 1 and higher the value of $R^{2}$, better will be the prediction and strength of the model. $R^{2}$ is basically a square of a correlation coefficient (𝜌).

There are multiple Formulas to calculate the coefficient of determination. It can be simply calculated from correlation coefficient:

Coefficient of Determination=(Correlation Coefficient)^2

The correlation coefficient (𝜌) is the basic approach to quantify the importance of features for a target. The 𝜌 is calculated to represent the linear correlation between features and target. The covariance between X (feature) and Y (target) (cov(X,Y)) is calculated divided by standard deviation of $\sigma_{X}$ and $\sigma_{Y}$.

$\Large 𝜌=\frac{Cov_{X,Y}}{\sigma_X\sigma_Y}=\frac{\frac{\sum[({X-\bar{X}})\times({Y-\bar{Y}}^{})]}{n}}{\sqrt{\frac{\sum({X-\bar{X}})^2}{n}}\times \sqrt{\frac{\sum({Y-\bar{Y}})^2}{n}}}=\frac{\sum[({X-\bar{X}})\times({Y-\bar{Y}})]}{\sqrt{\sum[({X-\bar{X}})^{2}\times({Y-\bar{Y}})^{2}]}}$

Where:

$X$: Data set X

$Y$: Data set Y

$\bar{X}$: Mean of Data set X

$\bar{Y}$: Mean of Data set Y

The 𝜌 is always between -1 and +1. Negative correlation means by increasing one feature, another feature will probably decrease. The opposite is true for positive correlation: increasing one feature probably lead to increase another feature. If the absolute value of 𝜌 is close to 1, it implies two features are perfectly correlated. If it is close to 0, two features could be independent. However, 𝜌 ≈ 0 does not necessarily imply independence since features can be non-linearly correlated. See Figure below as an schematic illustration for different linear correlations between X and Y variables:

Unlike the Pearson correlation coefficient, the coefficient of determination measures how well the predicted values match (and not just follow) the observed values. It depends on the distance between the points and the 1:1 line (and not the best-fit line) as shown above. Closer the data to the 1:1 line, higher the coefficient of determination. So, $R^{2}$ is only applied when when we have actual Y and predicted Y. $R^{2}$ can be calculated by correlation coefficient between actual $y_{a}$ and predicted $y_{p}$, and using Regression outputs as below:

$\Large R^{2}=𝜌^{2}=(\frac{\sum[({y_{a}-\bar{y_{a}}})\times({y_{p}-\bar{y_{p}}})]}{\sqrt{\sum[({y_{a}-\bar{y_{a}}})^{2}\times({y_{p}-\bar{y_{p}}})^{2}]}})^{2}$

see here for proof.

Coefficient of Determination ($R^{2}$) = Explained Variation / Variance of data

Coefficient of Determination $ \Large R^{2} = 1-\frac{SS_{res}}{SS_{tot}}$

Where:

$SS_{tot}$ – Total Sum of Squares, proportion to variance = $\large \sum({y_{a}-\bar{y_{a}}})^{2}$
$SS_{res}$ – Residual Sum of Squares = $\large \sum({y_{a}-y_{p}})^{2}$

In the best case, the modeled values exactly match the observed values, which results in $SS_{res}=0$ and $R^{2}=1$. Models that have worse predictions than this baseline will have a negative $R^{2}$.

There are two drawbacks with $R^{2}$ for feature importance quantification:

$R^{2}$ ≈ 0 does not necessarily imply independence since features can be non-linearly correlated. The non-linear relationship can be achieved from powerful predictive models.
Multivariate contribution of features for a target cannot be achieved from $R^{2}$. Predictive models can also quantify contribution of each feature to predict a target compared with other features.

Predictive Power Score¶

Predictive Power Score or PPS is a kind of score that is asymmetric and data-type agnostic and helps in identifying linear or non-linear relationships between two columns of a particular dataset. The value of PPS lies between 0 (no predictive power) and 1 (highest predictive power).

For Regression, the Mean Absolute Error (MAE) evaluation metric is used. First calculate MAE for the naive model, and then using this score, we generate the desired MAE for predictive power score. Here the score lies between 0 and 1 but as this score tells us about the error component, the lower it is, the better it will be. The mathematical formula used for calculating the MAE is mentioned below.

PPS = 1 – (MAE_model / MAE_naive)

For classification, the weighted F1 score is used as an evaluation metric. This score is basically a weighted average of precision and recall. In predictive power score, first the F1 score is calculated for the naive model (the model that always predicts the most common class) and after this, the actual F1 score for the predictive power score is calculated.

The F1 score lies between the range of 0 to 1. The higher the better it is.

Following is the mathematical formula used in this case:

PPS = (F1_model – F1_naive) / (1 – F1_naive)

By default, the predictive power score method uses a Decision Tree for calculating the metrics. There are many reasons for choosing the Decision Tree algorithm. Firstly, the Decision Tree can find out any sort of non-linear bivariate relationships. Decision Tree is applicable in numerous cases and also it requires very little data preprocessing. Furthermore, Decision Tree is able to handle outliers very well and rarely overfits, thus making it highly robust.

The main disadvantage of PPS, similar to correlation coefficient, it cannot measure multivariate contribution of features for a target because it only calculate the bivariate relationship between each feature and target at the time.

Predictive Models for Feature Importance¶

Regression (Linear & Logistic)¶

Regression is as supervised learning considered as a baseline prediction. Regression is divided into linear regression and logistic regression:

Linear Regression algorithm defines a linear relationship between independent and dependent variables. It uses a linear equation to identify the line of best fit (straight line) for a problem, thereby enabling the visualization and prediction of the output of the dependent variables. Here is mathematical equation for linear regression: $ y=w_{0}+w_{1}\times x_{1}+w_{2}\times x_{2}+,..., w_{n}\times x_{n}$
where $x_{1}$ to $x_{n}$ are input features and $w_{1}$ to $w_{n}$ are assigned weights for each feature to predict target $y$. The higher(absolute value), the higher importance of feature is to predict a target. $w$ can be positive or negative. Bias term $w_{0}$ is a variable or function independent of the input but which affects the output.

Logistic Regression model is similar to linear regression but it is applied to estimate the probability of a particular class. Logistic Regression calculates a weighted sum of the input features (plus a bias term) and use sigmoid function to estimate probability of each class. See the equation below: $ z=w_{0}+w_{1}\times x_{1}+w_{2}\times x_{2}+,..., w_{n}\times x_{n}$
$ y=\frac{1}{1+e^{-z}} (Sigmoid)$

The major limitation of Regression is the assumption of linearity between input features and target. Therefore, it cannot be applied for complex and non-linear correlation between features and target.

Tree-based algorithms¶

Decision Tree is a simple but still powerful algorithm for non-linear and complex data. Decision Tree is the fundamental component of Random Forest and Gradient Boosting. It is used a flowchart-based structure in which each node denotes a test. It starts with root node. Each branch represents the result of a test (child node), and each leaf node represents a class label. It is measured how each feature decrease the impurity of the split, the feature with highest decrease is selected for internal node. Therefore, it can be applied for feature importance. For each node, gini attribute measures the impurity: a node is “pure” (gini=0) if we have all the training instances belong to the same class. The gini reduces from Root Node to Leaf Node, so the predicted class is based on Leaf Node.

Decision Trees is trained with “growing” trees or greedy algorithm. The idea is really quite simple: the algorithm first splits the training set in two subsets using a single feature k and a threshold tk (e.g., “n_suspend_avg ≤ 0.398”). How does it choose k and tk? It searches for the pair (k, tk) that produces the purest subsets (weighted by their size). Once it has successfully split the training set in two, it splits the subsets using the same logic, then the sub-subsets and so on, recursively. It stops recursing once it reaches the maximum depth, or if it cannot find a split that will reduce impurity.

However, the result of one Decision Tree could not be as reliable as using many Decision Trees. Random Forest and Boosting Techniques, which apply aggregated results of multiple Decision Tree, likely lead to higher performance.

Random Forest¶

Random Forest is among the most versatile and reliable machine learning algorithms. The RF randomly creates and merges multiple decision trees and predicts a class that gets the most votes among all trees for each instance. For example, we can train a group of Decision Tree classifiers, each on a different random subset of the training set. To make predictions, we just obtain the predictions of all individual trees, then predict the class that gets the most votes. Despite its simplicity, RF is one of the most powerful ML algorithms available today. Feature importance for random forest is calculated based on gini. For each feature we can collect how on average it decreases the impurity. The average over all trees in the forest is the measure of the feature importance.

Gradient Boosting¶

Gradient Boosting is categorized as advanced machine learning algorithm. It is an ensemble learning applied to integrate some weak learners (e.g. Decision Tree) into some strong learner. It's called gradient boosting because it uses a gradient descent algorithm to minimize the loss. Boosting methods generally work by sequentially train predictors, and for each training, it tries to improve its predecessor. It fits new predictor to the residual or errors from the previous predictor. Then the prediction is combined with previous trees to make the final prediction.

Three most important Python libraries for gradient boosted trees algorithm are XGBoost, LightGBM and CatBoost. These techniques applies gradient boosting using decision tree algorithm mentioned above. XGBoost and LightGBM are capable of automatically handle missing values (imputation is not required for these techniques). In tree algorithms, branch directions for missing values are learned during training.

Feature importance calculation for gradient boosting is similar to random forest. Generally, importance provides a score that indicates how useful or valuable each feature was in the construction of the boosted decision trees within the model. The more an attribute is used to make key decisions with decision trees, the higher its relative importance. The feature importances are then averaged across all of the the decision trees within the model.

Performance Measurement

The most common approach for assessment of classification is accuracy, which is calculated by number of true predicted over total number of data. However, accuracy alone may not be practical for performance measurement of classifiers, especially in cease of skewed datasets. Accuracy should be considered along with other metrics. Confusion matrix is a much better way to evaluate the performance of a classifier. The general idea is to consider the number of times instances of negative class are misclassified as positive class and vice versa. Three more metrics Sensitivity, Precision and Specificity can be calculated from confusion matrix as well as Accuracy.

The receiver operating characteristic (ROC) curve is another common tool used to measure performance. The ROC curve plots the true positive rate (Sensitivity) against the false positive rate (1-Specificity). Every point on the ROC curve represents a chosen cut-off even though it cannot be seen. For more information and details see ROC. The most common way to compare classifiers is to measure the area under the curve (AUC). A perfect classifier will have a ROC AUC equal to 1, whereas a purely random classifier will have a ROC AUC equal to 0.5.

Majority Vote¶

Since each model gives different score as feature importance, we can use majority vote to get the most important features aggregated from all techniques:

Feature importance analysis is run for 7 predictive algorithms.

1.Coefficient of Determination
2.Predictive Power Score
3.LinearRegression
4.Random Forest
5.XGBoost 
6.CatBoost 
7.LightGBM

For each analysis, top most important features are selected; for example top two features We may not select more because the scores will be very low or close to zero. Finally, vote each feature for two most important features of 7 runs and select the features with high votes. See schematic illustration below:

Synthetic Data¶

predictor_name= ['Coefficient of Determination','Predictive Power Score','LinearRegression', 
                 'Random Forest', 'XGBoost', 'CatBoost', 'LightGBM']
df_most_important=len(predictor_name)*[None]

First, feature importance is applied for synthetic data: 6 features, Var1 to Var6 with Gaussian, Triangular and Uniform distribution are generated to have linear and non-linear correlation with target for 100000 data.

# Generate synthetic data
df=quadrac_swiss_roll(noise=2,nsim=100000,time_rand1=1,seed=42,a1=20,b1=8,a2=50,b2=8,swiss_roll=True)
features_colums=df.columns
# Satndardize data
for i in df.columns:
    df[i] = zscore(df[i])

Figure below shows cross plot matrix between Var1 to Var6 features and the target. Var1 and Var 2 are non-linearly correlated with the target. Var3 has strong positive correlation (0.68), and Var4 and Var5 are negative correlation with the target.

font = {'size'   : 7}
plt.rc('font', **font)
fig, ax1 = plt.subplots(figsize=(10, 10), dpi= 100, facecolor='w', edgecolor='k')
colors_map = plt.cm.get_cmap('jet')

colors = colors_map(np.linspace(0,0.8,len(features_colums)))

for ir in range(len(features_colums)):
    ax1=plt.subplot(4,2,ir+1) 
    val=df[features_colums[ir]]

    EDA_plot.histplt(val,bins=20,title=f'{features_colums[ir]}',xlabl=None,days=False,
             ylabl=None,xlimt=None,ylimt=(0,0.3)
             ,axt=ax1,nsplit=5,scale=1.02,loc=2,font=10.5,color=colors[len(features_colums)-ir-1])
plt.subplots_adjust(hspace=0.4)
plt.subplots_adjust(wspace=0.3)
fig.suptitle(f'Histogram of Variables for Synthetic Data', fontsize=16,y=0.96)
plt.show()

font = {'size'   :9 }
plt.rc('font', **font)

fig=plt.figure(figsize=(10, 9), dpi= 100, facecolor='w', edgecolor='k')
# Satndardize data
for i in features_colums:
    df[i] = zscore(df[i])
cplotmatrix(df,font=font,alpha=0.008,marker='g.',missin_rep=False)
plt.show()

Correlation Coefficient

font = {'size'   : 6}
plt.rc('font', **font)
fig, ax1 = plt.subplots(figsize=(3.5, 4), dpi= 200, facecolor='w', edgecolor='k')

corr=df.corr()
corr=corr['Target'].drop(['Target'])
coefs=corr.values
features_colums=list(corr.index)
Correlation_plot.corr_bar(coefs,clmns=features_colums,select=False,yfontsize=6.0,title=f'Linear Correlation with Target',
                          ymax_vert_lin=30,xlim = [-0.5, 0.7])

Coefficient of Determination

ir=0
corr=df.corr()
corr=corr['Target'].drop(['Target'])
coefs=corr.values
coefs= np.array([i**2 for i in coefs])

# Plot the importance of features
font = {'size'   : 7}
plt.rc('font', **font)
fig, ax1 = plt.subplots(figsize=(6, 3), dpi= 180, facecolor='w', edgecolor='k')

df_most_important[ir]=prfrmnce_plot(coefs, title=f'Feature Importance by Coefficient of Determination', 
            ylabel='Coefficient of Determination',clmns=features_colums,titlefontsize=9, 
            xfontsize=7, yfontsize=8).bargraph(perent=True,fontsizelable=8,xshift=-0.25,axt=ax1,
            yshift=0.02,ylim=[0,1], xlim=[-0.5,5.5], y_rot=0)

Predictive Power Score

ir=1

# Plot the importance of features
font = {'size'   : 7}
plt.rc('font', **font)
fig, ax1 = plt.subplots(figsize=(6, 3), dpi= 180, facecolor='w', edgecolor='k')

matrix =pps.matrix(df)

# Calculate importance
importance = list(matrix['ppscore'][(matrix['y']=="Target") & (matrix['ppscore']!=1)])
importance=importance/np.sum(importance)

features_colums=list(matrix['x'][(matrix['y']=="Target") & (matrix['ppscore']!=1)])

df_most_important[ir]=prfrmnce_plot(importance, title=f'Feature Importance by Predictive Power Score', 
            ylabel='Predictive Power Score',clmns=features_colums,titlefontsize=9, 
            xfontsize=7, yfontsize=8).bargraph(perent=True,fontsizelable=8,xshift=-0.25,axt=ax1,
            yshift=0.02,ylim=[0,1], xlim=[-0.5,5.5], y_rot=0)

# Tarining set
X=df.drop(['Target'],axis=1)
# Target
y=df['Target']
# Columns
features_colums=list(X.columns)

Linear Regression

As mentioned before, linear algorithms apply prediction by weighted sum of the input values (plus a bias term) via fitting a model. Some extensions of this algorithm add regularization, such as elastic net and ridge regression.

The weighted sum of input values lead to a set of coefficients. These coefficients can be used to represent the feature importance. The higher the weight, the higher the importance is.

font = {'size'   : 6}
plt.rc('font', **font)
fig, ax1 = plt.subplots(figsize=(3.5, 4), dpi= 200, facecolor='w', edgecolor='k')

# define the model
model = LinearRegression()

# fit the model
model.fit(X, y)

Correlation_plot.corr_bar(model.coef_,clmns=features_colums,select=False,yfontsize=6.0,
                          title=f'Feature Importance by Logistic Regression',
                          ymax_vert_lin=30,xlim = [-0.5, 0.7])

ir=2

# Plot the importance of features
font = {'size'   : 7}
plt.rc('font', **font)
fig, ax1 = plt.subplots(figsize=(6, 3), dpi= 180, facecolor='w', edgecolor='k')
# Calculate importance
importance = abs(model.coef_)
importance=importance/sum(importance)

df_most_important[ir]=prfrmnce_plot(importance, title=f'Feature Importance by Logistic Regression', 
            ylabel='Logistic Regression Score',clmns=features_colums,titlefontsize=9, 
            xfontsize=7, yfontsize=8).bargraph(perent=True,fontsizelable=8,xshift=-0.25,axt=ax1,
            yshift=0.02,ylim=[0,1], xlim=[-0.5,5.5], y_rot=0)

Decision Tree

# Plot the importance of features
font = {'size'   : 7}
plt.rc('font', **font)
fig, ax1 = plt.subplots(figsize=(6, 3), dpi= 180, facecolor='w', edgecolor='k')

# define the model
model = DecisionTreeRegressor()

# fit the model
model.fit(X, y)

# Calculate importance
importance = model.feature_importances_

_=prfrmnce_plot(importance, title=f'Feature Importance by Decision Tree', 
            ylabel='Decision Tree Score',clmns=features_colums,titlefontsize=9, 
            xfontsize=7, yfontsize=8).bargraph(perent=True,fontsizelable=8,xshift=-0.25,axt=ax1,
            yshift=0.01,ylim=[0,0.65], xlim=[-0.5,5.5], y_rot=0)

# Flowchart of Decision Tree for Prediction. Training features (X) and target (y) are shown in the figure.
np.random.seed(42) 
tree_cl = tree.DecisionTreeRegressor(max_depth=3)
tree_cl.fit(X, y)
fig, axes = plt.subplots(nrows = 1,ncols = 1,figsize = (8,8), dpi=200)
out = tree.plot_tree(tree_cl,fontsize=5,filled=True, proportion =True)
for o in out:
    arrow = o.arrow_patch
    if arrow is not None:
        arrow.set_edgecolor('red')
        arrow.set_linewidth(1)       
#
txt=''
for i in range(len(features_colums)):
    txt+='X'+'['+str(i)+']='+features_colums[i]+'\n'
txt+='Target'
plt.text(0.01,0.8, 'X and y Variables', fontsize=9,bbox=dict(facecolor='white', alpha=0.0))
plt.text(0.01,0.65, txt, fontsize=5.5,bbox=dict(facecolor='white', alpha=0.2))
fig.suptitle('Decision Tree Flowchart to Predict Target', fontsize=10,y=0.86)
plt.show()

Random Forest

ir=3

# Plot the importance of features
font = {'size'   : 7}
plt.rc('font', **font)
fig, ax1 = plt.subplots(figsize=(6, 3), dpi= 180, facecolor='w', edgecolor='k')

# define the model
model = RandomForestRegressor()

# fit the model
model.fit(X, y)

# Calculate importance
importance = abs(model.feature_importances_)

df_most_important[ir]=prfrmnce_plot(importance, title=f'Feature Importance by Random Forest', 
            ylabel='Random Forest Score',clmns=features_colums,titlefontsize=9, 
            xfontsize=7, yfontsize=8).bargraph(perent=True,fontsizelable=8,xshift=-0.25,axt=ax1,
            yshift=0.02,ylim=[0,0.65], xlim=[-0.5,5.5], y_rot=0)

XGBoost

ir=4

# Plot the importance of features
font = {'size'   : 7}
plt.rc('font', **font)
fig, ax1 = plt.subplots(figsize=(6, 3), dpi= 180, facecolor='w', edgecolor='k')

# define the model
model = XGBRegressor()

# fit the model
model.fit(X, y)

# Calculate importance
importance = model.feature_importances_

df_most_important[ir]=prfrmnce_plot(importance, title=f'Feature Importance by XGBoost', 
            ylabel='XGBoost Score',clmns=features_colums,titlefontsize=9, 
            xfontsize=7, yfontsize=8).bargraph(perent=True,fontsizelable=8,xshift=-0.25,axt=ax1,
            yshift=0.02,ylim=[0,0.65], xlim=[-0.5,5.5], y_rot=0)

CatBoost

ir=5

# Plot the importance of features
font = {'size'   : 7}
plt.rc('font', **font)
fig, ax1 = plt.subplots(figsize=(6, 3), dpi= 180, facecolor='w', edgecolor='k')

# define the model
model = CatBoostRegressor()

# fit the model
train_data = X
train_label = y
train_pool = Pool(train_data, train_label)
model.fit(train_pool,verbose=0)

# Calculate importance
importance = model.get_feature_importance()
# Calculate importance
importance=importance/sum(importance)

df_most_important[ir]=prfrmnce_plot(importance, title=f'Feature Importance by CatBoost', 
            ylabel='CatBoost Score',clmns=features_colums,titlefontsize=9, 
            xfontsize=7, yfontsize=8).bargraph(perent=True,fontsizelable=8,xshift=-0.25,axt=ax1,
            yshift=0.02,ylim=[0,0.65], xlim=[-0.5,5.5], y_rot=0)

LightGBM

ir=6

# Plot the importance of features
font = {'size'   : 7}
plt.rc('font', **font)
fig, ax1 = plt.subplots(figsize=(6, 3), dpi= 180, facecolor='w', edgecolor='k')

# define the model
model = LGBMRegressor(random_state=32)

# fit the model
model.fit(X, y)

# Calculate importance
importance=importance/sum(importance)

df_most_important[ir]=prfrmnce_plot(importance, title=f'Feature Importance by LightGBM', 
            ylabel='LightGBM Score',clmns=features_colums,titlefontsize=9, 
            xfontsize=7, yfontsize=8).bargraph(perent=True,fontsizelable=8,xshift=-0.25,axt=ax1,
            yshift=0.02,ylim=[0,0.65], xlim=[-0.5,5.5], y_rot=0)

Majority Vote¶

Majority vote is applied by selecting top two features of each algorithm.

no_rank=2
tmp=[]
all_features=list(np.unique(np.ravel([tmp+list(df_most_important[i]['Features']) for i in range(len(df_most_important))])))

dic_list=[]
aa=[]
tmp_value=[]
for i in range(len(predictor_name)):
    dic_={}
    tmp_val=[1 if ii in list(df_most_important[i]['Features'][:no_rank]) else 0 for ii in all_features]
    for ii in range(len(all_features)):
        dic_[all_features[ii]] = tmp_val[ii]
    dic_list.append(dic_)
#    
majo_vot=[]
for i in all_features:
    val=0
    for j in range(len(dic_list)):
        val=dic_list[j][i]+val
    majo_vot.append(val)       

# Plot the importance of features
font = {'size'   : 7}
plt.rc('font', **font)
fig, ax1 = plt.subplots(figsize=(6, 3), dpi= 180, facecolor='w', edgecolor='k')

#importance=list(dic_m.values())/sum(list(dic_m.values()))

df_most_important_soft_voting= prfrmnce_plot(np.array(majo_vot), 
    title=f'Majority Vote by Selecting Top {no_rank} Important Feature Importance of each Algorithm', 
    ylabel='Majority Votes'
    ,clmns=all_features,titlefontsize=8.2,
    xfontsize=7, yfontsize=8).bargraph(perent=False, select=30,
    fontsizelable=10,xshift=-0.05,yshift=0.2,xlim=[-0.5,5.5],ylim=[0,8], y_rot=0, graph_float=False)
plt.show()

The most important feature from majority vote is Var3 with 5 votes followed by Var1 with 4 votes and Var2 with 3 votes.

Classification¶

predictor_name= ['Coefficient of Determination','Predictive Power Score','LinearRegression', 
                 'Random Forest', 'XGBoost', 'CatBoost', 'LightGBM']
df_most_important_class=len(predictor_name)*[None]

Mobile Price Classification Data

The data for feature importance of classification is downloaded from Data. The definition of the variables in the dataset are:

battery_power: Total energy a battery can store in one time measured in mAh
blue: Has Bluetooth or not
clock_speed: the speed at which microprocessor executes instructions
dual_sim: Has dual sim support or not
fc: Front Camera megapixels
four_g: Has 4G or not
int_memory: Internal Memory in Gigabytes
m_dep: Mobile Depth in cm
mobile_wt: Weight of mobile phone
n_cores: Number of cores of the processor
pc: Primary Camera megapixels
px_height: Pixel Resolution Height
px_width: Pixel Resolution Width
ram: Random Access Memory in MegaBytes
sc_h: Screen Height of mobile in cm
sc_w: Screen Width of mobile in cm
talk_time: the longest time that a single battery charge will last when you are
three_g: Has 3G or not
touch_screen: Has touch screen or not
wifi: Has wifi or not
price_range: This is the target variable with a value of 0 (low cost), 1(medium cost), 2(high cost) and 3(very high cost).

train = pd.read_csv('./Data/train.csv')

# Tarining set
X=train.drop(['price_range'],axis=1)

# Target
y=train['price_range']
# Columns
features_colums=list(X.columns)

X

Figure below shows cross plot matrix between Var1 to Var6 features and the target. Var1 and Var 2 are non-linearly correlated with the target. Var3 has strong positive correlation (0.68), and Var4 and Var5 are negative correlation with the target.

font = {'size'   : 7}
plt.rc('font', **font)
fig, ax1 = plt.subplots(figsize=(20, 20), dpi= 100, facecolor='w', edgecolor='k')
colors_map = plt.cm.get_cmap('jet')

colors = colors_map(np.linspace(0,0.8,len(features_colums)))

for ir in range(len(features_colums)):
    ax1=plt.subplot(6, 4,ir+1) 
    val=train[features_colums[ir]]

    EDA_plot.histplt(val,bins=20,title=f'{features_colums[ir]}',xlabl=None,days=False,
             ylabl=None,xlimt=None,ylimt=(0,0.2)
             ,axt=ax1,nsplit=5,scale=1.02,loc=2,font=10.5,color=colors[len(features_colums)-ir-1])
plt.subplots_adjust(hspace=0.4)
plt.subplots_adjust(wspace=0.3)
fig.suptitle(f'Histogram of Variables for Mobile Price Data', fontsize=16,y=0.96)
plt.show()

font = {'size'   : 6}
plt.rc('font', **font)
fig, ax1 = plt.subplots(figsize=(3.5, 4), dpi= 200, facecolor='w', edgecolor='k')

corr=train.corr()
corr=corr['price_range'].drop(['price_range'])
coefs=corr.values
features_colums=list(corr.index)
Correlation_plot.corr_bar(coefs,clmns=features_colums,select=False,yfontsize=6.0,
                          title=f'Linear Correlation with price_range',
                          ymax_vert_lin=30,xlim = [-0.5, 0.7])

Coefficient of Determination

ir=0
corr=train.corr()
corr=corr['price_range'].drop(['price_range'])
coefs=corr.values
coefs= np.array([i**2 for i in coefs])

# Plot the importance of features
font = {'size'   : 7}
plt.rc('font', **font)
fig, ax1 = plt.subplots(figsize=(6, 3), dpi= 180, facecolor='w', edgecolor='k')

df_most_important[ir]=prfrmnce_plot(coefs, title=f'Feature Importance by Coefficient of Determination for price_range', 
            ylabel='Coefficient of Determination',clmns=features_colums,titlefontsize=9, 
            xfontsize=7, yfontsize=8).bargraph(perent=True,fontsizelable=7,xshift=-0.25,axt=ax1,
            yshift=0.02,ylim=[0,1], xlim=[-0.5,20.], y_rot=90)

Predictive Power Score

ir=1

# Plot the importance of features
font = {'size'   : 7}
plt.rc('font', **font)
fig, ax1 = plt.subplots(figsize=(6, 3), dpi= 180, facecolor='w', edgecolor='k')

matrix =pps.matrix(train)

# Calculate importance
importance = list(matrix['ppscore'][(matrix['y']=="price_range") & (matrix['ppscore']!=1)])
importance=importance/np.sum(importance)

features_colums=list(matrix['x'][(matrix['y']=="price_range") & (matrix['ppscore']!=1)])

df_most_important[ir]=prfrmnce_plot(importance, title=f'Feature Importance by Predictive Power Score for price_range', 
            ylabel='Predictive Power Score',clmns=features_colums,titlefontsize=9, 
            xfontsize=7, yfontsize=8).bargraph(perent=True,fontsizelable=7,xshift=-0.25,axt=ax1,
            yshift=0.06,ylim=[0,1.2], xlim=[-0.5,20.], y_rot=90)

# Tarining set
X=train.drop(['price_range'],axis=1)

# Satndardize data
for i in X.columns:
    X[i] = zscore(X[i])
    
# Target
y=train['price_range']
# Columns
features_colums=list(X.columns)

Logistic Regression

font = {'size'   : 6}
plt.rc('font', **font)
fig, ax1 = plt.subplots(figsize=(3.5, 4), dpi= 200, facecolor='w', edgecolor='k')

# define the model
model = LogisticRegression(multi_class="multinomial", random_state=42)

# fit the model
model.fit(X, y)

# Calculate importance
importance = model.coef_[3]
importance=importance/sum(importance)

Correlation_plot.corr_bar(importance,clmns=features_colums,select=False,yfontsize=6.0,
                          title=f'Feature Importance by Logistic Regression for price_range',
                          ymax_vert_lin=30,xlim = [-0.5, 0.7])

ir=2

# Plot the importance of features
font = {'size'   : 7}
plt.rc('font', **font)
fig, ax1 = plt.subplots(figsize=(6, 3), dpi= 180, facecolor='w', edgecolor='k')

df_most_important[ir]=prfrmnce_plot(importance, title=f'Feature Importance by Logistic Regression for price_range', 
            ylabel='Logistic Regression Score',clmns=features_colums,titlefontsize=9, 
            xfontsize=7, yfontsize=8).bargraph(perent=True,fontsizelable=7,xshift=-0.25,axt=ax1,
            yshift=0.02,ylim=[0,0.8], xlim=[-0.5,20.], y_rot=90)

Decision Tree

# Plot the importance of features
font = {'size'   : 7}
plt.rc('font', **font)
fig, ax1 = plt.subplots(figsize=(6, 3), dpi= 180, facecolor='w', edgecolor='k')

# define the model
model = DecisionTreeRegressor()

# fit the model
model.fit(X, y)

# Calculate importance
importance = model.feature_importances_
importance=importance/sum(importance)

_=prfrmnce_plot(importance, title=f'Feature Importance by Decision Tree for price_range', 
            ylabel='Decision Tree Score',clmns=features_colums,titlefontsize=9, 
            xfontsize=7, yfontsize=8).bargraph(perent=True,fontsizelable=7,xshift=-0.25,axt=ax1,
            yshift=0.02,ylim=[0,1], xlim=[-0.5,20.], y_rot=90)

# Flowchart of Decision Tree for Prediction. Training features (X) and target (y) are shown in the figure.
np.random.seed(42) 
tree_cl = tree.DecisionTreeClassifier(max_depth=3)
tree_cl.fit(X, y)
fig, axes = plt.subplots(nrows = 1,ncols = 1,figsize = (8,8), dpi=200)
out = tree.plot_tree(tree_cl,fontsize=5, class_names=['0','1','2','3'],filled=True, proportion =True)
for o in out:
    arrow = o.arrow_patch
    if arrow is not None:
        arrow.set_edgecolor('red')
        arrow.set_linewidth(1)       
#
txt=''
for i in range(len(features_colums)):
    txt+='X'+'['+str(i)+']='+features_colums[i]+'\n'
txt+='Target=price_range'
plt.text(0.01,0.99, 'X and y Variables', fontsize=9,bbox=dict(facecolor='white', alpha=0.0))
plt.text(0.01,0.65, txt, fontsize=5.5,bbox=dict(facecolor='white', alpha=0.2))
fig.suptitle('Decision Tree Flowchart to Predict price_range', fontsize=10,y=0.86)
plt.show()

Random Forest

ir=3

# Plot the importance of features
font = {'size'   : 7}
plt.rc('font', **font)
fig, ax1 = plt.subplots(figsize=(6, 3), dpi= 180, facecolor='w', edgecolor='k')

# define the model
model = RandomForestClassifier()

# fit the model
model.fit(X, y)

# Calculate importance
importance = model.feature_importances_
importance=importance/sum(importance)

df_most_important[ir]=prfrmnce_plot(importance, title=f'Feature Importance by Random Forest for price_range', 
            ylabel='Random Forest Score',clmns=features_colums,titlefontsize=9, 
            xfontsize=7, yfontsize=8).bargraph(perent=True,fontsizelable=7,xshift=-0.25,axt=ax1,
            yshift=0.02,ylim=[0,0.8], xlim=[-0.5,20.], y_rot=90)

XGBoost

ir=4

# Plot the importance of features
font = {'size'   : 7}
plt.rc('font', **font)
fig, ax1 = plt.subplots(figsize=(6, 3), dpi= 180, facecolor='w', edgecolor='k')

# define the model
model = XGBClassifier()

# fit the model
model.fit(X, y)

# Calculate importance
importance = model.feature_importances_
importance=importance/sum(importance)

df_most_important[ir]=prfrmnce_plot(importance, title=f'Feature Importance by XGBoost for price_range', 
            ylabel='XGBoost Score',clmns=features_colums,titlefontsize=9, 
            xfontsize=7, yfontsize=8).bargraph(perent=True,fontsizelable=7,xshift=-0.25,axt=ax1,
            yshift=0.02,ylim=[0,0.8], xlim=[-0.5,20.], y_rot=90)

CatBoost

ir=5

# Plot the importance of features
font = {'size'   : 7}
plt.rc('font', **font)
fig, ax1 = plt.subplots(figsize=(6, 3), dpi= 180, facecolor='w', edgecolor='k')

# define the model
model = CatBoostClassifier()

# fit the model
train_data = X
train_label = y
train_pool = Pool(train_data, train_label)
model.fit(train_pool,verbose=0)

# Calculate importance
importance = model.get_feature_importance()
importance=importance/sum(importance)

df_most_important[ir]=prfrmnce_plot(importance, title=f'Feature Importance by CatBoost for price_range', 
            ylabel='CatBoost Score',clmns=features_colums,titlefontsize=9, 
            xfontsize=7, yfontsize=8).bargraph(perent=True,fontsizelable=7,xshift=-0.25,axt=ax1,
            yshift=0.02,ylim=[0,0.8], xlim=[-0.5,20.], y_rot=90)

LightGBM

ir=6

# Plot the importance of features
font = {'size'   : 7}
plt.rc('font', **font)
fig, ax1 = plt.subplots(figsize=(6, 3), dpi= 180, facecolor='w', edgecolor='k')

# define the model
model = LGBMRegressor(random_state=32)

# fit the model
model.fit(X, y)

# Calculate importance
importance = model.feature_importances_
importance=importance/sum(importance)

df_most_important[ir]=prfrmnce_plot(importance, title=f'Feature Importance by LightGBM for price_range', 
            ylabel='LightGBM Score',clmns=features_colums,titlefontsize=9, 
            xfontsize=7, yfontsize=8).bargraph(perent=True,fontsizelable=7,xshift=-0.25,axt=ax1,
            yshift=0.008,ylim=[0,0.3], xlim=[-0.5,20.], y_rot=90)

Majority Vote¶

Majority vote is applied by selecting top three features of each algorithm.

no_rank=3
tmp=[]
all_features=list(np.unique(np.ravel([tmp+list(df_most_important[i]['Features']) for i in range(len(df_most_important))])))

dic_list=[]
aa=[]
tmp_value=[]
for i in range(len(predictor_name)):
    dic_={}
    tmp_val=[1 if ii in list(df_most_important[i]['Features'][:no_rank]) else 0 for ii in all_features]
    for ii in range(len(all_features)):
        dic_[all_features[ii]] = tmp_val[ii]
    dic_list.append(dic_)
#    
majo_vot=[]
for i in all_features:
    val=0
    for j in range(len(dic_list)):
        val=dic_list[j][i]+val
    majo_vot.append(val)       

# Plot the importance of features
font = {'size'   : 7}
plt.rc('font', **font)
fig, ax1 = plt.subplots(figsize=(6, 3), dpi= 180, facecolor='w', edgecolor='k')

#importance=list(dic_m.values())/sum(list(dic_m.values()))

df_most_important_soft_voting= prfrmnce_plot(np.array(majo_vot), 
    title=f'Majority Vote by Selecting Top {no_rank} Important Feature Importance of each Algorithm', 
    ylabel='Majority Votes'
    ,clmns=all_features,titlefontsize=8.2,
    xfontsize=7, yfontsize=8).bargraph(perent=False, select=30,axt=ax1,
    fontsizelable=10,xshift=-0.18,yshift=0.2,xlim=[-0.5,20],ylim=[0,8], y_rot=0, graph_float=False)
plt.show()

The most important feature from majority vote is ram with 6 votes followed by battery_power with 5 votes and px_height with 3 votes.

	battery_power	blue	clock_speed	dual_sim	fc	four_g	int_memory	m_dep	mobile_wt	n_cores	pc	px_height	px_width	ram	sc_h	sc_w	talk_time	three_g	touch_screen	wifi
0	842	0	2.2	0	1	0	7	0.6	188	2	2	20	756	2549	9	7	19	0	0	1
1	1021	1	0.5	1	0	1	53	0.7	136	3	6	905	1988	2631	17	3	7	1	1	0
2	563	1	0.5	1	2	1	41	0.9	145	5	6	1263	1716	2603	11	2	9	1	1	0
3	615	1	2.5	0	0	0	10	0.8	131	6	9	1216	1786	2769	16	8	11	1	0	0
4	1821	1	1.2	0	13	1	44	0.6	141	2	14	1208	1212	1411	8	2	15	1	1	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
1995	794	1	0.5	1	0	1	2	0.8	106	6	14	1222	1890	668	13	4	19	1	1	0
1996	1965	1	2.6	1	0	0	39	0.2	187	4	3	915	1965	2032	11	10	16	1	1	1
1997	1911	0	0.9	1	1	1	36	0.7	108	8	3	868	1632	3057	9	1	5	1	1	0
1998	1512	0	0.9	0	4	1	46	0.1	145	5	5	336	670	869	18	10	19	1	1	1
1999	510	1	2.0	1	5	1	45	0.9	168	6	16	483	754	3919	19	4	2	1	1	1

Feature Importance
© Mehdi Rezvandehy

Table of Contents

Introduction¶

Coefficient of Determination¶

Predictive Power Score¶

Predictive Models for Feature Importance¶

Regression (Linear & Logistic)¶

Tree-based algorithms¶

Random Forest¶

Gradient Boosting¶

Majority Vote¶

Synthetic Data¶

Majority Vote¶

Classification¶

Majority Vote¶

Feature Importance© Mehdi Rezvandehy

Table of Contents

Introduction¶

Coefficient of Determination¶

Predictive Power Score¶

Predictive Models for Feature Importance¶

Regression (Linear & Logistic)¶

Tree-based algorithms¶

Random Forest¶

Gradient Boosting¶

Majority Vote¶

Synthetic Data¶

Majority Vote¶

Classification¶

Majority Vote¶

Feature Importance
© Mehdi Rezvandehy