Multicollinearity exists whenever an independent variable is highly correlated with one or more of the other independent variables in a multiple regression equation. Multicollinearity is a problem because it undermines the statistical significance of an independent variable, reduces the precision of the estimated coefficients, and it affects the interpretability. In this notebook, p-value for coefficients of regression and Variance Inflation Factor are applied to detect and remove multicollinearity. Realistic and synthetic examples are presented.
Python functions and data files needed to run this notebook are available via this link.
import matplotlib.pyplot as plt
import pandas as pd
from IPython.display import Image
from pandas import DataFrame
from sklearn import linear_model
import statsmodels.api as sm
from scipy.stats import zscore
from functions import* # import require functions to run this notebook
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
import random
import math
import warnings
warnings.filterwarnings('ignore')
Multicollinearity is a condition where a predictor variable (independent variable) correlates with another predictor. For example, if there is high correlation between 22 selected features. Although multicollinearity may not affect the model’s performance, it will affect the interpretability especially for linear and logistic regression. If we don’t remove the multicollinearity, we will never know how much a variable contributes to the result. It will be recalled that one of the factors that affects the standard error of a partial regression coefficient in the regression equation is multicollinearity. The standard error represents the average distance that the observed values fall from the regression line. An independent variable that is very highly correlated with one or more other independent variables will have a relatively large standard error. This implies that the partial regression coefficient is unstable and will vary greatly from one sample to the next. It undermines the statistical significance of an independent variable. The larger the standard error of a regression coefficient, the less likely it is that this coefficient will be statistically significant. There are two types of multicollinearity:
Perfect multicollinearity obtains whenever one independent variable is a perfect linear function of one or more of the other independent variables in a regression equation. This multicollinearity can be easily detected.
Extreme multicollinearity is often more difficult than perfect multicollinearity because it frequently goes undetected. Extreme multicollinearity occurs whenever an independent variable is very highly correlated with one or more other independent variables.
Moderate multicollinearity may not be problematic. However, extreme multicollinearity is a problem because it can increase the variance of the coefficient estimates and make the estimates very sensitive to minor changes in the model. The result is that the coefficient estimates are unstable and difficult to interpret. Whenever two independent variables are very highly correlated with one another, neither or them is likely to be statistically significant, even though they may both be highly correlated with the dependent variable.
Multicollinearity saps the statistical power of the analysis, can cause the coefficients to switch signs, and makes it more difficult to specify the correct model.
Here is an example to explain multicollinearity. The equation below shows coefficient $\beta_{1}$ to $\beta_{4}$ for 4 independent variables 4 (predictors) $X_{1}$ to $X_{4}$ from linear and logistic regression:
If variables $X_{1}$ to $X_{4}$ are independent from each other, change the magnitude of feature will not cause any other independent variable to change its value or by some negligible amount. For each, increasing $X_{1}$ will increase Y but have low impact on $X_{2}$ to $X_{4}$
However, in case of high correlation between variables $X_{3}$ and $X_{4}$, we achieve unreliable results. If the magnitude of $X_{3}$ increases (as shown in red) to observe the change in Y(red), there will also be a significant difference in the value of $X_{4}$ (orange). As a result, the change that we observe in Y is due to the change in both $X_{3}$ (red) and $X_{4}$ (orange). The resultant change(blue) is greater than the Actual change (orange).
Since we are trying to estimate the coefficient corresponding to $X_{3}$, the contribution of variable $X_{4}$ causes the coefficient to be unstable (overestimated or underestimated). Regression algorithm tries to optimize $\beta_{1}$ to $\beta_{4}$ to have the minimum error between prediction and actual. The optimization may give very high value to $\beta_{3}$ (overestimated) and very low value to $\beta_{4}$ (underestimated). This can be easily switched if there is a small change in model or random subset of data. Since $\beta_{3}$ and $\beta_{4}$ will be unstable, this will mislead our interpretations. See Medium for more information.
You can find different real datasets on the UCI Machine Learning repository which are processed and cleaned before and ready to feed Machine Learning algorithms. I found Energy Efficiency Data Set useful for this course. Energy analysis are performed for 768 simulated building shapes with respect to 8 features including Wall Area, Overall Height, Glazing Area, Orientation.. to predict Heating Load. The work has been published by Tsanas and Xifara 2012 on Energy and Buildings Journal. In this notebook, we will apply multicollinearity analysis for Heating Load (Binary Classes) as the target.
To understand multicollinearity, we can make correlation (covariance) matrix between predictors. See Figure below for correlation matrix between features and target (Heating Load).
df = pd.read_csv('./Data/Building_Heating_Load.csv',na_values=['NA','?',' '])
np.random.seed(32)
df=df.reindex(np.random.permutation(df.index))
df.reset_index(inplace=True, drop=True)
df=df[['Relative Compactness', 'Surface Area', 'Wall Area', 'Roof Area',
'Overall Height', 'Orientation', 'Glazing Area',
'Glazing Area Distribution', 'Binary Classes','Multi-Classes']]
df['Binary Classes'] = np.where(df['Binary Classes']=='Low Level',0,1)
corr = df.drop(['Binary Classes', 'Multi-Classes'], axis=1)
for icolm in corr.columns:
corr[icolm] = zscore(corr[icolm])
corr.info()
# Correlation matrix
font = {'size' : 16}
plt.rc('font', **font)
fig, ax=plt.subplots(figsize=(10, 10), dpi= 80, facecolor='w', edgecolor='k')
corr_array=Correlation_plot.corr_mat(corr,corr_val_font=12,
title=f'Correlation Matrix between {len(df.columns)} Features'
,titlefontsize=14,xy_title=[1,-3.5],xyfontsize = 12,vlim=[-1, 1],axt=ax)
font = {'size' :14 }
plt.rc('font', **font)
fig=plt.figure(figsize=(25, 20), dpi= 100, facecolor='w', edgecolor='k')
# Make cross plot matrix
cplotmatrix(corr,alpha=0.05,marker='ro',missin_rep=False, font=font)
From the correlation matrix above, there are some features that have collinearity. Histogram and CDF of correlations between features can be calculated to quantify percentage of high correlation. Since the correlation matrix is symmetrical, we can get only upper case or lower case matrix and remove correlation 1 on diagonal elements.
df
input_features=df.drop(['Binary Classes','Multi-Classes'], axis=1)
corr_coeff=input_features.corr()
corr_coeff
# Get only lower matrix excluding diagonal elelmtes which have correlation 1
corr_coeff=corr_coeff.where(~np.triu(np.ones(corr_coeff.shape)).astype(bool))
corr_coeff_=corr_coeff.stack().reset_index()
corr_coeff_.columns = ['Row','Column','Value']
font = {'size' : 9}
plt.rc('font', **font)
fig, ax1 = plt.subplots(figsize=(8, 8), dpi= 100, facecolor='w', edgecolor='k')
ax1=plt.subplot(2,1,1)
corr_array=corr_coeff_['Value']
EDA_plot.histplt (np.array(corr_array),bins=20,title='Histogram of Correlation Coefficients between Features',xlabl='Correlation',
ylabl='Percentage',xlimt=(-1,1),ylimt=(0,0.45),axt=ax1,
scale=1.15,loc=1,font=10,color='#8DC63F')
ax2=plt.subplot(2,1,2)
data_var=[None]
data_var[0]=[abs(i) for i in corr_array]
title_cdf=f'CDF of Absolute Correlation Coefficients between Features'
EDA_plot.CDF_plot(data_var,1,label='Correlation',colors='r',xlabel=None,title=title_cdf,
ylabel='Cumulative Probability',bins=100,xlim=(0,1),ylim=(00.3,1),linewidth=2.5,loc=4,
axt=ax2,x_ftze=12,y_ftze=13,tit_ftze=12,leg_ftze=9)
plt.gca().yaxis.set_major_formatter(PercentFormatter(1))
# 90th percentile
per=0.9
Perctile=np.quantile(data_var[0], per, axis=0, keepdims=True)[0]
txt= '90th percentile= '
txt+=f'{np.round(Perctile,1)}'
plt.plot([0, Perctile], [per, per], 'k--', linewidth=1.2)
plt.plot([Perctile, Perctile], [0, per], 'k--', linewidth=1.2)
plt.text(Perctile/6, per+0.01, txt, color='b',
fontsize=10)
plt.subplots_adjust(hspace=0.4)
CDF (bottom Figure) shows 10% of features have correlation more than 0.9. Probably those features that have correlation bigger than 0.9 should be removed from data set. However, this is a native approach; more investigation should be done to find and remove multicollinearity.
sklearn.linear_model.LogisticRegression
do not calculate the p-value and uncertainty interval for each coefficient. statsmodels
automatically calculates standard error, z values, p-values, p2.5% and p97.5% for each coefficient. Those features with high p-values may be removed from selected features. The section below presents p-values and uncertainty interval for logistic regression coefficients for predicting target.
features_colums=df.drop(['Binary Classes','Multi-Classes'], axis=1).columns
X_train_std=df.drop(['Binary Classes','Multi-Classes'], axis=1).values
y_train=df['Binary Classes']
import statsmodels.api as sm
logit_model=sm.Logit(y_train, X_train_std)
result=logit_model.fit()
print(result.summary())
font = {'size' : 6}
plt.rc('font', **font)
fig, ax1 = plt.subplots(figsize=(3.5, 4), dpi= 200, facecolor='w', edgecolor='k')
Correlation_plot.corr_bar(result.params.values,clmns=features_colums,select=False,yfontsize=6.0,xlim = [-14, 5],
title=f'Feature Importance by Logistic Regression',ymax_vert_lin=30)
Figure below show p-values and confidence interval for logistic regression coefficients to predict target.
con_int=result.conf_int()
con_int.columns=['p2.5','p97.5']
font = {'size' :6 }
plt.rc('font', **font)
fig=plt.figure(figsize=(6, 5), dpi= 200, facecolor='w', edgecolor='k')
ax1=plt.subplot(2,1,1)
_=prfrmnce_plot(np.array(result.pvalues),
title=f'p-value for Logistic Regression Coefficient of Features to Predict Heating Load',
ylabel='p-value',
clmns=features_colums,titlefontsize=7.5,
xfontsize=6, yfontsize=6).bargraph(fontsizelable=6.,
xshift=-0.12,axt=ax1,yshift=0.04,ylim=[0,1.3],y_rot=90)
ax2=plt.subplot(2,1,2)
plt.plot(np.array(features_colums),np.array(con_int['p2.5']),'b--',label='P 2.5%',lw=1)
plt.plot(np.array(features_colums),np.array(con_int['p97.5']), 'r',label='P 97.5%',lw=1)
plt.title('P 2.5% and P 97.5% for Logistic Regression Coefficient of Features',fontsize=7)
plt.ylabel('Confidence Interval',fontsize=6)
plt.legend(loc=1,fontsize=7,markerscale=1.1)
ax2.grid(axis='both',linewidth='0.1')
ax2.set_xticklabels(features_colums, rotation=90,fontsize=6,y=0.02)
plt.subplots_adjust(wspace=0.9)
plt.subplots_adjust(hspace=1.4)
plt.show()
The coefficient of features with high p-values such as Surface Area, Wall Area, Roof Area, have multicolinearity and probably should be excluded from data set. However, more investigation should be done before removing features.
Lasso or L1 Regularization is a regularization technique of Linear Regression that can be used for feature selection. We investigate removing multicollinearity by Lasso. It adds a regularization term to the cost function a term $\large\alpha \sum_{i=1}^{n} \left| w_{i} \right| $. Since the cost function should be minimized, the higher α, the higher regularization. If α is high, then all weights will receive values close to zeros. If α=0, no regularization will be applied.
$\large MSE(w)+\alpha \sum_{i=1}^{n} \left| w_{i} \right|$
Lasso completely removes the weights for the features that are the least important (the values of the weights set to zero). So, feature selection can be automatically applied by Lasso Regularization. Figure below shows applying lasso for different α values.
alpha=0.001
clf = linear_model.Lasso(alpha=alpha)
clf.fit(X_train_std, y_train)
font = {'size' : 6}
plt.rc('font', **font)
fig, ax1 = plt.subplots(figsize=(3.5, 4), dpi= 200, facecolor='w', edgecolor='k')
Correlation_plot.corr_bar(clf.coef_,clmns=features_colums,select=False,yfontsize=6.0,xlim= [-1, 1],
title=f'Feature Importance by Lasso Logistic Regression for alpha={alpha}',ymax_vert_lin=30)
alpha=0.01
clf = linear_model.Lasso(alpha=alpha)
clf.fit(X_train_std, y_train)
font = {'size' : 6}
plt.rc('font', **font)
fig, ax1 = plt.subplots(figsize=(3.5, 4), dpi= 200, facecolor='w', edgecolor='k')
Correlation_plot.corr_bar(clf.coef_,clmns=features_colums,select=False,yfontsize=6.0,xlim= [-0.2, 0.2],
title=f'Feature Importance by Lasso Logistic Regression for alpha={alpha}',ymax_vert_lin=30)
alpha=0.1
clf = linear_model.Lasso(alpha=alpha)
clf.fit(X_train_std, y_train)
font = {'size' : 6}
plt.rc('font', **font)
fig, ax1 = plt.subplots(figsize=(3.5, 4), dpi= 200, facecolor='w', edgecolor='k')
Correlation_plot.corr_bar(clf.coef_,clmns=features_colums,select=False,yfontsize=6.0,xlim= [-0.2, 0.2],
title=f'Feature Importance by Lasso Logistic Regression for alpha={alpha}',ymax_vert_lin=30)
As can be seen, Lasso cannot remove multicollinearity
df = pd.read_csv('./Data/Building_Heating_Load.csv',na_values=['NA','?',' '])
np.random.seed(32)
df=df.reindex(np.random.permutation(df.index))
df.reset_index(inplace=True, drop=True)
df=df[['Relative Compactness', 'Surface Area', 'Wall Area', 'Roof Area',
'Overall Height', 'Orientation', 'Glazing Area',
'Glazing Area Distribution', 'Multi-Classes']]
from scipy.stats import zscore
for icolm in df.drop('Multi-Classes', axis=1).columns:
df[icolm] = zscore(df[icolm])
df.info()
We applied binary classifier that distinguishes between two classes, you should apply multiclass classifiers (multinomial classifiers) if you want to distinguish between more than two classes. Some Machine Learning algorithms are strictly designed for binary classification (Support Vector Machine); some classifiers (Random Forest) can handle multiple classes directly. However, there are different approaches you can apply multiclass classification using multiple binary classifiers.
For example, the Energy Efficiency Dataset have also Multiclasses column for Heating Load: Level 1 to Level 4:
df_multi=df.copy()
df_multi[:5]
# Percentage of each Class
counts=df_multi['Multi-Classes'].value_counts()
counts
counts/len(df['Multi-Classes'])
Now use Scikit-learn's OrdinalEncoder function to convert Level 1 to level 4 classes from 0 to 3 numbers.
from sklearn.preprocessing import OrdinalEncoder
ordinal_encoder = OrdinalEncoder()
Multi_Classes_encoded = ordinal_encoder.fit_transform(df_multi[['Multi-Classes']])
Multi_Classes_encoded[0:5]
df_multi['Multi-Classes']=Multi_Classes_encoded
df_multi[0:5]
Lets see percentage of each number:
counts=df_multi['Multi-Classes'].value_counts()
counts
counts/len(df['Multi-Classes'])
One way to create a system that can classify the Energy Efficiency dataset into 4 classes (from 0 to 3) is to train 4 binary classifiers, one for each class (a 0-detector, a 1-detector, a 2-detector, and a 3-detector). For example, if you want to predict class 4, you should assign class 4 as 1 and other classes as 0. This process should be repeated for all classes. Then, you should find the most frequent class for each instances. This approach is called the one-versus-all (OvA).
Training a binary classifier for every pair of digits is another strategy: one to distinguish 0 and 1 classes, another to distinguish 0 and 2 classes, another for 1 and 2 classes, and so on. This approach is called the one-versus-one (OvO). You need to train $ \frac{n\times (n-1)}{2}$ classifiers for n classes. For the Energy Efficiency problem, we should train 6 binary classifiers. Imagine for 10 classes, you should have 45 classifiers to see which class is the most frequent for each instance!! This should be a lot of work. However, OvO strategy needs to be trained on the part of the training set for the two classes. This is the main advantage of OvO in case of large data set since it is faster to train many classifiers on small data set than training a few classifier on large training set.
Scikit-Learn automatically runs OvA (except for SVM classifiers that uses OvO). It detects when you try to use a binary or multicalss classification (Aurélien Géron, 2019).
Let’s use SGDClassifier. But we should remove unnecessary columns 'Heating Load', 'Binary Classes' from data.
# Shuffle data
np.random.seed(32)
df_multi=df_multi.reindex(np.random.permutation(df_multi.index))
df_multi.reset_index(inplace=True, drop=True)
df_multi[0:10]
Divide data into training and test set using StratifiedShuffleSplit function to make sure we have a representative training and test set with the same percentage of classes.
from sklearn.model_selection import StratifiedShuffleSplit
# Training and Test
spt = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_idx, test_idx in spt.split(df_multi, df_multi['Multi-Classes']):
train_set_strat = df_multi.loc[train_idx]
test_set_strat = df_multi.loc[test_idx]
Plot histogram of classess for training and test set to make sure with have a corrected ratio of classes.
font = {'size' : 10}
plt.rc('font', **font)
fig = plt.subplots(figsize=(10, 3), dpi= 100, facecolor='w', edgecolor='k')
ax1=plt.subplot(1,2,1)
train_set_strat['Multi-Classes'].hist(bins=15)
plt.title("Training Set")
plt.xlabel("Multi-Classes")
plt.ylabel("Frequency")
ax2=plt.subplot(1,2,2)
test_set_strat['Multi-Classes'].hist(bins=15)
plt.title("Test Set")
plt.xlabel("Multi-Classes")
plt.ylabel("Frequency")
plt.show()
Remove "Multi-Classes" column from data set and have it as separate column:
# Note that drop() creates a copy and does not affect train_set_strat
X_train = train_set_strat.drop("Multi-Classes", axis=1)
y_train = train_set_strat["Multi-Classes"].values
Next, standardize data with StandardScaler as we did before:
scaler = StandardScaler()
X_train=X_train.copy()
X_train_Std=scaler.fit_transform(X_train)
# Traning Data
X_train_Std[0:5]
import statsmodels.api as sm
y_train_cls0 = np.where(y_train==2,1,0)
logit_model=sm.Logit(y_train_cls0,X_train_Std)
result=logit_model.fit()
print(result.summary())
con_int
con_int
result.conf_int()
con_int = result.conf_int()
p_2_5 = [i[0] for i in con_int]
p_97_5 = [i[0] for i in con_int]
font = {'size' :6 }
plt.rc('font', **font)
fig=plt.figure(figsize=(6, 5), dpi= 200, facecolor='w', edgecolor='k')
ax1=plt.subplot(2,1,1)
_=prfrmnce_plot(np.array(result.pvalues),
title=f'p-value for Logistic Regression Coefficient of Features to Predict Heating Load',
ylabel='p-value',
clmns=features_colums,titlefontsize=7.5,
xfontsize=6, yfontsize=6).bargraph(fontsizelable=6.,
xshift=-0.12,axt=ax1,yshift=0.04,ylim=[0,1.3],y_rot=90)
ax2=plt.subplot(2,1,2)
plt.plot(np.array(features_colums),p_2_5,'b--',label='P 2.5%',lw=1)
plt.plot(np.array(features_colums),p_97_5, 'r',label='P 97.5%',lw=1)
plt.title('P 2.5% and P 97.5% for Logistic Regression Coefficient of Features',fontsize=7)
plt.ylabel('Confidence Interval',fontsize=6)
plt.legend(loc=1,fontsize=7,markerscale=1.1)
ax2.grid(axis='both',linewidth='0.1')
ax2.set_xticklabels(features_colums, rotation=90,fontsize=6,y=0.02)
plt.subplots_adjust(wspace=0.9)
plt.subplots_adjust(hspace=1.4)
plt.show()
The best way to identify the multicollinearity is to calculate the Variance Inflation Factor (VIF)
corresponding to every independent Variable in the Dataset. VIF tells us about how well an independent variable is predictable using the other independent variables.
VIF is a number that determines whether a variable has multicollinearity or not. That number also represents how much a variable is inflated because of the linear dependence with other variables. The example below shows how VIF works. If we have 5 independent variables, To calculate the VIF of variable Var 1, this variable is isolated and consider as the target variable and all the other variables will be treated as the predictor variables to train a regression model and calculate the corresponding $R^{2}$ and $VIF$. This process is repeated all independent variables.
$\LARGE R^{2}=1-\frac{MSE(model)}{MSE(baseline)} \,\,\,\,\, \Rightarrow \,\,\,\,\, VIF=\frac{1}{1-R^{2}}$
The VIF value starts from 1, and it has no upper limit. It is suggested to keep features having VIF<5. However, we should not remove all features that have VIF<5. We should first remove features with very high VIF, then calculate VIF again. This process should be repeated until reasonable VIF is calculated. See Medium and Correlation and Collinearity for more information.
df
vif_feature=df.drop(['Multi-Classes'], axis=1)
%%time
vif_info = pd.DataFrame()
vif_info['VIF'] = [variance_inflation_factor(vif_feature.values, i) for i in range(vif_feature.values.shape[1])]
vif_info['Column'] = vif_feature.columns
vif_info.sort_values('VIF', ascending=False)
vif_info[vif_info.VIF>150].sort_values('VIF', ascending=False)
Top three features with infinity VIF are Surface Area, Wall Area, Roof Area. We can remove these features and calculate VIF again:
%%time
# Calculte VIF after removing three features with very high VIF
colmns=list(vif_info[vif_info.VIF>150].sort_values('VIF', ascending=False)['Column'])
vif_feature_new=vif_feature.drop(colmns,axis=1)
vif_info_new = pd.DataFrame()
vif_info_new['VIF'] = [variance_inflation_factor(vif_feature_new, i) for i in range(vif_feature_new.shape[1])]
vif_info_new['Column'] = vif_feature_new.columns
vif_info_new.sort_values('VIF', ascending=False)
Removing top three features with infinity VIF leads to remove multicollinearity. It must be noted that we should NOT remove all features with high VIF, feature should be removed sequentially as explained above.
import scipy.linalg
from scipy.stats import norm
import scipy.stats
# Make a synthetic correlation matric for 4 propeties
corr=np.zeros((4,4))
corr[0,0]=corr[1,1]=corr[2,2]=corr[3,3]=1.0
#
corr[0,1]=corr[1,0]=0.75
corr[0,2]=corr[2,0]=-0.4
corr[0,3]=corr[3,0]=0.75
#
corr[1,2]=corr[2,1]=0.12
corr[1,3]=corr[3,1]=0.40
#
corr[2,3]=corr[3,2]=-0.54
# Make synthetic data for simulation
nsim=10000 # number of simulation
val=[]
mu=5; sigma=1
# Normal distribution
val.append(np.random.normal(mu, sigma, nsim))
# Lognormal distribution
val.append(np.random.lognormal(0.58, 0.5, nsim))
# Triangular distribution
val.append(np.random.triangular(0,2.5, 5, nsim))
# Triangular distribution
val.append(np.random.triangular(0,0.5, 5, nsim))
# Order the random distributions of variable to respect the correlation between them
np.random.seed(42)
matrix=corr
L=scipy.linalg.cholesky(matrix, lower=True, overwrite_a=True)
mu=0; sigma=1; nvar=len(matrix)
w=np.zeros((nvar,nsim))
for i in range (nvar):
for j in range(nsim):
Dist = np.random.normal(mu, sigma, nsim)
w[i,:]=Dist
Sim_R_val=[]
N_var=[]
for i in range(nsim):
tmp=(np.matmul(L,w[:,i]))
N_var.append(tmp)
N_var=np.array(N_var).transpose()
#
Sim_R_val=[]
for i1 in range(nvar):
R_tmp=np.zeros(len(val[i1]))
tmp=np.sort(N_var[i1])
tmp_1=np.sort(val[i1])
for i2 in range(nsim):
index1=np.where(N_var[i1]==tmp[i2])[0]
index2=np.where(val[i1]==tmp_1[i2])[0]
R_tmp[index1[0]]=val[i1][index2[0]]
Sim_R_val.append(R_tmp)
#
columns=['Feature ' +str(i+1) for i in range(3)]+['Target']
df=pd.DataFrame({columns[0]:np.array(Sim_R_val[0]),columns[1]:np.array(Sim_R_val[1]),
columns[2]:np.array(Sim_R_val[2]),columns[3]:np.array(Sim_R_val[3])},columns=columns)
font = {'size' :14 }
plt.rc('font', **font)
fig=plt.figure(figsize=(10, 9), dpi= 80, facecolor='w', edgecolor='k')
# Make cross plot matrix
cplotmatrix(df,alpha=0.05,marker='ro',missin_rep=False, font=font)
# Correlation matrix
font = {'size' : 13}
plt.rc('font', **font)
fig, ax=plt.subplots(figsize=(6, 6), dpi= 80, facecolor='w', edgecolor='k')
corr_array=Correlation_plot.corr_mat(df,
title=f'Correlation Matrix',corr_val_font=14,
titlefontsize=14,xy_title=[0.5,-1.8],xyfontsize = 12,vlim=[-1, 1],axt=ax)
target='Target'
scaler = StandardScaler()
features_colums=df.drop([target],axis=1).columns
X_train_std=scaler.fit_transform(df[features_colums])
y_train=df[target]
logit_model=sm.OLS(y_train,X_train_std)
result=logit_model.fit()
print(result.summary())
font = {'size' : 6}
plt.rc('font', **font)
fig, ax1 = plt.subplots(figsize=(3.5, 4), dpi= 200, facecolor='w', edgecolor='k')
Correlation_plot.corr_bar(result.params.values,clmns=features_colums,select=False,yfontsize=6.0,xlim = [-0.8, 1.2],
title=f'Feature Importance by Logistic Regression for Target',ymax_vert_lin=30)
# Plot the importance of features
font = {'size' : 7}
plt.rc('font', **font)
fig, ax1 = plt.subplots(figsize=(6, 3), dpi= 180, facecolor='w', edgecolor='k')
_=prfrmnce_plot(np.array(result.pvalues), title='p-value for Logistic Regression Coefficient of Features',
ylabel='p-value',
clmns=features_colums,titlefontsize=9,
xfontsize=7, yfontsize=8).bargraph(fontsizelable=7,axt=ax1,
xshift=-0.1,yshift=0.00009,ylim=[0,0.005],y_rot=0)
Feature 2 is positively correlated with Target but logistic regression gives negative coefficient. This is happening because of multicollinearity.
Variance inflation factor proves this:
vif_feature=df.drop([target],axis=1)
vif_info = pd.DataFrame()
vif_info['VIF'] = [variance_inflation_factor(vif_feature, i) for i in range(3)]
vif_info['Column'] = vif_feature.columns
vif_info.sort_values('VIF', ascending=False)
Synthetic example below shows the correlation between 4 features and target.
# Make a synthetic correlation matric for 4 propeties
corr=np.zeros((4,4))
corr[0,0]=corr[1,1]=corr[2,2]=corr[3,3]=1.0
#
corr=np.zeros((5,5))
corr[0,0]=corr[1,1]=corr[2,2]=corr[3,3]=corr[4,4]=1.0
corr[0,1]=corr[1,0]=0.05
corr[0,2]=corr[2,0]=0.043
corr[0,3]=corr[3,0]=0.071
corr[0,4]=corr[4,0]=0.36
#
corr[1,2]=corr[2,1]=0.69
corr[1,3]=corr[3,1]=0.34
corr[1,4]=corr[4,1]=-0.42
#
corr[2,3]=corr[3,2]=0.74
corr[2,4]=corr[4,2]=-0.64
#
corr[3,4]=corr[4,3]=-0.77
# Make synthetic data for simulation
nsim=10000 # number of simulation
val=[]
mu=5; sigma=1
val.append(np.random.normal(mu, sigma, nsim))
mu=90; sigma=16
val.append(np.random.normal(mu, sigma, nsim))
mu=60; sigma=10
val.append(np.random.normal(mu, sigma, nsim))
mu=30; sigma=4
val.append(np.random.normal(mu, sigma, nsim))
mu=10; sigma=2
val.append(np.random.normal(mu, sigma, nsim))
# Order the random distributions of variable to respect the correlation between them
np.random.seed(42)
matrix=corr
L=scipy.linalg.cholesky(matrix, lower=True, overwrite_a=True)
mu=0; sigma=1; nvar=len(matrix)
w=np.zeros((nvar,nsim))
for i in range (nvar):
for j in range(nsim):
Dist = np.random.normal(mu, sigma, nsim)
w[i,:]=Dist
Sim_R_val=[]
N_var=[]
for i in range(nsim):
tmp=(np.matmul(L,w[:,i]))
N_var.append(tmp)
N_var=np.array(N_var).transpose()
#
Sim_R_val=[]
for i1 in range(nvar):
R_tmp=np.zeros(len(val[i1]))
tmp=np.sort(N_var[i1])
tmp_1=np.sort(val[i1])
for i2 in range(nsim):
index1=np.where(N_var[i1]==tmp[i2])[0]
index2=np.where(val[i1]==tmp_1[i2])[0]
R_tmp[index1[0]]=val[i1][index2[0]]
Sim_R_val.append(R_tmp)
#
columns=['Feature ' +str(i+1) for i in range(4)]+['Target']
df=pd.DataFrame({columns[0]:np.array(Sim_R_val[0]),columns[1]:np.array(Sim_R_val[1]),
columns[2]:np.array(Sim_R_val[2]),columns[3]:np.array(Sim_R_val[3]),
columns[4]:np.array(Sim_R_val[4])},columns=columns)
font = {'size' :11 }
plt.rc('font', **font)
fig=plt.figure(figsize=(10, 9), dpi= 80, facecolor='w', edgecolor='k')
# Make cross plot matrix
cplotmatrix(df,alpha=0.05,marker='ro',missin_rep=False,font=font)
# Correlation matrix
font = {'size' : 13}
plt.rc('font', **font)
fig, ax=plt.subplots(figsize=(6, 6), dpi= 110, facecolor='w', edgecolor='k')
corr_array=Correlation_plot.corr_mat(df,
title=f'Correlation Matrix',corr_val_font=14,
titlefontsize=14,xy_title=[0.5,-1.8],xyfontsize = 12,vlim=[-1, 1],axt=ax)
target='Target'
scaler = StandardScaler()
features_colums=df.drop([target],axis=1).columns
X_train_std=scaler.fit_transform(df[features_colums])
y_train=df[target]
logit_model=sm.OLS(y_train,X_train_std)
result=logit_model.fit()
print(result.summary())
font = {'size' : 6}
plt.rc('font', **font)
fig, ax1 = plt.subplots(figsize=(3.5, 4), dpi= 200, facecolor='w', edgecolor='k')
Correlation_plot.corr_bar(result.params.values,clmns=features_colums,select=False,yfontsize=6.0,xlim = [-1.6, 0.9],
title=f'Feature Importance by Logistic Regression for Target',ymax_vert_lin=30)
# Plot the importance of features
font = {'size' : 7}
plt.rc('font', **font)
fig, ax1 = plt.subplots(figsize=(6, 3), dpi= 180, facecolor='w', edgecolor='k')
_=prfrmnce_plot(np.array(result.pvalues), title='p-value for Logistic Regression Coefficient of Features',
ylabel='p-value',
clmns=features_colums,titlefontsize=9,
xfontsize=7, yfontsize=8).bargraph(fontsizelable=7,axt=ax1,
xshift=-0.1,yshift=0.009,ylim=[0,0.65],y_rot=0)
Feature 3 is negatively correlated with Target but logistic regression gives negative coefficient. This is happening because of multicollinearity. Moreover, Feature 3 gives high p-value which makes it insignificant. This is no problem for prediction but it misleads interpretation.
Variance inflation factor proves this:
vif_feature=df.drop([target],axis=1)
vif_info = pd.DataFrame()
vif_info['VIF'] = [variance_inflation_factor(vif_feature, i) for i in range(4)]
vif_info['Column'] = vif_feature.columns
vif_info.sort_values('VIF', ascending=False)