Real datasets for Machine Learning application can be found on the UCI Machine Learning repository. The data are processed and cleaned before and ready to feed Machine Learning algorithms. Energy Efficiency Data Set is applied for this lecture. Energy analysis are performed for 768 simulated building shapes with respect to 8 features including Wall Area, Overall Height, Glazing Area, Orientation.. to predict Heating Load and Cooling Load. The work has been published by Tsanas and Xifara 2012 on Energy and Buildings Journal. The dataset can be used for both regression and classification. In this lecture, we will apply binary classification on Heating Load that is the amount of heating that a building needs in order to maintain the indoor temperature at established levels. I added two columns to the dataset: Heating Load is divided into binary and multiclasses. Lets look at the dataset.
import pandas as pd
df = pd.read_csv('./Data/Building_Heating_Load.csv',na_values=['NA','?',' '])
df[0:5]
info() function shows that there is no missing values in dataset.
df.info()
value_counts() function gives number of each class in the data set.
counts=df['Binary Classes'].value_counts()
counts/len(df['Binary Classes'])
gb_mean=df.groupby('Binary Classes')['Heating Load'].mean()
gb_mean
df.describe()
df.corr()
def corr_bar(df,title):
"""Plot correlation bar with the pair of atrribute with last column"""
corr=df.corr()
Colms_sh=list(list(corr.columns))
coefs=corr.values[:,-1][:-1]
names=Colms_sh[:-1]
r_ = pd.DataFrame( { 'coef': coefs, 'positive': coefs>=0 }, index = names )
r_ = r_.sort_values(by=['coef'])
r_['coef'].plot(kind='barh', color=r_['positive'].map({True: 'b', False: 'r'}))
plt.xlabel('Correlation Coefficient',fontsize=6)
plt.vlines(x=0,ymin=-0.5, ymax=10, color = 'k',linewidth=0.8,linestyle="dashed")
plt.title(title)
plt.show()
#
import matplotlib
import pylab as plt
font = {'size' : 5}
matplotlib.rc('font', **font)
ax1,fig = plt.subplots(figsize=(2.8, 3), dpi= 200, facecolor='w', edgecolor='k')
# Plot correlations of attributes with the last column
corr_bar(df,title='Correlation with Heating Load')
For now, we want to simplify the problem and go with the binary classification, so, Heating Load and MultiClasses columns should be removed.
df_binary=df.copy()
df_binary.drop(['Heating Load','Multi-Classes'], axis=1, inplace=True)
df_binary[0:5]
We should convert text labels Low Level and High Level to numbers. Since there are only two categories, we can simply concert Low Level to 0 and High Level to 1 using Pandas replace() function.
df_binary['Binary Classes']=df_binary['Binary Classes'].replace('Low Level', 0)
df_binary['Binary Classes']=df_binary['Binary Classes'].replace('High Level', 1)
df_binary[0:5]
Shuffle the data to avoid any element of bias/patterns in the split datasets. Some learning algorithms are very sensitive to the order of the training data, and they perform very bad if they get many similar instances in a row. Shuffling makes sure that this does not occur.
import numpy as np
np.random.seed(32)
df_binary=df_binary.reindex(np.random.permutation(df_binary.index))
df_binary.reset_index(inplace=True, drop=True)
df_binary[0:10]
Make a histogram of all features
import matplotlib
import pylab as plt
df_binary.hist(bins=15, layout=(3, 3), figsize=(15,10))
plt.show()
Wait! You should always set aside a test data before looking at the data set closely. We should use stratified sampling based on the "Binary" category by Scikit-Learn’s StratifiedShuffleSplit function to have The same number of class for both training and test set.
from sklearn.model_selection import StratifiedShuffleSplit
# Training and Test
spt = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_idx, test_idx in spt.split(df_binary, df_binary['Binary Classes']):
train_set_strat = df_binary.loc[train_idx]
test_set_strat = df_binary.loc[test_idx]
Plot histogram of 'Binary Classes' to make sure we have selected a balanced number of classes for each data set.
font = {'size' : 10}
matplotlib.rc('font', **font)
fig = plt.subplots(figsize=(10, 3), dpi= 100, facecolor='w', edgecolor='k')
ax1=plt.subplot(1,2,1)
train_set_strat['Binary Classes'].hist(bins=15)
plt.title("Training Set")
plt.xlabel("Binary Classes")
plt.ylabel("Frequency")
ax2=plt.subplot(1,2,2)
test_set_strat['Binary Classes'].hist(bins=15)
plt.title("Test Set")
plt.xlabel("Binary Classes")
plt.ylabel("Frequency")
plt.show()
Now you should remove the column Binary Classes and have a copy of it as target.
# Note that drop() creates a copy and does not affect train_set_strat
X_train = train_set_strat.drop("Binary Classes", axis=1)
y_train = train_set_strat["Binary Classes"].values
Next is to standardize your training data. Be careful, you should not scale/standardize categorical variables and target.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train=X_train.copy()
X_train_Std=scaler.fit_transform(X_train)
# Traning Data
X_train_Std[0:5]
# Target values
y_train[0:5]
Okay, the data is ready to feed Machine Learning. Lets choose a simple classifier and train it. Stochastic Gradient Descent (SGD) classifier is a good place to start. This classifier is efficiently capable of handling very large datasets since SGD deals with training instances one at a time not all of them for each iteration. We will talk about SGD in details in next lecture. The following code applies SGD with Scikit-Learn’s SGDClassifier function for the prepared dataset:
from sklearn.linear_model import SGDClassifier
# Call SGD classifier
sgd_clf = SGDClassifier(random_state=42)
# Train SGD classifier
sgd_clf.fit(X_train_Std, y_train)
# Apply prediction for 5 data instances of training
sgd_clf.predict(X_train_Std[:3])
# Lets look at the real values (classes)
y_train[:3]
The model correctly predicts first and second instances as 0 and 1, but incorrectly predicts last instance as 1. Now, lets see how can evaluate the performance of model.
We discussed in last lecture how to evaluate a regressor (regression for estimation of oil production) using Root Mean Square Error. However, evaluating a classifier is often significantly more challenging than evaluating a regressor, so we will spend more time on this topic. There are several approaches to performance measures available.
As it was discussed in last lecture, a good way to evaluate a model is to use cross-validation. We use the cross_val_score() function to evaluate the SGDClassifier model using K-fold cross-validation, with 4-folds. Do you remember from K-fold crossvalidation from previous lecture? It means splitting the training set into K-folds (in here 4), then evaluating each fold using a model trained on the remaining folds: 1 fold for validation and 3-folds for training; then, accuracy for each fold is calculated. The accuracy is number of true prediction over total instances.
from sklearn.model_selection import cross_val_score
Accuracies=cross_val_score(sgd_clf,X_train_Std,y_train, cv=4, scoring="accuracy")
Accuracies
np.mean(Accuracies)
Wow! It seems the classifier has done a great job since the accuracy or ratio of correct prediction. Before we get too excited, lets apply a random classifier. It is always a good idea to apply a dummy classifier to compare with trained Machine Learning models.
from sklearn.dummy import DummyClassifier
import warnings
warnings.filterwarnings('ignore')
# Apply a random classifier
dmy_clf = DummyClassifier(random_state=42)
dmy_clf.fit(X_train_Std,y_train)
Accuracies=cross_val_score(dmy_clf,X_train_Std,y_train, cv=4, scoring="accuracy")
Accuracies
np.mean(Accuracies)
Okay, applying Machine Learning has improved prediction from average accuracy of 68% for random classifier to 87% for SGD ( 28%). However, this improvement may not be satisfying. You can still do much better. Having an average accuracy of 68% for a random classifier is simply because there about 80% class 0, so if you always guess that a class is 0, you will have higher chance of getting 0. This leads to high accuracy in general, so accuracy is generally not the preferred performance measurement for classifiers specially when we have skewed datasets (when frequency of one classifier is higher than others).
Confusion matrix is a much better way to evaluate the performance of a classifier. The general idea is to consider the number of times for example instances of class 0 are misclassified as class 1 and vice versa. In a confusion matrix, each column represents a predicted class, while each row signifies an actual class. A set of prediction is required to calculate confusion matrix. K-fold cross-validation (cross_val_predict() function) is applied to get a clean prediction which means by a model that never has seen the data during training. It is similar to cross_val_predict() performs K-fold cross-validation, but it returns the predictions made on each validation fold instead of calculating the evaluation scores:
from sklearn.model_selection import cross_val_predict
y_train_pred = cross_val_predict(sgd_clf,X_train_Std,y_train, cv=4)
Now confusion matrix can be calculated using the confusion_matrix() function. See the following code blow:
from sklearn.metrics import confusion_matrix
confusion_matrix(y_train, y_train_pred)
Lets make a nice plot for confusion matrix using the function below:
def Conf_Matrix(predictor,x_train,y_train,perfect,sdt,axt=None):
'''Plot confusion matrix'''
ax1 = axt or plt.axes()
y_train_pred = cross_val_predict(predictor,x_train,y_train, cv=4)
if(perfect==1): y_train_pred=y_train
conf_mx=confusion_matrix(y_train, y_train_pred)
ii=0
if(len(conf_mx)<4):
im =ax1.matshow(conf_mx, cmap='jet', interpolation='nearest')
x=['Predicted\nNegative', 'Predicted\nPositive']; y=['Actual\nNegative', 'Actual\nPositive']
for (i, j), z in np.ndenumerate(conf_mx):
if(ii==0): al='TN= '
if(ii==1): al='FP= '
if(ii==2): al='FN= '
if(ii==3): al='TP= '
ax1.text(j, i, al+'{:0.0f}'.format(z), ha='center', va='center', fontweight='bold',fontsize=8.5)
ii=ii+1
ax1.set_xticks(np.arange(len(x)))
ax1.set_xticklabels(x,fontsize=6.5,y=0.97, rotation='horizontal')
ax1.set_yticks(np.arange(len(y)))
ax1.set_yticklabels(y,fontsize=6.5,x=0.035, rotation='horizontal')
else:
if(sdt==1):
row_sums = conf_mx.sum(axis=1, keepdims=True)
norm_confmx = conf_mx / row_sums
else:
norm_confmx=conf_mx
im =ax1.matshow(norm_confmx, cmap='jet', interpolation='nearest')
for (i, j), z in np.ndenumerate(norm_confmx):
if(sdt==1): ax1.text(j, i, '{:0.2f}'.format(z), ha='center', va='center', fontweight='bold')
else: ax1.text(j, i, '{:0.0f}'.format(z), ha='center', va='center', fontweight='bold')
cbar =plt.colorbar(im,shrink=0.3,orientation='vertical')
train_set_strat["Binary Classes"].value_counts()
from sklearn.metrics import accuracy_score
accuracy_score(y_train, y_train_pred)
font = {'size' : 6}
matplotlib.rc('font', **font)
fig = plt.subplots(figsize=(4.5, 4.5), dpi= 190, facecolor='w', edgecolor='k')
ax1=plt.subplot(1,2,1)
Conf_Matrix(sgd_clf,X_train_Std,y_train,perfect=0,sdt=0,axt=ax1)
The first row element of the confusion matrix represents actual negative class (class 0): 468 negative classes are truly classified (True Negative (TN)=468): 23 negative class wrongly classified as positive (False Positive (FP)=23). The second row represents the actual positive class (class 1): 53 positive class were wrongly classified as negative (False Negative (FN)=53), while 70 were correctly classified as positive (True Positives (TP)=70).
A perfect classifier should have only actual positives and actual negatives, so confusion matrix of a perfect classifier should be nonzero values only on diagonal elements (see figure):
font = {'size' : 6}
matplotlib.rc('font', **font)
fig = plt.subplots(figsize=(4.5, 4.5), dpi= 190, facecolor='w', edgecolor='k')
ax1=plt.subplot(1,2,1)
Conf_Matrix(sgd_clf,X_train_Std,y_train,perfect=1,sdt=0,axt=ax1)
plt.show()
Although the confusion matrix represents a lot of information, sometimes more concise metric is preferred:
Precision is calculated by $\frac{TP}{(TP+FP)}$ where TP is the number of true positives, and FP is the number of false positives. For example, for a medical test example, precision answers the following question: how many of those who we labeled as diabetic are actually diabetic?
A naive way to have perfect precision is by making one single positive prediction and make sure that it is correct (precision = 1/1 = 100%). However, this is not efficient approach to measure performance, since the classifier ignores all but one positive instance. For a medical test example, if 1 person is correctly positive diabetic (TP=1) and no people is wrongly labeled as negative (FP=0) so we have precision 1/1=100%. But wait!! what about those people are positive diabetic but wrongly labeled as negative (FN). So precision is typically used along with another metric named recall, also called sensitivity calculated by $\frac{TP}{(TP+FN)}$. Where TP is the number of true positives, and FN is the number of false negative. Again, for a medical test example, of all the people who are diabetic, how many of those we correctly predicted?
Although precision and sensitivity can be manually calculated, scikit-Learn have functions. Lets calculate these two metrics for Energy Efficiency dataset:
from sklearn.metrics import precision_score, recall_score
precision=precision_score(y_train, y_train_pred) # Precision= 84 / (84 + 34)
print('Precision= ',precision)
recall=recall_score(y_train, y_train_pred) # Precision= 84 / (84 + 39)
print('Recall (sensitivity)= ',recall)
The calculated precision and recall are not satisfying and lower than accuracy. So, accuracy alone cannot be applied for a reliable performance measurement. For this example, precision means how many of predicted class 1 (positive) are actually class 1. Recall or Sensitivity means for all class 1 instances in data set, how many are correctly predicted as class 1.
The harmonic mean of precision and sensitivity could also be considered as another metric called F1- score:
$\large F_{1}=\frac{2}{\frac{1}{Precision}+\frac{1}{Sensitivity}}$
To calculate F1- score, f1_score() function of Scikit-Learn can be simply called:
from sklearn.metrics import f1_score
f1_score(y_train, y_train_pred)
F1- score gives much more weight to low values so a classifier only represents high harmonic mean if both precision and sensitivity are high. So classifiers with high F1 score should have similar precision and recall. However, this should not be always what you want: in some contexts precision is very important, and in other contexts you recall is considered as first priority. For example, if you are training a classifier to detect safe videos for kids, you may prefer a classifier that rejects many good videos (low sensitivity: high False Negative) but keeps only safe ones (high precision: very low False Positive), rather than a classifier with high sensitivity but allows very few inappropriate videos present in your product.
On the other hand, if you are training a classifier for detecting shoplifters or thieves on surveillance images: it is should be absolutely fine if your classifier has only 30% precision (high False Positive) on condition that it has 99% sensitivity (very low False Negative). The security guards will get few false alerts; nothing happens; the security guards can apologize but it is important that almost all shoplifters will get caught (Aurélien Géron, 2019).
So, requiring high precision or sensitivity depends on a project expectation. Machine learning parameters should be tuned to have high required metrics or select an algorithm that has high required metrics.
We applied binary classifier that distinguishes between two classes, you should apply multiclass classifiers (multinomial classifiers) if you want to distinguish between more than two classes. Some Machine Learning algorithms are strictly designed for binary classification (Support Vector Machine); some classifiers (Random Forest) can handle multiple classes directly. However, there are different approaches you can apply multiclass classification using multiple binary classifiers.
For example, the Energy Efficiency Dataset have also Multiclasses column for Heating Load: Level 1 to Level 4:
df_multi=df.copy()
df_multi[:5]
# Percentage of each Class
counts=df_multi['Multi-Classes'].value_counts()
counts
counts/len(df['Multi-Classes'])
# Mean of Heating Load for each Class
gb_mean=df_multi.groupby('Multi-Classes')['Heating Load'].mean()
gb_mean
Now use Scikit-learn's OrdinalEncoder function to convert Level 1 to level 4 classes from 0 to 3 numbers.
from sklearn.preprocessing import OrdinalEncoder
ordinal_encoder = OrdinalEncoder()
Multi_Classes_encoded = ordinal_encoder.fit_transform(df_multi[['Multi-Classes']])
Multi_Classes_encoded[0:5]
df_multi['Multi-Classes']=Multi_Classes_encoded
df_multi[0:5]
Lets see percentage of each number:
counts=df_multi['Multi-Classes'].value_counts()
counts
counts/len(df['Multi-Classes'])
One way to create a system that can classify the Energy Efficiency dataset into 4 classes (from 0 to 3) is to train 4 binary classifiers, one for each class (a 0-detector, a 1-detector, a 2-detector, and a 3-detector). For example, if you want to predict class 4, you should assign class 4 as 1 and other classes as 0. This process should be repeated for all classes. Then, you should find the most frequent class for each instances. This approach is called the one-versus-all (OvA).
Training a binary classifier for every pair of digits is another strategy: one to distinguish 0 and 1 classes, another to distinguish 0 and 2 classes, another for 1 and 2 classes, and so on. This approach is called the one-versus-one (OvO). You need to train $ \frac{n\times (n-1)}{2}$ classifiers for n classes. For the Energy Efficiency problem, we should train 6 binary classifiers. Imagine for 10 classes, you should have 45 classifiers to see which class is the most frequent for each instance!! This should be a lot of work. However, OvO strategy needs to be trained on the part of the training set for the two classes. This is the main advantage of OvO in case of large data set since it is faster to train many classifiers on small data set than training a few classifier on large training set.
Scikit-Learn automatically runs OvA (except for SVM classifiers that uses OvO). It detects when you try to use a binary or multicalss classification (Aurélien Géron, 2019).
Let’s use SGDClassifier. But we should remove unnecessary columns 'Heating Load', 'Binary Classes' from data.
df_multi.drop(['Heating Load','Binary Classes'], axis=1, inplace=True)
df_multi[0:5]
# Shuffle data
np.random.seed(32)
df_multi=df_multi.reindex(np.random.permutation(df_multi.index))
df_multi.reset_index(inplace=True, drop=True)
df_multi[0:10]
Divide data into training and test set using StratifiedShuffleSplit function to make sure we have a representative training and test set with the same percentage of classes.
# Training and Test
spt = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_idx, test_idx in spt.split(df_multi, df_multi['Multi-Classes']):
train_set_strat = df_multi.loc[train_idx]
test_set_strat = df_multi.loc[test_idx]
Plot histogram of classess for training and test set to make sure with have a corrected ratio of classes.
font = {'size' : 10}
matplotlib.rc('font', **font)
fig = plt.subplots(figsize=(10, 3), dpi= 100, facecolor='w', edgecolor='k')
ax1=plt.subplot(1,2,1)
train_set_strat['Multi-Classes'].hist(bins=15)
plt.title("Training Set")
plt.xlabel("Multi-Classes")
plt.ylabel("Frequency")
ax2=plt.subplot(1,2,2)
test_set_strat['Multi-Classes'].hist(bins=15)
plt.title("Test Set")
plt.xlabel("Multi-Classes")
plt.ylabel("Frequency")
plt.show()
Remove "Multi-Classes" column from data set and have it as separate column:
# Note that drop() creates a copy and does not affect train_set_strat
X_train = train_set_strat.drop("Multi-Classes", axis=1)
y_train = train_set_strat["Multi-Classes"].values
Next, standardize data with StandardScaler as we did before:
scaler = StandardScaler()
X_train=X_train.copy()
X_train_Std=scaler.fit_transform(X_train)
# Traning Data
X_train_Std[0:5]
Now the data is ready for multiclass classification. Let’s use SGDClassifier:
# Call SGD classifier
sgd_clf = SGDClassifier(random_state=42,loss='log')
# Train SGD classifier
sgd_clf.fit(X_train_Std, y_train)
# Apply prediction for 10 data instances of training
sgd_clf.predict(X_train_Std[:10])
#Lets look at real values (classes)
y_train[:10]
This seems so simple! The code train the SGDClassifier on the training set using the target classes from 0 to 3. If you want to know what is going on under the hood, Scikit-Learn trains 4 binary classifiers (OvA), gets the decision scores for each class, and selects the class with the highest score.
The following code calculates score values for each class from 0 to 3. The class with highest score is selected.
scores = sgd_clf.decision_function(X_train_Std[:1])
scores
print ('Predicted class: ', np.argmax(scores))
predict_proba() function can be applied to get the probabilities that the classifier assigned to each instance for each class. SGDClassifier requires log loss to predict probability of each class.
sgd_clf.predict_proba(X_train_Std[:1])
It is time to evaluate these classifiers. As always, you can use cross-validation to get a clean prediction. First we run class-validation to calculate accuracy. The following code is cross validation for 4-folds.
Accuracies=cross_val_score(sgd_clf,X_train_Std,y_train, cv=4, scoring="accuracy")
Accuracies
If you applied a random classifier, you get around 27% accuracy for each class; so this should not be a bad score at all, but you may be able to get better performance.
Accuracy=cross_val_score(dmy_clf,X_train_Std,y_train, cv=4, scoring="accuracy")
Accuracy
We applied confusion matrix for binary classification. Confusion matrix can also be applied for error analysis of multiclass classification. You first need to make clean predictions using the cross_val_predict() function, then call confusion_matrix() function to calculate confusion matrix as we did earlier:
y_train_pred = cross_val_predict(sgd_clf,X_train_Std,y_train, cv=4)
confmx=confusion_matrix(y_train, y_train_pred)
confmx
It is better to look at the confusion matrix plot:
font = {'size' : 9}
matplotlib.rc('font', **font)
fig = plt.subplots(figsize=(6.5, 6.5), dpi= 160, facecolor='w', edgecolor='k')
ax1=plt.subplot(1,2,1)
Conf_Matrix(sgd_clf,X_train_Std,y_train,perfect=0,sdt=0,axt=ax1)
plt.ylabel("Actual",fontsize=11)
plt.title("Predicted",y=1.1,fontsize=11)
plt.show()
This confusion matrix does not looks very good, since there are a lot of missclassified class. Only class 0 has the highest accuracy since it has dark blue cells (zero values) on off-diagonal. Let see how a perfect classifier looks like
font = {'size' : 9}
matplotlib.rc('font', **font)
fig = plt.subplots(figsize=(6.5, 6.5), dpi= 160, facecolor='w', edgecolor='k')
ax1=plt.subplot(1,2,1)
Conf_Matrix(sgd_clf,X_train_Std,y_train,perfect=1,sdt=0,axt=ax1)
plt.ylabel("Actual",fontsize=11)
plt.title("Predicted",y=1.1,fontsize=11)
plt.show()
A perfect classifier should have zero on off-diagonal elements.
Lets make error plot. Each value in the confusion matrix should be divided by the number of classes in the corresponding class:
row_sums = confmx.sum(axis=1, keepdims=True)
norm_confmx = confmx / row_sums
norm_confmx
font = {'size' : 9}
matplotlib.rc('font', **font)
fig = plt.subplots(figsize=(6.5, 6.5), dpi= 160, facecolor='w', edgecolor='k')
ax1=plt.subplot(1,2,1)
Conf_Matrix(sgd_clf,X_train_Std,y_train,perfect=0,sdt=1,axt=ax1)
plt.ylabel("Actual",fontsize=11)
plt.title("Predicted",y=1.1,fontsize=11)
plt.show()
Now the plot clearly shows the kinds of errors the classifier makes. Class 2 (Level 3) is the worst prediction: get 38% misclassified as CLass 1 and 23% miss\classified as Class 3. Class 0, is the best prediction; only 7% misclassified as Class 1. The confusion matrix is not necessarily symmetrical.