Introduction
In this lecture, we will discuss an example of Machine Learning project from start to end. Assume that you recently hired as data scientist in a company. We will discuss what main steps you should go through to launch a Machine Learning project. We use the oil production dataset for 10000 hydrocarbon wells. The data file is WPD_10000.csv. The data is not original and manipulated so it should NOT be used for any other purposes than learning. The target is oil production prediction using well properties.
Big Picture¶
The first task you are asked as a data scientist is to build a model for oil prediction prediction using known information of hydrocarbon wells. Your model should learn from this data and be able to predict the oil production for those wells that oil production is unknown. See Figure below for location map of oil production $(10^{3}m^{3}/month)$ for 10000 hydrocarbon wells in Alberta.
import matplotlib
import pylab as plt
import numpy as np
import pandas as pd
font = {'size' : 6.5}
matplotlib.rc('font', **font)
fig = plt.subplots(figsize=(3.35,4.2), dpi= 200, facecolor='w', edgecolor='k')
df = pd.read_csv("WPD_10000.csv",na_values=['NA', '?','']) # Read data file
x=df['X Coordinate']; y=df['Y Coordinate']
var=df['OIL Prod. (e3m3/month)']
plt.scatter(x, y,c=var,s=(var)**(0.5),cmap='jet',edgecolors='k',linewidths=0.5,vmin=0, vmax=800)
plt.title('10000 Hydrocarbon Wells in Alberta' ,fontsize='8')
plt.xlabel('Easting (km)',fontsize='8'); plt.ylabel('Northing (km)',fontsize='8')
plt.colorbar(shrink=0.6,label='Oil Production (e3m3/month)',orientation='vertical')
Name=["" for x in range(6)]
Name[0:6]='Calgary','Fort McMurray','Edmonton','Rainbow',\
'Grande Prairie','Lloydminster'
xn=[365.3,557.2,426.8,91.8,84.3]
yn=[231.6,858.9,517.0,1087.2,713.4]
plt.plot(xn,yn,'*y',markersize=6,linewidth=1)
for i in range(5):
plt.text(xn[i], (yn[i]-18), Name[i],fontsize=5.2,color='y', fontweight='bold',style='oblique', ha='center',
va='top', wrap=True,bbox=dict(facecolor='k', alpha=1,pad=0))
plt.xlim((0, 650))
plt.ylim((0, 1200))
plt.show()
Problem Definition¶
Building a model is probably not the final goal. First, you should ask your boss what exactly is the business objective. How this model can be beneficial for the company? By answering these questions you define the problem, and decide what algorithms to use and how evaluate the performance of your model.
The next question is what the current solution looks like. This can give you a reference (baseline) performance, and insights on how to tackle the problem as well. Performance of your model should be higher than the reference performance. Your boss answers that the oil production are currently predicted manually by experts. The problem is this approach is very expensive and time-consuming, and their predictions are not great; when they can find out the actual oil production, they often realize that the estimates were off by more than 50%. That is the main reason why the company thinks that it would be useful to train a model to predict oil production given properties of wells.
Now you can start designing your system. First, you need to know: is it supervised, unsupervised? Obviously, this is a supervised learning task since you are given labeled training examples (each instance has the expected output which is oil production). Is it a classification task, a regression task, or something else? The answer is it is a typical regression task, since you are asked to predict a value; this is a multiple regression problem since the system will use multiple features (X Coordinate, Y Coordinate, Measured Depth (m)...) to make a prediction (OIL Prod.). It is also a univariate regression problem since a single value is predicted for each well. If we were trying to predict multiple values for each well, it would be a multivariate regression problem.
Performance Measurement¶
The next step is to select a performance measure. All the algorithms in machine learning rely on minimizing a function called cost and loss functions. A cost function is a measure of how bad the predictive model is to predict the actual outcome. We will talk about minimizing a cost function. A typical performance measurement (cost function) for regression problems is the Mean Square Error (MSE). See Equation below:
$MSE=\frac{1}{n}\sum_{i=1}^{n}(y_p^{i}-y_a^{i})^{2}$ where $y_p$ are $y_a$ predicted and actual values, respectively. See Figure below for an example of MSE.
font = {'size' : 12}
matplotlib.rc('font', **font)
fig = plt.subplots(figsize=(7, 5), dpi= 80, facecolor='w', edgecolor='k')
it=[];MSE=[]
no=0
for i in range(-50,50):
y_i=np.linspace(0.10+i/2.0,0.40+i/2.0,100) # Range of predicted values
y_a=np.linspace(0.20,0.30,100) # Range of actual values values
no+=1
it.append(no)
MSE.append(np.mean((y_i-y_a)**2))
plt.plot(it,MSE,linewidth=4)
plt.title('Mean Square Error (MSE) Cost Function' ,fontsize='16')
plt.xlabel('Number of Iteration',fontsize='14') ; plt.ylabel('MSE',fontsize='14')
plt.axis([1, 100, 0, 620])
plt.show()
MSE is usually preferred for performance measurement of most regression tasks. However, you may want to use another function to measure performance in some contexts. If you have a lot of outlier, you should consider Mean Absolute Error (MAE). MAE is the average of absolute differences between the predicted and actual values. See Equation below:
$MAE=\frac{1}{n}\sum_{i=1}^{n}\left | y_p^{i}-y_a^{i} \right |$ where $y_p$ are $y_a$ predicted and actual values, respectively. See Figure below for an example of MAE.
font = {'size' : 12}
matplotlib.rc('font', **font)
fig = plt.subplots(figsize=(7, 5), dpi= 80, facecolor='w', edgecolor='k')
it=[];MAE=[]
no=0
for i in range(-50,50):
y_i=np.linspace(0.10+i/2.0,0.40+i/2.0,100) # Range of predicted values
y_a=np.linspace(0.20,0.30,100) # Range of actual values values
no+=1
it.append(no)
MAE.append(np.mean(abs(y_i-y_a)))
plt.plot(it,MAE,linewidth=4)
plt.title('Mean Absolute Error (MAE) Cost Function' ,fontsize='16')
plt.xlabel('Number of Iteration',fontsize='14') ; plt.ylabel('MAE',fontsize='14')
plt.axis([1, 100, 0, 25])
plt.show()
Although the squared error is easier to solve, using the absolute error is more robust to outliers. The goal of machine learning model is to find the point that minimizes cost function. Both functions reach the minimum when the prediction is very close to the actual value. Since MSE squares the error ($E=(y_p-y_a)^{2}$), the value of error ($E$) increases a lot if $E$ > 1. If there is an outlier in data set, $E^{2}$ >> $|E|$. This model with MSE cost function gives more weight to outliers than a model with MAE loss. MAE is useful if the training data is corrupted with outliers: for example, erroneously receiving unrealistically large negative/positive values in training data, but not in testing data. If the outliers are important for business and should be detected, then MSE should be used. The predictions using MSE loss function get skewed towards outliers while MAE go towards median value.
Get the Data and Preprocessing¶
It is time to look at your data to gain some insights. You will need to install a number of Python modules discussed before: Pandas, Jupyter, NumPy, Pandas, Matplotlib, Scikit-Learn..
Shuffling and Look at the Data Structure
The first step after reading data in Python is to Shuffle data set to avoid any element of bias/patterns as discussed before.
import pandas as pd
from numpy import random
df= pd.read_csv("WPD_10000.csv",na_values=['NA', '?','']) # Read data file
np.random.seed(42)
Oil_Prod = df.reindex(np.random.permutation(df.index))
Oil_Prod.reset_index(inplace=True, drop=True) # Reset index
Oil_Prod[0:5] # Display top five rows
X Coordinate | Y Coordinate | Measured Depth (m) | Deviation (True/False) | Surface-Casing Weight (kg/m) | Production-Casing Size (mm) | Production-Casing Weight (kg/m) | Bore. Temp. (degC) | Prod. Formation | Porosity (fraction) | Permeability (darcy) | OIL Prod. (e3m3/month) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 593.6 | 601.0 | 678.0 | True. | 48.1 | 177.8 | 25.3 | 28.0 | Shale-Sand | 0.176 | 10.03 | 62.18 |
1 | 610.6 | 420.0 | 648.0 | False. | 35.7 | 139.7 | 20.8 | 27.0 | Shale-Sand | 0.144 | 2.69 | 41.08 |
2 | 446.5 | 318.0 | 1535.0 | False. | NaN | 139.7 | NaN | 45.0 | Shale-Sand | 0.062 | 0.97 | 9.55 |
3 | 406.8 | 123.0 | 1808.0 | True. | 35.7 | 139.7 | 25.3 | 41.0 | Shale-Sand | 0.184 | 6.04 | 156.27 |
4 | 630.9 | 518.0 | 523.0 | True. | 48.1 | 177.8 | 29.8 | 16.0 | Shale-Sand | 0.236 | 22.73 | 172.81 |
info() method is efficient to get a quick description of the data, number of rows, and type of each attribute and number of non-null values.
Oil_Prod.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10000 entries, 0 to 9999 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 X Coordinate 10000 non-null float64 1 Y Coordinate 10000 non-null float64 2 Measured Depth (m) 10000 non-null float64 3 Deviation (True/False) 10000 non-null object 4 Surface-Casing Weight (kg/m) 5936 non-null float64 5 Production-Casing Size (mm) 8932 non-null float64 6 Production-Casing Weight (kg/m) 5367 non-null float64 7 Bore. Temp. (degC) 8273 non-null float64 8 Prod. Formation 10000 non-null object 9 Porosity (fraction) 10000 non-null float64 10 Permeability (darcy) 10000 non-null float64 11 OIL Prod. (e3m3/month) 10000 non-null float64 dtypes: float64(10), object(2) memory usage: 937.6+ KB
The describe() function is used to present a summary of the numerical attributes. The null values are ignored. P25% is 25th percentile (or 1st quartile), P50% is the median, and P75% is 75th percentile (or 3rd quartile).
Oil_Prod.describe()
X Coordinate | Y Coordinate | Measured Depth (m) | Surface-Casing Weight (kg/m) | Production-Casing Size (mm) | Production-Casing Weight (kg/m) | Bore. Temp. (degC) | Porosity (fraction) | Permeability (darcy) | OIL Prod. (e3m3/month) | |
---|---|---|---|---|---|---|---|---|---|---|
count | 10000.000000 | 10000.000000 | 10000.000000 | 5936.000000 | 8932.000000 | 5367.00000 | 8273.000000 | 10000.000000 | 10000.000000 | 10000.000000 |
mean | 443.814170 | 504.138000 | 1352.348210 | 44.826685 | 151.120634 | 25.88392 | 39.490632 | 0.150812 | 5.504281 | 76.606887 |
std | 172.454779 | 252.490962 | 752.082155 | 14.434801 | 25.074731 | 7.28519 | 17.888880 | 0.076726 | 9.847084 | 121.125589 |
min | 14.000000 | 1.000000 | 324.000000 | 9.500000 | 13.900000 | 6.80000 | -1.000000 | 0.000000 | 0.050000 | 0.210000 |
25% | 325.350000 | 369.000000 | 707.475000 | 35.700000 | 139.700000 | 20.80000 | 26.000000 | 0.098000 | 1.280000 | 11.460000 |
50% | 499.100000 | 479.000000 | 1127.800000 | 48.100000 | 139.700000 | 25.30000 | 33.000000 | 0.150000 | 2.790000 | 36.640000 |
75% | 595.900000 | 599.000000 | 1755.700000 | 53.600000 | 177.800000 | 34.20000 | 52.000000 | 0.202000 | 6.090000 | 89.077500 |
max | 649.100000 | 1197.000000 | 6363.000000 | 536.000000 | 298.400000 | 177.80000 | 174.000000 | 0.491000 | 236.630000 | 991.940000 |
Another quick way to get a feel of the type of data is to plot a histogram for each numerical attribute. A histogram shows the number of instances (on the vertical axis) that have a given value range (on the horizontal axis). You can either plot this one attribute at a time, or you can call the hist() method on the whole dataset, and it will plot a histogram for each numerical attribute. See histograms of some features below:
columns=['X Coordinate','Y Coordinate','Measured Depth (m)','Surface-Casing Weight (kg/m)',
'Production-Casing Size (mm)','Bore. Temp. (degC)','Porosity (fraction)',
'Permeability (darcy)','OIL Prod. (e3m3/month)']
Oil_Prod[columns].hist(bins=15, layout=(3, 3), figsize=(15,10))
plt.show()
You can make a much nicer histogram by the code below:
def histplt (val,bins,title,xlabl,xlimt,axt=None):
""" Function for histogram plotting"""
from matplotlib.offsetbox import AnchoredText
ax1 = axt or plt.axes()
val=np.array(val)
plt.hist(val, bins=bins,ec='black')
n=len(val)
Mean=np.mean(val)
SD=np.sqrt(np.var(val))
Max=np.amax(val)
Min=np.amin(val)
txt='n=%.0f \n $\mu$=%.1f \n $\sigma$=%.1f \n Max=%.1f \n Min=%.1f'
anchored_text = AnchoredText(txt %(n,Mean,SD,Max,Min), loc=1)
ax1.add_artist(anchored_text)
plt.title(title,fontsize=font['size']*1.25)
plt.xlabel(xlabl,fontsize=font['size'])
plt.ylabel('Frequency',fontsize=font['size'])
plt.xlim(xlimt)
ax1.grid(linewidth='0.35')
font = {'size' : 10}
matplotlib.rc('font', **font)
fig = plt.subplots(figsize=(10, 3), dpi= 150, facecolor='w', edgecolor='k')
ax1=plt.subplot(1,2,1)
val=Oil_Prod['OIL Prod. (e3m3/month)']
histplt(val,bins=100,title='Histogram of Oil Production',xlabl='OIL Prod. (e3m3/month)',xlimt=(0,200),axt=ax1)
ax2=plt.subplot(1,2,2)
val=Oil_Prod['Measured Depth (m)']
histplt(val,bins=30,title='Histogram of Measured Depth',xlabl='Measured Depth (m)',xlimt=(0,4000),axt=ax2)
Remove Outliers¶
We discussed in previous lecture that outliers are usually extreme high or low values that are different from pattern of majority data. A straightforward approach to detect outliers are several (n) standard deviations from the mean $m\pm n\sigma$ ($m$=mean, $\sigma$=standard deviation). For example, an outlier can be bigger than $m+ 3\sigma $ or less than $m- 3\sigma $. The following function can be used to apply this approach.
def outlier_remove(df, n,name):
"""Delete rows for a specified column where values are out of +/- n*sd standard deviations
df : Pandas dataframe
n : n in the equation 𝑚±𝑛𝜎
name: Column name
"""
mean=df[name].mean() # Calclute mean of column
sd=df[name].std() # Calclute standard deviation of column
drop_r = df.index[(mean -n * sd> df[name]) | (mean+n * sd< df[name])]
df.drop(drop_r, axis=0, inplace=True)
df.reset_index(inplace=True, drop=True) # Reset index
# Drop outliers in last column 'OIL Prod. (e3m3/month)'
outlier_remove(Oil_Prod, n=3,name='OIL Prod. (e3m3/month)')
Oil_Prod.describe()
X Coordinate | Y Coordinate | Measured Depth (m) | Surface-Casing Weight (kg/m) | Production-Casing Size (mm) | Production-Casing Weight (kg/m) | Bore. Temp. (degC) | Porosity (fraction) | Permeability (darcy) | OIL Prod. (e3m3/month) | |
---|---|---|---|---|---|---|---|---|---|---|
count | 9729.000000 | 9729.000000 | 9729.000000 | 5780.000000 | 8689.000000 | 5227.000000 | 8041.000000 | 9729.000000 | 9729.000000 | 9729.000000 |
mean | 443.581622 | 504.076986 | 1352.742183 | 44.824931 | 151.089239 | 25.875474 | 39.508021 | 0.146484 | 4.672725 | 60.927749 |
std | 172.272131 | 252.674401 | 751.454887 | 14.499362 | 25.112591 | 7.298808 | 17.879509 | 0.072788 | 6.344632 | 72.884754 |
min | 14.000000 | 1.000000 | 324.000000 | 9.500000 | 13.900000 | 6.800000 | -1.000000 | 0.000000 | 0.050000 | 0.210000 |
25% | 324.900000 | 369.000000 | 708.300000 | 35.700000 | 139.700000 | 20.800000 | 26.000000 | 0.095000 | 1.250000 | 10.940000 |
50% | 498.500000 | 479.000000 | 1130.000000 | 48.100000 | 139.700000 | 25.300000 | 33.000000 | 0.148000 | 2.670000 | 34.720000 |
75% | 595.900000 | 599.000000 | 1754.000000 | 53.600000 | 177.800000 | 34.200000 | 52.000000 | 0.197000 | 5.670000 | 82.210000 |
max | 649.100000 | 1197.000000 | 6363.000000 | 536.000000 | 298.400000 | 177.800000 | 174.000000 | 0.384000 | 195.520000 | 434.690000 |
You should separate the attributes (features or predictors) and targets (labels) since you should not apply the same transformation to the predictors and the target values:
Set Aside Test Set¶
At this step you should set aside part of data as test set. and you should not look at test set until the end of project for only final evaluation of your predictive model. It might seem a little strange to voluntarily take some part of your data and not even have a glance. But your brain is an amazing pattern detection: if you look at the test set, you may come across interesting patterns in the test data that persuade you to choose a particular Machine Learning model. This leads to a very optimistic estimate, your model cannot generalize well. This is called data Snooping Bias. It is quite simple to theoretically create a test set, typically 20% of the dataset (Aurélien Géron, 2019).
There are different approaches to take some instances. We discussed how to use Pandas to do this. However, Scikit-Learn provides functions to split datasets into multiple subsets in various ways. The simplest function is train_test_split. If you do not want your data split changes for each run, you should set the random generator seed.
from sklearn.model_selection import train_test_split
train_set, test_set = train_test_split(Oil_Prod, test_size=0.2, random_state=100)
This is a purely random sampling. If you have a very large dataset, this approach should be totally fine. However, if not, you may face a significant sampling bias. For example when a survey company wants to contact 1000 people, they do not randomly call 1000 people; they do their best to ensure 10000 people are representative of the whole population (Aurélien Géron, 2019); if the population of a city has 52% female and 48% male citizen, a reliable survey should preserve this ratio in sampling. This is called stratified sampling. If purely random sampling was used, the survey results would be significantly biased.
For a classification task, you should split your data in a way of having the same number of data for each class. For example, if you have two classes in your target: class 1 has 630 instances and class 2 has 370 instances, then you should set aside test set to have the same number of classes. If you are dealing with a regression task, you may divide your target into a few categories and then split the data to respect the number of instances for each category. The code below divide "OIL Prod_cat" to 4 categories. To limit number of categories, "OIL Prod. (e3m3/month)" is divided by 20 and rounding up using Ceil.
Oil_Prod["OIL Prod_cat"]=np.ceil(Oil_Prod["OIL Prod. (e3m3/month)"] / 20)
Oil_Prod["OIL Prod_cat"].where(Oil_Prod["OIL Prod_cat"] < 4, 4, inplace=True)
Oil_Prod[0:4]
X Coordinate | Y Coordinate | Measured Depth (m) | Deviation (True/False) | Surface-Casing Weight (kg/m) | Production-Casing Size (mm) | Production-Casing Weight (kg/m) | Bore. Temp. (degC) | Prod. Formation | Porosity (fraction) | Permeability (darcy) | OIL Prod. (e3m3/month) | OIL Prod_cat | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 593.6 | 601.0 | 678.0 | True. | 48.1 | 177.8 | 25.3 | 28.0 | Shale-Sand | 0.176 | 10.03 | 62.18 | 4.0 |
1 | 610.6 | 420.0 | 648.0 | False. | 35.7 | 139.7 | 20.8 | 27.0 | Shale-Sand | 0.144 | 2.69 | 41.08 | 3.0 |
2 | 446.5 | 318.0 | 1535.0 | False. | NaN | 139.7 | NaN | 45.0 | Shale-Sand | 0.062 | 0.97 | 9.55 | 1.0 |
3 | 406.8 | 123.0 | 1808.0 | True. | 35.7 | 139.7 | 25.3 | 41.0 | Shale-Sand | 0.184 | 6.04 | 156.27 | 4.0 |
# Plot histogram of categories
font = {'size' : 14}
matplotlib.rc('font', **font)
histplt(Oil_Prod["OIL Prod_cat"],bins=15,title='Number of Instances per Class ',xlabl='Categories',
xlimt=(0.68,4.3),axt=ax1)
Now you can do stratified sampling based on the "OIL Prod_cat" category. you can use Scikit-Learn’s StratifiedShuffleSplit function:
from sklearn.model_selection import StratifiedShuffleSplit
spt = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=100)
for train_idx, test_idx in spt.split(Oil_Prod, Oil_Prod["OIL Prod_cat"]):
train_set_strat = Oil_Prod.loc[train_idx]
test_set_strat = Oil_Prod.loc[test_idx]
test_set_strat[0:4]
X Coordinate | Y Coordinate | Measured Depth (m) | Deviation (True/False) | Surface-Casing Weight (kg/m) | Production-Casing Size (mm) | Production-Casing Weight (kg/m) | Bore. Temp. (degC) | Prod. Formation | Porosity (fraction) | Permeability (darcy) | OIL Prod. (e3m3/month) | OIL Prod_cat | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2279 | 520.6 | 120.0 | 977.8 | False. | NaN | 114.3 | NaN | 33.0 | Shale-Sand | 0.000 | 0.13 | 1.28 | 1.0 |
8952 | 617.2 | 618.0 | 539.0 | True. | NaN | 177.8 | 34.2 | NaN | Shale-Sand | 0.105 | 6.06 | 65.96 | 4.0 |
3262 | 81.5 | 782.0 | 3071.0 | True. | 53.6 | NaN | NaN | NaN | Shale-Sand | 0.148 | 1.81 | 33.16 | 2.0 |
9013 | 343.3 | 537.0 | 1408.2 | False. | NaN | 139.7 | NaN | 41.0 | Shale-Sand | 0.189 | 1.70 | 42.88 | 3.0 |
# Plot histogram of test set for categories
font = {'size' : 14}
matplotlib.rc('font', **font)
histplt(test_set_strat["OIL Prod_cat"],bins=15,title='Test_Number of Instances',
xlabl='Categories',xlimt=(0.68,4.3),axt=ax1)
Now you should remove the OIL Prod_cat attribute in order to retrieve the original state of data.
for dataset in (train_set_strat, test_set_strat):
dataset.drop("OIL Prod_cat", axis=1, inplace=True)
dataset.reset_index(inplace=True, drop=True) # Reset index
train_set_strat[0:4]
X Coordinate | Y Coordinate | Measured Depth (m) | Deviation (True/False) | Surface-Casing Weight (kg/m) | Production-Casing Size (mm) | Production-Casing Weight (kg/m) | Bore. Temp. (degC) | Prod. Formation | Porosity (fraction) | Permeability (darcy) | OIL Prod. (e3m3/month) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 590.1 | 509.0 | 750.0 | True. | 53.6 | 177.8 | 34.2 | 19.0 | Shale-Sand | 0.226 | 9.15 | 189.00 |
1 | 541.2 | 424.0 | 842.0 | False. | 35.7 | 139.7 | 20.8 | 32.0 | Shale-Sand | 0.093 | 1.04 | 6.92 |
2 | 602.4 | 604.0 | 573.0 | True. | 53.6 | 177.8 | 34.2 | 10.0 | Shale-Sand | 0.236 | 6.91 | 161.58 |
3 | 624.1 | 603.0 | 532.3 | True. | 48.1 | 177.8 | 34.2 | 27.0 | Shale-Sand | 0.183 | 2.97 | 25.60 |
It is often ignored to set aside some data as test set. However, this is a critical step for Machine Learning project since we may end up a very optimistic predictive model. Now we should move on to the stage of exploring the data.
Visualize the Data¶
We have only taken a quick look at the data. Now we should go deeper in data visualization. At this step, we should make sure to work only on the training set and do not touch the test set achieved in the last section. If the training set is very large, you may sample an exploration set to make a very fast and easy manipulations. This data set is small that we can work directly on the full set. It is a good idea to have a copy of your training set to avoid modifying the original data.
Oil_Prod = train_set_strat.copy()
Oil_Prod.describe()
X Coordinate | Y Coordinate | Measured Depth (m) | Surface-Casing Weight (kg/m) | Production-Casing Size (mm) | Production-Casing Weight (kg/m) | Bore. Temp. (degC) | Porosity (fraction) | Permeability (darcy) | OIL Prod. (e3m3/month) | |
---|---|---|---|---|---|---|---|---|---|---|
count | 7783.000000 | 7783.000000 | 7783.000000 | 4643.000000 | 6946.000000 | 4207.000000 | 6426.000000 | 7783.000000 | 7783.000000 | 7783.000000 |
mean | 444.127175 | 506.218168 | 1352.445034 | 44.874435 | 151.232623 | 25.898312 | 39.534547 | 0.146516 | 4.665215 | 60.996269 |
std | 172.390816 | 252.382250 | 752.076742 | 15.309212 | 24.971628 | 7.345240 | 17.940545 | 0.072873 | 6.480627 | 73.135294 |
min | 14.000000 | 1.000000 | 324.000000 | 9.500000 | 13.900000 | 9.500000 | 0.000000 | 0.000000 | 0.050000 | 0.210000 |
25% | 325.200000 | 370.000000 | 706.400000 | 35.700000 | 139.700000 | 20.800000 | 26.000000 | 0.095000 | 1.250000 | 10.950000 |
50% | 498.700000 | 480.000000 | 1135.000000 | 48.100000 | 139.700000 | 25.300000 | 33.000000 | 0.148000 | 2.680000 | 34.810000 |
75% | 596.100000 | 599.000000 | 1763.000000 | 53.600000 | 177.800000 | 34.200000 | 52.000000 | 0.197000 | 5.660000 | 82.225000 |
max | 648.100000 | 1195.000000 | 5825.000000 | 536.000000 | 298.400000 | 177.800000 | 174.000000 | 0.384000 | 195.520000 | 434.690000 |
Location Map¶
Due to having "X Coordinate" and "Y Coordinate", it is a good idea to have a location map of oil production to visualize the data.
# Change alpha for a better visualization and highlighting high-density areas
Oil_Prod.plot(kind="scatter", x="X Coordinate", y="Y Coordinate",fontsize=14,figsize=(7,9),alpha=0.15)
plt.xlabel('X Coordinate',fontsize=14)
plt.ylabel('Y Coordinate',fontsize=14)
plt.show()
You can make a nicer location map using matplotlib:
font = {'size' : 6.5}
matplotlib.rc('font', **font)
fig = plt.subplots(figsize=(3.35,4.2), dpi= 200, facecolor='w', edgecolor='k')
x=Oil_Prod['X Coordinate']; y=Oil_Prod['Y Coordinate']
var=Oil_Prod['OIL Prod. (e3m3/month)']
plt.scatter(x, y,c=var,s=(var)**(0.55),cmap='jet',edgecolors='k',linewidths=0.5,vmin=0, vmax=400)
plt.title('Training set: '+str(len(var))+' Hydrocarbon Wells in Alberta' ,fontsize='7')
plt.xlabel('Easting (km)',fontsize='7'); plt.ylabel('Northing (km)',fontsize='7')
plt.colorbar(shrink=0.6,label='Oil Production (e3m3/month)',orientation='vertical')
Name=["" for x in range(6)]
Name[0:6]='Calgary','Fort McMurray','Edmonton','Rainbow',\
'Grande Prairie','Lloydminster'
xn=[365.3,557.2,426.8,91.8,84.3]
yn=[231.6,858.9,517.0,1087.2,713.4]
plt.plot(xn,yn,'*y',markersize=6,linewidth=1)
for i in range(5):
plt.text(xn[i], (yn[i]-18), Name[i],fontsize=5.2,color='y', fontweight='bold',style='oblique', ha='center',
va='top', wrap=True,bbox=dict(facecolor='k', alpha=1,pad=0))
plt.xlim((0, 650))
plt.ylim((0, 1200))
plt.show()
Calculate Correlations¶
You can compute the standard covariance between variables called correlation coefficient (Pearson’s r) between every pair of attributes. The correlation coefficient $\large \rho_{XY}$ is between -1 and +1.
$\large \rho_{XY}=\frac{C_{XY}}{\sqrt{\sigma_{X}^{2}\sigma_{Y}^{2}}}$ where $C_{XY}$ is the covariance between $XY$ and $\sigma_{X}^{2}$ and $\sigma_{Y}^{2}$ are variances for X and Y variables. The covariance is the variance between two variable. See Variance for more information.
$\large m_{X}=\frac{1}{n}\sum_{i=1}^{n}X_{i}\,\,\,\,\,\,\,\,\,\,\,,\,\,\,\,\,\,\,\,\, \sigma_{X} ^{2}=\frac{1}{n}\sum_{i=1}^{n}\left ( X_{i}-m_{X} \right )^2\,\,\,\,\,\,\,\,\,\,\,,\,\,\,\,\,\,\,\,\, C_{XY}=\frac{1}{n}\sum_{i=1}^{n}\left ( X_{i}-m_{X} \right )\left ( Y_{i}-m_{Y} \right )$
where $m_{X}$ is the mean of $X$ and $m_{Y}$ is the mean of $Y$. If absolute value of $ \rho_{XY}$ is close to 1 ($ \rho_{XY}$ ≈ −1 𝑜𝑟 1), it implies that two variables are perfectly correlated. If the $ \rho_{XY}$ is close to zero ($\rho_{XY}$ ≈ 0), two variables may be independent. However, $\rho_{XY}$ ≈ 0 does not necessarily imply independence between the two variables. See Figure below for various plots of correlation coefficients between their horizontal and vertical axes.
Image retrieved from Wikipedia.
As you can see, the correlation coefficient only measures linear correlations. For example if X increases, Y generally goes up and vice versa. All the plots of the bottom row have $\rho_{XY}=0$ although they are not obviously independent: they have non-linear relationships.
corr() function calculates correlation coefficient between variables. First we need to convert text number. We will discuss this in more details.
Oil_Prod['Deviation (True/False)']=Oil_Prod['Deviation (True/False)'].replace('False.', 0) # Replace False with 0
Oil_Prod['Deviation (True/False)']=Oil_Prod['Deviation (True/False)'].replace('True.', 1) # Replace True with 1
#
Oil_Prod['Prod. Formation']=Oil_Prod['Prod. Formation'].replace('Sand', 0) # Replace Sand with 0
Oil_Prod['Prod. Formation']=Oil_Prod['Prod. Formation'].replace('Shale-Sand', 1) # Replace Shale-Sand with 1
Oil_Prod['Prod. Formation']=Oil_Prod['Prod. Formation'].replace('Shale', 2) # Shale True with 2
Oil_Prod
X Coordinate | Y Coordinate | Measured Depth (m) | Deviation (True/False) | Surface-Casing Weight (kg/m) | Production-Casing Size (mm) | Production-Casing Weight (kg/m) | Bore. Temp. (degC) | Prod. Formation | Porosity (fraction) | Permeability (darcy) | OIL Prod. (e3m3/month) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 590.1 | 509.0 | 750.0 | 1 | 53.6 | 177.8 | 34.2 | 19.0 | 1 | 0.226 | 9.15 | 189.00 |
1 | 541.2 | 424.0 | 842.0 | 0 | 35.7 | 139.7 | 20.8 | 32.0 | 1 | 0.093 | 1.04 | 6.92 |
2 | 602.4 | 604.0 | 573.0 | 1 | 53.6 | 177.8 | 34.2 | 10.0 | 1 | 0.236 | 6.91 | 161.58 |
3 | 624.1 | 603.0 | 532.3 | 1 | 48.1 | 177.8 | 34.2 | 27.0 | 1 | 0.183 | 2.97 | 25.60 |
4 | 600.9 | 539.0 | 664.0 | 1 | 48.1 | 177.8 | 34.2 | 27.0 | 1 | 0.223 | 1.92 | 36.01 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
7778 | 518.5 | 400.0 | 986.0 | 0 | NaN | 139.7 | NaN | 31.0 | 1 | 0.165 | 11.57 | 57.94 |
7779 | 180.4 | 779.0 | 2564.0 | 1 | 53.6 | NaN | NaN | 21.0 | 1 | 0.084 | 0.51 | 2.17 |
7780 | 551.5 | 101.0 | 916.0 | 1 | 35.7 | 139.7 | 20.8 | 36.0 | 1 | 0.136 | 3.92 | 76.10 |
7781 | 625.4 | 506.0 | 636.0 | 0 | 48.1 | 177.8 | 25.3 | 24.0 | 1 | 0.148 | 2.34 | 65.82 |
7782 | 299.8 | 419.0 | 1984.2 | 0 | NaN | 114.3 | NaN | 57.0 | 1 | 0.160 | 5.43 | 107.75 |
7783 rows × 12 columns
corr_matrix=Oil_Prod.corr()
corr_matrix
X Coordinate | Y Coordinate | Measured Depth (m) | Deviation (True/False) | Surface-Casing Weight (kg/m) | Production-Casing Size (mm) | Production-Casing Weight (kg/m) | Bore. Temp. (degC) | Prod. Formation | Porosity (fraction) | Permeability (darcy) | OIL Prod. (e3m3/month) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
X Coordinate | 1.000000 | -0.566847 | -0.694257 | 0.103841 | 0.029685 | 0.290090 | 0.177544 | -0.775748 | -0.013626 | -0.000171 | -0.007081 | 0.004895 |
Y Coordinate | -0.566847 | 1.000000 | 0.133506 | 0.133495 | 0.245511 | 0.304114 | 0.301934 | 0.360676 | -0.001061 | 0.010373 | 0.016840 | 0.007630 |
Measured Depth (m) | -0.694257 | 0.133506 | 1.000000 | 0.017123 | 0.029313 | -0.373264 | -0.217995 | 0.829445 | 0.010794 | -0.005485 | -0.000998 | -0.011757 |
Deviation (True/False) | 0.103841 | 0.133495 | 0.017123 | 1.000000 | 0.153415 | 0.212666 | 0.231320 | -0.123615 | -0.129802 | 0.259245 | 0.250028 | 0.374134 |
Surface-Casing Weight (kg/m) | 0.029685 | 0.245511 | 0.029313 | 0.153415 | 1.000000 | 0.503135 | 0.475470 | -0.006927 | 0.013684 | -0.002416 | -0.003613 | -0.000564 |
Production-Casing Size (mm) | 0.290090 | 0.304114 | -0.373264 | 0.212666 | 0.503135 | 1.000000 | 0.778854 | -0.262123 | -0.001636 | 0.010308 | 0.006990 | 0.008008 |
Production-Casing Weight (kg/m) | 0.177544 | 0.301934 | -0.217995 | 0.231320 | 0.475470 | 0.778854 | 1.000000 | -0.161192 | -0.022611 | 0.025301 | 0.015933 | 0.014181 |
Bore. Temp. (degC) | -0.775748 | 0.360676 | 0.829445 | -0.123615 | -0.006927 | -0.262123 | -0.161192 | 1.000000 | 0.019675 | -0.005322 | 0.010361 | -0.002941 |
Prod. Formation | -0.013626 | -0.001061 | 0.010794 | -0.129802 | 0.013684 | -0.001636 | -0.022611 | 0.019675 | 1.000000 | -0.468861 | -0.350375 | -0.495952 |
Porosity (fraction) | -0.000171 | 0.010373 | -0.005485 | 0.259245 | -0.002416 | 0.010308 | 0.025301 | -0.005322 | -0.468861 | 1.000000 | 0.512293 | 0.709663 |
Permeability (darcy) | -0.007081 | 0.016840 | -0.000998 | 0.250028 | -0.003613 | 0.006990 | 0.015933 | 0.010361 | -0.350375 | 0.512293 | 1.000000 | 0.673459 |
OIL Prod. (e3m3/month) | 0.004895 | 0.007630 | -0.011757 | 0.374134 | -0.000564 | 0.008008 | 0.014181 | -0.002941 | -0.495952 | 0.709663 | 0.673459 | 1.000000 |
# sort correlation of attributes with respect to target "OIL Prod. (e3m3/month)"
corr_matrix["OIL Prod. (e3m3/month)"].sort_values(ascending=False)
OIL Prod. (e3m3/month) 1.000000 Porosity (fraction) 0.709663 Permeability (darcy) 0.673459 Deviation (True/False) 0.374134 Production-Casing Weight (kg/m) 0.014181 Production-Casing Size (mm) 0.008008 Y Coordinate 0.007630 X Coordinate 0.004895 Surface-Casing Weight (kg/m) -0.000564 Bore. Temp. (degC) -0.002941 Measured Depth (m) -0.011757 Prod. Formation -0.495952 Name: OIL Prod. (e3m3/month), dtype: float64
There is a very high correlation for porosity and permeability. You can make a nice plot as correlation bar using the function below:
def corr_bar(df):
"""Plot correlation bar with the pair of atrribute with last column"""
corr=df.corr()
Colms_sh=list(list(corr.columns))
coefs=corr.values[:,-1][:-1]
names=Colms_sh[:-1]
r_ = pd.DataFrame( { 'coef': coefs, 'positive': coefs>=0 }, index = names )
r_ = r_.sort_values(by=['coef'])
r_['coef'].plot(kind='barh', color=r_['positive'].map({True: 'b', False: 'r'}))
plt.xlabel('Correlation Coefficient',fontsize=6)
plt.vlines(x=0,ymin=-0.5, ymax=10, color = 'k',linewidth=0.8,linestyle="dashed")
plt.show()
font = {'size' : 5}
matplotlib.rc('font', **font)
ax1 = plt.subplots(figsize=(2.8, 3), dpi= 200, facecolor='w', edgecolor='k')
# Plot correlations of attributes with the last column
corr_bar(Oil_Prod)
font = {'size' : 10}
matplotlib.rc('font', **font)
from pandas.plotting import scatter_matrix
scatter_matrix(Oil_Prod[columns[0:3]],figsize=(12, 8),alpha=0.1)
plt.show()
You can make a nicer crossplot using matplotlib:
def CrossPlot (x,y,xlabl,ylabl,axt=None):
'''Cross plot between two variables'''
from matplotlib.offsetbox import AnchoredText
ax1 = axt or plt.axes()
x=np.array(x)
y=np.array(y)
n_x=len(x)
Mean_x=np.mean(x)
SD_x=np.sqrt(np.var(x))
n_y=len(y)
Mean_y=np.mean(y)
SD_y=np.sqrt(np.var(y))
corr=np.corrcoef(x,y)
plt.plot(x,y,'ro',markersize=2,alpha=0.08)
txt='$\\rho_{XY}$=%.3f \n $n$=%.0f \n $\mu_{X}$=%.1f \n $\sigma_{X}$=%.1f \n '
txt+=' $\mu_{Y}$=%.1f \n $\sigma_{Y}$=%.1f'
anchored_text = AnchoredText(txt %(corr[1,0], n_x,Mean_x,SD_x,Mean_y,SD_y), loc=1)
ax1.add_artist(anchored_text)
plt.xlabel(xlabl,fontsize=font['size'])
plt.ylabel(ylabl,fontsize=font['size'])
font = {'size' :7 }
matplotlib.rc('font', **font)
fig=plt.figure(figsize=(8, 2.65), dpi= 120, facecolor='w', edgecolor='k')
ax1 = plt.subplot(1,2,1)
CrossPlot (x=Oil_Prod['X Coordinate'],y=Oil_Prod['Measured Depth (m)'],xlabl='X Coordinate',
ylabl='Measured Depth (m)',axt=ax1)
ax2 = plt.subplot(1,2,2)
CrossPlot (x=Oil_Prod['Porosity (fraction)'],y=Oil_Prod['Permeability (darcy)'],xlabl='Porosity',
ylabl='Permeability',axt=ax2)
Data Preparation for Machine Learning Algorithms¶
After looking at the data, it’s time to prepare the data for your Machine Learning algorithms. You need to automate this preparation by writing some functions (instead of doing manually) because of several reasons:
Apply the transformations easily on any dataset (for example a new dataset).
Build a library of transformation functions for future projects.
# Note that drop() creates a copy and does not affect train_set_strat
Oil_Prod = train_set_strat.drop("OIL Prod. (e3m3/month)", axis=1)
Oil_Prod_Target = train_set_strat["OIL Prod. (e3m3/month)"].copy()
Imputation¶
Most Machine Learning algorithms cannot work with missing values, so a few functions should be written to take care of them. For example the attributes "Surface-Casing Weight (kg/m)", "Production-Casing Size (mm)", "Production-Casing Weight (kg/m)", "Bore. Temp. (degC)" have a lot of missing value. As we talked in previous lectures, you have three options to take care of missing values: 1- Get rid of the whole attributes (not recommended), 2- Rows that have missing values (may not be efficient, 3- Recommended to impute missing values with statistical value retrieved from the attribute (the median, the mean, etc.). We discussed how to do these with Pandas in previous lectures. Here, we apply Scikit-Learn to do this which is much easier. First, create a SimpleImputer instance, specifying that attribute’s missing values are imputed with the median of that attribute:
from sklearn.impute import SimpleImputer
imput_mdn = SimpleImputer(strategy="median")
Imputation may only be applied for numerical attributes, so need to have a copy of the data without the text attribute including "Deviation (True/False)" and "Prod. Formation":
Oil_Prod_num = Oil_Prod.drop(["Deviation (True/False)","Prod. Formation"], axis=1)
Oil_Prod_num[0:10]
X Coordinate | Y Coordinate | Measured Depth (m) | Surface-Casing Weight (kg/m) | Production-Casing Size (mm) | Production-Casing Weight (kg/m) | Bore. Temp. (degC) | Porosity (fraction) | Permeability (darcy) | |
---|---|---|---|---|---|---|---|---|---|
0 | 590.1 | 509.0 | 750.0 | 53.6 | 177.8 | 34.2 | 19.0 | 0.226 | 9.15 |
1 | 541.2 | 424.0 | 842.0 | 35.7 | 139.7 | 20.8 | 32.0 | 0.093 | 1.04 |
2 | 602.4 | 604.0 | 573.0 | 53.6 | 177.8 | 34.2 | 10.0 | 0.236 | 6.91 |
3 | 624.1 | 603.0 | 532.3 | 48.1 | 177.8 | 34.2 | 27.0 | 0.183 | 2.97 |
4 | 600.9 | 539.0 | 664.0 | 48.1 | 177.8 | 34.2 | 27.0 | 0.223 | 1.92 |
5 | 449.7 | 325.0 | 1411.2 | NaN | 139.7 | NaN | 43.0 | 0.318 | 11.20 |
6 | 316.8 | 464.0 | 1478.9 | NaN | 114.3 | NaN | 41.0 | 0.107 | 0.88 |
7 | 126.3 | 1109.0 | 1689.0 | 53.6 | NaN | NaN | 65.0 | 0.186 | 9.11 |
8 | 450.9 | 542.0 | 951.6 | NaN | 177.8 | NaN | 31.0 | 0.067 | 1.45 |
9 | 609.5 | 530.0 | 561.0 | NaN | 139.7 | 20.8 | 28.0 | 0.135 | 1.24 |
Now you can calculate the median of each attribute by fitting imputer to the training data using the fit() method. The results are stored in its statistics_
imput_mdn.fit(Oil_Prod_num)
imput_mdn.statistics_
array([4.987e+02, 4.800e+02, 1.135e+03, 4.810e+01, 1.397e+02, 2.530e+01, 3.300e+01, 1.480e-01, 2.680e+00])
We trained an imputer; this “trained” imputer can be applied to transform the training set: missing values of each attribute can be replaced by the learned medians.
im = imput_mdn.transform(Oil_Prod_num)
The result of imputation is a NumPy array for each feature. We need to put it back into a Pandas DataFrame:
Oil_Prod_im = pd.DataFrame(im, columns=Oil_Prod_num.columns)
Oil_Prod_im
X Coordinate | Y Coordinate | Measured Depth (m) | Surface-Casing Weight (kg/m) | Production-Casing Size (mm) | Production-Casing Weight (kg/m) | Bore. Temp. (degC) | Porosity (fraction) | Permeability (darcy) | |
---|---|---|---|---|---|---|---|---|---|
0 | 590.1 | 509.0 | 750.0 | 53.6 | 177.8 | 34.2 | 19.0 | 0.226 | 9.15 |
1 | 541.2 | 424.0 | 842.0 | 35.7 | 139.7 | 20.8 | 32.0 | 0.093 | 1.04 |
2 | 602.4 | 604.0 | 573.0 | 53.6 | 177.8 | 34.2 | 10.0 | 0.236 | 6.91 |
3 | 624.1 | 603.0 | 532.3 | 48.1 | 177.8 | 34.2 | 27.0 | 0.183 | 2.97 |
4 | 600.9 | 539.0 | 664.0 | 48.1 | 177.8 | 34.2 | 27.0 | 0.223 | 1.92 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
7778 | 518.5 | 400.0 | 986.0 | 48.1 | 139.7 | 25.3 | 31.0 | 0.165 | 11.57 |
7779 | 180.4 | 779.0 | 2564.0 | 53.6 | 139.7 | 25.3 | 21.0 | 0.084 | 0.51 |
7780 | 551.5 | 101.0 | 916.0 | 35.7 | 139.7 | 20.8 | 36.0 | 0.136 | 3.92 |
7781 | 625.4 | 506.0 | 636.0 | 48.1 | 177.8 | 25.3 | 24.0 | 0.148 | 2.34 |
7782 | 299.8 | 419.0 | 1984.2 | 48.1 | 114.3 | 25.3 | 57.0 | 0.160 | 5.43 |
7783 rows × 9 columns
Deal with Text and Categorical Attributes¶
Most Machine Learning algorithms work with numbers anyway, so all texts and categorical variables should be efficiently converted to numbers before feeding Machine learning algorithms. For example "Deviation (True/False)" and "Prod. Formation" have texts. We discussed how to do this with Pandas in previous lectures. Here, we apply Scikit-Learn library.
Categorical Feature 1 (Prod. Formation)
Oil_Prod_cat = Oil_Prod[["Prod. Formation"]]
Oil_Prod_cat[0:10]
Prod. Formation | |
---|---|
0 | Shale-Sand |
1 | Shale-Sand |
2 | Shale-Sand |
3 | Shale-Sand |
4 | Shale-Sand |
5 | Sand |
6 | Shale-Sand |
7 | Shale-Sand |
8 | Shale-Sand |
9 | Shale-Sand |
Oil_Prod['Prod. Formation'].value_counts()
Shale-Sand 7197 Shale 401 Sand 185 Name: Prod. Formation, dtype: int64
gb_mean=df.groupby('Prod. Formation')['OIL Prod. (e3m3/month)'].mean()
gb_mean
Prod. Formation Sand 504.771160 Shale 0.478220 Shale-Sand 57.049353 Name: OIL Prod. (e3m3/month), dtype: float64
from sklearn.preprocessing import OrdinalEncoder
ordinal_encoder = OrdinalEncoder()
Oil_Prod_cat_encoded = ordinal_encoder.fit_transform(Oil_Prod_cat)
Oil_Prod_cat_encoded[:10]
array([[2.], [2.], [2.], [2.], [2.], [0.], [2.], [2.], [2.], [2.]])
A number from 0 to 2 is assigned to each category: Sand=0, Shale=1 and Shale-Sand=2. One challenge with this approach is that Machine Learning algorithms assume that two nearby (for example 0 and 1) values are more similar than two distant values (0 to 2): for example, categories such as “Bad”, “Average”, “Good”, “Excellent”). However, this may not be correct for all cases. To fix this issue as it was discussed before, one binary attribute per category is created: one attribute equal to 1 when the category is Sand (and 0 otherwise), another attribute equal to 1 when the category is Shale (and 0 otherwise), and finally attribute equal to 1 when the category is Shale-Sand (and 0 otherwise). This process should be applied for all categories. This is called one-hot encoding, because only one attribute will be equal to 1 (hot), while the others will be 0 (cold). The new attributes are sometimes called dummy attributes. Scikit-Learn provides a function to convert categorical values into one-hot vectors (Aurélien Géron, 2019).
from sklearn.preprocessing import OneHotEncoder
cat_encoder = OneHotEncoder()
Oil_Prod_cat_1hot = cat_encoder.fit_transform(Oil_Prod_cat)
Oil_Prod_cat_1hot
<7783x3 sparse matrix of type '<class 'numpy.float64'>' with 7783 stored elements in Compressed Sparse Row format>
Instead of a NumPy array, the output is a SciPy sparse matrix which is useful When dealing with many categories. The matrix is full of zeros except for a single 1 per row. It would be very wasteful to use tons of memory to store zeros instead a sparse matrix can stores only non‐zero elements. You can can convert a sparse matrixit to a NumPy array by toarray() function.
cat_hot_toarray_1=Oil_Prod_cat_1hot.toarray()
cat_hot_toarray_1[0:10]
array([[0., 0., 1.], [0., 0., 1.], [0., 0., 1.], [0., 0., 1.], [0., 0., 1.], [1., 0., 0.], [0., 0., 1.], [0., 0., 1.], [0., 0., 1.], [0., 0., 1.]])
You can see the list of categories:
cat_encoder.categories_
[array(['Sand', 'Shale', 'Shale-Sand'], dtype=object)]
Categorical Feature 2 (Deviation (True/False))
We have another categorical variable "Deviation (True/False)"; the same approach should be applied.
Oil_Prod_cat = Oil_Prod[["Deviation (True/False)"]]
Oil_Prod_cat[0:10]
Deviation (True/False) | |
---|---|
0 | True. |
1 | False. |
2 | True. |
3 | True. |
4 | True. |
5 | True. |
6 | False. |
7 | False. |
8 | True. |
9 | True. |
Oil_Prod_cat_2hot = cat_encoder.fit_transform(Oil_Prod_cat)
cat_hot_toarray_2=Oil_Prod_cat_2hot.toarray()
cat_hot_toarray_2[0:10]
array([[0., 1.], [1., 0.], [0., 1.], [0., 1.], [0., 1.], [0., 1.], [1., 0.], [1., 0.], [0., 1.], [0., 1.]])
cat_encoder.categories_
[array(['False.', 'True.'], dtype=object)]
Scale Feature Values¶
If the input attributes have very different scales, Machine Learning algorithms don’t perform well; so it is very important to transform the data and make them all scale consistent. We applied imputed data from Imputation Section. Let's look at the range of data for each attribute again:
Oil_Prod_im.describe()
X Coordinate | Y Coordinate | Measured Depth (m) | Surface-Casing Weight (kg/m) | Production-Casing Size (mm) | Production-Casing Weight (kg/m) | Bore. Temp. (degC) | Porosity (fraction) | Permeability (darcy) | |
---|---|---|---|---|---|---|---|---|---|
count | 7783.000000 | 7783.000000 | 7783.000000 | 7783.000000 | 7783.000000 | 7783.000000 | 7783.000000 | 7783.000000 | 7783.000000 |
mean | 444.127175 | 506.218168 | 1352.445034 | 46.175768 | 149.992381 | 25.623410 | 38.395220 | 0.146516 | 4.665215 |
std | 172.390816 | 252.382250 | 752.076742 | 11.929308 | 23.859568 | 5.408242 | 16.488948 | 0.072873 | 6.480627 |
min | 14.000000 | 1.000000 | 324.000000 | 9.500000 | 13.900000 | 9.500000 | 0.000000 | 0.000000 | 0.050000 |
25% | 325.200000 | 370.000000 | 706.400000 | 35.700000 | 139.700000 | 23.100000 | 28.000000 | 0.095000 | 1.250000 |
50% | 498.700000 | 480.000000 | 1135.000000 | 48.100000 | 139.700000 | 25.300000 | 33.000000 | 0.148000 | 2.680000 |
75% | 596.100000 | 599.000000 | 1763.000000 | 48.100000 | 177.800000 | 25.300000 | 48.000000 | 0.197000 | 5.660000 |
max | 648.100000 | 1195.000000 | 5825.000000 | 536.000000 | 298.400000 | 177.800000 | 174.000000 | 0.384000 | 195.520000 |
Standardization and min-max scaling are two common approaches to get all attributes to have the same scale. Min-max scaling is simply shifting and rescaling data value to be in range of 0 to 1. This can be done by subtracting the min value to have a shifted values and then dividing by the max of shifted values. See simple example below:
# Simple example for Min-max scaling
X=[1800,48,6,910,90,47,45]
X_shift=[i-min(X) for i in X]
X_MinMax=[i/max(X_shift) for i in X_shift]
X_MinMax
[1.0, 0.023411371237458192, 0.0, 0.5039018952062431, 0.046822742474916385, 0.022853957636566332, 0.021739130434782608]
Scikit-Learn has a transformer called MinMaxScaler for Min-max scaling. See the code below how to do min-max scaling for all features with Scikit-Learn.
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
Oil_Prod_im_MMS=Oil_Prod_im.copy()
scaler.fit_transform(Oil_Prod_im_MMS)
array([[0.90853178, 0.42546064, 0.07744047, ..., 0.1091954 , 0.58854167, 0.04655446], [0.8314146 , 0.35427136, 0.0941647 , ..., 0.18390805, 0.2421875 , 0.00506472], [0.92792935, 0.50502513, 0.0452645 , ..., 0.05747126, 0.61458333, 0.0350949 ], ..., [0.8476581 , 0.08375209, 0.1076168 , ..., 0.20689655, 0.35416667, 0.01979843], [0.96420123, 0.42294807, 0.05671696, ..., 0.13793103, 0.38541667, 0.01171535], [0.45071755, 0.35008375, 0.30179967, ..., 0.32758621, 0.41666667, 0.02752341]])
As it was discussed before, standardization is different: the data values should be first subtracted the mean value; this leads to mean=0, and then the data are divided by the standard deviation so that the standardized values have variance 1; standardization does not force values to a specific range (unlike min-max scaling,). This might be problematic for some algorithms. However, standardization is much less affected by outliers. See the code below as a simple example:
# Simple example for StandardScaler
X=[1800,48,6,910,90,47,45]
X_shift=[i-np.mean(X) for i in X]
X_Standardiation=[i/np.sqrt(np.var(X_shift)) for i in X_shift]
X_Standardiation
[2.164089802415627, -0.585070891268364, -0.6509754284457199, 0.7675412765145129, -0.5191663540910081, -0.5866400469154439, -0.5897783582096037]
For standardization, Scikit-Learn has a transformer called StandardScaler. See the code below for standardization of all features:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
Oil_Prod_im_MMS=Oil_Prod_im.copy()
Oil_Prod_im_Std=scaler.fit_transform(Oil_Prod_im_MMS)
Oil_Prod_im_Std
array([[ 8.46809634e-01, 1.10230055e-02, -8.01093435e-01, ..., -1.17633133e+00, 1.09078282e+00, 6.92073885e-01], [ 5.63133608e-01, -3.25789354e-01, -6.78757633e-01, ..., -3.87873813e-01, -7.34421973e-01, -5.59428578e-01], [ 9.18163727e-01, 3.87460349e-01, -1.03645688e+00, ..., -1.72218654e+00, 1.22801627e+00, 3.46406127e-01], ..., [ 6.22885409e-01, -1.60567632e+00, -5.80357098e-01, ..., -1.45271500e-01, -1.44318166e-01, -1.14998604e-01], [ 1.05159008e+00, -8.64489554e-04, -9.52683450e-01, ..., -8.73078439e-01, 2.03619659e-02, -3.58817826e-01], [-8.37262974e-01, -3.45601846e-01, 8.40067935e-01, ..., 1.12839064e+00, 1.85042098e-01, 1.18018501e-01]])
As with all the transformations, it is important to fit the scalers to the training data only, not to the full dataset.
Concatenate Scaled and One-hot Encoded Feature¶
We applied imputation to predict the missing values (Oil_Prod_im) and then applied standardization for imputed data (Oil_Prod_im_Std). We also converted two categorical variable ("Prod. Formation" and "Deviation (True/False)") to numbers using one-hot encoding (cat_hot_toarray_1, cat_hot_toarray_2). Now, these variables should be concatenated using Numpy concatenate() function to use as training set.
Oil_Prod_prepared=np.concatenate((Oil_Prod_im_Std,cat_hot_toarray_1, cat_hot_toarray_2), axis=1)
Oil_Prod_prepared[0:4]
array([[ 0.84680963, 0.01102301, -0.80109343, 0.62239229, 1.16554525, 1.58593883, -1.17633133, 1.09078282, 0.69207389, 0. , 0. , 1. , 0. , 1. ], [ 0.56313361, -0.32578935, -0.67875763, -0.87821027, -0.43140103, -0.89192012, -0.38787381, -0.73442197, -0.55942858, 0. , 0. , 1. , 1. , 0. ], [ 0.91816373, 0.38746035, -1.03645688, 0.62239229, 1.16554525, 1.58593883, -1.72218654, 1.22801627, 0.34640613, 0. , 0. , 1. , 0. , 1. ], [ 1.04404859, 0.38349785, -1.09057717, 0.16131329, 1.16554525, 1.58593883, -0.6911267 , 0.50067902, -0.26159877, 0. , 0. , 1. , 0. , 1. ]])
Model Training and Evaluation¶
Finally, you defined the problem, explored data and divided into training set and test set, and you applied transformation to feed Machine Learning algorithms more reliable data. Now you are ready to choose and train a Machine Learning algorithms. Let's train a very simple Linear Regression Model using Scikit-Learn.
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(Oil_Prod_prepared, Oil_Prod_Target)
LinearRegression()
some_data_prepared=Oil_Prod_prepared[-5:]
some_data_Target=Oil_Prod_Target[-5:]
print("Predictioned Oil Production (e3m3/month) \n:", lin_reg.predict(some_data_prepared))
Predictioned Oil Production (e3m3/month) : [74.48560289 19.2169574 68.73388829 34.80117781 57.24405241]
print("Actuals Oil Production (e3m3/month): \n", list(some_data_Target))
Actuals Oil Production (e3m3/month): [57.94, 2.17, 76.1, 65.82, 107.75]
The predictions are not accurate. We can calculate Root Mean Square Error (RMSE) between the predicted and actual values for the entire training set using Scikit-Learn’s mean_squared_error function:
from sklearn.metrics import mean_squared_error
Oil_Prod_predictions = lin_reg.predict(Oil_Prod_prepared)
mse_lin = mean_squared_error(Oil_Prod_Target, Oil_Prod_predictions)
rmse_lin= np.sqrt(mse_lin)
rmse_lin
35.22975341708548
This prediction is better than nothing; however, it should not be a great score: most oil production ("OIL Prod. (e3m3/month") values are between 11 to 89 (P25% to P75%) so a prediction error of 35 may not be efficient. This is a case of underfitting since the model is not powerful enough to make a good prediction. To resolve underfitting, you can choose a more powerful model, or select better features to feed Machine Learning algorithms, or to reduce the constraints on the model (this is not the case here sine we have not applied any constraints). You could try to add more features for example a well has many properties that can have an impact for oil production. But what if we try a more complex model.
We can apply a regression model of Decision Tree technique (Decision Trees will be discussed in more detail in next lectures). Decision Trees is one of powerful models for finding complex nonlinear relationships in the data.
from sklearn.tree import DecisionTreeRegressor
dtree_reg = DecisionTreeRegressor(random_state=40)
dtree_reg.fit(Oil_Prod_prepared, Oil_Prod_Target)
DecisionTreeRegressor(random_state=40)
some_data_prepared=Oil_Prod_prepared[-5:]
some_data_Target=Oil_Prod_Target[-5:]
print("Predictioned Oil Production (e3m3/month): \n", dtree_reg.predict(some_data_prepared))
Predictioned Oil Production (e3m3/month): [ 57.94 2.17 76.1 65.82 107.75]
print("Actuals Oil Production (e3m3/month): \n", list(some_data_Target))
Actuals Oil Production (e3m3/month): [57.94, 2.17, 76.1, 65.82, 107.75]
Oil_Prod_predictions = dtree_reg.predict(Oil_Prod_prepared)
tree_mse = mean_squared_error(Oil_Prod_Target, Oil_Prod_predictions)
tree_rmse = np.round(np.sqrt(tree_mse),5)
tree_rmse
0.0
Wow! This is an amazing model. Zero error! it perfectly predicted all training instances! But wait, this may be a clear example of very bad overfitting the training data. How can we make sure that it is overfitting? As we discussed before, the test set should be untouched until a reliable model is achieved. What you can do to void overfitting or underfitting?
Divide the Training Data to Smaller Training and Validation Sets¶
You can assign part of the training data for a smaller training set and a validation set. See Figure below. You should train your model with the smaller training set and validate it with validation. This process should be repeated until you achieve a satisfying model that neither overfitting nor underfitting. Finally you should only test your satisfied model to quantify the error of your model for never-before-seen dataset. That is it. Avoid temptation to tune your model again to have better performance on test set.
You can use again Scikit-Learn train_test_split function to split the training set into a smaller training set and a validation set. But first you need to convert the prepared data format into Pandas DataFrame:
Feature_list=['Feature '+ str(i+1) for i in range(len(Oil_Prod_prepared[0]))] # Make a list for columns (Features)
Oil_Prod_pd = pd.DataFrame(Oil_Prod_prepared, columns=Feature_list)
Oil_Prod_p_target=pd.concat([Oil_Prod_pd, Oil_Prod_Target], axis=1)
Oil_Prod_p_target[0:5]
Feature 1 | Feature 2 | Feature 3 | Feature 4 | Feature 5 | Feature 6 | Feature 7 | Feature 8 | Feature 9 | Feature 10 | Feature 11 | Feature 12 | Feature 13 | Feature 14 | OIL Prod. (e3m3/month) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.846810 | 0.011023 | -0.801093 | 0.622392 | 1.165545 | 1.585939 | -1.176331 | 1.090783 | 0.692074 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 189.00 |
1 | 0.563134 | -0.325789 | -0.678758 | -0.878210 | -0.431401 | -0.891920 | -0.387874 | -0.734422 | -0.559429 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 6.92 |
2 | 0.918164 | 0.387460 | -1.036457 | 0.622392 | 1.165545 | 1.585939 | -1.722187 | 1.228016 | 0.346406 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 161.58 |
3 | 1.044049 | 0.383498 | -1.090577 | 0.161313 | 1.165545 | 1.585939 | -0.691127 | 0.500679 | -0.261599 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 25.60 |
4 | 0.909462 | 0.129898 | -0.915451 | 0.161313 | 1.165545 | 1.585939 | -0.691127 | 1.049613 | -0.423631 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 36.01 |
Then we split the training set to as 85% smaller training set and 15% validation set.
# Split the training set to 15% varidation set and smaller traning set
Smaller_training, Validation = train_test_split(Oil_Prod_p_target, test_size=0.15, random_state=40)
You should remove target columns ("OIL Prod. (e3m3/month)") from the smaller training and validation set and have them as separate data.
# Assign target values for retitived data.
Smaller_Target=Smaller_training["OIL Prod. (e3m3/month)"]
Validation_Target=Validation["OIL Prod. (e3m3/month)"]
# Remove "OIL Prod. (e3m3/month)" that is the target from list
Smaller_training = Smaller_training.drop("OIL Prod. (e3m3/month)", axis=1)
Validation = Validation.drop("OIL Prod. (e3m3/month)", axis=1)
Both small training and validation set are fed to Machine Learning. First train Decision Tree model with the small training and then test it with validation set.
# Apply Decision Tree for smaller training set
dtree_reg.fit(Smaller_training, Smaller_Target)
Oil_Prod_predictions = dtree_reg.predict(Smaller_training)
tree_mse = mean_squared_error(Smaller_Target, Oil_Prod_predictions)
tree_rmse = np.round(np.sqrt(tree_mse),5)
tree_rmse
0.0
Again Decision Tree has perfectly predict all values of smaller training set. Now the trained Decision Tree model is applied to predict validation set to see if this is overfitting or not. Notice the Decision Tree model has not seen validation set before.
# Apply trained Decision Tree for validation set
Oil_Prod_predictions = dtree_reg.predict(Validation)
tree_mse = mean_squared_error(Validation_Target, Oil_Prod_predictions)
tree_rmse = np.round(np.sqrt(tree_mse),5)
tree_rmse
42.8212
Yes. It is proved that the Decision Tree model is overfitting since it gives very bad performance on the unseen data set (Validation), worse than linear regression.
Reliable Evaluation with K-Fold Cross-Validation¶
K-fold cross-validation is a reliable technique to evaluate your model; it applies a number of folds to provide each segment of data a chance to serve as both the validation and training set. First data are divided into number of folds, for example 6 folds in Figure below; then a model is trained with 5 folds and 1 folds is used for validation: every time a different fold is considered for evaluation. The result is 6 predictions using different data for trainings and validations.
The mean $m_{R}$ and variance $\sigma_{R}^{2}$ of the Root Mean Square Error ($RMSE$) for each set can be calculated to quantify uncertainty in predictive model.
$\sigma_{R} ^{2}=\frac{1}{n}\sum_{i=1}^{n}\left (R_{i}-m_{R} \right )^2$
then you can use a confidence interval to tell how confident your prediction (see Figure below).
- 68.0% of the predictions fall within 1 standard deviation of $\large m_{R} \pm \sigma_{R}$
- 95.0% of the predictions fall within 2 standard deviation of $\large m_{R} \pm 2\sigma_{R}$
- 99.7% of the predictions fall within 3 standard deviation of $\large m_{R} \pm 3\sigma_{R}$
see here for more information.
You need larger fold (say bigger than 20) to have a reliable uncertainty assessment. This approach is much better than simply divide the training set into smaller training set and validation set. The only problem is using cross-validation with high fold for large data set can be computationally expensive.
Scikit-Learn has K-fold cross-validation function. The following code randomly splits the training set into 20 folds, then it trains and evaluates the Decision Tree model 20 times, by selecting a different fold for evaluation every time and training on the other 19 folds.
Scikit-Learn’s cross-validation feature assumes greater is better rather than a cost function: lower is better. The scoring function gives negative values for mean square error. So we need to multiply scores by -1 before calculating the square root. The following code shows cross validation for 50 folds for Linear Regression and Decision Tree models.
from sklearn.model_selection import cross_val_score
scores = cross_val_score(lin_reg, Oil_Prod_prepared, Oil_Prod_Target,scoring="neg_mean_squared_error", cv=50)
linreg_rmse_scores = np.sqrt(-scores)
scores = cross_val_score(dtree_reg, Oil_Prod_prepared, Oil_Prod_Target,scoring="neg_mean_squared_error", cv=50)
dtree_rmse_scores = np.sqrt(-scores)
The Figure below shows histogram of RMSE for Linear Regression and Decision Tree models
font = {'size' : 9}
matplotlib.rc('font', **font)
fig = plt.subplots(figsize=(10, 3), dpi= 150, facecolor='w', edgecolor='k')
ax1=plt.subplot(1,2,1)
val=linreg_rmse_scores
histplt(val,bins=16,title='Histogram of RMSE_Linear Regression',xlabl=' RMSE ',xlimt=(24,58),axt=ax1)
ax2=plt.subplot(1,2,2)
val=dtree_rmse_scores
histplt(val,bins=14,title='Histogram of RMSE_Decision Tree Model',xlabl=' RMSE ',xlimt=(24,58),axt=ax2)
The cross-validation allows you to get not only an estimate of the performance of your model, but also a measure of how precise this estimate is (i.e., its standard deviation). The histograms shows that 68.0% of RMSE for Linear Regression model falls within 35.2$\pm$ 4.2 and but for Decision Tree 68.0% of RMSE falls within 43.0$\pm$ 4.0; so Decision Tree performs worse than the Linear Regression model! If you just used one validation set, you would not have this information. However, cross-validation comes at the cost of training the model several times that may not possible for all data set. Cross validation will be very computationally expensive for large data set.
Try Complex Models to Improve Performance¶
We should try several models to see which one gives better prediction. We have not discussed the models yet but let's try the Random Forest. We will discuss it in details in the coming lectures; here we just want to apply a more complex model to see how the results change. Random Forests work by training many Decision Trees using a random subsets of the features, then aggregating the predictions (averaging). See the code below to apply K-fold cross validation using Random Forest.
from sklearn.ensemble import RandomForestRegressor
forest_reg = RandomForestRegressor()
scores = cross_val_score(forest_reg, Oil_Prod_prepared, Oil_Prod_Target,scoring="neg_mean_squared_error", cv=50)
forest_reg_scores = np.sqrt(-scores)
#forest_reg_scores
font = {'size' : 9}
matplotlib.rc('font', **font)
fig = plt.subplots(figsize=(9, 3), dpi= 120,facecolor='w', edgecolor='k')
ax1=plt.subplot(1,2,1)
val=forest_reg_scores
histplt(val,bins=9,title='Histogram of RMSE_Random Forest',xlabl=' RMSE ',xlimt=(24,58),axt=ax1)
Yes, Random Forest looks promising although we only used default hyperparameters. The histograms of cross-validation shows that 68.0% of RMSE for Random Forestn model falls within 30.8$\pm$ 2.5 which has both lower mean and standard deviation compared with Linear Regression and Decision Tree; we will explain how to fine-tune your model using Scikit-Learn to get higher performance. So, if we apply other complex models, we can get higher performance.
Try to Remove Irrelevant Features or Add New Features¶
Another way to improve the performance of your model is to remove irrelevant features or add new features. For example, we might want to remove the columns "Surface-Casing Weight (kg/m)" "Production-Casing Size (mm)", "Production-Casing Weight (kg/m)" since they may not have that much impact on oil production. See the code below:
Oil_Prod_prepared_new=np.delete(Oil_Prod_prepared, [4,5,6], axis=1)
scores = cross_val_score(lin_reg, Oil_Prod_prepared_new, Oil_Prod_Target,scoring="neg_mean_squared_error", cv=50)
linreg_rmse_scores = np.sqrt(-scores)
scores = cross_val_score(dtree_reg, Oil_Prod_prepared_new, Oil_Prod_Target,scoring="neg_mean_squared_error", cv=50)
dtree_rmse_scores = np.sqrt(-scores)
font = {'size' : 9}
matplotlib.rc('font', **font)
fig = plt.subplots(figsize=(10, 3), dpi= 150, facecolor='w', edgecolor='k')
ax1=plt.subplot(1,2,1)
val=linreg_rmse_scores
histplt(val,bins=16,title='Histogram of RMSE_Linear Regression',xlabl=' RMSE ',xlimt=(24,58),axt=ax1)
ax2=plt.subplot(1,2,2)
val=dtree_rmse_scores
histplt(val,bins=14,title='Histogram of RMSE_Decision Tree Model',xlabl=' RMSE ',xlimt=(24,58),axt=ax2)
Removing those features did not improve performance of Linear Regression and Decision Tree. We can try removing other features to see if any improvement achieved or not. Moreover, other features can be added to the dataset to improve performance.
Evaluate Final Model on the Test Set¶
By working on training set and tweaking the models, you should achieve a final model that you are satisfied with its performance. The last step is to evaluate the final model on the test set that your model has never before seen this dataset. You do not need to do anything special about this process, only get the features and labels from your test data set; you should apply the same data preparation and transformation for test set. Notice you only need to call transform() not fit_transform. It means, for example, missing values in test set should be imputed using the information from training set. This process should be applied for other transformation such as standardization.
# Remove target
Oil_Prod_test = test_set_strat.drop("OIL Prod. (e3m3/month)", axis=1)
Oil_Prod_Target_test = test_set_strat["OIL Prod. (e3m3/month)"].copy()
# Imputation
Oil_Prod_num = Oil_Prod_test.drop(["Deviation (True/False)","Prod. Formation"], axis=1)
im = imput_mdn.transform(Oil_Prod_num)
Oil_Prod_im = pd.DataFrame(im, columns=Oil_Prod_num.columns)
# Deal with text
# Categorical Feature 1: "Prod. Formation"
Oil_Prod_cat = Oil_Prod_test[["Prod. Formation"]]
cat_encoder = OneHotEncoder()
Oil_Prod_cat_1hot = cat_encoder.fit_transform(Oil_Prod_cat)
cat_hot_toarray_1=Oil_Prod_cat_1hot.toarray()
# Categorical Feature 2 (Deviation (True/False))
Oil_Prod_cat = Oil_Prod_test[["Deviation (True/False)"]]
Oil_Prod_cat_2hot = cat_encoder.fit_transform(Oil_Prod_cat)
cat_hot_toarray_2=Oil_Prod_cat_2hot.toarray()
# Standardization
Oil_Prod_im_MMS=Oil_Prod_im.copy()
Oil_Prod_im_Std=scaler.fit_transform(Oil_Prod_im_MMS)
Oil_Prod_im_Std
# Concatenate Scaled and One-hot Encoded Feature
Oil_Prod_prepared_test=np.concatenate((Oil_Prod_im_Std,cat_hot_toarray_1, cat_hot_toarray_2), axis=1)
Lets apply linear regression for test data set and calculate RMSE:
Oil_Prod_predictions = lin_reg.predict(Oil_Prod_prepared_test)
mse_lin = mean_squared_error(Oil_Prod_Target_test, Oil_Prod_predictions)
rmse_lin= np.sqrt(mse_lin)
rmse_lin
34.038783940477366
Most of the time, this estimation may be sufficient as the generalization error. You might want to have an idea of how precise this estimate is. For example, you may want to know what is the range of generalized error for a 95% confidence interval:in other words, what is uncertainty in the mean of your error. But how we can calculate the distribution of the mean error since we have only one mean of error computed from data. It can be calculated based on Central Limit Theorem that says for a large enough sample size n, the distribution of the sampling mean will approach a normal distribution.. Ok that is good. We now know the the distribution of the sampling mean is Gaussian (normal) based on Central Limit Theorem. The mean of the sampling distribution is the mean of the population. Therefore, if a population has a mean $\mu$, then the mean of the sampling distribution is also $\mu_{s}$: $\mu=\mu_{s}$. The variance of the sampled mean is calculated by:
$\large\sigma^{2}_{\mu_{s}}=\frac{\sigma^{2}}{n_{ind}}$ where $\sigma^{2}$ is the variance of data and $n_{ind}$ is number of independent data.
For example, 95% of the mean falls within:
$\large \mu\pm 2 \sqrt{\frac{\sigma^{2}}{n_{ind}}}$
So we can calculate the uncertainty in the mean of error for 95% confidence interval using the equation above:
squared_errors = (Oil_Prod_Target_test-Oil_Prod_predictions)**2
m=np.mean(squared_errors) # mean of squared error
vr= np.var(squared_errors) # Variance of squared error
sd_m=np.sqrt(vr/(len(Oil_Prod_predictions)))
print('Upper limit of the mean: ',np.round(np.sqrt(m+2*sd_m),1)) # 95% confidence interval
print('Lower limit of the mean: ',np.round(np.sqrt(m-2*sd_m),1)) # 95% confidence interval
Upper limit of the mean: 35.5 Lower limit of the mean: 32.5
You can also apply scipy package to simply calculate the range of error for any given confidence interval.
from scipy import stats
confidence = 0.95
squared_errors = (Oil_Prod_Target_test- Oil_Prod_predictions) ** 2
interval=stats.t.interval(confidence, len(squared_errors),loc=squared_errors.mean(),
scale=stats.sem(squared_errors))
print('Upper limit of the mean: ',np.round(np.sqrt(interval)[1],1)) # 95% confidence interval
print('Lower limit of the mean: ',np.round(np.sqrt(interval)[0],1)) # 95% confidence interval
Upper limit of the mean: 35.5 Lower limit of the mean: 32.5
You may achieve lower performance on test set than what you calculate by cross-validation since you probably apply a lot of hyperparameter tuning: this leads your system ends up fine-tuned to perform well on the validation data, and will likely not perform as well on never-before seen dataset. For this example, it was not the case since we have not tuned hyperparameters. But, even if you see this in your project, you must resist the temptation to change and tweak the hyperparameters in order to get higher performance on the test set since the improvements may not be applied for new data. Each time you tweak the parameter, the so called information leak will happen and you will end-up an model that seems very promising but performs very bad on new data set.
- Home
-
- Prediction of Movie Genre by Fine-tunning GPT
- Fine-tunning BERT for Fake News Detection
- Covid Tweet Classification by Fine-tunning BART
- Semantic Search Using BERT
- Abstractive Semantic Search by OpenAI Embedding
- Fine-tunning GPT for Style Completion
- Extractive Question-Answering by BERT
- Fine-tunning T5 Model for Abstract Title Prediction
- Image Captioning by Fine-tunning ViT
- Build Serverless ChatGPT API
- Statistical Analysis in Python
- Clustering Algorithms
- Customer Segmentation
- Time Series Forecasting
- PySpark Fundamentals for Big Data
- Predict Customer Churn
- Classification with Imbalanced Classes
- Feature Importance
- Feature Selection
- Text Similarity Measurement
- Dimensionality Reduction
- Prediction of Methane Leakage
- Imputation by LU Simulation
- Histogram Uncertainty
- Delustering to Improve Preferential Sampling
- Uncertainty in Spatial Correlation
-
- Machine Learning Overview
- Python and Pandas
- Main Steps of Machine Learning
- Classification
- Model Training
- Support Vector Machines
- Decision Trees
- Ensemble Learning & Random Forests
- Artificial Neural Network
- Deep Neural Network (DNN)
- Unsupervised Learning
- Multicollinearity
- Introduction to Git
- Introduction to R
- SQL Basic to Advanced Level
- Develop Python Package
- Introduction to BERT LLM
- Exploratory Data Analysis
- Object Oriented Programming in Python
- Natural Language Processing
- Convolutional Neural Network
- Publications