ENPE 519.11/ENCH 619.78 Machine Learning – Energy Systems at University of Calgary

Pandas Introduction

Pandas is an open source Python package for data manipulation and analysis with high-performance and easy-to-use methods. It is based on the dataframe concept in the R programming language. Pandas can be used as the primary tools for data processing, data manipulation and data cleaning.

We will use a dataset for oil production of 1000 hydrocarbon wells in Alberta (WPD.csv). The target is oil production prediction using well properties including location, depth, deviation, porosity, permeability and so on. The well properties are retrieved from geoSCOUT; but location of the wells are changed and some key properties are manipulated for confidentially reason and for not being used for other reasons. This data is retrieved and generated for this course to learn data processing and apply Machine Learning. The prediction should NOT be used for any Publication because the original data is modified data.

Look at Data

In [1]:
# Reading data in Pandas

import pandas as pd # import Pandas

df = pd.read_csv("WPD.csv") # Read data file
#df.head()
df[0:5] # Display data
Out[1]:
X Coordinate Y Coordinate Measured Depth (m) Deviation (True/False) Surface-Casing Weight (kg/m) Production-Casing Size (mm) Production-Casing Weight (kg/m) Bore. Temp. (degC) Prod. Formation Porosity (fraction) Permeability (Darcy) OIL Prod. (e3m3/month)
0 352.6 372 2145.8 False NaN 114.3 NaN 53.0 Shale-Sand 0.224 12.25 108.75
1 435.7 551 1638.0 True 25.3 114.3 17.3 NaN Sand 0.118 4.11 0.33
2 37.7 788 3984.0 True 35.7 114.3 22.5 NaN Shale 0.121 1.99 21.87
3 346.7 311 3809.0 True 41.7 114.3 17.3 NaN Sand 0.170 5.84 46.30
4 254.6 727 2335.0 True 53.6 NaN NaN 32.0 Sand 0.158 5.25 17.53

The columns function gives the name of the columns in your data file.

In [2]:
column=list(df.columns)
print(column)
['X Coordinate', 'Y Coordinate', 'Measured Depth (m)', 'Deviation (True/False)', 'Surface-Casing Weight (kg/m)', 'Production-Casing Size (mm)', 'Production-Casing Weight (kg/m)', 'Bore. Temp. (degC)', 'Prod. Formation', 'Porosity (fraction)', 'Permeability (Darcy)', 'OIL Prod. (e3m3/month)']

The info() function gives some information about the data type: string float or null (missing values).

In [3]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 12 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   X Coordinate                     1000 non-null   float64
 1   Y Coordinate                     1000 non-null   int64  
 2   Measured Depth (m)               1000 non-null   float64
 3   Deviation (True/False)           1000 non-null   bool   
 4   Surface-Casing Weight (kg/m)     593 non-null    float64
 5   Production-Casing Size (mm)      913 non-null    float64
 6   Production-Casing Weight (kg/m)  547 non-null    float64
 7   Bore. Temp. (degC)               801 non-null    float64
 8   Prod. Formation                  1000 non-null   object 
 9   Porosity (fraction)              1000 non-null   float64
 10  Permeability (Darcy)             1000 non-null   float64
 11  OIL Prod. (e3m3/month)           1000 non-null   float64
dtypes: bool(1), float64(9), int64(1), object(1)
memory usage: 87.0+ KB
In [4]:
df
Out[4]:
X Coordinate Y Coordinate Measured Depth (m) Deviation (True/False) Surface-Casing Weight (kg/m) Production-Casing Size (mm) Production-Casing Weight (kg/m) Bore. Temp. (degC) Prod. Formation Porosity (fraction) Permeability (Darcy) OIL Prod. (e3m3/month)
0 352.6 372 2145.8 False NaN 114.3 NaN 53.0 Shale-Sand 0.224 12.25 108.75
1 435.7 551 1638.0 True 25.3 114.3 17.3 NaN Sand 0.118 4.11 0.33
2 37.7 788 3984.0 True 35.7 114.3 22.5 NaN Shale 0.121 1.99 21.87
3 346.7 311 3809.0 True 41.7 114.3 17.3 NaN Sand 0.170 5.84 46.30
4 254.6 727 2335.0 True 53.6 NaN NaN 32.0 Sand 0.158 5.25 17.53
... ... ... ... ... ... ... ... ... ... ... ... ...
995 92.5 767 1466.0 False 35.7 139.7 20.8 51.0 Sand 0.230 6.26 8.74
996 548.3 475 749.8 False NaN 114.3 NaN 28.0 Shale 0.017 0.48 0.00
997 593.4 456 683.7 False NaN 177.8 NaN 26.0 Shale-Sand 0.332 49.75 832.45
998 540.9 328 1346.0 True 25.3 114.3 14.1 NaN Shale 0.111 0.59 0.00
999 617.8 377 844.0 False 35.7 139.7 20.8 27.0 Shale-Sand 0.292 9.95 153.43

1000 rows × 12 columns

In [ ]:
 

The describe function gives some statistical analysis for each column if the data type is not string.

In [5]:
df.describe()
Out[5]:
X Coordinate Y Coordinate Measured Depth (m) Surface-Casing Weight (kg/m) Production-Casing Size (mm) Production-Casing Weight (kg/m) Bore. Temp. (degC) Porosity (fraction) Permeability (Darcy) OIL Prod. (e3m3/month)
count 1000.000000 1000.000000 1000.000000 593.000000 913.000000 547.000000 801.000000 1000.000000 1000.000000 1000.000000
mean 432.893400 489.749000 1364.379300 42.906745 142.495290 23.260512 40.774032 0.141840 5.269140 60.003280
std 172.206682 243.547912 855.633567 25.877438 25.953205 7.137579 21.256825 0.074501 8.665687 234.814659
min 12.300000 36.000000 276.000000 14.100000 13.700000 10.800000 5.000000 0.000000 0.050000 0.000000
25% 325.425000 339.750000 704.000000 35.700000 114.300000 17.300000 27.000000 0.089750 1.040000 0.000000
50% 487.250000 480.000000 1066.000000 35.700000 139.700000 20.800000 33.000000 0.141000 2.480000 4.715000
75% 581.675000 596.000000 1790.000000 48.100000 177.800000 25.300000 53.000000 0.192000 5.752500 33.117500
max 648.800000 1172.000000 6363.000000 595.000000 219.100000 53.600000 255.000000 0.400000 94.510000 3425.530000

Data Cleaning

Missing (NaN) Values

It is ideal to have valid value for every rows and columns. In reality, missing values is one of the main challenges faced by machine learning. There are some approaches in practice to deal with missing values.

Option 1: Drop Column

One naive approach is to remove columns that have missing values with drop() function.

In [6]:
Select_column=column[3:8] # Select columns that have missing
Select_column
Out[6]:
['Deviation (True/False)',
 'Surface-Casing Weight (kg/m)',
 'Production-Casing Size (mm)',
 'Production-Casing Weight (kg/m)',
 'Bore. Temp. (degC)']
In [7]:
df_drop_c=df.copy() # Copy to avoid modifying the original data
df_drop_c.drop(Select_column,axis=1, inplace=True)  # Drop selected 5 columns
df_drop_c
Out[7]:
X Coordinate Y Coordinate Measured Depth (m) Prod. Formation Porosity (fraction) Permeability (Darcy) OIL Prod. (e3m3/month)
0 352.6 372 2145.8 Shale-Sand 0.224 12.25 108.75
1 435.7 551 1638.0 Sand 0.118 4.11 0.33
2 37.7 788 3984.0 Shale 0.121 1.99 21.87
3 346.7 311 3809.0 Sand 0.170 5.84 46.30
4 254.6 727 2335.0 Sand 0.158 5.25 17.53
... ... ... ... ... ... ... ...
995 92.5 767 1466.0 Sand 0.230 6.26 8.74
996 548.3 475 749.8 Shale 0.017 0.48 0.00
997 593.4 456 683.7 Shale-Sand 0.332 49.75 832.45
998 540.9 328 1346.0 Shale 0.111 0.59 0.00
999 617.8 377 844.0 Shale-Sand 0.292 9.95 153.43

1000 rows × 7 columns

In [8]:
aa=df_drop_c.drop(1,axis=0)
aa
Out[8]:
X Coordinate Y Coordinate Measured Depth (m) Prod. Formation Porosity (fraction) Permeability (Darcy) OIL Prod. (e3m3/month)
0 352.6 372 2145.8 Shale-Sand 0.224 12.25 108.75
2 37.7 788 3984.0 Shale 0.121 1.99 21.87
3 346.7 311 3809.0 Sand 0.170 5.84 46.30
4 254.6 727 2335.0 Sand 0.158 5.25 17.53
5 127.5 1125 1560.6 Shale 0.000 0.10 0.00
... ... ... ... ... ... ... ...
995 92.5 767 1466.0 Sand 0.230 6.26 8.74
996 548.3 475 749.8 Shale 0.017 0.48 0.00
997 593.4 456 683.7 Shale-Sand 0.332 49.75 832.45
998 540.9 328 1346.0 Shale 0.111 0.59 0.00
999 617.8 377 844.0 Shale-Sand 0.292 9.95 153.43

999 rows × 7 columns

In [9]:
df_drop_c
Out[9]:
X Coordinate Y Coordinate Measured Depth (m) Prod. Formation Porosity (fraction) Permeability (Darcy) OIL Prod. (e3m3/month)
0 352.6 372 2145.8 Shale-Sand 0.224 12.25 108.75
1 435.7 551 1638.0 Sand 0.118 4.11 0.33
2 37.7 788 3984.0 Shale 0.121 1.99 21.87
3 346.7 311 3809.0 Sand 0.170 5.84 46.30
4 254.6 727 2335.0 Sand 0.158 5.25 17.53
... ... ... ... ... ... ... ...
995 92.5 767 1466.0 Sand 0.230 6.26 8.74
996 548.3 475 749.8 Shale 0.017 0.48 0.00
997 593.4 456 683.7 Shale-Sand 0.332 49.75 832.45
998 540.9 328 1346.0 Shale 0.111 0.59 0.00
999 617.8 377 844.0 Shale-Sand 0.292 9.95 153.43

1000 rows × 7 columns

In [ ]:
 
In [10]:
df_drop_c.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 7 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   X Coordinate            1000 non-null   float64
 1   Y Coordinate            1000 non-null   int64  
 2   Measured Depth (m)      1000 non-null   float64
 3   Prod. Formation         1000 non-null   object 
 4   Porosity (fraction)     1000 non-null   float64
 5   Permeability (Darcy)    1000 non-null   float64
 6   OIL Prod. (e3m3/month)  1000 non-null   float64
dtypes: float64(5), int64(1), object(1)
memory usage: 54.8+ KB

This approach is not reliable since the information of 5 columns without missing values have been removed from data set.

Option 2: Drop Row

Another approach is to remove rows that have missing values with dropna() function.

In [11]:
df_drop_=df.copy() # Copy to avoid modifying the original data
df_drop_r=df_drop_.dropna() # Drop rows with missing values
df_drop_r
Out[11]:
X Coordinate Y Coordinate Measured Depth (m) Deviation (True/False) Surface-Casing Weight (kg/m) Production-Casing Size (mm) Production-Casing Weight (kg/m) Bore. Temp. (degC) Prod. Formation Porosity (fraction) Permeability (Darcy) OIL Prod. (e3m3/month)
6 636.9 462 713.0 True 48.1 177.8 25.3 27.0 Shale 0.128 0.70 0.00
7 556.2 441 761.0 False 35.7 139.7 20.8 28.0 Sand 0.055 0.36 0.00
8 572.0 521 587.0 False 25.3 114.3 14.1 15.0 Shale 0.033 0.46 0.00
10 366.2 385 1801.0 False 35.7 139.7 25.3 52.0 Shale-Sand 0.147 3.15 5.80
14 648.7 275 920.0 False 35.7 139.7 20.8 30.0 Shale 0.104 0.38 0.00
... ... ... ... ... ... ... ... ... ... ... ... ...
989 44.1 787 1895.0 False 35.7 139.7 23.1 66.0 Shale 0.144 11.73 22.01
990 487.2 81 1033.0 True 35.7 139.7 20.8 30.0 Shale 0.144 0.99 0.00
991 607.4 526 668.0 True 48.1 177.8 25.3 14.0 Shale-Sand 0.000 0.45 0.00
995 92.5 767 1466.0 False 35.7 139.7 20.8 51.0 Sand 0.230 6.26 8.74
999 617.8 377 844.0 False 35.7 139.7 20.8 27.0 Shale-Sand 0.292 9.95 153.43

426 rows × 12 columns

In [12]:
df_drop_r.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 426 entries, 6 to 999
Data columns (total 12 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   X Coordinate                     426 non-null    float64
 1   Y Coordinate                     426 non-null    int64  
 2   Measured Depth (m)               426 non-null    float64
 3   Deviation (True/False)           426 non-null    bool   
 4   Surface-Casing Weight (kg/m)     426 non-null    float64
 5   Production-Casing Size (mm)      426 non-null    float64
 6   Production-Casing Weight (kg/m)  426 non-null    float64
 7   Bore. Temp. (degC)               426 non-null    float64
 8   Prod. Formation                  426 non-null    object 
 9   Porosity (fraction)              426 non-null    float64
 10  Permeability (Darcy)             426 non-null    float64
 11  OIL Prod. (e3m3/month)           426 non-null    float64
dtypes: bool(1), float64(9), int64(1), object(1)
memory usage: 40.4+ KB

This approach is also not recommend because number of data (rows) have decreased from 1000 to 426.

Option 3: Replace with Median (mean)

A common practice is to replace missing values with the median (mean) value for that column. The median is the middle value of a list that 50% are less and 50% are bigger. See Median. The following code replaces all missing values in 5 columns with the median value of each column values.

In [13]:
df_im=df.copy()

# Get the median of each variable and replace the missing values with median
for i in range(len(Select_column)):
    P50 = df[Select_column[i]].median() # Calculate median of each column
    df_im[Select_column[i]] = df_im[Select_column[i]].fillna(P50)   # replace NaN with median(P50)
    
df_im    
Out[13]:
X Coordinate Y Coordinate Measured Depth (m) Deviation (True/False) Surface-Casing Weight (kg/m) Production-Casing Size (mm) Production-Casing Weight (kg/m) Bore. Temp. (degC) Prod. Formation Porosity (fraction) Permeability (Darcy) OIL Prod. (e3m3/month)
0 352.6 372 2145.8 False 35.7 114.3 20.8 53.0 Shale-Sand 0.224 12.25 108.75
1 435.7 551 1638.0 True 25.3 114.3 17.3 33.0 Sand 0.118 4.11 0.33
2 37.7 788 3984.0 True 35.7 114.3 22.5 33.0 Shale 0.121 1.99 21.87
3 346.7 311 3809.0 True 41.7 114.3 17.3 33.0 Sand 0.170 5.84 46.30
4 254.6 727 2335.0 True 53.6 139.7 20.8 32.0 Sand 0.158 5.25 17.53
... ... ... ... ... ... ... ... ... ... ... ... ...
995 92.5 767 1466.0 False 35.7 139.7 20.8 51.0 Sand 0.230 6.26 8.74
996 548.3 475 749.8 False 35.7 114.3 20.8 28.0 Shale 0.017 0.48 0.00
997 593.4 456 683.7 False 35.7 177.8 20.8 26.0 Shale-Sand 0.332 49.75 832.45
998 540.9 328 1346.0 True 25.3 114.3 14.1 33.0 Shale 0.111 0.59 0.00
999 617.8 377 844.0 False 35.7 139.7 20.8 27.0 Shale-Sand 0.292 9.95 153.43

1000 rows × 12 columns

In [14]:
df_im.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 12 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   X Coordinate                     1000 non-null   float64
 1   Y Coordinate                     1000 non-null   int64  
 2   Measured Depth (m)               1000 non-null   float64
 3   Deviation (True/False)           1000 non-null   bool   
 4   Surface-Casing Weight (kg/m)     1000 non-null   float64
 5   Production-Casing Size (mm)      1000 non-null   float64
 6   Production-Casing Weight (kg/m)  1000 non-null   float64
 7   Bore. Temp. (degC)               1000 non-null   float64
 8   Prod. Formation                  1000 non-null   object 
 9   Porosity (fraction)              1000 non-null   float64
 10  Permeability (Darcy)             1000 non-null   float64
 11  OIL Prod. (e3m3/month)           1000 non-null   float64
dtypes: bool(1), float64(9), int64(1), object(1)
memory usage: 87.0+ KB

Remove Outliers

Outliers are usually extreme high or low values that are different from pattern of majority data. Outliers should be removed from dataset (we should be careful not to consider novelty as outlier). A straightforward approach to detect outliers are several (n) standard deviations from the mean $m\pm n\sigma$ ($m$=mean, $\sigma$=standard deviation). For example, an outlier can bigger than $m+ 3\sigma $ or less than can bigger than $m- 3\sigma $. The following function can be used to apply this approach.

In [15]:
def outlier_remove(df, n,name):
    """Delete rows for a specified column where values are out of +/- n*sd standard deviations
    
    df  : Pandas dataframe
    n   : n in the equation 𝑚±𝑛𝜎
    name: Column name
    """
    mean=df[name].mean() # Calclute mean of column
    sd=df[name].std()  # Calclute standard deviation of column
    drop_r = df.index[(mean -n * sd> df[name]) | (mean+n * sd< df[name])] # Find data that are not within 𝑚±𝑛𝜎
    df.drop(drop_r, axis=0, inplace=True) # Drop data
    df.reset_index(inplace=True, drop=True) # Reset index
In [16]:
# Drop outliers in last column 'OIL Prod. (e3m3/month)'
df_im_out=df_im.copy()
outlier_remove(df_im_out,n=3,name='OIL Prod. (e3m3/month)')
df_im_out
Out[16]:
X Coordinate Y Coordinate Measured Depth (m) Deviation (True/False) Surface-Casing Weight (kg/m) Production-Casing Size (mm) Production-Casing Weight (kg/m) Bore. Temp. (degC) Prod. Formation Porosity (fraction) Permeability (Darcy) OIL Prod. (e3m3/month)
0 352.6 372 2145.8 False 35.7 114.3 20.8 53.0 Shale-Sand 0.224 12.25 108.75
1 435.7 551 1638.0 True 25.3 114.3 17.3 33.0 Sand 0.118 4.11 0.33
2 37.7 788 3984.0 True 35.7 114.3 22.5 33.0 Shale 0.121 1.99 21.87
3 346.7 311 3809.0 True 41.7 114.3 17.3 33.0 Sand 0.170 5.84 46.30
4 254.6 727 2335.0 True 53.6 139.7 20.8 32.0 Sand 0.158 5.25 17.53
... ... ... ... ... ... ... ... ... ... ... ... ...
976 535.5 788 1744.0 True 71.4 139.7 20.8 28.0 Sand 0.185 2.79 1.22
977 92.5 767 1466.0 False 35.7 139.7 20.8 51.0 Sand 0.230 6.26 8.74
978 548.3 475 749.8 False 35.7 114.3 20.8 28.0 Shale 0.017 0.48 0.00
979 540.9 328 1346.0 True 25.3 114.3 14.1 33.0 Shale 0.111 0.59 0.00
980 617.8 377 844.0 False 35.7 139.7 20.8 27.0 Shale-Sand 0.292 9.95 153.43

981 rows × 12 columns

Concatenation

A new data frames can be built by concatenating rows and columns. A new dataframe is created by concatenating two columns together 'Measured Depth (m)' and 'OIL Prod. (e3m3/month)' using the concat() function.

In [17]:
col_1= df_im['Measured Depth (m)']
col_2 = df_im['OIL Prod. (e3m3/month)']
result = pd.concat([col_1, col_2], axis=1)
result
Out[17]:
Measured Depth (m) OIL Prod. (e3m3/month)
0 2145.8 108.75
1 1638.0 0.33
2 3984.0 21.87
3 3809.0 46.30
4 2335.0 17.53
... ... ...
995 1466.0 8.74
996 749.8 0.00
997 683.7 832.45
998 1346.0 0.00
999 844.0 153.43

1000 rows × 2 columns

Two rows can be concatenated together using concat() function. The code below concatenates the first 4 rows and the last 3 rows of the previous dataset after imputation.

In [18]:
row_1= df_im[0:4]  # Retrieve first 4 rows
row_2 = df_im[-4:-1]  # Retrieve last 3 rows
result = pd.concat([row_1, row_2], axis=0) # Concatenate rows
result
Out[18]:
X Coordinate Y Coordinate Measured Depth (m) Deviation (True/False) Surface-Casing Weight (kg/m) Production-Casing Size (mm) Production-Casing Weight (kg/m) Bore. Temp. (degC) Prod. Formation Porosity (fraction) Permeability (Darcy) OIL Prod. (e3m3/month)
0 352.6 372 2145.8 False 35.7 114.3 20.8 53.0 Shale-Sand 0.224 12.25 108.75
1 435.7 551 1638.0 True 25.3 114.3 17.3 33.0 Sand 0.118 4.11 0.33
2 37.7 788 3984.0 True 35.7 114.3 22.5 33.0 Shale 0.121 1.99 21.87
3 346.7 311 3809.0 True 41.7 114.3 17.3 33.0 Sand 0.170 5.84 46.30
996 548.3 475 749.8 False 35.7 114.3 20.8 28.0 Shale 0.017 0.48 0.00
997 593.4 456 683.7 False 35.7 177.8 20.8 26.0 Shale-Sand 0.332 49.75 832.45
998 540.9 328 1346.0 True 25.3 114.3 14.1 33.0 Shale 0.111 0.59 0.00

Saving a Dataframe as CSV

You can simply save a dataframe as CSV by the following code. The following code performs a shuffle and then saves a new copy.

In [19]:
filename="myfile.csv"
df_im.to_csv(filename, index=False) # index = False not writing row numbers

Shuffling, Grouping and Sorting

Shuffling Dataset

An example of shuffling is playing the game of cards. All the cards are collected at the end of each round of play, shuffled to ensure that cards are distributed randomly and each player receives cards based on chance. In Machine Learning, the data set should be split into training and validation datasets. So, shuffling is very important to avoid any element of bias/patterns in the split datasets.

If consistent shuffling of the data set is required for several times of running the code, a random seed can be used. See code below.

In [20]:
import numpy as np

#np.random.seed(82) 
df_shfl = df_im.reindex(np.random.permutation(df.index)) 
df_shfl.reset_index(inplace=True, drop=True) # Reset index
df_shfl[0:5]
Out[20]:
X Coordinate Y Coordinate Measured Depth (m) Deviation (True/False) Surface-Casing Weight (kg/m) Production-Casing Size (mm) Production-Casing Weight (kg/m) Bore. Temp. (degC) Prod. Formation Porosity (fraction) Permeability (Darcy) OIL Prod. (e3m3/month)
0 451.7 541 970.2 False 35.7 177.8 20.8 31.0 Sand 0.097 3.87 0.0
1 388.5 639 832.1 False 35.7 139.7 20.8 26.0 Shale 0.000 0.48 0.0
2 270.8 461 3999.0 True 35.7 114.3 22.5 33.0 Shale 0.125 0.76 0.0
3 290.8 480 1695.3 False 35.7 139.7 20.8 33.0 Sand 0.048 1.15 0.0
4 132.0 1104 276.0 False 25.3 114.3 14.1 25.0 Shale 0.142 1.16 0.0

Grouping Data Set

Grouping is applied to summarize data. The following codes performs grouping; generating the mean and sum of each category for feature 'Prod. Formation' in target variable 'OIL Prod. (e3m3/month)'.

In [21]:
gb_mean=df_im.groupby('Prod. Formation')['OIL Prod. (e3m3/month)'].mean()
gb_mean
Out[21]:
Prod. Formation
Sand          64.015467
Shale         59.502943
Shale-Sand    54.482824
Name: OIL Prod. (e3m3/month), dtype: float64
In [22]:
gb_sum=df_im.groupby('Prod. Formation')['OIL Prod. (e3m3/month)'].sum()
gb_sum
Out[22]:
Prod. Formation
Sand          19204.64
Shale         31536.56
Shale-Sand     9262.08
Name: OIL Prod. (e3m3/month), dtype: float64

Sorting Data Set

In [23]:
df_sort = df_im.sort_values(by='Measured Depth (m)', ascending=True)
#df_sort.reset_index(inplace=True, drop=True) # Reset index
df_sort
Out[23]:
X Coordinate Y Coordinate Measured Depth (m) Deviation (True/False) Surface-Casing Weight (kg/m) Production-Casing Size (mm) Production-Casing Weight (kg/m) Bore. Temp. (degC) Prod. Formation Porosity (fraction) Permeability (Darcy) OIL Prod. (e3m3/month)
90 132.0 1104 276.0 False 25.3 114.3 14.1 25.0 Shale 0.142 1.16 0.00
537 72.9 784 295.5 False 25.3 114.3 14.1 22.0 Shale 0.000 0.66 0.00
403 591.3 791 297.0 False 25.3 114.3 14.1 26.0 Shale 0.142 13.37 72.26
507 621.7 593 321.0 True 53.6 177.8 34.2 33.0 Shale 0.091 1.72 0.00
105 631.2 583 343.2 False 56.5 139.7 20.8 33.0 Sand 0.168 3.19 19.23
... ... ... ... ... ... ... ... ... ... ... ... ...
273 179.8 628 5066.0 True 53.6 139.7 20.8 33.0 Shale 0.114 3.23 0.00
331 110.2 677 5116.0 True 35.7 139.7 25.3 33.0 Sand 0.000 0.13 0.00
951 48.6 718 5226.0 True 59.5 139.7 20.8 33.0 Shale 0.249 8.97 83.09
754 188.4 619 5790.0 True 53.6 114.3 22.5 33.0 Shale 0.194 4.03 0.39
570 343.2 383 6363.0 True 60.3 139.7 38.7 33.0 Sand 0.239 3.19 12.26

1000 rows × 12 columns

Feature Engineering

Feature engineering is the process of extracting new feature from raw data via data mining techniques. The new features can be used to enhance the performance of machine learning algorithms. New features could also be calculated from the other fields. For the data set WPD.csv, we can add "Surface-Casing Weight (kg/m)" and "Production-Casing Weight (kg/m)" as a new data set.

In [24]:
df_im_c=df_im.copy()
df_im.insert(1, 'New Column', (df_im['Surface-Casing Weight (kg/m)']*df_im['Production-Casing Weight (kg/m)']).astype(int))
df_im
Out[24]:
X Coordinate New Column Y Coordinate Measured Depth (m) Deviation (True/False) Surface-Casing Weight (kg/m) Production-Casing Size (mm) Production-Casing Weight (kg/m) Bore. Temp. (degC) Prod. Formation Porosity (fraction) Permeability (Darcy) OIL Prod. (e3m3/month)
0 352.6 742 372 2145.8 False 35.7 114.3 20.8 53.0 Shale-Sand 0.224 12.25 108.75
1 435.7 437 551 1638.0 True 25.3 114.3 17.3 33.0 Sand 0.118 4.11 0.33
2 37.7 803 788 3984.0 True 35.7 114.3 22.5 33.0 Shale 0.121 1.99 21.87
3 346.7 721 311 3809.0 True 41.7 114.3 17.3 33.0 Sand 0.170 5.84 46.30
4 254.6 1114 727 2335.0 True 53.6 139.7 20.8 32.0 Sand 0.158 5.25 17.53
... ... ... ... ... ... ... ... ... ... ... ... ... ...
995 92.5 742 767 1466.0 False 35.7 139.7 20.8 51.0 Sand 0.230 6.26 8.74
996 548.3 742 475 749.8 False 35.7 114.3 20.8 28.0 Shale 0.017 0.48 0.00
997 593.4 742 456 683.7 False 35.7 177.8 20.8 26.0 Shale-Sand 0.332 49.75 832.45
998 540.9 356 328 1346.0 True 25.3 114.3 14.1 33.0 Shale 0.111 0.59 0.00
999 617.8 742 377 844.0 False 35.7 139.7 20.8 27.0 Shale-Sand 0.292 9.95 153.43

1000 rows × 13 columns

Standardize (Normalize) Values

Most Machine Learning algorithms require to standardize the input to have a reliable comparison between two values of different variables. For example, if you have variable one with a range of 3 to 1000 and variable two with range 1 to 3; we cannot have a good comparison since the highest value of variable two (3) is the lowest value in variable two. This inconsistency in data values leads to low performance in Machine Learning algorithms. So, it is highly recommended to standardize (normalize) your data to have the same the mean($\mu$) and the standard deviation ($\sigma$) before feeding Machine Learning algorithms. One very common Machine learning standardization (normalization) is the Z-Score:

$\Large z = \frac{x - \mu}{\sigma} $

where $ \mu = \frac{1}{n} \sum_{i=1}^n x_i$, $\sigma = \sqrt{\frac{1}{n} \sum_{i=1}^n (x_i - \mu)^2}$, $n$ is number of data. This ensures that each variable have $\mu=0$ and $\sigma=1$

This standardization should be applied for all features except the target (label). For WPD.csv data set, the target is "OIL Prod. (e3m3/month)" that should not be changed.

The following Python code replaces the 'Measured Depth (m)' with a z-score. This can be done for all variables.

In [25]:
from scipy.stats import zscore

mean=df_im['Measured Depth (m)'].mean()
variance=df_im['Measured Depth (m)'].var()
print("Before standardization: Mean= ",int(mean), ", variance= ",int(variance))

df_im['Measured Depth (m)'] = zscore(df_im['Measured Depth (m)'])

mean=df_im['Measured Depth (m)'].mean()
variance=df_im['Measured Depth (m)'].var()
print("After standardization: Mean= ",int(mean), ", variance= ",int(variance))
Before standardization: Mean=  1364 , variance=  732108
After standardization: Mean=  0 , variance=  1
In [26]:
df_im[0:5]
Out[26]:
X Coordinate New Column Y Coordinate Measured Depth (m) Deviation (True/False) Surface-Casing Weight (kg/m) Production-Casing Size (mm) Production-Casing Weight (kg/m) Bore. Temp. (degC) Prod. Formation Porosity (fraction) Permeability (Darcy) OIL Prod. (e3m3/month)
0 352.6 742 372 0.913723 False 35.7 114.3 20.8 53.0 Shale-Sand 0.224 12.25 108.75
1 435.7 437 551 0.319947 True 25.3 114.3 17.3 33.0 Sand 0.118 4.11 0.33
2 37.7 803 788 3.063147 True 35.7 114.3 22.5 33.0 Shale 0.121 1.99 21.87
3 346.7 721 311 2.858518 True 41.7 114.3 17.3 33.0 Sand 0.170 5.84 46.30
4 254.6 1114 727 1.134956 True 53.6 139.7 20.8 32.0 Sand 0.158 5.25 17.53

Categorical Values

The input data for Machine Learning should be completely numeric which means any text should be converted to number. In the data set WPD.csv, the columns 'Deviation (True/False)' and 'Prod. Formation' have categorical values. Use value_counts() function to find out what categories exist and how many data belong to each category.

In [27]:
df_c=df_im.copy()
df_c['Deviation (True/False)'].value_counts()
Out[27]:
False    652
True     348
Name: Deviation (True/False), dtype: int64
In [28]:
df_c['Prod. Formation'].value_counts()
Out[28]:
Shale         530
Sand          300
Shale-Sand    170
Name: Prod. Formation, dtype: int64

We can replace each test with numbers starting from 0

In [29]:
df_c['Deviation (True/False)']=df_c['Deviation (True/False)'].replace(False, 0) # Replace False with 0
df_c['Deviation (True/False)']=df_c['Deviation (True/False)'].replace(True, 1)  # Replace True with 1
df_c[0:5]
Out[29]:
X Coordinate New Column Y Coordinate Measured Depth (m) Deviation (True/False) Surface-Casing Weight (kg/m) Production-Casing Size (mm) Production-Casing Weight (kg/m) Bore. Temp. (degC) Prod. Formation Porosity (fraction) Permeability (Darcy) OIL Prod. (e3m3/month)
0 352.6 742 372 0.913723 0.0 35.7 114.3 20.8 53.0 Shale-Sand 0.224 12.25 108.75
1 435.7 437 551 0.319947 1.0 25.3 114.3 17.3 33.0 Sand 0.118 4.11 0.33
2 37.7 803 788 3.063147 1.0 35.7 114.3 22.5 33.0 Shale 0.121 1.99 21.87
3 346.7 721 311 2.858518 1.0 41.7 114.3 17.3 33.0 Sand 0.170 5.84 46.30
4 254.6 1114 727 1.134956 1.0 53.6 139.7 20.8 32.0 Sand 0.158 5.25 17.53
In [30]:
df_c['Prod. Formation']=df_c['Prod. Formation'].replace('Shale', 0)       # Replace Shale with 0
df_c['Prod. Formation']=df_c['Prod. Formation'].replace('Sand', 1)        # Replace Sand with 1
df_c['Prod. Formation']=df_c['Prod. Formation'].replace('Shale-Sand', 2)  # Replace Shale-Sand with 2
df_c[0:5]
Out[30]:
X Coordinate New Column Y Coordinate Measured Depth (m) Deviation (True/False) Surface-Casing Weight (kg/m) Production-Casing Size (mm) Production-Casing Weight (kg/m) Bore. Temp. (degC) Prod. Formation Porosity (fraction) Permeability (Darcy) OIL Prod. (e3m3/month)
0 352.6 742 372 0.913723 0.0 35.7 114.3 20.8 53.0 2 0.224 12.25 108.75
1 435.7 437 551 0.319947 1.0 25.3 114.3 17.3 33.0 1 0.118 4.11 0.33
2 37.7 803 788 3.063147 1.0 35.7 114.3 22.5 33.0 0 0.121 1.99 21.87
3 346.7 721 311 2.858518 1.0 41.7 114.3 17.3 33.0 1 0.170 5.84 46.30
4 254.6 1114 727 1.134956 1.0 53.6 139.7 20.8 32.0 1 0.158 5.25 17.53

One challenge with this representation is that Machine Learning algorithms consider two nearby values are more similar than two distant values. This may be fine for some cases (order of categories “bad”, “average”, “good”), but this is not always true. A common solution is to have one binary attribute per category. This is called one-hot encoding, because we only have one attribute equal to 1 (hot), while the others are 0 (cold).

In [31]:
One_hot = pd.get_dummies(['a','b'],prefix='Deviation')
print(One_hot)
   Deviation_a  Deviation_b
0            1            0
1            0            1

These dummies should be merged back into the data frame.

In [32]:
One_hot = pd.get_dummies(df_im['Deviation (True/False)'],prefix='Deviation')
print(One_hot[0:10]) # Just show the first 10
   Deviation_False  Deviation_True
0                1               0
1                0               1
2                0               1
3                0               1
4                0               1
5                1               0
6                0               1
7                1               0
8                1               0
9                0               1
In [33]:
df_one_hot = pd.concat([df_im,One_hot],axis=1)
df_one_hot[0:5]
Out[33]:
X Coordinate New Column Y Coordinate Measured Depth (m) Deviation (True/False) Surface-Casing Weight (kg/m) Production-Casing Size (mm) Production-Casing Weight (kg/m) Bore. Temp. (degC) Prod. Formation Porosity (fraction) Permeability (Darcy) OIL Prod. (e3m3/month) Deviation_False Deviation_True
0 352.6 742 372 0.913723 False 35.7 114.3 20.8 53.0 Shale-Sand 0.224 12.25 108.75 1 0
1 435.7 437 551 0.319947 True 25.3 114.3 17.3 33.0 Sand 0.118 4.11 0.33 0 1
2 37.7 803 788 3.063147 True 35.7 114.3 22.5 33.0 Shale 0.121 1.99 21.87 0 1
3 346.7 721 311 2.858518 True 41.7 114.3 17.3 33.0 Sand 0.170 5.84 46.30 0 1
4 254.6 1114 727 1.134956 True 53.6 139.7 20.8 32.0 Sand 0.158 5.25 17.53 0 1

This should be done for 'Prod. Formation' column.

In [34]:
One_hot = pd.get_dummies(df_im['Prod. Formation'],prefix='Formation')
df_one_hot = pd.concat([df_one_hot,One_hot],axis=1)
df_one_hot[0:5]
Out[34]:
X Coordinate New Column Y Coordinate Measured Depth (m) Deviation (True/False) Surface-Casing Weight (kg/m) Production-Casing Size (mm) Production-Casing Weight (kg/m) Bore. Temp. (degC) Prod. Formation Porosity (fraction) Permeability (Darcy) OIL Prod. (e3m3/month) Deviation_False Deviation_True Formation_Sand Formation_Shale Formation_Shale-Sand
0 352.6 742 372 0.913723 False 35.7 114.3 20.8 53.0 Shale-Sand 0.224 12.25 108.75 1 0 0 0 1
1 435.7 437 551 0.319947 True 25.3 114.3 17.3 33.0 Sand 0.118 4.11 0.33 0 1 1 0 0
2 37.7 803 788 3.063147 True 35.7 114.3 22.5 33.0 Shale 0.121 1.99 21.87 0 1 0 1 0
3 346.7 721 311 2.858518 True 41.7 114.3 17.3 33.0 Sand 0.170 5.84 46.30 0 1 1 0 0
4 254.6 1114 727 1.134956 True 53.6 139.7 20.8 32.0 Sand 0.158 5.25 17.53 0 1 1 0 0

Finally, the original columns 'Deviation (True/False)' and 'Prod. Formation' should be removed.

In [35]:
df_one_hot.drop(['Deviation (True/False)','Prod. Formation'], axis=1, inplace=True)
#df_one_hot

Split Data to Training and Validation

Machine Learning models should be evaluated based on prediction for never-before-seen data set. The models usually have high performance for the data that are trained by. However, the main aim of training is to have high performance on completely new data set. So, data are split into training and validation set. The machine learning models learn from the training data, but based on the validation data are evaluated: the data should be divided into training and validation set according to some ratio. The common ratios are 80% training and 20% validation. The image below shows how a model is trained on 80% of the data and then validated against the remaining 20%. Based on feedback performance on evaluation set, Machine learning algorithms are modified. This process is repeated until the model's performance is satisfied. This repetition also could lead to information leak. The model could perform well on the validation set but not on a new data set. So, sometimes, a portion of data are hold-out for testing. We will talk about this on the next lectures

 

drawing

 

The following code splits the data into 80% training and 20% validation set.

In [36]:
import numpy as np

df = df.reindex(np.random.permutation(df.index)) #  Shuffle the data
df.reset_index(inplace=True, drop=True) # Reset index
MASK = np.random.rand(len(df)) < 0.8    
df_train = pd.DataFrame(df[MASK])
df_validation = pd.DataFrame(df[~MASK])

print("Number of Training: "+ str(len(df_train)))
print("Number of Validation: "+  str(len(df_validation)))
Number of Training: 825
Number of Validation: 175