Summary

This notebook presents different approaches for exploratory data analysis and basic statistical analysis. Python functions are written to make nice data visualization including histogram, CDF, swarm plot, box plot, Violin plot, Kernel Density Estimation, cross plot and correlation. Probability distribution including binomial PMF, Poisson, Normal and Exponential distribution will be discussed. Finally data manipulation and data analysis with Pandas will be presented.

Python functions and data files needed to run this notebook are available via this link.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib 
import pylab as plt
from functions import* # import require functions to run this notebook
import warnings
warnings.filterwarnings('ignore')

Data Visualization

The data is 2008 US swing state election results downloaded from https://www.kaggle.com/datasets/aman1py/swing-states?resource=download

In [2]:
df_swing=pd.read_csv('./Data/2008_swing_states.csv')
df_swing[['state','county','dem_share']][:5]
Out[2]:
state county dem_share
0 PA Erie County 60.08
1 PA Bradford County 40.64
2 PA Tioga County 36.07
3 PA McKean County 41.21
4 PA Potter County 31.04

Histogram

In [3]:
font = {'size'   : 7}
plt.rc('font', **font)
fig, ax1 = plt.subplots(figsize=(10, 7), dpi= 100, facecolor='w', edgecolor='k')

gs = gridspec.GridSpec(2, 4)
ax1=plt.subplot(gs[0, :2], ) 
KPI='dem_share'
val=df_swing[df_swing['state']=='OH'][KPI]
EDA_plot.histplt (val,bins=20,title='OH State',xlabl=KPI,
         ylabl='Percentage',xlimt=(0,100),ylimt=(0,0.2),axt=ax1,
         scale=1.15,loc=1,font=10,color='#8DC63F')

ax2=plt.subplot(gs[0, 2:]) 
KPI='dem_share'
val=df_swing[df_swing['state']=='PA'][KPI]
EDA_plot.histplt (val,bins=20,title='PA State',xlabl=KPI,
         ylabl='Percentage',xlimt=(0,100),ylimt=(0,0.2),axt=ax2,
         scale=1.15,loc=1,font=10,color='#025174')

ax3=plt.subplot(gs[1, 1:3]) 
val=df_swing[df_swing['state']=='FL'][KPI]
EDA_plot.histplt (val,bins=20,title='FL State',xlabl=KPI,
         ylabl='Percentage',xlimt=(0,100),ylimt=(0,0.2),axt=ax3
         ,scale=1.15,loc=1,font=10,color='#323539')

plt.subplots_adjust(wspace=0.9)
plt.subplots_adjust(hspace=0.45)
fig.suptitle('2008 US Swing State Election_Histogram of dem_share for each State', fontsize=16,y=1.02)

plt.show()

Bee Swarm plot

In [4]:
font = {'size'   : 12}
plt.rc('font', **font)
fig, ax1 = plt.subplots(figsize=(8, 4), dpi= 100, facecolor='w', edgecolor='k')


sns.set_theme(style="whitegrid")
tips = sns.load_dataset("tips")
ax = sns.swarmplot(x="state", y="dem_share", data=df_swing)
df_means = df_swing.groupby("state")["dem_share"].agg("mean").reset_index()
xlim = ax.get_xlim()
ylim = ax.get_ylim()
sns.scatterplot(x="state", y="dem_share", marker='X', color='black',
                s=100, zorder=3, ax=ax, legend=True, data=df_means)
ax.set_xlim(xlim)
ax.set_ylim(ylim)
plt.show()

Box Plot

The center of box plot is 50th percentile which has a higher limit of 75th percentile and lower limit of 25th percentile. This is called IQR (Interquartile Range). The reset extend of data is 1.5 IQR.

In [5]:
font = {'size'   : 7}
plt.rc('font', **font)
fig, ax1 = plt.subplots(figsize=(8, 4), dpi= 100, facecolor='w', edgecolor='k')


sns.set_theme(style="whitegrid")
tips = sns.load_dataset("tips")
ax = sns.swarmplot(x="state", y="dem_share", data=df_swing)
ax = sns.boxplot(x="state", y="dem_share", data=df_swing)
df_means = df_swing.groupby("state")["dem_share"].agg("mean").reset_index()
xlim = ax.get_xlim()
ylim = ax.get_ylim()
sns.scatterplot(x="state", y="dem_share", marker='X', color='black',
                s=100, zorder=3, ax=ax, legend=True, data=df_means)
ax.set_xlim(xlim)
ax.set_ylim(ylim)
plt.show()

Violin Plot

In [6]:
#plt.style.available
plt.style.use('classic')
In [7]:
KPI='dem_share'
val=[]
val.append(np.array(df_swing[df_swing['state']=='PA'][KPI]))
val.append(np.array(df_swing[df_swing['state']=='OH'][KPI]))
val.append(np.array(df_swing[df_swing['state']=='FL'][KPI]))
#
Position=[100,500,1000]     

font = {'size'   : 11}
plt.rc('font', **font)
fig, ax1 = plt.subplots(figsize=(9, 5), dpi= 130, facecolor='w', edgecolor='k')

v1 = plt.violinplot(val, points=100, positions=Position, widths=300,
               showmeans=True,showextrema=True, showmedians=False,bw_method =0.15)
for b in v1['bodies']:
    m = np.mean(b.get_paths()[0].vertices[:, 0])
    b.get_paths()[0].vertices[:, 0] = np.clip(b.get_paths()[0].vertices[:, 0], -np.inf, m)
    b.set_color('darkred')
    b.set_edgecolor('k')
    b.set_linewidth(1)  
    b.set_alpha(0.9)
#plt.axis('off')
plt.ylabel('dem_share',fontsize=12)
    
plt.xticks([100,500,1000],['PA','OH','FL'],rotation=0)
plt.title("Violin Plot",fontweight="bold",fontsize=16)
plt.show()

Kernel Density Estimation

In [8]:
font = {'size'   : 11}
plt.rc('font', **font)
fig, ax1 = plt.subplots(figsize=(9, 8), dpi= 130, facecolor='w', edgecolor='k')

ax1=plt.subplot(2,1,1)
label=['OH State','PA State','FL State']

data_var=len(label)*[None]
KPI='dem_share'
data_var[0]=np.array(df_swing[df_swing['state']=='OH'][KPI])
data_var[1]=np.array(df_swing[df_swing['state']=='PA'][KPI])
data_var[2]=np.array(df_swing[df_swing['state']=='FL'][KPI])

nvar=3
xs = np.linspace(0,100,500)
colors = ['r','b','g']
title='2008 US Swing State Election_Kernel Density Estimation for dem_share for each State'
EDA_plot.KDE(xs,data_var,nvar,label,colors,xlabel=None,title=title,ylabel='Probability',xlim=(0,100),ylim=(0,0.01),
    LAMBA=0.5,linewidth=2.5,loc=0,axt=ax1,x_ftze=12,y_ftze=12,tit_ftze=12,leg_ftze=9)

Emperical cdf

In [9]:
font = {'size'   : 11}
plt.rc('font', **font)
fig, ax1 = plt.subplots(figsize=(9, 8), dpi= 130, facecolor='w', edgecolor='k')

label=['OH State','PA State','FL State']

data_var=len(label)*[None]
KPI='dem_share'
data_var[0]=np.array(df_swing[df_swing['state']=='OH'][KPI])
data_var[1]=np.array(df_swing[df_swing['state']=='PA'][KPI])
data_var[2]=np.array(df_swing[df_swing['state']=='FL'][KPI])

ax1=plt.subplot(2,1,1)
title=f'2008 US Swing State Election_Emperical CDF_ for dem_share for each State'
EDA_plot.CDF_plot(data_var,nvar,label,colors,xlabel=None,title=title,ylabel='Cumulative Probability',bins=100,
    xlim=(0,100),ylim=(0,1),linewidth=2.5,loc=0,axt=ax1,x_ftze=12,y_ftze=12,tit_ftze=12,leg_ftze=9)

plt.show()

Percentile

In [10]:
font = {'size'   : 11}
plt.rc('font', **font)
fig, ax1 = plt.subplots(figsize=(8, 8), dpi= 130, facecolor='w', edgecolor='k')

label=['FL State']
KPI='dem_share'
data_var=np.array(df_swing[df_swing['state']=='FL'][KPI])

ax1=plt.subplot(2,1,1)
title=f'2008 US Swing State Election_Emperical CDF_for dem_share for FL State'
EDA_plot.CDF_plot(data_var,1,label,colors='g',xlabel=None,title=title,ylabel='Cumulative Probability',bins=100,
    xlim=(0,90),ylim=(0,1),linewidth=2.5,loc=4,axt=ax1,x_ftze=12,y_ftze=12,tit_ftze=12,leg_ftze=9)

# 25th percentile
per=0.25
Perctile=np.quantile(data_var, per, axis=0, keepdims=True)[0]
txt= '25th percentile= '
txt+=f'{np.round(Perctile,1)}'
plt.plot([0, Perctile], [per, per], 'k--', linewidth=1.2)
plt.plot([Perctile, Perctile], [0, per], 'k--', linewidth=1.2)
plt.text(Perctile/6, per+0.03, txt, color='b',
         fontsize=10)

# 50th percentile
per=0.5
Perctile=np.quantile(data_var, per, axis=0, keepdims=True)[0]
txt= '50th percentile= '
txt+=f'{np.round(Perctile,1)}'
plt.plot([0, Perctile], [per, per], 'k--', linewidth=1.2)
plt.plot([Perctile, Perctile], [0, per], 'k--', linewidth=1.2)
plt.text(Perctile/6, per+0.03, txt, color='b',
         fontsize=10)

# 75th percentile
per=0.75
Perctile=np.quantile(data_var, per, axis=0, keepdims=True)[0]
txt= '75th percentile= '
txt+=f'{np.round(Perctile,1)}'
plt.plot([0, Perctile], [per, per], 'k--', linewidth=1.2)
plt.plot([Perctile, Perctile], [0, per], 'k--', linewidth=1.2)
plt.text(Perctile/6, per+0.03, txt, color='b',
         fontsize=10)

plt.show()

Cross plot

In [11]:
font = {'size'   : 12}
plt.rc('font', **font)
fig, ax1 = plt.subplots(figsize=(7, 5), dpi= 100, facecolor='w', edgecolor='k')


x=df_swing['total_votes']
y=df_swing['dem_share']
EDA_plot.CrossPlot(x,y,title='cross plot: total votes vs. dem_vote',
          xlabl='total votes',ylabl='percent of dem_vote',loc=1,xlimt=(np.nanmin(x),np.nanmax(x)),
          ylimt=(np.nanmin(y),np.nanmax(y)),axt=ax1,scale=1.2,alpha=0.9,markersize=6,marker='bo',font=font)

Correlation

Synthetic data with 6 features, Var1 to Var6 with Gaussian, Triangular and Uniform distribution are generated to have linear and non-linear correlation with target for 100000 data.

In [12]:
# Generate synthetic data
df=quadrac_swiss_roll(noise=2,nsim=100000,time_rand1=1,seed=42,a1=20,b1=8,a2=50,b2=8,swiss_roll=True)
features_colums=df.columns
# Satndardize data
for i in df.columns:
    df[i] = zscore(df[i])

Figure below shows cross plot matrix between Var1 to Var6 features and the target. Var1 and Var 2 are non-linearly correlated with the target. Var3 has strong positive correlation (0.68), and Var4 and Var5 are negative correlation with the target.

In [13]:
font = {'size'   : 7}
plt.rc('font', **font)
fig, ax1 = plt.subplots(figsize=(10, 10), dpi= 100, facecolor='w', edgecolor='k')
colors_map = plt.cm.get_cmap('jet')

colors = colors_map(np.linspace(0,0.8,len(features_colums)))

for ir in range(len(features_colums)):
    ax1=plt.subplot(4,2,ir+1) 
    val=df[features_colums[ir]]

    EDA_plot.histplt(val,bins=20,title=f'{features_colums[ir]}',xlabl=None,days=False,
             ylabl=None,xlimt=None,ylimt=(0,0.3)
             ,axt=ax1,nsplit=5,scale=1.02,loc=2,font=10.5,color=colors[len(features_colums)-ir-1])
plt.subplots_adjust(hspace=0.4)
plt.subplots_adjust(wspace=0.3)
fig.suptitle(f'Histogram of Variables for Synthetic Data', fontsize=16,y=0.96)
plt.show()
In [14]:
font = {'size'   :9 }
plt.rc('font', **font)

fig=plt.figure(figsize=(10, 9), dpi= 100, facecolor='w', edgecolor='k')
# Satndardize data
for i in features_colums:
    df[i] = zscore(df[i])
cplotmatrix(df,font=font,alpha=0.008,marker='g.',missin_rep=False)
plt.show()

Correlation Coefficient

In [15]:
font = {'size'   : 6}
plt.rc('font', **font)
fig, ax1 = plt.subplots(figsize=(3.5, 4), dpi= 200, facecolor='w', edgecolor='k')

corr=df.corr()
corr=corr['Target'].drop(['Target'])
coefs=corr.values
features_colums=list(corr.index)
Correlation_plot.corr_bar(coefs,clmns=features_colums,select=False,yfontsize=6.0,title=f'Linear Correlation with Target',
                          ymax_vert_lin=30,xlim = [-0.5, 0.7])

Coefficient of Determination

The correlation coefficient also termed as $R^{2}$ is the basic approach to quantify the importance of features for a target. In statistics, $R^{2}$ determines and assesses the ability of a statistical model to explain and predict future outcomes. In other words, if we have dependent variable y and independent variable x in a model, then $R^{2}$ helps in determining the variation in y by variation x. It is one of the key output of regression analysis and is used when we want to predict future or testing some models with related information. $R^{2}$ lies between 0 and 1 and higher the value of $R^{2}$, better will be the prediction and strength of the model. $R^{2}$ is basically a square of a correlation coefficient (𝜌).

There are multiple Formulas to calculate the coefficient of determination. It can be simply calculated from correlation coefficient:

Coefficient of Determination=(Correlation Coefficient)^2

In [16]:
ir=0
corr=df.corr()
corr=corr['Target'].drop(['Target'])
coefs=corr.values
coefs= np.array([i**2 for i in coefs])

# Plot the importance of features
font = {'size'   : 7}
plt.rc('font', **font)
fig, ax1 = plt.subplots(figsize=(6, 3), dpi= 180, facecolor='w', edgecolor='k')

_=prfrmnce_plot(coefs, title=f'Coefficient of Determination', 
            ylabel='Coefficient of Determination',clmns=features_colums,titlefontsize=9, 
            xfontsize=7, yfontsize=8).bargraph(perent=True,fontsizelable=8,xshift=-0.25,axt=ax1,
            yshift=0.02,ylim=[0,1], xlim=[-0.5,5.5], y_rot=0)

Probability Distributions

  • Hacker statistics
    • Use simulated repeated measurement to compute probabilities
    • Determine how to simulate data
    • Simulated many many times
    • Probability is approximately fraction of trials with the outcome of interest
  • Bernoulli trial
    • An experiment that has two options, "success" (True) and failure (False)
  • Probability Mass Function (PMF)
    • The set of probabilities of discrete outcomes
  • Probability Distribution
    • A mathematical description of outcome
  • Binomial distribution

    • The number r of success in n Bernoulli trials with probability p of success, is Binomially distributed. For example, the number of the heads in 4 coin flips with probability 0.5 of heads, is Binomially distributed.
  • Probability Density Function (PDF)
    • Continuous Variables: Quantities that can take any value, not just discrete values
    • Continuous analog to PMF (probability mass function)
    • Mathematical description of the relative likelihood of observing a value of a continuous variable.
    • for PDF of normal distribution: the area under PDF is the probability of occurring a value.

The Binomial PMF

In probability theory and statistics, the binomial distribution with parameters n and p is the discrete probability distribution of the number of successes in a sequence of n independent experiments, each asking a yes–no question.

The probability density for the binomial distribution is

image.png

where n is the number of trials, p is the probability of success, and N is the number of successes.

In [17]:
Samples=np.random.binomial(n=60,p=0.1,size=10000)
font = {'size'   : 11}
plt.rc('font', **font)
fig, ax1 = plt.subplots(figsize=(8, 8), dpi= 130, facecolor='w', edgecolor='k')
title='Binomial PMF, n=60, p=0.1, size=1000'

ax1=plt.subplot(2,1,1)
EDA_plot.histplt (Samples,bins=800,title=title,xlabl=None,
         ylabl='Percentage',xlimt=(0,15),ylimt=(0,0.2),axt=ax1,
         scale=1.15,loc=1,font=10,color='g')

ax1=plt.subplot(2,1,2)
EDA_plot.CDF_plot(Samples,1,label='Binomial ecdf',colors='g',xlabel=None,title=title,ylabel='Cumulative Probability',bins=10000,
    xlim=(0,15),ylim=(0,1),linewidth=2.5,loc=4,axt=ax1,x_ftze=12,y_ftze=12,tit_ftze=16,leg_ftze=9)

Poisson processes and the Poisson Distribution

The timing of the next event is completely independent of when the previous even happened. The example are:

  • Natural births in a given hospital
  • Hit on a website during a given hour
  • meteor strikes
  • Molecular collisions in a gas
  • Aviation incidents
  • Buses in Poissonville
  • The number of r of arrivals of a Poisson process in a given time interval with average rate of? arrivals per intervals is Poisson distributed.
  • The number of r hits on a website in one hour with an average hit rate of 6 hits per hour is Poisson distributed

Poisson distribution is the limit of the Binomial distribution for low probability of success and large number of Bernoulli trial (an experiment that has two options, "success" (True) and failure (False)).

In probability theory and statistics, the Poisson distribution is a discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time (or space) if these events occur with a known constant mean rate:

$\LARGE P(n)=\frac{\mu ^{n}e^{-\mu}}{n!}$

where $P(n)$ is predicted probability for $n$ number of occurrence, $\mu$ is the mean for number occurrence for a fixed interval of time

  • Small example

A small company, on average, is receiving 12 customers per day ($\mu=12$). What is the probability if the company will receive exactly 8 ($n=8$) customers in one day?

$\large P(n=8)=\frac{12 ^{8}e^{-12}}{8!}=0.0655$

In [18]:
# Draw samples from a Poisson distribution.
Samples=np.random.poisson(lam=60,size=10000)
# lam: Expected number of events occurring in a fixed-time interval,
font = {'size'   : 11}
plt.rc('font', **font)
fig, ax1 = plt.subplots(figsize=(8, 8), dpi= 130, facecolor='w', edgecolor='k')
title='Poisson PMF, Expected Number of Events in Fixed Interval=60, size=10000'

ax1=plt.subplot(2,1,1)
EDA_plot.histplt (Samples,bins=1000,title=title,xlabl=None,
         ylabl='Percentage',xlimt=(0,100),ylimt=None,axt=ax1,
         scale=1.15,loc=1,font=9,color='g')
#
ax1=plt.subplot(2,1,2)
EDA_plot.CDF_plot(Samples,1,label='Poisson ecdf',colors='g',xlabel=None,title=title,ylabel='Cumulative Probability',bins=10000,
    xlim=(0,100),ylim=None,linewidth=2.5,loc=4,axt=ax1,x_ftze=12,y_ftze=12,tit_ftze=12,leg_ftze=9)

Normal Distribution

In [19]:
# Draw samples from a Normal distribution.
Samples=np.random.normal(loc=3, scale=1.0, size=10000)

font = {'size'   : 11}
plt.rc('font', **font)
fig, ax1 = plt.subplots(figsize=(8, 8), dpi= 130, facecolor='w', edgecolor='k')
title='Normal Distribution_loc=3, scale=1.0, size=1000'

ax1=plt.subplot(2,1,1)
EDA_plot.histplt (Samples,bins=50,title=title,xlabl=None,
         ylabl='Percentage',xlimt=(0,6),ylimt=None,axt=ax1,
         scale=1.15,loc=1,font=10,color='g')


ax1=plt.subplot(2,1,2)
EDA_plot.CDF_plot(Samples,1,label='Normal ecdf',colors='g',xlabel=None,title='ecdf of '+title,ylabel='Cumulative Probability',bins=10000,
    xlim=(0,6),ylim=None,linewidth=2.5,loc=4,axt=ax1,x_ftze=12,y_ftze=12,tit_ftze=14,leg_ftze=9)

The Exponential Distribution

The waiting time between arrivals of a Poisson process is Exponentially distributed.

In [20]:
# Draw samples from a Exponential distribution.
Samples=np.random.exponential(scale=1.0,size=10000)
font = {'size'   : 11}
plt.rc('font', **font)
fig, ax1 = plt.subplots(figsize=(8, 8), dpi= 130, facecolor='w', edgecolor='k')
title='Exponential Distribution_scale=60, size=10000'

ax1=plt.subplot(2,1,1)
EDA_plot.histplt (Samples,bins=100,title=title,xlabl=None,
         ylabl='Percentage',xlimt=(0,10),ylimt=None,axt=ax1,
         scale=1.15,loc=1,font=10,color='g')


ax1=plt.subplot(2,1,2)
EDA_plot.CDF_plot(Samples,1,label='Poisson ecdf',colors='g',xlabel=None,title=title,ylabel='Cumulative Probability',bins=10000,
    xlim=(0,10),ylim=None,linewidth=2.5,loc=4,axt=ax1,x_ftze=12,y_ftze=12,tit_ftze=14,leg_ftze=9)

Data Manipulation with Pandas

Pandas is an open source Python package for data manipulation and analysis with high-performance and easy-to-use methods. It is based on the dataframe concept in the R programming language. Pandas can be used as the primary tools for data processing, data manipulation and data cleaning. pandas is built on NumPy and Matplotlib.

Look at Data

In [21]:
# Reading data in Pandas
import pandas as pd
df = pd.read_csv("./Data/WPD.csv") # Read data file
#df.head()
df[0:5] # Display data
Out[21]:
X Coordinate Y Coordinate Measured Depth (m) Deviation (True/False) Surface-Casing Weight (kg/m) Production-Casing Size (mm) Production-Casing Weight (kg/m) Bore. Temp. (degC) Prod. Formation Porosity (fraction) Permeability (Darcy) OIL Prod. (e3m3/month)
0 352.6 372 2145.8 False NaN 114.3 NaN 53.0 Shale-Sand 0.224 12.25 108.75
1 435.7 551 1638.0 True 25.3 114.3 17.3 NaN Sand 0.118 4.11 0.33
2 37.7 788 3984.0 True 35.7 114.3 22.5 NaN Shale 0.121 1.99 21.87
3 346.7 311 3809.0 True 41.7 114.3 17.3 NaN Sand 0.170 5.84 46.30
4 254.6 727 2335.0 True 53.6 NaN NaN 32.0 Sand 0.158 5.25 17.53
In [22]:
# shape of data
df.shape
Out[22]:
(1000, 12)

The columns function gives the name of the columns in your data file.

In [23]:
column=list(df.columns)
print(column)
['X Coordinate', 'Y Coordinate', 'Measured Depth (m)', 'Deviation (True/False)', 'Surface-Casing Weight (kg/m)', 'Production-Casing Size (mm)', 'Production-Casing Weight (kg/m)', 'Bore. Temp. (degC)', 'Prod. Formation', 'Porosity (fraction)', 'Permeability (Darcy)', 'OIL Prod. (e3m3/month)']

The info() function gives some information about the data type: string float or null (missing values).

In [24]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 12 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   X Coordinate                     1000 non-null   float64
 1   Y Coordinate                     1000 non-null   int64  
 2   Measured Depth (m)               1000 non-null   float64
 3   Deviation (True/False)           1000 non-null   bool   
 4   Surface-Casing Weight (kg/m)     593 non-null    float64
 5   Production-Casing Size (mm)      913 non-null    float64
 6   Production-Casing Weight (kg/m)  547 non-null    float64
 7   Bore. Temp. (degC)               801 non-null    float64
 8   Prod. Formation                  1000 non-null   object 
 9   Porosity (fraction)              1000 non-null   float64
 10  Permeability (Darcy)             1000 non-null   float64
 11  OIL Prod. (e3m3/month)           1000 non-null   float64
dtypes: bool(1), float64(9), int64(1), object(1)
memory usage: 87.0+ KB

The describe function gives some statistical analysis for each column if the data type is not string.

In [25]:
df.describe()
Out[25]:
X Coordinate Y Coordinate Measured Depth (m) Surface-Casing Weight (kg/m) Production-Casing Size (mm) Production-Casing Weight (kg/m) Bore. Temp. (degC) Porosity (fraction) Permeability (Darcy) OIL Prod. (e3m3/month)
count 1000.000000 1000.000000 1000.000000 593.000000 913.000000 547.000000 801.000000 1000.000000 1000.000000 1000.000000
mean 432.893400 489.749000 1364.379300 42.906745 142.495290 23.260512 40.774032 0.141840 5.269140 60.003280
std 172.206682 243.547912 855.633567 25.877438 25.953205 7.137579 21.256825 0.074501 8.665687 234.814659
min 12.300000 36.000000 276.000000 14.100000 13.700000 10.800000 5.000000 0.000000 0.050000 0.000000
25% 325.425000 339.750000 704.000000 35.700000 114.300000 17.300000 27.000000 0.089750 1.040000 0.000000
50% 487.250000 480.000000 1066.000000 35.700000 139.700000 20.800000 33.000000 0.141000 2.480000 4.715000
75% 581.675000 596.000000 1790.000000 48.100000 177.800000 25.300000 53.000000 0.192000 5.752500 33.117500
max 648.800000 1172.000000 6363.000000 595.000000 219.100000 53.600000 255.000000 0.400000 94.510000 3425.530000
In [26]:
# Components of DataFrame
df.values
Out[26]:
array([[352.6, 372, 2145.8, ..., 0.224, 12.25, 108.75],
       [435.7, 551, 1638.0, ..., 0.118, 4.11, 0.33],
       [37.7, 788, 3984.0, ..., 0.121, 1.99, 21.87],
       ...,
       [593.4, 456, 683.7, ..., 0.332, 49.75, 832.45],
       [540.9, 328, 1346.0, ..., 0.111, 0.59, 0.0],
       [617.8, 377, 844.0, ..., 0.292, 9.95, 153.43]], dtype=object)
In [27]:
# index of DataFrame
df.index
Out[27]:
RangeIndex(start=0, stop=1000, step=1)

Data Cleaning

Missing (NaN) Values

It is ideal to have valid value for every rows and columns. In reality, missing values is one of the main challenges faced by machine learning. There are some approaches in practice to deal with missing values.

Option 1: Drop Column

One naive approach is to remove columns that have missing values with drop() function.

In [28]:
Select_column=column[3:8] # Select columns that have missing
Select_column
Out[28]:
['Deviation (True/False)',
 'Surface-Casing Weight (kg/m)',
 'Production-Casing Size (mm)',
 'Production-Casing Weight (kg/m)',
 'Bore. Temp. (degC)']
In [29]:
df_drop_c=df.copy() # Copy to avoid modifying the original data
df_drop_c.drop(Select_column,axis=1, inplace=True)  # Drop selected 5 columns
df_drop_c[:5]
Out[29]:
X Coordinate Y Coordinate Measured Depth (m) Prod. Formation Porosity (fraction) Permeability (Darcy) OIL Prod. (e3m3/month)
0 352.6 372 2145.8 Shale-Sand 0.224 12.25 108.75
1 435.7 551 1638.0 Sand 0.118 4.11 0.33
2 37.7 788 3984.0 Shale 0.121 1.99 21.87
3 346.7 311 3809.0 Sand 0.170 5.84 46.30
4 254.6 727 2335.0 Sand 0.158 5.25 17.53
In [30]:
df_drop_c.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 7 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   X Coordinate            1000 non-null   float64
 1   Y Coordinate            1000 non-null   int64  
 2   Measured Depth (m)      1000 non-null   float64
 3   Prod. Formation         1000 non-null   object 
 4   Porosity (fraction)     1000 non-null   float64
 5   Permeability (Darcy)    1000 non-null   float64
 6   OIL Prod. (e3m3/month)  1000 non-null   float64
dtypes: float64(5), int64(1), object(1)
memory usage: 54.8+ KB

This approach is not reliable since the information of 5 columns without missing values have been removed from data set.

Option 2: Drop Row

Another approach is to remove rows that have missing values with dropna() function.

In [31]:
df_drop_=df.copy() # Copy to avoid modifying the original data
df_drop_r=df_drop_.dropna() # Drop rows with missing values
df_drop_r[:5]
Out[31]:
X Coordinate Y Coordinate Measured Depth (m) Deviation (True/False) Surface-Casing Weight (kg/m) Production-Casing Size (mm) Production-Casing Weight (kg/m) Bore. Temp. (degC) Prod. Formation Porosity (fraction) Permeability (Darcy) OIL Prod. (e3m3/month)
6 636.9 462 713.0 True 48.1 177.8 25.3 27.0 Shale 0.128 0.70 0.0
7 556.2 441 761.0 False 35.7 139.7 20.8 28.0 Sand 0.055 0.36 0.0
8 572.0 521 587.0 False 25.3 114.3 14.1 15.0 Shale 0.033 0.46 0.0
10 366.2 385 1801.0 False 35.7 139.7 25.3 52.0 Shale-Sand 0.147 3.15 5.8
14 648.7 275 920.0 False 35.7 139.7 20.8 30.0 Shale 0.104 0.38 0.0
In [32]:
df_drop_r.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 426 entries, 6 to 999
Data columns (total 12 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   X Coordinate                     426 non-null    float64
 1   Y Coordinate                     426 non-null    int64  
 2   Measured Depth (m)               426 non-null    float64
 3   Deviation (True/False)           426 non-null    bool   
 4   Surface-Casing Weight (kg/m)     426 non-null    float64
 5   Production-Casing Size (mm)      426 non-null    float64
 6   Production-Casing Weight (kg/m)  426 non-null    float64
 7   Bore. Temp. (degC)               426 non-null    float64
 8   Prod. Formation                  426 non-null    object 
 9   Porosity (fraction)              426 non-null    float64
 10  Permeability (Darcy)             426 non-null    float64
 11  OIL Prod. (e3m3/month)           426 non-null    float64
dtypes: bool(1), float64(9), int64(1), object(1)
memory usage: 40.4+ KB

This approach is also not recommend because number of data (rows) have decreased from 1000 to 426.

Option 3: Replace with Median (mean)

A common practice is to replace missing values with the median (mean) value for that column. The median is the middle value of a list that 50% are less and 50% are bigger. See Median. The following code replaces all missing values in 5 columns with the median value of each column values.

In [33]:
df_im=df.copy()

# Get the median of each variable and replace the missing values with median
for i in range(len(Select_column)):
    P50 = df[Select_column[i]].median() # Calculate median of each column
    df_im[Select_column[i]] = df_im[Select_column[i]].fillna(P50)   # replace NaN with median(P50)
    
df_im[:5]
Out[33]:
X Coordinate Y Coordinate Measured Depth (m) Deviation (True/False) Surface-Casing Weight (kg/m) Production-Casing Size (mm) Production-Casing Weight (kg/m) Bore. Temp. (degC) Prod. Formation Porosity (fraction) Permeability (Darcy) OIL Prod. (e3m3/month)
0 352.6 372 2145.8 False 35.7 114.3 20.8 53.0 Shale-Sand 0.224 12.25 108.75
1 435.7 551 1638.0 True 25.3 114.3 17.3 33.0 Sand 0.118 4.11 0.33
2 37.7 788 3984.0 True 35.7 114.3 22.5 33.0 Shale 0.121 1.99 21.87
3 346.7 311 3809.0 True 41.7 114.3 17.3 33.0 Sand 0.170 5.84 46.30
4 254.6 727 2335.0 True 53.6 139.7 20.8 32.0 Sand 0.158 5.25 17.53
In [34]:
df_im.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 12 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   X Coordinate                     1000 non-null   float64
 1   Y Coordinate                     1000 non-null   int64  
 2   Measured Depth (m)               1000 non-null   float64
 3   Deviation (True/False)           1000 non-null   bool   
 4   Surface-Casing Weight (kg/m)     1000 non-null   float64
 5   Production-Casing Size (mm)      1000 non-null   float64
 6   Production-Casing Weight (kg/m)  1000 non-null   float64
 7   Bore. Temp. (degC)               1000 non-null   float64
 8   Prod. Formation                  1000 non-null   object 
 9   Porosity (fraction)              1000 non-null   float64
 10  Permeability (Darcy)             1000 non-null   float64
 11  OIL Prod. (e3m3/month)           1000 non-null   float64
dtypes: bool(1), float64(9), int64(1), object(1)
memory usage: 87.0+ KB
  • Dropping duplicate names

    df.drop_duplicates(subset="name")

    df.drop_duplicates(subset=["name1","name2"])

Remove Outliers

Outliers are usually extreme high or low values that are different from pattern of majority data. Outliers should be removed from dataset (we should be careful not to consider novelty as outlier). A straightforward approach to detect outliers are several (n) standard deviations from the mean $m\pm n\sigma$ ($m$=mean, $\sigma$=standard deviation). For example, an outlier can bigger than $m+ 3\sigma $ or less than can bigger than $m- 3\sigma $. The following function can be used to apply this approach.

In [35]:
def outlier_remove(df, n,name):
    """Delete rows for a specified column where values are out of +/- n*sd standard deviations
    
    df  : Pandas dataframe
    n   : n in the equation 𝑚±𝑛𝜎
    name: Column name
    """
    mean=df[name].mean() # Calclute mean of column
    sd=df[name].std()  # Calclute standard deviation of column
    drop_r = df.index[(mean -n * sd> df[name]) | (mean+n * sd< df[name])] # Find data that are not within 𝑚±𝑛𝜎
    df.drop(drop_r, axis=0, inplace=True) # Drop data
    df.reset_index(inplace=True, drop=True) # Reset index
In [36]:
# Drop outliers in last column 'OIL Prod. (e3m3/month)'
df_im_out=df_im.copy()
outlier_remove(df_im_out,n=3,name='OIL Prod. (e3m3/month)')
df_im_out[:5]
Out[36]:
X Coordinate Y Coordinate Measured Depth (m) Deviation (True/False) Surface-Casing Weight (kg/m) Production-Casing Size (mm) Production-Casing Weight (kg/m) Bore. Temp. (degC) Prod. Formation Porosity (fraction) Permeability (Darcy) OIL Prod. (e3m3/month)
0 352.6 372 2145.8 False 35.7 114.3 20.8 53.0 Shale-Sand 0.224 12.25 108.75
1 435.7 551 1638.0 True 25.3 114.3 17.3 33.0 Sand 0.118 4.11 0.33
2 37.7 788 3984.0 True 35.7 114.3 22.5 33.0 Shale 0.121 1.99 21.87
3 346.7 311 3809.0 True 41.7 114.3 17.3 33.0 Sand 0.170 5.84 46.30
4 254.6 727 2335.0 True 53.6 139.7 20.8 32.0 Sand 0.158 5.25 17.53

Concatenation

A new data frames can be built by concatenating rows and columns. A new dataframe is created by concatenating two columns together 'Measured Depth (m)' and 'OIL Prod. (e3m3/month)' using the concat() function.

In [37]:
col_1= df_im['Measured Depth (m)']
col_2 = df_im['OIL Prod. (e3m3/month)']
result = pd.concat([col_1, col_2], axis=1)
result[:5]
Out[37]:
Measured Depth (m) OIL Prod. (e3m3/month)
0 2145.8 108.75
1 1638.0 0.33
2 3984.0 21.87
3 3809.0 46.30
4 2335.0 17.53

Two rows can be concatenated together using concat() function. The code below concatenates the first 4 rows and the last 3 rows of the previous dataset after imputation.

In [38]:
row_1= df_im[0:4]  # Retrieve first 4 rows
row_2 = df_im[-4:-1]  # Retrieve last 3 rows
result = pd.concat([row_1, row_2], axis=0) # Concatenate rows
result[:5]
Out[38]:
X Coordinate Y Coordinate Measured Depth (m) Deviation (True/False) Surface-Casing Weight (kg/m) Production-Casing Size (mm) Production-Casing Weight (kg/m) Bore. Temp. (degC) Prod. Formation Porosity (fraction) Permeability (Darcy) OIL Prod. (e3m3/month)
0 352.6 372 2145.8 False 35.7 114.3 20.8 53.0 Shale-Sand 0.224 12.25 108.75
1 435.7 551 1638.0 True 25.3 114.3 17.3 33.0 Sand 0.118 4.11 0.33
2 37.7 788 3984.0 True 35.7 114.3 22.5 33.0 Shale 0.121 1.99 21.87
3 346.7 311 3809.0 True 41.7 114.3 17.3 33.0 Sand 0.170 5.84 46.30
996 548.3 475 749.8 False 35.7 114.3 20.8 28.0 Shale 0.017 0.48 0.00

Saving a Dataframe as CSV

You can simply save a dataframe as CSV by the following code. The following code performs a shuffle and then saves a new copy.

In [39]:
filename="myfile.csv"
df_im.to_csv(filename, index=False) # index = False not writing row numbers

Shuffling, Grouping and Sorting

Shuffling Dataset

An example of shuffling is playing the game of cards. All the cards are collected at the end of each round of play, shuffled to ensure that cards are distributed randomly and each player receives cards based on chance. In Machine Learning, the data set should be split into training and validation datasets. So, shuffling is very important to avoid any element of bias/patterns in the split datasets.

If consistent shuffling of the data set is required for several times of running the code, a random seed can be used. See code below.

In [40]:
import numpy as np

#np.random.seed(82) 
df_shfl = df_im.reindex(np.random.permutation(df.index)) 
df_shfl.reset_index(inplace=True, drop=True) # Reset index
df_shfl[0:5]
Out[40]:
X Coordinate Y Coordinate Measured Depth (m) Deviation (True/False) Surface-Casing Weight (kg/m) Production-Casing Size (mm) Production-Casing Weight (kg/m) Bore. Temp. (degC) Prod. Formation Porosity (fraction) Permeability (Darcy) OIL Prod. (e3m3/month)
0 498.9 59 1010.0 False 53.6 139.7 20.8 26.0 Shale-Sand 0.100 0.66 0.00
1 320.3 468 1436.8 False 35.7 139.7 20.8 41.0 Sand 0.134 3.50 2.02
2 298.7 674 3065.0 True 53.6 139.7 20.8 33.0 Sand 0.338 42.45 1760.45
3 569.8 205 989.1 False 35.7 114.3 20.8 33.0 Shale 0.018 0.91 0.00
4 522.3 213 1050.0 False 35.7 139.7 23.1 35.0 Shale 0.157 9.48 37.06

Sorting Data Set

In [41]:
df_sort = df_im.sort_values(by='Measured Depth (m)', ascending=True)
#df_sort.reset_index(inplace=True, drop=True) # Reset index
df_sort[:5]
Out[41]:
X Coordinate Y Coordinate Measured Depth (m) Deviation (True/False) Surface-Casing Weight (kg/m) Production-Casing Size (mm) Production-Casing Weight (kg/m) Bore. Temp. (degC) Prod. Formation Porosity (fraction) Permeability (Darcy) OIL Prod. (e3m3/month)
90 132.0 1104 276.0 False 25.3 114.3 14.1 25.0 Shale 0.142 1.16 0.00
537 72.9 784 295.5 False 25.3 114.3 14.1 22.0 Shale 0.000 0.66 0.00
403 591.3 791 297.0 False 25.3 114.3 14.1 26.0 Shale 0.142 13.37 72.26
507 621.7 593 321.0 True 53.6 177.8 34.2 33.0 Shale 0.091 1.72 0.00
105 631.2 583 343.2 False 56.5 139.7 20.8 33.0 Sand 0.168 3.19 19.23

we can sort values with multiple variables:

In [42]:
df_sort = df_im.sort_values(['Measured Depth (m)','Porosity (fraction)'], ascending=[True, False])
df_sort[:5]
Out[42]:
X Coordinate Y Coordinate Measured Depth (m) Deviation (True/False) Surface-Casing Weight (kg/m) Production-Casing Size (mm) Production-Casing Weight (kg/m) Bore. Temp. (degC) Prod. Formation Porosity (fraction) Permeability (Darcy) OIL Prod. (e3m3/month)
90 132.0 1104 276.0 False 25.3 114.3 14.1 25.0 Shale 0.142 1.16 0.00
537 72.9 784 295.5 False 25.3 114.3 14.1 22.0 Shale 0.000 0.66 0.00
403 591.3 791 297.0 False 25.3 114.3 14.1 26.0 Shale 0.142 13.37 72.26
507 621.7 593 321.0 True 53.6 177.8 34.2 33.0 Shale 0.091 1.72 0.00
105 631.2 583 343.2 False 56.5 139.7 20.8 33.0 Sand 0.168 3.19 19.23

Subsetting based on multiple conditions:

In [43]:
is_lab   = df["Measured Depth (m)"]     == 321.0
is_brown = df["Deviation (True/False)"] == True
df[is_lab & is_brown]
Out[43]:
X Coordinate Y Coordinate Measured Depth (m) Deviation (True/False) Surface-Casing Weight (kg/m) Production-Casing Size (mm) Production-Casing Weight (kg/m) Bore. Temp. (degC) Prod. Formation Porosity (fraction) Permeability (Darcy) OIL Prod. (e3m3/month)
507 621.7 593 321.0 True 53.6 177.8 34.2 NaN Shale 0.091 1.72 0.0

Subsetting using .isin()

In [44]:
df_sort[df_sort['Prod. Formation'].isin(['Sand'])][:5]
Out[44]:
X Coordinate Y Coordinate Measured Depth (m) Deviation (True/False) Surface-Casing Weight (kg/m) Production-Casing Size (mm) Production-Casing Weight (kg/m) Bore. Temp. (degC) Prod. Formation Porosity (fraction) Permeability (Darcy) OIL Prod. (e3m3/month)
105 631.2 583 343.2 False 56.5 139.7 20.8 33.0 Sand 0.168 3.19 19.23
918 607.5 592 409.0 True 48.1 177.8 29.7 21.0 Sand 0.050 0.66 0.00
92 589.3 596 436.0 False 48.1 177.8 25.3 18.0 Sand 0.166 11.82 17.42
793 571.8 598 445.0 False 53.6 177.8 34.2 18.0 Sand 0.084 1.01 0.00
560 621.3 595 455.0 True 48.1 177.8 29.8 27.0 Sand 0.262 8.87 121.89

Summary statistics

Grouping is applied to summarize data. The attributes below are used to calculate summary statistics for each col or each row. The attributes by default calculate statistics for columns:

  • mean()
  • median()
  • mode()
  • min()
  • max()
  • var()
  • std()
  • sum()
  • quantile()

The following codes performs grouping; generating summary statistics for each category for feature 'Prod. Formation' in target variable 'OIL Prod. (e3m3/month)'.

In [45]:
gb_mean=df_im.groupby('Prod. Formation')['OIL Prod. (e3m3/month)'].mean()
gb_mean
Out[45]:
Prod. Formation
Sand          64.015467
Shale         59.502943
Shale-Sand    54.482824
Name: OIL Prod. (e3m3/month), dtype: float64
In [46]:
gb_sum=df_im.groupby('Prod. Formation')['OIL Prod. (e3m3/month)'].sum()
gb_sum
Out[46]:
Prod. Formation
Sand          19204.64
Shale         31536.56
Shale-Sand     9262.08
Name: OIL Prod. (e3m3/month), dtype: float64

.agg() Method

The .agg() method is used for computing costume summary statistics (df['column'].agg(function))

In [47]:
def pct90(clm):
    return clm.quantile(0.9)

df_im['OIL Prod. (e3m3/month)'].agg(pct90)
Out[47]:
105.87400000000005
In [48]:
df_im[['OIL Prod. (e3m3/month)','Measured Depth (m)']].agg(pct90)
Out[48]:
OIL Prod. (e3m3/month)     105.874
Measured Depth (m)        2396.400
dtype: float64
In [49]:
def pct30(clm):
    return clm.quantile(0.3)

df_im['OIL Prod. (e3m3/month)'].agg([pct30,pct90])
Out[49]:
pct30      0.000
pct90    105.874
Name: OIL Prod. (e3m3/month), dtype: float64

Cumulative sum

In [50]:
df_im['OIL Prod. (e3m3/month)'].iloc[:5]
Out[50]:
0    108.75
1      0.33
2     21.87
3     46.30
4     17.53
Name: OIL Prod. (e3m3/month), dtype: float64
In [51]:
df_im['OIL Prod. (e3m3/month)'].iloc[:5].cumsum()
Out[51]:
0    108.75
1    109.08
2    130.95
3    177.25
4    194.78
Name: OIL Prod. (e3m3/month), dtype: float64

There are other cumulative statistics:

  • cummax()
  • cummin()
  • cumprod()
In [52]:
df_im['OIL Prod. (e3m3/month)'].iloc[:5].cummax()
Out[52]:
0    108.75
1    108.75
2    108.75
3    108.75
4    108.75
Name: OIL Prod. (e3m3/month), dtype: float64
In [53]:
df_im['OIL Prod. (e3m3/month)'].iloc[:5].cummin()
Out[53]:
0    108.75
1      0.33
2      0.33
3      0.33
4      0.33
Name: OIL Prod. (e3m3/month), dtype: float64
In [54]:
df_im['OIL Prod. (e3m3/month)'].value_counts()
Out[54]:
0.00      379
0.01       14
0.05       12
0.02        8
22.01       5
         ... 
37.68       1
17.15       1
553.28      1
27.46       1
153.43      1
Name: OIL Prod. (e3m3/month), Length: 521, dtype: int64

groupby

In [55]:
df_im.groupby('Deviation (True/False)')['Measured Depth (m)'].agg([min,max,np.median])
Out[55]:
min max median
Deviation (True/False)
False 276.0 3458.3 1025.0
True 321.0 6363.0 1354.5
In [56]:
df_im.groupby('Deviation (True/False)')['Measured Depth (m)'].mean()
Out[56]:
Deviation (True/False)
False    1214.885276
True     1644.465805
Name: Measured Depth (m), dtype: float64

Pivot tables

In [57]:
df_im.groupby('Deviation (True/False)')['Measured Depth (m)'].mean()
Out[57]:
Deviation (True/False)
False    1214.885276
True     1644.465805
Name: Measured Depth (m), dtype: float64
In [58]:
df_im.pivot_table(aggfunc='mean', values='Measured Depth (m)', index='Deviation (True/False)')
Out[58]:
Measured Depth (m)
Deviation (True/False)
False 1214.885276
True 1644.465805

values argument is the column you want to summarize and index is the column you want to groupby.

In [59]:
df_im.pivot_table(aggfunc=[np.median,np.std], values='Measured Depth (m)', index='Deviation (True/False)')
Out[59]:
median std
Measured Depth (m) Measured Depth (m)
Deviation (True/False)
False 1025.0 612.634963
True 1354.5 1132.657827

Index

In [60]:
df_im.columns
Out[60]:
Index(['X Coordinate', 'Y Coordinate', 'Measured Depth (m)',
       'Deviation (True/False)', 'Surface-Casing Weight (kg/m)',
       'Production-Casing Size (mm)', 'Production-Casing Weight (kg/m)',
       'Bore. Temp. (degC)', 'Prod. Formation', 'Porosity (fraction)',
       'Permeability (Darcy)', 'OIL Prod. (e3m3/month)'],
      dtype='object')
In [61]:
df_im.index
Out[61]:
RangeIndex(start=0, stop=1000, step=1)

we can move a column from body of dataframe to index:

In [62]:
df_im[:5].set_index("X Coordinate")
Out[62]:
Y Coordinate Measured Depth (m) Deviation (True/False) Surface-Casing Weight (kg/m) Production-Casing Size (mm) Production-Casing Weight (kg/m) Bore. Temp. (degC) Prod. Formation Porosity (fraction) Permeability (Darcy) OIL Prod. (e3m3/month)
X Coordinate
352.6 372 2145.8 False 35.7 114.3 20.8 53.0 Shale-Sand 0.224 12.25 108.75
435.7 551 1638.0 True 25.3 114.3 17.3 33.0 Sand 0.118 4.11 0.33
37.7 788 3984.0 True 35.7 114.3 22.5 33.0 Shale 0.121 1.99 21.87
346.7 311 3809.0 True 41.7 114.3 17.3 33.0 Sand 0.170 5.84 46.30
254.6 727 2335.0 True 53.6 139.7 20.8 32.0 Sand 0.158 5.25 17.53

To undo what we just did we can reset_ndex():

In [63]:
df_im[:5].reset_index(drop=True)
Out[63]:
X Coordinate Y Coordinate Measured Depth (m) Deviation (True/False) Surface-Casing Weight (kg/m) Production-Casing Size (mm) Production-Casing Weight (kg/m) Bore. Temp. (degC) Prod. Formation Porosity (fraction) Permeability (Darcy) OIL Prod. (e3m3/month)
0 352.6 372 2145.8 False 35.7 114.3 20.8 53.0 Shale-Sand 0.224 12.25 108.75
1 435.7 551 1638.0 True 25.3 114.3 17.3 33.0 Sand 0.118 4.11 0.33
2 37.7 788 3984.0 True 35.7 114.3 22.5 33.0 Shale 0.121 1.99 21.87
3 346.7 311 3809.0 True 41.7 114.3 17.3 33.0 Sand 0.170 5.84 46.30
4 254.6 727 2335.0 True 53.6 139.7 20.8 32.0 Sand 0.158 5.25 17.53

Indexing makes subsetting simpler:

In [64]:
df_im[df_im['Prod. Formation'].isin(['Shale', 'Sand'])][:5]
Out[64]:
X Coordinate Y Coordinate Measured Depth (m) Deviation (True/False) Surface-Casing Weight (kg/m) Production-Casing Size (mm) Production-Casing Weight (kg/m) Bore. Temp. (degC) Prod. Formation Porosity (fraction) Permeability (Darcy) OIL Prod. (e3m3/month)
1 435.7 551 1638.0 True 25.3 114.3 17.3 33.0 Sand 0.118 4.11 0.33
2 37.7 788 3984.0 True 35.7 114.3 22.5 33.0 Shale 0.121 1.99 21.87
3 346.7 311 3809.0 True 41.7 114.3 17.3 33.0 Sand 0.170 5.84 46.30
4 254.6 727 2335.0 True 53.6 139.7 20.8 32.0 Sand 0.158 5.25 17.53
5 127.5 1125 1560.6 False 35.7 177.8 20.8 63.0 Shale 0.000 0.10 0.00

This is a long code, we can use subsetting as below:

In [65]:
#df_im.set_index("Prod. Formation",inplace=True)
df_im[:5]
Out[65]:
X Coordinate Y Coordinate Measured Depth (m) Deviation (True/False) Surface-Casing Weight (kg/m) Production-Casing Size (mm) Production-Casing Weight (kg/m) Bore. Temp. (degC) Prod. Formation Porosity (fraction) Permeability (Darcy) OIL Prod. (e3m3/month)
0 352.6 372 2145.8 False 35.7 114.3 20.8 53.0 Shale-Sand 0.224 12.25 108.75
1 435.7 551 1638.0 True 25.3 114.3 17.3 33.0 Sand 0.118 4.11 0.33
2 37.7 788 3984.0 True 35.7 114.3 22.5 33.0 Shale 0.121 1.99 21.87
3 346.7 311 3809.0 True 41.7 114.3 17.3 33.0 Sand 0.170 5.84 46.30
4 254.6 727 2335.0 True 53.6 139.7 20.8 32.0 Sand 0.158 5.25 17.53
In [66]:
df.loc[0: 3]
Out[66]:
X Coordinate Y Coordinate Measured Depth (m) Deviation (True/False) Surface-Casing Weight (kg/m) Production-Casing Size (mm) Production-Casing Weight (kg/m) Bore. Temp. (degC) Prod. Formation Porosity (fraction) Permeability (Darcy) OIL Prod. (e3m3/month)
0 352.6 372 2145.8 False NaN 114.3 NaN 53.0 Shale-Sand 0.224 12.25 108.75
1 435.7 551 1638.0 True 25.3 114.3 17.3 NaN Sand 0.118 4.11 0.33
2 37.7 788 3984.0 True 35.7 114.3 22.5 NaN Shale 0.121 1.99 21.87
3 346.7 311 3809.0 True 41.7 114.3 17.3 NaN Sand 0.170 5.84 46.30

Multi-level indexes

In [67]:
df_im.set_index(["Porosity (fraction)","Permeability (Darcy)"],inplace=True)
In [68]:
df_im[:5]
Out[68]:
X Coordinate Y Coordinate Measured Depth (m) Deviation (True/False) Surface-Casing Weight (kg/m) Production-Casing Size (mm) Production-Casing Weight (kg/m) Bore. Temp. (degC) Prod. Formation OIL Prod. (e3m3/month)
Porosity (fraction) Permeability (Darcy)
0.224 12.25 352.6 372 2145.8 False 35.7 114.3 20.8 53.0 Shale-Sand 108.75
0.118 4.11 435.7 551 1638.0 True 25.3 114.3 17.3 33.0 Sand 0.33
0.121 1.99 37.7 788 3984.0 True 35.7 114.3 22.5 33.0 Shale 21.87
0.170 5.84 346.7 311 3809.0 True 41.7 114.3 17.3 33.0 Sand 46.30
0.158 5.25 254.6 727 2335.0 True 53.6 139.7 20.8 32.0 Sand 17.53
In [69]:
df_im.loc[[(0.224,12.25), (0.017,0.48)]]
Out[69]:
X Coordinate Y Coordinate Measured Depth (m) Deviation (True/False) Surface-Casing Weight (kg/m) Production-Casing Size (mm) Production-Casing Weight (kg/m) Bore. Temp. (degC) Prod. Formation OIL Prod. (e3m3/month)
Porosity (fraction) Permeability (Darcy)
0.224 12.25 352.6 372 2145.8 False 35.7 114.3 20.8 53.0 Shale-Sand 108.75
0.017 0.48 548.3 475 749.8 False 35.7 114.3 20.8 28.0 Shale 0.00

Sorting by index values

In [70]:
df_im.sort_index()[:5]
Out[70]:
X Coordinate Y Coordinate Measured Depth (m) Deviation (True/False) Surface-Casing Weight (kg/m) Production-Casing Size (mm) Production-Casing Weight (kg/m) Bore. Temp. (degC) Prod. Formation OIL Prod. (e3m3/month)
Porosity (fraction) Permeability (Darcy)
0.0 0.05 596.4 126 368.8 False 35.7 177.8 20.8 33.0 Shale 0.0
0.07 598.2 513 621.0 True 48.1 177.8 29.8 25.0 Sand 0.0
0.07 372.6 382 1914.0 False 35.7 139.7 23.1 46.0 Shale 0.0
0.10 127.5 1125 1560.6 False 35.7 177.8 20.8 63.0 Shale 0.0
0.10 427.0 555 1495.0 True 25.3 114.3 17.3 33.0 Shale-Sand 0.0
In [71]:
df_im.sort_index(level=['Porosity (fraction)','Permeability (Darcy)'], ascending=[True, True])[:5]
Out[71]:
X Coordinate Y Coordinate Measured Depth (m) Deviation (True/False) Surface-Casing Weight (kg/m) Production-Casing Size (mm) Production-Casing Weight (kg/m) Bore. Temp. (degC) Prod. Formation OIL Prod. (e3m3/month)
Porosity (fraction) Permeability (Darcy)
0.0 0.05 596.4 126 368.8 False 35.7 177.8 20.8 33.0 Shale 0.0
0.07 598.2 513 621.0 True 48.1 177.8 29.8 25.0 Sand 0.0
0.07 372.6 382 1914.0 False 35.7 139.7 23.1 46.0 Shale 0.0
0.10 127.5 1125 1560.6 False 35.7 177.8 20.8 63.0 Shale 0.0
0.10 427.0 555 1495.0 True 25.3 114.3 17.3 33.0 Shale-Sand 0.0

Slicing by .loc and .iloc

loc does not get inner level index only it return outer level.

We can have two arguments for loc to include columns (: means return everything).

In [72]:
df_im.loc[:,"X Coordinate":"Deviation (True/False)"][:5]
Out[72]:
X Coordinate Y Coordinate Measured Depth (m) Deviation (True/False)
Porosity (fraction) Permeability (Darcy)
0.224 12.25 352.6 372 2145.8 False
0.118 4.11 435.7 551 1638.0 True
0.121 1.99 37.7 788 3984.0 True
0.170 5.84 346.7 311 3809.0 True
0.158 5.25 254.6 727 2335.0 True

iloc method is to slice row and columns based on numbers

In [73]:
df_im.iloc[1:5,2:6]
Out[73]:
Measured Depth (m) Deviation (True/False) Surface-Casing Weight (kg/m) Production-Casing Size (mm)
Porosity (fraction) Permeability (Darcy)
0.118 4.11 1638.0 True 25.3 114.3
0.121 1.99 3984.0 True 35.7 114.3
0.170 5.84 3809.0 True 41.7 114.3
0.158 5.25 2335.0 True 53.6 139.7

Unlike loc, last row and column is not included in iloc.

Feature Engineering

Feature engineering is the process of extracting new feature from raw data via data mining techniques. The new features can be used to enhance the performance of machine learning algorithms. New features could also be calculated from the other fields. For the data set WPD.csv, we can add "Surface-Casing Weight (kg/m)" and "Production-Casing Weight (kg/m)" as a new data set.

In [74]:
df_im_c=df_im.copy()
df_im.insert(1, 'New Column', (df_im['Surface-Casing Weight (kg/m)']*df_im['Production-Casing Weight (kg/m)']).astype(int))
df_im
Out[74]:
X Coordinate New Column Y Coordinate Measured Depth (m) Deviation (True/False) Surface-Casing Weight (kg/m) Production-Casing Size (mm) Production-Casing Weight (kg/m) Bore. Temp. (degC) Prod. Formation OIL Prod. (e3m3/month)
Porosity (fraction) Permeability (Darcy)
0.224 12.25 352.6 742 372 2145.8 False 35.7 114.3 20.8 53.0 Shale-Sand 108.75
0.118 4.11 435.7 437 551 1638.0 True 25.3 114.3 17.3 33.0 Sand 0.33
0.121 1.99 37.7 803 788 3984.0 True 35.7 114.3 22.5 33.0 Shale 21.87
0.170 5.84 346.7 721 311 3809.0 True 41.7 114.3 17.3 33.0 Sand 46.30
0.158 5.25 254.6 1114 727 2335.0 True 53.6 139.7 20.8 32.0 Sand 17.53
... ... ... ... ... ... ... ... ... ... ... ... ...
0.230 6.26 92.5 742 767 1466.0 False 35.7 139.7 20.8 51.0 Sand 8.74
0.017 0.48 548.3 742 475 749.8 False 35.7 114.3 20.8 28.0 Shale 0.00
0.332 49.75 593.4 742 456 683.7 False 35.7 177.8 20.8 26.0 Shale-Sand 832.45
0.111 0.59 540.9 356 328 1346.0 True 25.3 114.3 14.1 33.0 Shale 0.00
0.292 9.95 617.8 742 377 844.0 False 35.7 139.7 20.8 27.0 Shale-Sand 153.43

1000 rows × 11 columns

Standardize (Normalize) Values

Most Machine Learning algorithms require to standardize the input to have a reliable comparison between two values of different variables. For example, if you have variable one with a range of 3 to 1000 and variable two with range 1 to 3; we cannot have a good comparison since the highest value of variable two (3) is the lowest value in variable two. This inconsistency in data values leads to low performance in Machine Learning algorithms. So, it is highly recommended to standardize (normalize) your data to have the same the mean($\mu$) and the standard deviation ($\sigma$) before feeding Machine Learning algorithms. One very common Machine learning standardization (normalization) is the Z-Score:

$\Large z = \frac{x - \mu}{\sigma} $

where $ \mu = \frac{1}{n} \sum_{i=1}^n x_i$, $\sigma = \sqrt{\frac{1}{n} \sum_{i=1}^n (x_i - \mu)^2}$, $n$ is number of data. This ensures that each variable have $\mu=0$ and $\sigma=1$

This standardization should be applied for all features except the target (label). For WPD.csv data set, the target is "OIL Prod. (e3m3/month)" that should not be changed.

The following Python code replaces the 'Measured Depth (m)' with a z-score. This can be done for all variables.

In [75]:
from scipy.stats import zscore

mean=df_im['Measured Depth (m)'].mean()
variance=df_im['Measured Depth (m)'].var()
print("Before standardization: Mean= ",int(mean), ", variance= ",int(variance))

df_im['Measured Depth (m)'] = zscore(df_im['Measured Depth (m)'])

mean=df_im['Measured Depth (m)'].mean()
variance=df_im['Measured Depth (m)'].var()
print("After standardization: Mean= ",int(mean), ", variance= ",int(variance))
Before standardization: Mean=  1364 , variance=  732108
After standardization: Mean=  0 , variance=  1
In [76]:
df_im[0:5]
Out[76]:
X Coordinate New Column Y Coordinate Measured Depth (m) Deviation (True/False) Surface-Casing Weight (kg/m) Production-Casing Size (mm) Production-Casing Weight (kg/m) Bore. Temp. (degC) Prod. Formation OIL Prod. (e3m3/month)
Porosity (fraction) Permeability (Darcy)
0.224 12.25 352.6 742 372 0.913723 False 35.7 114.3 20.8 53.0 Shale-Sand 108.75
0.118 4.11 435.7 437 551 0.319947 True 25.3 114.3 17.3 33.0 Sand 0.33
0.121 1.99 37.7 803 788 3.063147 True 35.7 114.3 22.5 33.0 Shale 21.87
0.170 5.84 346.7 721 311 2.858518 True 41.7 114.3 17.3 33.0 Sand 46.30
0.158 5.25 254.6 1114 727 1.134956 True 53.6 139.7 20.8 32.0 Sand 17.53

Categorical Values

The input data for Machine Learning should be completely numeric which means any text should be converted to number. In the data set WPD.csv, the columns 'Deviation (True/False)' and 'Prod. Formation' have categorical values. Use value_counts() function to find out what categories exist and how many data belong to each category.

In [77]:
df_c=df_im.copy()
df_c['Deviation (True/False)'].value_counts()
Out[77]:
False    652
True     348
Name: Deviation (True/False), dtype: int64
In [78]:
df_c['Prod. Formation'].value_counts()
Out[78]:
Shale         530
Sand          300
Shale-Sand    170
Name: Prod. Formation, dtype: int64

We can replace each test with numbers starting from 0

In [79]:
df_c['Deviation (True/False)']=df_c['Deviation (True/False)'].replace(False, 0) # Replace False with 0
df_c['Deviation (True/False)']=df_c['Deviation (True/False)'].replace(True, 1)  # Replace True with 1
df_c[0:5]
Out[79]:
X Coordinate New Column Y Coordinate Measured Depth (m) Deviation (True/False) Surface-Casing Weight (kg/m) Production-Casing Size (mm) Production-Casing Weight (kg/m) Bore. Temp. (degC) Prod. Formation OIL Prod. (e3m3/month)
Porosity (fraction) Permeability (Darcy)
0.224 12.25 352.6 742 372 0.913723 0 35.7 114.3 20.8 53.0 Shale-Sand 108.75
0.118 4.11 435.7 437 551 0.319947 1 25.3 114.3 17.3 33.0 Sand 0.33
0.121 1.99 37.7 803 788 3.063147 1 35.7 114.3 22.5 33.0 Shale 21.87
0.170 5.84 346.7 721 311 2.858518 1 41.7 114.3 17.3 33.0 Sand 46.30
0.158 5.25 254.6 1114 727 1.134956 1 53.6 139.7 20.8 32.0 Sand 17.53
In [80]:
df_c['Prod. Formation']=df_c['Prod. Formation'].replace('Shale', 0)       # Replace Shale with 0
df_c['Prod. Formation']=df_c['Prod. Formation'].replace('Sand', 1)        # Replace Sand with 1
df_c['Prod. Formation']=df_c['Prod. Formation'].replace('Shale-Sand', 2)  # Replace Shale-Sand with 2
df_c[0:5]
Out[80]:
X Coordinate New Column Y Coordinate Measured Depth (m) Deviation (True/False) Surface-Casing Weight (kg/m) Production-Casing Size (mm) Production-Casing Weight (kg/m) Bore. Temp. (degC) Prod. Formation OIL Prod. (e3m3/month)
Porosity (fraction) Permeability (Darcy)
0.224 12.25 352.6 742 372 0.913723 0 35.7 114.3 20.8 53.0 2 108.75
0.118 4.11 435.7 437 551 0.319947 1 25.3 114.3 17.3 33.0 1 0.33
0.121 1.99 37.7 803 788 3.063147 1 35.7 114.3 22.5 33.0 0 21.87
0.170 5.84 346.7 721 311 2.858518 1 41.7 114.3 17.3 33.0 1 46.30
0.158 5.25 254.6 1114 727 1.134956 1 53.6 139.7 20.8 32.0 1 17.53

One challenge with this representation is that Machine Learning algorithms consider two nearby values are more similar than two distant values. This may be fine for some cases (order of categories “bad”, “average”, “good”), but this is not always true. A common solution is to have one binary attribute per category. This is called one-hot encoding, because we only have one attribute equal to 1 (hot), while the others are 0 (cold).

In [81]:
One_hot = pd.get_dummies(['a','b'],prefix='Deviation')
print(One_hot)
   Deviation_a  Deviation_b
0            1            0
1            0            1

These dummies should be merged back into the data frame.

In [82]:
One_hot = pd.get_dummies(df_im['Deviation (True/False)'],prefix='Deviation')
print(One_hot[0:10]) # Just show the first 10
                                          Deviation_False  Deviation_True
Porosity (fraction) Permeability (Darcy)                                 
0.224               12.25                               1               0
0.118               4.11                                0               1
0.121               1.99                                0               1
0.170               5.84                                0               1
0.158               5.25                                0               1
0.000               0.10                                1               0
0.128               0.70                                0               1
0.055               0.36                                1               0
0.033               0.46                                1               0
0.224               9.95                                0               1
In [83]:
df_one_hot = pd.concat([df_im,One_hot],axis=1)
df_one_hot[0:5]
Out[83]:
X Coordinate New Column Y Coordinate Measured Depth (m) Deviation (True/False) Surface-Casing Weight (kg/m) Production-Casing Size (mm) Production-Casing Weight (kg/m) Bore. Temp. (degC) Prod. Formation OIL Prod. (e3m3/month) Deviation_False Deviation_True
Porosity (fraction) Permeability (Darcy)
0.224 12.25 352.6 742 372 0.913723 False 35.7 114.3 20.8 53.0 Shale-Sand 108.75 1 0
0.118 4.11 435.7 437 551 0.319947 True 25.3 114.3 17.3 33.0 Sand 0.33 0 1
0.121 1.99 37.7 803 788 3.063147 True 35.7 114.3 22.5 33.0 Shale 21.87 0 1
0.170 5.84 346.7 721 311 2.858518 True 41.7 114.3 17.3 33.0 Sand 46.30 0 1
0.158 5.25 254.6 1114 727 1.134956 True 53.6 139.7 20.8 32.0 Sand 17.53 0 1

This should be done for 'Prod. Formation' column.

In [84]:
One_hot = pd.get_dummies(df_im['Prod. Formation'],prefix='Formation')
df_one_hot = pd.concat([df_one_hot,One_hot],axis=1)
df_one_hot[0:5]
Out[84]:
X Coordinate New Column Y Coordinate Measured Depth (m) Deviation (True/False) Surface-Casing Weight (kg/m) Production-Casing Size (mm) Production-Casing Weight (kg/m) Bore. Temp. (degC) Prod. Formation OIL Prod. (e3m3/month) Deviation_False Deviation_True Formation_Sand Formation_Shale Formation_Shale-Sand
Porosity (fraction) Permeability (Darcy)
0.224 12.25 352.6 742 372 0.913723 False 35.7 114.3 20.8 53.0 Shale-Sand 108.75 1 0 0 0 1
0.118 4.11 435.7 437 551 0.319947 True 25.3 114.3 17.3 33.0 Sand 0.33 0 1 1 0 0
0.121 1.99 37.7 803 788 3.063147 True 35.7 114.3 22.5 33.0 Shale 21.87 0 1 0 1 0
0.170 5.84 346.7 721 311 2.858518 True 41.7 114.3 17.3 33.0 Sand 46.30 0 1 1 0 0
0.158 5.25 254.6 1114 727 1.134956 True 53.6 139.7 20.8 32.0 Sand 17.53 0 1 1 0 0

Finally, the original columns 'Deviation (True/False)' and 'Prod. Formation' should be removed.

In [85]:
df_one_hot.drop(['Deviation (True/False)','Prod. Formation'], axis=1, inplace=True)
df_one_hot
Out[85]:
X Coordinate New Column Y Coordinate Measured Depth (m) Surface-Casing Weight (kg/m) Production-Casing Size (mm) Production-Casing Weight (kg/m) Bore. Temp. (degC) OIL Prod. (e3m3/month) Deviation_False Deviation_True Formation_Sand Formation_Shale Formation_Shale-Sand
Porosity (fraction) Permeability (Darcy)
0.224 12.25 352.6 742 372 0.913723 35.7 114.3 20.8 53.0 108.75 1 0 0 0 1
0.118 4.11 435.7 437 551 0.319947 25.3 114.3 17.3 33.0 0.33 0 1 1 0 0
0.121 1.99 37.7 803 788 3.063147 35.7 114.3 22.5 33.0 21.87 0 1 0 1 0
0.170 5.84 346.7 721 311 2.858518 41.7 114.3 17.3 33.0 46.30 0 1 1 0 0
0.158 5.25 254.6 1114 727 1.134956 53.6 139.7 20.8 32.0 17.53 0 1 1 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
0.230 6.26 92.5 742 767 0.118826 35.7 139.7 20.8 51.0 8.74 1 0 1 0 0
0.017 0.48 548.3 742 475 -0.718633 35.7 114.3 20.8 28.0 0.00 1 0 0 1 0
0.332 49.75 593.4 742 456 -0.795925 35.7 177.8 20.8 26.0 832.45 1 0 0 0 1
0.111 0.59 540.9 356 328 -0.021491 25.3 114.3 14.1 33.0 0.00 0 1 0 1 0
0.292 9.95 617.8 742 377 -0.608484 35.7 139.7 20.8 27.0 153.43 1 0 0 0 1

1000 rows × 14 columns