Summary

This notebook presents different approaches for exploratory data analysis and basic statistical analysis. Python functions are written to make nice data visualization including histogram, CDF, swarm plot, box plot, Violin plot, Kernel Density Estimation, cross plot and correlation. Probability distribution including binomial PMF, Poisson, Normal and Exponential distribution will be discussed. Finally data manipulation and data analysis with Pandas will be presented.

Python functions and data files needed to run this notebook are available via this link.

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib 
import pylab as plt
from functions import* # import require functions to run this notebook
import warnings
warnings.filterwarnings('ignore')

Data Visualization¶

The data is 2008 US swing state election results downloaded from https://www.kaggle.com/datasets/aman1py/swing-states?resource=download

df_swing=pd.read_csv('./Data/2008_swing_states.csv')
df_swing[['state','county','dem_share']][:5]

Histogram¶

font = {'size'   : 7}
plt.rc('font', **font)
fig, ax1 = plt.subplots(figsize=(10, 7), dpi= 100, facecolor='w', edgecolor='k')

gs = gridspec.GridSpec(2, 4)
ax1=plt.subplot(gs[0, :2], ) 
KPI='dem_share'
val=df_swing[df_swing['state']=='OH'][KPI]
EDA_plot.histplt (val,bins=20,title='OH State',xlabl=KPI,
         ylabl='Percentage',xlimt=(0,100),ylimt=(0,0.2),axt=ax1,
         scale=1.15,loc=1,font=10,color='#8DC63F')

ax2=plt.subplot(gs[0, 2:]) 
KPI='dem_share'
val=df_swing[df_swing['state']=='PA'][KPI]
EDA_plot.histplt (val,bins=20,title='PA State',xlabl=KPI,
         ylabl='Percentage',xlimt=(0,100),ylimt=(0,0.2),axt=ax2,
         scale=1.15,loc=1,font=10,color='#025174')

ax3=plt.subplot(gs[1, 1:3]) 
val=df_swing[df_swing['state']=='FL'][KPI]
EDA_plot.histplt (val,bins=20,title='FL State',xlabl=KPI,
         ylabl='Percentage',xlimt=(0,100),ylimt=(0,0.2),axt=ax3
         ,scale=1.15,loc=1,font=10,color='#323539')

plt.subplots_adjust(wspace=0.9)
plt.subplots_adjust(hspace=0.45)
fig.suptitle('2008 US Swing State Election_Histogram of dem_share for each State', fontsize=16,y=1.02)

plt.show()

Bee Swarm plot¶

font = {'size'   : 12}
plt.rc('font', **font)
fig, ax1 = plt.subplots(figsize=(8, 4), dpi= 100, facecolor='w', edgecolor='k')


sns.set_theme(style="whitegrid")
tips = sns.load_dataset("tips")
ax = sns.swarmplot(x="state", y="dem_share", data=df_swing)
df_means = df_swing.groupby("state")["dem_share"].agg("mean").reset_index()
xlim = ax.get_xlim()
ylim = ax.get_ylim()
sns.scatterplot(x="state", y="dem_share", marker='X', color='black',
                s=100, zorder=3, ax=ax, legend=True, data=df_means)
ax.set_xlim(xlim)
ax.set_ylim(ylim)
plt.show()

Box Plot¶

The center of box plot is 50th percentile which has a higher limit of 75th percentile and lower limit of 25th percentile. This is called IQR (Interquartile Range). The reset extend of data is 1.5 IQR.

font = {'size'   : 7}
plt.rc('font', **font)
fig, ax1 = plt.subplots(figsize=(8, 4), dpi= 100, facecolor='w', edgecolor='k')


sns.set_theme(style="whitegrid")
tips = sns.load_dataset("tips")
ax = sns.swarmplot(x="state", y="dem_share", data=df_swing)
ax = sns.boxplot(x="state", y="dem_share", data=df_swing)
df_means = df_swing.groupby("state")["dem_share"].agg("mean").reset_index()
xlim = ax.get_xlim()
ylim = ax.get_ylim()
sns.scatterplot(x="state", y="dem_share", marker='X', color='black',
                s=100, zorder=3, ax=ax, legend=True, data=df_means)
ax.set_xlim(xlim)
ax.set_ylim(ylim)
plt.show()

Violin Plot¶

#plt.style.available
plt.style.use('classic')

KPI='dem_share'
val=[]
val.append(np.array(df_swing[df_swing['state']=='PA'][KPI]))
val.append(np.array(df_swing[df_swing['state']=='OH'][KPI]))
val.append(np.array(df_swing[df_swing['state']=='FL'][KPI]))
#
Position=[100,500,1000]     

font = {'size'   : 11}
plt.rc('font', **font)
fig, ax1 = plt.subplots(figsize=(9, 5), dpi= 130, facecolor='w', edgecolor='k')

v1 = plt.violinplot(val, points=100, positions=Position, widths=300,
               showmeans=True,showextrema=True, showmedians=False,bw_method =0.15)
for b in v1['bodies']:
    m = np.mean(b.get_paths()[0].vertices[:, 0])
    b.get_paths()[0].vertices[:, 0] = np.clip(b.get_paths()[0].vertices[:, 0], -np.inf, m)
    b.set_color('darkred')
    b.set_edgecolor('k')
    b.set_linewidth(1)  
    b.set_alpha(0.9)
#plt.axis('off')
plt.ylabel('dem_share',fontsize=12)
    
plt.xticks([100,500,1000],['PA','OH','FL'],rotation=0)
plt.title("Violin Plot",fontweight="bold",fontsize=16)
plt.show()

Kernel Density Estimation¶

font = {'size'   : 11}
plt.rc('font', **font)
fig, ax1 = plt.subplots(figsize=(9, 8), dpi= 130, facecolor='w', edgecolor='k')

ax1=plt.subplot(2,1,1)
label=['OH State','PA State','FL State']

data_var=len(label)*[None]
KPI='dem_share'
data_var[0]=np.array(df_swing[df_swing['state']=='OH'][KPI])
data_var[1]=np.array(df_swing[df_swing['state']=='PA'][KPI])
data_var[2]=np.array(df_swing[df_swing['state']=='FL'][KPI])

nvar=3
xs = np.linspace(0,100,500)
colors = ['r','b','g']
title='2008 US Swing State Election_Kernel Density Estimation for dem_share for each State'
EDA_plot.KDE(xs,data_var,nvar,label,colors,xlabel=None,title=title,ylabel='Probability',xlim=(0,100),ylim=(0,0.01),
    LAMBA=0.5,linewidth=2.5,loc=0,axt=ax1,x_ftze=12,y_ftze=12,tit_ftze=12,leg_ftze=9)

Emperical cdf¶

font = {'size'   : 11}
plt.rc('font', **font)
fig, ax1 = plt.subplots(figsize=(9, 8), dpi= 130, facecolor='w', edgecolor='k')

label=['OH State','PA State','FL State']

data_var=len(label)*[None]
KPI='dem_share'
data_var[0]=np.array(df_swing[df_swing['state']=='OH'][KPI])
data_var[1]=np.array(df_swing[df_swing['state']=='PA'][KPI])
data_var[2]=np.array(df_swing[df_swing['state']=='FL'][KPI])

ax1=plt.subplot(2,1,1)
title=f'2008 US Swing State Election_Emperical CDF_ for dem_share for each State'
EDA_plot.CDF_plot(data_var,nvar,label,colors,xlabel=None,title=title,ylabel='Cumulative Probability',bins=100,
    xlim=(0,100),ylim=(0,1),linewidth=2.5,loc=0,axt=ax1,x_ftze=12,y_ftze=12,tit_ftze=12,leg_ftze=9)

plt.show()

Percentile¶

font = {'size'   : 11}
plt.rc('font', **font)
fig, ax1 = plt.subplots(figsize=(8, 8), dpi= 130, facecolor='w', edgecolor='k')

label=['FL State']
KPI='dem_share'
data_var=np.array(df_swing[df_swing['state']=='FL'][KPI])

ax1=plt.subplot(2,1,1)
title=f'2008 US Swing State Election_Emperical CDF_for dem_share for FL State'
EDA_plot.CDF_plot(data_var,1,label,colors='g',xlabel=None,title=title,ylabel='Cumulative Probability',bins=100,
    xlim=(0,90),ylim=(0,1),linewidth=2.5,loc=4,axt=ax1,x_ftze=12,y_ftze=12,tit_ftze=12,leg_ftze=9)

# 25th percentile
per=0.25
Perctile=np.quantile(data_var, per, axis=0, keepdims=True)[0]
txt= '25th percentile= '
txt+=f'{np.round(Perctile,1)}'
plt.plot([0, Perctile], [per, per], 'k--', linewidth=1.2)
plt.plot([Perctile, Perctile], [0, per], 'k--', linewidth=1.2)
plt.text(Perctile/6, per+0.03, txt, color='b',
         fontsize=10)

# 50th percentile
per=0.5
Perctile=np.quantile(data_var, per, axis=0, keepdims=True)[0]
txt= '50th percentile= '
txt+=f'{np.round(Perctile,1)}'
plt.plot([0, Perctile], [per, per], 'k--', linewidth=1.2)
plt.plot([Perctile, Perctile], [0, per], 'k--', linewidth=1.2)
plt.text(Perctile/6, per+0.03, txt, color='b',
         fontsize=10)

# 75th percentile
per=0.75
Perctile=np.quantile(data_var, per, axis=0, keepdims=True)[0]
txt= '75th percentile= '
txt+=f'{np.round(Perctile,1)}'
plt.plot([0, Perctile], [per, per], 'k--', linewidth=1.2)
plt.plot([Perctile, Perctile], [0, per], 'k--', linewidth=1.2)
plt.text(Perctile/6, per+0.03, txt, color='b',
         fontsize=10)

plt.show()

Cross plot¶

font = {'size'   : 12}
plt.rc('font', **font)
fig, ax1 = plt.subplots(figsize=(7, 5), dpi= 100, facecolor='w', edgecolor='k')


x=df_swing['total_votes']
y=df_swing['dem_share']
EDA_plot.CrossPlot(x,y,title='cross plot: total votes vs. dem_vote',
          xlabl='total votes',ylabl='percent of dem_vote',loc=1,xlimt=(np.nanmin(x),np.nanmax(x)),
          ylimt=(np.nanmin(y),np.nanmax(y)),axt=ax1,scale=1.2,alpha=0.9,markersize=6,marker='bo',font=font)

Correlation¶

Synthetic data with 6 features, Var1 to Var6 with Gaussian, Triangular and Uniform distribution are generated to have linear and non-linear correlation with target for 100000 data.

# Generate synthetic data
df=quadrac_swiss_roll(noise=2,nsim=100000,time_rand1=1,seed=42,a1=20,b1=8,a2=50,b2=8,swiss_roll=True)
features_colums=df.columns
# Satndardize data
for i in df.columns:
    df[i] = zscore(df[i])

Figure below shows cross plot matrix between Var1 to Var6 features and the target. Var1 and Var 2 are non-linearly correlated with the target. Var3 has strong positive correlation (0.68), and Var4 and Var5 are negative correlation with the target.

font = {'size'   : 7}
plt.rc('font', **font)
fig, ax1 = plt.subplots(figsize=(10, 10), dpi= 100, facecolor='w', edgecolor='k')
colors_map = plt.cm.get_cmap('jet')

colors = colors_map(np.linspace(0,0.8,len(features_colums)))

for ir in range(len(features_colums)):
    ax1=plt.subplot(4,2,ir+1) 
    val=df[features_colums[ir]]

    EDA_plot.histplt(val,bins=20,title=f'{features_colums[ir]}',xlabl=None,days=False,
             ylabl=None,xlimt=None,ylimt=(0,0.3)
             ,axt=ax1,nsplit=5,scale=1.02,loc=2,font=10.5,color=colors[len(features_colums)-ir-1])
plt.subplots_adjust(hspace=0.4)
plt.subplots_adjust(wspace=0.3)
fig.suptitle(f'Histogram of Variables for Synthetic Data', fontsize=16,y=0.96)
plt.show()

font = {'size'   :9 }
plt.rc('font', **font)

fig=plt.figure(figsize=(10, 9), dpi= 100, facecolor='w', edgecolor='k')
# Satndardize data
for i in features_colums:
    df[i] = zscore(df[i])
cplotmatrix(df,font=font,alpha=0.008,marker='g.',missin_rep=False)
plt.show()

Correlation Coefficient

font = {'size'   : 6}
plt.rc('font', **font)
fig, ax1 = plt.subplots(figsize=(3.5, 4), dpi= 200, facecolor='w', edgecolor='k')

corr=df.corr()
corr=corr['Target'].drop(['Target'])
coefs=corr.values
features_colums=list(corr.index)
Correlation_plot.corr_bar(coefs,clmns=features_colums,select=False,yfontsize=6.0,title=f'Linear Correlation with Target',
                          ymax_vert_lin=30,xlim = [-0.5, 0.7])

Coefficient of Determination

The correlation coefficient also termed as $R^{2}$ is the basic approach to quantify the importance of features for a target. In statistics, $R^{2}$ determines and assesses the ability of a statistical model to explain and predict future outcomes. In other words, if we have dependent variable y and independent variable x in a model, then $R^{2}$ helps in determining the variation in y by variation x. It is one of the key output of regression analysis and is used when we want to predict future or testing some models with related information. $R^{2}$ lies between 0 and 1 and higher the value of $R^{2}$, better will be the prediction and strength of the model. $R^{2}$ is basically a square of a correlation coefficient (𝜌).

There are multiple Formulas to calculate the coefficient of determination. It can be simply calculated from correlation coefficient:

Coefficient of Determination=(Correlation Coefficient)^2

ir=0
corr=df.corr()
corr=corr['Target'].drop(['Target'])
coefs=corr.values
coefs= np.array([i**2 for i in coefs])

# Plot the importance of features
font = {'size'   : 7}
plt.rc('font', **font)
fig, ax1 = plt.subplots(figsize=(6, 3), dpi= 180, facecolor='w', edgecolor='k')

_=prfrmnce_plot(coefs, title=f'Coefficient of Determination', 
            ylabel='Coefficient of Determination',clmns=features_colums,titlefontsize=9, 
            xfontsize=7, yfontsize=8).bargraph(perent=True,fontsizelable=8,xshift=-0.25,axt=ax1,
            yshift=0.02,ylim=[0,1], xlim=[-0.5,5.5], y_rot=0)

Probability Distributions¶

Hacker statistics
- Use simulated repeated measurement to compute probabilities
- Determine how to simulate data
- Simulated many many times
- Probability is approximately fraction of trials with the outcome of interest

Bernoulli trial
- An experiment that has two options, "success" (True) and failure (False)

Probability Mass Function (PMF)
- The set of probabilities of discrete outcomes

Probability Distribution
- A mathematical description of outcome

Binomial distribution
- The number r of success in n Bernoulli trials with probability p of success, is Binomially distributed. For example, the number of the heads in 4 coin flips with probability 0.5 of heads, is Binomially distributed.

Probability Density Function (PDF)
- Continuous Variables: Quantities that can take any value, not just discrete values
- Continuous analog to PMF (probability mass function)
- Mathematical description of the relative likelihood of observing a value of a continuous variable.
- for PDF of normal distribution: the area under PDF is the probability of occurring a value.

The Binomial PMF¶

In probability theory and statistics, the binomial distribution with parameters n and p is the discrete probability distribution of the number of successes in a sequence of n independent experiments, each asking a yes–no question.

The probability density for the binomial distribution is

where n is the number of trials, p is the probability of success, and N is the number of successes.

Samples=np.random.binomial(n=60,p=0.1,size=10000)
font = {'size'   : 11}
plt.rc('font', **font)
fig, ax1 = plt.subplots(figsize=(8, 8), dpi= 130, facecolor='w', edgecolor='k')
title='Binomial PMF, n=60, p=0.1, size=1000'

ax1=plt.subplot(2,1,1)
EDA_plot.histplt (Samples,bins=800,title=title,xlabl=None,
         ylabl='Percentage',xlimt=(0,15),ylimt=(0,0.2),axt=ax1,
         scale=1.15,loc=1,font=10,color='g')

ax1=plt.subplot(2,1,2)
EDA_plot.CDF_plot(Samples,1,label='Binomial ecdf',colors='g',xlabel=None,title=title,ylabel='Cumulative Probability',bins=10000,
    xlim=(0,15),ylim=(0,1),linewidth=2.5,loc=4,axt=ax1,x_ftze=12,y_ftze=12,tit_ftze=16,leg_ftze=9)

Poisson processes and the Poisson Distribution¶

The timing of the next event is completely independent of when the previous even happened. The example are:

Natural births in a given hospital
Hit on a website during a given hour
meteor strikes
Molecular collisions in a gas
Aviation incidents
Buses in Poissonville

The number of r of arrivals of a Poisson process in a given time interval with average rate of? arrivals per intervals is Poisson distributed.

The number of r hits on a website in one hour with an average hit rate of 6 hits per hour is Poisson distributed

Poisson distribution is the limit of the Binomial distribution for low probability of success and large number of Bernoulli trial (an experiment that has two options, "success" (True) and failure (False)).

In probability theory and statistics, the Poisson distribution is a discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time (or space) if these events occur with a known constant mean rate:

$\LARGE P(n)=\frac{\mu ^{n}e^{-\mu}}{n!}$

where $P(n)$ is predicted probability for $n$ number of occurrence, $\mu$ is the mean for number occurrence for a fixed interval of time

Small example

A small company, on average, is receiving 12 customers per day ($\mu=12$). What is the probability if the company will receive exactly 8 ($n=8$) customers in one day?

$\large P(n=8)=\frac{12 ^{8}e^{-12}}{8!}=0.0655$

# Draw samples from a Poisson distribution.
Samples=np.random.poisson(lam=60,size=10000)
# lam: Expected number of events occurring in a fixed-time interval,
font = {'size'   : 11}
plt.rc('font', **font)
fig, ax1 = plt.subplots(figsize=(8, 8), dpi= 130, facecolor='w', edgecolor='k')
title='Poisson PMF, Expected Number of Events in Fixed Interval=60, size=10000'

ax1=plt.subplot(2,1,1)
EDA_plot.histplt (Samples,bins=1000,title=title,xlabl=None,
         ylabl='Percentage',xlimt=(0,100),ylimt=None,axt=ax1,
         scale=1.15,loc=1,font=9,color='g')
#
ax1=plt.subplot(2,1,2)
EDA_plot.CDF_plot(Samples,1,label='Poisson ecdf',colors='g',xlabel=None,title=title,ylabel='Cumulative Probability',bins=10000,
    xlim=(0,100),ylim=None,linewidth=2.5,loc=4,axt=ax1,x_ftze=12,y_ftze=12,tit_ftze=12,leg_ftze=9)

Normal Distribution¶

# Draw samples from a Normal distribution.
Samples=np.random.normal(loc=3, scale=1.0, size=10000)

font = {'size'   : 11}
plt.rc('font', **font)
fig, ax1 = plt.subplots(figsize=(8, 8), dpi= 130, facecolor='w', edgecolor='k')
title='Normal Distribution_loc=3, scale=1.0, size=1000'

ax1=plt.subplot(2,1,1)
EDA_plot.histplt (Samples,bins=50,title=title,xlabl=None,
         ylabl='Percentage',xlimt=(0,6),ylimt=None,axt=ax1,
         scale=1.15,loc=1,font=10,color='g')


ax1=plt.subplot(2,1,2)
EDA_plot.CDF_plot(Samples,1,label='Normal ecdf',colors='g',xlabel=None,title='ecdf of '+title,ylabel='Cumulative Probability',bins=10000,
    xlim=(0,6),ylim=None,linewidth=2.5,loc=4,axt=ax1,x_ftze=12,y_ftze=12,tit_ftze=14,leg_ftze=9)

The Exponential Distribution¶

The waiting time between arrivals of a Poisson process is Exponentially distributed.

# Draw samples from a Exponential distribution.
Samples=np.random.exponential(scale=1.0,size=10000)
font = {'size'   : 11}
plt.rc('font', **font)
fig, ax1 = plt.subplots(figsize=(8, 8), dpi= 130, facecolor='w', edgecolor='k')
title='Exponential Distribution_scale=60, size=10000'

ax1=plt.subplot(2,1,1)
EDA_plot.histplt (Samples,bins=100,title=title,xlabl=None,
         ylabl='Percentage',xlimt=(0,10),ylimt=None,axt=ax1,
         scale=1.15,loc=1,font=10,color='g')


ax1=plt.subplot(2,1,2)
EDA_plot.CDF_plot(Samples,1,label='Poisson ecdf',colors='g',xlabel=None,title=title,ylabel='Cumulative Probability',bins=10000,
    xlim=(0,10),ylim=None,linewidth=2.5,loc=4,axt=ax1,x_ftze=12,y_ftze=12,tit_ftze=14,leg_ftze=9)

Data Manipulation with Pandas¶

Pandas is an open source Python package for data manipulation and analysis with high-performance and easy-to-use methods. It is based on the dataframe concept in the R programming language. Pandas can be used as the primary tools for data processing, data manipulation and data cleaning. pandas is built on NumPy and Matplotlib.

Look at Data¶

# Reading data in Pandas
import pandas as pd
df = pd.read_csv("./Data/WPD.csv") # Read data file
#df.head()
df[0:5] # Display data

# shape of data
df.shape

(1000, 12)

The columns function gives the name of the columns in your data file.

column=list(df.columns)
print(column)

['X Coordinate', 'Y Coordinate', 'Measured Depth (m)', 'Deviation (True/False)', 'Surface-Casing Weight (kg/m)', 'Production-Casing Size (mm)', 'Production-Casing Weight (kg/m)', 'Bore. Temp. (degC)', 'Prod. Formation', 'Porosity (fraction)', 'Permeability (Darcy)', 'OIL Prod. (e3m3/month)']

The info() function gives some information about the data type: string float or null (missing values).

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 12 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   X Coordinate                     1000 non-null   float64
 1   Y Coordinate                     1000 non-null   int64  
 2   Measured Depth (m)               1000 non-null   float64
 3   Deviation (True/False)           1000 non-null   bool   
 4   Surface-Casing Weight (kg/m)     593 non-null    float64
 5   Production-Casing Size (mm)      913 non-null    float64
 6   Production-Casing Weight (kg/m)  547 non-null    float64
 7   Bore. Temp. (degC)               801 non-null    float64
 8   Prod. Formation                  1000 non-null   object 
 9   Porosity (fraction)              1000 non-null   float64
 10  Permeability (Darcy)             1000 non-null   float64
 11  OIL Prod. (e3m3/month)           1000 non-null   float64
dtypes: bool(1), float64(9), int64(1), object(1)
memory usage: 87.0+ KB

The describe function gives some statistical analysis for each column if the data type is not string.

df.describe()

# Components of DataFrame
df.values

array([[352.6, 372, 2145.8, ..., 0.224, 12.25, 108.75],
       [435.7, 551, 1638.0, ..., 0.118, 4.11, 0.33],
       [37.7, 788, 3984.0, ..., 0.121, 1.99, 21.87],
       ...,
       [593.4, 456, 683.7, ..., 0.332, 49.75, 832.45],
       [540.9, 328, 1346.0, ..., 0.111, 0.59, 0.0],
       [617.8, 377, 844.0, ..., 0.292, 9.95, 153.43]], dtype=object)

# index of DataFrame
df.index

RangeIndex(start=0, stop=1000, step=1)

Data Cleaning¶

Missing (NaN) Values¶

It is ideal to have valid value for every rows and columns. In reality, missing values is one of the main challenges faced by machine learning. There are some approaches in practice to deal with missing values.

Option 1: Drop Column

One naive approach is to remove columns that have missing values with drop() function.

Select_column=column[3:8] # Select columns that have missing
Select_column

['Deviation (True/False)',
 'Surface-Casing Weight (kg/m)',
 'Production-Casing Size (mm)',
 'Production-Casing Weight (kg/m)',
 'Bore. Temp. (degC)']

df_drop_c=df.copy() # Copy to avoid modifying the original data
df_drop_c.drop(Select_column,axis=1, inplace=True)  # Drop selected 5 columns
df_drop_c[:5]

df_drop_c.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 7 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   X Coordinate            1000 non-null   float64
 1   Y Coordinate            1000 non-null   int64  
 2   Measured Depth (m)      1000 non-null   float64
 3   Prod. Formation         1000 non-null   object 
 4   Porosity (fraction)     1000 non-null   float64
 5   Permeability (Darcy)    1000 non-null   float64
 6   OIL Prod. (e3m3/month)  1000 non-null   float64
dtypes: float64(5), int64(1), object(1)
memory usage: 54.8+ KB

This approach is not reliable since the information of 5 columns without missing values have been removed from data set.

Option 2: Drop Row

Another approach is to remove rows that have missing values with dropna() function.

df_drop_=df.copy() # Copy to avoid modifying the original data
df_drop_r=df_drop_.dropna() # Drop rows with missing values
df_drop_r[:5]

df_drop_r.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 426 entries, 6 to 999
Data columns (total 12 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   X Coordinate                     426 non-null    float64
 1   Y Coordinate                     426 non-null    int64  
 2   Measured Depth (m)               426 non-null    float64
 3   Deviation (True/False)           426 non-null    bool   
 4   Surface-Casing Weight (kg/m)     426 non-null    float64
 5   Production-Casing Size (mm)      426 non-null    float64
 6   Production-Casing Weight (kg/m)  426 non-null    float64
 7   Bore. Temp. (degC)               426 non-null    float64
 8   Prod. Formation                  426 non-null    object 
 9   Porosity (fraction)              426 non-null    float64
 10  Permeability (Darcy)             426 non-null    float64
 11  OIL Prod. (e3m3/month)           426 non-null    float64
dtypes: bool(1), float64(9), int64(1), object(1)
memory usage: 40.4+ KB

This approach is also not recommend because number of data (rows) have decreased from 1000 to 426.

Option 3: Replace with Median (mean)

A common practice is to replace missing values with the median (mean) value for that column. The median is the middle value of a list that 50% are less and 50% are bigger. See Median. The following code replaces all missing values in 5 columns with the median value of each column values.

df_im=df.copy()

# Get the median of each variable and replace the missing values with median
for i in range(len(Select_column)):
    P50 = df[Select_column[i]].median() # Calculate median of each column
    df_im[Select_column[i]] = df_im[Select_column[i]].fillna(P50)   # replace NaN with median(P50)
    
df_im[:5]

df_im.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 12 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   X Coordinate                     1000 non-null   float64
 1   Y Coordinate                     1000 non-null   int64  
 2   Measured Depth (m)               1000 non-null   float64
 3   Deviation (True/False)           1000 non-null   bool   
 4   Surface-Casing Weight (kg/m)     1000 non-null   float64
 5   Production-Casing Size (mm)      1000 non-null   float64
 6   Production-Casing Weight (kg/m)  1000 non-null   float64
 7   Bore. Temp. (degC)               1000 non-null   float64
 8   Prod. Formation                  1000 non-null   object 
 9   Porosity (fraction)              1000 non-null   float64
 10  Permeability (Darcy)             1000 non-null   float64
 11  OIL Prod. (e3m3/month)           1000 non-null   float64
dtypes: bool(1), float64(9), int64(1), object(1)
memory usage: 87.0+ KB

Dropping duplicate names

df.drop_duplicates(subset="name")

df.drop_duplicates(subset=["name1","name2"])

Remove Outliers¶

Outliers are usually extreme high or low values that are different from pattern of majority data. Outliers should be removed from dataset (we should be careful not to consider novelty as outlier). A straightforward approach to detect outliers are several (n) standard deviations from the mean $m\pm n\sigma$ ($m$=mean, $\sigma$=standard deviation). For example, an outlier can bigger than $m+ 3\sigma $ or less than can bigger than $m- 3\sigma $. The following function can be used to apply this approach.

def outlier_remove(df, n,name):
    """Delete rows for a specified column where values are out of +/- n*sd standard deviations
    
    df  : Pandas dataframe
    n   : n in the equation 𝑚±𝑛𝜎
    name: Column name
    """
    mean=df[name].mean() # Calclute mean of column
    sd=df[name].std()  # Calclute standard deviation of column
    drop_r = df.index[(mean -n * sd> df[name]) | (mean+n * sd< df[name])] # Find data that are not within 𝑚±𝑛𝜎
    df.drop(drop_r, axis=0, inplace=True) # Drop data
    df.reset_index(inplace=True, drop=True) # Reset index

# Drop outliers in last column 'OIL Prod. (e3m3/month)'
df_im_out=df_im.copy()
outlier_remove(df_im_out,n=3,name='OIL Prod. (e3m3/month)')
df_im_out[:5]

Concatenation¶

A new data frames can be built by concatenating rows and columns. A new dataframe is created by concatenating two columns together 'Measured Depth (m)' and 'OIL Prod. (e3m3/month)' using the concat() function.

col_1= df_im['Measured Depth (m)']
col_2 = df_im['OIL Prod. (e3m3/month)']
result = pd.concat([col_1, col_2], axis=1)
result[:5]

Two rows can be concatenated together using concat() function. The code below concatenates the first 4 rows and the last 3 rows of the previous dataset after imputation.

row_1= df_im[0:4]  # Retrieve first 4 rows
row_2 = df_im[-4:-1]  # Retrieve last 3 rows
result = pd.concat([row_1, row_2], axis=0) # Concatenate rows
result[:5]

Saving a Dataframe as CSV¶

You can simply save a dataframe as CSV by the following code. The following code performs a shuffle and then saves a new copy.

filename="myfile.csv"
df_im.to_csv(filename, index=False) # index = False not writing row numbers

Shuffling, Grouping and Sorting¶

Shuffling Dataset

An example of shuffling is playing the game of cards. All the cards are collected at the end of each round of play, shuffled to ensure that cards are distributed randomly and each player receives cards based on chance. In Machine Learning, the data set should be split into training and validation datasets. So, shuffling is very important to avoid any element of bias/patterns in the split datasets.

If consistent shuffling of the data set is required for several times of running the code, a random seed can be used. See code below.

import numpy as np

#np.random.seed(82) 
df_shfl = df_im.reindex(np.random.permutation(df.index)) 
df_shfl.reset_index(inplace=True, drop=True) # Reset index
df_shfl[0:5]

Sorting Data Set

df_sort = df_im.sort_values(by='Measured Depth (m)', ascending=True)
#df_sort.reset_index(inplace=True, drop=True) # Reset index
df_sort[:5]

we can sort values with multiple variables:

df_sort = df_im.sort_values(['Measured Depth (m)','Porosity (fraction)'], ascending=[True, False])
df_sort[:5]

Subsetting based on multiple conditions:

is_lab   = df["Measured Depth (m)"]     == 321.0
is_brown = df["Deviation (True/False)"] == True
df[is_lab & is_brown]

Subsetting using .isin()

df_sort[df_sort['Prod. Formation'].isin(['Sand'])][:5]

Summary statistics¶

Grouping is applied to summarize data. The attributes below are used to calculate summary statistics for each col or each row. The attributes by default calculate statistics for columns:

mean()
median()
mode()
min()
max()
var()
std()
sum()
quantile()

The following codes performs grouping; generating summary statistics for each category for feature 'Prod. Formation' in target variable 'OIL Prod. (e3m3/month)'.

gb_mean=df_im.groupby('Prod. Formation')['OIL Prod. (e3m3/month)'].mean()
gb_mean

Prod. Formation
Sand          64.015467
Shale         59.502943
Shale-Sand    54.482824
Name: OIL Prod. (e3m3/month), dtype: float64

gb_sum=df_im.groupby('Prod. Formation')['OIL Prod. (e3m3/month)'].sum()
gb_sum

Prod. Formation
Sand          19204.64
Shale         31536.56
Shale-Sand     9262.08
Name: OIL Prod. (e3m3/month), dtype: float64

`.agg()` Method¶

The .agg() method is used for computing costume summary statistics (df['column'].agg(function))

def pct90(clm):
    return clm.quantile(0.9)

df_im['OIL Prod. (e3m3/month)'].agg(pct90)

105.87400000000005

df_im[['OIL Prod. (e3m3/month)','Measured Depth (m)']].agg(pct90)

OIL Prod. (e3m3/month)     105.874
Measured Depth (m)        2396.400
dtype: float64

def pct30(clm):
    return clm.quantile(0.3)

df_im['OIL Prod. (e3m3/month)'].agg([pct30,pct90])

pct30      0.000
pct90    105.874
Name: OIL Prod. (e3m3/month), dtype: float64

Cumulative sum¶

df_im['OIL Prod. (e3m3/month)'].iloc[:5]

0    108.75
1      0.33
2     21.87
3     46.30
4     17.53
Name: OIL Prod. (e3m3/month), dtype: float64

df_im['OIL Prod. (e3m3/month)'].iloc[:5].cumsum()

0    108.75
1    109.08
2    130.95
3    177.25
4    194.78
Name: OIL Prod. (e3m3/month), dtype: float64

There are other cumulative statistics:

cummax()
cummin()
cumprod()

df_im['OIL Prod. (e3m3/month)'].iloc[:5].cummax()

0    108.75
1    108.75
2    108.75
3    108.75
4    108.75
Name: OIL Prod. (e3m3/month), dtype: float64

df_im['OIL Prod. (e3m3/month)'].iloc[:5].cummin()

0    108.75
1      0.33
2      0.33
3      0.33
4      0.33
Name: OIL Prod. (e3m3/month), dtype: float64

df_im['OIL Prod. (e3m3/month)'].value_counts()

0.00      379
0.01       14
0.05       12
0.02        8
22.01       5
         ... 
37.68       1
17.15       1
553.28      1
27.46       1
153.43      1
Name: OIL Prod. (e3m3/month), Length: 521, dtype: int64

`groupby`¶

df_im.groupby('Deviation (True/False)')['Measured Depth (m)'].agg([min,max,np.median])

df_im.groupby('Deviation (True/False)')['Measured Depth (m)'].mean()

Deviation (True/False)
False    1214.885276
True     1644.465805
Name: Measured Depth (m), dtype: float64

Pivot tables¶

df_im.groupby('Deviation (True/False)')['Measured Depth (m)'].mean()

Deviation (True/False)
False    1214.885276
True     1644.465805
Name: Measured Depth (m), dtype: float64

df_im.pivot_table(aggfunc='mean', values='Measured Depth (m)', index='Deviation (True/False)')

values argument is the column you want to summarize and index is the column you want to groupby.

df_im.pivot_table(aggfunc=[np.median,np.std], values='Measured Depth (m)', index='Deviation (True/False)')

Index¶

df_im.columns

Index(['X Coordinate', 'Y Coordinate', 'Measured Depth (m)',
       'Deviation (True/False)', 'Surface-Casing Weight (kg/m)',
       'Production-Casing Size (mm)', 'Production-Casing Weight (kg/m)',
       'Bore. Temp. (degC)', 'Prod. Formation', 'Porosity (fraction)',
       'Permeability (Darcy)', 'OIL Prod. (e3m3/month)'],
      dtype='object')

df_im.index

RangeIndex(start=0, stop=1000, step=1)

we can move a column from body of dataframe to index:

df_im[:5].set_index("X Coordinate")

To undo what we just did we can reset_ndex():

df_im[:5].reset_index(drop=True)

Indexing makes subsetting simpler:

df_im[df_im['Prod. Formation'].isin(['Shale', 'Sand'])][:5]

This is a long code, we can use subsetting as below:

#df_im.set_index("Prod. Formation",inplace=True)
df_im[:5]

df.loc[0: 3]

Multi-level indexes

df_im.set_index(["Porosity (fraction)","Permeability (Darcy)"],inplace=True)

df_im[:5]

df_im.loc[[(0.224,12.25), (0.017,0.48)]]

Sorting by index values

df_im.sort_index()[:5]

df_im.sort_index(level=['Porosity (fraction)','Permeability (Darcy)'], ascending=[True, True])[:5]

Slicing by `.loc` and `.iloc`¶

loc does not get inner level index only it return outer level.

We can have two arguments for loc to include columns (: means return everything).

df_im.loc[:,"X Coordinate":"Deviation (True/False)"][:5]

iloc method is to slice row and columns based on numbers

df_im.iloc[1:5,2:6]

Unlike loc, last row and column is not included in iloc.

Feature Engineering¶

Feature engineering is the process of extracting new feature from raw data via data mining techniques. The new features can be used to enhance the performance of machine learning algorithms. New features could also be calculated from the other fields. For the data set WPD.csv, we can add "Surface-Casing Weight (kg/m)" and "Production-Casing Weight (kg/m)" as a new data set.

df_im_c=df_im.copy()
df_im.insert(1, 'New Column', (df_im['Surface-Casing Weight (kg/m)']*df_im['Production-Casing Weight (kg/m)']).astype(int))
df_im

Standardize (Normalize) Values¶

Most Machine Learning algorithms require to standardize the input to have a reliable comparison between two values of different variables. For example, if you have variable one with a range of 3 to 1000 and variable two with range 1 to 3; we cannot have a good comparison since the highest value of variable two (3) is the lowest value in variable two. This inconsistency in data values leads to low performance in Machine Learning algorithms. So, it is highly recommended to standardize (normalize) your data to have the same the mean($\mu$) and the standard deviation ($\sigma$) before feeding Machine Learning algorithms. One very common Machine learning standardization (normalization) is the Z-Score:

$\Large z = \frac{x - \mu}{\sigma} $

where $ \mu = \frac{1}{n} \sum_{i=1}^n x_i$, $\sigma = \sqrt{\frac{1}{n} \sum_{i=1}^n (x_i - \mu)^2}$, $n$ is number of data. This ensures that each variable have $\mu=0$ and $\sigma=1$

This standardization should be applied for all features except the target (label). For WPD.csv data set, the target is "OIL Prod. (e3m3/month)" that should not be changed.

The following Python code replaces the 'Measured Depth (m)' with a z-score. This can be done for all variables.

from scipy.stats import zscore

mean=df_im['Measured Depth (m)'].mean()
variance=df_im['Measured Depth (m)'].var()
print("Before standardization: Mean= ",int(mean), ", variance= ",int(variance))

df_im['Measured Depth (m)'] = zscore(df_im['Measured Depth (m)'])

mean=df_im['Measured Depth (m)'].mean()
variance=df_im['Measured Depth (m)'].var()
print("After standardization: Mean= ",int(mean), ", variance= ",int(variance))

Before standardization: Mean=  1364 , variance=  732108
After standardization: Mean=  0 , variance=  1

df_im[0:5]

Categorical Values¶

The input data for Machine Learning should be completely numeric which means any text should be converted to number. In the data set WPD.csv, the columns 'Deviation (True/False)' and 'Prod. Formation' have categorical values. Use value_counts() function to find out what categories exist and how many data belong to each category.

df_c=df_im.copy()
df_c['Deviation (True/False)'].value_counts()

False    652
True     348
Name: Deviation (True/False), dtype: int64

df_c['Prod. Formation'].value_counts()

Shale         530
Sand          300
Shale-Sand    170
Name: Prod. Formation, dtype: int64

We can replace each test with numbers starting from 0

df_c['Deviation (True/False)']=df_c['Deviation (True/False)'].replace(False, 0) # Replace False with 0
df_c['Deviation (True/False)']=df_c['Deviation (True/False)'].replace(True, 1)  # Replace True with 1
df_c[0:5]

df_c['Prod. Formation']=df_c['Prod. Formation'].replace('Shale', 0)       # Replace Shale with 0
df_c['Prod. Formation']=df_c['Prod. Formation'].replace('Sand', 1)        # Replace Sand with 1
df_c['Prod. Formation']=df_c['Prod. Formation'].replace('Shale-Sand', 2)  # Replace Shale-Sand with 2
df_c[0:5]

One challenge with this representation is that Machine Learning algorithms consider two nearby values are more similar than two distant values. This may be fine for some cases (order of categories “bad”, “average”, “good”), but this is not always true. A common solution is to have one binary attribute per category. This is called one-hot encoding, because we only have one attribute equal to 1 (hot), while the others are 0 (cold).

One_hot = pd.get_dummies(['a','b'],prefix='Deviation')
print(One_hot)

   Deviation_a  Deviation_b
0            1            0
1            0            1

These dummies should be merged back into the data frame.

One_hot = pd.get_dummies(df_im['Deviation (True/False)'],prefix='Deviation')
print(One_hot[0:10]) # Just show the first 10

                                          Deviation_False  Deviation_True
Porosity (fraction) Permeability (Darcy)                                 
0.224               12.25                               1               0
0.118               4.11                                0               1
0.121               1.99                                0               1
0.170               5.84                                0               1
0.158               5.25                                0               1
0.000               0.10                                1               0
0.128               0.70                                0               1
0.055               0.36                                1               0
0.033               0.46                                1               0
0.224               9.95                                0               1

df_one_hot = pd.concat([df_im,One_hot],axis=1)
df_one_hot[0:5]

This should be done for 'Prod. Formation' column.

One_hot = pd.get_dummies(df_im['Prod. Formation'],prefix='Formation')
df_one_hot = pd.concat([df_one_hot,One_hot],axis=1)
df_one_hot[0:5]

Finally, the original columns 'Deviation (True/False)' and 'Prod. Formation' should be removed.

df_one_hot.drop(['Deviation (True/False)','Prod. Formation'], axis=1, inplace=True)
df_one_hot

	state	county	dem_share
0	PA	Erie County	60.08
1	PA	Bradford County	40.64
2	PA	Tioga County	36.07
3	PA	McKean County	41.21
4	PA	Potter County	31.04

	X Coordinate	Y Coordinate	Measured Depth (m)	Surface-Casing Weight (kg/m)	Production-Casing Size (mm)	Production-Casing Weight (kg/m)	Bore. Temp. (degC)	Porosity (fraction)	Permeability (Darcy)	OIL Prod. (e3m3/month)
count	1000.000000	1000.000000	1000.000000	593.000000	913.000000	547.000000	801.000000	1000.000000	1000.000000	1000.000000
mean	432.893400	489.749000	1364.379300	42.906745	142.495290	23.260512	40.774032	0.141840	5.269140	60.003280
std	172.206682	243.547912	855.633567	25.877438	25.953205	7.137579	21.256825	0.074501	8.665687	234.814659
min	12.300000	36.000000	276.000000	14.100000	13.700000	10.800000	5.000000	0.000000	0.050000	0.000000
25%	325.425000	339.750000	704.000000	35.700000	114.300000	17.300000	27.000000	0.089750	1.040000	0.000000
50%	487.250000	480.000000	1066.000000	35.700000	139.700000	20.800000	33.000000	0.141000	2.480000	4.715000
75%	581.675000	596.000000	1790.000000	48.100000	177.800000	25.300000	53.000000	0.192000	5.752500	33.117500
max	648.800000	1172.000000	6363.000000	595.000000	219.100000	53.600000	255.000000	0.400000	94.510000	3425.530000

	X Coordinate	Y Coordinate	Measured Depth (m)	Deviation (True/False)	Surface-Casing Weight (kg/m)	Production-Casing Size (mm)	Production-Casing Weight (kg/m)	Bore. Temp. (degC)	Prod. Formation	Porosity (fraction)	Permeability (Darcy)	OIL Prod. (e3m3/month)
0	498.9	59	1010.0	False	53.6	139.7	20.8	26.0	Shale-Sand	0.100	0.66	0.00
1	320.3	468	1436.8	False	35.7	139.7	20.8	41.0	Sand	0.134	3.50	2.02
2	298.7	674	3065.0	True	53.6	139.7	20.8	33.0	Sand	0.338	42.45	1760.45
3	569.8	205	989.1	False	35.7	114.3	20.8	33.0	Shale	0.018	0.91	0.00
4	522.3	213	1050.0	False	35.7	139.7	23.1	35.0	Shale	0.157	9.48	37.06

	X Coordinate	Y Coordinate	Measured Depth (m)	Deviation (True/False)	Surface-Casing Weight (kg/m)	Production-Casing Size (mm)	Production-Casing Weight (kg/m)	Bore. Temp. (degC)	Prod. Formation	Porosity (fraction)	Permeability (Darcy)	OIL Prod. (e3m3/month)
90	132.0	1104	276.0	False	25.3	114.3	14.1	25.0	Shale	0.142	1.16	0.00
537	72.9	784	295.5	False	25.3	114.3	14.1	22.0	Shale	0.000	0.66	0.00
403	591.3	791	297.0	False	25.3	114.3	14.1	26.0	Shale	0.142	13.37	72.26
507	621.7	593	321.0	True	53.6	177.8	34.2	33.0	Shale	0.091	1.72	0.00
105	631.2	583	343.2	False	56.5	139.7	20.8	33.0	Sand	0.168	3.19	19.23

	X Coordinate	Y Coordinate	Measured Depth (m)	Deviation (True/False)	Surface-Casing Weight (kg/m)	Production-Casing Size (mm)	Production-Casing Weight (kg/m)	Bore. Temp. (degC)	Prod. Formation	Porosity (fraction)	Permeability (Darcy)	OIL Prod. (e3m3/month)
90	132.0	1104	276.0	False	25.3	114.3	14.1	25.0	Shale	0.142	1.16	0.00
537	72.9	784	295.5	False	25.3	114.3	14.1	22.0	Shale	0.000	0.66	0.00
403	591.3	791	297.0	False	25.3	114.3	14.1	26.0	Shale	0.142	13.37	72.26
507	621.7	593	321.0	True	53.6	177.8	34.2	33.0	Shale	0.091	1.72	0.00
105	631.2	583	343.2	False	56.5	139.7	20.8	33.0	Sand	0.168	3.19	19.23

	X Coordinate	Y Coordinate	Measured Depth (m)	Deviation (True/False)	Surface-Casing Weight (kg/m)	Production-Casing Size (mm)	Production-Casing Weight (kg/m)	Bore. Temp. (degC)	Prod. Formation	Porosity (fraction)	Permeability (Darcy)	OIL Prod. (e3m3/month)
0	352.6	372	2145.8	False	NaN	114.3	NaN	53.0	Shale-Sand	0.224	12.25	108.75
1	435.7	551	1638.0	True	25.3	114.3	17.3	NaN	Sand	0.118	4.11	0.33
2	37.7	788	3984.0	True	35.7	114.3	22.5	NaN	Shale	0.121	1.99	21.87
3	346.7	311	3809.0	True	41.7	114.3	17.3	NaN	Sand	0.170	5.84	46.30
4	254.6	727	2335.0	True	53.6	NaN	NaN	32.0	Sand	0.158	5.25	17.53

	X Coordinate	Y Coordinate	Measured Depth (m)	Deviation (True/False)	Surface-Casing Weight (kg/m)	Production-Casing Size (mm)	Production-Casing Weight (kg/m)	Bore. Temp. (degC)	Prod. Formation	Porosity (fraction)	Permeability (Darcy)	OIL Prod. (e3m3/month)
6	636.9	462	713.0	True	48.1	177.8	25.3	27.0	Shale	0.128	0.70	0.0
7	556.2	441	761.0	False	35.7	139.7	20.8	28.0	Sand	0.055	0.36	0.0
8	572.0	521	587.0	False	25.3	114.3	14.1	15.0	Shale	0.033	0.46	0.0
10	366.2	385	1801.0	False	35.7	139.7	25.3	52.0	Shale-Sand	0.147	3.15	5.8
14	648.7	275	920.0	False	35.7	139.7	20.8	30.0	Shale	0.104	0.38	0.0

		X Coordinate	Y Coordinate	Measured Depth (m)	Deviation (True/False)	Surface-Casing Weight (kg/m)	Production-Casing Size (mm)	Production-Casing Weight (kg/m)	Bore. Temp. (degC)	Prod. Formation	OIL Prod. (e3m3/month)
Porosity (fraction)	Permeability (Darcy)
0.0	0.05	596.4	126	368.8	False	35.7	177.8	20.8	33.0	Shale	0.0
	0.07	598.2	513	621.0	True	48.1	177.8	29.8	25.0	Sand	0.0
	0.07	372.6	382	1914.0	False	35.7	139.7	23.1	46.0	Shale	0.0
	0.10	127.5	1125	1560.6	False	35.7	177.8	20.8	63.0	Shale	0.0
	0.10	427.0	555	1495.0	True	25.3	114.3	17.3	33.0	Shale-Sand	0.0

Table of Contents