Summary
This notebook presents different approaches for exploratory data analysis and basic statistical analysis. Python functions are written to make nice data visualization including histogram, CDF, swarm plot, box plot, Violin plot, Kernel Density Estimation, cross plot and correlation. Probability distribution including binomial PMF, Poisson, Normal and Exponential distribution will be discussed. Finally data manipulation and data analysis with Pandas will be presented.
Python functions and data files to run this notebook are in my Github page.
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib
import pylab as plt
from functions import* # import require functions to run this notebook
import warnings
warnings.filterwarnings('ignore')
Data Visualization¶
The data is 2008 US swing state election results downloaded from https://www.kaggle.com/datasets/aman1py/swing-states?resource=download
df_swing=pd.read_csv('./Data/2008_swing_states.csv')
df_swing[['state','county','dem_share']][:5]
state | county | dem_share | |
---|---|---|---|
0 | PA | Erie County | 60.08 |
1 | PA | Bradford County | 40.64 |
2 | PA | Tioga County | 36.07 |
3 | PA | McKean County | 41.21 |
4 | PA | Potter County | 31.04 |
Histogram¶
font = {'size' : 7}
plt.rc('font', **font)
fig, ax1 = plt.subplots(figsize=(10, 7), dpi= 100, facecolor='w', edgecolor='k')
gs = gridspec.GridSpec(2, 4)
ax1=plt.subplot(gs[0, :2], )
KPI='dem_share'
val=df_swing[df_swing['state']=='OH'][KPI]
EDA_plot.histplt (val,bins=20,title='OH State',xlabl=KPI,
ylabl='Percentage',xlimt=(0,100),ylimt=(0,0.2),axt=ax1,
scale=1.15,loc=1,font=10,color='#8DC63F')
ax2=plt.subplot(gs[0, 2:])
KPI='dem_share'
val=df_swing[df_swing['state']=='PA'][KPI]
EDA_plot.histplt (val,bins=20,title='PA State',xlabl=KPI,
ylabl='Percentage',xlimt=(0,100),ylimt=(0,0.2),axt=ax2,
scale=1.15,loc=1,font=10,color='#025174')
ax3=plt.subplot(gs[1, 1:3])
val=df_swing[df_swing['state']=='FL'][KPI]
EDA_plot.histplt (val,bins=20,title='FL State',xlabl=KPI,
ylabl='Percentage',xlimt=(0,100),ylimt=(0,0.2),axt=ax3
,scale=1.15,loc=1,font=10,color='#323539')
plt.subplots_adjust(wspace=0.9)
plt.subplots_adjust(hspace=0.45)
fig.suptitle('2008 US Swing State Election_Histogram of dem_share for each State', fontsize=16,y=1.02)
plt.show()
Bee Swarm plot¶
font = {'size' : 12}
plt.rc('font', **font)
fig, ax1 = plt.subplots(figsize=(8, 4), dpi= 100, facecolor='w', edgecolor='k')
sns.set_theme(style="whitegrid")
tips = sns.load_dataset("tips")
ax = sns.swarmplot(x="state", y="dem_share", data=df_swing)
df_means = df_swing.groupby("state")["dem_share"].agg("mean").reset_index()
xlim = ax.get_xlim()
ylim = ax.get_ylim()
sns.scatterplot(x="state", y="dem_share", marker='X', color='black',
s=100, zorder=3, ax=ax, legend=True, data=df_means)
ax.set_xlim(xlim)
ax.set_ylim(ylim)
plt.show()
Box Plot¶
The center of box plot is 50th percentile which has a higher limit of 75th percentile and lower limit of 25th percentile. This is called IQR (Interquartile Range). The reset extend of data is 1.5 IQR.
font = {'size' : 7}
plt.rc('font', **font)
fig, ax1 = plt.subplots(figsize=(8, 4), dpi= 100, facecolor='w', edgecolor='k')
sns.set_theme(style="whitegrid")
tips = sns.load_dataset("tips")
ax = sns.swarmplot(x="state", y="dem_share", data=df_swing)
ax = sns.boxplot(x="state", y="dem_share", data=df_swing)
df_means = df_swing.groupby("state")["dem_share"].agg("mean").reset_index()
xlim = ax.get_xlim()
ylim = ax.get_ylim()
sns.scatterplot(x="state", y="dem_share", marker='X', color='black',
s=100, zorder=3, ax=ax, legend=True, data=df_means)
ax.set_xlim(xlim)
ax.set_ylim(ylim)
plt.show()
Violin Plot¶
#plt.style.available
plt.style.use('classic')
KPI='dem_share'
val=[]
val.append(np.array(df_swing[df_swing['state']=='PA'][KPI]))
val.append(np.array(df_swing[df_swing['state']=='OH'][KPI]))
val.append(np.array(df_swing[df_swing['state']=='FL'][KPI]))
#
Position=[100,500,1000]
font = {'size' : 11}
plt.rc('font', **font)
fig, ax1 = plt.subplots(figsize=(9, 5), dpi= 130, facecolor='w', edgecolor='k')
v1 = plt.violinplot(val, points=100, positions=Position, widths=300,
showmeans=True,showextrema=True, showmedians=False,bw_method =0.15)
for b in v1['bodies']:
m = np.mean(b.get_paths()[0].vertices[:, 0])
b.get_paths()[0].vertices[:, 0] = np.clip(b.get_paths()[0].vertices[:, 0], -np.inf, m)
b.set_color('darkred')
b.set_edgecolor('k')
b.set_linewidth(1)
b.set_alpha(0.9)
#plt.axis('off')
plt.ylabel('dem_share',fontsize=12)
plt.xticks([100,500,1000],['PA','OH','FL'],rotation=0)
plt.title("Violin Plot",fontweight="bold",fontsize=16)
plt.show()
Kernel Density Estimation¶
font = {'size' : 11}
plt.rc('font', **font)
fig, ax1 = plt.subplots(figsize=(9, 8), dpi= 130, facecolor='w', edgecolor='k')
ax1=plt.subplot(2,1,1)
label=['OH State','PA State','FL State']
data_var=len(label)*[None]
KPI='dem_share'
data_var[0]=np.array(df_swing[df_swing['state']=='OH'][KPI])
data_var[1]=np.array(df_swing[df_swing['state']=='PA'][KPI])
data_var[2]=np.array(df_swing[df_swing['state']=='FL'][KPI])
nvar=3
xs = np.linspace(0,100,500)
colors = ['r','b','g']
title='2008 US Swing State Election_Kernel Density Estimation for dem_share for each State'
EDA_plot.KDE(xs,data_var,nvar,label,colors,xlabel=None,title=title,ylabel='Probability',xlim=(0,100),ylim=(0,0.01),
LAMBA=0.5,linewidth=2.5,loc=0,axt=ax1,x_ftze=12,y_ftze=12,tit_ftze=12,leg_ftze=9)
Emperical cdf¶
font = {'size' : 11}
plt.rc('font', **font)
fig, ax1 = plt.subplots(figsize=(9, 8), dpi= 130, facecolor='w', edgecolor='k')
label=['OH State','PA State','FL State']
data_var=len(label)*[None]
KPI='dem_share'
data_var[0]=np.array(df_swing[df_swing['state']=='OH'][KPI])
data_var[1]=np.array(df_swing[df_swing['state']=='PA'][KPI])
data_var[2]=np.array(df_swing[df_swing['state']=='FL'][KPI])
ax1=plt.subplot(2,1,1)
title=f'2008 US Swing State Election_Emperical CDF_ for dem_share for each State'
EDA_plot.CDF_plot(data_var,nvar,label,colors,xlabel=None,title=title,ylabel='Cumulative Probability',bins=100,
xlim=(0,100),ylim=(0,1),linewidth=2.5,loc=0,axt=ax1,x_ftze=12,y_ftze=12,tit_ftze=12,leg_ftze=9)
plt.show()
Percentile¶
font = {'size' : 11}
plt.rc('font', **font)
fig, ax1 = plt.subplots(figsize=(8, 8), dpi= 130, facecolor='w', edgecolor='k')
label=['FL State']
KPI='dem_share'
data_var=np.array(df_swing[df_swing['state']=='FL'][KPI])
ax1=plt.subplot(2,1,1)
title=f'2008 US Swing State Election_Emperical CDF_for dem_share for FL State'
EDA_plot.CDF_plot(data_var,1,label,colors='g',xlabel=None,title=title,ylabel='Cumulative Probability',bins=100,
xlim=(0,90),ylim=(0,1),linewidth=2.5,loc=4,axt=ax1,x_ftze=12,y_ftze=12,tit_ftze=12,leg_ftze=9)
# 25th percentile
per=0.25
Perctile=np.quantile(data_var, per, axis=0, keepdims=True)[0]
txt= '25th percentile= '
txt+=f'{np.round(Perctile,1)}'
plt.plot([0, Perctile], [per, per], 'k--', linewidth=1.2)
plt.plot([Perctile, Perctile], [0, per], 'k--', linewidth=1.2)
plt.text(Perctile/6, per+0.03, txt, color='b',
fontsize=10)
# 50th percentile
per=0.5
Perctile=np.quantile(data_var, per, axis=0, keepdims=True)[0]
txt= '50th percentile= '
txt+=f'{np.round(Perctile,1)}'
plt.plot([0, Perctile], [per, per], 'k--', linewidth=1.2)
plt.plot([Perctile, Perctile], [0, per], 'k--', linewidth=1.2)
plt.text(Perctile/6, per+0.03, txt, color='b',
fontsize=10)
# 75th percentile
per=0.75
Perctile=np.quantile(data_var, per, axis=0, keepdims=True)[0]
txt= '75th percentile= '
txt+=f'{np.round(Perctile,1)}'
plt.plot([0, Perctile], [per, per], 'k--', linewidth=1.2)
plt.plot([Perctile, Perctile], [0, per], 'k--', linewidth=1.2)
plt.text(Perctile/6, per+0.03, txt, color='b',
fontsize=10)
plt.show()
Cross plot¶
font = {'size' : 12}
plt.rc('font', **font)
fig, ax1 = plt.subplots(figsize=(7, 5), dpi= 100, facecolor='w', edgecolor='k')
x=df_swing['total_votes']
y=df_swing['dem_share']
EDA_plot.CrossPlot(x,y,title='cross plot: total votes vs. dem_vote',
xlabl='total votes',ylabl='percent of dem_vote',loc=1,xlimt=(np.nanmin(x),np.nanmax(x)),
ylimt=(np.nanmin(y),np.nanmax(y)),axt=ax1,scale=1.2,alpha=0.9,markersize=6,marker='bo',font=font)
Correlation¶
Synthetic data with 6 features, Var1 to Var6 with Gaussian, Triangular and Uniform distribution are generated to have linear and non-linear correlation with target for 100000 data.
# Generate synthetic data
df=quadrac_swiss_roll(noise=2,nsim=100000,time_rand1=1,seed=42,a1=20,b1=8,a2=50,b2=8,swiss_roll=True)
features_colums=df.columns
# Satndardize data
for i in df.columns:
df[i] = zscore(df[i])
Figure below shows cross plot matrix between Var1 to Var6 features and the target. Var1 and Var 2 are non-linearly correlated with the target. Var3 has strong positive correlation (0.68), and Var4 and Var5 are negative correlation with the target.
font = {'size' : 7}
plt.rc('font', **font)
fig, ax1 = plt.subplots(figsize=(10, 10), dpi= 100, facecolor='w', edgecolor='k')
colors_map = plt.cm.get_cmap('jet')
colors = colors_map(np.linspace(0,0.8,len(features_colums)))
for ir in range(len(features_colums)):
ax1=plt.subplot(4,2,ir+1)
val=df[features_colums[ir]]
EDA_plot.histplt(val,bins=20,title=f'{features_colums[ir]}',xlabl=None,days=False,
ylabl=None,xlimt=None,ylimt=(0,0.3)
,axt=ax1,nsplit=5,scale=1.02,loc=2,font=10.5,color=colors[len(features_colums)-ir-1])
plt.subplots_adjust(hspace=0.4)
plt.subplots_adjust(wspace=0.3)
fig.suptitle(f'Histogram of Variables for Synthetic Data', fontsize=16,y=0.96)
plt.show()
font = {'size' :9 }
plt.rc('font', **font)
fig=plt.figure(figsize=(10, 9), dpi= 100, facecolor='w', edgecolor='k')
# Satndardize data
for i in features_colums:
df[i] = zscore(df[i])
cplotmatrix(df,font=font,alpha=0.008,marker='g.',missin_rep=False)
plt.show()
Correlation Coefficient
font = {'size' : 6}
plt.rc('font', **font)
fig, ax1 = plt.subplots(figsize=(3.5, 4), dpi= 200, facecolor='w', edgecolor='k')
corr=df.corr()
corr=corr['Target'].drop(['Target'])
coefs=corr.values
features_colums=list(corr.index)
Correlation_plot.corr_bar(coefs,clmns=features_colums,select=False,yfontsize=6.0,title=f'Linear Correlation with Target',
ymax_vert_lin=30,xlim = [-0.5, 0.7])
Coefficient of Determination
The correlation coefficient also termed as $R^{2}$ is the basic approach to quantify the importance of features for a target. In statistics, $R^{2}$ determines and assesses the ability of a statistical model to explain and predict future outcomes. In other words, if we have dependent variable y and independent variable x in a model, then $R^{2}$ helps in determining the variation in y by variation x. It is one of the key output of regression analysis and is used when we want to predict future or testing some models with related information. $R^{2}$ lies between 0 and 1 and higher the value of $R^{2}$, better will be the prediction and strength of the model. $R^{2}$ is basically a square of a correlation coefficient (𝜌).
There are multiple Formulas to calculate the coefficient of determination. It can be simply calculated from correlation coefficient:
Coefficient of Determination=(Correlation Coefficient)^2
ir=0
corr=df.corr()
corr=corr['Target'].drop(['Target'])
coefs=corr.values
coefs= np.array([i**2 for i in coefs])
# Plot the importance of features
font = {'size' : 7}
plt.rc('font', **font)
fig, ax1 = plt.subplots(figsize=(6, 3), dpi= 180, facecolor='w', edgecolor='k')
_=prfrmnce_plot(coefs, title=f'Coefficient of Determination',
ylabel='Coefficient of Determination',clmns=features_colums,titlefontsize=9,
xfontsize=7, yfontsize=8).bargraph(perent=True,fontsizelable=8,xshift=-0.25,axt=ax1,
yshift=0.02,ylim=[0,1], xlim=[-0.5,5.5], y_rot=0)
Probability Distributions¶
- Hacker statistics
- Use simulated repeated measurement to compute probabilities
- Determine how to simulate data
- Simulated many many times
- Probability is approximately fraction of trials with the outcome of interest
- Bernoulli trial
- An experiment that has two options, "success" (True) and failure (False)
- Probability Mass Function (PMF)
- The set of probabilities of discrete outcomes
- Probability Distribution
- A mathematical description of outcome
Binomial distribution
- The number r of success in n Bernoulli trials with probability p of success, is Binomially distributed. For example, the number of the heads in 4 coin flips with probability 0.5 of heads, is Binomially distributed.
- Probability Density Function (PDF)
- Continuous Variables: Quantities that can take any value, not just discrete values
- Continuous analog to PMF (probability mass function)
- Mathematical description of the relative likelihood of observing a value of a continuous variable.
- for PDF of normal distribution: the area under PDF is the probability of occurring a value.
The Binomial PMF¶
In probability theory and statistics, the binomial distribution with parameters n and p is the discrete probability distribution of the number of successes in a sequence of n independent experiments, each asking a yes–no question.
The probability density for the binomial distribution is
where n is the number of trials, p is the probability of success, and N is the number of successes.
Samples=np.random.binomial(n=60,p=0.1,size=10000)
font = {'size' : 11}
plt.rc('font', **font)
fig, ax1 = plt.subplots(figsize=(8, 8), dpi= 130, facecolor='w', edgecolor='k')
title='Binomial PMF, n=60, p=0.1, size=1000'
ax1=plt.subplot(2,1,1)
EDA_plot.histplt (Samples,bins=800,title=title,xlabl=None,
ylabl='Percentage',xlimt=(0,15),ylimt=(0,0.2),axt=ax1,
scale=1.15,loc=1,font=10,color='g')
ax1=plt.subplot(2,1,2)
EDA_plot.CDF_plot(Samples,1,label='Binomial ecdf',colors='g',xlabel=None,title=title,ylabel='Cumulative Probability',bins=10000,
xlim=(0,15),ylim=(0,1),linewidth=2.5,loc=4,axt=ax1,x_ftze=12,y_ftze=12,tit_ftze=16,leg_ftze=9)
Poisson processes and the Poisson Distribution¶
The timing of the next event is completely independent of when the previous even happened. The example are:
- Natural births in a given hospital
- Hit on a website during a given hour
- meteor strikes
- Molecular collisions in a gas
- Aviation incidents
- Buses in Poissonville
- The number of r of arrivals of a Poisson process in a given time interval with average rate of? arrivals per intervals is Poisson distributed.
- The number of r hits on a website in one hour with an average hit rate of 6 hits per hour is Poisson distributed
Poisson distribution is the limit of the Binomial distribution for low probability of success and large number of Bernoulli trial (an experiment that has two options, "success" (True) and failure (False)
).
In probability theory and statistics, the Poisson distribution is a discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time (or space) if these events occur with a known constant mean rate:
$\LARGE P(n)=\frac{\mu ^{n}e^{-\mu}}{n!}$
where $P(n)$ is predicted probability for $n$ number of occurrence, $\mu$ is the mean for number occurrence for a fixed interval of time
- Small example
A small company, on average, is receiving 12 customers per day ($\mu=12$). What is the probability if the company will receive exactly 8 ($n=8$) customers in one day?
$\large P(n=8)=\frac{12 ^{8}e^{-12}}{8!}=0.0655$
# Draw samples from a Poisson distribution.
Samples=np.random.poisson(lam=60,size=10000)
# lam: Expected number of events occurring in a fixed-time interval,
font = {'size' : 11}
plt.rc('font', **font)
fig, ax1 = plt.subplots(figsize=(8, 8), dpi= 130, facecolor='w', edgecolor='k')
title='Poisson PMF, Expected Number of Events in Fixed Interval=60, size=10000'
ax1=plt.subplot(2,1,1)
EDA_plot.histplt (Samples,bins=1000,title=title,xlabl=None,
ylabl='Percentage',xlimt=(0,100),ylimt=None,axt=ax1,
scale=1.15,loc=1,font=9,color='g')
#
ax1=plt.subplot(2,1,2)
EDA_plot.CDF_plot(Samples,1,label='Poisson ecdf',colors='g',xlabel=None,title=title,ylabel='Cumulative Probability',bins=10000,
xlim=(0,100),ylim=None,linewidth=2.5,loc=4,axt=ax1,x_ftze=12,y_ftze=12,tit_ftze=12,leg_ftze=9)
Normal Distribution¶
# Draw samples from a Normal distribution.
Samples=np.random.normal(loc=3, scale=1.0, size=10000)
font = {'size' : 11}
plt.rc('font', **font)
fig, ax1 = plt.subplots(figsize=(8, 8), dpi= 130, facecolor='w', edgecolor='k')
title='Normal Distribution_loc=3, scale=1.0, size=1000'
ax1=plt.subplot(2,1,1)
EDA_plot.histplt (Samples,bins=50,title=title,xlabl=None,
ylabl='Percentage',xlimt=(0,6),ylimt=None,axt=ax1,
scale=1.15,loc=1,font=10,color='g')
ax1=plt.subplot(2,1,2)
EDA_plot.CDF_plot(Samples,1,label='Normal ecdf',colors='g',xlabel=None,title='ecdf of '+title,ylabel='Cumulative Probability',bins=10000,
xlim=(0,6),ylim=None,linewidth=2.5,loc=4,axt=ax1,x_ftze=12,y_ftze=12,tit_ftze=14,leg_ftze=9)
The Exponential Distribution¶
The waiting time between arrivals of a Poisson process is Exponentially distributed.
# Draw samples from a Exponential distribution.
Samples=np.random.exponential(scale=1.0,size=10000)
font = {'size' : 11}
plt.rc('font', **font)
fig, ax1 = plt.subplots(figsize=(8, 8), dpi= 130, facecolor='w', edgecolor='k')
title='Exponential Distribution_scale=60, size=10000'
ax1=plt.subplot(2,1,1)
EDA_plot.histplt (Samples,bins=100,title=title,xlabl=None,
ylabl='Percentage',xlimt=(0,10),ylimt=None,axt=ax1,
scale=1.15,loc=1,font=10,color='g')
ax1=plt.subplot(2,1,2)
EDA_plot.CDF_plot(Samples,1,label='Poisson ecdf',colors='g',xlabel=None,title=title,ylabel='Cumulative Probability',bins=10000,
xlim=(0,10),ylim=None,linewidth=2.5,loc=4,axt=ax1,x_ftze=12,y_ftze=12,tit_ftze=14,leg_ftze=9)
Data Manipulation with Pandas¶
Pandas is an open source Python package for data manipulation and analysis with high-performance and easy-to-use methods. It is based on the dataframe concept in the R programming language. Pandas can be used as the primary tools for data processing, data manipulation and data cleaning. pandas is built on NumPy and Matplotlib.
Look at Data¶
# Reading data in Pandas
import pandas as pd
df = pd.read_csv("./Data/WPD.csv") # Read data file
#df.head()
df[0:5] # Display data
X Coordinate | Y Coordinate | Measured Depth (m) | Deviation (True/False) | Surface-Casing Weight (kg/m) | Production-Casing Size (mm) | Production-Casing Weight (kg/m) | Bore. Temp. (degC) | Prod. Formation | Porosity (fraction) | Permeability (Darcy) | OIL Prod. (e3m3/month) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 352.6 | 372 | 2145.8 | False | NaN | 114.3 | NaN | 53.0 | Shale-Sand | 0.224 | 12.25 | 108.75 |
1 | 435.7 | 551 | 1638.0 | True | 25.3 | 114.3 | 17.3 | NaN | Sand | 0.118 | 4.11 | 0.33 |
2 | 37.7 | 788 | 3984.0 | True | 35.7 | 114.3 | 22.5 | NaN | Shale | 0.121 | 1.99 | 21.87 |
3 | 346.7 | 311 | 3809.0 | True | 41.7 | 114.3 | 17.3 | NaN | Sand | 0.170 | 5.84 | 46.30 |
4 | 254.6 | 727 | 2335.0 | True | 53.6 | NaN | NaN | 32.0 | Sand | 0.158 | 5.25 | 17.53 |
# shape of data
df.shape
(1000, 12)
The columns function gives the name of the columns in your data file.
column=list(df.columns)
print(column)
['X Coordinate', 'Y Coordinate', 'Measured Depth (m)', 'Deviation (True/False)', 'Surface-Casing Weight (kg/m)', 'Production-Casing Size (mm)', 'Production-Casing Weight (kg/m)', 'Bore. Temp. (degC)', 'Prod. Formation', 'Porosity (fraction)', 'Permeability (Darcy)', 'OIL Prod. (e3m3/month)']
The info() function gives some information about the data type: string float or null (missing values).
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1000 entries, 0 to 999 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 X Coordinate 1000 non-null float64 1 Y Coordinate 1000 non-null int64 2 Measured Depth (m) 1000 non-null float64 3 Deviation (True/False) 1000 non-null bool 4 Surface-Casing Weight (kg/m) 593 non-null float64 5 Production-Casing Size (mm) 913 non-null float64 6 Production-Casing Weight (kg/m) 547 non-null float64 7 Bore. Temp. (degC) 801 non-null float64 8 Prod. Formation 1000 non-null object 9 Porosity (fraction) 1000 non-null float64 10 Permeability (Darcy) 1000 non-null float64 11 OIL Prod. (e3m3/month) 1000 non-null float64 dtypes: bool(1), float64(9), int64(1), object(1) memory usage: 87.0+ KB
The describe function gives some statistical analysis for each column if the data type is not string.
df.describe()
X Coordinate | Y Coordinate | Measured Depth (m) | Surface-Casing Weight (kg/m) | Production-Casing Size (mm) | Production-Casing Weight (kg/m) | Bore. Temp. (degC) | Porosity (fraction) | Permeability (Darcy) | OIL Prod. (e3m3/month) | |
---|---|---|---|---|---|---|---|---|---|---|
count | 1000.000000 | 1000.000000 | 1000.000000 | 593.000000 | 913.000000 | 547.000000 | 801.000000 | 1000.000000 | 1000.000000 | 1000.000000 |
mean | 432.893400 | 489.749000 | 1364.379300 | 42.906745 | 142.495290 | 23.260512 | 40.774032 | 0.141840 | 5.269140 | 60.003280 |
std | 172.206682 | 243.547912 | 855.633567 | 25.877438 | 25.953205 | 7.137579 | 21.256825 | 0.074501 | 8.665687 | 234.814659 |
min | 12.300000 | 36.000000 | 276.000000 | 14.100000 | 13.700000 | 10.800000 | 5.000000 | 0.000000 | 0.050000 | 0.000000 |
25% | 325.425000 | 339.750000 | 704.000000 | 35.700000 | 114.300000 | 17.300000 | 27.000000 | 0.089750 | 1.040000 | 0.000000 |
50% | 487.250000 | 480.000000 | 1066.000000 | 35.700000 | 139.700000 | 20.800000 | 33.000000 | 0.141000 | 2.480000 | 4.715000 |
75% | 581.675000 | 596.000000 | 1790.000000 | 48.100000 | 177.800000 | 25.300000 | 53.000000 | 0.192000 | 5.752500 | 33.117500 |
max | 648.800000 | 1172.000000 | 6363.000000 | 595.000000 | 219.100000 | 53.600000 | 255.000000 | 0.400000 | 94.510000 | 3425.530000 |
# Components of DataFrame
df.values
array([[352.6, 372, 2145.8, ..., 0.224, 12.25, 108.75], [435.7, 551, 1638.0, ..., 0.118, 4.11, 0.33], [37.7, 788, 3984.0, ..., 0.121, 1.99, 21.87], ..., [593.4, 456, 683.7, ..., 0.332, 49.75, 832.45], [540.9, 328, 1346.0, ..., 0.111, 0.59, 0.0], [617.8, 377, 844.0, ..., 0.292, 9.95, 153.43]], dtype=object)
# index of DataFrame
df.index
RangeIndex(start=0, stop=1000, step=1)
Data Cleaning¶
Missing (NaN) Values¶
It is ideal to have valid value for every rows and columns. In reality, missing values is one of the main challenges faced by machine learning. There are some approaches in practice to deal with missing values.
Option 1: Drop Column
One naive approach is to remove columns that have missing values with drop() function.
Select_column=column[3:8] # Select columns that have missing
Select_column
['Deviation (True/False)', 'Surface-Casing Weight (kg/m)', 'Production-Casing Size (mm)', 'Production-Casing Weight (kg/m)', 'Bore. Temp. (degC)']
df_drop_c=df.copy() # Copy to avoid modifying the original data
df_drop_c.drop(Select_column,axis=1, inplace=True) # Drop selected 5 columns
df_drop_c[:5]
X Coordinate | Y Coordinate | Measured Depth (m) | Prod. Formation | Porosity (fraction) | Permeability (Darcy) | OIL Prod. (e3m3/month) | |
---|---|---|---|---|---|---|---|
0 | 352.6 | 372 | 2145.8 | Shale-Sand | 0.224 | 12.25 | 108.75 |
1 | 435.7 | 551 | 1638.0 | Sand | 0.118 | 4.11 | 0.33 |
2 | 37.7 | 788 | 3984.0 | Shale | 0.121 | 1.99 | 21.87 |
3 | 346.7 | 311 | 3809.0 | Sand | 0.170 | 5.84 | 46.30 |
4 | 254.6 | 727 | 2335.0 | Sand | 0.158 | 5.25 | 17.53 |
df_drop_c.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1000 entries, 0 to 999 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 X Coordinate 1000 non-null float64 1 Y Coordinate 1000 non-null int64 2 Measured Depth (m) 1000 non-null float64 3 Prod. Formation 1000 non-null object 4 Porosity (fraction) 1000 non-null float64 5 Permeability (Darcy) 1000 non-null float64 6 OIL Prod. (e3m3/month) 1000 non-null float64 dtypes: float64(5), int64(1), object(1) memory usage: 54.8+ KB
This approach is not reliable since the information of 5 columns without missing values have been removed from data set.
Option 2: Drop Row
Another approach is to remove rows that have missing values with dropna() function.
df_drop_=df.copy() # Copy to avoid modifying the original data
df_drop_r=df_drop_.dropna() # Drop rows with missing values
df_drop_r[:5]
X Coordinate | Y Coordinate | Measured Depth (m) | Deviation (True/False) | Surface-Casing Weight (kg/m) | Production-Casing Size (mm) | Production-Casing Weight (kg/m) | Bore. Temp. (degC) | Prod. Formation | Porosity (fraction) | Permeability (Darcy) | OIL Prod. (e3m3/month) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
6 | 636.9 | 462 | 713.0 | True | 48.1 | 177.8 | 25.3 | 27.0 | Shale | 0.128 | 0.70 | 0.0 |
7 | 556.2 | 441 | 761.0 | False | 35.7 | 139.7 | 20.8 | 28.0 | Sand | 0.055 | 0.36 | 0.0 |
8 | 572.0 | 521 | 587.0 | False | 25.3 | 114.3 | 14.1 | 15.0 | Shale | 0.033 | 0.46 | 0.0 |
10 | 366.2 | 385 | 1801.0 | False | 35.7 | 139.7 | 25.3 | 52.0 | Shale-Sand | 0.147 | 3.15 | 5.8 |
14 | 648.7 | 275 | 920.0 | False | 35.7 | 139.7 | 20.8 | 30.0 | Shale | 0.104 | 0.38 | 0.0 |
df_drop_r.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 426 entries, 6 to 999 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 X Coordinate 426 non-null float64 1 Y Coordinate 426 non-null int64 2 Measured Depth (m) 426 non-null float64 3 Deviation (True/False) 426 non-null bool 4 Surface-Casing Weight (kg/m) 426 non-null float64 5 Production-Casing Size (mm) 426 non-null float64 6 Production-Casing Weight (kg/m) 426 non-null float64 7 Bore. Temp. (degC) 426 non-null float64 8 Prod. Formation 426 non-null object 9 Porosity (fraction) 426 non-null float64 10 Permeability (Darcy) 426 non-null float64 11 OIL Prod. (e3m3/month) 426 non-null float64 dtypes: bool(1), float64(9), int64(1), object(1) memory usage: 40.4+ KB
This approach is also not recommend because number of data (rows) have decreased from 1000 to 426.
Option 3: Replace with Median (mean)
A common practice is to replace missing values with the median (mean) value for that column. The median is the middle value of a list that 50% are less and 50% are bigger. See Median. The following code replaces all missing values in 5 columns with the median value of each column values.
df_im=df.copy()
# Get the median of each variable and replace the missing values with median
for i in range(len(Select_column)):
P50 = df[Select_column[i]].median() # Calculate median of each column
df_im[Select_column[i]] = df_im[Select_column[i]].fillna(P50) # replace NaN with median(P50)
df_im[:5]
X Coordinate | Y Coordinate | Measured Depth (m) | Deviation (True/False) | Surface-Casing Weight (kg/m) | Production-Casing Size (mm) | Production-Casing Weight (kg/m) | Bore. Temp. (degC) | Prod. Formation | Porosity (fraction) | Permeability (Darcy) | OIL Prod. (e3m3/month) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 352.6 | 372 | 2145.8 | False | 35.7 | 114.3 | 20.8 | 53.0 | Shale-Sand | 0.224 | 12.25 | 108.75 |
1 | 435.7 | 551 | 1638.0 | True | 25.3 | 114.3 | 17.3 | 33.0 | Sand | 0.118 | 4.11 | 0.33 |
2 | 37.7 | 788 | 3984.0 | True | 35.7 | 114.3 | 22.5 | 33.0 | Shale | 0.121 | 1.99 | 21.87 |
3 | 346.7 | 311 | 3809.0 | True | 41.7 | 114.3 | 17.3 | 33.0 | Sand | 0.170 | 5.84 | 46.30 |
4 | 254.6 | 727 | 2335.0 | True | 53.6 | 139.7 | 20.8 | 32.0 | Sand | 0.158 | 5.25 | 17.53 |
df_im.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1000 entries, 0 to 999 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 X Coordinate 1000 non-null float64 1 Y Coordinate 1000 non-null int64 2 Measured Depth (m) 1000 non-null float64 3 Deviation (True/False) 1000 non-null bool 4 Surface-Casing Weight (kg/m) 1000 non-null float64 5 Production-Casing Size (mm) 1000 non-null float64 6 Production-Casing Weight (kg/m) 1000 non-null float64 7 Bore. Temp. (degC) 1000 non-null float64 8 Prod. Formation 1000 non-null object 9 Porosity (fraction) 1000 non-null float64 10 Permeability (Darcy) 1000 non-null float64 11 OIL Prod. (e3m3/month) 1000 non-null float64 dtypes: bool(1), float64(9), int64(1), object(1) memory usage: 87.0+ KB
Dropping duplicate names
df.drop_duplicates(subset="name")
df.drop_duplicates(subset=["name1","name2"])
Remove Outliers¶
Outliers are usually extreme high or low values that are different from pattern of majority data. Outliers should be removed from dataset (we should be careful not to consider novelty as outlier). A straightforward approach to detect outliers are several (n) standard deviations from the mean $m\pm n\sigma$ ($m$=mean, $\sigma$=standard deviation). For example, an outlier can bigger than $m+ 3\sigma $ or less than can bigger than $m- 3\sigma $. The following function can be used to apply this approach.
def outlier_remove(df, n,name):
"""Delete rows for a specified column where values are out of +/- n*sd standard deviations
df : Pandas dataframe
n : n in the equation 𝑚±𝑛𝜎
name: Column name
"""
mean=df[name].mean() # Calclute mean of column
sd=df[name].std() # Calclute standard deviation of column
drop_r = df.index[(mean -n * sd> df[name]) | (mean+n * sd< df[name])] # Find data that are not within 𝑚±𝑛𝜎
df.drop(drop_r, axis=0, inplace=True) # Drop data
df.reset_index(inplace=True, drop=True) # Reset index
# Drop outliers in last column 'OIL Prod. (e3m3/month)'
df_im_out=df_im.copy()
outlier_remove(df_im_out,n=3,name='OIL Prod. (e3m3/month)')
df_im_out[:5]
X Coordinate | Y Coordinate | Measured Depth (m) | Deviation (True/False) | Surface-Casing Weight (kg/m) | Production-Casing Size (mm) | Production-Casing Weight (kg/m) | Bore. Temp. (degC) | Prod. Formation | Porosity (fraction) | Permeability (Darcy) | OIL Prod. (e3m3/month) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 352.6 | 372 | 2145.8 | False | 35.7 | 114.3 | 20.8 | 53.0 | Shale-Sand | 0.224 | 12.25 | 108.75 |
1 | 435.7 | 551 | 1638.0 | True | 25.3 | 114.3 | 17.3 | 33.0 | Sand | 0.118 | 4.11 | 0.33 |
2 | 37.7 | 788 | 3984.0 | True | 35.7 | 114.3 | 22.5 | 33.0 | Shale | 0.121 | 1.99 | 21.87 |
3 | 346.7 | 311 | 3809.0 | True | 41.7 | 114.3 | 17.3 | 33.0 | Sand | 0.170 | 5.84 | 46.30 |
4 | 254.6 | 727 | 2335.0 | True | 53.6 | 139.7 | 20.8 | 32.0 | Sand | 0.158 | 5.25 | 17.53 |
Concatenation¶
A new data frames can be built by concatenating rows and columns. A new dataframe is created by concatenating two columns together 'Measured Depth (m)' and 'OIL Prod. (e3m3/month)' using the concat() function.
col_1= df_im['Measured Depth (m)']
col_2 = df_im['OIL Prod. (e3m3/month)']
result = pd.concat([col_1, col_2], axis=1)
result[:5]
Measured Depth (m) | OIL Prod. (e3m3/month) | |
---|---|---|
0 | 2145.8 | 108.75 |
1 | 1638.0 | 0.33 |
2 | 3984.0 | 21.87 |
3 | 3809.0 | 46.30 |
4 | 2335.0 | 17.53 |
Two rows can be concatenated together using concat() function. The code below concatenates the first 4 rows and the last 3 rows of the previous dataset after imputation.
row_1= df_im[0:4] # Retrieve first 4 rows
row_2 = df_im[-4:-1] # Retrieve last 3 rows
result = pd.concat([row_1, row_2], axis=0) # Concatenate rows
result[:5]
X Coordinate | Y Coordinate | Measured Depth (m) | Deviation (True/False) | Surface-Casing Weight (kg/m) | Production-Casing Size (mm) | Production-Casing Weight (kg/m) | Bore. Temp. (degC) | Prod. Formation | Porosity (fraction) | Permeability (Darcy) | OIL Prod. (e3m3/month) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 352.6 | 372 | 2145.8 | False | 35.7 | 114.3 | 20.8 | 53.0 | Shale-Sand | 0.224 | 12.25 | 108.75 |
1 | 435.7 | 551 | 1638.0 | True | 25.3 | 114.3 | 17.3 | 33.0 | Sand | 0.118 | 4.11 | 0.33 |
2 | 37.7 | 788 | 3984.0 | True | 35.7 | 114.3 | 22.5 | 33.0 | Shale | 0.121 | 1.99 | 21.87 |
3 | 346.7 | 311 | 3809.0 | True | 41.7 | 114.3 | 17.3 | 33.0 | Sand | 0.170 | 5.84 | 46.30 |
996 | 548.3 | 475 | 749.8 | False | 35.7 | 114.3 | 20.8 | 28.0 | Shale | 0.017 | 0.48 | 0.00 |
Saving a Dataframe as CSV¶
You can simply save a dataframe as CSV by the following code. The following code performs a shuffle and then saves a new copy.
filename="myfile.csv"
df_im.to_csv(filename, index=False) # index = False not writing row numbers
Shuffling, Grouping and Sorting¶
Shuffling Dataset
An example of shuffling is playing the game of cards. All the cards are collected at the end of each round of play, shuffled to ensure that cards are distributed randomly and each player receives cards based on chance. In Machine Learning, the data set should be split into training and validation datasets. So, shuffling is very important to avoid any element of bias/patterns in the split datasets.
If consistent shuffling of the data set is required for several times of running the code, a random seed can be used. See code below.
import numpy as np
#np.random.seed(82)
df_shfl = df_im.reindex(np.random.permutation(df.index))
df_shfl.reset_index(inplace=True, drop=True) # Reset index
df_shfl[0:5]
X Coordinate | Y Coordinate | Measured Depth (m) | Deviation (True/False) | Surface-Casing Weight (kg/m) | Production-Casing Size (mm) | Production-Casing Weight (kg/m) | Bore. Temp. (degC) | Prod. Formation | Porosity (fraction) | Permeability (Darcy) | OIL Prod. (e3m3/month) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 498.9 | 59 | 1010.0 | False | 53.6 | 139.7 | 20.8 | 26.0 | Shale-Sand | 0.100 | 0.66 | 0.00 |
1 | 320.3 | 468 | 1436.8 | False | 35.7 | 139.7 | 20.8 | 41.0 | Sand | 0.134 | 3.50 | 2.02 |
2 | 298.7 | 674 | 3065.0 | True | 53.6 | 139.7 | 20.8 | 33.0 | Sand | 0.338 | 42.45 | 1760.45 |
3 | 569.8 | 205 | 989.1 | False | 35.7 | 114.3 | 20.8 | 33.0 | Shale | 0.018 | 0.91 | 0.00 |
4 | 522.3 | 213 | 1050.0 | False | 35.7 | 139.7 | 23.1 | 35.0 | Shale | 0.157 | 9.48 | 37.06 |
Sorting Data Set
df_sort = df_im.sort_values(by='Measured Depth (m)', ascending=True)
#df_sort.reset_index(inplace=True, drop=True) # Reset index
df_sort[:5]
X Coordinate | Y Coordinate | Measured Depth (m) | Deviation (True/False) | Surface-Casing Weight (kg/m) | Production-Casing Size (mm) | Production-Casing Weight (kg/m) | Bore. Temp. (degC) | Prod. Formation | Porosity (fraction) | Permeability (Darcy) | OIL Prod. (e3m3/month) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
90 | 132.0 | 1104 | 276.0 | False | 25.3 | 114.3 | 14.1 | 25.0 | Shale | 0.142 | 1.16 | 0.00 |
537 | 72.9 | 784 | 295.5 | False | 25.3 | 114.3 | 14.1 | 22.0 | Shale | 0.000 | 0.66 | 0.00 |
403 | 591.3 | 791 | 297.0 | False | 25.3 | 114.3 | 14.1 | 26.0 | Shale | 0.142 | 13.37 | 72.26 |
507 | 621.7 | 593 | 321.0 | True | 53.6 | 177.8 | 34.2 | 33.0 | Shale | 0.091 | 1.72 | 0.00 |
105 | 631.2 | 583 | 343.2 | False | 56.5 | 139.7 | 20.8 | 33.0 | Sand | 0.168 | 3.19 | 19.23 |
we can sort values with multiple variables:
df_sort = df_im.sort_values(['Measured Depth (m)','Porosity (fraction)'], ascending=[True, False])
df_sort[:5]
X Coordinate | Y Coordinate | Measured Depth (m) | Deviation (True/False) | Surface-Casing Weight (kg/m) | Production-Casing Size (mm) | Production-Casing Weight (kg/m) | Bore. Temp. (degC) | Prod. Formation | Porosity (fraction) | Permeability (Darcy) | OIL Prod. (e3m3/month) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
90 | 132.0 | 1104 | 276.0 | False | 25.3 | 114.3 | 14.1 | 25.0 | Shale | 0.142 | 1.16 | 0.00 |
537 | 72.9 | 784 | 295.5 | False | 25.3 | 114.3 | 14.1 | 22.0 | Shale | 0.000 | 0.66 | 0.00 |
403 | 591.3 | 791 | 297.0 | False | 25.3 | 114.3 | 14.1 | 26.0 | Shale | 0.142 | 13.37 | 72.26 |
507 | 621.7 | 593 | 321.0 | True | 53.6 | 177.8 | 34.2 | 33.0 | Shale | 0.091 | 1.72 | 0.00 |
105 | 631.2 | 583 | 343.2 | False | 56.5 | 139.7 | 20.8 | 33.0 | Sand | 0.168 | 3.19 | 19.23 |
Subsetting based on multiple conditions:
is_lab = df["Measured Depth (m)"] == 321.0
is_brown = df["Deviation (True/False)"] == True
df[is_lab & is_brown]
X Coordinate | Y Coordinate | Measured Depth (m) | Deviation (True/False) | Surface-Casing Weight (kg/m) | Production-Casing Size (mm) | Production-Casing Weight (kg/m) | Bore. Temp. (degC) | Prod. Formation | Porosity (fraction) | Permeability (Darcy) | OIL Prod. (e3m3/month) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
507 | 621.7 | 593 | 321.0 | True | 53.6 | 177.8 | 34.2 | NaN | Shale | 0.091 | 1.72 | 0.0 |
Subsetting using .isin()
df_sort[df_sort['Prod. Formation'].isin(['Sand'])][:5]
X Coordinate | Y Coordinate | Measured Depth (m) | Deviation (True/False) | Surface-Casing Weight (kg/m) | Production-Casing Size (mm) | Production-Casing Weight (kg/m) | Bore. Temp. (degC) | Prod. Formation | Porosity (fraction) | Permeability (Darcy) | OIL Prod. (e3m3/month) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
105 | 631.2 | 583 | 343.2 | False | 56.5 | 139.7 | 20.8 | 33.0 | Sand | 0.168 | 3.19 | 19.23 |
918 | 607.5 | 592 | 409.0 | True | 48.1 | 177.8 | 29.7 | 21.0 | Sand | 0.050 | 0.66 | 0.00 |
92 | 589.3 | 596 | 436.0 | False | 48.1 | 177.8 | 25.3 | 18.0 | Sand | 0.166 | 11.82 | 17.42 |
793 | 571.8 | 598 | 445.0 | False | 53.6 | 177.8 | 34.2 | 18.0 | Sand | 0.084 | 1.01 | 0.00 |
560 | 621.3 | 595 | 455.0 | True | 48.1 | 177.8 | 29.8 | 27.0 | Sand | 0.262 | 8.87 | 121.89 |
Summary statistics¶
Grouping is applied to summarize data. The attributes below are used to calculate summary statistics for each col or each row. The attributes by default calculate statistics for columns:
mean()
median()
mode()
min()
max()
var()
std()
sum()
quantile()
The following codes performs grouping; generating summary statistics for each category for feature 'Prod. Formation' in target variable 'OIL Prod. (e3m3/month)'.
gb_mean=df_im.groupby('Prod. Formation')['OIL Prod. (e3m3/month)'].mean()
gb_mean
Prod. Formation Sand 64.015467 Shale 59.502943 Shale-Sand 54.482824 Name: OIL Prod. (e3m3/month), dtype: float64
gb_sum=df_im.groupby('Prod. Formation')['OIL Prod. (e3m3/month)'].sum()
gb_sum
Prod. Formation Sand 19204.64 Shale 31536.56 Shale-Sand 9262.08 Name: OIL Prod. (e3m3/month), dtype: float64
.agg()
Method¶
The .agg()
method is used for computing costume summary statistics (df['column'].agg(function)
)
def pct90(clm):
return clm.quantile(0.9)
df_im['OIL Prod. (e3m3/month)'].agg(pct90)
105.87400000000005
df_im[['OIL Prod. (e3m3/month)','Measured Depth (m)']].agg(pct90)
OIL Prod. (e3m3/month) 105.874 Measured Depth (m) 2396.400 dtype: float64
def pct30(clm):
return clm.quantile(0.3)
df_im['OIL Prod. (e3m3/month)'].agg([pct30,pct90])
pct30 0.000 pct90 105.874 Name: OIL Prod. (e3m3/month), dtype: float64
Cumulative sum¶
df_im['OIL Prod. (e3m3/month)'].iloc[:5]
0 108.75 1 0.33 2 21.87 3 46.30 4 17.53 Name: OIL Prod. (e3m3/month), dtype: float64
df_im['OIL Prod. (e3m3/month)'].iloc[:5].cumsum()
0 108.75 1 109.08 2 130.95 3 177.25 4 194.78 Name: OIL Prod. (e3m3/month), dtype: float64
There are other cumulative statistics:
cummax()
cummin()
cumprod()
df_im['OIL Prod. (e3m3/month)'].iloc[:5].cummax()
0 108.75 1 108.75 2 108.75 3 108.75 4 108.75 Name: OIL Prod. (e3m3/month), dtype: float64
df_im['OIL Prod. (e3m3/month)'].iloc[:5].cummin()
0 108.75 1 0.33 2 0.33 3 0.33 4 0.33 Name: OIL Prod. (e3m3/month), dtype: float64
df_im['OIL Prod. (e3m3/month)'].value_counts()
0.00 379 0.01 14 0.05 12 0.02 8 22.01 5 ... 37.68 1 17.15 1 553.28 1 27.46 1 153.43 1 Name: OIL Prod. (e3m3/month), Length: 521, dtype: int64
groupby
¶
df_im.groupby('Deviation (True/False)')['Measured Depth (m)'].agg([min,max,np.median])
min | max | median | |
---|---|---|---|
Deviation (True/False) | |||
False | 276.0 | 3458.3 | 1025.0 |
True | 321.0 | 6363.0 | 1354.5 |
df_im.groupby('Deviation (True/False)')['Measured Depth (m)'].mean()
Deviation (True/False) False 1214.885276 True 1644.465805 Name: Measured Depth (m), dtype: float64
Pivot tables¶
df_im.groupby('Deviation (True/False)')['Measured Depth (m)'].mean()
Deviation (True/False) False 1214.885276 True 1644.465805 Name: Measured Depth (m), dtype: float64
df_im.pivot_table(aggfunc='mean', values='Measured Depth (m)', index='Deviation (True/False)')
Measured Depth (m) | |
---|---|
Deviation (True/False) | |
False | 1214.885276 |
True | 1644.465805 |
values
argument is the column you want to summarize and index
is the column you want to groupby.
df_im.pivot_table(aggfunc=[np.median,np.std], values='Measured Depth (m)', index='Deviation (True/False)')
median | std | |
---|---|---|
Measured Depth (m) | Measured Depth (m) | |
Deviation (True/False) | ||
False | 1025.0 | 612.634963 |
True | 1354.5 | 1132.657827 |
Index¶
df_im.columns
Index(['X Coordinate', 'Y Coordinate', 'Measured Depth (m)', 'Deviation (True/False)', 'Surface-Casing Weight (kg/m)', 'Production-Casing Size (mm)', 'Production-Casing Weight (kg/m)', 'Bore. Temp. (degC)', 'Prod. Formation', 'Porosity (fraction)', 'Permeability (Darcy)', 'OIL Prod. (e3m3/month)'], dtype='object')
df_im.index
RangeIndex(start=0, stop=1000, step=1)
we can move a column from body of dataframe to index:
df_im[:5].set_index("X Coordinate")
Y Coordinate | Measured Depth (m) | Deviation (True/False) | Surface-Casing Weight (kg/m) | Production-Casing Size (mm) | Production-Casing Weight (kg/m) | Bore. Temp. (degC) | Prod. Formation | Porosity (fraction) | Permeability (Darcy) | OIL Prod. (e3m3/month) | |
---|---|---|---|---|---|---|---|---|---|---|---|
X Coordinate | |||||||||||
352.6 | 372 | 2145.8 | False | 35.7 | 114.3 | 20.8 | 53.0 | Shale-Sand | 0.224 | 12.25 | 108.75 |
435.7 | 551 | 1638.0 | True | 25.3 | 114.3 | 17.3 | 33.0 | Sand | 0.118 | 4.11 | 0.33 |
37.7 | 788 | 3984.0 | True | 35.7 | 114.3 | 22.5 | 33.0 | Shale | 0.121 | 1.99 | 21.87 |
346.7 | 311 | 3809.0 | True | 41.7 | 114.3 | 17.3 | 33.0 | Sand | 0.170 | 5.84 | 46.30 |
254.6 | 727 | 2335.0 | True | 53.6 | 139.7 | 20.8 | 32.0 | Sand | 0.158 | 5.25 | 17.53 |
To undo what we just did we can reset_ndex():
df_im[:5].reset_index(drop=True)
X Coordinate | Y Coordinate | Measured Depth (m) | Deviation (True/False) | Surface-Casing Weight (kg/m) | Production-Casing Size (mm) | Production-Casing Weight (kg/m) | Bore. Temp. (degC) | Prod. Formation | Porosity (fraction) | Permeability (Darcy) | OIL Prod. (e3m3/month) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 352.6 | 372 | 2145.8 | False | 35.7 | 114.3 | 20.8 | 53.0 | Shale-Sand | 0.224 | 12.25 | 108.75 |
1 | 435.7 | 551 | 1638.0 | True | 25.3 | 114.3 | 17.3 | 33.0 | Sand | 0.118 | 4.11 | 0.33 |
2 | 37.7 | 788 | 3984.0 | True | 35.7 | 114.3 | 22.5 | 33.0 | Shale | 0.121 | 1.99 | 21.87 |
3 | 346.7 | 311 | 3809.0 | True | 41.7 | 114.3 | 17.3 | 33.0 | Sand | 0.170 | 5.84 | 46.30 |
4 | 254.6 | 727 | 2335.0 | True | 53.6 | 139.7 | 20.8 | 32.0 | Sand | 0.158 | 5.25 | 17.53 |
Indexing makes subsetting simpler:
df_im[df_im['Prod. Formation'].isin(['Shale', 'Sand'])][:5]
X Coordinate | Y Coordinate | Measured Depth (m) | Deviation (True/False) | Surface-Casing Weight (kg/m) | Production-Casing Size (mm) | Production-Casing Weight (kg/m) | Bore. Temp. (degC) | Prod. Formation | Porosity (fraction) | Permeability (Darcy) | OIL Prod. (e3m3/month) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 435.7 | 551 | 1638.0 | True | 25.3 | 114.3 | 17.3 | 33.0 | Sand | 0.118 | 4.11 | 0.33 |
2 | 37.7 | 788 | 3984.0 | True | 35.7 | 114.3 | 22.5 | 33.0 | Shale | 0.121 | 1.99 | 21.87 |
3 | 346.7 | 311 | 3809.0 | True | 41.7 | 114.3 | 17.3 | 33.0 | Sand | 0.170 | 5.84 | 46.30 |
4 | 254.6 | 727 | 2335.0 | True | 53.6 | 139.7 | 20.8 | 32.0 | Sand | 0.158 | 5.25 | 17.53 |
5 | 127.5 | 1125 | 1560.6 | False | 35.7 | 177.8 | 20.8 | 63.0 | Shale | 0.000 | 0.10 | 0.00 |
This is a long code, we can use subsetting as below:
#df_im.set_index("Prod. Formation",inplace=True)
df_im[:5]
X Coordinate | Y Coordinate | Measured Depth (m) | Deviation (True/False) | Surface-Casing Weight (kg/m) | Production-Casing Size (mm) | Production-Casing Weight (kg/m) | Bore. Temp. (degC) | Prod. Formation | Porosity (fraction) | Permeability (Darcy) | OIL Prod. (e3m3/month) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 352.6 | 372 | 2145.8 | False | 35.7 | 114.3 | 20.8 | 53.0 | Shale-Sand | 0.224 | 12.25 | 108.75 |
1 | 435.7 | 551 | 1638.0 | True | 25.3 | 114.3 | 17.3 | 33.0 | Sand | 0.118 | 4.11 | 0.33 |
2 | 37.7 | 788 | 3984.0 | True | 35.7 | 114.3 | 22.5 | 33.0 | Shale | 0.121 | 1.99 | 21.87 |
3 | 346.7 | 311 | 3809.0 | True | 41.7 | 114.3 | 17.3 | 33.0 | Sand | 0.170 | 5.84 | 46.30 |
4 | 254.6 | 727 | 2335.0 | True | 53.6 | 139.7 | 20.8 | 32.0 | Sand | 0.158 | 5.25 | 17.53 |
df.loc[0: 3]
X Coordinate | Y Coordinate | Measured Depth (m) | Deviation (True/False) | Surface-Casing Weight (kg/m) | Production-Casing Size (mm) | Production-Casing Weight (kg/m) | Bore. Temp. (degC) | Prod. Formation | Porosity (fraction) | Permeability (Darcy) | OIL Prod. (e3m3/month) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 352.6 | 372 | 2145.8 | False | NaN | 114.3 | NaN | 53.0 | Shale-Sand | 0.224 | 12.25 | 108.75 |
1 | 435.7 | 551 | 1638.0 | True | 25.3 | 114.3 | 17.3 | NaN | Sand | 0.118 | 4.11 | 0.33 |
2 | 37.7 | 788 | 3984.0 | True | 35.7 | 114.3 | 22.5 | NaN | Shale | 0.121 | 1.99 | 21.87 |
3 | 346.7 | 311 | 3809.0 | True | 41.7 | 114.3 | 17.3 | NaN | Sand | 0.170 | 5.84 | 46.30 |
Multi-level indexes
df_im.set_index(["Porosity (fraction)","Permeability (Darcy)"],inplace=True)
df_im[:5]
X Coordinate | Y Coordinate | Measured Depth (m) | Deviation (True/False) | Surface-Casing Weight (kg/m) | Production-Casing Size (mm) | Production-Casing Weight (kg/m) | Bore. Temp. (degC) | Prod. Formation | OIL Prod. (e3m3/month) | ||
---|---|---|---|---|---|---|---|---|---|---|---|
Porosity (fraction) | Permeability (Darcy) | ||||||||||
0.224 | 12.25 | 352.6 | 372 | 2145.8 | False | 35.7 | 114.3 | 20.8 | 53.0 | Shale-Sand | 108.75 |
0.118 | 4.11 | 435.7 | 551 | 1638.0 | True | 25.3 | 114.3 | 17.3 | 33.0 | Sand | 0.33 |
0.121 | 1.99 | 37.7 | 788 | 3984.0 | True | 35.7 | 114.3 | 22.5 | 33.0 | Shale | 21.87 |
0.170 | 5.84 | 346.7 | 311 | 3809.0 | True | 41.7 | 114.3 | 17.3 | 33.0 | Sand | 46.30 |
0.158 | 5.25 | 254.6 | 727 | 2335.0 | True | 53.6 | 139.7 | 20.8 | 32.0 | Sand | 17.53 |
df_im.loc[[(0.224,12.25), (0.017,0.48)]]
X Coordinate | Y Coordinate | Measured Depth (m) | Deviation (True/False) | Surface-Casing Weight (kg/m) | Production-Casing Size (mm) | Production-Casing Weight (kg/m) | Bore. Temp. (degC) | Prod. Formation | OIL Prod. (e3m3/month) | ||
---|---|---|---|---|---|---|---|---|---|---|---|
Porosity (fraction) | Permeability (Darcy) | ||||||||||
0.224 | 12.25 | 352.6 | 372 | 2145.8 | False | 35.7 | 114.3 | 20.8 | 53.0 | Shale-Sand | 108.75 |
0.017 | 0.48 | 548.3 | 475 | 749.8 | False | 35.7 | 114.3 | 20.8 | 28.0 | Shale | 0.00 |
Sorting by index values
df_im.sort_index()[:5]
X Coordinate | Y Coordinate | Measured Depth (m) | Deviation (True/False) | Surface-Casing Weight (kg/m) | Production-Casing Size (mm) | Production-Casing Weight (kg/m) | Bore. Temp. (degC) | Prod. Formation | OIL Prod. (e3m3/month) | ||
---|---|---|---|---|---|---|---|---|---|---|---|
Porosity (fraction) | Permeability (Darcy) | ||||||||||
0.0 | 0.05 | 596.4 | 126 | 368.8 | False | 35.7 | 177.8 | 20.8 | 33.0 | Shale | 0.0 |
0.07 | 598.2 | 513 | 621.0 | True | 48.1 | 177.8 | 29.8 | 25.0 | Sand | 0.0 | |
0.07 | 372.6 | 382 | 1914.0 | False | 35.7 | 139.7 | 23.1 | 46.0 | Shale | 0.0 | |
0.10 | 127.5 | 1125 | 1560.6 | False | 35.7 | 177.8 | 20.8 | 63.0 | Shale | 0.0 | |
0.10 | 427.0 | 555 | 1495.0 | True | 25.3 | 114.3 | 17.3 | 33.0 | Shale-Sand | 0.0 |
df_im.sort_index(level=['Porosity (fraction)','Permeability (Darcy)'], ascending=[True, True])[:5]
X Coordinate | Y Coordinate | Measured Depth (m) | Deviation (True/False) | Surface-Casing Weight (kg/m) | Production-Casing Size (mm) | Production-Casing Weight (kg/m) | Bore. Temp. (degC) | Prod. Formation | OIL Prod. (e3m3/month) | ||
---|---|---|---|---|---|---|---|---|---|---|---|
Porosity (fraction) | Permeability (Darcy) | ||||||||||
0.0 | 0.05 | 596.4 | 126 | 368.8 | False | 35.7 | 177.8 | 20.8 | 33.0 | Shale | 0.0 |
0.07 | 598.2 | 513 | 621.0 | True | 48.1 | 177.8 | 29.8 | 25.0 | Sand | 0.0 | |
0.07 | 372.6 | 382 | 1914.0 | False | 35.7 | 139.7 | 23.1 | 46.0 | Shale | 0.0 | |
0.10 | 127.5 | 1125 | 1560.6 | False | 35.7 | 177.8 | 20.8 | 63.0 | Shale | 0.0 | |
0.10 | 427.0 | 555 | 1495.0 | True | 25.3 | 114.3 | 17.3 | 33.0 | Shale-Sand | 0.0 |
Slicing by .loc
and .iloc
¶
loc
does not get inner level index only it return outer level.
We can have two arguments for loc
to include columns (:
means return everything).
df_im.loc[:,"X Coordinate":"Deviation (True/False)"][:5]
X Coordinate | Y Coordinate | Measured Depth (m) | Deviation (True/False) | ||
---|---|---|---|---|---|
Porosity (fraction) | Permeability (Darcy) | ||||
0.224 | 12.25 | 352.6 | 372 | 2145.8 | False |
0.118 | 4.11 | 435.7 | 551 | 1638.0 | True |
0.121 | 1.99 | 37.7 | 788 | 3984.0 | True |
0.170 | 5.84 | 346.7 | 311 | 3809.0 | True |
0.158 | 5.25 | 254.6 | 727 | 2335.0 | True |
iloc
method is to slice row and columns based on numbers
df_im.iloc[1:5,2:6]
Measured Depth (m) | Deviation (True/False) | Surface-Casing Weight (kg/m) | Production-Casing Size (mm) | ||
---|---|---|---|---|---|
Porosity (fraction) | Permeability (Darcy) | ||||
0.118 | 4.11 | 1638.0 | True | 25.3 | 114.3 |
0.121 | 1.99 | 3984.0 | True | 35.7 | 114.3 |
0.170 | 5.84 | 3809.0 | True | 41.7 | 114.3 |
0.158 | 5.25 | 2335.0 | True | 53.6 | 139.7 |
Unlike loc
, last row and column is not included in iloc
.
Feature Engineering¶
Feature engineering is the process of extracting new feature from raw data via data mining techniques. The new features can be used to enhance the performance of machine learning algorithms. New features could also be calculated from the other fields. For the data set WPD.csv, we can add "Surface-Casing Weight (kg/m)" and "Production-Casing Weight (kg/m)" as a new data set.
df_im_c=df_im.copy()
df_im.insert(1, 'New Column', (df_im['Surface-Casing Weight (kg/m)']*df_im['Production-Casing Weight (kg/m)']).astype(int))
df_im
X Coordinate | New Column | Y Coordinate | Measured Depth (m) | Deviation (True/False) | Surface-Casing Weight (kg/m) | Production-Casing Size (mm) | Production-Casing Weight (kg/m) | Bore. Temp. (degC) | Prod. Formation | OIL Prod. (e3m3/month) | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Porosity (fraction) | Permeability (Darcy) | |||||||||||
0.224 | 12.25 | 352.6 | 742 | 372 | 2145.8 | False | 35.7 | 114.3 | 20.8 | 53.0 | Shale-Sand | 108.75 |
0.118 | 4.11 | 435.7 | 437 | 551 | 1638.0 | True | 25.3 | 114.3 | 17.3 | 33.0 | Sand | 0.33 |
0.121 | 1.99 | 37.7 | 803 | 788 | 3984.0 | True | 35.7 | 114.3 | 22.5 | 33.0 | Shale | 21.87 |
0.170 | 5.84 | 346.7 | 721 | 311 | 3809.0 | True | 41.7 | 114.3 | 17.3 | 33.0 | Sand | 46.30 |
0.158 | 5.25 | 254.6 | 1114 | 727 | 2335.0 | True | 53.6 | 139.7 | 20.8 | 32.0 | Sand | 17.53 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
0.230 | 6.26 | 92.5 | 742 | 767 | 1466.0 | False | 35.7 | 139.7 | 20.8 | 51.0 | Sand | 8.74 |
0.017 | 0.48 | 548.3 | 742 | 475 | 749.8 | False | 35.7 | 114.3 | 20.8 | 28.0 | Shale | 0.00 |
0.332 | 49.75 | 593.4 | 742 | 456 | 683.7 | False | 35.7 | 177.8 | 20.8 | 26.0 | Shale-Sand | 832.45 |
0.111 | 0.59 | 540.9 | 356 | 328 | 1346.0 | True | 25.3 | 114.3 | 14.1 | 33.0 | Shale | 0.00 |
0.292 | 9.95 | 617.8 | 742 | 377 | 844.0 | False | 35.7 | 139.7 | 20.8 | 27.0 | Shale-Sand | 153.43 |
1000 rows × 11 columns
Standardize (Normalize) Values¶
Most Machine Learning algorithms require to standardize the input to have a reliable comparison between two values of different variables. For example, if you have variable one with a range of 3 to 1000 and variable two with range 1 to 3; we cannot have a good comparison since the highest value of variable two (3) is the lowest value in variable two. This inconsistency in data values leads to low performance in Machine Learning algorithms. So, it is highly recommended to standardize (normalize) your data to have the same the mean($\mu$) and the standard deviation ($\sigma$) before feeding Machine Learning algorithms. One very common Machine learning standardization (normalization) is the Z-Score:
$\Large z = \frac{x - \mu}{\sigma} $
where $ \mu = \frac{1}{n} \sum_{i=1}^n x_i$, $\sigma = \sqrt{\frac{1}{n} \sum_{i=1}^n (x_i - \mu)^2}$, $n$ is number of data. This ensures that each variable have $\mu=0$ and $\sigma=1$
This standardization should be applied for all features except the target (label). For WPD.csv data set, the target is "OIL Prod. (e3m3/month)" that should not be changed.
The following Python code replaces the 'Measured Depth (m)' with a z-score. This can be done for all variables.
from scipy.stats import zscore
mean=df_im['Measured Depth (m)'].mean()
variance=df_im['Measured Depth (m)'].var()
print("Before standardization: Mean= ",int(mean), ", variance= ",int(variance))
df_im['Measured Depth (m)'] = zscore(df_im['Measured Depth (m)'])
mean=df_im['Measured Depth (m)'].mean()
variance=df_im['Measured Depth (m)'].var()
print("After standardization: Mean= ",int(mean), ", variance= ",int(variance))
Before standardization: Mean= 1364 , variance= 732108 After standardization: Mean= 0 , variance= 1
df_im[0:5]
X Coordinate | New Column | Y Coordinate | Measured Depth (m) | Deviation (True/False) | Surface-Casing Weight (kg/m) | Production-Casing Size (mm) | Production-Casing Weight (kg/m) | Bore. Temp. (degC) | Prod. Formation | OIL Prod. (e3m3/month) | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Porosity (fraction) | Permeability (Darcy) | |||||||||||
0.224 | 12.25 | 352.6 | 742 | 372 | 0.913723 | False | 35.7 | 114.3 | 20.8 | 53.0 | Shale-Sand | 108.75 |
0.118 | 4.11 | 435.7 | 437 | 551 | 0.319947 | True | 25.3 | 114.3 | 17.3 | 33.0 | Sand | 0.33 |
0.121 | 1.99 | 37.7 | 803 | 788 | 3.063147 | True | 35.7 | 114.3 | 22.5 | 33.0 | Shale | 21.87 |
0.170 | 5.84 | 346.7 | 721 | 311 | 2.858518 | True | 41.7 | 114.3 | 17.3 | 33.0 | Sand | 46.30 |
0.158 | 5.25 | 254.6 | 1114 | 727 | 1.134956 | True | 53.6 | 139.7 | 20.8 | 32.0 | Sand | 17.53 |
Categorical Values¶
The input data for Machine Learning should be completely numeric which means any text should be converted to number. In the data set WPD.csv, the columns 'Deviation (True/False)' and 'Prod. Formation' have categorical values. Use value_counts() function to find out what categories exist and how many data belong to each category.
df_c=df_im.copy()
df_c['Deviation (True/False)'].value_counts()
False 652 True 348 Name: Deviation (True/False), dtype: int64
df_c['Prod. Formation'].value_counts()
Shale 530 Sand 300 Shale-Sand 170 Name: Prod. Formation, dtype: int64
We can replace each test with numbers starting from 0
df_c['Deviation (True/False)']=df_c['Deviation (True/False)'].replace(False, 0) # Replace False with 0
df_c['Deviation (True/False)']=df_c['Deviation (True/False)'].replace(True, 1) # Replace True with 1
df_c[0:5]
X Coordinate | New Column | Y Coordinate | Measured Depth (m) | Deviation (True/False) | Surface-Casing Weight (kg/m) | Production-Casing Size (mm) | Production-Casing Weight (kg/m) | Bore. Temp. (degC) | Prod. Formation | OIL Prod. (e3m3/month) | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Porosity (fraction) | Permeability (Darcy) | |||||||||||
0.224 | 12.25 | 352.6 | 742 | 372 | 0.913723 | 0 | 35.7 | 114.3 | 20.8 | 53.0 | Shale-Sand | 108.75 |
0.118 | 4.11 | 435.7 | 437 | 551 | 0.319947 | 1 | 25.3 | 114.3 | 17.3 | 33.0 | Sand | 0.33 |
0.121 | 1.99 | 37.7 | 803 | 788 | 3.063147 | 1 | 35.7 | 114.3 | 22.5 | 33.0 | Shale | 21.87 |
0.170 | 5.84 | 346.7 | 721 | 311 | 2.858518 | 1 | 41.7 | 114.3 | 17.3 | 33.0 | Sand | 46.30 |
0.158 | 5.25 | 254.6 | 1114 | 727 | 1.134956 | 1 | 53.6 | 139.7 | 20.8 | 32.0 | Sand | 17.53 |
df_c['Prod. Formation']=df_c['Prod. Formation'].replace('Shale', 0) # Replace Shale with 0
df_c['Prod. Formation']=df_c['Prod. Formation'].replace('Sand', 1) # Replace Sand with 1
df_c['Prod. Formation']=df_c['Prod. Formation'].replace('Shale-Sand', 2) # Replace Shale-Sand with 2
df_c[0:5]
X Coordinate | New Column | Y Coordinate | Measured Depth (m) | Deviation (True/False) | Surface-Casing Weight (kg/m) | Production-Casing Size (mm) | Production-Casing Weight (kg/m) | Bore. Temp. (degC) | Prod. Formation | OIL Prod. (e3m3/month) | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Porosity (fraction) | Permeability (Darcy) | |||||||||||
0.224 | 12.25 | 352.6 | 742 | 372 | 0.913723 | 0 | 35.7 | 114.3 | 20.8 | 53.0 | 2 | 108.75 |
0.118 | 4.11 | 435.7 | 437 | 551 | 0.319947 | 1 | 25.3 | 114.3 | 17.3 | 33.0 | 1 | 0.33 |
0.121 | 1.99 | 37.7 | 803 | 788 | 3.063147 | 1 | 35.7 | 114.3 | 22.5 | 33.0 | 0 | 21.87 |
0.170 | 5.84 | 346.7 | 721 | 311 | 2.858518 | 1 | 41.7 | 114.3 | 17.3 | 33.0 | 1 | 46.30 |
0.158 | 5.25 | 254.6 | 1114 | 727 | 1.134956 | 1 | 53.6 | 139.7 | 20.8 | 32.0 | 1 | 17.53 |
One challenge with this representation is that Machine Learning algorithms consider two nearby values are more similar than two distant values. This may be fine for some cases (order of categories “bad”, “average”, “good”), but this is not always true. A common solution is to have one binary attribute per category. This is called one-hot encoding, because we only have one attribute equal to 1 (hot), while the others are 0 (cold).
One_hot = pd.get_dummies(['a','b'],prefix='Deviation')
print(One_hot)
Deviation_a Deviation_b 0 1 0 1 0 1
These dummies should be merged back into the data frame.
One_hot = pd.get_dummies(df_im['Deviation (True/False)'],prefix='Deviation')
print(One_hot[0:10]) # Just show the first 10
Deviation_False Deviation_True Porosity (fraction) Permeability (Darcy) 0.224 12.25 1 0 0.118 4.11 0 1 0.121 1.99 0 1 0.170 5.84 0 1 0.158 5.25 0 1 0.000 0.10 1 0 0.128 0.70 0 1 0.055 0.36 1 0 0.033 0.46 1 0 0.224 9.95 0 1
df_one_hot = pd.concat([df_im,One_hot],axis=1)
df_one_hot[0:5]
X Coordinate | New Column | Y Coordinate | Measured Depth (m) | Deviation (True/False) | Surface-Casing Weight (kg/m) | Production-Casing Size (mm) | Production-Casing Weight (kg/m) | Bore. Temp. (degC) | Prod. Formation | OIL Prod. (e3m3/month) | Deviation_False | Deviation_True | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Porosity (fraction) | Permeability (Darcy) | |||||||||||||
0.224 | 12.25 | 352.6 | 742 | 372 | 0.913723 | False | 35.7 | 114.3 | 20.8 | 53.0 | Shale-Sand | 108.75 | 1 | 0 |
0.118 | 4.11 | 435.7 | 437 | 551 | 0.319947 | True | 25.3 | 114.3 | 17.3 | 33.0 | Sand | 0.33 | 0 | 1 |
0.121 | 1.99 | 37.7 | 803 | 788 | 3.063147 | True | 35.7 | 114.3 | 22.5 | 33.0 | Shale | 21.87 | 0 | 1 |
0.170 | 5.84 | 346.7 | 721 | 311 | 2.858518 | True | 41.7 | 114.3 | 17.3 | 33.0 | Sand | 46.30 | 0 | 1 |
0.158 | 5.25 | 254.6 | 1114 | 727 | 1.134956 | True | 53.6 | 139.7 | 20.8 | 32.0 | Sand | 17.53 | 0 | 1 |
This should be done for 'Prod. Formation' column.
One_hot = pd.get_dummies(df_im['Prod. Formation'],prefix='Formation')
df_one_hot = pd.concat([df_one_hot,One_hot],axis=1)
df_one_hot[0:5]
X Coordinate | New Column | Y Coordinate | Measured Depth (m) | Deviation (True/False) | Surface-Casing Weight (kg/m) | Production-Casing Size (mm) | Production-Casing Weight (kg/m) | Bore. Temp. (degC) | Prod. Formation | OIL Prod. (e3m3/month) | Deviation_False | Deviation_True | Formation_Sand | Formation_Shale | Formation_Shale-Sand | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Porosity (fraction) | Permeability (Darcy) | ||||||||||||||||
0.224 | 12.25 | 352.6 | 742 | 372 | 0.913723 | False | 35.7 | 114.3 | 20.8 | 53.0 | Shale-Sand | 108.75 | 1 | 0 | 0 | 0 | 1 |
0.118 | 4.11 | 435.7 | 437 | 551 | 0.319947 | True | 25.3 | 114.3 | 17.3 | 33.0 | Sand | 0.33 | 0 | 1 | 1 | 0 | 0 |
0.121 | 1.99 | 37.7 | 803 | 788 | 3.063147 | True | 35.7 | 114.3 | 22.5 | 33.0 | Shale | 21.87 | 0 | 1 | 0 | 1 | 0 |
0.170 | 5.84 | 346.7 | 721 | 311 | 2.858518 | True | 41.7 | 114.3 | 17.3 | 33.0 | Sand | 46.30 | 0 | 1 | 1 | 0 | 0 |
0.158 | 5.25 | 254.6 | 1114 | 727 | 1.134956 | True | 53.6 | 139.7 | 20.8 | 32.0 | Sand | 17.53 | 0 | 1 | 1 | 0 | 0 |
Finally, the original columns 'Deviation (True/False)' and 'Prod. Formation' should be removed.
df_one_hot.drop(['Deviation (True/False)','Prod. Formation'], axis=1, inplace=True)
df_one_hot
X Coordinate | New Column | Y Coordinate | Measured Depth (m) | Surface-Casing Weight (kg/m) | Production-Casing Size (mm) | Production-Casing Weight (kg/m) | Bore. Temp. (degC) | OIL Prod. (e3m3/month) | Deviation_False | Deviation_True | Formation_Sand | Formation_Shale | Formation_Shale-Sand | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Porosity (fraction) | Permeability (Darcy) | ||||||||||||||
0.224 | 12.25 | 352.6 | 742 | 372 | 0.913723 | 35.7 | 114.3 | 20.8 | 53.0 | 108.75 | 1 | 0 | 0 | 0 | 1 |
0.118 | 4.11 | 435.7 | 437 | 551 | 0.319947 | 25.3 | 114.3 | 17.3 | 33.0 | 0.33 | 0 | 1 | 1 | 0 | 0 |
0.121 | 1.99 | 37.7 | 803 | 788 | 3.063147 | 35.7 | 114.3 | 22.5 | 33.0 | 21.87 | 0 | 1 | 0 | 1 | 0 |
0.170 | 5.84 | 346.7 | 721 | 311 | 2.858518 | 41.7 | 114.3 | 17.3 | 33.0 | 46.30 | 0 | 1 | 1 | 0 | 0 |
0.158 | 5.25 | 254.6 | 1114 | 727 | 1.134956 | 53.6 | 139.7 | 20.8 | 32.0 | 17.53 | 0 | 1 | 1 | 0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
0.230 | 6.26 | 92.5 | 742 | 767 | 0.118826 | 35.7 | 139.7 | 20.8 | 51.0 | 8.74 | 1 | 0 | 1 | 0 | 0 |
0.017 | 0.48 | 548.3 | 742 | 475 | -0.718633 | 35.7 | 114.3 | 20.8 | 28.0 | 0.00 | 1 | 0 | 0 | 1 | 0 |
0.332 | 49.75 | 593.4 | 742 | 456 | -0.795925 | 35.7 | 177.8 | 20.8 | 26.0 | 832.45 | 1 | 0 | 0 | 0 | 1 |
0.111 | 0.59 | 540.9 | 356 | 328 | -0.021491 | 25.3 | 114.3 | 14.1 | 33.0 | 0.00 | 0 | 1 | 0 | 1 | 0 |
0.292 | 9.95 | 617.8 | 742 | 377 | -0.608484 | 35.7 | 139.7 | 20.8 | 27.0 | 153.43 | 1 | 0 | 0 | 0 | 1 |
1000 rows × 14 columns
- Home
-
- Prediction of Movie Genre by Fine-tunning GPT
- Fine-tunning BERT for Fake News Detection
- Covid Tweet Classification by Fine-tunning BART
- Semantic Search Using BERT
- Abstractive Semantic Search by OpenAI Embedding
- Fine-tunning GPT for Style Completion
- Extractive Question-Answering by BERT
- Fine-tunning T5 Model for Abstract Title Prediction
- Image Captioning by Fine-tunning ViT
- Build Serverless ChatGPT API
- Statistical Analysis in Python
- Clustering Algorithms
- Customer Segmentation
- Time Series Forecasting
- PySpark Fundamentals for Big Data
- Predict Customer Churn
- Classification with Imbalanced Classes
- Feature Importance
- Feature Selection
- Text Similarity Measurement
- Dimensionality Reduction
- Prediction of Methane Leakage
- Imputation by LU Simulation
- Histogram Uncertainty
- Delustering to Improve Preferential Sampling
- Uncertainty in Spatial Correlation
-
- Machine Learning Overview
- Python and Pandas
- Main Steps of Machine Learning
- Classification
- Model Training
- Support Vector Machines
- Decision Trees
- Ensemble Learning & Random Forests
- Artificial Neural Network
- Deep Neural Network (DNN)
- Unsupervised Learning
- Multicollinearity
- Introduction to Git
- Introduction to R
- SQL Basic to Advanced Level
- Develop Python Package
- Introduction to BERT LLM
- Exploratory Data Analysis
- Object Oriented Programming in Python
- Natural Language Processing
- Convolutional Neural Network
- Publications