ENPE 519.11/ENCH 619.78 Machine Learning – Energy Systems at University of Calgary¶

Instructor: Mehdi Rezvandehy

Pandas Introduction¶

Pandas is an open source Python package for data manipulation and analysis with high-performance and easy-to-use methods. It is based on the dataframe concept in the R programming language. Pandas can be used as the primary tools for data processing, data manipulation and data cleaning.

We will use a dataset for oil production of 1000 hydrocarbon wells in Alberta (WPD.csv). The target is oil production prediction using well properties including location, depth, deviation, porosity, permeability and so on. The well properties are retrieved from geoSCOUT; but location of the wells are changed and some key properties are manipulated for confidentially reason and for not being used for other reasons. This data is retrieved and generated for this course to learn data processing and apply Machine Learning. The prediction should NOT be used for any Publication because the original data is modified data.

Look at Data¶

# Reading data in Pandas

import pandas as pd # import Pandas

df = pd.read_csv("WPD.csv") # Read data file
#df.head()
df[0:5] # Display data

The columns function gives the name of the columns in your data file.

column=list(df.columns)
print(column)

['X Coordinate', 'Y Coordinate', 'Measured Depth (m)', 'Deviation (True/False)', 'Surface-Casing Weight (kg/m)', 'Production-Casing Size (mm)', 'Production-Casing Weight (kg/m)', 'Bore. Temp. (degC)', 'Prod. Formation', 'Porosity (fraction)', 'Permeability (Darcy)', 'OIL Prod. (e3m3/month)']

The info() function gives some information about the data type: string float or null (missing values).

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 12 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   X Coordinate                     1000 non-null   float64
 1   Y Coordinate                     1000 non-null   int64  
 2   Measured Depth (m)               1000 non-null   float64
 3   Deviation (True/False)           1000 non-null   bool   
 4   Surface-Casing Weight (kg/m)     593 non-null    float64
 5   Production-Casing Size (mm)      913 non-null    float64
 6   Production-Casing Weight (kg/m)  547 non-null    float64
 7   Bore. Temp. (degC)               801 non-null    float64
 8   Prod. Formation                  1000 non-null   object 
 9   Porosity (fraction)              1000 non-null   float64
 10  Permeability (Darcy)             1000 non-null   float64
 11  OIL Prod. (e3m3/month)           1000 non-null   float64
dtypes: bool(1), float64(9), int64(1), object(1)
memory usage: 87.0+ KB

df

The describe function gives some statistical analysis for each column if the data type is not string.

df.describe()

Data Cleaning¶

Missing (NaN) Values¶

It is ideal to have valid value for every rows and columns. In reality, missing values is one of the main challenges faced by machine learning. There are some approaches in practice to deal with missing values.

Option 1: Drop Column

One naive approach is to remove columns that have missing values with drop() function.

Select_column=column[3:8] # Select columns that have missing
Select_column

['Deviation (True/False)',
 'Surface-Casing Weight (kg/m)',
 'Production-Casing Size (mm)',
 'Production-Casing Weight (kg/m)',
 'Bore. Temp. (degC)']

df_drop_c=df.copy() # Copy to avoid modifying the original data
df_drop_c.drop(Select_column,axis=1, inplace=True)  # Drop selected 5 columns
df_drop_c

aa=df_drop_c.drop(1,axis=0)
aa

df_drop_c

df_drop_c.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 7 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   X Coordinate            1000 non-null   float64
 1   Y Coordinate            1000 non-null   int64  
 2   Measured Depth (m)      1000 non-null   float64
 3   Prod. Formation         1000 non-null   object 
 4   Porosity (fraction)     1000 non-null   float64
 5   Permeability (Darcy)    1000 non-null   float64
 6   OIL Prod. (e3m3/month)  1000 non-null   float64
dtypes: float64(5), int64(1), object(1)
memory usage: 54.8+ KB

This approach is not reliable since the information of 5 columns without missing values have been removed from data set.

Option 2: Drop Row

Another approach is to remove rows that have missing values with dropna() function.

df_drop_=df.copy() # Copy to avoid modifying the original data
df_drop_r=df_drop_.dropna() # Drop rows with missing values
df_drop_r

df_drop_r.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 426 entries, 6 to 999
Data columns (total 12 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   X Coordinate                     426 non-null    float64
 1   Y Coordinate                     426 non-null    int64  
 2   Measured Depth (m)               426 non-null    float64
 3   Deviation (True/False)           426 non-null    bool   
 4   Surface-Casing Weight (kg/m)     426 non-null    float64
 5   Production-Casing Size (mm)      426 non-null    float64
 6   Production-Casing Weight (kg/m)  426 non-null    float64
 7   Bore. Temp. (degC)               426 non-null    float64
 8   Prod. Formation                  426 non-null    object 
 9   Porosity (fraction)              426 non-null    float64
 10  Permeability (Darcy)             426 non-null    float64
 11  OIL Prod. (e3m3/month)           426 non-null    float64
dtypes: bool(1), float64(9), int64(1), object(1)
memory usage: 40.4+ KB

This approach is also not recommend because number of data (rows) have decreased from 1000 to 426.

Option 3: Replace with Median (mean)

A common practice is to replace missing values with the median (mean) value for that column. The median is the middle value of a list that 50% are less and 50% are bigger. See Median. The following code replaces all missing values in 5 columns with the median value of each column values.

df_im=df.copy()

# Get the median of each variable and replace the missing values with median
for i in range(len(Select_column)):
    P50 = df[Select_column[i]].median() # Calculate median of each column
    df_im[Select_column[i]] = df_im[Select_column[i]].fillna(P50)   # replace NaN with median(P50)
    
df_im

df_im.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 12 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   X Coordinate                     1000 non-null   float64
 1   Y Coordinate                     1000 non-null   int64  
 2   Measured Depth (m)               1000 non-null   float64
 3   Deviation (True/False)           1000 non-null   bool   
 4   Surface-Casing Weight (kg/m)     1000 non-null   float64
 5   Production-Casing Size (mm)      1000 non-null   float64
 6   Production-Casing Weight (kg/m)  1000 non-null   float64
 7   Bore. Temp. (degC)               1000 non-null   float64
 8   Prod. Formation                  1000 non-null   object 
 9   Porosity (fraction)              1000 non-null   float64
 10  Permeability (Darcy)             1000 non-null   float64
 11  OIL Prod. (e3m3/month)           1000 non-null   float64
dtypes: bool(1), float64(9), int64(1), object(1)
memory usage: 87.0+ KB

Remove Outliers¶

Outliers are usually extreme high or low values that are different from pattern of majority data. Outliers should be removed from dataset (we should be careful not to consider novelty as outlier). A straightforward approach to detect outliers are several (n) standard deviations from the mean $m\pm n\sigma$ ($m$=mean, $\sigma$=standard deviation). For example, an outlier can bigger than $m+ 3\sigma $ or less than can bigger than $m- 3\sigma $. The following function can be used to apply this approach.

def outlier_remove(df, n,name):
    """Delete rows for a specified column where values are out of +/- n*sd standard deviations
    
    df  : Pandas dataframe
    n   : n in the equation 𝑚±𝑛𝜎
    name: Column name
    """
    mean=df[name].mean() # Calclute mean of column
    sd=df[name].std()  # Calclute standard deviation of column
    drop_r = df.index[(mean -n * sd> df[name]) | (mean+n * sd< df[name])] # Find data that are not within 𝑚±𝑛𝜎
    df.drop(drop_r, axis=0, inplace=True) # Drop data
    df.reset_index(inplace=True, drop=True) # Reset index

# Drop outliers in last column 'OIL Prod. (e3m3/month)'
df_im_out=df_im.copy()
outlier_remove(df_im_out,n=3,name='OIL Prod. (e3m3/month)')
df_im_out

Concatenation¶

A new data frames can be built by concatenating rows and columns. A new dataframe is created by concatenating two columns together 'Measured Depth (m)' and 'OIL Prod. (e3m3/month)' using the concat() function.

col_1= df_im['Measured Depth (m)']
col_2 = df_im['OIL Prod. (e3m3/month)']
result = pd.concat([col_1, col_2], axis=1)
result

Two rows can be concatenated together using concat() function. The code below concatenates the first 4 rows and the last 3 rows of the previous dataset after imputation.

row_1= df_im[0:4]  # Retrieve first 4 rows
row_2 = df_im[-4:-1]  # Retrieve last 3 rows
result = pd.concat([row_1, row_2], axis=0) # Concatenate rows
result

Saving a Dataframe as CSV¶

You can simply save a dataframe as CSV by the following code. The following code performs a shuffle and then saves a new copy.

filename="myfile.csv"
df_im.to_csv(filename, index=False) # index = False not writing row numbers

Shuffling, Grouping and Sorting¶

Shuffling Dataset

An example of shuffling is playing the game of cards. All the cards are collected at the end of each round of play, shuffled to ensure that cards are distributed randomly and each player receives cards based on chance. In Machine Learning, the data set should be split into training and validation datasets. So, shuffling is very important to avoid any element of bias/patterns in the split datasets.

If consistent shuffling of the data set is required for several times of running the code, a random seed can be used. See code below.

import numpy as np

#np.random.seed(82) 
df_shfl = df_im.reindex(np.random.permutation(df.index)) 
df_shfl.reset_index(inplace=True, drop=True) # Reset index
df_shfl[0:5]

Grouping Data Set

Grouping is applied to summarize data. The following codes performs grouping; generating the mean and sum of each category for feature 'Prod. Formation' in target variable 'OIL Prod. (e3m3/month)'.

gb_mean=df_im.groupby('Prod. Formation')['OIL Prod. (e3m3/month)'].mean()
gb_mean

Prod. Formation
Sand          64.015467
Shale         59.502943
Shale-Sand    54.482824
Name: OIL Prod. (e3m3/month), dtype: float64

gb_sum=df_im.groupby('Prod. Formation')['OIL Prod. (e3m3/month)'].sum()
gb_sum

Prod. Formation
Sand          19204.64
Shale         31536.56
Shale-Sand     9262.08
Name: OIL Prod. (e3m3/month), dtype: float64

Sorting Data Set

df_sort = df_im.sort_values(by='Measured Depth (m)', ascending=True)
#df_sort.reset_index(inplace=True, drop=True) # Reset index
df_sort

Feature Engineering¶

Feature engineering is the process of extracting new feature from raw data via data mining techniques. The new features can be used to enhance the performance of machine learning algorithms. New features could also be calculated from the other fields. For the data set WPD.csv, we can add "Surface-Casing Weight (kg/m)" and "Production-Casing Weight (kg/m)" as a new data set.

df_im_c=df_im.copy()
df_im.insert(1, 'New Column', (df_im['Surface-Casing Weight (kg/m)']*df_im['Production-Casing Weight (kg/m)']).astype(int))
df_im

Standardize (Normalize) Values¶

Most Machine Learning algorithms require to standardize the input to have a reliable comparison between two values of different variables. For example, if you have variable one with a range of 3 to 1000 and variable two with range 1 to 3; we cannot have a good comparison since the highest value of variable two (3) is the lowest value in variable two. This inconsistency in data values leads to low performance in Machine Learning algorithms. So, it is highly recommended to standardize (normalize) your data to have the same the mean($\mu$) and the standard deviation ($\sigma$) before feeding Machine Learning algorithms. One very common Machine learning standardization (normalization) is the Z-Score:

$\Large z = \frac{x - \mu}{\sigma} $

where $ \mu = \frac{1}{n} \sum_{i=1}^n x_i$, $\sigma = \sqrt{\frac{1}{n} \sum_{i=1}^n (x_i - \mu)^2}$, $n$ is number of data. This ensures that each variable have $\mu=0$ and $\sigma=1$

This standardization should be applied for all features except the target (label). For WPD.csv data set, the target is "OIL Prod. (e3m3/month)" that should not be changed.

The following Python code replaces the 'Measured Depth (m)' with a z-score. This can be done for all variables.

from scipy.stats import zscore

mean=df_im['Measured Depth (m)'].mean()
variance=df_im['Measured Depth (m)'].var()
print("Before standardization: Mean= ",int(mean), ", variance= ",int(variance))

df_im['Measured Depth (m)'] = zscore(df_im['Measured Depth (m)'])

mean=df_im['Measured Depth (m)'].mean()
variance=df_im['Measured Depth (m)'].var()
print("After standardization: Mean= ",int(mean), ", variance= ",int(variance))

Before standardization: Mean=  1364 , variance=  732108
After standardization: Mean=  0 , variance=  1

df_im[0:5]

Categorical Values¶

The input data for Machine Learning should be completely numeric which means any text should be converted to number. In the data set WPD.csv, the columns 'Deviation (True/False)' and 'Prod. Formation' have categorical values. Use value_counts() function to find out what categories exist and how many data belong to each category.

df_c=df_im.copy()
df_c['Deviation (True/False)'].value_counts()

False    652
True     348
Name: Deviation (True/False), dtype: int64

df_c['Prod. Formation'].value_counts()

Shale         530
Sand          300
Shale-Sand    170
Name: Prod. Formation, dtype: int64

We can replace each test with numbers starting from 0

df_c['Deviation (True/False)']=df_c['Deviation (True/False)'].replace(False, 0) # Replace False with 0
df_c['Deviation (True/False)']=df_c['Deviation (True/False)'].replace(True, 1)  # Replace True with 1
df_c[0:5]

df_c['Prod. Formation']=df_c['Prod. Formation'].replace('Shale', 0)       # Replace Shale with 0
df_c['Prod. Formation']=df_c['Prod. Formation'].replace('Sand', 1)        # Replace Sand with 1
df_c['Prod. Formation']=df_c['Prod. Formation'].replace('Shale-Sand', 2)  # Replace Shale-Sand with 2
df_c[0:5]

One challenge with this representation is that Machine Learning algorithms consider two nearby values are more similar than two distant values. This may be fine for some cases (order of categories “bad”, “average”, “good”), but this is not always true. A common solution is to have one binary attribute per category. This is called one-hot encoding, because we only have one attribute equal to 1 (hot), while the others are 0 (cold).

One_hot = pd.get_dummies(['a','b'],prefix='Deviation')
print(One_hot)

   Deviation_a  Deviation_b
0            1            0
1            0            1

These dummies should be merged back into the data frame.

One_hot = pd.get_dummies(df_im['Deviation (True/False)'],prefix='Deviation')
print(One_hot[0:10]) # Just show the first 10

   Deviation_False  Deviation_True
0                1               0
1                0               1
2                0               1
3                0               1
4                0               1
5                1               0
6                0               1
7                1               0
8                1               0
9                0               1

df_one_hot = pd.concat([df_im,One_hot],axis=1)
df_one_hot[0:5]

This should be done for 'Prod. Formation' column.

One_hot = pd.get_dummies(df_im['Prod. Formation'],prefix='Formation')
df_one_hot = pd.concat([df_one_hot,One_hot],axis=1)
df_one_hot[0:5]

Finally, the original columns 'Deviation (True/False)' and 'Prod. Formation' should be removed.

df_one_hot.drop(['Deviation (True/False)','Prod. Formation'], axis=1, inplace=True)
#df_one_hot

Split Data to Training and Validation¶

Machine Learning models should be evaluated based on prediction for never-before-seen data set. The models usually have high performance for the data that are trained by. However, the main aim of training is to have high performance on completely new data set. So, data are split into training and validation set. The machine learning models learn from the training data, but based on the validation data are evaluated: the data should be divided into training and validation set according to some ratio. The common ratios are 80% training and 20% validation. The image below shows how a model is trained on 80% of the data and then validated against the remaining 20%. Based on feedback performance on evaluation set, Machine learning algorithms are modified. This process is repeated until the model's performance is satisfied. This repetition also could lead to information leak. The model could perform well on the validation set but not on a new data set. So, sometimes, a portion of data are hold-out for testing. We will talk about this on the next lectures

The following code splits the data into 80% training and 20% validation set.

import numpy as np

df = df.reindex(np.random.permutation(df.index)) #  Shuffle the data
df.reset_index(inplace=True, drop=True) # Reset index
MASK = np.random.rand(len(df)) < 0.8    
df_train = pd.DataFrame(df[MASK])
df_validation = pd.DataFrame(df[~MASK])

print("Number of Training: "+ str(len(df_train)))
print("Number of Validation: "+  str(len(df_validation)))

Number of Training: 825
Number of Validation: 175

	X Coordinate	Y Coordinate	Measured Depth (m)	Deviation (True/False)	Surface-Casing Weight (kg/m)	Production-Casing Size (mm)	Production-Casing Weight (kg/m)	Bore. Temp. (degC)	Prod. Formation	Porosity (fraction)	Permeability (Darcy)	OIL Prod. (e3m3/month)
0	352.6	372	2145.8	False	NaN	114.3	NaN	53.0	Shale-Sand	0.224	12.25	108.75
1	435.7	551	1638.0	True	25.3	114.3	17.3	NaN	Sand	0.118	4.11	0.33
2	37.7	788	3984.0	True	35.7	114.3	22.5	NaN	Shale	0.121	1.99	21.87
3	346.7	311	3809.0	True	41.7	114.3	17.3	NaN	Sand	0.170	5.84	46.30
4	254.6	727	2335.0	True	53.6	NaN	NaN	32.0	Sand	0.158	5.25	17.53

	X Coordinate	Y Coordinate	Measured Depth (m)	Deviation (True/False)	Surface-Casing Weight (kg/m)	Production-Casing Size (mm)	Production-Casing Weight (kg/m)	Bore. Temp. (degC)	Prod. Formation	Porosity (fraction)	Permeability (Darcy)	OIL Prod. (e3m3/month)
0	352.6	372	2145.8	False	NaN	114.3	NaN	53.0	Shale-Sand	0.224	12.25	108.75
1	435.7	551	1638.0	True	25.3	114.3	17.3	NaN	Sand	0.118	4.11	0.33
2	37.7	788	3984.0	True	35.7	114.3	22.5	NaN	Shale	0.121	1.99	21.87
3	346.7	311	3809.0	True	41.7	114.3	17.3	NaN	Sand	0.170	5.84	46.30
4	254.6	727	2335.0	True	53.6	NaN	NaN	32.0	Sand	0.158	5.25	17.53
...	...	...	...	...	...	...	...	...	...	...	...	...
995	92.5	767	1466.0	False	35.7	139.7	20.8	51.0	Sand	0.230	6.26	8.74
996	548.3	475	749.8	False	NaN	114.3	NaN	28.0	Shale	0.017	0.48	0.00
997	593.4	456	683.7	False	NaN	177.8	NaN	26.0	Shale-Sand	0.332	49.75	832.45
998	540.9	328	1346.0	True	25.3	114.3	14.1	NaN	Shale	0.111	0.59	0.00
999	617.8	377	844.0	False	35.7	139.7	20.8	27.0	Shale-Sand	0.292	9.95	153.43

	X Coordinate	Y Coordinate	Measured Depth (m)	Surface-Casing Weight (kg/m)	Production-Casing Size (mm)	Production-Casing Weight (kg/m)	Bore. Temp. (degC)	Porosity (fraction)	Permeability (Darcy)	OIL Prod. (e3m3/month)
count	1000.000000	1000.000000	1000.000000	593.000000	913.000000	547.000000	801.000000	1000.000000	1000.000000	1000.000000
mean	432.893400	489.749000	1364.379300	42.906745	142.495290	23.260512	40.774032	0.141840	5.269140	60.003280
std	172.206682	243.547912	855.633567	25.877438	25.953205	7.137579	21.256825	0.074501	8.665687	234.814659
min	12.300000	36.000000	276.000000	14.100000	13.700000	10.800000	5.000000	0.000000	0.050000	0.000000
25%	325.425000	339.750000	704.000000	35.700000	114.300000	17.300000	27.000000	0.089750	1.040000	0.000000
50%	487.250000	480.000000	1066.000000	35.700000	139.700000	20.800000	33.000000	0.141000	2.480000	4.715000
75%	581.675000	596.000000	1790.000000	48.100000	177.800000	25.300000	53.000000	0.192000	5.752500	33.117500
max	648.800000	1172.000000	6363.000000	595.000000	219.100000	53.600000	255.000000	0.400000	94.510000	3425.530000

	X Coordinate	Y Coordinate	Measured Depth (m)	Prod. Formation	Porosity (fraction)	Permeability (Darcy)	OIL Prod. (e3m3/month)
0	352.6	372	2145.8	Shale-Sand	0.224	12.25	108.75
1	435.7	551	1638.0	Sand	0.118	4.11	0.33
2	37.7	788	3984.0	Shale	0.121	1.99	21.87
3	346.7	311	3809.0	Sand	0.170	5.84	46.30
4	254.6	727	2335.0	Sand	0.158	5.25	17.53
...	...	...	...	...	...	...	...
995	92.5	767	1466.0	Sand	0.230	6.26	8.74
996	548.3	475	749.8	Shale	0.017	0.48	0.00
997	593.4	456	683.7	Shale-Sand	0.332	49.75	832.45
998	540.9	328	1346.0	Shale	0.111	0.59	0.00
999	617.8	377	844.0	Shale-Sand	0.292	9.95	153.43

	X Coordinate	Y Coordinate	Measured Depth (m)	Prod. Formation	Porosity (fraction)	Permeability (Darcy)	OIL Prod. (e3m3/month)
0	352.6	372	2145.8	Shale-Sand	0.224	12.25	108.75
2	37.7	788	3984.0	Shale	0.121	1.99	21.87
3	346.7	311	3809.0	Sand	0.170	5.84	46.30
4	254.6	727	2335.0	Sand	0.158	5.25	17.53
5	127.5	1125	1560.6	Shale	0.000	0.10	0.00
...	...	...	...	...	...	...	...
995	92.5	767	1466.0	Sand	0.230	6.26	8.74
996	548.3	475	749.8	Shale	0.017	0.48	0.00
997	593.4	456	683.7	Shale-Sand	0.332	49.75	832.45
998	540.9	328	1346.0	Shale	0.111	0.59	0.00
999	617.8	377	844.0	Shale-Sand	0.292	9.95	153.43

	X Coordinate	Y Coordinate	Measured Depth (m)	Deviation (True/False)	Surface-Casing Weight (kg/m)	Production-Casing Size (mm)	Production-Casing Weight (kg/m)	Bore. Temp. (degC)	Prod. Formation	Porosity (fraction)	Permeability (Darcy)	OIL Prod. (e3m3/month)
6	636.9	462	713.0	True	48.1	177.8	25.3	27.0	Shale	0.128	0.70	0.00
7	556.2	441	761.0	False	35.7	139.7	20.8	28.0	Sand	0.055	0.36	0.00
8	572.0	521	587.0	False	25.3	114.3	14.1	15.0	Shale	0.033	0.46	0.00
10	366.2	385	1801.0	False	35.7	139.7	25.3	52.0	Shale-Sand	0.147	3.15	5.80
14	648.7	275	920.0	False	35.7	139.7	20.8	30.0	Shale	0.104	0.38	0.00
...	...	...	...	...	...	...	...	...	...	...	...	...
989	44.1	787	1895.0	False	35.7	139.7	23.1	66.0	Shale	0.144	11.73	22.01
990	487.2	81	1033.0	True	35.7	139.7	20.8	30.0	Shale	0.144	0.99	0.00
991	607.4	526	668.0	True	48.1	177.8	25.3	14.0	Shale-Sand	0.000	0.45	0.00
995	92.5	767	1466.0	False	35.7	139.7	20.8	51.0	Sand	0.230	6.26	8.74
999	617.8	377	844.0	False	35.7	139.7	20.8	27.0	Shale-Sand	0.292	9.95	153.43

	X Coordinate	Y Coordinate	Measured Depth (m)	Deviation (True/False)	Surface-Casing Weight (kg/m)	Production-Casing Size (mm)	Production-Casing Weight (kg/m)	Bore. Temp. (degC)	Prod. Formation	Porosity (fraction)	Permeability (Darcy)
0	451.7	541	970.2	False	35.7	177.8	20.8	31.0	Sand	0.097	3.87
1	388.5	639	832.1	False	35.7	139.7	20.8	26.0	Shale	0.000	0.48
2	270.8	461	3999.0	True	35.7	114.3	22.5	33.0	Shale	0.125	0.76
3	290.8	480	1695.3	False	35.7	139.7	20.8	33.0	Sand	0.048	1.15
4	132.0	1104	276.0	False	25.3	114.3	14.1	25.0	Shale	0.142	1.16

	X Coordinate	New Column	Y Coordinate	Measured Depth (m)	Deviation (True/False)	Surface-Casing Weight (kg/m)	Production-Casing Size (mm)	Production-Casing Weight (kg/m)	Bore. Temp. (degC)	Prod. Formation	Porosity (fraction)	Permeability (Darcy)	OIL Prod. (e3m3/month)
0	352.6	742	372	0.913723	False	35.7	114.3	20.8	53.0	Shale-Sand	0.224	12.25	108.75
1	435.7	437	551	0.319947	True	25.3	114.3	17.3	33.0	Sand	0.118	4.11	0.33
2	37.7	803	788	3.063147	True	35.7	114.3	22.5	33.0	Shale	0.121	1.99	21.87
3	346.7	721	311	2.858518	True	41.7	114.3	17.3	33.0	Sand	0.170	5.84	46.30
4	254.6	1114	727	1.134956	True	53.6	139.7	20.8	32.0	Sand	0.158	5.25	17.53

Table of Contents