Introduction
Python is a high-level, general-purpose programming language. Created by Guido van Rossum in 1989 as a scripting language for administrative tasks; it was named after a television show. First released in 1991. There has been a growing community of Python developers since then, evolving into a well-supported programming language. Python 3.0, released in 2008, was a significant revision of the language. Python has become a common programming language for machine learning and the primary language for TensorFlow. The Python Software Foundation (PSF), a non-profit organization, manages for Python development. See Python website at www.python.org
Installing Python¶
Anaconda distribution of Python is recommended since it already contains many data science packages. Anaconda is free (the download is large and take time) and can be installed on work or school or personal computers. Anaconda comes bundled with about pre-installed 600 packages. Download the latest version of Python 3 (As of August 2020, 3.8) from Anaconda.com/downloads and install it on your computer (only press next).
Installing Jupyter and Required Modules¶
The Jupyter notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text. Jupyter Notebooks Allow Python and Markdown to coexist. We can write LaTeX equation in Jupyter notebook:
$ f'(x) = \lim_{a\to0} \frac{f(x+a) - f(x)}{a}. $
Jupyter notebooks will be frequently used in this course. It should be installed into your base Python environment. Go to Windows Start menu, type Anaconda Prompt (Anaconda3), right click on that and select Run as administrator to launch the anaconda console (see Figure below). Check to see if pip (pip is a package manager for Python packages, or modules) is installed by typing this command in the anaconda console and press enter: pip3 --version
Make sure that you have a recent version of pip installed. To upgrade the pip module, type this command in the anaconda console and press enter: pip3 install --upgrade pip
Now install and upgrade all required modules and their dependencies by typing this command in the anaconda console: pip3 install --upgrade jupyter matplotlib numpy pandas scipy scikit-learn
It may take a few minutes to complete. Try to import every module to check installation:
python -c "import jupyter, matplotlib, numpy, pandas, scipy, sklearn"
There should be no error and no output in console after pressing enter.
Launch Jupyter Notebook¶
Install jupyter_nbextensions_configurator by typing this command in the anaconda console and press enter:
pip install jupyter_nbextensions_configurator
Type jupyter notebook in the anaconda console and press enter to pop up empty workspace. Now create a new Python notebook by clicking on the New button (and selecting the appropriate Python version) (see numbers 1 and 2 in Figure blow).
A new notebook file called Untitled.ipynb is created in your workspace; it also starts a Jupyter Python kernel to run this notebook; and, this notebook opens in a new tab. This notebook can be renamed to “Script” (this automatically rename the file to Script.ipynb) by clicking Untitled and typing the new name (see number 3 in Figure blow). A notebook has a list of cells. Each cell can include executable code or formatted text. The notebook contains only one empty code cell when it is launched first (see Figure below for labeled In [1]:. Now type print("Hello world!") in the cell (see number 4 in Figure blow), and click on the play button (see number 5 in Figure blow) or press Shift-Enter. This sends the current cell to this notebook’s Python kernel, which runs it and returns the output. The result is displayed below the cell, a new cell is automatically created since reaching the end of the notebook Aurélien Géron, 2019.
Launch Jupyter Notebook in a Specific Folder¶
If you want to launch a new or previously saved notebook in a specific directory, go to Windows Start menu and type Anaconda Prompt (Anaconda3) to launch the anaconda console: you can pin Anaconda Prompt to taskbar (see Figure below).
Then, if you want to change drive type this command: drive name:. For example, if you want to go from drive C to F, type F: (see Figure below). Then, if you want to a folder, type cd folder's name. For example, Figure below shows how to launch a new notebook in drive F, folder Python.
Introduction to Python¶
Print Function, Srings and Variables¶
We start with printing "Hello World" similar to most programming tutorial. The code below passes a constant string, containing the text "Hello World" to the print function: a string is a series of characters, surrounded by single or double quotes.
print('Hello World')
To elaborate what you are coding for readers of your code, you can have comments in your code. In Python, any line of instructions containing the # symbol means the start of a comment. The rest of the line will be ignored when the program is run.
# When the program is run, this part has no effect.
print("Hello World") # print the string "Hello World"
Python also allows numbers as literal constants. An integer or a floating-point number can be differentiated by a decimal point. For example, the value 88 is an integer; 88.0 is a floating-point number. The value 88.0 is a floating-point number, but without fractional part. The value 88.6 is a floating-point number with fractional part. See the codes below.
print(88)
print(88.0)
print(88.6)
We have seen string values and literal numeric. When your program is run, the constants do not change. We can have variables that change as the program runs: variables are used to store values. The variables have names referencing their values. The code below assigns a string value to a variable "x" and an integer value to another variable "y".
x = 3
y = "three"
print(x)
print(y)
3 three
You can rewrite the code above to reduce length of your code using semicolon ;.
x = 3; y = "three"; print(x); print(y) # using ";" to reduce numebr of lines in your code
3 three
The variables can change as you run your code.
x = 10; print(x)
x = x + 20; print(x) # add 20 to the variable with the assigned value 10
10 30
Srings and variables can be concatenated (combining strings) in python. The variables with number must be converted to string with "str" function.
x = 'This course is '
y = 'Machine Learning – Energy Systems'
print(x+y)
This course is Machine Learning – Energy Systems
x = 'The value of y variable is '
y = 8
print(x+str(y))
The value of y variable is 8
x = 7; y = 10
print('x+y= '+str(x+y))
x+y= 17
Although Python has many other ways to print numbers, we will use str() function for this course. The following code presents other ways of printing numbers:
x = 25
print('The value of x is ' + str(x)) # This is preferred in this course.
print(f'The value of x is {x}')
print('The value of x is {}'.format(x))
print('The value of x is %d' % (x))
The value of x is 25 The value of x is 25 The value of x is 25 The value of x is 25
Lists and Tuples, Dictionaries¶
Similar to most modern programming languages, Python contains built-in data structures. In this section, each of these collection types will be discussed in more detail.
Lists and Tuples¶
A Python list stores a series of items in a particular order. Many programming languages includes a data collection called an array. There is no built-in array type in Python. However, a list can be easily used in place of an array in Python. Arrays in most programming languages requires to assign maximum length ahead of time. This restriction leads to array-overrun (overflow) bugs. The Python list does not require to assign length and the program can dynamically alter the size of a list. Lists allow programmers to store sets of information in one place, does not matter a few items or millions of items. Lists are one of Python's most powerful features.
Tuples are similar to lists. The only difference is the items in a tuple can't be modified: a tuple is immutable, which means the program cannot change it. It is possible to get by as a programmer using only lists and ignoring tuples. One advantage of tuples over lists is tuples may be a little faster for iteration over than lists.
For making a list, use square brackets [], and use commas , to separate individual items in the list.
Lists = ['a', 'b', 'c', 'd']
print(Lists)
['a', 'b', 'c', 'd']
Individual elements in a list are called the index. The index of the first element is 0, the index of the second element is 1, and so forth. Negative indices denote to the items at the end of the list. To get a specific element, write the name of the list and then the index of the element in square brackets.
$List= \,\,\,\,[\,\,\,\,'a'\,\,\,,\,\,\,'b' \,\,\,,\,\,\, 'c'\,\,\,\,,\,\,\,\,'d'\,\,\,\,]\\ Index:\,\,\,\,\,\,\,\,\,0\,\,\,\,\,\,\,\,\,\,\,\,1\,\,\,\,\,\,\,\,\,\,\,\,\,2\,\,\,\,\,\,\,\,\,\,\,\,\,3 \\ Index:\,\,\,\,\,-4\,\,\,\,\,\,\,-3\,\,\,\,\,-2\,\,\,\,\,\,\,-1$
Lists[-1] # Get last element
'd'
Lists[0] # Get first element
'a'
You can change individual elements in a list by referring to the index of the item you want to change.
Lists[0] = 'a_changed' # Change first element to a_changed
Lists[2] = 'c_changed' # Change third element to c_changed
print(Lists)
['a_changed', 'b', 'c_changed', 'd']
The items in a tuple can't be modified and it gives TypeError. Tuples are enclosed with parenthesis ().
Tuples = ('a', 'b', 'c', 'd')
print(Tuples)
('a', 'b', 'c', 'd')
#Tuples[0] = 'a_changed' # Try to change first element of tuples but it gives TypeError
Tuples do not allow the program to add any element after definition. This is possible with list; you can add elements to the end (using append() function) or wherever (using insert() function) you like in a list.
# Adding an element to the end of the list
Lists.append('added_element')
print(Lists)
['a_changed', 'b', 'c_changed', 'd', 'added_element']
# Making a list with an empty list
Lists = []
Lists.append('First')
Lists.append('Second')
Lists.append('Third')
print(Lists)
['First', 'Second', 'Third']
The programmer must specify an index for the insert function.
# Insert
Lists = ['a', 'b', 'c',]
Lists.insert(2, 'a2')
print(Lists)
['a', 'b', 'a2', 'c']
Elements can be removed by their position in a list (using remove() function) , or by the value of the item (using del() function). Python removes only the first item if you remove an item by its value.
Lists.remove('a2') # Remove by value
print(Lists)
['a', 'b', 'c']
del Lists[0] # Remove at index
print(Lists)
['b', 'c']
If a list contains numerical data, there are a number of simple statistics you can be applied.
ages_years = [92, 51, 78, 12, 63, 82, 65, 51, 4, 35]
youngest = min(ages_years) # Find the minimum value
print('The youngest age is '+str(youngest))
#
oldest = max(ages_years) # Find the maximum value
print('The oldest age is '+str(oldest))
#
total_years = sum(ages_years)
print('The total years are '+str(total_years))
The youngest age is 4 The oldest age is 92 The total years are 533
You can work on a portion of a list that is called a slice. To slice a list, start with the index of the first item you want, then add a colon and the index after the last item you want. To start at the beginning of the list, leave off the first index. To slice through the end of the list, leave off the last index.
# Get the three items from index 1 to 3
lists = ['a', 'b', 'c', 'd', 'e', 'f']
middle_three = lists[1:4]
print(middle_three)
['b', 'c', 'd']
# Get the first three items
first_three = lists[:4]
print(first_three)
['a', 'b', 'c', 'd']
# Get the last three items
last_three = lists[-3:]
print(last_three)
['d', 'e', 'f']
You can find the index of a value in a list using index() function. Index() returns the lowest index in the list:
index=lists.index('a')
print(index)
0
If the value is not in the list, it returns a ValueError.
#index=lists.index('k')
# What lowest index means
lists_1 = ['a', 'b', 'a', 'd']
index=lists_1.index('a')
print(index)
0
Dictionaries¶
Python provides a dictionary that allow you to connect pieces of related information. Each piece of information in a dictionary is stored as a key-value pair. Python returns the value associated with that key when a key is provided.
For making a dictionary, use curly braces {} to define a dictionary. Use colons ":" for connecting keys and values, and commas "," for separating individual key-value pairs.
Dic = {'name': 'Tom', 'midterm': 87,'Final':95 ,'grade':'A'}
print(Dic['name'])
print(Dic['Final'])
Tom 95
# Get the value of a key with get()
print(Dic.get('name'))
print(Dic.get('Final'))
Tom 95
# All of the keys of dictionary
print(Dic.keys())
dict_keys(['name', 'midterm', 'Final', 'grade'])
# All of the values of dictionary
print(Dic.values())
dict_values(['Tom', 87, 95, 'A'])
We can have a list for each key in dictionary. The following code shows a list of five values for each key.
name=['Tom', 'Liam', 'Jared', 'Alex', 'John']
grade=['A','A+','A-','B+','A']
Dic_l = {'name': name,'grade':grade}
i=0
print('Name= ',Dic_l['name'][i],', grade= ',Dic_l['grade'][i])
Name= Tom , grade= A
Arrays¶
As it was mentioned, in Python we have lists that serve the purpose of arrays. However, lists can be very slow to process for large data set. Arrays are very frequently applied in data science, where speed and resources are very important. NumPy is a Python library used for working with arrays. The reason why NumPy is faster than lists is NumPy arrays are stored at one continuous place in memory unlike lists, so it can be very efficiently processed and manipulated.
First you need to install NumPy and import it in your applications by adding the import keyword. You can create an alias with the as keyword while importing:
import numpy as np
list_=[1, 2, 3, 4, 5, 6, 7, 8]
arr = np.array(list_)
print(arr)
[1 2 3 4 5 6 7 8]
Dimensions¶
0-D Arrays (Scalars)
0-D Arrays are Scalars. Each value in an array is a 0-D array.
arr = np.array(82)
print(arr)
82
1-D Arrays (Vectors)
Uni-dimensional or 1-D array is an array that has 0-D arrays (values) as its elements.
arr = np.array([1, 2, 3, 4, 5, 6, 7, 8])
print(arr)
[1 2 3 4 5 6 7 8]
2-D Arrays (Matrices)
A 2-D array is an array that has 1-D arrays as its elements. 2-D arrays often used to represent matrix or second order tensors.
arr = np.array([[1, 2, 3, 4, 5, 6], [7, 8, 8, 10, 11, 12]]) # This is 2-D matrix that has two rows and 6 columns
print(arr)
[[ 1 2 3 4 5 6] [ 7 8 8 10 11 12]]
3-D Arrays
3-D array is an array that has 2-D arrays (matrices) as its elements. 3-D array are often used to show third order tensor.
import numpy as np
arr = np.array([[[1, 2], [3, 4]], [[5, 6], [7, 8]]])
print(arr)
[[[1 2] [3 4]] [[5 6] [7 8]]]
ndim attribute tells how many dimensions the array have.
arr.ndim
3
Shape of an array can be achieved by shape attribute that returns a tuple with each index having the number of corresponding elements.
arr.shape
(2, 2, 2)
Slicing 1D-arrays are similar to list:
arr = np.array([1, 2, 3, 4, 5, 6, 7])
print(arr[-1])
7
Slicing 2-D arrays:
arr = np.array([[1, 2, 3, 4, 5], [6, 7, 8, 9, 10]])
print(arr[0, :2])
[1 2]
We can also reshape an array from one dimension to another dimension using reshape function:
arr = np.array([1, 2, 3, 4, 5, 6, 7, 8])
print(arr.shape)
new_arr = arr.reshape(4, 2)
print(new_arr.shape)
(8,) (4, 2)
Searching a Value in Arrays
you can find all indices of a value in an array by where():
arr = np.array([3, 5, 3, 2, 3, 8, 7, 9])
idx = np.where(arr==3)
print(idx)
(array([0, 2, 4], dtype=int64),)
Generate Random Number¶
Random means that the data can not be predicted logically. The code below generates random integer from 10 to 50 with NumPy. First you need to import random.
from numpy import random
random.seed(89)
rand = random.randint(10,50)
print(rand)
29
The random module's rand() method returns a random float between 0 and 1.
#random.seed(45)
rand = random.rand()
print(rand)
0.38783079576445467
The following code generates a 1-D array containing 5 random integers from 10 to 50:
rand = random.randint(10,50, size=(100))
print(rand)
[32 47 18 28 43 36 49 40 34 10 33 32 16 32 43 48 17 32 41 29 27 39 27 15 17 41 39 26 37 23 33 23 39 43 14 40 38 22 19 28 34 22 22 25 13 37 19 20 10 12 33 23 45 11 25 46 44 38 30 25 37 43 47 40 46 36 45 25 45 49 10 39 25 45 31 25 49 19 34 30 36 31 30 25 43 31 34 39 13 25 26 33 29 39 20 47 46 30 34 44]
The following code generates a 2-D array with 2 rows each has 4 random integer from 10 to 50:
rand = random.randint(10,50, size=(2,4))
print(rand)
[[25 18 26 14] [31 27 40 48]]
The code below generates a 1-D array containing 5 random floats:
rand = random.rand(5)
print(rand)
[0.99920205 0.23730174 0.89028427 0.93641836 0.75591492]
2-D array of float numbers can be generated:
rand = random.rand(2, 3)*20
print(rand)
[[19.45507821 13.80481058 0.86738014] [ 1.59536119 13.60122857 5.71961611]]
If Statements and Loops¶
If statements are applied to test for particular conditions and respond properly. Conditional tests are:
not equal $\;\;\;\;\;\;\;\;\;$ x != 82
equals $\;\;\;\;\;\;\;\;\;\;\;\;$ x == 82
greater than $\;\;\;\;\;$ x > 82
or equal to $\;\;\;\;\;\;\;$ x >= 82
less than $\;\;\;\;\;\;\;\;\;$ x < 82
or equal to $\;\;\;\;\;\;\;$ x <= 82
If-statements require appropriate indents to define blocks of code to execute together. A block usually begins after a colon and includes any lines at the same level of indent. Python uses whitespace to define blocks of code, unlike many other programming languages. One of main annoyance for new Python programmers is defining this whitespace."
# If-elif-else statements
val = 20
if val<20:
print('The variable is less than 20')
elif val>20:
print('The variable is greater than 20')
else:
print('The variable is equal to 20')
The variable is equal to 20
To loop over a range of numbers in Python, the range function can be used. The following code shows a for loop and a range function to loop between 1 and 5.
for number in range(6):
print(number)
0 1 2 3 4 5
for number in range(1, 6):
print('Iteration '+str(number)) # looped printing of string and number.
Iteration 1 Iteration 2 Iteration 3 Iteration 4 Iteration 5
# Use loop, range and if functions to generate a list of square numbers for even number
sqs= []
for x in range(1, 11):
if (x%2==0): # A number is even if division by 2 gives a remainder of 0
sq= x**2
sqs.append(sq)
print(sqs)
[4, 16, 36, 64, 100]
# Looping over a list
add=0
for x in sqs:
add+=x
print("Add "+str(x)+", suming number so far= "+str(add) )
Add 4, suming number so far= 4 Add 16, suming number so far= 20 Add 36, suming number so far= 56 Add 64, suming number so far= 120 Add 100, suming number so far= 220
We can use python comprehension to generate a list; it is the same as for loop.
number = [x for x in range(1, 6)]
print(number)
[1, 2, 3, 4, 5]
# Use comprehension to generate a list of square numbers for even number
sqs = [x**2 for x in range(1, 11) if (x%2==0)]
print(sqs)
[4, 16, 36, 64, 100]
Zip and Enumerate¶
Two lists can be combined into a single list by the zip command.
To see the results of the zip function, we convert the returned zip object into a list. As you can see, the zip function returns a list of tuples. Each tuple represents a pair of items that the function zipped together. The order in the two lists was maintained.
x = [12,18,25,33,37]
y = [1,2,3,4,5]
for x,y in zip(x,y):
print('Print x - y= ',x-y)
Print x - y= 11 Print x - y= 16 Print x - y= 22 Print x - y= 29 Print x - y= 32
You can use the enumerate command to define the index location for an element.
x = [7, 50, 75, 99, 32, 49, 67, 58]
for i, j in enumerate(x):
print('index= '+str(i)+', value= '+str(j))
index= 0, value= 7 index= 1, value= 50 index= 2, value= 75 index= 3, value= 99 index= 4, value= 32 index= 5, value= 49 index= 6, value= 67 index= 7, value= 58
The following program changes every element of the list less than 50 to 0; and filter the list. The enumerate command makes it possible to know which element index it is currently on, to change the value of that.
x = [7, 50, 75, 99, 32, 49, 67, 58]
new_list=[]
for i, j in enumerate(x):
if j<50:
x[i] = 0
else:
new_list.append(x[i])
print('List for less than 50 to 0= ',x)
print('Filter list for less than 50= ',new_list)
List for less than 50 to 0= [0, 50, 75, 99, 0, 0, 67, 58] Filter list for less than 50= [50, 75, 99, 67, 58]
The enumerate command can be used to build a dictionary. This approach is commonly used to build an index to column names.
colmn = ['col1','col2', 'col3', 'col4']
Dic = {key:value+1 for (value,key) in enumerate(colmn)}
print(Dic)
{'col1': 1, 'col2': 2, 'col3': 3, 'col4': 4}
we saw that index() function only return the lowest, but how we can get all indices of a value in list? we can use enumerate and comprehension:
list_=[1,3,1,9,45,85,1,69,3,4,2,1,9,3,7,2]
# Get lowest index
idx=list_.index(1)
print('Lowest index= ',idx)
# Get all indices using enumerate and comprehension
inds = [idx for idx, n in enumerate(list_) if n == 1]
print('All indices= ',inds)
Lowest index= 0 All indices= [0, 2, 6, 11]
Reading and Writing a File¶
To read from a file, your code should open the file and then read the contents. It is possible to read the entire contents of the file at once, or read the file line by line.
# Reading an entire file at once
filename = './Data/Wells.txt'
with open(filename) as f_cont:
contents = f_cont.read()
print(contents)
Information of some energy wells in Alberta Easting(km) Northing(km) Depth(m) Spudded Month BH Temp(degC) Geo. Formation OIL Prod(m3) 310 832 1518 467 41 Dgrn_wash 46581 570 204 883 303 30 Kglauc_ss 697 588 511 821 33 NA Kdina 220 522 167 515.1 523 220 Kbelly_rv 0 608 132 369 253 NA Kmilk_rv 0 521 167 1080 191 34 Ksunburst 8479 503 370 1152 298 35 Kellrslie 14 573 896 990 486 30 NA NA 248 461 2350 183 NA NA NA 610 141 473 493 20 Kmilk_rv 0
For reading line by line, each line has a newline character at the end of the line. The rstrip() function eliminates the the extra blank lines.
# Reading line by line
with open(filename) as f_cont:
for line in f_cont:
print(line.rstrip())
Information of some energy wells in Alberta Easting(km) Northing(km) Depth(m) Spudded Month BH Temp(degC) Geo. Formation OIL Prod(m3) 310 832 1518 467 41 Dgrn_wash 46581 570 204 883 303 30 Kglauc_ss 697 588 511 821 33 NA Kdina 220 522 167 515.1 523 220 Kbelly_rv 0 608 132 369 253 NA Kmilk_rv 0 521 167 1080 191 34 Ksunburst 8479 503 370 1152 298 35 Kellrslie 14 573 896 990 486 30 NA NA 248 461 2350 183 NA NA NA 610 141 473 493 20 Kmilk_rv 0
The advantage of reading line by line is that you can filter or manipulate the data while reading. This is very efficient for reading very large dataset. For example, the following code skip two lines.
# Reading line by line
with open(filename) as f_cont:
for i in range(2):
next(f_cont) # Skip two lines
for line in f_cont:
print(line.rstrip())
310 832 1518 467 41 Dgrn_wash 46581 570 204 883 303 30 Kglauc_ss 697 588 511 821 33 NA Kdina 220 522 167 515.1 523 220 Kbelly_rv 0 608 132 369 253 NA Kmilk_rv 0 521 167 1080 191 34 Ksunburst 8479 503 370 1152 298 35 Kellrslie 14 573 896 990 486 30 NA NA 248 461 2350 183 NA NA NA 610 141 473 493 20 Kmilk_rv 0
To get a specific column, you should split the line with function split. The following code shows how to get depths of the wells (column 3). Notice, python reads data as string, if you want to do any math on the data, you should convert it to number by float() function.
# Reading line by line
Depth_km=[]
with open(filename) as f_cont:
for i in range(2):next(f_cont) # Skip two lines
for line in f_cont:
p = line.split() # Split line
Depth_m = float(p[2]) # Get depth value which is index 2 (column 3) and convert to number
Depth_km.append(Depth_m/1000)
print('Depth in km',Depth_km)
Depth in km [1.518, 0.883, 0.821, 0.5151, 0.369, 1.08, 1.152, 0.99, 2.35, 0.473]
Now if you want to write a list to a file, you should pass the 'w' argument to open() that tells Python to write to the file. Be careful; this will erase the contents of the file if it already exists. Passing the 'a' argument tells Python you want to append to the end of an existing file.
filename = 'Output.txt'
with open(filename, 'w') as f:
for i in range(len(Depth_km)):
if(i==0):
f.write('Depth of wells in km\n') # Make a header for your output file
f.write(str(Depth_km[i])+'\n')
You can import excel file in python. But, first you should import xlrd. The following code shows how to read data from sheet2 of th excel file "Wells.xlsx"
import xlrd
book = xlrd.open_workbook('./Data/Wells.xlsx')
# Select excel sheet starts with 0
sheet_num=0
sheet = book.sheet_by_index(sheet_num)
# Get number of rows and columns
rows = sheet.nrows
columns = sheet.ncols
print('Rows=',rows,'Columns=',columns)
# Read row and column starts with 0
row=2
column=0
val=sheet.cell_value(row, column)
print('The value of row '+str(row+1)+ ' and column '+str(column+1)+' is',val)
Rows= 12 Columns= 7 The value of row 3 and column 1 is 310.6626
# Read the entire selected column 2
column=1
for i in range(rows-1):
print(sheet.cell_value(i+1, column) )
Northing(km) 832.7453 204.4554 511.89459999999997 167.9459 132.09210000000002 167.159 370.2815 89.6165 461.6372 141.0724
# Read 4th row of sheet 2
current_row=3
for i in range(sheet.ncols):
if(i==0): print('Reading row '+str(current_row+1))
print(sheet.cell_value(current_row,i))
Reading row 4 570.3494000000001 204.4554 883.0 303.0 30.0 Kglauc_ss 697.0
Reading and writing data, data analysis, data manupulation and processing is easier with Pandas open source library for Python. We will discuss about it in the coming classes.
Functions and Lambdas¶
Functions and lambdas are advanced methods to process. Here, these techniques are introduced with some examples and will be discussed more in next classes.
Functions¶
Functions are designed to do one specific job, and named blocks of code. Function only runs when it is called. They are used in all programming language. The main aim of writing functions is to avoid a long code (avoid repetition) and have a clean and well-organized code. It starts with def followed by function name of function, parentheses () and colon :.
# A simple function
def greet():
"""Greeting.""" # Give a little information about the fucntion
print("Hello World!")
greet()
Hello World!
# Passing an argument
def greet(text):
"""Display greeting."""
print(text)
mytext= 'Hi, how are you doing!'
greet(mytext)
Hi, how are you doing!
# Making default values for parameters
def pizza_order(type_='chicken',topping='Mushroom',size='Large'):
"""Order pizza."""
print("Type of pizza: " + type_)
print("Topping : " + topping)
print("Size: " + size)
#pizza_order()
pizza_order(type_='beef',topping='olives')
Type of pizza: beef Topping : olives Size: Large
# Returning a value
def multipy_numbers(x, y):
""" Multipy two input parameters."""
value=x*y
return value
multipy_numbers(20,10)
200
Filter¶
This is a function to filter data based on some rules.
def positive(x):
return x>0
mylist=[-10,4,-2,36,6,-8,12]
filtered=list(filter(positive, mylist))
print(filtered)
[4, 36, 6, 12]
Lambda¶
It is tedious to have a function for a simple task of finding values greater than 0. A lambda function is very efficient which an unnamed function.
mylist=[-10,4,-2,36,6,-8,12]
filtered=list(filter(lambda i: i>0, mylist))
print(filtered)
[4, 36, 6, 12]
Data visualization¶
One advantage of Python compared with other programming language is its capability of excellent data visualization. The matplotlib package of Python is extremely flexible to make interesting representations of the data. You should install matplotlib first (see previous lectures) then import it before use. Here are a few examples of data visualization.
# Make a graph with Plot() function
import matplotlib.pyplot as plt
x = [-30,-20, -10, 0, 10, 20, 30]
x_squared = [900, 400, 100, 0, 100, 400, 900]
plt.plot(x, x_squared)
plt.show()
# Make a graph with Plot() function
x = list(range(-30,30))
x_squared = [i**2 for i in x] # use comprehension to square your values
plt.plot(x, x_squared)
plt.show()
# Make a graph with Scatter() function
x = list(range(-30,30))
x_squared = [i**2 for i in x]
plt.scatter(x, x_squared)
plt.title("Plot for Squared Numbers", fontsize=15) # Title of your plot
plt.xlabel("Value", fontsize=12) # Label of your x axis
plt.ylabel("Squared Value", fontsize=12) # Label of your y axis
plt.show()
# Make a location map with scatter()
x_ = [50,220,986,830,680,357,773,935,400]
y_ = [269,789,30,99,698,947,875,624,400]
value_ = [0.368,0.051,0.224,0.198,0.287,0.329,0.094,0.187,0.33]
plt.scatter(x_, y_,c=value_,s=105,cmap='jet', edgecolor='k')
plt.colorbar()
plt.xlim((0, 1000))
plt.ylim((0, 1000))
plt.title("Location Map", fontsize=18) # Title of your plot
plt.xlabel("Easting (m)", fontsize=14) # Label of your x axis
plt.ylabel("Northing (m)", fontsize=14) # Label of your y axis
plt.savefig('Figure.png', dpi=200, bbox_inches='tight') # SavE a plot
plt.show()
# Make histogram
plt.hist(x_squared, bins=8,ec='black')
plt.title("Histogram ", fontsize=15) # Title of your plot
plt.xlabel("Squared Value", fontsize=12) # Label of your x axis
plt.ylabel("Frequency", fontsize=12) # Label of your y axis
plt.show()
You can include as many individual graphs in one figure.
fig = plt.subplots(figsize=(10,3), dpi= 200, facecolor='w')
ax1=plt.subplot(1,2,1)
plt.scatter(x_, y_,c=value_,s=105,cmap='jet', edgecolor='k')
plt.colorbar()
plt.title("Location Map", fontsize=15)
plt.xlabel("Easting (m)", fontsize=12)
plt.ylabel("Northing (m)", fontsize=12)
plt.xlim((0, 1000))
plt.ylim((0, 1000))
ax2=plt.subplot(1,2,2)
plt.hist(x_squared, bins=15,ec='black')
plt.title("Histogram ", fontsize=15)
plt.xlabel("Squared Value", fontsize=12)
plt.ylabel("Frequency", fontsize=12)
plt.savefig('Subplot.png', dpi=200, bbox_inches='tight') # SavE a plot
plt.show()
Introduction to Pandas¶
Pandas is an open source Python package for data manipulation and analysis with high-performance and easy-to-use methods. It is based on the dataframe concept in the R programming language. Pandas can be used as the primary tools for data processing, data manipulation and data cleaning.
We will use a dataset for oil production of 1000 hydrocarbon wells in Alberta (WPD.csv). The target is oil production prediction using well properties including location, depth, deviation, porosity, permeability and so on. The well properties are retrieved from geoSCOUT; but location of the wells are changed and some key properties are manipulated for confidentially reason and for not being used for other reasons. This data is retrieved and generated for this course to learn data processing and apply Machine Learning. The prediction should NOT be used for any Publication because the original data is modified data.
Look at Data¶
# Reading data in Pandas
import pandas as pd # import Pandas
df = pd.read_csv("./Data/WPD.csv") # Read data file
#df.head()
df[0:5] # Display data
X Coordinate | Y Coordinate | Measured Depth (m) | Deviation (True/False) | Surface-Casing Weight (kg/m) | Production-Casing Size (mm) | Production-Casing Weight (kg/m) | Bore. Temp. (degC) | Prod. Formation | Porosity (fraction) | Permeability (Darcy) | OIL Prod. (e3m3/month) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 352.6 | 372 | 2145.8 | False | NaN | 114.3 | NaN | 53.0 | Shale-Sand | 0.224 | 12.25 | 108.75 |
1 | 435.7 | 551 | 1638.0 | True | 25.3 | 114.3 | 17.3 | NaN | Sand | 0.118 | 4.11 | 0.33 |
2 | 37.7 | 788 | 3984.0 | True | 35.7 | 114.3 | 22.5 | NaN | Shale | 0.121 | 1.99 | 21.87 |
3 | 346.7 | 311 | 3809.0 | True | 41.7 | 114.3 | 17.3 | NaN | Sand | 0.170 | 5.84 | 46.30 |
4 | 254.6 | 727 | 2335.0 | True | 53.6 | NaN | NaN | 32.0 | Sand | 0.158 | 5.25 | 17.53 |
The columns function gives the name of the columns in your data file.
column=list(df.columns)
print(column)
['X Coordinate', 'Y Coordinate', 'Measured Depth (m)', 'Deviation (True/False)', 'Surface-Casing Weight (kg/m)', 'Production-Casing Size (mm)', 'Production-Casing Weight (kg/m)', 'Bore. Temp. (degC)', 'Prod. Formation', 'Porosity (fraction)', 'Permeability (Darcy)', 'OIL Prod. (e3m3/month)']
The info() function gives some information about the data type: string float or null (missing values).
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1000 entries, 0 to 999 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 X Coordinate 1000 non-null float64 1 Y Coordinate 1000 non-null int64 2 Measured Depth (m) 1000 non-null float64 3 Deviation (True/False) 1000 non-null bool 4 Surface-Casing Weight (kg/m) 593 non-null float64 5 Production-Casing Size (mm) 913 non-null float64 6 Production-Casing Weight (kg/m) 547 non-null float64 7 Bore. Temp. (degC) 801 non-null float64 8 Prod. Formation 1000 non-null object 9 Porosity (fraction) 1000 non-null float64 10 Permeability (Darcy) 1000 non-null float64 11 OIL Prod. (e3m3/month) 1000 non-null float64 dtypes: bool(1), float64(9), int64(1), object(1) memory usage: 87.0+ KB
The describe function gives some statistical analysis for each column if the data type is not string.
df.describe()
X Coordinate | Y Coordinate | Measured Depth (m) | Surface-Casing Weight (kg/m) | Production-Casing Size (mm) | Production-Casing Weight (kg/m) | Bore. Temp. (degC) | Porosity (fraction) | Permeability (Darcy) | OIL Prod. (e3m3/month) | |
---|---|---|---|---|---|---|---|---|---|---|
count | 1000.000000 | 1000.000000 | 1000.000000 | 593.000000 | 913.000000 | 547.000000 | 801.000000 | 1000.000000 | 1000.000000 | 1000.000000 |
mean | 432.893400 | 489.749000 | 1364.379300 | 42.906745 | 142.495290 | 23.260512 | 40.774032 | 0.141840 | 5.269140 | 60.003280 |
std | 172.206682 | 243.547912 | 855.633567 | 25.877438 | 25.953205 | 7.137579 | 21.256825 | 0.074501 | 8.665687 | 234.814659 |
min | 12.300000 | 36.000000 | 276.000000 | 14.100000 | 13.700000 | 10.800000 | 5.000000 | 0.000000 | 0.050000 | 0.000000 |
25% | 325.425000 | 339.750000 | 704.000000 | 35.700000 | 114.300000 | 17.300000 | 27.000000 | 0.089750 | 1.040000 | 0.000000 |
50% | 487.250000 | 480.000000 | 1066.000000 | 35.700000 | 139.700000 | 20.800000 | 33.000000 | 0.141000 | 2.480000 | 4.715000 |
75% | 581.675000 | 596.000000 | 1790.000000 | 48.100000 | 177.800000 | 25.300000 | 53.000000 | 0.192000 | 5.752500 | 33.117500 |
max | 648.800000 | 1172.000000 | 6363.000000 | 595.000000 | 219.100000 | 53.600000 | 255.000000 | 0.400000 | 94.510000 | 3425.530000 |
Data Cleaning¶
Missing (NaN) Values¶
It is ideal to have valid value for every rows and columns. In reality, missing values is one of the main challenges faced by machine learning. There are some approaches in practice to deal with missing values.
Option 1: Drop Column
One naive approach is to remove columns that have missing values with drop() function.
Select_column=column[3:8] # Select columns that have missing
Select_column
['Deviation (True/False)', 'Surface-Casing Weight (kg/m)', 'Production-Casing Size (mm)', 'Production-Casing Weight (kg/m)', 'Bore. Temp. (degC)']
df_drop_c=df.copy() # Copy to avoid modifying the original data
df_drop_c.drop(Select_column,axis=1, inplace=True) # Drop selected 5 columns
df_drop_c
X Coordinate | Y Coordinate | Measured Depth (m) | Prod. Formation | Porosity (fraction) | Permeability (Darcy) | OIL Prod. (e3m3/month) | |
---|---|---|---|---|---|---|---|
0 | 352.6 | 372 | 2145.8 | Shale-Sand | 0.224 | 12.25 | 108.75 |
1 | 435.7 | 551 | 1638.0 | Sand | 0.118 | 4.11 | 0.33 |
2 | 37.7 | 788 | 3984.0 | Shale | 0.121 | 1.99 | 21.87 |
3 | 346.7 | 311 | 3809.0 | Sand | 0.170 | 5.84 | 46.30 |
4 | 254.6 | 727 | 2335.0 | Sand | 0.158 | 5.25 | 17.53 |
... | ... | ... | ... | ... | ... | ... | ... |
995 | 92.5 | 767 | 1466.0 | Sand | 0.230 | 6.26 | 8.74 |
996 | 548.3 | 475 | 749.8 | Shale | 0.017 | 0.48 | 0.00 |
997 | 593.4 | 456 | 683.7 | Shale-Sand | 0.332 | 49.75 | 832.45 |
998 | 540.9 | 328 | 1346.0 | Shale | 0.111 | 0.59 | 0.00 |
999 | 617.8 | 377 | 844.0 | Shale-Sand | 0.292 | 9.95 | 153.43 |
1000 rows × 7 columns
aa=df_drop_c.drop(1,axis=0)
aa
X Coordinate | Y Coordinate | Measured Depth (m) | Prod. Formation | Porosity (fraction) | Permeability (Darcy) | OIL Prod. (e3m3/month) | |
---|---|---|---|---|---|---|---|
0 | 352.6 | 372 | 2145.8 | Shale-Sand | 0.224 | 12.25 | 108.75 |
2 | 37.7 | 788 | 3984.0 | Shale | 0.121 | 1.99 | 21.87 |
3 | 346.7 | 311 | 3809.0 | Sand | 0.170 | 5.84 | 46.30 |
4 | 254.6 | 727 | 2335.0 | Sand | 0.158 | 5.25 | 17.53 |
5 | 127.5 | 1125 | 1560.6 | Shale | 0.000 | 0.10 | 0.00 |
... | ... | ... | ... | ... | ... | ... | ... |
995 | 92.5 | 767 | 1466.0 | Sand | 0.230 | 6.26 | 8.74 |
996 | 548.3 | 475 | 749.8 | Shale | 0.017 | 0.48 | 0.00 |
997 | 593.4 | 456 | 683.7 | Shale-Sand | 0.332 | 49.75 | 832.45 |
998 | 540.9 | 328 | 1346.0 | Shale | 0.111 | 0.59 | 0.00 |
999 | 617.8 | 377 | 844.0 | Shale-Sand | 0.292 | 9.95 | 153.43 |
999 rows × 7 columns
df_drop_c
X Coordinate | Y Coordinate | Measured Depth (m) | Prod. Formation | Porosity (fraction) | Permeability (Darcy) | OIL Prod. (e3m3/month) | |
---|---|---|---|---|---|---|---|
0 | 352.6 | 372 | 2145.8 | Shale-Sand | 0.224 | 12.25 | 108.75 |
1 | 435.7 | 551 | 1638.0 | Sand | 0.118 | 4.11 | 0.33 |
2 | 37.7 | 788 | 3984.0 | Shale | 0.121 | 1.99 | 21.87 |
3 | 346.7 | 311 | 3809.0 | Sand | 0.170 | 5.84 | 46.30 |
4 | 254.6 | 727 | 2335.0 | Sand | 0.158 | 5.25 | 17.53 |
... | ... | ... | ... | ... | ... | ... | ... |
995 | 92.5 | 767 | 1466.0 | Sand | 0.230 | 6.26 | 8.74 |
996 | 548.3 | 475 | 749.8 | Shale | 0.017 | 0.48 | 0.00 |
997 | 593.4 | 456 | 683.7 | Shale-Sand | 0.332 | 49.75 | 832.45 |
998 | 540.9 | 328 | 1346.0 | Shale | 0.111 | 0.59 | 0.00 |
999 | 617.8 | 377 | 844.0 | Shale-Sand | 0.292 | 9.95 | 153.43 |
1000 rows × 7 columns
df_drop_c.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1000 entries, 0 to 999 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 X Coordinate 1000 non-null float64 1 Y Coordinate 1000 non-null int64 2 Measured Depth (m) 1000 non-null float64 3 Prod. Formation 1000 non-null object 4 Porosity (fraction) 1000 non-null float64 5 Permeability (Darcy) 1000 non-null float64 6 OIL Prod. (e3m3/month) 1000 non-null float64 dtypes: float64(5), int64(1), object(1) memory usage: 54.8+ KB
This approach is not reliable since the information of 5 columns without missing values have been removed from data set.
Option 2: Drop Row
Another approach is to remove rows that have missing values with dropna() function.
df_drop_=df.copy() # Copy to avoid modifying the original data
df_drop_r=df_drop_.dropna() # Drop rows with missing values
df_drop_r
X Coordinate | Y Coordinate | Measured Depth (m) | Deviation (True/False) | Surface-Casing Weight (kg/m) | Production-Casing Size (mm) | Production-Casing Weight (kg/m) | Bore. Temp. (degC) | Prod. Formation | Porosity (fraction) | Permeability (Darcy) | OIL Prod. (e3m3/month) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
6 | 636.9 | 462 | 713.0 | True | 48.1 | 177.8 | 25.3 | 27.0 | Shale | 0.128 | 0.70 | 0.00 |
7 | 556.2 | 441 | 761.0 | False | 35.7 | 139.7 | 20.8 | 28.0 | Sand | 0.055 | 0.36 | 0.00 |
8 | 572.0 | 521 | 587.0 | False | 25.3 | 114.3 | 14.1 | 15.0 | Shale | 0.033 | 0.46 | 0.00 |
10 | 366.2 | 385 | 1801.0 | False | 35.7 | 139.7 | 25.3 | 52.0 | Shale-Sand | 0.147 | 3.15 | 5.80 |
14 | 648.7 | 275 | 920.0 | False | 35.7 | 139.7 | 20.8 | 30.0 | Shale | 0.104 | 0.38 | 0.00 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
989 | 44.1 | 787 | 1895.0 | False | 35.7 | 139.7 | 23.1 | 66.0 | Shale | 0.144 | 11.73 | 22.01 |
990 | 487.2 | 81 | 1033.0 | True | 35.7 | 139.7 | 20.8 | 30.0 | Shale | 0.144 | 0.99 | 0.00 |
991 | 607.4 | 526 | 668.0 | True | 48.1 | 177.8 | 25.3 | 14.0 | Shale-Sand | 0.000 | 0.45 | 0.00 |
995 | 92.5 | 767 | 1466.0 | False | 35.7 | 139.7 | 20.8 | 51.0 | Sand | 0.230 | 6.26 | 8.74 |
999 | 617.8 | 377 | 844.0 | False | 35.7 | 139.7 | 20.8 | 27.0 | Shale-Sand | 0.292 | 9.95 | 153.43 |
426 rows × 12 columns
df_drop_r.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 426 entries, 6 to 999 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 X Coordinate 426 non-null float64 1 Y Coordinate 426 non-null int64 2 Measured Depth (m) 426 non-null float64 3 Deviation (True/False) 426 non-null bool 4 Surface-Casing Weight (kg/m) 426 non-null float64 5 Production-Casing Size (mm) 426 non-null float64 6 Production-Casing Weight (kg/m) 426 non-null float64 7 Bore. Temp. (degC) 426 non-null float64 8 Prod. Formation 426 non-null object 9 Porosity (fraction) 426 non-null float64 10 Permeability (Darcy) 426 non-null float64 11 OIL Prod. (e3m3/month) 426 non-null float64 dtypes: bool(1), float64(9), int64(1), object(1) memory usage: 40.4+ KB
This approach is also not recommend because number of data (rows) have decreased from 1000 to 426.
Option 3: Replace with Median (mean)
A common practice is to replace missing values with the median (mean) value for that column. The median is the middle value of a list that 50% are less and 50% are bigger. See Median. The following code replaces all missing values in 5 columns with the median value of each column values.
df_im=df.copy()
# Get the median of each variable and replace the missing values with median
for i in range(len(Select_column)):
P50 = df[Select_column[i]].median() # Calculate median of each column
df_im[Select_column[i]] = df_im[Select_column[i]].fillna(P50) # replace NaN with median(P50)
df_im
X Coordinate | Y Coordinate | Measured Depth (m) | Deviation (True/False) | Surface-Casing Weight (kg/m) | Production-Casing Size (mm) | Production-Casing Weight (kg/m) | Bore. Temp. (degC) | Prod. Formation | Porosity (fraction) | Permeability (Darcy) | OIL Prod. (e3m3/month) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 352.6 | 372 | 2145.8 | False | 35.7 | 114.3 | 20.8 | 53.0 | Shale-Sand | 0.224 | 12.25 | 108.75 |
1 | 435.7 | 551 | 1638.0 | True | 25.3 | 114.3 | 17.3 | 33.0 | Sand | 0.118 | 4.11 | 0.33 |
2 | 37.7 | 788 | 3984.0 | True | 35.7 | 114.3 | 22.5 | 33.0 | Shale | 0.121 | 1.99 | 21.87 |
3 | 346.7 | 311 | 3809.0 | True | 41.7 | 114.3 | 17.3 | 33.0 | Sand | 0.170 | 5.84 | 46.30 |
4 | 254.6 | 727 | 2335.0 | True | 53.6 | 139.7 | 20.8 | 32.0 | Sand | 0.158 | 5.25 | 17.53 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
995 | 92.5 | 767 | 1466.0 | False | 35.7 | 139.7 | 20.8 | 51.0 | Sand | 0.230 | 6.26 | 8.74 |
996 | 548.3 | 475 | 749.8 | False | 35.7 | 114.3 | 20.8 | 28.0 | Shale | 0.017 | 0.48 | 0.00 |
997 | 593.4 | 456 | 683.7 | False | 35.7 | 177.8 | 20.8 | 26.0 | Shale-Sand | 0.332 | 49.75 | 832.45 |
998 | 540.9 | 328 | 1346.0 | True | 25.3 | 114.3 | 14.1 | 33.0 | Shale | 0.111 | 0.59 | 0.00 |
999 | 617.8 | 377 | 844.0 | False | 35.7 | 139.7 | 20.8 | 27.0 | Shale-Sand | 0.292 | 9.95 | 153.43 |
1000 rows × 12 columns
df_im.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1000 entries, 0 to 999 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 X Coordinate 1000 non-null float64 1 Y Coordinate 1000 non-null int64 2 Measured Depth (m) 1000 non-null float64 3 Deviation (True/False) 1000 non-null bool 4 Surface-Casing Weight (kg/m) 1000 non-null float64 5 Production-Casing Size (mm) 1000 non-null float64 6 Production-Casing Weight (kg/m) 1000 non-null float64 7 Bore. Temp. (degC) 1000 non-null float64 8 Prod. Formation 1000 non-null object 9 Porosity (fraction) 1000 non-null float64 10 Permeability (Darcy) 1000 non-null float64 11 OIL Prod. (e3m3/month) 1000 non-null float64 dtypes: bool(1), float64(9), int64(1), object(1) memory usage: 87.0+ KB
Remove Outliers¶
Outliers are usually extreme high or low values that are different from pattern of majority data. Outliers should be removed from dataset (we should be careful not to consider novelty as outlier). A straightforward approach to detect outliers are several (n) standard deviations from the mean $m\pm n\sigma$ ($m$=mean, $\sigma$=standard deviation). For example, an outlier can bigger than $m+ 3\sigma $ or less than can bigger than $m- 3\sigma $. The following function can be used to apply this approach.
def outlier_remove(df, n,name):
"""Delete rows for a specified column where values are out of +/- n*sd standard deviations
df : Pandas dataframe
n : n in the equation □±□□
name: Column name
"""
mean=df[name].mean() # Calclute mean of column
sd=df[name].std() # Calclute standard deviation of column
drop_r = df.index[(mean -n * sd> df[name]) | (mean+n * sd< df[name])] # Find data that are not within □±□□
df.drop(drop_r, axis=0, inplace=True) # Drop data
df.reset_index(inplace=True, drop=True) # Reset index
# Drop outliers in last column 'OIL Prod. (e3m3/month)'
df_im_out=df_im.copy()
outlier_remove(df_im_out,n=3,name='OIL Prod. (e3m3/month)')
df_im_out
X Coordinate | Y Coordinate | Measured Depth (m) | Deviation (True/False) | Surface-Casing Weight (kg/m) | Production-Casing Size (mm) | Production-Casing Weight (kg/m) | Bore. Temp. (degC) | Prod. Formation | Porosity (fraction) | Permeability (Darcy) | OIL Prod. (e3m3/month) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 352.6 | 372 | 2145.8 | False | 35.7 | 114.3 | 20.8 | 53.0 | Shale-Sand | 0.224 | 12.25 | 108.75 |
1 | 435.7 | 551 | 1638.0 | True | 25.3 | 114.3 | 17.3 | 33.0 | Sand | 0.118 | 4.11 | 0.33 |
2 | 37.7 | 788 | 3984.0 | True | 35.7 | 114.3 | 22.5 | 33.0 | Shale | 0.121 | 1.99 | 21.87 |
3 | 346.7 | 311 | 3809.0 | True | 41.7 | 114.3 | 17.3 | 33.0 | Sand | 0.170 | 5.84 | 46.30 |
4 | 254.6 | 727 | 2335.0 | True | 53.6 | 139.7 | 20.8 | 32.0 | Sand | 0.158 | 5.25 | 17.53 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
976 | 535.5 | 788 | 1744.0 | True | 71.4 | 139.7 | 20.8 | 28.0 | Sand | 0.185 | 2.79 | 1.22 |
977 | 92.5 | 767 | 1466.0 | False | 35.7 | 139.7 | 20.8 | 51.0 | Sand | 0.230 | 6.26 | 8.74 |
978 | 548.3 | 475 | 749.8 | False | 35.7 | 114.3 | 20.8 | 28.0 | Shale | 0.017 | 0.48 | 0.00 |
979 | 540.9 | 328 | 1346.0 | True | 25.3 | 114.3 | 14.1 | 33.0 | Shale | 0.111 | 0.59 | 0.00 |
980 | 617.8 | 377 | 844.0 | False | 35.7 | 139.7 | 20.8 | 27.0 | Shale-Sand | 0.292 | 9.95 | 153.43 |
981 rows × 12 columns
Concatenation¶
A new data frames can be built by concatenating rows and columns. A new dataframe is created by concatenating two columns together 'Measured Depth (m)' and 'OIL Prod. (e3m3/month)' using the concat() function.
col_1= df_im['Measured Depth (m)']
col_2 = df_im['OIL Prod. (e3m3/month)']
result = pd.concat([col_1, col_2], axis=1)
result
Measured Depth (m) | OIL Prod. (e3m3/month) | |
---|---|---|
0 | 2145.8 | 108.75 |
1 | 1638.0 | 0.33 |
2 | 3984.0 | 21.87 |
3 | 3809.0 | 46.30 |
4 | 2335.0 | 17.53 |
... | ... | ... |
995 | 1466.0 | 8.74 |
996 | 749.8 | 0.00 |
997 | 683.7 | 832.45 |
998 | 1346.0 | 0.00 |
999 | 844.0 | 153.43 |
1000 rows × 2 columns
Two rows can be concatenated together using concat() function. The code below concatenates the first 4 rows and the last 3 rows of the previous dataset after imputation.
row_1= df_im[0:4] # Retrieve first 4 rows
row_2 = df_im[-4:-1] # Retrieve last 3 rows
result = pd.concat([row_1, row_2], axis=0) # Concatenate rows
result
X Coordinate | Y Coordinate | Measured Depth (m) | Deviation (True/False) | Surface-Casing Weight (kg/m) | Production-Casing Size (mm) | Production-Casing Weight (kg/m) | Bore. Temp. (degC) | Prod. Formation | Porosity (fraction) | Permeability (Darcy) | OIL Prod. (e3m3/month) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 352.6 | 372 | 2145.8 | False | 35.7 | 114.3 | 20.8 | 53.0 | Shale-Sand | 0.224 | 12.25 | 108.75 |
1 | 435.7 | 551 | 1638.0 | True | 25.3 | 114.3 | 17.3 | 33.0 | Sand | 0.118 | 4.11 | 0.33 |
2 | 37.7 | 788 | 3984.0 | True | 35.7 | 114.3 | 22.5 | 33.0 | Shale | 0.121 | 1.99 | 21.87 |
3 | 346.7 | 311 | 3809.0 | True | 41.7 | 114.3 | 17.3 | 33.0 | Sand | 0.170 | 5.84 | 46.30 |
996 | 548.3 | 475 | 749.8 | False | 35.7 | 114.3 | 20.8 | 28.0 | Shale | 0.017 | 0.48 | 0.00 |
997 | 593.4 | 456 | 683.7 | False | 35.7 | 177.8 | 20.8 | 26.0 | Shale-Sand | 0.332 | 49.75 | 832.45 |
998 | 540.9 | 328 | 1346.0 | True | 25.3 | 114.3 | 14.1 | 33.0 | Shale | 0.111 | 0.59 | 0.00 |
Saving a Dataframe as CSV¶
You can simply save a dataframe as CSV by the following code. The following code performs a shuffle and then saves a new copy.
filename="myfile.csv"
df_im.to_csv(filename, index=False) # index = False not writing row numbers
Shuffling, Grouping and Sorting¶
Shuffling Dataset
An example of shuffling is playing the game of cards. All the cards are collected at the end of each round of play, shuffled to ensure that cards are distributed randomly and each player receives cards based on chance. In Machine Learning, the data set should be split into training and validation datasets. So, shuffling is very important to avoid any element of bias/patterns in the split datasets.
If consistent shuffling of the data set is required for several times of running the code, a random seed can be used. See code below.
import numpy as np
#np.random.seed(82)
df_shfl = df_im.reindex(np.random.permutation(df.index))
df_shfl.reset_index(inplace=True, drop=True) # Reset index
df_shfl[0:5]
X Coordinate | Y Coordinate | Measured Depth (m) | Deviation (True/False) | Surface-Casing Weight (kg/m) | Production-Casing Size (mm) | Production-Casing Weight (kg/m) | Bore. Temp. (degC) | Prod. Formation | Porosity (fraction) | Permeability (Darcy) | OIL Prod. (e3m3/month) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 590.6 | 595 | 615.0 | True | 48.1 | 177.8 | 25.3 | 10.0 | Shale-Sand | 0.150 | 3.95 | 21.82 |
1 | 578.6 | 388 | 898.6 | False | 35.7 | 114.3 | 20.8 | 34.0 | Shale | 0.143 | 3.67 | 0.00 |
2 | 124.8 | 1093 | 1720.9 | False | 35.7 | 177.8 | 20.8 | 73.0 | Shale-Sand | 0.159 | 1.48 | 5.85 |
3 | 613.6 | 458 | 875.5 | True | 35.7 | 139.7 | 20.8 | 29.0 | Shale | 0.020 | 0.53 | 0.00 |
4 | 100.4 | 1075 | 1925.0 | True | 53.6 | 139.7 | 20.8 | 65.0 | Sand | 0.245 | 47.24 | 434.80 |
Grouping Data Set
Grouping is applied to summarize data. The following codes performs grouping; generating the mean and sum of each category for feature 'Prod. Formation' in target variable 'OIL Prod. (e3m3/month)'.
gb_mean=df_im.groupby('Prod. Formation')['OIL Prod. (e3m3/month)'].mean()
gb_mean
Prod. Formation Sand 64.015467 Shale 59.502943 Shale-Sand 54.482824 Name: OIL Prod. (e3m3/month), dtype: float64
gb_sum=df_im.groupby('Prod. Formation')['OIL Prod. (e3m3/month)'].sum()
gb_sum
Prod. Formation Sand 19204.64 Shale 31536.56 Shale-Sand 9262.08 Name: OIL Prod. (e3m3/month), dtype: float64
Sorting Data Set
df_sort = df_im.sort_values(by='Measured Depth (m)', ascending=True)
#df_sort.reset_index(inplace=True, drop=True) # Reset index
df_sort
X Coordinate | Y Coordinate | Measured Depth (m) | Deviation (True/False) | Surface-Casing Weight (kg/m) | Production-Casing Size (mm) | Production-Casing Weight (kg/m) | Bore. Temp. (degC) | Prod. Formation | Porosity (fraction) | Permeability (Darcy) | OIL Prod. (e3m3/month) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
90 | 132.0 | 1104 | 276.0 | False | 25.3 | 114.3 | 14.1 | 25.0 | Shale | 0.142 | 1.16 | 0.00 |
537 | 72.9 | 784 | 295.5 | False | 25.3 | 114.3 | 14.1 | 22.0 | Shale | 0.000 | 0.66 | 0.00 |
403 | 591.3 | 791 | 297.0 | False | 25.3 | 114.3 | 14.1 | 26.0 | Shale | 0.142 | 13.37 | 72.26 |
507 | 621.7 | 593 | 321.0 | True | 53.6 | 177.8 | 34.2 | 33.0 | Shale | 0.091 | 1.72 | 0.00 |
105 | 631.2 | 583 | 343.2 | False | 56.5 | 139.7 | 20.8 | 33.0 | Sand | 0.168 | 3.19 | 19.23 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
273 | 179.8 | 628 | 5066.0 | True | 53.6 | 139.7 | 20.8 | 33.0 | Shale | 0.114 | 3.23 | 0.00 |
331 | 110.2 | 677 | 5116.0 | True | 35.7 | 139.7 | 25.3 | 33.0 | Sand | 0.000 | 0.13 | 0.00 |
951 | 48.6 | 718 | 5226.0 | True | 59.5 | 139.7 | 20.8 | 33.0 | Shale | 0.249 | 8.97 | 83.09 |
754 | 188.4 | 619 | 5790.0 | True | 53.6 | 114.3 | 22.5 | 33.0 | Shale | 0.194 | 4.03 | 0.39 |
570 | 343.2 | 383 | 6363.0 | True | 60.3 | 139.7 | 38.7 | 33.0 | Sand | 0.239 | 3.19 | 12.26 |
1000 rows × 12 columns
Feature Engineering¶
Feature engineering is the process of extracting new feature from raw data via data mining techniques. The new features can be used to enhance the performance of machine learning algorithms. New features could also be calculated from the other fields. For the data set WPD.csv, we can add "Surface-Casing Weight (kg/m)" and "Production-Casing Weight (kg/m)" as a new data set.
df_im_c=df_im.copy()
df_im.insert(1, 'New Column', (df_im['Surface-Casing Weight (kg/m)']*df_im['Production-Casing Weight (kg/m)']).astype(int))
df_im
X Coordinate | New Column | Y Coordinate | Measured Depth (m) | Deviation (True/False) | Surface-Casing Weight (kg/m) | Production-Casing Size (mm) | Production-Casing Weight (kg/m) | Bore. Temp. (degC) | Prod. Formation | Porosity (fraction) | Permeability (Darcy) | OIL Prod. (e3m3/month) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 352.6 | 742 | 372 | 2145.8 | False | 35.7 | 114.3 | 20.8 | 53.0 | Shale-Sand | 0.224 | 12.25 | 108.75 |
1 | 435.7 | 437 | 551 | 1638.0 | True | 25.3 | 114.3 | 17.3 | 33.0 | Sand | 0.118 | 4.11 | 0.33 |
2 | 37.7 | 803 | 788 | 3984.0 | True | 35.7 | 114.3 | 22.5 | 33.0 | Shale | 0.121 | 1.99 | 21.87 |
3 | 346.7 | 721 | 311 | 3809.0 | True | 41.7 | 114.3 | 17.3 | 33.0 | Sand | 0.170 | 5.84 | 46.30 |
4 | 254.6 | 1114 | 727 | 2335.0 | True | 53.6 | 139.7 | 20.8 | 32.0 | Sand | 0.158 | 5.25 | 17.53 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
995 | 92.5 | 742 | 767 | 1466.0 | False | 35.7 | 139.7 | 20.8 | 51.0 | Sand | 0.230 | 6.26 | 8.74 |
996 | 548.3 | 742 | 475 | 749.8 | False | 35.7 | 114.3 | 20.8 | 28.0 | Shale | 0.017 | 0.48 | 0.00 |
997 | 593.4 | 742 | 456 | 683.7 | False | 35.7 | 177.8 | 20.8 | 26.0 | Shale-Sand | 0.332 | 49.75 | 832.45 |
998 | 540.9 | 356 | 328 | 1346.0 | True | 25.3 | 114.3 | 14.1 | 33.0 | Shale | 0.111 | 0.59 | 0.00 |
999 | 617.8 | 742 | 377 | 844.0 | False | 35.7 | 139.7 | 20.8 | 27.0 | Shale-Sand | 0.292 | 9.95 | 153.43 |
1000 rows × 13 columns
Standardize (Normalize) Values¶
Most Machine Learning algorithms require to standardize the input to have a reliable comparison between two values of different variables. For example, if you have variable one with a range of 3 to 1000 and variable two with range 1 to 3; we cannot have a good comparison since the highest value of variable two (3) is the lowest value in variable two. This inconsistency in data values leads to low performance in Machine Learning algorithms. So, it is highly recommended to standardize (normalize) your data to have the same the mean($\mu$) and the standard deviation ($\sigma$) before feeding Machine Learning algorithms. One very common Machine learning standardization (normalization) is the Z-Score:
$\Large z = \frac{x - \mu}{\sigma} $
where $ \mu = \frac{1}{n} \sum_{i=1}^n x_i$, $\sigma = \sqrt{\frac{1}{n} \sum_{i=1}^n (x_i - \mu)^2}$, $n$ is number of data. This ensures that each variable have $\mu=0$ and $\sigma=1$
This standardization should be applied for all features except the target (label). For WPD.csv data set, the target is "OIL Prod. (e3m3/month)" that should not be changed.
The following Python code replaces the 'Measured Depth (m)' with a z-score. This can be done for all variables.
from scipy.stats import zscore
mean=df_im['Measured Depth (m)'].mean()
variance=df_im['Measured Depth (m)'].var()
print("Before standardization: Mean= ",int(mean), ", variance= ",int(variance))
df_im['Measured Depth (m)'] = zscore(df_im['Measured Depth (m)'])
mean=df_im['Measured Depth (m)'].mean()
variance=df_im['Measured Depth (m)'].var()
print("After standardization: Mean= ",int(mean), ", variance= ",int(variance))
Before standardization: Mean= 1364 , variance= 732108 After standardization: Mean= 0 , variance= 1
df_im[0:5]
X Coordinate | New Column | Y Coordinate | Measured Depth (m) | Deviation (True/False) | Surface-Casing Weight (kg/m) | Production-Casing Size (mm) | Production-Casing Weight (kg/m) | Bore. Temp. (degC) | Prod. Formation | Porosity (fraction) | Permeability (Darcy) | OIL Prod. (e3m3/month) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 352.6 | 742 | 372 | 0.913723 | False | 35.7 | 114.3 | 20.8 | 53.0 | Shale-Sand | 0.224 | 12.25 | 108.75 |
1 | 435.7 | 437 | 551 | 0.319947 | True | 25.3 | 114.3 | 17.3 | 33.0 | Sand | 0.118 | 4.11 | 0.33 |
2 | 37.7 | 803 | 788 | 3.063147 | True | 35.7 | 114.3 | 22.5 | 33.0 | Shale | 0.121 | 1.99 | 21.87 |
3 | 346.7 | 721 | 311 | 2.858518 | True | 41.7 | 114.3 | 17.3 | 33.0 | Sand | 0.170 | 5.84 | 46.30 |
4 | 254.6 | 1114 | 727 | 1.134956 | True | 53.6 | 139.7 | 20.8 | 32.0 | Sand | 0.158 | 5.25 | 17.53 |
Categorical Values¶
The input data for Machine Learning should be completely numeric which means any text should be converted to number. In the data set WPD.csv, the columns 'Deviation (True/False)' and 'Prod. Formation' have categorical values. Use value_counts() function to find out what categories exist and how many data belong to each category.
df_c=df_im.copy()
df_c['Deviation (True/False)'].value_counts()
False 652 True 348 Name: Deviation (True/False), dtype: int64
df_c['Prod. Formation'].value_counts()
Shale 530 Sand 300 Shale-Sand 170 Name: Prod. Formation, dtype: int64
We can replace each test with numbers starting from 0
df_c['Deviation (True/False)']=df_c['Deviation (True/False)'].replace(False, 0) # Replace False with 0
df_c['Deviation (True/False)']=df_c['Deviation (True/False)'].replace(True, 1) # Replace True with 1
df_c[0:5]
X Coordinate | New Column | Y Coordinate | Measured Depth (m) | Deviation (True/False) | Surface-Casing Weight (kg/m) | Production-Casing Size (mm) | Production-Casing Weight (kg/m) | Bore. Temp. (degC) | Prod. Formation | Porosity (fraction) | Permeability (Darcy) | OIL Prod. (e3m3/month) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 352.6 | 742 | 372 | 0.913723 | 0.0 | 35.7 | 114.3 | 20.8 | 53.0 | Shale-Sand | 0.224 | 12.25 | 108.75 |
1 | 435.7 | 437 | 551 | 0.319947 | 1.0 | 25.3 | 114.3 | 17.3 | 33.0 | Sand | 0.118 | 4.11 | 0.33 |
2 | 37.7 | 803 | 788 | 3.063147 | 1.0 | 35.7 | 114.3 | 22.5 | 33.0 | Shale | 0.121 | 1.99 | 21.87 |
3 | 346.7 | 721 | 311 | 2.858518 | 1.0 | 41.7 | 114.3 | 17.3 | 33.0 | Sand | 0.170 | 5.84 | 46.30 |
4 | 254.6 | 1114 | 727 | 1.134956 | 1.0 | 53.6 | 139.7 | 20.8 | 32.0 | Sand | 0.158 | 5.25 | 17.53 |
df_c['Prod. Formation']=df_c['Prod. Formation'].replace('Shale', 0) # Replace Shale with 0
df_c['Prod. Formation']=df_c['Prod. Formation'].replace('Sand', 1) # Replace Sand with 1
df_c['Prod. Formation']=df_c['Prod. Formation'].replace('Shale-Sand', 2) # Replace Shale-Sand with 2
df_c[0:5]
X Coordinate | New Column | Y Coordinate | Measured Depth (m) | Deviation (True/False) | Surface-Casing Weight (kg/m) | Production-Casing Size (mm) | Production-Casing Weight (kg/m) | Bore. Temp. (degC) | Prod. Formation | Porosity (fraction) | Permeability (Darcy) | OIL Prod. (e3m3/month) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 352.6 | 742 | 372 | 0.913723 | 0.0 | 35.7 | 114.3 | 20.8 | 53.0 | 2 | 0.224 | 12.25 | 108.75 |
1 | 435.7 | 437 | 551 | 0.319947 | 1.0 | 25.3 | 114.3 | 17.3 | 33.0 | 1 | 0.118 | 4.11 | 0.33 |
2 | 37.7 | 803 | 788 | 3.063147 | 1.0 | 35.7 | 114.3 | 22.5 | 33.0 | 0 | 0.121 | 1.99 | 21.87 |
3 | 346.7 | 721 | 311 | 2.858518 | 1.0 | 41.7 | 114.3 | 17.3 | 33.0 | 1 | 0.170 | 5.84 | 46.30 |
4 | 254.6 | 1114 | 727 | 1.134956 | 1.0 | 53.6 | 139.7 | 20.8 | 32.0 | 1 | 0.158 | 5.25 | 17.53 |
One challenge with this representation is that Machine Learning algorithms consider two nearby values are more similar than two distant values. This may be fine for some cases (order of categories “bad”, “average”, “good”), but this is not always true. A common solution is to have one binary attribute per category. This is called one-hot encoding, because we only have one attribute equal to 1 (hot), while the others are 0 (cold).
One_hot = pd.get_dummies(['a','b'],prefix='Deviation')
print(One_hot)
Deviation_a Deviation_b 0 1 0 1 0 1
These dummies should be merged back into the data frame.
One_hot = pd.get_dummies(df_im['Deviation (True/False)'],prefix='Deviation')
print(One_hot[0:10]) # Just show the first 10
Deviation_False Deviation_True 0 1 0 1 0 1 2 0 1 3 0 1 4 0 1 5 1 0 6 0 1 7 1 0 8 1 0 9 0 1
df_one_hot = pd.concat([df_im,One_hot],axis=1)
df_one_hot[0:5]
X Coordinate | New Column | Y Coordinate | Measured Depth (m) | Deviation (True/False) | Surface-Casing Weight (kg/m) | Production-Casing Size (mm) | Production-Casing Weight (kg/m) | Bore. Temp. (degC) | Prod. Formation | Porosity (fraction) | Permeability (Darcy) | OIL Prod. (e3m3/month) | Deviation_False | Deviation_True | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 352.6 | 742 | 372 | 0.913723 | False | 35.7 | 114.3 | 20.8 | 53.0 | Shale-Sand | 0.224 | 12.25 | 108.75 | 1 | 0 |
1 | 435.7 | 437 | 551 | 0.319947 | True | 25.3 | 114.3 | 17.3 | 33.0 | Sand | 0.118 | 4.11 | 0.33 | 0 | 1 |
2 | 37.7 | 803 | 788 | 3.063147 | True | 35.7 | 114.3 | 22.5 | 33.0 | Shale | 0.121 | 1.99 | 21.87 | 0 | 1 |
3 | 346.7 | 721 | 311 | 2.858518 | True | 41.7 | 114.3 | 17.3 | 33.0 | Sand | 0.170 | 5.84 | 46.30 | 0 | 1 |
4 | 254.6 | 1114 | 727 | 1.134956 | True | 53.6 | 139.7 | 20.8 | 32.0 | Sand | 0.158 | 5.25 | 17.53 | 0 | 1 |
This should be done for 'Prod. Formation' column.
One_hot = pd.get_dummies(df_im['Prod. Formation'],prefix='Formation')
df_one_hot = pd.concat([df_one_hot,One_hot],axis=1)
df_one_hot[0:5]
X Coordinate | New Column | Y Coordinate | Measured Depth (m) | Deviation (True/False) | Surface-Casing Weight (kg/m) | Production-Casing Size (mm) | Production-Casing Weight (kg/m) | Bore. Temp. (degC) | Prod. Formation | Porosity (fraction) | Permeability (Darcy) | OIL Prod. (e3m3/month) | Deviation_False | Deviation_True | Formation_Sand | Formation_Shale | Formation_Shale-Sand | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 352.6 | 742 | 372 | 0.913723 | False | 35.7 | 114.3 | 20.8 | 53.0 | Shale-Sand | 0.224 | 12.25 | 108.75 | 1 | 0 | 0 | 0 | 1 |
1 | 435.7 | 437 | 551 | 0.319947 | True | 25.3 | 114.3 | 17.3 | 33.0 | Sand | 0.118 | 4.11 | 0.33 | 0 | 1 | 1 | 0 | 0 |
2 | 37.7 | 803 | 788 | 3.063147 | True | 35.7 | 114.3 | 22.5 | 33.0 | Shale | 0.121 | 1.99 | 21.87 | 0 | 1 | 0 | 1 | 0 |
3 | 346.7 | 721 | 311 | 2.858518 | True | 41.7 | 114.3 | 17.3 | 33.0 | Sand | 0.170 | 5.84 | 46.30 | 0 | 1 | 1 | 0 | 0 |
4 | 254.6 | 1114 | 727 | 1.134956 | True | 53.6 | 139.7 | 20.8 | 32.0 | Sand | 0.158 | 5.25 | 17.53 | 0 | 1 | 1 | 0 | 0 |
Finally, the original columns 'Deviation (True/False)' and 'Prod. Formation' should be removed.
df_one_hot.drop(['Deviation (True/False)','Prod. Formation'], axis=1, inplace=True)
#df_one_hot
Split Data to Training and Validation¶
Machine Learning models should be evaluated based on prediction for never-before-seen data set. The models usually have high performance for the data that are trained by. However, the main aim of training is to have high performance on completely new data set. So, data are split into training and validation set. The machine learning models learn from the training data, but based on the validation data are evaluated: the data should be divided into training and validation set according to some ratio. The common ratios are 80% training and 20% validation. The image below shows how a model is trained on 80% of the data and then validated against the remaining 20%. Based on feedback performance on evaluation set, Machine learning algorithms are modified. This process is repeated until the model's performance is satisfied. This repetition also could lead to information leak. The model could perform well on the validation set but not on a new data set. So, sometimes, a portion of data are hold-out for testing. We will talk about this on the next lectures
The following code splits the data into 80% training and 20% validation set.
import numpy as np
df = df.reindex(np.random.permutation(df.index)) # Shuffle the data
df.reset_index(inplace=True, drop=True) # Reset index
MASK = np.random.rand(len(df)) < 0.8
df_train = pd.DataFrame(df[MASK])
df_validation = pd.DataFrame(df[~MASK])
print("Number of Training: "+ str(len(df_train)))
print("Number of Validation: "+ str(len(df_validation)))
Number of Training: 790 Number of Validation: 210