Summary

This work has been published by Mehdi Rezvandehy and Bernhard Mayer at Data-Centric Engineering, Cambridge University Press https://doi.org/10.1017/dce.2023.9.

The exploitation of hydrocarbon reservoirs may potentially lead to contamination of soils, shallow water resources and greenhouse gas emissions in cases where wells are not perfectly sealed. Fluids such as methane or CO2 may in some cases migrate towards the groundwater zone and atmosphere through and along hydrocarbon wells. Several well properties such as age, depth, production/injection history, well density, and deviation, among others, may influence the likelihood of fluid leakage. In this study, machine learning approaches were applied to predict serious leakage with uncertainty quantification for wells that have not been field tested in Alberta, Canada. An improved imputation technique was developed by Cholesky factorization of the correlation matrix between features. Next, a wide range of predictive algorithms and various performance metrics were considered to achieve the most reliable classifier. However, a highly skewed distribution towards the negative class (non-serious leakage) forces predictive models to unrealistically underestimate the minority class (serious leakage). To address this issue, a combination of oversampling and undersampling was applied to adjust the class distribution. In this notebook, Python implementation to prediction fluid leakage from Hydrocarbon Wells is presented in this notebook.

Python functions and data files to run this notebook are in my Github page.

import sys
import numpy as np
import pandas as pd
import math
import matplotlib.pyplot as plt
import statistics
from scipy.stats import mode
import os
from matplotlib.offsetbox import AnchoredText
import random
import utm
import tensorflow as tf
from tensorflow import keras
from functions import* # import require functions to run this notebook

1 Introduction
2 Raw Data: Well Properties and Gas Migration Status
3 Data Processing
- 3.1 Location Map
- 3.2 Training Set
- 3.3 Validation Set
- 3.4 Test Set
- 3.5 Target Encoding
- 3.6 Number of Missing Values of Raw Data
- 3.7 Correlation between features
- 3.8 Correlation Matrix before Imputation
4 Imputation
- 4.1 LU Conditional Simulation
- 4.2 Correlation Matrix after Imputation
5 Get Final Training Set
6 Get Final Validation and Test Set
- 6.1 Target Encoding with Statistics from Training
- 6.2 Imputation with Statistics from Training
7 OverUndersampling for Only Training set
8 Dummy (ie. random) Classifier
9 Stochastic Gradient Decent
10 Logistic Regression
11 Support Vector Machine
12 Adaptive Boosting with Decision Trees
13 Random Forest
- 13.1 Importance of Features
- 13.2 Under & Oversampling
14 Neural Network
- 14.1 Optimum Hyperparameters
- 14.2 Bootstraping (Neural Network)
15 Area Under Curve (AUC)
16 Performance of the Models on Training Set
17 Random Forest: Sampling before and within Cross validation
18 Model after Changing Ratio
19 Test set
- 19.1 Soft Voting
- 19.2 Final Plots
20 Prediction of Serious Leakage for 1000 Random Wells without Field Test
- 20.1 Final Result

Introduction¶

Exploitation of oil and gas reservoirs has raised public concerns regarding potential contamination of soils, shallow ground water and increases in greenhouse gas emissions. In cases where hydrocarbon wells are not properly sealed, there is the potential that gas such as methane and/or sequestered CO2 can migrate outside the surface casing of wellbores towards shallow aquifers, soils and the atmosphere, or travel inside the wellbore to be vented to the atmosphere via surface casing vent flows (SCVF) causing environmental concerns. Monitoring for such gas migration and surface casing vent flows is required by regulators to detect serious leakage associated with existing oil and gas wells, and to prioritize the amendment of the leakiest wells.

The Alberta Energy Regulator (AER) in Alberta, Canada, operates such field tests for energy wells within the province. The AER applies two field tests for identification of fluid migration after a well is completed to produce hydrocarbon or to inject any fluid: 1) Surface casing vent flow (SCVF) is the flow of gas (methane, CO2 etc.) out of the casing annulus or surface casing. SCVF is often referred to as internal migration. 2) Gas Migration (GM) is a flow of any gas that is detectable at surface outside of the outermost casing string. GM is often referred to as seepage or external migration.

The test for SCVF can be a simple bubble test where a small hose attached to the surfacecasing vent is immersed into a container filled with water. Observing bubbles indicates the well has SCVF and further testing is required to measure pressure and flow rate. Wells with positive SCVF are considered serious in the province of Alberta under one or several of the following conditions: a) gas-flow rates higher than 300 m3/d$, b) stabilized pressure > 9.8 kPa/m, c) saline water, d) liquid-hydrocarbons, and e) hydrogen sulphide (H2S) flow.

Monitoring for GM in the vicinity of hydrocarbon wells is generally more challenging and often approached with measurements using gas detectors near the soil surface. GM in soils happens when gas migrates outside of the cemented surface casing. The gas source is frequently natural gas from geological formations in the intermediate zone (e.g., below the base of groundwater protection but above the targeted hydrocarbon reservoir) with migration often facilitated by poor surface-casing cement. Monitoring for GM is traditionally accomplished by inserting gas measurement probes into the soil to a depth of at least 50 cm in a test pattern radiating out from the wellbore, and subsequently gas concentrations and flow rates are measured. A GM is serious if there is a high flow rate or public safety hazard or off-lease environmental damage, such as groundwater contamination (see AER, Interim directive: ID 2003-01 (2003) \cite{Web1} for more information). Wells with positive SCVF/GM are classified as non-serious if none of the conditions for serious category are met. In Alberta, repair for serious SCVF/GM leakage is required within 90 days; otherwise, repair is deferred to abandonment.

Efficient and cost-effective testing of all hydrocarbon-producing wells is a major challenge in areas with large numbers of producing or injection wells. The AER requires testing for all wells only within a small specific area in central and eastern Alberta and only for wells completed since 1995. There are many wells in other parts of Alberta including abandoned and orphaned wells for which no SCVF/GM test have been conducted. The objective of this work was to reliably train a model to predict the probability of serious fluid leakage using oil and gas wells for which SCFV and GM measurements are available (Figure \ref{fig1}) and subsequently make predictions for wells for which SCVF/GM tests have not been conducted.

Raw Data: Well Properties and Gas Migration Status¶

# Reading Data
df_xy = pd.read_csv("./Data/Final_Final.csv",na_values=['NA','?',' '])
drop_colmn=['X Coordinate','Y Coordinate'] # x and y coordinate is used only for plotting

#Shuffle data to avoid any artifact
np.random.seed(42)
df_xy=df_xy.reindex(np.random.permutation(df_xy.index))
df_xy.reset_index(inplace=True, drop=True)

# Raname some columns name to make more sense.
df_xy=df_xy.rename(columns={"MD (All Wells) (m)":"Measured Depth (m)", "Surf-Hole Easting (NAD83)": "X Coordinate",
"Surf-Hole Easting (NAD83)": "X Coordinate","Surf-Hole Northing (NAD83)": "Y Coordinate",
"Rm": "Rheology Modifier (RM)","BH Temp. (degC)": "Borehole Temperature (degC)",
'Cumulative WTR Prod. (m3)':'Cumulative WATER Prod. (m3)',
 'Prod./Inject. Frmtn': 'Prod./Inject. Formation', 'Well Status Abrv': 'Well Status',
'No_well':'Well Density (n/10km X 10km)', 'Surface Abandonment Date': 'Surface Abandonment Month',                          
'Total Production Hrs':'Total Production Hours', 'Class': 'AER Classification'})
#
df_xy=df_xy[['Deviated Hole (T/F)','Horizontal Hole (T/F)','Surface-Casing Depth (m)','Surface-Casing Size (mm)'
,'Surface-Casing Weight (kg/m)','Production-Casing Depth (m)','Production-Casing Size (mm)'
,'Production-Casing Weight (kg/m)','Production-Casing Grade','Surface-Casing Grade'
,'Measured Depth (m)','Borehole Temperature (degC)','Prod./Inject. Formation','Well Status',
'Month Well Spudded','Well Density (n/10km X 10km)','Surface Abandonment Type','Surface Abandonment Month',
'Cumulative GAS Prod. (e3m3)','Cumulative OIL Prod. (m3)','Cumulative WATER Prod. (m3)',
'Total Production Month','X Coordinate','Y Coordinate','AER Classification']]

# one with x and y, and another without x and y
df=df_xy.copy()
drop_colmn=['X Coordinate','Y Coordinate'] # x and y coordinate is used only for plotting
df.drop(drop_colmn, axis=1, inplace=True)
df

Data Processing¶

# Remove 'Considered Non Serious' from 'AER Classification'
df_xy=df_xy[df_xy['AER Classification']!='Considered Non Serious']
df=df[df['AER Classification']!='Considered Non Serious']

df_xy['AER Classification']=df_xy['AER Classification'].replace('Non Serious', 0)
df_xy['AER Classification']=df_xy['AER Classification'].replace('Serious', 1)
#df_xy['AER Classification']=df_xy['AER Classification'].replace('Considered Non Serious', 1)
df_xy.reset_index(inplace=True, drop=True)

#
df['AER Classification']=df['AER Classification'].replace('Non Serious', 0)
df['AER Classification']=df['AER Classification'].replace('Serious', 1)
#df['AER Classification']=df['AER Classification'].replace('Considered Non Serious', 1)
df.reset_index(inplace=True, drop=True)

Location Map¶

#Plot for Serious
x_1=list((df_xy["X Coordinate"][df_xy["AER Classification"]==1])/1000)
y_1=list((df_xy["Y Coordinate"][df_xy["AER Classification"]==1])/1000) 
count_1=len(df_xy[df_xy["AER Classification"]==1])

#Plot for Non Serious
x_2=list((df_xy["X Coordinate"][df_xy["AER Classification"]==0])/1000)
y_2=list((df_xy["Y Coordinate"][df_xy["AER Classification"]==0])/1000)  
count_0=len(df_xy[df_xy["AER Classification"]==0])

# Plot location map and pie chart
loc_pie_plot(x_1, y_1, x_2, y_2, count_0, count_1,title='AER Classification:\n'+ str(len(df_xy))+
             " SCVF/GM Wells ",s1=6,s2=18,alpha1=1,alpha2=0.25,save='./Figures/Fig1.png')

# 10% of data for Test
x=df_xy
y=df_xy['AER Classification']

# Test set
split = StratifiedShuffleSplit(n_splits=1, test_size=0.1, random_state=42)
for train_index, test_index in split.split(x,y):
    train_val_set_strat = x.loc[train_index]
    test_set_strat  = x.loc[test_index]
    df_main         = x.loc[train_index]
    
test_set_strat.reset_index(inplace=True, drop=True)
   
# 70% Training set and 20% Validation set
train_val_set_strat.reset_index(inplace=True, drop=True)
x=train_val_set_strat
y=train_val_set_strat['AER Classification']    
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=32)
for train_index, test_index in split.split(x,y):
    train_set_strat = x.loc[train_index]
    vali_set_strat  = x.loc[test_index]

train_set_strat.reset_index(inplace=True, drop=True) 
df_xy=train_set_strat 
vali_set_strat.reset_index(inplace=True, drop=True)

Training Set¶

#Plot for Serious
ind_Serious=[index for index, value in enumerate(df_xy['AER Classification']) if value == 1]
ind_Non_Serious=[index for index, value in enumerate(df_xy['AER Classification']) if value == 0]
x_1=list((df_xy["X Coordinate"].iloc[ind_Serious])/1000)
y_1=list((df_xy["Y Coordinate"].iloc[ind_Serious])/1000) 
count_1_taining=len(df_xy["AER Classification"].iloc[ind_Serious])

#Plot for Non Serious
x_2=list((df_xy["X Coordinate"].iloc[ind_Non_Serious])/1000)
y_2=list((df_xy["Y Coordinate"].iloc[ind_Non_Serious])/1000) 
count_0_taining=len(df_xy["AER Classification"].iloc[ind_Non_Serious])

# Plot location map and pie chart
loc_pie_plot(x_1, y_1, x_2, y_2, count_0_taining, count_1_taining,title= str(len(df_xy))+
             " SCVF/GM Wells ",s1=6,s2=18,alpha1=1,alpha2=0.25,save='./Figures/Fig1_taining.png',startangle=180)

Validation Set¶

#Plot for Serious
ind_Serious=[index for index, value in enumerate(vali_set_strat['AER Classification']) if value == 1]
ind_Non_Serious=[index for index, value in enumerate(vali_set_strat['AER Classification']) if value == 0]
x_1=list((vali_set_strat["X Coordinate"].iloc[ind_Serious])/1000)
y_1=list((vali_set_strat["Y Coordinate"].iloc[ind_Serious])/1000) 
count_1_valid=len(vali_set_strat["AER Classification"].iloc[ind_Serious])

#Plot for Non Serious
x_2=list((vali_set_strat["X Coordinate"].iloc[ind_Non_Serious])/1000)
y_2=list((vali_set_strat["Y Coordinate"].iloc[ind_Non_Serious])/1000) 
count_0_valid=len(vali_set_strat["AER Classification"].iloc[ind_Non_Serious])

# Plot location map and pie chart
loc_pie_plot(x_1, y_1, x_2, y_2, count_0_valid, count_1_valid,title= str(len(vali_set_strat))+
             " SCVF/GM Wells ",s1=6,s2=18,alpha1=1,alpha2=0.25,save='./Figures/Fig1_valid.png',startangle=180)

Test Set¶

#Plot for Serious
ind_Serious=[index for index, value in enumerate(test_set_strat['AER Classification']) if value == 1]
ind_Non_Serious=[index for index, value in enumerate(test_set_strat['AER Classification']) if value == 0]
x_1=list((test_set_strat["X Coordinate"].iloc[ind_Serious])/1000)
y_1=list((test_set_strat["Y Coordinate"].iloc[ind_Serious])/1000) 
count_1=len(test_set_strat["AER Classification"].iloc[ind_Serious])

#Plot for Non Serious
x_2=list((test_set_strat["X Coordinate"].iloc[ind_Non_Serious])/1000)
y_2=list((test_set_strat["Y Coordinate"].iloc[ind_Non_Serious])/1000) 
count_0=len(test_set_strat["AER Classification"].iloc[ind_Non_Serious])

# Plot location map and pie chart
loc_pie_plot(x_1, y_1, x_2, y_2, count_0, count_1,title= str(len(test_set_strat))+
             " SCVF/GM Wells ",s1=9,s2=25,alpha1=1,alpha2=0.25,save='./Figures/Fig1_testset.png',startangle=180)

Target Encoding¶

## Assign weight
# Target Encoding
WEIGHT = 5
#df_xy['Well Status'],dic_TE_1 = smoothed_mean(df_1=df_xy, cat='Well Status', target='AER Classification', weight=WEIGHT)
#
df_xy['Prod./Inject. Formation'].fillna('NA',inplace=True)
df_xy['Prod./Inject. Formation'],dic_TE_2= smoothed_mean(df_1=df_xy, cat='Prod./Inject. Formation', 
                                                         target='AER Classification', weight=WEIGHT)
#
df_xy['Production-Casing Grade'],dic_TE_3 = smoothed_mean(df_1=df_xy, cat='Production-Casing Grade', 
                                                         target='AER Classification', weight=WEIGHT)
#
df_xy['Surface-Casing Grade'],dic_TE_4 = smoothed_mean(df_1=df_xy, cat='Surface-Casing Grade', 
                                                         target='AER Classification', weight=WEIGHT)
#
df_xy['Surface Abandonment Type'].fillna('NA',inplace=True)
df_xy['Surface Abandonment Type'],dic_TE_5 = smoothed_mean(df_1=df_xy, cat='Surface Abandonment Type', 
                                                         target='AER Classification', weight=WEIGHT)
#
df_xy['Deviated Hole (T/F)'].fillna('NA',inplace=True)
df_xy['Deviated Hole (T/F)'],dic_TE_6=smoothed_mean(df_1=df_xy, cat='Deviated Hole (T/F)', 
                                                         target='AER Classification', weight=WEIGHT)
#
df_xy['Horizontal Hole (T/F)'].fillna('NA',inplace=True)
df_xy['Horizontal Hole (T/F)'],dic_TE_7=smoothed_mean(df_1=df_xy, cat='Horizontal Hole (T/F)', 
                                                         target='AER Classification', weight=WEIGHT)
#
df_xy['Well Status'].fillna('NA',inplace=True)
df_xy['Well Status'],dic_TE_8= smoothed_mean(df_1=df_xy, cat='Well Status', 
                                                         target='AER Classification', weight=WEIGHT)

# Replace real missing values with 0
df_xy['Surface Abandonment Month'].fillna(0,inplace=True)

df_xy

# Drop x and y coordinate since they are only used for ploting. Hence, we have two different data:
# one with x and y, and another without x and y
df=df_xy.copy()
drop_colmn=['X Coordinate','Y Coordinate'] # x and y coordinate is used only for plotting
df.drop(drop_colmn, axis=1, inplace=True)
df

Number of Missing Values of Raw Data¶

#Calculate number of missing data and percentage for each feature
per_non_nan=[]
non_nan_=[]
df_=df.copy()
df_.drop(['AER Classification'], axis=1, inplace=True)

Colms=df_.columns
for i in range(len(Colms)):
    non_nan=df_[Colms[i]].count()
    non_nan_.append(len(df_)-non_nan)
    per_non_nan.append(float(100-(non_nan/len(df_))*100))
    
#Sort percentage of missing data and columns   
sort_array=np.sort(per_non_nan)
sort_colms=[]
uniqu=list(np.unique(per_non_nan))
for i in range(len(uniqu)):
    ind=[j for j, x in enumerate(per_non_nan) if x == uniqu[i]]
    for k in ind:
        sort_colms.append(Colms[k])
        
#Sort number of missing data and columns        
sort_non_nan_=np.sort(non_nan_)
sort_non_nan__=[]
uniqu=list(np.unique(non_nan_))
for i in range(len(uniqu)):
    ind=[j for j, x in enumerate(non_nan_) if x == uniqu[i]]
    for k in ind:
        sort_non_nan__.append(Colms[k])

#Plot barchart with two vertical axis        
font = {'size'   : 10}
matplotlib.rc('font', **font)
fig, ax1 = plt.subplots(figsize=(8, 4), dpi= 200, facecolor='w', edgecolor='k')

ax2 = ax1.twinx() # Create another axes that shares the same x-axis
index1 = np.arange(len(sort_colms))
index2 = np.arange(len(sort_non_nan__))
ax1.bar(index1, sort_array,color='b')
ax2.bar(index2, sort_non_nan_,color='b')
plt.title('Total Wells= '+str(count_0_taining+count_1_taining)+': Non-serious='+
          str(count_0_taining)+', Serious='+str(count_1_taining),fontsize=12)
#ax1.set_xlabel('Features (Well Properties)',fontsize='10.5')
ax1.set_ylabel('Percentage of Missing Data (%)',fontsize='9.5')
ax2.set_ylabel('Number of Missing Data',fontsize='9.5')
ax1.set_xticks(np.arange(len(Colms)))
ax1.set_xticklabels(sort_colms,fontsize=8, rotation=90)
ax1.xaxis.grid(color='k', linestyle='--', linewidth=0.4) # horizontal lines
fig.savefig('./Figures/Fig2.pdf', dpi=500, bbox_inches='tight')

plt.show()

# Percentage of missing value in each row
clm_all=df.columns
miss_per=np.zeros((len(df.index)))+np.nan
for i in df.index:
    miss_per[i]=len(df.iloc[i][df.iloc[i].isnull()])/len(clm_all)

Correlation between features¶

# Calculate correlation between features
corr=df.corr()
Colms_sh=list(Colms)

# Plot correlation of all features with the target 'AER Classification'
font = {'size'   : 5}
matplotlib.rc('font', **font)
corr_arr=corr.values[:,-1]

fig, ax1 = plt.subplots(figsize=(3, 3), dpi= 200, facecolor='w', edgecolor='k')

coefs=corr_arr[0:-1]; names=Colms_sh
r_ = pd.DataFrame( { 'coef': coefs, 'positive': coefs>=0  }, index = names )
r_ = r_.sort_values(by=['coef'])
r_['coef'].plot(kind='barh', color=r_['positive'].map({True: 'b', False: 'r'}))   

ax1.set_xlabel('Correlation Coefficient',fontsize='7')
plt.xlim(-0.1,0.25)
ax1.xaxis.grid(color='k', linestyle='--', linewidth=0.2) # horizontal lines
ax1.yaxis.grid(color='k', linestyle='--', linewidth=0.2) # horizontal lines
plt.show()

Since target values 'AER Classification' is only 0 and 1, the linear correlation of target and features could be misleading: features may be correlated with target non-linearly. Hence, we consider the correlation between features, A variable may be non-linearly is correlated with target but it can be linearly correlated with other features. Here is covariance matrix between all features.

# Calculate covariance matrix between features and percentage of missing values as matrix 

colmn=list(df_xy.columns)
corr=zeros((len(colmn),len(colmn)))
corr_nxy=zeros((len(colmn)-2,len(colmn)-2))
Per_NaN=zeros((len(colmn)-2,len(colmn)-2))
xy_dic_no_im={}
# we do not need x and y, these features are only used for location map plot 
ll=0; itest=0
for l in range(len(colmn)):
    tmpx=[] ;tmpy=[]
    if(colmn[l]!='X Coordinate' and colmn[l]!='Y Coordinate'):
        ll=ll+1; ll1=0
    for l1 in range(len(colmn)):
        if(colmn[l1]!='Y Coordinate' and colmn[l1]!='X Coordinate'):
            ll1=ll1+1 
        New_pd=pd.concat([df_xy[[colmn[l]]],df_xy[[colmn[l1]]]],axis=1)
        New_pd_=(New_pd.dropna(subset=[colmn[l],colmn[l1]])).to_numpy()
        xy_dic_no_im[l,l1]=New_pd_[:,0],New_pd_[:,1]
        tt=np.corrcoef(New_pd_[:,0],New_pd_[:,1])
        #
        if(l==l1): corr[l,l1]=tt[0,1]
        else: corr[l,l1]=tt[0,1]  
        #
        if(ll>=ll1 and colmn[l]!='X Coordinate' and colmn[l]!='Y Coordinate'
         and colmn[l1]!='Y Coordinate' and colmn[l1]!='X Coordinate'):
            if(ll==ll1): corr_tmp=tt[0,1]+0.05
            else: corr_tmp=tt[0,1]              
            corr_nxy[ll1-1,ll-1]=np.nan
            corr_nxy[ll-1,ll1-1]=corr_tmp
            Per_NaN[ll1-1,ll-1]=100-(len(New_pd_[:,0])/len(df_xy))*100   
            Per_NaN[ll-1,ll1-1]=np.nan

Correlation Matrix before Imputation¶

font = {'size'   : 11}
matplotlib.rc('font', **font)
fig, ax=plt.subplots(figsize=(8, 6), dpi= 200, facecolor='w', edgecolor='k')

# Matrix of correlation between features excluding x and y
try:
    colmn.remove('Y Coordinate')
    colmn.remove('X Coordinate')
except ValueError :
    pass
#
im =ax.matshow(corr_nxy, cmap='jet', interpolation='nearest',vmin=-0.6, vmax=0.6)
#
cbaxes = fig.add_axes([0.32, 0.04, 0.38, 0.030]) 
fig.colorbar(im,cax=cbaxes,shrink=0.5,label='Correlation Coefficient',orientation='horizontal')
ax.set_xticks(np.arange(len(corr_nxy)))
ax.set_xticklabels(colmn,fontsize=6, rotation='vertical')
ax.set_yticks(np.arange(len(corr_nxy)))
ax.set_yticklabels(colmn,fontsize=6)
ax.grid(color='k', linestyle='-', linewidth=0.1)

# Missing matrix between features excluding x and y
im =ax.matshow(Per_NaN, cmap='Greys', interpolation='nearest',vmin=0, vmax=60)
cbaxe_ = fig.add_axes([0.83, 0.23, 0.023, 0.48]) 
fig.colorbar(im,cax=cbaxe_,shrink=0.7,label='Percentage of Missing Data \n for each Bivariate')
ax.set_xticks(np.arange(len(Per_NaN)))
ax.set_xticklabels(colmn,fontsize=9, rotation='vertical')
ax.set_yticks(np.arange(len(Per_NaN)))
ax.set_yticklabels(colmn,fontsize=9,rotation=45)
ax.grid(color='k', linestyle='-', linewidth=0.1)
#ax.title(' Matrix of Correlation and Missing \n Data before Imputation', fontsize=10,y=1.38)
#plt.text(-3,1055,' Matrix of Correlation and Missing Data before Imputation', 
#         fontsize=10,bbox=dict(facecolor='white', alpha=0.2))
fig.savefig('./Figures/Fig5.png', dpi=500, bbox_inches='tight')
plt.show()

Since there are missing values in data set, correlation matrix is achieved by removing all missing value. Upper diagonal matrix shows the covariance between features. Instead of having the same covariance value, we assign percentage of missing data for lower diagonal covariance matrix. There are very high linear correlation between some features.

Imputation¶

Removing all missing values from data leads to remove valuable information that can have greatly reduce the performance of machine learning prediction. One naive imputation approach is to replace missing values with mean or median of the variable. However, this approach does not consider uncertainty in the data. There are other imputation techniques, but they are very expensive and time consuming for large datasets.

We can apply bootstrapping technique to calculate uncertainty in the mean for each variable. Then, replace a missing value by randomly drawing a value from the distribution of the mean of that variable. This process can be repeated to impute all missing values. However, this approach does not respect the correlation between features since we randomly sample from distribution of the mean.

We propose conditional LU simulation for imputation that we can replace missing values by uncertainty in the mean and respecting the correlation between features after imputation.

LU Conditional Simulation¶

%%time
df_xy_im=cond_LU_Sim(df_xy,seed=900)
Final=df_xy_im.values

Wall time: 33.5 s

# Calculate correlation after imputation
colm_=df_xy_im.columns
corr___=zeros((len(colm_)-2,len(colm_)-2))
Per_NaN_=zeros((len(colm_)-2,len(colm_)-2))
ll=0
for l in range(len(colm_)):
    if(colm_[l]!='X Coordinate' and colm_[l]!='Y Coordinate'):
        ll=ll+1; ll1=0    
    for l1 in range(len(colm_)):
        if(colm_[l]!='X Coordinate' and colm_[l]!='Y Coordinate'
           and colm_[l1]!='X Coordinate' and colm_[l1]!='Y Coordinate'):
            ll1=ll1+1
            tt=np.corrcoef(Final[:,l],Final[:,l1])    
            corr___[ll1-1,ll-1]=np.nan 
            corr___[ll-1,ll1-1]=tt[0,1] 
            no_nan=np.where(Final[:,l]==np.nan)
            Per_NaN_[ll1-1,ll-1]=(len(no_nan)/len(Final[:,l]))*100            
            Per_NaN_[ll-1,ll1-1]=np.nan

Correlation Matrix after Imputation¶

# Plot correlation matrix after imputation
font = {'size'   : 11}
matplotlib.rc('font', **font)
fig, ax=plt.subplots(figsize=(8, 6), dpi= 200, facecolor='w', edgecolor='k')

# Matrix of correlation between features excluding x and y
try:
    colmn.remove('Y Coordinate')
    colmn.remove('X Coordinate')
except ValueError :
    pass
#
im =ax.matshow(corr___, cmap='jet', interpolation='nearest',vmin=-0.6, vmax=0.6)
#
cbaxes = fig.add_axes([0.32, 0.04, 0.38, 0.030])  
fig.colorbar(im,cax=cbaxes,shrink=0.5,label='Correlation Coefficient',orientation='horizontal')
ax.set_xticks(np.arange(len(corr___)))
ax.set_xticklabels(colmn,fontsize=6, rotation='vertical')
ax.set_yticks(np.arange(len(corr___)))
ax.set_yticklabels(colmn,fontsize=6)
ax.grid(color='k', linestyle='-', linewidth=0.1)

# Missing matrix between features excluding x and y
im =ax.matshow(Per_NaN_, cmap='Greys', interpolation='nearest',vmin=0, vmax=60)
cbaxe_ = fig.add_axes([0.83, 0.23, 0.023, 0.48]) 
fig.colorbar(im,cax=cbaxe_,shrink=0.7,label='Percentage of Missing Data \n for each Bivariate')

ax.set_xticks(np.arange(len(Per_NaN)))
ax.set_xticklabels(colmn,fontsize=9, rotation='vertical')
ax.set_yticks(np.arange(len(Per_NaN)))
ax.set_yticklabels(colmn,fontsize=9,rotation=45)
ax.grid(color='k', linestyle='-', linewidth=0.1)
plt.text(-750,50,r"$\bf {Not\,Missing}$",fontsize=17)

fig.savefig('./Figures/Fig6.png', dpi=500, bbox_inches='tight')
plt.show()

font = {'size'   :8.2 }
#
matplotlib.rc('font', **font)
fig=plt.figure(figsize=(7.3, 5.3), dpi= 200, facecolor='w', edgecolor='k')

markersize=4

col_1=2; col_2=5
ax1 = plt.subplot(2,2,1)
x_=xy_dic_no_im[col_1,col_2][0]; y_=xy_dic_no_im[col_1,col_2][1]
CrossPlot(x_,y_,0,None,None,title='Before Imputation',xlabl=colm_[col_1],
          ylabl=colm_[col_2],loc=4,xlimt=(0,1100),ylimt=(0,6500),markersize=markersize,axt=ax1,scale=1.1,alpha=0.2,loc_1=2)

ax1 = plt.subplot(2,2,2)
df_xy_=df_xy.copy()
co1=colm_[col_1]
co2=colm_[col_2]
ind_na=df_xy.loc[(df_xy.isnull()[co1]) | (df_xy.isnull()[co2])].index
x__1=(df_xy_im[co1].iloc[ind_na]).to_numpy()
x__2=(df_xy_im[co2].iloc[ind_na]).to_numpy()
#ind_=df_xy.loc[~(df_xy.isnull()[co1]) & ~(df_xy.isnull()[co2])].index
x_=Final[:,col_1]; y_=Final[:,col_2]
x__nna=xy_dic_no_im[col_1,col_2][0]; y__nna=xy_dic_no_im[col_1,col_2][1]
CrossPlot(x_,y_,1,x__1,x__2,title='After Imputation',xlabl=colm_[col_1],
          ylabl=colm_[col_2],loc=4,xlimt=(0,1100),
          ylimt=(0,6500),markersize=markersize,x__nna=x__nna,y__nna=y__nna,axt=ax1,scale=1.1,alpha=0.2)

fig.tight_layout(pad=1.5) 
plt.subplots_adjust(wspace=0.35)
fig.savefig('./Figures/Fig7.png', dpi=500, bbox_inches='tight')

Get Final Training Set¶

# Remove separate 'AER Classification' from dataset as target and remove 
# x 'X Coordinate' and y 'Y Coordinate' for only plotting
df_xy_im.drop(['X Coordinate','Y Coordinate'], axis=1, inplace=True)
#
class_=list(df_xy_im['AER Classification'])
df_xy_im.drop(['AER Classification'], axis=1, inplace=True)
columns_=list(df_xy_im.columns)
columns_RF=list(df_xy_im.columns)
x=df_xy_im

# Convert training and target to numpy 
x= x.to_numpy()
y = np.int_(class_)

x_train,y_train=x,y

# Percentage of missing value in each row
clm_all=train_set_strat.columns
miss_per=np.zeros((len(train_set_strat.index)))+np.nan
for i in train_set_strat.index:
    miss_per[i]=len(train_set_strat.iloc[i][train_set_strat.iloc[i].isnull()])/len(clm_all)

# Standardize Test, Train and Validation Data
scaler = StandardScaler()
scaler.fit(x_train)

# Standardize Train data
std_x_train  = scaler.transform(x_train)

# Make epmty array for training
Train_Accu=np.zeros((7))+np.nan
Train_Pre=np.zeros((7))+np.nan
Train_Rec=np.zeros((7))+np.nan
Train_Spe=np.zeros((7))+np.nan

# Make epmty array for validation
Valid_Accu=np.zeros((7))+np.nan
Valid_Pre=np.zeros((7))+np.nan
Valid_Rec=np.zeros((7))+np.nan
Valid_Spe=np.zeros((7))+np.nan

Get Final Validation and Test Set¶

Target Encoding with Statistics from Training¶

#vali_set_strat['Well Status']=vali_set_strat['Well Status'].map(dic_TE_1)
#
vali_set_strat['Prod./Inject. Formation'].fillna('NA',inplace=True)
vali_set_strat['Prod./Inject. Formation']=vali_set_strat['Prod./Inject. Formation'].map(dic_TE_2)
#
vali_set_strat['Production-Casing Grade']=vali_set_strat['Production-Casing Grade'].map(dic_TE_3)
#
vali_set_strat['Surface-Casing Grade']=vali_set_strat['Surface-Casing Grade'].map(dic_TE_4)
#
vali_set_strat['Surface Abandonment Type'].fillna('NA',inplace=True)
vali_set_strat['Surface Abandonment Type']=vali_set_strat['Surface Abandonment Type'].map(dic_TE_5)
#
vali_set_strat['Deviated Hole (T/F)'].fillna('NA',inplace=True)
vali_set_strat['Deviated Hole (T/F)']=vali_set_strat['Deviated Hole (T/F)'].map(dic_TE_6)
#
vali_set_strat['Horizontal Hole (T/F)'].fillna('NA',inplace=True)
vali_set_strat['Horizontal Hole (T/F)']=vali_set_strat['Horizontal Hole (T/F)'].map(dic_TE_7)
#
vali_set_strat['Well Status'].fillna('NA',inplace=True)
vali_set_strat['Well Status']= vali_set_strat['Well Status'].map(dic_TE_8)
#
vali_set_strat['Surface Abandonment Month'].fillna(0,inplace=True)

#test_set_strat['Well Status']=test_set_strat['Well Status'].map(dic_TE_1)
#
test_set_strat['Prod./Inject. Formation'].fillna('NA',inplace=True)
test_set_strat['Prod./Inject. Formation']=test_set_strat['Prod./Inject. Formation'].map(dic_TE_2)
#
test_set_strat['Production-Casing Grade']=test_set_strat['Production-Casing Grade'].map(dic_TE_3)
#
test_set_strat['Surface-Casing Grade']=test_set_strat['Surface-Casing Grade'].map(dic_TE_4)
#
test_set_strat['Surface Abandonment Type'].fillna('NA',inplace=True)
test_set_strat['Surface Abandonment Type']=test_set_strat['Surface Abandonment Type'].map(dic_TE_5)
#
test_set_strat['Deviated Hole (T/F)'].fillna('NA',inplace=True)
test_set_strat['Deviated Hole (T/F)']=test_set_strat['Deviated Hole (T/F)'].map(dic_TE_6)
#
test_set_strat['Horizontal Hole (T/F)'].fillna('NA',inplace=True)
test_set_strat['Horizontal Hole (T/F)']=test_set_strat['Horizontal Hole (T/F)'].map(dic_TE_7)
#
test_set_strat['Well Status'].fillna('NA',inplace=True)
test_set_strat['Well Status']= test_set_strat['Well Status'].map(dic_TE_8)
#
test_set_strat['Surface Abandonment Month'].fillna(0,inplace=True)

Imputation with Statistics from Training¶

%%time
# Imputation of validation and test set
vali_set_im=cond_LU_Sim(vali_set_strat,ref_df=df_xy,seed=78)
test_set_im=cond_LU_Sim(test_set_strat,ref_df=df_xy,seed=39)

Wall time: 16.6 s

# Remove separate 'AER Classification' from dataset as target and remove 
# x 'X Coordinate' and y 'Y Coordinate' for only plotting
vali_set_im.drop(['X Coordinate','Y Coordinate'], axis=1, inplace=True)
test_set_im.drop(['X Coordinate','Y Coordinate'], axis=1, inplace=True)
#
class_vali=list(vali_set_im['AER Classification'])
class_test=list(test_set_im['AER Classification'])

# Convert training and target to numpy 
x_valid,y_valid=vali_set_im.drop(['AER Classification'], axis=1).to_numpy(),np.int_(class_vali)
#
x_test,y_test=test_set_im.drop(['AER Classification'], axis=1).to_numpy(),np.int_(class_test)

# Standardize Validation and test set using statistics from training data
std_x_valid = scaler.transform(x_valid)
std_x_test  = scaler.transform(x_test)

OverUndersampling for Only Training set¶

x=std_x_train
y=y_train
corr=df_xy_im.corr().to_numpy()

%%time
# Undersampling and oversamping
r1_raw=pd.Series(y).value_counts()[1]/pd.Series(y).value_counts()[0]
#ra_1=np.arange(0.01,0.45,0.02)+r1_raw
#ra_2=np.arange(0.01,0.45,0.02)+r1_raw

ra_1=np.arange(0.01,0.45,0.02)+r1_raw
ra_2=np.arange(0.01,0.45,0.02)+r1_raw


x_over_under,y_over_under,ratio=[],[],[]
for ira in range(len(ra_1)):
    x_temp,y_temp=Over_Under_Sampling(x,y,ind=2,r1=ra_1[ira],r2=ra_1[ira],corr=corr,seed=34,mr=True,r3=miss_per)
    x_over_under.append(x_temp)
    y_over_under.append(y_temp)
    r_temp=pd.Series(y_temp).value_counts()[1]/pd.Series(y_temp).value_counts()[0]
    ratio.append(r_temp)

# epmty array for over and undersampling ratios
Train_Accu_ov_un=np.zeros((len(ra_1),6))+np.nan
Train_Pre_ov_un=np.zeros((len(ra_1),6))+np.nan
Train_Rec_ov_un=np.zeros((len(ra_1),6))+np.nan 
Train_Spe_ov_un=np.zeros((len(ra_1),6))+np.nan
Train_AUC_ov_un=np.zeros((len(ra_1),6))+np.nan

# epmty array for Crossvalidation during over and undersampling  
Train_Accu_ov_un_during=np.zeros((len(ra_1),6))+np.nan
Train_Pre_ov_un_during=np.zeros((len(ra_1),6))+np.nan
Train_Rec_ov_un_during=np.zeros((len(ra_1),6))+np.nan 
Train_Spe_ov_un_during=np.zeros((len(ra_1),6))+np.nan
Train_AUC_ov_un_during=np.zeros((len(ra_1),6))+np.nan

# Make epmty array for validation
valid_Accu_ov_un=np.zeros((len(ra_1),6))+np.nan
valid_Pre_ov_un=np.zeros((len(ra_1),6))+np.nan
valid_Rec_ov_un=np.zeros((len(ra_1),6))+np.nan
valid_Spe_ov_un=np.zeros((len(ra_1),6))+np.nan
valid_AUC_ov_un=np.zeros((len(ra_1),6))+np.nan
valid_prob=np.zeros((7))+np.nan

# Make epmty array for test
test_Accu_ov_un=np.zeros((len(ra_1),6))+np.nan
test_Pre_ov_un=np.zeros((len(ra_1),6))+np.nan
test_Rec_ov_un=np.zeros((len(ra_1),6))+np.nan
test_Spe_ov_un=np.zeros((len(ra_1),6))+np.nan
test_AUC_ov_un=np.zeros((len(ra_1),6))+np.nan
test_prob=np.zeros((7))+np.nan

Dummy (ie. random) Classifier¶

i=0
# Train Model
dmy_clf = DummyClassifier(strategy="stratified",random_state=12)
dmy_clf.fit(std_x_train,y_train)
predictor=dmy_clf

# predict training and validation set
y_train_pred_dmy_clf=cross_val_predict(predictor,std_x_train,y_train, cv=3)

font = {'size'   : 6}
plt.rc('font', **font)
fig = plt.subplots(figsize=(5, 5), dpi= 250, facecolor='w', edgecolor='k')  
ax1=plt.subplot(1,2,1)
j=0
acr, prec, reca, spec=Conf_Matrix(y_train,y_train_pred_dmy_clf.reshape(-1,1),axt=ax1, title='Dummy Classifier')
Train_Accu[j]=acr
Train_Pre[j]=prec
Train_Rec[j]=reca
Train_Spe[j]=spec  

ax1=plt.subplot(1,2,2)
y_valid_pred_dmy_clf=dmy_clf.predict_proba(std_x_valid)
j=0
acr, prec, reca, spec=Conf_Matrix(y_valid,y_valid_pred_dmy_clf,axt=ax1, title='Dummy Classifier')
Valid_Accu[j]=acr
Valid_Pre[j]=prec
Valid_Rec[j]=reca
Valid_Spe[j]=spec

Stochastic Gradient Decent¶

#sgd = SGDClassifier(loss='log',random_state=42)  
#eta0=list(np.linspace(0.05,0.5, 10))
#params = {'penalty': ['l1','l2','elasticnet', 'none'],'eta0': eta0, 'early_stopping':[False,True]}
#sgd_search_cv = GridSearchCV(sgd, params, cv=3,scoring="recall") 
#sgd_search_cv.fit(std_x_train,y_train)
#
## Optimum values for the hyperparameters are:
#sgd_search_cv.best_params_
print('Results from GridSearchCV:', {'early_stopping': True, 'eta0': 0.05, 'penalty': 'none'})

Results from GridSearchCV: {'early_stopping': True, 'eta0': 0.05, 'penalty': 'none'}

# Train Model
sgd= SGDClassifier(loss='log',early_stopping=True, eta0= 0.05, penalty='none',random_state=42)  
predictor=sgd
font = {'size'   : 6}
plt.rc('font', **font)
fig = plt.subplots(figsize=(5, 5), dpi= 250, facecolor='w', edgecolor='k')  
ax1=plt.subplot(1,2,1)
acr, prec, reca, spec=Conf_Matrix(y_train,y_train_pred_dmy_clf.reshape(-1,1),axt=ax1, 
                    title='a) Dummy Classifier',t_fontsize=7.5,t_y=1.23,
                    x_fontsize=6,y_fontsize=6)

ax1=plt.subplot(1,2,2)
y_train_pred_sgd=cross_val_predict(sgd,std_x_train,y_train, cv=3, method='predict_proba')
acr, prec, reca, spec=Conf_Matrix(y_train,y_train_pred_sgd,axt=ax1, 
                title='b) Stochastic Gradient Descent',t_fontsize=7.5,t_y=1.23,
                x_fontsize=6,y_fontsize=6)

plt.subplots_adjust(wspace=0.3)
plt.savefig('./Figures/Fig8-1.png', dpi=500, bbox_inches='tight')

# Train Model
sgd= SGDClassifier(loss='log',early_stopping=True, eta0= 0.05, penalty='none',random_state=42)  
predictor=sgd
font = {'size'   : 6}
plt.rc('font', **font)
fig = plt.subplots(figsize=(5, 5), dpi= 250, facecolor='w', edgecolor='k')  
ax1=plt.subplot(1,2,1)
j=1  
# predict training and validation set
y_train_pred_sgd=cross_val_predict(sgd,std_x_train,y_train, cv=3, method='predict_proba')

acr, prec, reca, spec=Conf_Matrix(y_train,y_train_pred_sgd,axt=ax1, 
                title='Stochastic Gradient Descent (train)',t_fontsize=7.5,t_y=1.23,
                x_fontsize=6,y_fontsize=6)
Train_Accu[j]=acr
Train_Pre[j]=prec
Train_Rec[j]=reca
Train_Spe[j]=spec

ax1=plt.subplot(1,2,2)
sgd.fit(std_x_train,y_train)
y_valid_pred_sgd=sgd.predict_proba(std_x_valid)
acr, prec, reca, spec=Conf_Matrix(y_valid,y_valid_pred_sgd,axt=ax1, title='Stochastic Gradient Descent (validation)')
Valid_Accu[j]=acr
Valid_Pre[j]=prec
Valid_Rec[j]=reca
Valid_Spe[j]=spec  

plt.subplots_adjust(wspace=0.3)
#plt.savefig('./Figures/Fig8-1.png', dpi=500, bbox_inches='tight')

Logistic Regression¶

#log_reg = LogisticRegression(solver="lbfgs",random_state=42)
#log_reg.fit(std_x_train,y_train)
#
#params = {'penalty': ['l1','l2','elasticnet', 'none'], 'max_iter':[5,10,20,30,50],'C':[0.1,0.5,1,10,20,50]}
#lr_search_cv = GridSearchCV(log_reg, params, cv=5,scoring="recall") 
#lr_search_cv.fit(std_x_train,y_train)
#
## Optimum values for the hyperparameters are:
#lr_search_cv.best_params_
print('Results from GridSearchCV:', {'C': 0.1, 'max_iter': 30, 'penalty': 'none'})

Results from GridSearchCV: {'C': 0.1, 'max_iter': 30, 'penalty': 'none'}

# Train Model
log_reg = LogisticRegression(max_iter=30, penalty='none',C=0.1,random_state=42)  
log_reg.fit(std_x_train,y_train)
predictor=log_reg

# predict training and validation set
y_train_pred_log_reg=cross_val_predict(predictor,std_x_train,y_train, cv=3, method='predict_proba')

font = {'size'   : 6}
plt.rc('font', **font)
fig = plt.subplots(figsize=(5, 5), dpi= 250, facecolor='w', edgecolor='k')  
ax1=plt.subplot(1,2,1)
j=2
acr, prec, reca, spec=Conf_Matrix(y_train,y_train_pred_log_reg,axt=ax1, title='Logistic Regression')
Train_Accu[j]=acr
Train_Pre[j]=prec
Train_Rec[j]=reca
Train_Spe[j]=spec

ax1=plt.subplot(1,2,2)
log_reg.fit(std_x_train,y_train)
y_valid_pred_log_reg=log_reg.predict_proba(std_x_valid)
acr, prec, reca, spec=Conf_Matrix(y_valid,y_valid_pred_log_reg,axt=ax1, title='Logistic Regression (validation)')
Valid_Accu[j]=acr
Valid_Pre[j]=prec
Valid_Rec[j]=reca
Valid_Spe[j]=spec  

plt.subplots_adjust(wspace=0.3)

# Train Model
log_reg = LogisticRegression(max_iter=30, penalty='none',C=0.1,random_state=42)  
predictor=log_reg

font = {'size'   : 6}
plt.rc('font', **font)
fig = plt.subplots(figsize=(5, 5), dpi= 250, facecolor='w', edgecolor='k')  
ax1=plt.subplot(1,2,1)
acr, prec, reca, spec=Conf_Matrix(y_train,y_train_pred_dmy_clf.reshape(-1,1),axt=ax1, 
                    title='a) Random Classifier',t_fontsize=7.5,t_y=1.23,
                    x_fontsize=6,y_fontsize=6)

ax1=plt.subplot(1,2,2)
y_train_pred_log_reg=cross_val_predict(log_reg,std_x_train,y_train, cv=3, method='predict_proba')
acr, prec, reca, spec=Conf_Matrix(y_train,y_train_pred_log_reg,axt=ax1, 
                title='b) Logistic Regression',t_fontsize=7.5,t_y=1.23,
                x_fontsize=6,y_fontsize=6)

plt.subplots_adjust(wspace=0.3)
plt.savefig('./Figures/Fig9.png', dpi=500, bbox_inches='tight')

Support Vector Machine¶

# Train Model
svm = SVC(kernel='rbf',C=2,probability=True,random_state=42)
svm.fit(std_x_train,y_train)
predictor=svm

# predict training and validation set
y_train_pred_svm=cross_val_predict(predictor,std_x_train,y_train, cv=3, method='predict_proba')

font = {'size'   : 6}
plt.rc('font', **font)
fig = plt.subplots(figsize=(5, 5), dpi= 250, facecolor='w', edgecolor='k')  
ax1=plt.subplot(1,2,1)
j=3
acr, prec, reca, spec=Conf_Matrix(y_train,y_train_pred_svm,axt=ax1, title='Support Vector Machine')
Train_Accu[j]=acr
Train_Pre[j]=prec
Train_Rec[j]=reca
Train_Spe[j]=spec
ax1=plt.subplot(1,2,2)
svm.fit(std_x_train,y_train)
y_valid_pred_svm=svm.predict_proba(std_x_valid)

acr, prec, reca, spec=Conf_Matrix(y_valid,y_valid_pred_svm,axt=ax1, title='Support Vector Machine (validation)')
Valid_Accu[j]=acr
Valid_Pre[j]=prec
Valid_Rec[j]=reca
Valid_Spe[j]=spec

Adaptive Boosting with Decision Trees¶

#np.random.seed(42) 
#tree =DecisionTreeClassifier()
#ada_tree = AdaBoostClassifier(tree ,random_state=42)
#params = {'n_estimators': [20,40,50,100,],'learning_rate': [0.05,0.25,0.5,0.7],
#          'algorithm':['SAMME', 'SAMME.R']}
#ada_tree_search_cv = GridSearchCV(ada_tree, params, cv=3,scoring="recall") 
#ada_tree_search_cv.fit(std_x_train,y_train)
#
## Optimum values for the hyperparameters are:
#ada_tree_search_cv.best_params_

print('Results from GridSearchCV:', {'algorithm': 'SAMME', 'learning_rate': 0.05, 'n_estimators': 20})

Results from GridSearchCV: {'algorithm': 'SAMME', 'learning_rate': 0.05, 'n_estimators': 20}

tree =DecisionTreeClassifier()
ada_tree = AdaBoostClassifier(tree,n_estimators=30, algorithm= 'SAMME', learning_rate= 0.2, random_state=102)   
ada_tree.fit(std_x_train,y_train)
predictor=ada_tree

# predict training and validation set
y_train_pred_ada_tree=cross_val_predict(predictor,std_x_train,y_train, cv=3, method='predict_proba')

font = {'size'   : 6}
plt.rc('font', **font)
fig = plt.subplots(figsize=(5, 5), dpi= 250, facecolor='w', edgecolor='k')  
ax1=plt.subplot(1,2,1)
j=4
acr, prec, reca, spec=Conf_Matrix(y_train,y_train_pred_ada_tree,axt=ax1, title='Gradient Boosting (Training)')
Train_Accu[j]=acr
Train_Pre[j]=prec
Train_Rec[j]=reca
Train_Spe[j]=spec

ax1=plt.subplot(1,2,2)
ada_tree.fit(std_x_train,y_train)
y_valid_pred_ada_tree=ada_tree.predict_proba(std_x_valid)
acr, prec, reca, spec=Conf_Matrix(y_valid,y_valid_pred_ada_tree,axt=ax1, title='Gradient Boosting (validation)')
Valid_Accu[j]=acr
Valid_Pre[j]=prec
Valid_Rec[j]=reca
Valid_Spe[j]=spec

Random Forest¶

# Random Forest
rnf = RandomForestClassifier( random_state=42)
rnf.fit(std_x_train,y_train)
predictor=rnf

pickle.dump(rnf, open(f'./Trained_models/rf_no_ratio.sav', 'wb'))

# predict training and validation set
y_train_pred_rnf=cross_val_predict(predictor,std_x_train,y_train, cv=5, method='predict_proba')

font = {'size'   : 6}
plt.rc('font', **font)
fig = plt.subplots(figsize=(5, 5), dpi= 250, facecolor='w', edgecolor='k')  
ax1=plt.subplot(1,2,1)
j=5
acr, prec, reca, spec=Conf_Matrix(y_train,y_train_pred_rnf,axt=ax1, title='Random Forest (Training)')
Train_Accu[j]=acr
Train_Pre[j]=prec
Train_Rec[j]=reca
Train_Spe[j]=spec

ax1=plt.subplot(1,2,2)
rnf.fit(std_x_train,y_train)
y_valid_pred_rnf=rnf.predict_proba(std_x_valid)
acr, prec, reca, spec=Conf_Matrix(y_valid,y_valid_pred_rnf,axt=ax1, title='Random Forest (validation)')
Valid_Accu[j]=acr
Valid_Pre[j]=prec
Valid_Rec[j]=reca
Valid_Spe[j]=spec

Importance of Features¶

list_imp=[]
nsim=100
for i in range(nsim):
    rnf = RandomForestClassifier()
    rnf.fit(std_x_train,y_train)
    feature_importances = rnf.feature_importances_
    tmp=sorted(zip(feature_importances, columns_RF), reverse=True)
    list_imp.append(tmp)

# Calculate mean and variance of importance features for all runs
mean_=[] ; sd_=[]       
col_=[]; val_=[]
for j in range(nsim):
    tmp=list_imp[j]
    tmp_val=[]; tmp_col=[]
    for k in range(len(columns_RF)):
        tmp_val.append(tmp[k][0])
        tmp_col.append(tmp[k][1]) 
    col_.append(tmp_col)    
    val_.append(tmp_val)   
for k in range(len(columns_RF)):
    tmp=[]
    for i in range(nsim):
        ind=col_[i].index(columns_RF[k])
        tmp.append(val_[i][ind])
    mean_.append(np.mean(tmp))  
    sd_.append(np.sqrt(np.var(tmp)))
    
# Sort features based on the percentage of importance
sort_array_mean=np.sort(mean_, axis=0)
sort_array_mean=sort_array_mean[::-1]
sort_colms=[]
sort_sd_=[]
for i in range(len(sort_array_mean)):
    ind=mean_.index(sort_array_mean[i])
    sort_colms.append(columns_RF[ind])
    sort_sd_.append(sd_[ind])
        
#
font = {'size'   : 8}
matplotlib.rc('font', **font)
fig, ax1 = plt.subplots(figsize=(8, 4), dpi= 200, facecolor='w', edgecolor='k')
index1 = np.arange(len(sort_colms))
ax1.bar(index1, sort_array_mean,yerr=sort_sd_, align='center', alpha=1, ecolor='black', capsize=4,color='b')
#ax1.set_xlabel('Well Properties',fontsize='9.5')
plt.title('Improtance of Well Properties to Predict Fluid Leakage for 100 Realizations of Random Forest',fontsize=9.5)
ax1.set_ylabel('Percentage of Importance (%)',fontsize='9.5')
ax1.set_xticks(np.arange(len(columns_RF)))
ax1.set_xticklabels(sort_colms,fontsize=8, rotation=90)     
plt.gca().yaxis.set_major_formatter(PercentFormatter(1)) 
ax1.xaxis.grid(color='r', linestyle='--', linewidth=0.4) 
plt.savefig('./Figures/Fig12_Feature_Importance.pdf', dpi=500, bbox_inches='tight')
plt.show()

Under & Oversampling¶

def Ove_Und_During_Cross(x,y,model_tmp,ind,r1,r2,corr,seed=34,cv=3,
                         NN=False,batch_size=32,verbose=0,epochs=10):
    """
    Implement K-Fold cross validation during oversampling
    """
    from sklearn.base import clone

    # Cross-validate
    # Use for StratifiedKFold classification
    kf = StratifiedKFold(n_splits=cv, shuffle=True, random_state=42) 
    oos_pred = []
    y_real=[]
    model_=[]
    # Must specify y StratifiedKFold for
    ir=1
    for train, test in kf.split(x,y):   
        x_trn = x[train]
        y_trn = y[train]
        #
        x_tst = x[test]
        y_tst = y[test]  
        
        x_over_under,y_over_under=Over_Under_Sampling(x_trn,y_trn,ind,r1
                            ,r2,corr,seed+ir,r3=None)
    
        # Fit Model for over undersample data
        if (NN==True):
            model=DNN (init_mode='uniform', activation= 'relu',dropout_rate= False, neurons= 50 ) 
            model.fit(x_over_under,y_over_under,batch_size=batch_size,verbose=verbose,epochs=epochs) 
            model_.append(model)            
        else:
            model_tmp.random_state=seed+ir
            model=clone(model_tmp)
            model.fit(x_over_under,y_over_under)
            model_.append(model)
        
        # Predict on test fold
        pred = list(np.ravel(np.array(model.predict(x_tst), dtype='float')))
        oos_pred.append(pred)
        y_real.append(y_tst)
        ir+=1
    oos_pred=np.concatenate(oos_pred).ravel()
    y_real=np.concatenate(y_real).ravel()
    return oos_pred,y_real,model_

%%time
j=4
rnf = RandomForestClassifier(n_estimators=100, max_depth= 10, min_samples_split= 4, bootstrap= True, random_state=42)
predictor=rnf
AUC_t=[]
model__ratio=[]
for i in range(len(ra_1)):
    # Cross Validation after Over-undersampling
    y_train_pred=cross_val_predict(predictor,x_over_under[i], y_over_under[i], cv=5, method='predict_proba') 
    acr, prec, reca, spec=Conf_Matrix(y_over_under[i],y_train_pred,axt=ax1,plot=False)
    Train_Accu_ov_un[i,j]=acr
    Train_Pre_ov_un[i,j]=prec
    Train_Rec_ov_un[i,j]=reca
    Train_Spe_ov_un[i,j]=spec
    fpr, tpr, thresold = roc_curve(y_over_under[i], y_train_pred[:,1])
    Train_AUC_ov_un[i,j]=auc(fpr, tpr)
    
    # Cross Validation during Over-undersampling
    y_train_pred,y_real,model__=Ove_Und_During_Cross(std_x_train,y_train,predictor,ind=2,r1=ra_1[i],r2=ra_2[i],
                                             corr=corr,seed=34,cv=5)
    model__ratio.append(model__)
    acr, prec, reca, spec=Conf_Matrix(y_real,y_train_pred.reshape(-1,1),axt=ax1,plot=False)
    Train_Accu_ov_un_during[i,j]=acr
    Train_Pre_ov_un_during[i,j]=prec
    Train_Rec_ov_un_during[i,j]=reca
    Train_Spe_ov_un_during[i,j]=spec
    fpr, tpr, thresold = roc_curve(y_real, y_train_pred)
    Train_AUC_ov_un_during[i,j]=auc(fpr, tpr)

Wall time: 38min 23s

Neural Network¶

Optimum Hyperparameters¶

######## This cell may take a couple of hours to find optimum hyperparameters ####### 
#
## define the grid search parameters: different optimizers, initialization, and number of neurons
#param_grid = {'optimizer': ['RMSprop', 'Adagrad',  'Adam'],
#             'init_mode' : ['uniform', 'lecun_uniform','he_normal', 'he_uniform'],'neurons' : [20, 50, 100]}
#
## Run Keras Classifier 
#model_NN = KerasClassifier(build_fn=DNN)
#model_NN. _estimator_type = "classifier"
#
## Apply Scikit Learn GridSearchCV
#grid = GridSearchCV(estimator=model_NN, param_grid=param_grid, cv=3,scoring='recall')
#
## Early stopping to avoid overfitting
#monitor= keras.callbacks.EarlyStopping(monitor='val_loss', min_delta=1e-3,patience=5, verbose=0, mode='auto')
#grid_result = grid.fit(std_x_train,y_train,batch_size=32,validation_data=
#                       (std_x_Validation,y_Validation),callbacks=[monitor],
#                   verbose=0,epochs=1000)
#
## Define best hyperparameters
#predictor=grid_result.best_estimator_
#
## Summarize results
#print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
#means = grid_result.cv_results_['mean_test_score']
#stds = grid_result.cv_results_['std_test_score']
#params = grid_result.cv_results_['params']
#for mean, stdev, param in zip(means, stds, params):
#    print("%f (%f) with: %r" % (mean, stdev, param))
#    

print('Results from GridSearchCV:', {'init_mode': 'uniform', 'neurons': 50, 'optimizer': 'Adam'})

Results from GridSearchCV: {'init_mode': 'uniform', 'neurons': 50, 'optimizer': 'Adam'}

## Divid training data into smaller training set and validation set for early stopping.
#split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
#x=std_x_train
#y=y_train
#for train_index, test_index in split.split(x,y):
#    Smaller_x, Smaller_y = x[train_index],y[train_index]
#    Validation_x, Validation_y  = x[test_index],y[test_index]

%%time
# Call Function with fined-tune numebr of neurons
model_NN=DNN (init_mode='uniform', activation= 'relu',dropout_rate= False, neurons= 20 )  

# Early stopping to avoid overfitting
monitor= keras.callbacks.EarlyStopping(monitor='val_loss', min_delta=1e-5,patience=3, mode='auto')
history=model_NN.fit(std_x_train,y_train,batch_size=64,validation_data=
          (std_x_valid,y_valid),callbacks=[monitor],verbose=0,epochs=1000)

Wall time: 4.66 s

# Summary of parameters and plotting
plot_NN(history, ylim_1=[0.33, 0.47], ylim_2=[0.11, 0.827])

Bootstraping (Neural Network)¶

# Find optimum number of epoches by bootstraping
x=std_x_train
y=y_train
model_NN=DNN (init_mode='uniform', activation= 'relu',dropout_rate= False, neurons= 20 )  
epoches=boots_epoch(x, y,model_NN,min_delta=1e-5,patience=10,n_splits=50)
print('Optimum number of epoches: ',epoches)

Optimum number of epoches:  10

# Apply K-Fold cross validation for Neural Network
model_NN=DNN (init_mode='uniform', activation= 'relu',dropout_rate= False, neurons= 20 )  

# Training set
x=std_x_train
y=y_train
y_train_pred_DNN,y_real_NN=DNN_K_Fold_Cross(x,y,model_NN,epochs=epoches)

pred=model_NN.predict(std_x_train)
fpr, tpr, thresold = roc_curve(y_real_NN,y_train_pred_DNN)
auc(fpr, tpr)

0.727571240944521

font = {'size'   : 6}
plt.rc('font', **font)
fig = plt.subplots(figsize=(5, 5), dpi= 250, facecolor='w', edgecolor='k')  
ax1=plt.subplot(1,2,1)
j=6
acr, prec, reca, spec=Conf_Matrix(y_real_NN,y_train_pred_DNN.reshape(-1,1),axt=ax1, title='Training Set')
Train_Accu[j]=acr
Train_Pre[j]=prec
Train_Rec[j]=reca
Train_Spe[j]=spec

ax1=plt.subplot(1,2,2)
model_NN.fit(std_x_train,y_train)
y_valid_pred_model_NN=model_NN.predict_proba(std_x_valid)
acr, prec, reca, spec=Conf_Matrix(y_valid,y_valid_pred_model_NN,axt=ax1, title='Logistic Regression (validation)')
Valid_Accu[j]=acr
Valid_Pre[j]=prec
Valid_Rec[j]=reca
Valid_Spe[j]=spec  
plt.subplots_adjust(wspace=0.3)

617/617 [==============================] - 0s 740us/step - loss: 0.3830 - accuracy: 0.8474

Area Under Curve (AUC)¶

# Make epmty array for test
Test_Accu=np.zeros((7))+np.nan
Test_Pre=np.zeros((7))+np.nan
Test_Rec=np.zeros((7))+np.nan
Test_Spe=np.zeros((7))+np.nan

##################  Test set ##################

dmy_clf = DummyClassifier(strategy="stratified",random_state=12)
dmy_clf.fit(std_x_train,y_train)
y_test_pred_dmy_clf=dmy_clf.predict_proba(std_x_test)
j=0
acr, prec, reca, spec=Conf_Matrix(y_test,y_test_pred_dmy_clf,axt=None, plot=None)
Test_Accu[j]=acr
Test_Pre[j]=prec
Test_Rec[j]=reca
Test_Spe[j]=spec  
#
sgd= SGDClassifier(loss='log',early_stopping=True, eta0= 0.05, penalty='none',random_state=42)  
sgd.fit(std_x_train,y_train)
y_test_pred_sgd=sgd.predict_proba(std_x_test)
j=1
acr, prec, reca, spec=Conf_Matrix(y_test,y_test_pred_sgd,axt=None, plot=None)
Test_Accu[j]=acr
Test_Pre[j]=prec
Test_Rec[j]=reca
Test_Spe[j]=spec  
#
log_reg = LogisticRegression(max_iter=30, penalty='none',C=0.1,random_state=42)  
log_reg.fit(std_x_train,y_train)
y_test_pred_log_reg=log_reg.predict_proba(std_x_test)
j=2
acr, prec, reca, spec=Conf_Matrix(y_test,y_test_pred_log_reg,axt=None, plot=None)
Test_Accu[j]=acr
Test_Pre[j]=prec
Test_Rec[j]=reca
Test_Spe[j]=spec  
#
svm = SVC(kernel='rbf',C=2,probability=True,random_state=42)
svm.fit(std_x_train,y_train)
y_test_pred_svm=svm.predict_proba(std_x_test)
j=3
acr, prec, reca, spec=Conf_Matrix(y_test,y_test_pred_svm,axt=None, plot=None)
Test_Accu[j]=acr
Test_Pre[j]=prec
Test_Rec[j]=reca
Test_Spe[j]=spec  
#
tree =DecisionTreeClassifier()
ada_tree = AdaBoostClassifier(tree,n_estimators=30, algorithm= 'SAMME', learning_rate= 0.2, random_state=102)   
ada_tree.fit(std_x_train,y_train)
y_test_pred_ada_tree=ada_tree.predict_proba(std_x_test)
j=4
acr, prec, reca, spec=Conf_Matrix(y_test,y_test_pred_ada_tree,axt=None, plot=None)
Test_Accu[j]=acr
Test_Pre[j]=prec
Test_Rec[j]=reca
Test_Spe[j]=spec  
#
rnf = RandomForestClassifier(random_state=42)
rnf.fit(std_x_train,y_train)
y_test_pred_rnf=rnf.predict_proba(std_x_test)
j=5
acr, prec, reca, spec=Conf_Matrix(y_test,y_test_pred_rnf,axt=None, plot=None)
Test_Accu[j]=acr
Test_Pre[j]=prec
Test_Rec[j]=reca
Test_Spe[j]=spec 
#
model_NN=DNN (init_mode='uniform', activation= 'relu',dropout_rate= False, neurons= 20 )  
model_NN.fit(std_x_train,y_train)
y_test_pred_model_NN=model_NN.predict_proba(std_x_test)
j=6
acr, prec, reca, spec=Conf_Matrix(y_test,y_test_pred_model_NN,axt=None, plot=None)
Test_Accu[j]=acr
Test_Pre[j]=prec
Test_Rec[j]=reca
Test_Spe[j]=spec

617/617 [==============================] - 1s 812us/step - loss: 0.4323 - accuracy: 0.8342

font = {'size' : 12}
plt.rc('font', **font)
fig = plt.subplots(figsize=(13,12), dpi= 120, facecolor='w', edgecolor='k')

##################  Training set ##################

gs = gridspec.GridSpec(2, 4)
ax1=plt.subplot(2,2,1)

training_AUC=[]
# Random Classifier
y=y_train
pred=y_train_pred_dmy_clf
fpr, tpr, thresold = roc_curve(y, pred)
roc_auc = auc(fpr, tpr)
training_AUC.append(roc_auc)
plt.plot(fpr, tpr,'k--', linewidth=4,label='Dummy Classifier (AUC =' + r"$\bf{" + str(np.round(roc_auc,3)) + "}$"+')')

# Stochastic Gradient Descent
y=y_train
pred=y_train_pred_sgd[:,1]
fpr, tpr, thresold = roc_curve(y, pred)
roc_auc = auc(fpr, tpr)
training_AUC.append(roc_auc)
plt.plot(fpr, tpr,'c', linewidth=2,label='Stochastic Gradient Descent (AUC =' + r"$\bf{" + str(np.round(roc_auc,3)) + "}$"+')')

# Logistic Regression
y=y_train
pred=y_train_pred_log_reg[:,1]
fpr, tpr, thresold = roc_curve(y, pred)
roc_auc = auc(fpr, tpr)
training_AUC.append(roc_auc)
plt.plot(fpr, tpr,'b', linewidth=2,label='Logistic Regression (AUC =' + r"$\bf{" + str(np.round(roc_auc,3)) + "}$"+')')

# 'Support Vector Machine
y=y_train
pred=y_train_pred_svm[:,1]
fpr, tpr, thresold = roc_curve(y, pred)
roc_auc = auc(fpr, tpr)
training_AUC.append(roc_auc)
plt.plot(fpr, tpr,'y-', linewidth=2,label='Support Vector Machine (AUC =' + r"$\bf{" + str(np.round(roc_auc,3)) + "}$"+')')

# Adaptive Boosting
y=y_train
pred=y_train_pred_ada_tree[:,1]
fpr, tpr, thresold = roc_curve(y, pred)
roc_auc = auc(fpr, tpr)
training_AUC.append(roc_auc)
plt.plot(fpr, tpr,'m', linewidth=2,label='Adaptive Boosting (AUC =' + r"$\bf{" + str(np.round(roc_auc,3)) + "}$"+')')

# Random Forest
y=y_train
pred=y_train_pred_rnf[:,1]
fpr, tpr, thresold = roc_curve(y, pred)
roc_auc = auc(fpr, tpr)
training_AUC.append(roc_auc)
#plt.scatter(fpr, tpr,s=10,c=thresold,linewidths=0,cmap='gray',vmin=0, vmax=1)
plt.plot(fpr, tpr,'g', linewidth=2,label='Random Forest (AUC =' + r"$\bf{" + str(np.round(roc_auc,3)) + "}$"+')')

# Neural Network
y=y_real_NN
pred=y_train_pred_DNN
fpr, tpr, thresold = roc_curve(y, pred)
roc_auc = auc(fpr, tpr)
training_AUC.append(roc_auc)
plt.plot(fpr, tpr,'r', linewidth=2,label='Deep Neural Network (AUC =' + r"$\bf{" + str(np.round(roc_auc,3)) + "}$"+')')

#
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.xlabel('False Positive Rate (1-Specificity) FP/(FP+TN)',fontsize=12)
plt.ylabel('True Positive Rate (Sensistivity) TP/(TP+FN)',fontsize=12)
plt.title('a) Training Set \n Receiver Operating Characteristic (ROC)',fontsize=15)
plt.grid(linewidth='0.25')
plt.legend(loc="lower right",fontsize=11)


##################  Validation set ##################

Valid_AUC=[]
ax1=plt.subplot(2,2,2)

# Random Classifier
y=y_valid
pred=y_valid_pred_dmy_clf[:,1]
fpr, tpr, thresold = roc_curve(y, pred)
roc_auc = auc(fpr, tpr)
Valid_AUC.append(roc_auc)
plt.plot(fpr, tpr,'k--', linewidth=4,label='Dummy Classifier (AUC =' + r"$\bf{" + str(np.round(roc_auc,3)) + "}$"+')')


# Stochastic Gradient Descent
y=y_valid
pred=y_valid_pred_sgd[:,1]
fpr, tpr, thresold = roc_curve(y, pred)
roc_auc = auc(fpr, tpr)
Valid_AUC.append(roc_auc)
plt.plot(fpr, tpr,'c', linewidth=2,label='Stochastic Gradient Descent (AUC =' + r"$\bf{" + str(np.round(roc_auc,3)) + "}$"+')')

# Logistic Regression
y=y_valid
pred=y_valid_pred_log_reg[:,1]
fpr, tpr, thresold = roc_curve(y, pred)
roc_auc = auc(fpr, tpr)
Valid_AUC.append(roc_auc)
plt.plot(fpr, tpr,'b', linewidth=2,label='Logistic Regression (AUC =' + r"$\bf{" + str(np.round(roc_auc,3)) + "}$"+')')

# 'Support Vector Machine
y=y_valid
pred=y_valid_pred_svm[:,1]
fpr, tpr, thresold = roc_curve(y, pred)
roc_auc = auc(fpr, tpr)
Valid_AUC.append(roc_auc)
plt.plot(fpr, tpr,'y-', linewidth=2,label='Support Vector Machine (AUC =' + r"$\bf{" + str(np.round(roc_auc,3)) + "}$"+')')

# Adaptive Boosting
y=y_valid
pred=y_valid_pred_ada_tree[:,1]
fpr, tpr, thresold = roc_curve(y, pred)
roc_auc = auc(fpr, tpr)
Valid_AUC.append(roc_auc)
plt.plot(fpr, tpr,'m', linewidth=2,label='Adaptive Boosting (AUC = ' + r"$\bf{" + str(np.round(roc_auc,3)) + "}$"+')')

# Random Forest
y=y_valid
pred=y_valid_pred_rnf[:,1]
fpr, tpr, thresold = roc_curve(y, pred)
roc_auc = auc(fpr, tpr)
Valid_AUC.append(roc_auc)
#plt.scatter(fpr, tpr,s=10,c=thresold,linewidths=0,cmap='gray',vmin=0, vmax=1)
plt.plot(fpr, tpr,'g', linewidth=2,label='Random Forest (AUC =' + r"$\bf{" + str(np.round(roc_auc,3)) + "}$"+')')

# Neural Network
y=y_valid
pred=y_valid_pred_model_NN
fpr, tpr, thresold = roc_curve(y, pred)
roc_auc = auc(fpr, tpr)
Valid_AUC.append(roc_auc)
plt.plot(fpr, tpr,'r', linewidth=2,label='Deep Neural Network (AUC =' + r"$\bf{" + str(np.round(roc_auc,3)) + "}$"+')')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.xlabel('False Positive Rate (1-Specificity) FP/(FP+TN)',fontsize=12)
plt.ylabel('True Positive Rate (Sensistivity) TP/(TP+FN)',fontsize=12)
plt.title('b) Validation Set \n Receiver Operating Characteristic (ROC)',fontsize=15)
plt.grid(linewidth='0.25')
plt.legend(loc="lower right",fontsize=11)

##################  Test set ##################
Test_AUC=[]
ax1=plt.subplot(gs[1, 1:3])
# Random Classifier
y=y_test
pred=y_test_pred_dmy_clf[:,1]
fpr, tpr, thresold = roc_curve(y, pred)
roc_auc = auc(fpr, tpr)
Test_AUC.append(roc_auc)
plt.plot(fpr, tpr,'k--', linewidth=4,label='Dummy Classifier (AUC =' + r"$\bf{" + str(np.round(roc_auc,3)) + "}$"+')')


# Stochastic Gradient Descent
y=y_test
pred=y_test_pred_sgd[:,1]
fpr, tpr, thresold = roc_curve(y, pred)
roc_auc = auc(fpr, tpr)
Test_AUC.append(roc_auc)
plt.plot(fpr, tpr,'c', linewidth=2,label='Stochastic Gradient Descent (AUC =' + r"$\bf{" + str(np.round(roc_auc,3)) + "}$"+')')

# Logistic Regression
y=y_test
pred=y_test_pred_log_reg[:,1]
fpr, tpr, thresold = roc_curve(y, pred)
roc_auc = auc(fpr, tpr)
Test_AUC.append(roc_auc)
plt.plot(fpr, tpr,'b', linewidth=2,label='Logistic Regression (AUC =' + r"$\bf{" + str(np.round(roc_auc,3)) + "}$"+')')

# 'Support Vector Machine
y=y_test
pred=y_test_pred_svm[:,1]
fpr, tpr, thresold = roc_curve(y, pred)
roc_auc = auc(fpr, tpr)
Test_AUC.append(roc_auc)
plt.plot(fpr, tpr,'y-', linewidth=2,label='Support Vector Machine (AUC =' + r"$\bf{" + str(np.round(roc_auc,3)) + "}$"+')')

# Adaptive Boosting
y=y_test
pred=y_test_pred_ada_tree[:,1]
fpr, tpr, thresold = roc_curve(y, pred)
roc_auc = auc(fpr, tpr)
Test_AUC.append(roc_auc)
plt.plot(fpr, tpr,'m', linewidth=2,label='Adaptive Boosting (AUC = ' + r"$\bf{" + str(np.round(roc_auc,3)) + "}$"+')')

# Random Forest
y=y_test
pred=y_test_pred_rnf[:,1]
fpr, tpr, thresold = roc_curve(y, pred)
roc_auc = auc(fpr, tpr)
Test_AUC.append(roc_auc)
#plt.scatter(fpr, tpr,s=10,c=thresold,linewidths=0,cmap='gray',vmin=0, vmax=1)
plt.plot(fpr, tpr,'g', linewidth=2,label='Random Forest (AUC =' + r"$\bf{" + str(np.round(roc_auc,3)) + "}$"+')')

# Neural Network
y=y_test
pred=y_test_pred_model_NN
fpr, tpr, thresold = roc_curve(y, pred)
roc_auc = auc(fpr, tpr)
Test_AUC.append(roc_auc)
plt.plot(fpr, tpr,'r', linewidth=2,label='Deep Neural Network (AUC =' + r"$\bf{" + str(np.round(roc_auc,3)) + "}$"+')')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.xlabel('False Positive Rate (1-Specificity) FP/(FP+TN)',fontsize=12)
plt.ylabel('True Positive Rate (Sensistivity) TP/(TP+FN)',fontsize=12)
plt.title('C) Test Set \n Receiver Operating Characteristic (ROC)',fontsize=15)
plt.grid(linewidth='0.25')
plt.legend(loc="lower right",fontsize=11)

plt.subplots_adjust(wspace=0.2)
plt.subplots_adjust(hspace=0.3)

plt.savefig('./Figures/Fig10_AUC_taining_Validation.pdf', dpi=500, bbox_inches='tight')
plt.show()

Performance of the Models on Training Set¶

predictor= ['Dummy Classifier','Stocastic Gradient Descent','Logistic Regression',
            'Support Vector Machine','Adaptive Boosting',
            'Random Forest','Deep Neural Network']

font = {'size'   :8.5}
plt.rc('font', **font)


fig, ax1 = plt.subplots(figsize=(11, 6), dpi= 120, facecolor='w', edgecolor='k')
gs = gridspec.GridSpec(2, 4)
ax1=plt.subplot(2,2,1)
ax1.plot(predictor,Train_Accu,'r*-',linewidth=1,path_effects=[pe.Stroke(linewidth=1, foreground='k'), pe.Normal()],
                                                             markersize=6,label='Accuracy',markeredgecolor='k')
ax1.plot(predictor,Train_Pre,'ys-',linewidth=1,path_effects=[pe.Stroke(linewidth=1, foreground='k'), pe.Normal()],
                                                             markersize=4,label='Precision',markeredgecolor='k')
ax1.plot(predictor,Train_Rec,'bo-',linewidth=2,path_effects=[pe.Stroke(linewidth=1, foreground='k'), pe.Normal()],
                                                             markersize=4,label='Sensitivity',markeredgecolor='k') 
ax1.plot(predictor,Train_Spe,'gp-',linewidth=1,path_effects=[pe.Stroke(linewidth=1, foreground='k'), pe.Normal()],
                                                           markersize=5,label='Specificity',markeredgecolor='k')   
ax1.set_xticks(np.arange(len(predictor)))
ax1.set_xticklabels(predictor,y=0.03, rotation=30, fontsize=8)       
ax1.xaxis.grid(color='k', linestyle='--', linewidth=0.8) 
ax1.yaxis.grid(color='k', linestyle='--', linewidth=0.2) 
#plt.ylabel('Performance',fontsize=9)
plt.title('a) Performance of Models for Training Set (Cross Validation)',fontsize=10)
plt.ylim((-0.35,1.0))
legend=plt.legend(ncol=2,loc=4,fontsize=8.5,framealpha =3) 

plt.yticks([0,0.1,0.20,0.30,0.40,0.50,0.60,0.70,0.80,0.90,1],y=0, fontsize=8.2)
ax1.yaxis.set_ticks_position('both')

ax1=plt.subplot(2,2,2)
ax1.plot(predictor,Valid_Accu,'r*-',linewidth=1,path_effects=[pe.Stroke(linewidth=1, foreground='k'), pe.Normal()],
                                                             markersize=6,label='Accuracy',markeredgecolor='k')
ax1.plot(predictor,Valid_Pre,'ys-',linewidth=1,path_effects=[pe.Stroke(linewidth=1, foreground='k'), pe.Normal()],
                                                             markersize=4,label='Precision',markeredgecolor='k') 
ax1.plot(predictor,Valid_Rec,'bo-',linewidth=2,path_effects=[pe.Stroke(linewidth=1, foreground='k'), pe.Normal()],
                                                             markersize=4,label='Sensitivity',markeredgecolor='k') 
ax1.plot(predictor,Valid_Spe,'gp-',linewidth=1,path_effects=[pe.Stroke(linewidth=1, foreground='k'), pe.Normal()],
                                                           markersize=5,label='Specificity',markeredgecolor='k')   
ax1.set_xticks(np.arange(len(predictor)))
ax1.set_xticklabels(predictor,y=0.03, rotation=30, fontsize=8)       
ax1.xaxis.grid(color='k', linestyle='--', linewidth=0.8) 
ax1.yaxis.grid(color='k', linestyle='--', linewidth=0.2) 
plt.title('b) Performance of Validation Set with Trained Models',fontsize=10)
legend=plt.legend(ncol=2,loc=4,fontsize=8.5,framealpha =3) 
legend.get_frame().set_edgecolor("black")
plt.yticks([0,0.1,0.20,0.30,0.40,0.50,0.60,0.70,0.80,0.90,1],y=0, fontsize=8.2)
ax1.yaxis.set_ticks_position('both')
plt.ylim((-0.35,1.0))


ax1=plt.subplot(gs[1, 1:3])

ax1.plot(predictor,Test_Accu,'r*-',linewidth=1,path_effects=[pe.Stroke(linewidth=1, foreground='k'), pe.Normal()],
                                                             markersize=6,label='Accuracy',markeredgecolor='k')
ax1.plot(predictor,Test_Pre,'ys-',linewidth=1,path_effects=[pe.Stroke(linewidth=1, foreground='k'), pe.Normal()],
                                                             markersize=4,label='Precision',markeredgecolor='k') 
ax1.plot(predictor,Test_Rec,'bo-',linewidth=2,path_effects=[pe.Stroke(linewidth=1, foreground='k'), pe.Normal()],
                                                             markersize=4,label='Sensitivity',markeredgecolor='k') 
ax1.plot(predictor,Test_Spe,'gp-',linewidth=1,path_effects=[pe.Stroke(linewidth=1, foreground='k'), pe.Normal()],
                                                           markersize=5,label='Specificity',markeredgecolor='k')   
ax1.set_xticks(np.arange(len(predictor)))
ax1.set_xticklabels(predictor,y=0.03, rotation=30, fontsize=8)       
ax1.xaxis.grid(color='k', linestyle='--', linewidth=0.8) 
ax1.yaxis.grid(color='k', linestyle='--', linewidth=0.2) 
plt.title('c) Performance of Test Set with Trained Models',fontsize=10)
legend=plt.legend(ncol=2,loc=4,fontsize=8.5,framealpha =3) 
legend.get_frame().set_edgecolor("black")
plt.yticks([0,0.1,0.20,0.30,0.40,0.50,0.60,0.70,0.80,0.90,1],y=0, fontsize=8.2)
ax1.yaxis.set_ticks_position('both')
plt.ylim((-0.35,1.0))

plt.subplots_adjust(wspace=0.12)
plt.subplots_adjust(hspace=0.8)
plt.savefig('./Figures/Fig11_new_Performance_nosampling_wipre.pdf', dpi=500, bbox_inches='tight')
plt.show()

Random Forest: Sampling before and within Cross validation¶

font = {'size'   :8.1 }
plt.rc('font', **font)
fig=plt.figure(figsize=(10, 6), dpi= 200, facecolor='w', edgecolor='k')

j=4
gs = gridspec.GridSpec(2, 4)
ax1=plt.subplot(gs[0, :2], ) 
ax1.plot(ratio,Train_Accu_ov_un[:,j],'r*-',linewidth=1,path_effects=[pe.Stroke(linewidth=1, foreground='k'), pe.Normal()],
                                                             markersize=5.5,label='Accuracy',markeredgecolor='k')
ax1.plot(ratio,Train_Rec_ov_un[:,j],'bo-',linewidth=1,path_effects=[pe.Stroke(linewidth=1, foreground='k'), pe.Normal()],
                                                             markersize=4.5,label='Sensitivity',markeredgecolor='k') 
ax1.plot(ratio,Train_Spe_ov_un[:,j],'gp-',linewidth=1,path_effects=[pe.Stroke(linewidth=1, foreground='k'), pe.Normal()],
                                                           markersize=4.5,label='Specificity',markeredgecolor='k')   
ax1.plot(ratio,Train_AUC_ov_un[:,j],'m>-',linewidth=1,path_effects=[pe.Stroke(linewidth=1, foreground='k'), pe.Normal()],
                                                           markersize=4.5,label='AUC',markeredgecolor='k')   
plt.xlabel(r'$Ratio=\frac{Class\,1}{Class\,0}$',fontsize=12)
ax1.xaxis.grid(color='k', linestyle='--', linewidth=0.5)
ax1.yaxis.grid(color='k', linestyle='--', linewidth=0.2)
plt.title('a) Resampling before Cross Validation (Training set)',fontsize=10)
legend=plt.legend(ncol=2,loc=4,fontsize=8.5,framealpha =1) 
legend.get_frame().set_edgecolor("black")
plt.ylim((0,1.0))
plt.yticks(np.arange(0,1.1,0.1))
plt.xticks(np.arange(0.25,2,0.1), rotation=90, fontsize=8,y=0.03)
ax1.yaxis.set_ticks_position('both')


ax1=plt.subplot(gs[0, 2:]) 
j=4
ax1.plot(ratio,Train_Accu_ov_un_during[:,j],'r*-',linewidth=1,path_effects=[pe.Stroke(linewidth=1, foreground='k'), pe.Normal()],
                                                             markersize=5.5,label='Accuracy',markeredgecolor='k')
ax1.plot(ratio,Train_Rec_ov_un_during[:,j],'bo-',linewidth=1,path_effects=[pe.Stroke(linewidth=1, foreground='k'), pe.Normal()],
                                                             markersize=4.5,label='Sensitivity',markeredgecolor='k') 
ax1.plot(ratio,Train_Spe_ov_un_during[:,j],'gp-',linewidth=1,path_effects=[pe.Stroke(linewidth=1, foreground='k'), pe.Normal()],
                                                           markersize=4.5,label='Specificity',markeredgecolor='k')   
ax1.plot(ratio,Train_AUC_ov_un_during[:,j],'m>-',linewidth=1,path_effects=[pe.Stroke(linewidth=1, foreground='k'), pe.Normal()],
                                                           markersize=4.5,label='AUC',markeredgecolor='k')  
ax1.xaxis.grid(color='k', linestyle='--', linewidth=0.5) 
ax1.yaxis.grid(color='k', linestyle='--', linewidth=0.2)
plt.title('b) Resampling Within Cross Validation (Training set)',fontsize=10)
legend=plt.legend(ncol=2,loc=4,fontsize=8.5,framealpha =1) 
legend.get_frame().set_edgecolor("black")
plt.ylim((0,1.0))
plt.yticks(np.arange(0,1.1,0.1))
plt.xticks(np.arange(0.25,2,0.1), rotation=90, fontsize=8,y=0.03)

ax1.yaxis.set_ticks_position('both')
plt.xlabel(r'$Ratio=\frac{Class\,1}{Class\,0}$',fontsize=12)
   
plt.subplots_adjust(wspace=0.25)
plt.subplots_adjust(hspace=0.55)
plt.savefig('./Figures/Fig11_Performance_overundersampling.pdf', dpi=500, bbox_inches='tight')
plt.show()

Model after Changing Ratio¶

%%time
iratio=15

############# Dummy Classifier #############
dmy_clf = DummyClassifier(strategy="stratified",random_state=12)
y_train_pred,y_real,model__=Ove_Und_During_Cross(std_x_train,y_train,dmy_clf,ind=2,r1=ra_1[iratio],r2=ra_2[iratio],
                                         corr=corr,seed=34,cv=5)
model__ratio_dmy=model__

# Soft voting   
pred=np.array([], dtype=np.int64).reshape(len(std_x_valid),0)
for ii in range(len(model__)):
    tmp=model__ratio_dmy[ii].predict_proba(std_x_valid)[:,1].reshape(-1,1)
    pred=np.concatenate((tmp, pred), axis=1)
Soft_voting=[np.mean(i) for i in pred]      
    
acr, prec, reca, spec=Conf_Matrix(y_valid,np.array(Soft_voting).reshape(-1,1),axt=ax1,plot=False)
valid_Accu_ov_un_during_cv_dmy=acr
valid_Pre_ov_un_during_cv_dmy=prec
valid_Rec_ov_un_during_cv_dmy=reca
valid_Spe_ov_un_during_cv_dmy=spec
fpr, tpr, thresold = roc_curve(y_valid, Soft_voting)
valid_AUC_ov_un_during_cv_dmy=auc(fpr, tpr) 


############# Stochastic Gradient Descent ############# 
sgd= SGDClassifier(loss='log',early_stopping=True, eta0= 0.05, penalty='none',random_state=42)  
y_train_pred,y_real,model__=Ove_Und_During_Cross(std_x_train,y_train,sgd,ind=2,r1=ra_1[iratio],r2=ra_2[iratio],
                                         corr=corr,seed=34,cv=5)
model__ratio_sgd=model__

# Soft voting   
pred=np.array([], dtype=np.int64).reshape(len(std_x_valid),0)
for ii in range(len(model__)):
    tmp=model__ratio_sgd[ii].predict_proba(std_x_valid)[:,1].reshape(-1,1)
    pred=np.concatenate((tmp, pred), axis=1)
Soft_voting=[np.mean(i) for i in pred]      

acr, prec, reca, spec=Conf_Matrix(y_valid,np.array(Soft_voting).reshape(-1,1),axt=ax1,plot=False)
valid_Accu_ov_un_during_cv_sgd=acr
valid_Pre_ov_un_during_cv_sgd=prec
valid_Rec_ov_un_during_cv_sgd=reca
valid_Spe_ov_un_during_cv_sgd=spec
fpr, tpr, thresold = roc_curve(y_valid, Soft_voting)
valid_AUC_ov_un_during_cv_sgd=auc(fpr, tpr) 


############# Logistic Regression ############# 
log_reg = LogisticRegression(max_iter=30, penalty='none',C=0.1,random_state=42) 
y_train_pred,y_real,model__=Ove_Und_During_Cross(std_x_train,y_train,log_reg,ind=2,r1=ra_1[iratio],r2=ra_2[iratio],
                                         corr=corr,seed=34,cv=5)
model__ratio_lr=model__

# Soft voting   
pred=np.array([], dtype=np.int64).reshape(len(std_x_valid),0)
for ii in range(len(model__)):
    tmp=model__ratio_lr[ii].predict_proba(std_x_valid)[:,1].reshape(-1,1)
    pred=np.concatenate((tmp, pred), axis=1)
Soft_voting=[np.mean(i) for i in pred]      

acr, prec, reca, spec=Conf_Matrix(y_valid,np.array(Soft_voting).reshape(-1,1),axt=ax1,plot=False)
valid_Accu_ov_un_during_cv_lr=acr
valid_Pre_ov_un_during_cv_lr=prec
valid_Rec_ov_un_during_cv_lr=reca
valid_Spe_ov_un_during_cv_lr=spec
fpr, tpr, thresold = roc_curve(y_valid, Soft_voting)
valid_AUC_ov_un_during_cv_lr=auc(fpr, tpr)


############# Support Vector Machine ############# 
svm = SVC(kernel='rbf',C=2,probability=True,random_state=42)
y_train_pred,y_real,model__=Ove_Und_During_Cross(std_x_train,y_train,svm,ind=2,r1=ra_1[iratio],r2=ra_2[iratio],
                                         corr=corr,seed=34,cv=5)
model__ratio_svm=model__

# Soft voting   
pred=np.array([], dtype=np.int64).reshape(len(std_x_valid),0)
for ii in range(len(model__)):
    tmp=model__ratio_svm[ii].predict_proba(std_x_valid)[:,1].reshape(-1,1)
    pred=np.concatenate((tmp, pred), axis=1)
Soft_voting=[np.mean(i) for i in pred]      

acr, prec, reca, spec=Conf_Matrix(y_valid,np.array(Soft_voting).reshape(-1,1),axt=ax1,plot=False)
valid_Accu_ov_un_during_cv_svm=acr
valid_Pre_ov_un_during_cv_svm=prec
valid_Rec_ov_un_during_cv_svm=reca
valid_Spe_ov_un_during_cv_svm=spec
fpr, tpr, thresold = roc_curve(y_valid, Soft_voting)
valid_AUC_ov_un_during_cv_svm=auc(fpr, tpr)

############# Adaptive Boosting ############# 
tree =DecisionTreeClassifier()
ada_tree = AdaBoostClassifier(tree,n_estimators=30, algorithm= 'SAMME', learning_rate= 0.2, random_state=102)  
y_train_pred,y_real,model__=Ove_Und_During_Cross(std_x_train,y_train,ada_tree,ind=2,r1=ra_1[iratio],r2=ra_2[iratio],
                                         corr=corr,seed=34,cv=5)
model__ratio_ada_tree=model__

# Soft voting   
pred=np.array([], dtype=np.int64).reshape(len(std_x_valid),0)
for ii in range(len(model__)):
    tmp=model__ratio_ada_tree[ii].predict_proba(std_x_valid)[:,1].reshape(-1,1)
    pred=np.concatenate((tmp, pred), axis=1)
Soft_voting=[np.mean(i) for i in pred]      

acr, prec, reca, spec=Conf_Matrix(y_valid,np.array(Soft_voting).reshape(-1,1),axt=ax1,plot=False)
valid_Accu_ov_un_during_cv_ada=acr
valid_Pre_ov_un_during_cv_ada=prec
valid_Rec_ov_un_during_cv_ada=reca
valid_Spe_ov_un_during_cv_ada=spec
fpr, tpr, thresold = roc_curve(y_valid, Soft_voting)
valid_AUC_ov_un_during_cv_ada=auc(fpr, tpr)


############# Random Forest ############# 
rnf = RandomForestClassifier( random_state=42) 
y_train_pred,y_real,model__=Ove_Und_During_Cross(std_x_train,y_train,rnf,ind=2,r1=ra_1[iratio],r2=ra_2[iratio],
                                         corr=corr,seed=34,cv=5)
model__ratio_rnf=model__

# Soft voting   
pred=np.array([], dtype=np.int64).reshape(len(std_x_valid),0)
for ii in range(len(model__)):
    tmp=model__ratio_rnf[ii].predict_proba(std_x_valid)[:,1].reshape(-1,1)
    pred=np.concatenate((tmp, pred), axis=1)
Soft_voting=[np.mean(i) for i in pred]      

acr, prec, reca, spec=Conf_Matrix(y_valid,np.array(Soft_voting).reshape(-1,1),axt=ax1,plot=False)
valid_Accu_ov_un_during_cv_rnf=acr
valid_Pre_ov_un_during_cv_rnf=prec
valid_Rec_ov_un_during_cv_rnf=reca
valid_Spe_ov_un_during_cv_rnf=spec
fpr, tpr, thresold = roc_curve(y_valid, Soft_voting)
valid_AUC_ov_un_during_cv_rnf=auc(fpr, tpr)


############# Neural Network ############# 
Neural_Net=DNN (init_mode='uniform', activation= 'relu',dropout_rate= False, neurons= 50 ) 
y_train_pred,y_real,model__=Ove_Und_During_Cross(std_x_train,y_train,Neural_Net,ind=2,NN=True,
                                   r1=ra_1[i],r2=ra_2[i],corr=corr,seed=34,cv=5)
model__ratio_NN=model__

# Soft voting   
Soft_voting=[]
pred=np.array([], dtype=np.int64).reshape(len(std_x_valid),0)
for ii in range(len(model__)):
    tmp=model__ratio_NN[ii].predict_proba(std_x_valid).tolist()
    pred=np.concatenate((tmp, pred), axis=1)
Soft_voting=[np.mean(i) for i in pred]   

acr, prec, reca, spec=Conf_Matrix(y_valid,np.array(Soft_voting).reshape(-1,1),axt=ax1,plot=False)
valid_Accu_ov_un_during_cv_NN=acr
valid_Pre_ov_un_during_cv_NN=prec
valid_Rec_ov_un_during_cv_NN=reca
valid_Spe_ov_un_during_cv_NN=spec
fpr, tpr, thresold = roc_curve(y_valid, Soft_voting)
valid_AUC_ov_un_during_cv_NN=auc(fpr, tpr)

Wall time: 19min 23s

pred_1=np.array([], dtype=np.int64).reshape(len(std_x_valid),0)
pred_2=np.array([], dtype=np.int64).reshape(len(std_x_valid),0)
pred_3=np.array([], dtype=np.int64).reshape(len(std_x_valid),0)
pred_4=np.array([], dtype=np.int64).reshape(len(std_x_valid),0)
pred_5=np.array([], dtype=np.int64).reshape(len(std_x_valid),0)
pred_6=np.array([], dtype=np.int64).reshape(len(std_x_valid),0)

for ii in range(len(model__)):
    tmp=model__ratio_sgd[ii].predict_proba(std_x_valid)[:,1].reshape(-1,1)
    pred_1=np.concatenate((tmp, pred_1), axis=1)
    #
    tmp=model__ratio_lr[ii].predict_proba(std_x_valid)[:,1].reshape(-1,1)
    pred_2=np.concatenate((tmp, pred_2), axis=1)
    #
    tmp=model__ratio_svm[ii].predict_proba(std_x_valid).tolist()
    pred_3=np.concatenate((tmp, pred_3), axis=1)
    #
    tmp=model__ratio_ada_tree[ii].predict_proba(std_x_valid)[:,1].reshape(-1,1)
    pred_4=np.concatenate((tmp, pred_4), axis=1)
    #
    tmp=model__ratio_rnf[ii].predict_proba(std_x_valid)[:,1].reshape(-1,1)
    pred_5=np.concatenate((tmp, pred_5), axis=1)    
    #
    tmp=model__ratio_NN[ii].predict_proba(std_x_valid).tolist()
    pred_6=np.concatenate((tmp, pred_6), axis=1)

    
Soft_voting_1=[np.mean(i) for i in pred_1]
Soft_voting_2=[np.mean(i) for i in pred_2]
Soft_voting_3=[np.mean(i) for i in pred_3]
Soft_voting_4=[np.mean(i) for i in pred_4]
Soft_voting_5=[np.mean(i) for i in pred_5]
Soft_voting_6=[np.mean(i) for i in pred_6]

Soft_voting_all=[]
for i in range(len(Soft_voting_2)):
    Soft_voting_all.append((Soft_voting_2[i]+Soft_voting_5[i])/2)

valid_Accu_sft, prec_sft, valid_Rec_sft, valid_Spe_sft=Conf_Matrix(y_valid,np.array(Soft_voting_all).reshape(-1,1),axt=ax1,plot=False)
fpr, tpr, thresold = roc_curve(y_valid, Soft_voting_all)
valid_AUC_sft=auc(fpr, tpr)

font = {'size'   :8.1 }
plt.rc('font', **font)
fig=plt.figure(figsize=(10, 6), dpi= 200, facecolor='w', edgecolor='k')
predictor= ['Random Classifier','Stocastic Gradient Descent','Logistic Regression',
            'Support Vector Machine','Adaptive Boosting',
            'Random Forest','Deep Neural Network', 'Majority Voting']

gs = gridspec.GridSpec(2, 4)
ax1=plt.subplot(gs[0, :2], ) 

Acuracy=(valid_Accu_ov_un_during_cv_dmy,valid_Accu_ov_un_during_cv_sgd,
        valid_Accu_ov_un_during_cv_lr,valid_Accu_ov_un_during_cv_svm,
        valid_Accu_ov_un_during_cv_ada,valid_Accu_ov_un_during_cv_rnf,
        valid_Accu_ov_un_during_cv_NN,valid_Accu_sft)
#
Recall=(valid_Rec_ov_un_during_cv_dmy,valid_Rec_ov_un_during_cv_sgd,
        valid_Rec_ov_un_during_cv_lr,valid_Rec_ov_un_during_cv_svm,
        valid_Rec_ov_un_during_cv_ada,valid_Rec_ov_un_during_cv_rnf,
        valid_Rec_ov_un_during_cv_NN,valid_Rec_sft)
#
Specificity=(valid_Spe_ov_un_during_cv_dmy,valid_Spe_ov_un_during_cv_sgd,
        valid_Spe_ov_un_during_cv_lr,valid_Spe_ov_un_during_cv_svm,
        valid_Spe_ov_un_during_cv_ada,valid_Spe_ov_un_during_cv_rnf,
        valid_Spe_ov_un_during_cv_NN, valid_Spe_sft)
#
AUC_=(valid_AUC_ov_un_during_cv_dmy,valid_AUC_ov_un_during_cv_sgd,
        valid_AUC_ov_un_during_cv_lr,valid_AUC_ov_un_during_cv_svm,
        valid_AUC_ov_un_during_cv_ada,valid_AUC_ov_un_during_cv_rnf,
        valid_AUC_ov_un_during_cv_NN,valid_AUC_sft)
#

ax1.plot(predictor,Acuracy,'r*-',linewidth=1,path_effects=[pe.Stroke(linewidth=1, foreground='k'), pe.Normal()],
                                                             markersize=5.5,label='Accuracy',markeredgecolor='k')
ax1.plot(predictor,Recall,'bo-',linewidth=1,path_effects=[pe.Stroke(linewidth=1, foreground='k'), pe.Normal()],
                                                             markersize=4.5,label='Sensitivity',markeredgecolor='k') 
ax1.plot(predictor,Specificity,'gp-',linewidth=1,path_effects=[pe.Stroke(linewidth=1, foreground='k'), pe.Normal()],
                                                           markersize=4.5,label='Specificity',markeredgecolor='k')   
ax1.plot(predictor,AUC_,'m>-',linewidth=1,path_effects=[pe.Stroke(linewidth=1, foreground='k'), pe.Normal()],
                                                           markersize=4.5,label='AUC',markeredgecolor='k')   
#plt.xlabel(r'$\frac{Class\,1\,(Serious\,Leakage)}{Class\,0\,(Non-Serious\,Leakage)}$',fontsize=12)
ax1.xaxis.grid(color='k', linestyle='--', linewidth=0.8) 
ax1.yaxis.grid(color='k', linestyle='--', linewidth=0.2)
plt.title('a) Soft Coding for Validation Set \n Model within Cross Validation',fontsize=10.5)
legend=plt.legend(ncol=2,loc=4,fontsize=8.5,framealpha =1) 
legend.get_frame().set_edgecolor("black")
plt.ylim((0,1.0))
plt.yticks(np.arange(0,1.1,0.1))
ax1.yaxis.set_ticks_position('both')
ax1.set_xticklabels(predictor,y=0.03, rotation=90, fontsize=8)   

plt.subplots_adjust(wspace=0.25)
plt.subplots_adjust(hspace=0.55)
#plt.savefig('./Figures/Fig12_Performance_overundersampling_validation.pdf', dpi=500, bbox_inches='tight')
plt.show()

Test set¶

%%time

############# Dummy Classifier #############

# Soft voting   
pred=np.array([], dtype=np.int64).reshape(len(std_x_test),0)
for ii in range(len(model__)):
    tmp=model__ratio_dmy[ii].predict_proba(std_x_test)[:,1].reshape(-1,1)
    pred=np.concatenate((tmp, pred), axis=1)
Soft_voting=[np.mean(i) for i in pred]      
    
acr, prec, reca, spec=Conf_Matrix(y_test,np.array(Soft_voting).reshape(-1,1),axt=ax1,plot=False)
test_Accu_ov_un_during_cv_dmy=acr
test_Pre_ov_un_during_cv_dmy=prec
test_Rec_ov_un_during_cv_dmy=reca
test_Spe_ov_un_during_cv_dmy=spec
fpr, tpr, thresold = roc_curve(y_test, Soft_voting)
test_AUC_ov_un_during_cv_dmy=auc(fpr, tpr) 


############# Stochastic Gradient Descent ############# 

# Soft voting   
pred=np.array([], dtype=np.int64).reshape(len(std_x_test),0)
for ii in range(len(model__)):
    tmp=model__ratio_sgd[ii].predict_proba(std_x_test)[:,1].reshape(-1,1)
    pred=np.concatenate((tmp, pred), axis=1)
Soft_voting=[np.mean(i) for i in pred]      

acr, prec, reca, spec=Conf_Matrix(y_test,np.array(Soft_voting).reshape(-1,1),axt=ax1,plot=False)
test_Accu_ov_un_during_cv_sgd=acr
test_Pre_ov_un_during_cv_sgd=prec
test_Rec_ov_un_during_cv_sgd=reca
test_Spe_ov_un_during_cv_sgd=spec
fpr, tpr, thresold = roc_curve(y_test, Soft_voting)
test_AUC_ov_un_during_cv_sgd=auc(fpr, tpr) 


############# Logistic Regression ############# 
# Soft voting   
pred=np.array([], dtype=np.int64).reshape(len(std_x_test),0)
for ii in range(len(model__)):
    tmp=model__ratio_lr[ii].predict_proba(std_x_test)[:,1].reshape(-1,1)
    pred=np.concatenate((tmp, pred), axis=1)
Soft_voting=[np.mean(i) for i in pred]      

acr, prec, reca, spec=Conf_Matrix(y_test,np.array(Soft_voting).reshape(-1,1),axt=ax1,plot=False)
test_Accu_ov_un_during_cv_lr=acr
test_Pre_ov_un_during_cv_lr=prec
test_Rec_ov_un_during_cv_lr=reca
test_Spe_ov_un_during_cv_lr=spec
fpr, tpr, thresold = roc_curve(y_test, Soft_voting)
test_AUC_ov_un_during_cv_lr=auc(fpr, tpr)


############# Support Vector Machine ############# 
# Soft voting   
pred=np.array([], dtype=np.int64).reshape(len(std_x_test),0)
for ii in range(len(model__)):
    tmp=model__ratio_svm[ii].predict_proba(std_x_test)[:,1].reshape(-1,1)
    pred=np.concatenate((tmp, pred), axis=1)
Soft_voting=[np.mean(i) for i in pred]      

acr, prec, reca, spec=Conf_Matrix(y_test,np.array(Soft_voting).reshape(-1,1),axt=ax1,plot=False)
test_Accu_ov_un_during_cv_svm=acr
test_Pre_ov_un_during_cv_svm=prec
test_Rec_ov_un_during_cv_svm=reca
test_Spe_ov_un_during_cv_svm=spec
fpr, tpr, thresold = roc_curve(y_test, Soft_voting)
test_AUC_ov_un_during_cv_svm=auc(fpr, tpr)

############# Adaptive Boosting ############# 
# Soft voting   
pred=np.array([], dtype=np.int64).reshape(len(std_x_test),0)
for ii in range(len(model__)):
    tmp=model__ratio_ada_tree[ii].predict_proba(std_x_test)[:,1].reshape(-1,1)
    pred=np.concatenate((tmp, pred), axis=1)
Soft_voting=[np.mean(i) for i in pred]      

acr, prec, reca, spec=Conf_Matrix(y_test,np.array(Soft_voting).reshape(-1,1),axt=ax1,plot=False)
test_Accu_ov_un_during_cv_ada=acr
test_Pre_ov_un_during_cv_ada=prec
test_Rec_ov_un_during_cv_ada=reca
test_Spe_ov_un_during_cv_ada=spec
fpr, tpr, thresold = roc_curve(y_test, Soft_voting)
test_AUC_ov_un_during_cv_ada=auc(fpr, tpr)


############# Random Forest ############# 
# Soft voting   
pred=np.array([], dtype=np.int64).reshape(len(std_x_test),0)
for ii in range(len(model__)):
    tmp=model__ratio_rnf[ii].predict_proba(std_x_test)[:,1].reshape(-1,1)
    pred=np.concatenate((tmp, pred), axis=1)
Soft_voting=[np.mean(i) for i in pred]      

acr, prec, reca, spec=Conf_Matrix(y_test,np.array(Soft_voting).reshape(-1,1),axt=ax1,plot=False)
test_Accu_ov_un_during_cv_rnf=acr
test_Pre_ov_un_during_cv_rnf=prec
test_Rec_ov_un_during_cv_rnf=reca
test_Spe_ov_un_during_cv_rnf=spec
fpr, tpr, thresold = roc_curve(y_test, Soft_voting)
test_AUC_ov_un_during_cv_rnf=auc(fpr, tpr)


############# Neural Network ############# 
# Soft voting   
Soft_voting=[]
pred=np.array([], dtype=np.int64).reshape(len(std_x_test),0)
for ii in range(len(model__)):
    tmp=model__ratio_NN[ii].predict_proba(std_x_test).tolist()
    pred=np.concatenate((tmp, pred), axis=1)
Soft_voting=[np.mean(i) for i in pred]   

acr, prec, reca, spec=Conf_Matrix(y_test,np.array(Soft_voting).reshape(-1,1),axt=ax1,plot=False)
test_Accu_ov_un_during_cv_NN=acr
test_Pre_ov_un_during_cv_NN=prec
test_Rec_ov_un_during_cv_NN=reca
test_Spe_ov_un_during_cv_NN=spec
fpr, tpr, thresold = roc_curve(y_test, Soft_voting)
test_AUC_ov_un_during_cv_NN=auc(fpr, tpr)

Soft Voting¶

pred_1=np.array([], dtype=np.int64).reshape(len(std_x_test),0)
pred_2=np.array([], dtype=np.int64).reshape(len(std_x_test),0)
pred_3=np.array([], dtype=np.int64).reshape(len(std_x_test),0)
pred_4=np.array([], dtype=np.int64).reshape(len(std_x_test),0)
pred_5=np.array([], dtype=np.int64).reshape(len(std_x_test),0)
pred_6=np.array([], dtype=np.int64).reshape(len(std_x_test),0)

for ii in range(len(model__)):
    tmp=model__ratio_sgd[ii].predict_proba(std_x_test)[:,1].reshape(-1,1)
    pred_1=np.concatenate((tmp, pred_1), axis=1)
    #
    tmp=model__ratio_lr[ii].predict_proba(std_x_test)[:,1].reshape(-1,1)
    pickle.dump(model__ratio_lr[ii], open(f'./Trained_models/lr_ratio_{ii+1}.sav', 'wb'))
    
    pred_2=np.concatenate((tmp, pred_2), axis=1)
    #
    tmp=model__ratio_svm[ii].predict_proba(std_x_test).tolist()
    pred_3=np.concatenate((tmp, pred_3), axis=1)
    #
    tmp=model__ratio_ada_tree[ii].predict_proba(std_x_test)[:,1].reshape(-1,1)
    pred_4=np.concatenate((tmp, pred_4), axis=1)
    #
    tmp=model__ratio_rnf[ii].predict_proba(std_x_test)[:,1].reshape(-1,1)
    pickle.dump(model__ratio_rnf[ii], open(f'./Trained_models/rf_ratio_{ii+1}.sav', 'wb'))    
    pred_5=np.concatenate((tmp, pred_5), axis=1)    
    #
    tmp=model__ratio_NN[ii].predict_proba(std_x_test).tolist()
    pred_6=np.concatenate((tmp, pred_6), axis=1)
    
Soft_voting_1=[np.mean(i) for i in pred_1]
Soft_voting_2=[np.mean(i) for i in pred_2]
Soft_voting_3=[np.mean(i) for i in pred_3]
Soft_voting_4=[np.mean(i) for i in pred_4]
Soft_voting_5=[np.mean(i) for i in pred_5]
Soft_voting_6=[np.mean(i) for i in pred_6]

Soft_voting_all=[]
for i in range(len(Soft_voting_2)):
    Soft_voting_all.append((Soft_voting_2[i]+
    Soft_voting_5[i])/2)

test_Accu_sft, prec_sft, test_Rec_sft, test_Spe_sft=Conf_Matrix(y_test,np.array(Soft_voting_all).reshape(-1,1),axt=ax1,plot=False)
fpr, tpr, thresold = roc_curve(y_test, Soft_voting_all)
test_AUC_sft=auc(fpr, tpr)

Final Plots¶

font = {'size'   :8.1 }
plt.rc('font', **font)
fig=plt.figure(figsize=(10, 6), dpi= 200, facecolor='w', edgecolor='k')
predictor= ['Stocastic Gradient Descent','Logistic Regression',
            'Support Vector Machine','Adaptive Boosting',
            'Random Forest','Deep Neural Network', 'Soft Voting']

gs = gridspec.GridSpec(2, 4)
ax1=plt.subplot(gs[0, :2], ) 

Acuracy=(valid_Accu_ov_un_during_cv_sgd,
        valid_Accu_ov_un_during_cv_lr,valid_Accu_ov_un_during_cv_svm,
        valid_Accu_ov_un_during_cv_ada,valid_Accu_ov_un_during_cv_rnf,
        valid_Accu_ov_un_during_cv_NN,valid_Accu_sft)
#
Recall=(valid_Rec_ov_un_during_cv_sgd,
        valid_Rec_ov_un_during_cv_lr,valid_Rec_ov_un_during_cv_svm,
        valid_Rec_ov_un_during_cv_ada,valid_Rec_ov_un_during_cv_rnf,
        valid_Rec_ov_un_during_cv_NN,valid_Rec_sft)
#
Specificity=(valid_Spe_ov_un_during_cv_sgd,
        valid_Spe_ov_un_during_cv_lr,valid_Spe_ov_un_during_cv_svm,
        valid_Spe_ov_un_during_cv_ada,valid_Spe_ov_un_during_cv_rnf,
        valid_Spe_ov_un_during_cv_NN, valid_Spe_sft)
#
AUC_=(valid_AUC_ov_un_during_cv_sgd,
        valid_AUC_ov_un_during_cv_lr,valid_AUC_ov_un_during_cv_svm,
        valid_AUC_ov_un_during_cv_ada,valid_AUC_ov_un_during_cv_rnf,
        valid_AUC_ov_un_during_cv_NN,valid_AUC_sft)
#

ax1.plot(predictor,Acuracy,'r*-',linewidth=1,path_effects=[pe.Stroke(linewidth=1, foreground='k'), pe.Normal()],
                                                             markersize=5.5,label='Accuracy',markeredgecolor='k')
ax1.plot(predictor,Recall,'bo-',linewidth=1,path_effects=[pe.Stroke(linewidth=2, foreground='k'), pe.Normal()],
                                                             markersize=4.5,label='Sensitivity',markeredgecolor='k') 
ax1.plot(predictor,Specificity,'gp-',linewidth=1,path_effects=[pe.Stroke(linewidth=1, foreground='k'), pe.Normal()],
                                                           markersize=4.5,label='Specificity',markeredgecolor='k')   
ax1.plot(predictor,AUC_,'m>-',linewidth=1,path_effects=[pe.Stroke(linewidth=1, foreground='k'), pe.Normal()],
                                                           markersize=4.5,label='AUC',markeredgecolor='k')   
#plt.xlabel(r'$\frac{Class\,1\,(Serious\,Leakage)}{Class\,0\,(Non-Serious\,Leakage)}$',fontsize=12)
ax1.xaxis.grid(color='k', linestyle='--', linewidth=0.8) 
ax1.yaxis.grid(color='k', linestyle='--', linewidth=0.2)
plt.title('a) Validation Set Performance \n with Trained Model of '+ r'$\frac{Class\,1}{Class\,0}=1.27$',fontsize=10)
legend=plt.legend(ncol=2,loc=4,fontsize=8.5,framealpha =1) 
legend.get_frame().set_edgecolor("black")
plt.ylim((0.4,0.75))
plt.yticks(np.arange(0.4,0.75,0.05))
ax1.yaxis.set_ticks_position('both')
ax1.set_xticklabels(predictor,y=0.03, rotation=90, fontsize=8)   

ax1=plt.subplot(gs[0, 2:])  
Acuracy=(test_Accu_ov_un_during_cv_sgd,
        test_Accu_ov_un_during_cv_lr,test_Accu_ov_un_during_cv_svm,
        test_Accu_ov_un_during_cv_ada,test_Accu_ov_un_during_cv_rnf,
        test_Accu_ov_un_during_cv_NN,test_Accu_sft)
#
Recall=(test_Rec_ov_un_during_cv_sgd,
        test_Rec_ov_un_during_cv_lr,test_Rec_ov_un_during_cv_svm,
        test_Rec_ov_un_during_cv_ada,test_Rec_ov_un_during_cv_rnf,
        test_Rec_ov_un_during_cv_NN,test_Rec_sft)
#
Specificity=(test_Spe_ov_un_during_cv_sgd,
        test_Spe_ov_un_during_cv_lr,test_Spe_ov_un_during_cv_svm,
        test_Spe_ov_un_during_cv_ada,test_Spe_ov_un_during_cv_rnf,
        test_Spe_ov_un_during_cv_NN, test_Spe_sft)
#
AUC_=(test_AUC_ov_un_during_cv_sgd,
        test_AUC_ov_un_during_cv_lr,test_AUC_ov_un_during_cv_svm,
        test_AUC_ov_un_during_cv_ada,test_AUC_ov_un_during_cv_rnf,
        test_AUC_ov_un_during_cv_NN,test_AUC_sft)
#

ax1.plot(predictor,Acuracy,'r*-',linewidth=1,path_effects=[pe.Stroke(linewidth=1, foreground='k'), pe.Normal()],
                                                             markersize=5.5,label='Accuracy',markeredgecolor='k')
ax1.plot(predictor,Recall,'bo-',linewidth=1,path_effects=[pe.Stroke(linewidth=2, foreground='k'), pe.Normal()],
                                                             markersize=4.5,label='Sensitivity',markeredgecolor='k') 
ax1.plot(predictor,Specificity,'gp-',linewidth=1,path_effects=[pe.Stroke(linewidth=1, foreground='k'), pe.Normal()],
                                                           markersize=4.5,label='Specificity',markeredgecolor='k')   
ax1.plot(predictor,AUC_,'m>-',linewidth=1,path_effects=[pe.Stroke(linewidth=1, foreground='k'), pe.Normal()],
                                                           markersize=4.5,label='AUC',markeredgecolor='k')   
#plt.xlabel(r'$\frac{Class\,1\,(Serious\,Leakage)}{Class\,0\,(Non-Serious\,Leakage)}$',fontsize=12)
ax1.xaxis.grid(color='k', linestyle='--', linewidth=0.8) 
ax1.yaxis.grid(color='k', linestyle='--', linewidth=0.2)
plt.title('b) Test Set Performance \n with Trained Model of '+ r'$\frac{Class\,1}{Class\,0}=1.27$',fontsize=10)
legend=plt.legend(ncol=2,loc=4,fontsize=8.5,framealpha =1) 
legend.get_frame().set_edgecolor("black")
plt.ylim((0.4,0.75))
plt.yticks(np.arange(0.4,0.75,0.05))
ax1.yaxis.set_ticks_position('both')
ax1.set_xticklabels(predictor,y=0.03, rotation=90, fontsize=8)   

plt.subplots_adjust(wspace=0.35)
plt.subplots_adjust(hspace=0.55)
plt.savefig('./Figures/Fig13_Performance_overundersampling_validation_Test.pdf', dpi=500, bbox_inches='tight')
plt.show()

font = {'size'   :8.1 }
plt.rc('font', **font)
fig=plt.figure(figsize=(10, 6), dpi= 200, facecolor='w', edgecolor='k')
predictor= ['Dummy Classifier','Stocastic Gradient Descent','Logistic Regression',
            'Support Vector Machine','Adaptive Gradient Boosting',
            'Random Forest','Deep Neural Network', 'Soft Voting']

gs = gridspec.GridSpec(2, 4)
ax1=plt.subplot(gs[0, :2], ) 

Acuracy=(valid_Accu_ov_un_during_cv_dmy,valid_Accu_ov_un_during_cv_sgd,
        valid_Accu_ov_un_during_cv_lr,valid_Accu_ov_un_during_cv_svm,
        valid_Accu_ov_un_during_cv_ada,valid_Accu_ov_un_during_cv_rnf,
        valid_Accu_ov_un_during_cv_NN,valid_Accu_sft)
#
Recall=(valid_Rec_ov_un_during_cv_dmy,valid_Rec_ov_un_during_cv_sgd,
        valid_Rec_ov_un_during_cv_lr,valid_Rec_ov_un_during_cv_svm,
        valid_Rec_ov_un_during_cv_ada,valid_Rec_ov_un_during_cv_rnf,
        valid_Rec_ov_un_during_cv_NN,valid_Rec_sft)
#
Specificity=(valid_Spe_ov_un_during_cv_dmy,valid_Spe_ov_un_during_cv_sgd,
        valid_Spe_ov_un_during_cv_lr,valid_Spe_ov_un_during_cv_svm,
        valid_Spe_ov_un_during_cv_ada,valid_Spe_ov_un_during_cv_rnf,
        valid_Spe_ov_un_during_cv_NN, valid_Spe_sft)
#
AUC_=(valid_AUC_ov_un_during_cv_dmy,valid_AUC_ov_un_during_cv_sgd,
        valid_AUC_ov_un_during_cv_lr,valid_AUC_ov_un_during_cv_svm,
        valid_AUC_ov_un_during_cv_ada,valid_AUC_ov_un_during_cv_rnf,
        valid_AUC_ov_un_during_cv_NN,valid_AUC_sft)
#

ax1.plot(predictor,Acuracy,'r*-',linewidth=1,path_effects=[pe.Stroke(linewidth=1, foreground='k'), pe.Normal()],
                                                             markersize=5.5,label='Accuracy',markeredgecolor='k')
ax1.plot(predictor,Recall,'bo-',linewidth=1,path_effects=[pe.Stroke(linewidth=1, foreground='k'), pe.Normal()],
                                                             markersize=4.5,label='Sensitivity',markeredgecolor='k') 
ax1.plot(predictor,Specificity,'gp-',linewidth=1,path_effects=[pe.Stroke(linewidth=1, foreground='k'), pe.Normal()],
                                                           markersize=4.5,label='Specificity',markeredgecolor='k')   
ax1.plot(predictor,AUC_,'m>-',linewidth=1,path_effects=[pe.Stroke(linewidth=1, foreground='k'), pe.Normal()],
                                                           markersize=4.5,label='AUC',markeredgecolor='k')   
#plt.xlabel(r'$\frac{Class\,1\,(Serious\,Leakage)}{Class\,0\,(Non-Serious\,Leakage)}$',fontsize=12)
ax1.xaxis.grid(color='k', linestyle='--', linewidth=0.8) 
ax1.yaxis.grid(color='k', linestyle='--', linewidth=0.2)
plt.title('a) Performance of Validation Set  \n with Trained Models of '+ r'$\frac{Class\,1}{Class\,0}=1.27$',fontsize=10)
legend=plt.legend(ncol=2,loc=4,fontsize=8.5,framealpha =1) 
legend.get_frame().set_edgecolor("black")
plt.ylim((0.3,0.85))
plt.yticks(np.arange(0.3,0.85,0.1))
ax1.yaxis.set_ticks_position('both')
ax1.set_xticklabels(predictor,y=0.03, rotation=30, fontsize=8)   

ax1=plt.subplot(gs[0, 2:])  
Acuracy=(test_Accu_ov_un_during_cv_dmy,test_Accu_ov_un_during_cv_sgd,
        test_Accu_ov_un_during_cv_lr,test_Accu_ov_un_during_cv_svm,
        test_Accu_ov_un_during_cv_ada,test_Accu_ov_un_during_cv_rnf,
        test_Accu_ov_un_during_cv_NN,test_Accu_sft)
#
Recall=(test_Rec_ov_un_during_cv_dmy,test_Rec_ov_un_during_cv_sgd,
        test_Rec_ov_un_during_cv_lr,test_Rec_ov_un_during_cv_svm,
        test_Rec_ov_un_during_cv_ada,test_Rec_ov_un_during_cv_rnf,
        test_Rec_ov_un_during_cv_NN,test_Rec_sft)
#
Specificity=(test_Spe_ov_un_during_cv_dmy,test_Spe_ov_un_during_cv_sgd,
        test_Spe_ov_un_during_cv_lr,test_Spe_ov_un_during_cv_svm,
        test_Spe_ov_un_during_cv_ada,test_Spe_ov_un_during_cv_rnf,
        test_Spe_ov_un_during_cv_NN, test_Spe_sft)
#
AUC_=(test_AUC_ov_un_during_cv_dmy,test_AUC_ov_un_during_cv_sgd,
        test_AUC_ov_un_during_cv_lr,test_AUC_ov_un_during_cv_svm,
        test_AUC_ov_un_during_cv_ada,test_AUC_ov_un_during_cv_rnf,
        test_AUC_ov_un_during_cv_NN,test_AUC_sft)
#

ax1.plot(predictor,Acuracy,'r*-',linewidth=1,path_effects=[pe.Stroke(linewidth=1, foreground='k'), pe.Normal()],
                                                             markersize=5.5,label='Accuracy',markeredgecolor='k')
ax1.plot(predictor,Recall,'bo-',linewidth=1,path_effects=[pe.Stroke(linewidth=1, foreground='k'), pe.Normal()],
                                                             markersize=4.5,label='Sensitivity',markeredgecolor='k') 
ax1.plot(predictor,Specificity,'gp-',linewidth=1,path_effects=[pe.Stroke(linewidth=1, foreground='k'), pe.Normal()],
                                                           markersize=4.5,label='Specificity',markeredgecolor='k')   
ax1.plot(predictor,AUC_,'m>-',linewidth=1,path_effects=[pe.Stroke(linewidth=1, foreground='k'), pe.Normal()],
                                                           markersize=4.5,label='AUC',markeredgecolor='k')   
#plt.xlabel(r'$\frac{Class\,1\,(Serious\,Leakage)}{Class\,0\,(Non-Serious\,Leakage)}$',fontsize=12)
ax1.xaxis.grid(color='k', linestyle='--', linewidth=0.8) 
ax1.yaxis.grid(color='k', linestyle='--', linewidth=0.2)
plt.title('b) Performance of Test Set \n with Trained Models of '+ r'$\frac{Class\,1}{Class\,0}=1.27$',fontsize=10)
legend=plt.legend(ncol=2,loc=4,fontsize=8.5,framealpha =1) 
legend.get_frame().set_edgecolor("black")
plt.ylim((0.3,0.85))
plt.yticks(np.arange(0.3,0.85,0.1))
ax1.yaxis.set_ticks_position('both')
ax1.set_xticklabels(predictor,y=0.03, rotation=30, fontsize=8)   

plt.subplots_adjust(wspace=0.25)
plt.subplots_adjust(hspace=0.55)
plt.savefig('./Figures/Fig14_Performance_overundersampling_validation_Test.pdf', dpi=500, bbox_inches='tight')
plt.show()

Prediction of Serious Leakage for 1000 Random Wells without Field Test¶

def loc_plot(x_1, y_1,title,s1,c='g',axt=None,ylabel=True, marker='o',cbar=False,vmin=False,
             vmax=False,alpha=0.6,ylime=(-30, 1270),save=False,title_fontsize=11,linewidths=0.3,legend=False,
             label=False,bbox_to_anchor=(0.25, 0.12),shrink=0.5):
    """ Function to plot location map and pie chart of """

    ax1 = axt or plt.axes()    
    Name=["" for x in range(6)]
    Name[0:6]='Calgary','Fort McMurray','Edmonton','Rainbow',\
     'Grande Prairie','Lloydminster'
    
    ax1.spines['left'].set_visible(True)
    ax1.spines['bottom'].set_visible(True)  
    ax1.spines['right'].set_visible(False)
    ax1.spines['top'].set_visible(False)  
    with open('./Data/Location names.txt') as f:
        xn = []
        yn = [] 
        for line in f:
            p = line.split()
            tmp=utm.from_latlon(float(p[2]), float(p[1]),force_zone_number=12)
            xn.append((float(tmp[0])-(-80563.0))/1000)
            yn.append((float(tmp[1]-5428042.5))/1000)
    #plt.plot(xn,yn,'*k',markersize=6.5,linewidth=1)        
    for i in range(5):
        plt.text(xn[i], (yn[i]-18), Name[i],fontsize=7,color='y', fontweight='bold',style='oblique', ha='center',
         va='top', wrap=True,bbox=dict(facecolor='k', alpha=1,pad=0))      
        
    # Clip to Alberta province
    with open('./Data/Alberta.dat') as f:
        path = []
        Name = []   
        for line in f:
            p = line.split()
            Name.append(p[0])
            tmp=utm.from_latlon(float(p[1]), float(p[0]),force_zone_number=12)
            aa=[tmp[0]/1000-(-80563.0/1000),tmp[1]/1000-(5428042.5/1000)]
            path.append(aa)
    path1 = np.asarray(path)
    path = Path(path1)
    patch = PathPatch(path, facecolor='none',linewidth=0.8)
    plt.gca().add_patch(patch)
    if label:
        sc=ax1.scatter(x_1, y_1,s=s1,c=c, marker=marker,vmin=vmin, vmax=vmax,edgecolors='k',
                   cmap='jet',linewidths=linewidths,alpha=alpha,label=label)
    else:
        sc=ax1.scatter(x_1, y_1,s=s1,c=c, marker=marker,vmin=vmin, vmax=vmax,edgecolors='k',
                   cmap='jet',linewidths=linewidths,alpha=alpha)        
    if(cbar):
        fig.colorbar(sc,shrink=shrink,label='Probability of Serious Leakage',orientation='horizontal')        
    if(legend):
        plt.legend(loc=9,bbox_to_anchor=(0.25, 0.12), ncol=1,fontsize=8.5,markerscale=1.2)
    plt.title(title,fontsize=title_fontsize)    
    plt.xlabel('Easting (km)',fontsize='9.5')
    if(ylabel): plt.ylabel('Northing (km)',fontsize='9.5')
    plt.gca().set_aspect('0.9')
    plt.xlim((-30, 660))   # set the xlim to left, right
    plt.ylim(ylime)     # set the xlim to left, right
    if(save): fig.savefig(save, dpi=500, bbox_inches='tight')

# Reading Data+
df_xy_test = pd.read_csv("./Data/Wells_Test.csv",na_values=['NA','?',' '])
drop_colmn=['X Coordinate','Y Coordinate'] # x and y coordinate is used only for plotting

#Shuffle data to avoid any artifact
np.random.seed(52)
df_xy_test=df_xy_test.reindex(np.random.permutation(df_xy_test.index))
df_xy_test.reset_index(inplace=True, drop=True)

# Raname some columns name to make more sense.
df_xy_test=df_xy_test.rename(columns={"MD (All Wells) (m)":"Measured Depth (m)", "Surf-Hole Easting (NAD83)": "X Coordinate",
"Surf-Hole Easting (NAD83)": "X Coordinate","Surf-Hole Northing (NAD83)": "Y Coordinate",
"Rm": "Rheology Modifier (RM)","BH Temp. (degC)": "Borehole Temperature (degC)",
'Cumulative WTR Prod. (m3)':'Cumulative WATER Prod. (m3)',
 'Prod./Inject. Frmtn': 'Prod./Inject. Formation', 'Well Status Abrv': 'Well Status',
'No_well':'Well Density (n/10km X 10km)', 'Surface Abandonment Date': 'Surface Abandonment Month',                          
'Total Production Hrs':'Total Production Hours', 'Class': 'AER Classification'})
#
df_xy_test=df_xy_test[['Deviated Hole (T/F)','Horizontal Hole (T/F)','Surface-Casing Depth (m)','Surface-Casing Size (mm)'
,'Surface-Casing Weight (kg/m)','Production-Casing Depth (m)','Production-Casing Size (mm)'
,'Production-Casing Weight (kg/m)','Production-Casing Grade','Surface-Casing Grade'
,'Measured Depth (m)','Borehole Temperature (degC)','Prod./Inject. Formation','Well Status',
'Month Well Spudded','Well Density (n/10km X 10km)','Surface Abandonment Type','Surface Abandonment Month',
'Cumulative GAS Prod. (e3m3)','Cumulative OIL Prod. (m3)','Cumulative WATER Prod. (m3)',
'Total Production Month','X Coordinate','Y Coordinate']]

# one with x and y, and another without x and y
df_test=df_xy_test.copy()
drop_colmn=['X Coordinate','Y Coordinate'] # x and y coordinate is used only for plotting
df_test['Surface Abandonment Month'].fillna(0,inplace=True)
df_test.drop(drop_colmn, axis=1, inplace=True)
df_test

#vali_set_strat['Well Status']=vali_set_strat['Well Status'].map(dic_TE_1)
#
df_test['Prod./Inject. Formation'].fillna('NA',inplace=True)
df_test['Prod./Inject. Formation']=df_test['Prod./Inject. Formation'].map(dic_TE_2)
#
df_test['Production-Casing Grade']=df_test['Production-Casing Grade'].map(dic_TE_3)
#
df_test['Surface-Casing Grade']=df_test['Surface-Casing Grade'].map(dic_TE_4)
#
df_test['Surface Abandonment Type'].fillna('NA',inplace=True)
df_test['Surface Abandonment Type']=df_test['Surface Abandonment Type'].map(dic_TE_5)
#
df_test['Deviated Hole (T/F)'].fillna('NA',inplace=True)
df_test['Deviated Hole (T/F)']=df_test['Deviated Hole (T/F)'].map(dic_TE_6)
#
df_test['Horizontal Hole (T/F)'].fillna('NA',inplace=True)
df_test['Horizontal Hole (T/F)']=df_test['Horizontal Hole (T/F)'].map(dic_TE_7)
#
df_test['Well Status'].fillna('NA',inplace=True)
df_test['Well Status']= df_test['Well Status'].map(dic_TE_8)
#
df_test['Surface Abandonment Month'].fillna(0,inplace=True)

xy_test=df_xy_test[['X Coordinate','Y Coordinate']]

nsim=1
df_test_im=nsim*[None]
for isim in range(nsim):
    df_test_im[isim]=cond_LU_Sim(df_test,ref_df=df_xy.drop(['X Coordinate',
                                'Y Coordinate','AER Classification'],axis=1),seed=isim+89)

# Load models
model__ratio_lr_load=[]
model__ratio_rf_load=[]
for ii in range(5):
    fname=f'./Trained_models/lr_ratio_{ii+1}.sav'
    model__ratio_lr_load.append(pickle.load(open(fname, 'rb')))
    fname=f'./Trained_models/rf_ratio_{ii+1}.sav'
    model__ratio_rf_load.append(pickle.load(open(fname, 'rb')))

# Soft voting   
pred=np.array([], dtype=np.int64).reshape(len(df_test),0)
columns=[]
for ii in range(5):
    for jj in range(nsim):
        model__ratio_lr_load[ii].random_state=34+ii+jj
        tmp_model=model__ratio_lr_load[ii]
        tmp=tmp_model.predict_proba(df_test_im[jj])[:,1].reshape(-1,1)
        pred=np.concatenate((pred,tmp), axis=1)
        columns.append('lr')
        #
        model__ratio_rf_load[ii].random_state=44+ii+jj
        tmp_model=model__ratio_rf_load[ii]
        tmp=tmp_model.predict_proba(df_test_im[jj])[:,1].reshape(-1,1)
        pred=np.concatenate((pred,tmp), axis=1)    
        columns.append('rf')

sim_result=pd.DataFrame(pred,columns=columns)
#sim_result

Mean=sim_result.apply(lambda x: x.mean(), axis=1)
df_xy_test['Classification']=Mean

prob_gt_83=list(df_xy_test[df_xy_test['Classification']>0.83].index)
prob_80_83=list(df_xy_test[(df_xy_test['Classification']>=0.8) & (df_xy_test['Classification']<0.83)].index)
prob_70_80=list(df_xy_test[(df_xy_test['Classification']>=0.7) & (df_xy_test['Classification']<0.80)].index)
prob_60_70=list(df_xy_test[(df_xy_test['Classification']>=0.6) & (df_xy_test['Classification']<0.7)].index)
prob_40_60=list(df_xy_test[(df_xy_test['Classification']>=0.4) & (df_xy_test['Classification']<0.6)].index)
prob_30_40=list(df_xy_test[(df_xy_test['Classification']>=0.3) & (df_xy_test['Classification']<0.4)].index)
prob_lt_27=list(df_xy_test[df_xy_test['Classification']<0.27].index)

all_index=prob_gt_83[:50]+prob_80_83[:173]+prob_70_80[:80]+prob_60_70[:80]+\
prob_40_60[:295]+prob_30_40[:310]+prob_lt_27

Final Result¶

df_test=df_test.iloc[all_index]
df_test.reset_index(inplace=True, drop=True)
#
df_xy_test=df_xy_test.iloc[all_index]
df_xy_test.reset_index(inplace=True, drop=True)

font = {'size'   : 9}
plt.rc('font', **font)
fig, ax1 = plt.subplots(figsize=(6,5), dpi= 200, facecolor='w', edgecolor='k')
ax1=plt.subplot(1,2,1)
Mean=sim_result.apply(lambda x: x.mean(), axis=1)
loc_plot(df_xy_test["X Coordinate"]/1000, df_xy_test["Y Coordinate"]/1000,
         '1000 Random Wells \n without Field Test in Alberta',cbar=False,vmin=0, vmax=1,
         s1=35,c='g',axt=ax1,ylabel=True,alpha=0.6,save='./Figures/Fig15.png')

#Calculate number of missing data and percentage for each feature
per_non_nan=[]
non_nan_=[]
Colms=df_test.columns
for i in range(len(Colms)):
    non_nan=df_test[Colms[i]].count()
    non_nan_.append(len(df_test)-non_nan)
    per_non_nan.append(float(100-(non_nan/len(df_test))*100))
    
#Sort percentage of missing data and columns   
sort_array=np.sort(per_non_nan)
sort_colms=[]
uniqu=list(np.unique(per_non_nan))
for i in range(len(uniqu)):
    ind=[j for j, x in enumerate(per_non_nan) if x == uniqu[i]]
    for k in ind:
        sort_colms.append(Colms[k])
        
#Sort number of missing data and columns        
sort_non_nan_=np.sort(non_nan_)
sort_non_nan__=[]
uniqu=list(np.unique(non_nan_))
for i in range(len(uniqu)):
    ind=[j for j, x in enumerate(non_nan_) if x == uniqu[i]]
    for k in ind:
        sort_non_nan__.append(Colms[k])

#Plot barchart with two vertical axis        
font = {'size'   : 12}
matplotlib.rc('font', **font)
fig, ax1 = plt.subplots(figsize=(8, 4), dpi= 100, facecolor='w', edgecolor='k')

ax2 = ax1.twinx() # Create another axes that shares the same x-axis
index1 = np.arange(len(sort_colms))
index2 = np.arange(len(sort_non_nan__))
ax1.bar(index1, sort_array,color='b')
ax2.bar(index2, sort_non_nan_,color='b')
plt.title(f'a) Missing Data for {len(df_xy_test)} Random Wells in Alberta',fontsize=16)
ax1.set_ylabel('Percentage of Missing Data (%)',fontsize=11)
ax2.set_ylabel('Number of Missing Data',fontsize=11)
ax1.set_xticks(np.arange(len(Colms)))
ax1.set_xticklabels(sort_colms,fontsize=10, rotation=90)
ax1.xaxis.grid(color='k', linestyle='--', linewidth=0.4) # horizontal lines
fig.savefig('./Figures/Fig16.png', dpi=500, bbox_inches='tight')

plt.show()

nsim=100
df_test_im=nsim*[None]
for isim in range(nsim):
    df_test_im[isim]=cond_LU_Sim(df_test,ref_df=df_xy.drop(['X Coordinate',
                                'Y Coordinate','AER Classification'],axis=1),seed=isim+89)

# Load models
model__ratio_lr_load=[]
model__ratio_rf_load=[]
for ii in range(5):
    fname=f'./Trained_models/lr_ratio_{ii+1}.sav'
    model__ratio_lr_load.append(pickle.load(open(fname, 'rb')))
    fname=f'./Trained_models/rf_ratio_{ii+1}.sav'
    model__ratio_rf_load.append(pickle.load(open(fname, 'rb')))

# Soft voting   
pred=np.array([], dtype=np.int64).reshape(len(df_test_im[0]),0)
pred_real=np.array([], dtype=np.int64).reshape(len(df_test_im[0]),0)
columns=[]
columns_real=[]
for jj in range(nsim):
    pred_real_tmp=np.array([], dtype=np.int64).reshape(len(df_test_im[0]),0)
    for ii in range(5):
        model__ratio_lr_load[ii].random_state=34+ii+jj
        tmp_model=model__ratio_lr_load[ii]
        tmp1=tmp_model.predict_proba(df_test_im[jj])[:,1].reshape(-1,1)
        pred=np.concatenate((pred,tmp1), axis=1)
        pred_real_tmp=np.concatenate((pred_real_tmp,tmp1), axis=1)   
        columns.append('lr')
        #
        model__ratio_rf_load[ii].random_state=44+ii+jj
        tmp_model=model__ratio_rf_load[ii]
        tmp2=tmp_model.predict_proba(df_test_im[jj])[:,1].reshape(-1,1)
        pred=np.concatenate((pred,tmp2), axis=1)    
        columns.append('rf')
        #tmp_real=np.sum(tmp1,tmp1)/2
        pred_real_tmp=np.concatenate((pred_real_tmp,tmp2), axis=1) 
    columns_real.append('Realization_'+str(jj+1))    
    mean_real=np.array(pd.DataFrame(pred_real_tmp).apply(lambda x: x.mean(), axis=1)).reshape(-1,1)        
    pred_real=np.concatenate((pred_real,mean_real), axis=1)

sim_result_real=pd.DataFrame(pred_real,columns=columns_real)

sim_result=pd.DataFrame(pred,columns=columns)
#sim_result

Mean=sim_result.apply(lambda x: x.mean(), axis=1)
df_xy_test['Classification']=Mean

font = {'size'   : 10}
plt.rc('font', **font)
fig, ax1 = plt.subplots(figsize=(10,10), dpi= 200, facecolor='w', edgecolor='k')

#gs = gridspec.GridSpec(1, 2, width_ratios=[1.25, 1], wspace=0.4)
#ax1=plt.subplot(gs[0]) 
ax1=plt.subplot(1,3,1) 
loc_plot(df_xy_test["X Coordinate"]/1000, df_xy_test["Y Coordinate"]/1000,
         'a) \n Predicted Probability of Serious \n Leakage for 1000 Wells',cbar=False,vmin=0.2, vmax=0.9,
         s1=35,c=Mean,marker="o",axt=ax1,ylabel=True,alpha=0.9,save=False,
         title_fontsize=10,linewidths=0.2,bbox_to_anchor=(2000, -0.1))

ax1=plt.subplot(1,3,2)  
prop=0.6
loc_plot(df_xy_test[df_xy_test['Classification']>prop]["X Coordinate"]/1000, 
         df_xy_test[df_xy_test['Classification']>prop]["Y Coordinate"]/1000,
         f'b) \n {len(Mean[Mean>prop])} Wells with Predicted Probability \n of Serious Leakage > {prop}',cbar=False,vmin=0.2, vmax=0.8,
         s1=65,c=Mean[Mean>prop],marker="P",axt=ax1,ylabel=False,alpha=1,save=False,title_fontsize=10,linewidths=0.2,
         label=f'Wells with Leakage\n Probability>{prop}',legend=False)

ax1=plt.subplot(1,3,3)  
prop=0.8
loc_plot(df_xy_test[df_xy_test['Classification']>prop]["X Coordinate"]/1000, 
         df_xy_test[df_xy_test['Classification']>prop]["Y Coordinate"]/1000,
         f'c) \n {len(Mean[Mean>prop])} Wells with Predicted Probability \n of Serious Leakage > {prop}',cbar=False,vmin=0.2, vmax=0.8,
         s1=65,c=Mean[Mean>prop],marker="X",axt=ax1,ylabel=False,alpha=1,save=False,title_fontsize=10,linewidths=0.2,
         label=f'Wells with Leakage\n Probability>{prop}',legend=False)


plt.subplots_adjust(hspace=0.25)
plt.subplots_adjust(wspace=0.35)
fig.savefig('./Figures/Fig17.png', dpi=500, bbox_inches='tight')
plt.show()

Mean=sim_result.apply(lambda x: x.mean(), axis=1)
df_xy_test['Classification']=Mean

font = {'size'   : 10}
plt.rc('font', **font)
fig, ax1 = plt.subplots(figsize=(10,10), dpi= 200, facecolor='w', edgecolor='k')

#gs = gridspec.GridSpec(1, 2, width_ratios=[1.25, 1], wspace=0.4)
#ax1=plt.subplot(gs[0]) 
ax1=plt.subplot(1,3,1) 
loc_plot(df_xy_test["X Coordinate"]/1000, df_xy_test["Y Coordinate"]/1000,
         'a) \n Predicted Probability of Serious \n Leakage for 1000 Wells',cbar=True,vmin=0.2, vmax=0.9,
         s1=35,c=Mean,marker="o",axt=ax1,ylabel=True,alpha=0.9,save=False,
         title_fontsize=10,linewidths=0.2,bbox_to_anchor=(2000, -0.1),shrink=2)

plt.subplots_adjust(hspace=0.25)
plt.subplots_adjust(wspace=0.35)
fig.savefig('./Figures/Fig18.pdf', dpi=500, bbox_inches='tight')
plt.show()

def histplt (val,bins,title,xlabl,ylabl,xlimt,ylimt=False, loc=1,legend=1,axt=None,days=True,
             class_=False,scale=1,int_=0,nsplit=1, font=5,color='b',alpha=0.8):
    """ Make histogram of data"""
    
    ax1 = axt or plt.axes()
    font = {'size'   : font }
    plt.rc('font', **font) 
    
    val=val[~np.isnan(val)]
    val=np.array(val)
    plt.hist(val, bins=bins, weights=np.ones(len(val)) / len(val),ec='black',color=color,alpha=alpha)
    n=len(val[~np.isnan(val)])
    Mean=np.nanmean(val)
    Median=np.nanmedian(val)
    SD=np.sqrt(np.nanvar(val))
    Max=np.nanmax(val)
    Min=np.nanmin(val)

    if (int_==0):
        txt='n=%.0f\nMean=%0.2f\nMedian=%0.1f\nσ=%0.1f\nMax=%0.1f\nMin=%0.1f'
    else:
        txt='n=%.0f\nMean=%0.2f\nMedian=%0.1f\nMax=%0.0f\nMin=%0.0f'        
    anchored_text = AnchoredText(txt %(n,Mean,Median,SD,Max,Min), borderpad=0, loc=loc,prop={ 'size': font['size']*scale})    
    if(legend==1): ax1.add_artist(anchored_text)
    if (scale): plt.title(title,fontsize=font['size']*(scale+0.25))
    else:       plt.title(title)
    plt.xlabel(xlabl,fontsize=font['size']) 
    ax1.set_ylabel('Frequency',fontsize=font['size'])
    if (scale): ax1.set_xlabel(xlabl,fontsize=font['size']*scale)
    else:       ax1.set_xlabel(xlabl)

    try:
        xlabl
    except NameError:
        pass    
    else:
        if (scale): plt.xlabel(xlabl,fontsize=font['size']*scale) 
        else:        plt.xlabel(xlabl)   
        
    try:
        ylabl
    except NameError:
        pass      
    else:
        if (scale): plt.ylabel(ylabl,fontsize=font['size']*scale)  
        else:         plt.ylabel(ylabl)  
        
    if (class_==True): plt.xticks([0,1])
    plt.gca().yaxis.set_major_formatter(PercentFormatter(1))
    ax1.grid(linewidth='0.1')
    if days:plt.xticks(range(0,45,nsplit),y=0.01, fontsize=8.6)  
    plt.xticks(fontsize=font['size']*scale)    
    plt.yticks(fontsize=font['size']*scale)   
    try:
        xlimt
    except NameError:
        pass  
    else:
        plt.xlim(xlimt) 
        
    try:
        ylimt
    except NameError:
        pass  
    else:
        plt.ylim(ylimt)

font = {'size'   : 7}
plt.rc('font', **font)
fig, ax1 = plt.subplots(figsize=(10, 3), dpi= 100, facecolor='w', edgecolor='k')

gs = gridspec.GridSpec(ncols=6, nrows=2)
ax1=plt.subplot(1,2,1) 
i=0
tmp_val=[idx for idx, val in enumerate(Mean) if val>0.65 and val<0.75] 
val=sim_result_real.iloc[tmp_val[0]]
histplt (val,bins=12,title='100 Realizations',xlabl='Probability of Serious Leakage',days=False,alpha=0.8,
         ylabl='Percentage',xlimt=(0.2,0.9),ylimt=(0,0.6),axt=ax1,nsplit=5,scale=1.05,loc=2,font=10,color='brown')
fig.savefig('./Figures/Fig19.png', dpi=500, bbox_inches='tight')

font = {'size'   : 7}
plt.rc('font', **font)
fig, ax1 = plt.subplots(figsize=(10, 3), dpi= 100, facecolor='w', edgecolor='k')

gs = gridspec.GridSpec(ncols=6, nrows=2)
ax1=plt.subplot(1,2,1) 
i=0
tmp_val=[idx for idx, val in enumerate(Mean) if val>0.25 and val<0.45] 
val=sim_result_real.iloc[tmp_val[10]]
histplt (val,bins=10,title='100 Realizations',xlabl='Probability of Serious Leakage',days=False,alpha=0.8,
         ylabl='Percentage',xlimt=(0.2,0.9),ylimt=(0,0.6),axt=ax1,nsplit=5,scale=1.05,loc=1,font=10,color='b')
fig.savefig('./Figures/Fig20.png', dpi=500, bbox_inches='tight')

	Deviated Hole (T/F)	Horizontal Hole (T/F)	Surface-Casing Depth (m)	Surface-Casing Size (mm)	Surface-Casing Weight (kg/m)	Production-Casing Depth (m)	Production-Casing Size (mm)	Production-Casing Weight (kg/m)	Production-Casing Grade	Surface-Casing Grade	Measured Depth (m)	Borehole Temperature (degC)	Prod./Inject. Formation	Well Status	Month Well Spudded	Well Density (n/10km X 10km)	Surface Abandonment Type	Surface Abandonment Month	Cumulative GAS Prod. (e3m3)	Cumulative OIL Prod. (m3)	Cumulative WATER Prod. (m3)	Total Production Month	AER Classification
0	F	F	186.0	219.1	35.7	422.0	114.3	14.1	J055	J055	901.0	33.0	Kmilk_rv	FG	447.0	51	NaN	NaN	35312.0	0.0	9.0	395.0	Non Serious
1	F	F	104.0	273.0	NaN	607.0	177.8	NaN	NaN	NaN	607.0	23.0	Kcummings	ABO	506.0	75	NaN	NaN	282.0	19307.0	7831.0	330.0	Non Serious
2	F	F	326.0	219.1	35.7	1850.0	114.3	17.3	J055	J055	1855.0	61.0	NaN	AZN	447.0	9	Plate	290.0	NaN	NaN	NaN	NaN	Serious
3	F	F	144.0	177.8	NaN	609.0	114.3	NaN	NaN	NaN	609.0	26.0	NaN	AZN	496.0	15	Plate	143.0	NaN	NaN	NaN	NaN	Non Serious
4	F	T	136.8	339.7	90.8	632.0	219.1	53.7	L080	K055	1594.0	NaN	Kclwtr_ss	BOC	247.0	45	NaN	NaN	NaN	NaN	NaN	NaN	Serious
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
29961	T	F	214.0	219.1	35.7	996.0	139.7	20.8	OOJ055	OOJ055	996.0	29.0	Ksparky	WI	281.0	14	NaN	NaN	NaN	NaN	NaN	NaN	Non Serious
29962	F	T	614.0	244.5	NaN	NaN	NaN	NaN	NaN	NaN	3780.0	NaN	Kcard_ss	PO	102.0	10	NaN	NaN	5217.0	6270.0	233.0	55.0	Non Serious
29963	F	F	283.7	219.1	35.7	1472.8	139.7	20.8	K055	K055	1474.6	36.0	Kcard_ss	SO	460.0	12	NaN	NaN	11255.0	19008.0	2.0	366.0	Non Serious
29964	F	F	205.5	219.1	35.7	1236.0	139.7	20.8	J055	J055	1236.0	NaN	Kglauc_ss	AGZ	76.0	4	Cement And Plate	116.0	73.0	0.0	13.0	3.0	Non Serious
29965	T	F	172.0	244.5	53.6	630.6	177.8	34.2	J055	J055	630.6	25.0	NaN	CSD	120.0	35	NaN	NaN	NaN	NaN	NaN	NaN	Serious

	Deviated Hole (T/F)	Horizontal Hole (T/F)	Surface-Casing Depth (m)	Surface-Casing Size (mm)	Surface-Casing Weight (kg/m)	Production-Casing Depth (m)	Production-Casing Size (mm)	Production-Casing Weight (kg/m)	Production-Casing Grade	Surface-Casing Grade	Measured Depth (m)	Borehole Temperature (degC)	Prod./Inject. Formation	Well Status	Month Well Spudded	Well Density (n/10km X 10km)	Surface Abandonment Type	Surface Abandonment Month	Cumulative GAS Prod. (e3m3)	Cumulative OIL Prod. (m3)	Cumulative WATER Prod. (m3)	Total Production Month	X Coordinate	Y Coordinate	AER Classification
0	0.129647	0.166704	107.0	244.5	48.1	619.0	177.8	34.2	0.187893	0.137679	619.0	10.0	0.170461	0.089447	209.0	41	0.134169	0.0	355.0	12969.0	33976.0	41.0	586810.0	512623.7	0
1	0.173492	0.166704	252.0	219.1	NaN	1628.0	139.7	NaN	NaN	NaN	1628.0	53.0	0.169473	0.172106	508.0	28	0.134169	0.0	5927.0	20363.0	1202.0	278.0	72489.3	776757.1	0
2	0.173492	0.166704	218.0	219.1	NaN	2159.0	114.3	NaN	NaN	NaN	2163.0	79.0	0.117536	0.138271	521.0	15	0.134169	0.0	430251.0	0.0	7101.0	455.0	348133.3	416882.6	1
3	0.129647	0.166704	185.0	219.1	35.7	1200.0	139.7	20.8	0.131978	0.132078	1200.0	34.0	0.127741	0.124544	317.0	17	0.134169	0.0	2574.0	281.0	1957.0	95.0	470936.8	105185.2	0
4	0.173492	0.166704	188.1	273.0	NaN	1082.0	139.7	NaN	NaN	NaN	1239.0	23.0	0.127741	0.244136	772.0	17	0.198462	369.0	14741.0	0.0	1709.0	372.0	482039.9	422675.4	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
19725	0.173492	0.166704	138.4	219.1	NaN	1403.6	114.3	NaN	NaN	NaN	1403.9	44.0	0.257481	0.158095	759.0	18	0.134169	0.0	3991.0	37524.0	24770.0	297.0	313010.6	482981.8	0
19726	0.129647	0.166704	266.0	219.1	35.7	1021.0	139.7	23.1	0.131978	0.132078	1022.0	NaN	0.154149	0.090592	158.0	8	0.134169	0.0	NaN	NaN	NaN	NaN	509540.1	66495.9	0
19727	0.173492	0.166704	187.8	177.8	NaN	649.2	114.3	NaN	NaN	NaN	806.5	41.0	0.165947	0.155173	584.0	23	0.134169	0.0	5586.0	0.0	259.0	300.0	517401.3	490091.4	0
19728	0.173492	0.166704	399.0	339.7	NaN	2317.7	177.8	NaN	NaN	NaN	2338.0	NaN	0.156535	0.172106	790.0	34	0.134169	0.0	11307.0	29659.0	124154.0	173.0	185721.3	675266.9	0
19729	0.129647	0.166704	200.0	244.5	53.6	2202.0	177.8	34.2	0.240164	0.132078	2214.0	NaN	0.250195	0.229537	352.0	32	0.198462	207.0	6262.0	44336.0	94147.0	93.0	420888.6	360712.0	1

	Deviated Hole (T/F)	Horizontal Hole (T/F)	Surface-Casing Depth (m)	Surface-Casing Size (mm)	Surface-Casing Weight (kg/m)	Production-Casing Depth (m)	Production-Casing Size (mm)	Production-Casing Weight (kg/m)	Production-Casing Grade	Surface-Casing Grade	Measured Depth (m)	Borehole Temperature (degC)	Prod./Inject. Formation	Well Status	Month Well Spudded	Well Density (n/10km X 10km)	Surface Abandonment Type	Surface Abandonment Month	Cumulative GAS Prod. (e3m3)	Cumulative OIL Prod. (m3)	Cumulative WATER Prod. (m3)	Total Production Month	AER Classification
0	0.129647	0.166704	107.0	244.5	48.1	619.0	177.8	34.2	0.187893	0.137679	619.0	10.0	0.170461	0.089447	209.0	41	0.134169	0.0	355.0	12969.0	33976.0	41.0	0
1	0.173492	0.166704	252.0	219.1	NaN	1628.0	139.7	NaN	NaN	NaN	1628.0	53.0	0.169473	0.172106	508.0	28	0.134169	0.0	5927.0	20363.0	1202.0	278.0	0
2	0.173492	0.166704	218.0	219.1	NaN	2159.0	114.3	NaN	NaN	NaN	2163.0	79.0	0.117536	0.138271	521.0	15	0.134169	0.0	430251.0	0.0	7101.0	455.0	1
3	0.129647	0.166704	185.0	219.1	35.7	1200.0	139.7	20.8	0.131978	0.132078	1200.0	34.0	0.127741	0.124544	317.0	17	0.134169	0.0	2574.0	281.0	1957.0	95.0	0
4	0.173492	0.166704	188.1	273.0	NaN	1082.0	139.7	NaN	NaN	NaN	1239.0	23.0	0.127741	0.244136	772.0	17	0.198462	369.0	14741.0	0.0	1709.0	372.0	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
19725	0.173492	0.166704	138.4	219.1	NaN	1403.6	114.3	NaN	NaN	NaN	1403.9	44.0	0.257481	0.158095	759.0	18	0.134169	0.0	3991.0	37524.0	24770.0	297.0	0
19726	0.129647	0.166704	266.0	219.1	35.7	1021.0	139.7	23.1	0.131978	0.132078	1022.0	NaN	0.154149	0.090592	158.0	8	0.134169	0.0	NaN	NaN	NaN	NaN	0
19727	0.173492	0.166704	187.8	177.8	NaN	649.2	114.3	NaN	NaN	NaN	806.5	41.0	0.165947	0.155173	584.0	23	0.134169	0.0	5586.0	0.0	259.0	300.0	0
19728	0.173492	0.166704	399.0	339.7	NaN	2317.7	177.8	NaN	NaN	NaN	2338.0	NaN	0.156535	0.172106	790.0	34	0.134169	0.0	11307.0	29659.0	124154.0	173.0	0
19729	0.129647	0.166704	200.0	244.5	53.6	2202.0	177.8	34.2	0.240164	0.132078	2214.0	NaN	0.250195	0.229537	352.0	32	0.198462	207.0	6262.0	44336.0	94147.0	93.0	1

	Deviated Hole (T/F)	Horizontal Hole (T/F)	Surface-Casing Depth (m)	Surface-Casing Size (mm)	Surface-Casing Weight (kg/m)	Production-Casing Depth (m)	Production-Casing Size (mm)	Production-Casing Weight (kg/m)	Production-Casing Grade	Surface-Casing Grade	Measured Depth (m)	Borehole Temperature (degC)	Prod./Inject. Formation	Well Status	Month Well Spudded	Well Density (n/10km X 10km)	Surface Abandonment Type	Surface Abandonment Month	Cumulative GAS Prod. (e3m3)	Cumulative OIL Prod. (m3)	Cumulative WATER Prod. (m3)	Total Production Month
0	F	F	97.6	177.8	25.3	455.0	114.3	14.1	J055	H040	455.0	NaN	TRchly_lk	AG	386.0	7	Cement And Plate	0.0	563.0	0.0	65.0	0.0
1	F	F	225.0	219.1	41.6	526.3	114.3	17.3	P110	J055	526.3	NaN	Kvik_ss	D&A	337.0	5	NaN	0.0	NaN	NaN	NaN	215.0
2	F	F	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	107.7	NaN	Dslave_pt	D&A	444.0	7	Other	0.0	NaN	NaN	NaN	403.0
3	F	T	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2731.0	NaN	NaN	SBO	505.0	0	NaN	505.0	9885.0	30205.0	18203.0	NaN
4	F	F	145.0	177.8	25.3	776.0	114.3	14.1	J055	H040	776.0	NaN	NaN	CBM	227.0	11	NaN	227.0	7162.0	0.0	58.0	NaN
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
19995	F	F	163.0	177.8	29.8	1111.0	114.3	14.1	J055	J055	1111.0	NaN	Kellrslie	CMM	275.0	10	NaN	0.0	NaN	NaN	NaN	99.0
19996	F	F	181.0	177.8	29.8	1493.0	114.3	14.1	J055	J055	1493.0	41.0	NaN	CBM	415.0	6	NaN	414.0	830.0	0.0	32.0	NaN
19997	F	F	361.0	244.5	53.6	NaN	NaN	NaN	NaN	J055	2257.0	54.0	Jrock_ck	D&A	232.0	0	Other	0.0	NaN	NaN	NaN	51.0
19998	F	F	140.8	244.5	NaN	NaN	NaN	NaN	NaN	NaN	1202.4	43.0	NaN	D&A	374.0	7	Other	55.0	NaN	NaN	NaN	NaN
19999	F	F	193.2	219.1	NaN	1280.2	114.3	NaN	NaN	NaN	1286.3	37.0	NaN	CMM	180.0	3	NaN	180.0	NaN	NaN	NaN	NaN

Prediction of Methane Leakageby Mehdi Rezvandehy

Table of Contents