Summary

Semantic search seeks to improve search accuracy by understanding the content of the search query. In contrast to traditional search engines, which only find documents based on lexical matches, semantic search can also find synonyms. In fact, this type of search makes browsing more complete by understanding almost exactly what the user is trying to ask, instead of simply matching keywords to pages. The idea behind semantic search is to embed all entries in your corpus, which can be sentences, paragraphs, or documents, into a vector space. Semantic search algorithms used contextual embedding to perform look-ups, providing for closer cotextual matches than lexical matches. In this notebook, we search a question within a book and find the closest answers to the question.

Python functions and data files needed to run this notebook are available via this link.

In [1]:
import requests
from urllib.request import urlopen
from bs4 import BeautifulSoup
import numpy as np
from datasets import load_dataset # Load data set from HuggingFace

from sentence_transformers import SentenceTransformer, util # SentenceTransformer for semantic meaning
from transformers import pipeline

from random import sample, seed, shuffle
from sentence_transformers import InputExample, losses, evaluation
from torch.utils.data import DataLoader

We have two types of search:

  1. Symmetric Search: documents and queries are roughly the same size and carry the same amount of semantic content: e.g. getting news article titles for a given query

  2. Asymmetric Search: documents are usually longer than queries and carry larger amounts of semantic content: e.g. getting an entire paragraph from a textbook for answering a QUERY

A siamese **bi-encoder architecture allows the network to learn embeddings that can be compared using cosine similarity. Cross-encoder** is traditional BERT model which is slower. Bi-encoder encodes query and candidates independently, using a scoring function to compare them. Cross-encoder encodes query-candidate pairs together, capturing their context. Cross-encoder performs better but is more computationally expensive, while bi-encoder is more efficient but may lack context awareness. Choice depends on task and requirements. image.png

  • Cosine similarity calculates similarity by measuring the cosine of angle between two vectors. It is a judgment of orientation and not magnitude: two vectors with the same orientation have a cosine similarity of 1, two vectors oriented at 90° relative to each other have a similarity of 0, independent of their magnitude. If the angle is bigger than 90°, the similarity will be negative.
  • The cosine similarity is advantageous because even if the two similar sentences are far apart by the Euclidean (yoo·kli·dee·uhn) distance (due to the size of the document), chances are they may still be oriented closer together. The smaller the angle, higher the cosine similarity.

 

$\large Similarity= cos(\theta)=\frac{\textbf{A}.\textbf{B}}{\left| \left| \textbf{A}\right| \right|\, \left| \left| \textbf{B} \right| \right|}=\frac{\sum_{i=1}^{n} A_{i}B_{i}}{\sqrt{\sum_{i=1}^{n} A_{i}^{2}}\sqrt{\sum_{i=1}^{n} B_{i}^{2}}}$

image.png

Google search with roberta model

In [2]:
# We want to search a person on Google

PERSON = 'Mehdi Rezvandehy'

# This is done only for education purposes which is very efficient way of searching
google_html = BeautifulSoup(requests.get(f'https://www.google.com/search?q={PERSON}').text).get_text()[:1024]

nlp = pipeline('question-answering', 
               model='deepset/roberta-base-squad2', 
               tokenizer='deepset/roberta-base-squad2', 
               max_length=50)  # pipline of document-question-answering with roberta model

nlp(f'Who is {PERSON}?', google_html)
Out[2]:
{'score': 0.10439878702163696,
 'start': 284,
 'end': 308,
 'answer': 'Principal Data Scientist'}

Load an Online Text Book

When applying semantic search to large documents, breaking down the text into manageable and meaningful chunks is essential for efficient processing.

Split up the Text

Spliting can be applied on word level (maximum tokens) and paragraph level. For example for word level, if a document has 800 tokens, we might split it into two chunks: the first 512 tokens in one chunk and the remaining 288 tokens in another. Each chunk is then treated as a separate input during the semantic search process.

In addition to word-level splitting, we might also consider splitting documents at the paragraph level. Each paragraph can be treated as a separate unit for semantic search. This approach has some advantages:

  • Context Preservation: Splitting at the paragraph level allows to maintain the context of each paragraph, which can be important for understanding the meaning of the text.

  • Parallel Processing: We can process multiple paragraphs in parallel, taking advantage of modern hardware and speeding up the search process.

  • Granular Matching: If the query is related to a specific topic or context, paragraph-level matching allows to identify relevant paragraphs more accurately.

Natural Whitespace Chunking

Natural Whitespace Chunking refers to a method of splitting the text based on the natural spaces (whitespace) between paragraphs. This process can be done with or without overlap between the chunks. Without overlap means the document is split into non-overlapping chunks based on natural whitespace (line breaks) between paragraphs. Each chunk is independent and contains a distinct set of paragraphs.

Chunking Without Overlapping:

Figure below shows whitespace chunking without overlapping.

image-2.png

Pros:

  1. Non-Overlapping Chunks: Each chunk corresponds to a distinct, non-overlapping portion of the text. This can simplify downstream processing and analysis.

  2. Sequential Processing: When chunks do not overlap, you can process them sequentially, which may be more straightforward and efficient.

Cons:

  1. Context Discontinuity: If the content within a paragraph spans multiple chunks, there might be a discontinuity in context, potentially impacting the understanding of the text.

  2. Boundary Effects: Non-overlapping chunks might lead to missing important information near chunk boundaries, especially if there are critical phrases or sentences that bridge two adjacent chunks.

Chunking With Overlapping

Figure below shows whitespace chunking with overlapping:

image-3.png

Pros:

  1. Context Continuity: Overlapping chunks help maintain context continuity, as portions of adjacent chunks overlap, capturing shared information.

  2. Reduced Boundary Effects: Overlapping chunks can mitigate boundary effects by ensuring that information near chunk boundaries is present in multiple chunks.

Cons:

  1. Redundancy: Overlapping chunks may introduce redundancy, where some content is present in multiple chunks. This redundancy can impact efficiency in storage and processing.

  2. Increased Complexity: Overlapping chunks might require more complex processing, especially if you need to handle duplicate information or ensure that the analysis considers the shared context.

In this notebook, we applied chunking without overlapping for paragraphs > 90 characters. See below:

In [3]:
# textbook about animal: Title: Wild Animals I Have Known
text = urlopen("""https://www.gutenberg.org/cache/epub/3031/pg3031.txt""").read().decode() # open URL of the document and read

# split up the text book into paragraphs with new line character and only keep documents of at least 90 characters
doc = list(filter(lambda x: len(x) > 90, text.split('\r\n\r\n')))

doc = np.array(doc)

print(f'There are {len(doc)} paragraphs')
There are 645 paragraphs
In [ ]:
 

Encode Paragraphs

In [4]:
# Bi-Encoder is used to encode all the documents one at a time, to use asymetric sematic search
bi_encoder = SentenceTransformer('msmarco-distilbert-base-v4') # distilbert was fine-tuned on msmarco data set
bi_encoder.max_seq_length = 256     # Truncate long documents to 256 tokens

bi_encoder
Out[4]:
SentenceTransformer(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: DistilBertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)
In [5]:
# To use the bi-encoder, we can call encoder method of the bi_encoder.This takes iterable string of document
# and convert them to tensors
doc_embeddings = bi_encoder.encode(doc, convert_to_tensor=True, show_progress_bar=True)

doc_embeddings.shape
Out[5]:
torch.Size([645, 768])
In [ ]:
 

Encode Query

Encoding should also be applied for the question.

In [6]:
QUERY = """What is snare?"""  # a natural language query
In [7]:
# Encode the QUERY using the bi-encoder and convert to a tensor and find relevant documents
que_embedding = bi_encoder.encode(QUERY, convert_to_tensor=True)

# Give number of documents to retrieve with the bi-encoder
# top_k is top documents similar to query by descending Cosine similarity
cos_sim_score = util.semantic_search(que_embedding, doc_embeddings, top_k=3)[0]

cos_sim_score
Out[7]:
[{'corpus_id': 144, 'score': 0.5957522392272949},
 {'corpus_id': 604, 'score': 0.388653039932251},
 {'corpus_id': 340, 'score': 0.24888089299201965}]
In [8]:
print(f'QUERY: {QUERY}\n')

for no, ir in enumerate(cos_sim_score):
    
    print(f'Document {no + 1}: Cosine Similarity is {ir["score"]:.3f}:\n\n{doc[ir["corpus_id"]]}')
    print('\n')
QUERY: What is snare?

Document 1: Cosine Similarity is 0.596:

"A snare is something that looks like a creeper, but it doesn't grow and

it's worse than all the hawks in the world," said Molly, glancing at the

now far-away red-tail, "for there it hides night and day in the runway

till the chance to catch you comes."


Document 2: Cosine Similarity is 0.389:

It seemed, at length, a waste of time to follow him with a gun, so when

the snow was deepest, and food scarcest, Cuddy hatched a new plot. Right

across the feeding-ground, almost the only good one now in the Stormy

Moon, he set a row of snares. A cottontail rabbit, an old friend, cut

several of these with his sharp teeth, but some remained, and Redruff,

watching a far-off speck that might turn out a hawk, trod right in one

of them, and in an instant was jerked into the air to dangle by one

foot.


Document 3: Cosine Similarity is 0.249:

My window now took the place of the hollow bass wood. A number of hens

of the breed he knew so well were about the cub in the yard. Late that

afternoon as they strayed near the captive there was a sudden rattle of

the chain, and the youngster dashed at the nearest one and would have

caught him but for the chain which brought him up with a jerk. He got on

his feet and slunk back to his box, and though he afterward made several

rushes he so gauged his leap as to win or fail within the length of the

chain and never again was brought up by its cruel jerk.


In [9]:
nlp = pipeline('question-answering', 
               model='deepset/roberta-base-squad2', 
               tokenizer='deepset/roberta-base-squad2', 
               max_length=512)  # pipline of document-question-answering with roberta model
In [10]:
# Answer the QUERY from the top document
nlp(QUERY, str(doc[cos_sim_score[0]['corpus_id']]),max_length=50)
Out[10]:
{'score': 0.5877875089645386,
 'start': 12,
 'end': 47,
 'answer': 'something that looks like a creeper'}

We just built an "Open Book Q/A" System. It looks through the entire book and find the closest paragraph to the question.

Fine-tune bi_encoder

Good and Bad Training Data

To fine-tune our bi-encoder, we need a data set including questions and answers. We used adversarial_qa from HuggingFace. We have perfect answers to question as good training (Cosign 1) and if the answer is not related to the question (Cosign 0), it will be bad training data. The code below is retrieved from Sinan Ozdemir

In [11]:
# Load the adversarial_qa dataset from the Q/A use-case in Hugging Face
training_qa = load_dataset('adversarial_qa', 'adversarialQA', split='train')

# Initialize lists for good and bad training data
data_good_training = []
data_bad_training = []

# Iterate over the adversarial_qa dataset
last_example = None
for example in training_qa:
    # Check if the context is different from the previous example
    if last_example and example['context'] != last_example['context']:
        
        # If different, append to bad training data with a label of 0.0 (neutral)
        data_bad_training.append((example['question'], last_example['context'], 0.0))
        
    # If the context is the same as the previous example
    elif last_example and example['context'] == last_example['context']:
        
        # Append to good training data with a label of 1.0
        data_good_training.append((example['question'], example['context'], 1.0))
        
    # Update last_example for the next iteration
    last_example = example
Found cached dataset adversarial_qa (C:/Users/mrezv/.cache/huggingface/datasets/adversarial_qa/adversarialQA/1.0.0/92356be07b087c5c6a543138757828b8d61ca34de8a87807d40bbc0e6c68f04b)
In [12]:
len(data_good_training), len(data_bad_training)
Out[12]:
(27352, 2647)

Good training data have the highest similarity (1) between questions and answers:

In [13]:
data_good_training[0]
Out[13]:
('What is surrounded by cerebrospinal fluid?',
 'Another approach to brain function is to examine the consequences of damage to specific brain areas. Even though it is protected by the skull and meninges, surrounded by cerebrospinal fluid, and isolated from the bloodstream by the blood–brain barrier, the delicate nature of the brain makes it vulnerable to numerous diseases and several types of damage. In humans, the effects of strokes and other types of brain damage have been a key source of information about brain function. Because there is no ability to experimentally control the nature of the damage, however, this information is often difficult to interpret. In animal studies, most commonly involving rats, it is possible to use electrodes or locally injected chemicals to produce precise patterns of damage and then examine the consequences for behavior.',
 1.0)

Bad training data have the lowest similarity (0) between questions and answers:

In [14]:
data_bad_training[0]
Out[14]:
('What do you think with?',
 'Another approach to brain function is to examine the consequences of damage to specific brain areas. Even though it is protected by the skull and meninges, surrounded by cerebrospinal fluid, and isolated from the bloodstream by the blood–brain barrier, the delicate nature of the brain makes it vulnerable to numerous diseases and several types of damage. In humans, the effects of strokes and other types of brain damage have been a key source of information about brain function. Because there is no ability to experimentally control the nature of the damage, however, this information is often difficult to interpret. In animal studies, most commonly involving rats, it is possible to use electrodes or locally injected chemicals to produce precise patterns of damage and then examine the consequences for behavior.',
 0.0)

Split to Training set and Test set

Now we merge 700 samples of good and bad samples and then shuffle the data:

In [15]:
# https://www.sbert.net/docs/training/overview.html for more information on training

seed(42)  # seed our upcoming sample

sampled_training_data = sample(data_good_training, 700) + sample(data_bad_training, 700)

shuffle(sampled_training_data)  # shuffle our data around

training_index = int(.75 * len(sampled_training_data))  # Get an 75/25 train/test split

Define Training Examples

In [16]:
# Define the training examples; that is how the library is designed.
train_examples = [InputExample(texts=t[:2], label=t[2]) for t in sampled_training_data[:training_index]]

train_examples[0].__dict__
Out[16]:
{'guid': '',
 'texts': ('What were his words taken out of in the media frenzy?',
  'After receiving his J.D. from Boston College Law School, Kerry worked in Massachusetts as an Assistant District Attorney. He served as Lieutenant Governor of Massachusetts under Michael Dukakis from 1983 to 1985 and was elected to the U.S. Senate in 1984 and was sworn in the following January. On the Senate Foreign Relations Committee, he led a series of hearings from 1987 to 1989 which were a precursor to the Iran–Contra affair. Kerry was re-elected to additional terms in 1990, 1996, 2002 and 2008. In 2002, Kerry voted to authorize the President "to use force, if necessary, to disarm Saddam Hussein", but warned that the administration should exhaust its diplomatic avenues before launching war.'),
 'label': 0.0}

Dataloader

We have not used dataloader, we used data collator before: A data loader is the object that specifically shuffles/grabs batches of data from a Dataset

In [17]:
# Define the training dataset, dataloader, and the training loss
# Typically, the Trainer class in Hugging Face can handle this automatically, but here's a manual setup

# Create a dataloader for training with shuffling and a batch size of 32
train_dataloader_set = DataLoader(train_examples, shuffle=True, batch_size=32, collate_fn=bi_encoder.smart_batching_collate)

Loss should also be calculated to measure how well our model is predicted. It is a Cosign similarity loss meaning how close our Cosign similarity is to actual similarity

In [18]:
# Explicitly define the training loss using cosine similarity
train_loss_set = losses.CosineSimilarityLoss(bi_encoder)
In [19]:
train_loss_set
Out[19]:
CosineSimilarityLoss(
  (model): SentenceTransformer(
    (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: DistilBertModel 
    (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
  )
  (loss_fct): MSELoss()
  (cos_score_transformation): Identity()
)

Every batch of data offers three things:

  1. QUERY_batch: standard input ids (32: batch size, 25: max size of questions)
  2. context_batch: standard input ids (32: batch size, 256: max size of answers)
  3. labels (32: either 0 or 1)
In [20]:
(QUERY_batch, context_batch), labels = next(iter(train_dataloader_set))  # get a sample batch of data

QUERY_batch['input_ids'].shape, context_batch['input_ids'].shape, labels.shape
Out[20]:
(torch.Size([32, 20]), torch.Size([32, 256]), torch.Size([32]))

256 came from bi_encoder.max_seq_length = 256, it is truncating any length > 256.

Set Evaluator

Evaluator is going to evaluate embedding closeness to our evaluation set. This is for 25 of our data set for evaluation.

In [21]:
# Evaluation data, sentences1 and sentences2 are lists of QUERYs and context respectively and scores are 0 or 1
sentences1, sentences2, scores = zip(*sampled_training_data[training_index:])
In [22]:
sentences1[0], sentences2[0], scores[0]
Out[22]:
("Who hadn't done what was needed?",
 'Han Chinese make up the vast majority of the population, and the largest Han subgroup are the speakers of Wu varieties of Chinese. There are also 400,000 members of ethnic minorities, including approximately 200,000 She people and approximately 20,000 Hui Chinese[citation needed]. Jingning She Autonomous County in Lishui is the only She autonomous county in China.',
 0.0)

EmbeddingSimilarityEvaluator evaluator will evaluate embedding closeness based on cosign similarity:

In [23]:
# evaluator will evaluate embedding closeness
evaluator = evaluation.EmbeddingSimilarityEvaluator(sentences1, sentences2, scores)
In [24]:
# Initial evalaution
bi_encoder.evaluate(evaluator)  
Out[24]:
0.6179759479051656

This is initial evaluation is 0.61 (higher is better) that we want to increase it:

Fit Model

Finally, we can fit our model using bi_encoder where we pass out training objectives and our training loss.

In [25]:
# Fine-tune the model using the fit method using bi_encoder.fit
bi_encoder.fit(
    train_objectives=[(train_dataloader_set, train_loss_set)],  # this is how it calculates the losses to update the weights
    output_path='qa/results',
    epochs=2, 
    evaluator=evaluator 
)
In [26]:
bi_encoder.evaluate(evaluator)  # final evalaution (higher embedding similarity is better)
# Not a huge jump in performance with 2 epochs. We could try more data or more epochs
Out[26]:
0.6189941233799751

There is a very little performance improvement after fine-tunning the model.

Load fine-tuned model

In [27]:
# load fine-tuned model
finetuned_bi_encoder = SentenceTransformer('qa/results')
In [28]:
# Slightly more confident results!

# Encode document
document_embeddings = finetuned_bi_encoder.encode(doc, convert_to_tensor=True, show_progress_bar=True)

# Encode question
QUERY_embedding = finetuned_bi_encoder.encode(QUERY, convert_to_tensor=True)

# Get document cos_sim_score
cos_sim_score = util.semantic_search(QUERY_embedding, document_embeddings, top_k=3)[0]

print(f'QUERY: {QUERY}\n')

for no, ir in enumerate(cos_sim_score):
    
    print(f'Document {no + 1}: Cosine Similarity is {ir["score"]:.3f}:\n\n{doc[ir["corpus_id"]]}')
    print('\n')
QUERY: What is snare?

Document 1: Cosine Similarity is 0.594:

"A snare is something that looks like a creeper, but it doesn't grow and

it's worse than all the hawks in the world," said Molly, glancing at the

now far-away red-tail, "for there it hides night and day in the runway

till the chance to catch you comes."


Document 2: Cosine Similarity is 0.391:

It seemed, at length, a waste of time to follow him with a gun, so when

the snow was deepest, and food scarcest, Cuddy hatched a new plot. Right

across the feeding-ground, almost the only good one now in the Stormy

Moon, he set a row of snares. A cottontail rabbit, an old friend, cut

several of these with his sharp teeth, but some remained, and Redruff,

watching a far-off speck that might turn out a hawk, trod right in one

of them, and in an instant was jerked into the air to dangle by one

foot.


Document 3: Cosine Similarity is 0.251:

My window now took the place of the hollow bass wood. A number of hens

of the breed he knew so well were about the cub in the yard. Late that

afternoon as they strayed near the captive there was a sudden rattle of

the chain, and the youngster dashed at the nearest one and would have

caught him but for the chain which brought him up with a jerk. He got on

his feet and slunk back to his box, and though he afterward made several

rushes he so gauged his leap as to win or fail within the length of the

chain and never again was brought up by its cruel jerk.


In [ ]:
 

Search within PDF

In [29]:
import PyPDF2
from tqdm import tqdm

def extract_text_from_pdf(pdf_path):
    # Open the PDF file in read-binary mode
    with open(pdf_path, 'rb') as file:
        # Create a PDF reader object
        reader = PyPDF2.PdfReader(file)

        # Initialize an empty string to hold the text
        extracted_text = ''

        # Loop through each page in the PDF file
        for page in tqdm(reader.pages):
            # Extract the text from the page
            text = page.extract_text()

            # Find the starting point of the text to extract
            # In this case, extracting text starting from the string ' ]'
            starting_point = text.find(' ]') + 2
            extracted_text += '\n\n' + text[starting_point:]

    return extracted_text

# Specify the path to the PDF file
pdf_path = 'machine_learning.pdf'

# Extract text from the PDF
principles_of_ds_text = extract_text_from_pdf(pdf_path)
100%|██████████| 21/21 [00:00<00:00, 25.63it/s]
In [30]:
principles_of_ds_text[:600]
Out[30]:
'\n\nESEARCH ARTICLE\nMachine learning approaches for the prediction of serious\nfluid leakage from hydrocarbon wells\nMehdi Rezvandehy1and Bernhard Mayer2\n1Department of Chemical and Petroleum Engineering, University of Calgary, Calgary, AB, Canada\n2Department of Geoscience, University of Calgary, Calgary, AB, Canada\nCorresponding author: Mehdi Rezvandehy; Email: mehdi.rezvandehy@ucalgary.ca\nReceived: 23 August 2022; Revised: 05 April 2023; Accepted: 14 April 2023\nKeywords: Energy wells; imbalanced class classification; imputation; probability estimation; resampling\nAbstract\nThe exploitation of hyd'

Maximum Token Chunking

The Function below split the text into chunks of a maximum number of tokens. This function is retrieved from Sinan Ozdemir

In [31]:
# Function to split the text into chunks of a maximum number of tokens.

from transformers import  AutoTokenizer

model_name = "deepset/roberta-base-squad2"
tokenizer = AutoTokenizer.from_pretrained(model_name)

def overlapping_chunks(text, max_tokens = 200, overlapping_factor = 5):
    '''
    max_tokens: maximum tokens per segment
    overlapping_factor: number of sentences to start each segment with, overlapping with the previous segment
    
    '''
    import re

    # Split the text using punctuation
    sentences = re.split(r'[.?!]', text)

    # Get the token count for each sentence
    n_tokens = [len(tokenizer.encode(" " + sentence)) for sentence in sentences]

    chunks, tokens_so_far, chunk = [], 0, []

    # Iterate through the sentences and tokens joined together in a tuple
    for sentence, token in zip(sentences, n_tokens):

        # If the number of tokens so far plus the number of tokens in the current sentence is greater
        # than the max number of tokens, then add the chunk to the list of chunks and reset
        # the chunk and tokens so far
        if tokens_so_far + token > max_tokens:
            chunks.append(". ".join(chunk) + ".")
            if overlapping_factor > 0:
                chunk = chunk[-overlapping_factor:]
                tokens_so_far = sum([len(tokenizer.encode(c)) for c in chunk])
            else:
                chunk = []
                tokens_so_far = 0
                
        # If the number of tokens in the current sentence is greater than the max number of
        # tokens, go to the next sentence
        if token > max_tokens:
            continue

        # Otherwise, add the sentence to the chunk and add the number of tokens to the total
        chunk.append(sentence)
        tokens_so_far += token + 1

    return chunks
In [32]:
split_no_overlap = overlapping_chunks(principles_of_ds_text, max_tokens=200, overlapping_factor=0)
avg_length = sum([len(tokenizer.encode(t)) for t in split_no_overlap
                 ]) / len(split_no_overlap)
print(f'non-overlapping chunking approach has {len(split_no_overlap)} documents with average length {avg_length:.1f} tokens')
non-overlapping chunking approach has 82 documents with average length 169.7 tokens
In [33]:
split_no_overlap[20]
Out[33]:
' Measured depth (m) 22.  Total production monthe12-4 Mehdi Rezvandehy and Bernhard Mayer\nhttps://doi. org/10. 1017/dce. 2023. 9  Published online by Cambridge University Press\n\nroperty 17 indicates the type of surface abandonment such as plate or cement; 18 is time in month since\nabandonment (counted from January 2022).  Properties 19 –22 are cumulative gas, oil, and water\nproduction and total months in production. \n3.  Workflow\nBinary classification was applied using the 22 physical properties in Table 1 as training features, while\nusing the SCVF/GM test results (AER classification) as target with serious leakage as positive class (value\n1) and nonserious leakage as negative class (value 0).'
In [34]:
# with 5 overlapping sentences per chunk
split_with_overlap = overlapping_chunks(principles_of_ds_text, max_tokens=200, overlapping_factor=5)
avg_length = sum([len(tokenizer.encode(t)) for t in split_with_overlap]) / len(split_with_overlap)
print(f'overlapping chunking approach has {len(split_with_overlap)} documents with average length {avg_length:.1f} tokens')
overlapping chunking approach has 226 documents with average length 182.9 tokens
In [35]:
split_with_overlap[20]
Out[35]:
' The AER applies two field tests for the identification of fluid\nmigration after a well is completed to produce hydrocarbon or to inject any fluid:\n1.  SCVF is the flow of gas (methane, CO\n2, etc. ) out of the casing annulus or surface casing.  SCVF is\noften referred to as internal migration.  Wells with positive SCVF are considered serious in the\nprovince of Alberta under one or several of the following conditions: (a) gas-flow rates higher than\n300 m3/d, (b) stabilized pressure >9. 8 kPa/m, (c) liquid-hydrocarbons, and (d) hydrogen sulfide\n(H2S) flow (see Alberta Energy Regulator, 2003 , for more information). \n2.'

With overlap, we see an increase in the number of document chunks, but they are all approximately the same size. The higher the overlapping factor, the more redundancy we introduce into the system.

In [ ]:
 
In [36]:
# Bi-Encoder is used to encode all the documents one at a time, to use asymetric sematic search
bi_encoder = SentenceTransformer('msmarco-distilbert-base-v4') # distilbert was fine-tuned on msmarco data set

# To use the bi-encoder, we can call encoder method of the bi_encoder.This takes iterable string of document
# and convert them to tensors
doc_embeddings = bi_encoder.encode(split_no_overlap, convert_to_tensor=True, show_progress_bar=True)

# Encode Query
QUERY = """What is Oversampling?"""  # a natural language query

# Encode the QUERY using the bi-encoder and convert to a tensor and find relevant documents
que_embedding = bi_encoder.encode(QUERY, convert_to_tensor=True)

# Give number of documents to retrieve with the bi-encoder
# top_k is top documents similar to query by descending Cosine similarity
cos_sim_score = util.semantic_search(que_embedding, doc_embeddings, top_k=3)[0]

print(f'QUERY: {QUERY}\n')

for no, ir in enumerate(cos_sim_score):
    
    print(f'Document {no + 1}: Cosine Similarity is {ir["score"]:.3f}:\n\n{split_no_overlap[ir["corpus_id"]]}')
    print('\n')
QUERY: What is Oversampling?

Document 1: Cosine Similarity is 0.528:

e12-10 Mehdi Rezvandehy and Bernhard Mayer
https://doi. org/10. 1017/dce. 2023. 9  Published online by Cambridge University Press

imulation for oversampling to avoid the inclusion of exact duplicates as discussed in Section 4 and shown
inFigure 4b .  This approach is fast and includes associated uncertainty for resampling.  Both resampling
approaches were combined with equal percentage: undersampling with a selected percentage is appliedto the majority class to reduce the bias on that class, while also applying the same percentagefor oversampling of the minority class to improve the bias toward these instances.  This was appliedfor different ratios of
Class 1
Class 0, where Class 1 and Class 0 are the proportions of serious and nonserious
leakage, respectively.


Document 2: Cosine Similarity is 0.483:


(b)Oversampling: this approach adds random instances (duplicates) of the minority class to the
training set.  A disadvantage of this technique is that it increases the likelihood of overfittingbecause of including the exact copies of the minority class examples (Fernández et al. , 2018 ;
Brownlee, 2021 ).  In this article, integration of oversampling and undersampling approaches was
applied to resolve the issue of imbalanced data. 
3.  Finally, the main aim was to achieve a higher-performance classifier.  Ensemble Learning was
utilized to integrate multiple models to build a stronger predictor.  It works by aggregating thepredictions of a group of predictors.


Document 3: Cosine Similarity is 0.481:

 This leads to
forcing the predictive models to an unrealistically high classification for the negative class(Montague et al. , 2018 ; Brownlee, 2020 ) and underestimating the positive class.  Resampling
techniques were used to adjust the class distribution of training data to feed more balanced data intopredictive models, thereby creating a new transformed version of the training set with a different
class distribution.  Two main approaches for random resampling an imbalanced dataset are:
(a)Undersampling : this approach deletes random instances of a majority class from the training
set.  A drawback of undersampling is eliminating the instances that may be important, useful, orcritical for fitting a robust decision boundary (He and Ma, 2013 ; Brownlee, 2021 ).


In [37]:
nlp = pipeline('question-answering', 
               model='deepset/roberta-base-squad2', 
               tokenizer='deepset/roberta-base-squad2', 
               max_length=512)  # pipline of document-question-answering with roberta model
In [38]:
# Answer the QUERY from the top document
nlp(QUERY, str(split_no_overlap[cos_sim_score[0]['corpus_id']]),max_length=100)
Out[38]:
{'score': 0.03856489807367325,
 'start': 435,
 'end': 496,
 'answer': 'appliedto the majority class to reduce the bias on that class'}

Semantic Search with Clustering

An alternative method for document chunking is to employ clustering to generate semantic documents. This technique entails forming new documents by amalgamating small, semantically akin chunks of information. It necessitates a degree of creativity since any adjustments to the document chunks will impact the resultant vector. For instance, one could utilize agglomerative clustering from scikit-learn, grouping together comparable sentences or paragraphs to construct novel documents.

Let’s try to cluster the chunks. The Agglomerative Clustering model is specifically configured to use a precomputed distance matrix as input. Agglomerative Clustering is a hierarchical clustering technique that builds a hierarchy of clusters. It is a bottom-up approach, where each data point starts as its own cluster, and clusters are successively merged based on a linkage criterion until all data points belong to a single cluster or a specified number of clusters is reached. The result is a tree-like structure called a dendrogram, which illustrates the merging process.

In [39]:
from sklearn.cluster import AgglomerativeClustering
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# Assuming we possess a list of text embeddings named 'doc_embeddings'
# Calculate the cosine similarity matrix for all pairs of embeddings

matrix_cosine_sim = cosine_similarity(doc_embeddings)

# Based on cosine similarity matrix

# Instantiate the AgglomerativeClustering model
agg_clustering = AgglomerativeClustering(
    
    # The algorithm can determine the optimal number of clusters based on the data,
    # we can also have it fixed
    n_clusters=4, 
    
    # Clusters will be formed until all pairwise distances between clusters are greater than 0.1
    #distance_threshold=0.5, 
    
    # We are providing a precomputed distance matrix (1 - similarity matrix) as input
    affinity='precomputed',  
    
    # Form clusters by iteratively merging the smallest clusters based on the maximum distance between their components
    linkage='complete'        
   )

# Fit the model to the cosine distance matrix (1 - similarity matrix)
agg_clustering.fit(1 - matrix_cosine_sim)

# Obtain the cluster labels for each embedding
cluster_labels_cosine = agg_clustering.labels_

# Display the number of embeddings in each cluster
unique_labels, counts = np.unique(cluster_labels_cosine, return_counts=True)
for label, count in zip(unique_labels, counts):
    print(f'Cluster {label}: {count} texts')
Cluster 0: 10 texts
Cluster 1: 28 texts
Cluster 2: 35 texts
Cluster 3: 9 texts
In [40]:
matrix_cosine_sim
Out[40]:
array([[0.99999994, 0.5833392 , 0.4138024 , ..., 0.24828397, 0.46851665,
        0.5100452 ],
       [0.5833392 , 1.0000001 , 0.6709356 , ..., 0.1305966 , 0.5283048 ,
        0.54814637],
       [0.4138024 , 0.6709356 , 0.99999994, ..., 0.13354453, 0.42463306,
        0.41541386],
       ...,
       [0.24828397, 0.1305966 , 0.13354453, ..., 0.99999994, 0.28361392,
        0.26737753],
       [0.46851665, 0.5283048 , 0.42463306, ..., 0.28361392, 0.9999999 ,
        0.4775532 ],
       [0.5100452 , 0.54814637, 0.41541386, ..., 0.26737753, 0.4775532 ,
        1.0000001 ]], dtype=float32)
In [42]:
import pandas as pd
pd.set_option('display.max_colwidth', -1)

# Create an empty DataFrame
df = pd.DataFrame()

df['Split_text'] = split_no_overlap
df['clusters'] = cluster_labels_cosine
C:\Users\mrezv\AppData\Local\Temp\ipykernel_16880\2960526001.py:2: FutureWarning: Passing a negative integer is deprecated in version 1.0 and will not be supported in future version. Instead, use None to not limit the column width.
  pd.set_option('display.max_colwidth', -1)
In [43]:
df
Out[43]:
Split_text clusters
0 \n\nESEARCH ARTICLE\nMachine learning approaches for the prediction of serious\nfluid leakage from hydrocarbon wells\nMehdi Rezvandehy1and Bernhard Mayer2\n1Department of Chemical and Petroleum Engineering, University of Calgary, Calgary, AB, Canada\n2Department of Geoscience, University of Calgary, Calgary, AB, Canada\nCorresponding author: Mehdi Rezvandehy; Email: mehdi. rezvandehy@ucalgary. ca\nReceived: 23 August 2022; Revised: 05 April 2023; Accepted: 14 April 2023\nKeywords: Energy wells; imbalanced class classification; imputation; probability estimation; resampling\nAbstract\nThe exploitation of hydrocarbon reservoirs may potentially lead to contamination of soils, shallow water resources,\nand greenhouse gas emissions. 1
1 Fluids such as methane or CO 2may in some cases migrate toward the groundwater\nzone and atmosphere through and along imperfectly sealed hydrocarbon wells. Field tests in hydrocarbon-producing\nregions are routinely conducted for detecting serious leakage to prevent environmental pollution. The challenge isthat testing is costly, time-consuming, and sometimes labor-intensive. In this study, machine learning approaches\nwere applied to predict serious leakage with uncertainty quantification for wells that have not been field tested in\nAlberta, Canada. An improved imputation technique was developed by Cholesky factorization of the covariancematrix between features, where missing data are imputed via conditioning of available values. The uncertainty in\nimputed values was quantified and incorporated into the final prediction to improve decision-making. 1
2 Next, a wide\nrange of predictive algorithms and various performance metrics were considered to achieve the most reliableclassifier. However, a highly skewed distribution of field tests toward the negative class (nonserious leakage) forcespredictive models to unrealistically underestimate the minority class (serious leakage). To address this issue, a\ncombination of oversampling, undersampling, and ensemble learning was applied. By investigating all the models on\nnever-before-seen data, an optimum classifier with minimal false negative prediction was determined. The developedmethodology can be applied to identify the wells with the highest likelihood for serious fluid leakage within\nproducing fields. This information is of key importance for optimizing field test operations to achieve economic\nand environmental benefits. \nImpact Statement\nField test operations to detect methane and CO2 leakages from hydrocarbon wells can be costly. 1
3 Most wells do\nnot have leaks or are categorized as non-serious, which means that no repair is needed until they are abandoned. \nHowever, it is crucial to identify and prioritize serious leakages for immediate remediation to prevent environ-\nmental pollution. This study developed a reliable predictive model by correlating the results of historical fieldtests with various well properties, including age, depth, production/injection history, and deviation, among\nothers. The trained model can predict the likelihood of serious leakage for untested wells, allowing for the\nprioritization of wells with the highest probability of leaks for field testing. This approach leads to cost-effectivefield testing and environmental benefits. \n© The Author(s), 2023. Published by Cambridge University Press. 1
4 This is an Open Access article, distributed under the terms of the Creative Commons\nAttribution licence ( http://creativecommons. org/licenses/by/4. 0 ), which permits unrestricted re-use, distribution and reproduction, provided the\noriginal article is properly cited. Data-Centric Engineering (2023), 4: e12\ndoi:10. 1017/dce. 2023. 9\nhttps://doi. org/10. 1017/dce. 2023. 9 Published online by Cambridge University Press\n\n. Introduction\nExploitation of oil and gas reservoirs has raised public concerns regarding potential contamination of\nsoils, shallow groundwater, and increases in greenhouse gas emissions (Shindell et al. , 2009 ; Brandt et al. ,\n2014 ; Cherry et al. 1
... ... ...
77 9 Published online by Cambridge University Press\n\nyer J ,Lackey G ,Edvardsen L ,Bean A ,Carroll SA ,Huerta N ,Smith MM ,Torsæter M ,Dilmore RM and Cerasi P (2022) A\nreview of well integrity based on field experience at carbon utilization and storage sites. International Journal of Greenhouse Gas\nControl 113 , 103533. \nJournel AG and Bitanov A (2004) Uncertainty in N/G ratio in early reservoir development. Journal of Petroleum Science and\nEngineering 44 (1), 115 –130. \nKaggle (2022) “Target Encoding ”Available at https://www. kaggle. com/ryanholbrook/target-encoding (accessed 14 April 2023). 3
78 \nKang M ,Kanno CM ,Reid MC ,Zhang X ,Mauzerall DL ,Celia MA ,Chen Yand Onstott TC (2014) Direct measurements of\nmethane emissions from abandoned oil and gas wells in Pennsylvania. Proceedings of the National Academy of Sciences 111 (51),\n18173–18177. \nKhan KD and Deutsch CV (2016) Practical incorporation of multivariate parameter uncertainty in Geostatistical resource\nmodeling. Natural Resources Research 25 (1), 51–70. \nMontague JA ,Pinder GF and Watson TL (2018) Predicting gas migration through existing oil and gas wells. Environmental\nGeosciences 25 (4), 121 –132. 3
79 \nPedregosa F ,Varoquaux G ,Gramfort A ,Michel V ,Thirion B ,Grisel O ,Blondel M ,Prettenhofer P ,Weiss R ,Dubourg V ,\nVanderplas J ,Passos A ,Cournapeau D ,Brucher M ,Perrot M and Duchesnay É (2011) Scikit-learn: Machine learning in\npython. Journal of Machine Learning Research 12 , 2825–2830. \nRezvandehy M and Deutsch CV (2017) Horizontal variogram inference in the presence of widely spaced well data. Petroleum\nGeoscience 24 (2), 219 –235. 3
80 \nSandl E ,Cahill A ,Welch L and Beckie R (2021) Characterizing oil and gas wells with fugitive gas migration through Bayesian\nmultilevel logistic regression. Science of the Total Environment 769 , 144678. \nSantos MS ,Soares JP ,Abreu PH ,Araujo H and Santos J (2018) Cross-validation for imbalanced datasets: Avoiding over-\noptimistic and overfitting approaches [research frontier]. IEEE Computational Intelligence Magazine 13 (4), 59–76. \nShindell DT ,Faluvegi G ,Koch DM ,Schmidt GA ,Unger N and Bauer SE (2009) Improved attribution of climate forcing to\nemissions. Science 326 (5953), 716 –718. 3
81 \nvan Buuren Sv and Groothuis-Oudshoorn K (2010) MICE: Multivariate imputation by chained equations in r. Journal of\nStatistical Software 45 ,1–68. \nWatson TL and Bachu S (2009) Evaluation of the potential for gas and CO 2leakage along wellbores. SPE Drilling & Completion\n24(01), 115 –126. \nWisen J ,Chesnaux R ,Werring J ,Wendling G ,Baudron P and Barbecot F (2020) A portrait of wellbore leakage in northeastern\nBritish Columbia, Canada. Proceedings of the National Academy of Sciences 117 (2), 913 –922. \nCite this article: Rezvandehy M and Mayer B (2023). 1

82 rows × 2 columns

This method often produces semantically cohesive chunks, yet it encounters the drawback of having isolated pieces of content that lack contextual connection with the surrounding text. It is effective when the initial chunks are anticipated to be relatively unrelated to each other, demonstrating greater independence.

In [ ]: