Summary
Semantic search seeks to improve search accuracy by understanding the content of the search query. In contrast to traditional search engines, which only find documents based on lexical matches, semantic search can also find synonyms. In fact, this type of search makes browsing more complete by understanding almost exactly what the user is trying to ask, instead of simply matching keywords to pages. The idea behind semantic search is to embed all entries in your corpus, which can be sentences, paragraphs, or documents, into a vector space. Semantic search algorithms used contextual embedding to perform look-ups, providing for closer cotextual matches than lexical matches. In this notebook, we search a question within a book and find the closest answers to the question.
Python functions and data files needed to run this notebook are available via this link.
import requests
from urllib.request import urlopen
from bs4 import BeautifulSoup
import numpy as np
from datasets import load_dataset # Load data set from HuggingFace
from sentence_transformers import SentenceTransformer, util # SentenceTransformer for semantic meaning
from transformers import pipeline
from random import sample, seed, shuffle
from sentence_transformers import InputExample, losses, evaluation
from torch.utils.data import DataLoader
We have two types of search:
Symmetric Search: documents and queries are roughly the same size and carry the same amount of semantic content: e.g. getting news article titles for a given query
Asymmetric Search: documents are usually longer than queries and carry larger amounts of semantic content: e.g. getting an entire paragraph from a textbook for answering a QUERY
A siamese **bi-encoder architecture allows the network to learn embeddings that can be compared using cosine similarity. Cross-encoder** is traditional BERT model which is slower. Bi-encoder encodes query and candidates independently, using a scoring function to compare them. Cross-encoder encodes query-candidate pairs together, capturing their context. Cross-encoder performs better but is more computationally expensive, while bi-encoder is more efficient but may lack context awareness. Choice depends on task and requirements.
$\large Similarity= cos(\theta)=\frac{\textbf{A}.\textbf{B}}{\left| \left| \textbf{A}\right| \right|\, \left| \left| \textbf{B} \right| \right|}=\frac{\sum_{i=1}^{n} A_{i}B_{i}}{\sqrt{\sum_{i=1}^{n} A_{i}^{2}}\sqrt{\sum_{i=1}^{n} B_{i}^{2}}}$
# We want to search a person on Google
PERSON = 'Mehdi Rezvandehy'
# This is done only for education purposes which is very efficient way of searching
google_html = BeautifulSoup(requests.get(f'https://www.google.com/search?q={PERSON}').text).get_text()[:1024]
nlp = pipeline('question-answering',
model='deepset/roberta-base-squad2',
tokenizer='deepset/roberta-base-squad2',
max_length=50) # pipline of document-question-answering with roberta model
nlp(f'Who is {PERSON}?', google_html)
When applying semantic search to large documents, breaking down the text into manageable and meaningful chunks is essential for efficient processing.
Spliting can be applied on word level (maximum tokens) and paragraph level. For example for word level, if a document has 800 tokens, we might split it into two chunks: the first 512 tokens in one chunk and the remaining 288 tokens in another. Each chunk is then treated as a separate input during the semantic search process.
In addition to word-level splitting, we might also consider splitting documents at the paragraph level. Each paragraph can be treated as a separate unit for semantic search. This approach has some advantages:
Context Preservation: Splitting at the paragraph level allows to maintain the context of each paragraph, which can be important for understanding the meaning of the text.
Parallel Processing: We can process multiple paragraphs in parallel, taking advantage of modern hardware and speeding up the search process.
Granular Matching: If the query is related to a specific topic or context, paragraph-level matching allows to identify relevant paragraphs more accurately.
Natural Whitespace Chunking refers to a method of splitting the text based on the natural spaces (whitespace) between paragraphs. This process can be done with or without overlap between the chunks. Without overlap means the document is split into non-overlapping chunks based on natural whitespace (line breaks) between paragraphs. Each chunk is independent and contains a distinct set of paragraphs.
Figure below shows whitespace chunking without overlapping.
Pros:
Non-Overlapping Chunks: Each chunk corresponds to a distinct, non-overlapping portion of the text. This can simplify downstream processing and analysis.
Sequential Processing: When chunks do not overlap, you can process them sequentially, which may be more straightforward and efficient.
Cons:
Context Discontinuity: If the content within a paragraph spans multiple chunks, there might be a discontinuity in context, potentially impacting the understanding of the text.
Boundary Effects: Non-overlapping chunks might lead to missing important information near chunk boundaries, especially if there are critical phrases or sentences that bridge two adjacent chunks.
Figure below shows whitespace chunking with overlapping:
Pros:
Context Continuity: Overlapping chunks help maintain context continuity, as portions of adjacent chunks overlap, capturing shared information.
Reduced Boundary Effects: Overlapping chunks can mitigate boundary effects by ensuring that information near chunk boundaries is present in multiple chunks.
Cons:
Redundancy: Overlapping chunks may introduce redundancy, where some content is present in multiple chunks. This redundancy can impact efficiency in storage and processing.
Increased Complexity: Overlapping chunks might require more complex processing, especially if you need to handle duplicate information or ensure that the analysis considers the shared context.
In this notebook, we applied chunking without overlapping for paragraphs > 90 characters. See below:
# textbook about animal: Title: Wild Animals I Have Known
text = urlopen("""https://www.gutenberg.org/cache/epub/3031/pg3031.txt""").read().decode() # open URL of the document and read
# split up the text book into paragraphs with new line character and only keep documents of at least 90 characters
doc = list(filter(lambda x: len(x) > 90, text.split('\r\n\r\n')))
doc = np.array(doc)
print(f'There are {len(doc)} paragraphs')
# Bi-Encoder is used to encode all the documents one at a time, to use asymetric sematic search
bi_encoder = SentenceTransformer('msmarco-distilbert-base-v4') # distilbert was fine-tuned on msmarco data set
bi_encoder.max_seq_length = 256 # Truncate long documents to 256 tokens
bi_encoder
# To use the bi-encoder, we can call encoder method of the bi_encoder.This takes iterable string of document
# and convert them to tensors
doc_embeddings = bi_encoder.encode(doc, convert_to_tensor=True, show_progress_bar=True)
doc_embeddings.shape
Encoding should also be applied for the question.
QUERY = """What is snare?""" # a natural language query
# Encode the QUERY using the bi-encoder and convert to a tensor and find relevant documents
que_embedding = bi_encoder.encode(QUERY, convert_to_tensor=True)
# Give number of documents to retrieve with the bi-encoder
# top_k is top documents similar to query by descending Cosine similarity
cos_sim_score = util.semantic_search(que_embedding, doc_embeddings, top_k=3)[0]
cos_sim_score
print(f'QUERY: {QUERY}\n')
for no, ir in enumerate(cos_sim_score):
print(f'Document {no + 1}: Cosine Similarity is {ir["score"]:.3f}:\n\n{doc[ir["corpus_id"]]}')
print('\n')
nlp = pipeline('question-answering',
model='deepset/roberta-base-squad2',
tokenizer='deepset/roberta-base-squad2',
max_length=512) # pipline of document-question-answering with roberta model
# Answer the QUERY from the top document
nlp(QUERY, str(doc[cos_sim_score[0]['corpus_id']]),max_length=50)
We just built an "Open Book Q/A" System. It looks through the entire book and find the closest paragraph to the question.
To fine-tune our bi-encoder, we need a data set including questions and answers. We used adversarial_qa from HuggingFace. We have perfect answers to question as good training (Cosign 1) and if the answer is not related to the question (Cosign 0), it will be bad training data. The code below is retrieved from Sinan Ozdemir
# Load the adversarial_qa dataset from the Q/A use-case in Hugging Face
training_qa = load_dataset('adversarial_qa', 'adversarialQA', split='train')
# Initialize lists for good and bad training data
data_good_training = []
data_bad_training = []
# Iterate over the adversarial_qa dataset
last_example = None
for example in training_qa:
# Check if the context is different from the previous example
if last_example and example['context'] != last_example['context']:
# If different, append to bad training data with a label of 0.0 (neutral)
data_bad_training.append((example['question'], last_example['context'], 0.0))
# If the context is the same as the previous example
elif last_example and example['context'] == last_example['context']:
# Append to good training data with a label of 1.0
data_good_training.append((example['question'], example['context'], 1.0))
# Update last_example for the next iteration
last_example = example
len(data_good_training), len(data_bad_training)
Good training data have the highest similarity (1) between questions and answers:
data_good_training[0]
Bad training data have the lowest similarity (0) between questions and answers:
data_bad_training[0]
Now we merge 700 samples of good and bad samples and then shuffle the data:
# https://www.sbert.net/docs/training/overview.html for more information on training
seed(42) # seed our upcoming sample
sampled_training_data = sample(data_good_training, 700) + sample(data_bad_training, 700)
shuffle(sampled_training_data) # shuffle our data around
training_index = int(.75 * len(sampled_training_data)) # Get an 75/25 train/test split
# Define the training examples; that is how the library is designed.
train_examples = [InputExample(texts=t[:2], label=t[2]) for t in sampled_training_data[:training_index]]
train_examples[0].__dict__
We have not used dataloader, we used data collator before: A data loader is the object that specifically shuffles/grabs batches of data from a Dataset
# Define the training dataset, dataloader, and the training loss
# Typically, the Trainer class in Hugging Face can handle this automatically, but here's a manual setup
# Create a dataloader for training with shuffling and a batch size of 32
train_dataloader_set = DataLoader(train_examples, shuffle=True, batch_size=32, collate_fn=bi_encoder.smart_batching_collate)
Loss should also be calculated to measure how well our model is predicted. It is a Cosign similarity loss meaning how close our Cosign similarity is to actual similarity
# Explicitly define the training loss using cosine similarity
train_loss_set = losses.CosineSimilarityLoss(bi_encoder)
train_loss_set
Every batch of data offers three things:
(QUERY_batch, context_batch), labels = next(iter(train_dataloader_set)) # get a sample batch of data
QUERY_batch['input_ids'].shape, context_batch['input_ids'].shape, labels.shape
256 came from bi_encoder.max_seq_length = 256
, it is truncating any length > 256.
Evaluator is going to evaluate embedding closeness to our evaluation set. This is for 25 of our data set for evaluation.
# Evaluation data, sentences1 and sentences2 are lists of QUERYs and context respectively and scores are 0 or 1
sentences1, sentences2, scores = zip(*sampled_training_data[training_index:])
sentences1[0], sentences2[0], scores[0]
EmbeddingSimilarityEvaluator
evaluator will evaluate embedding closeness based on cosign similarity:
# evaluator will evaluate embedding closeness
evaluator = evaluation.EmbeddingSimilarityEvaluator(sentences1, sentences2, scores)
# Initial evalaution
bi_encoder.evaluate(evaluator)
This is initial evaluation is 0.61 (higher is better) that we want to increase it:
Finally, we can fit our model using bi_encoder where we pass out training objectives and our training loss.
# Fine-tune the model using the fit method using bi_encoder.fit
bi_encoder.fit(
train_objectives=[(train_dataloader_set, train_loss_set)], # this is how it calculates the losses to update the weights
output_path='qa/results',
epochs=2,
evaluator=evaluator
)
bi_encoder.evaluate(evaluator) # final evalaution (higher embedding similarity is better)
# Not a huge jump in performance with 2 epochs. We could try more data or more epochs
There is a very little performance improvement after fine-tunning the model.
# load fine-tuned model
finetuned_bi_encoder = SentenceTransformer('qa/results')
# Slightly more confident results!
# Encode document
document_embeddings = finetuned_bi_encoder.encode(doc, convert_to_tensor=True, show_progress_bar=True)
# Encode question
QUERY_embedding = finetuned_bi_encoder.encode(QUERY, convert_to_tensor=True)
# Get document cos_sim_score
cos_sim_score = util.semantic_search(QUERY_embedding, document_embeddings, top_k=3)[0]
print(f'QUERY: {QUERY}\n')
for no, ir in enumerate(cos_sim_score):
print(f'Document {no + 1}: Cosine Similarity is {ir["score"]:.3f}:\n\n{doc[ir["corpus_id"]]}')
print('\n')
import PyPDF2
from tqdm import tqdm
def extract_text_from_pdf(pdf_path):
# Open the PDF file in read-binary mode
with open(pdf_path, 'rb') as file:
# Create a PDF reader object
reader = PyPDF2.PdfReader(file)
# Initialize an empty string to hold the text
extracted_text = ''
# Loop through each page in the PDF file
for page in tqdm(reader.pages):
# Extract the text from the page
text = page.extract_text()
# Find the starting point of the text to extract
# In this case, extracting text starting from the string ' ]'
starting_point = text.find(' ]') + 2
extracted_text += '\n\n' + text[starting_point:]
return extracted_text
# Specify the path to the PDF file
pdf_path = 'machine_learning.pdf'
# Extract text from the PDF
principles_of_ds_text = extract_text_from_pdf(pdf_path)
principles_of_ds_text[:600]
The Function below split the text into chunks of a maximum number of tokens. This function is retrieved from Sinan Ozdemir
# Function to split the text into chunks of a maximum number of tokens.
from transformers import AutoTokenizer
model_name = "deepset/roberta-base-squad2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
def overlapping_chunks(text, max_tokens = 200, overlapping_factor = 5):
'''
max_tokens: maximum tokens per segment
overlapping_factor: number of sentences to start each segment with, overlapping with the previous segment
'''
import re
# Split the text using punctuation
sentences = re.split(r'[.?!]', text)
# Get the token count for each sentence
n_tokens = [len(tokenizer.encode(" " + sentence)) for sentence in sentences]
chunks, tokens_so_far, chunk = [], 0, []
# Iterate through the sentences and tokens joined together in a tuple
for sentence, token in zip(sentences, n_tokens):
# If the number of tokens so far plus the number of tokens in the current sentence is greater
# than the max number of tokens, then add the chunk to the list of chunks and reset
# the chunk and tokens so far
if tokens_so_far + token > max_tokens:
chunks.append(". ".join(chunk) + ".")
if overlapping_factor > 0:
chunk = chunk[-overlapping_factor:]
tokens_so_far = sum([len(tokenizer.encode(c)) for c in chunk])
else:
chunk = []
tokens_so_far = 0
# If the number of tokens in the current sentence is greater than the max number of
# tokens, go to the next sentence
if token > max_tokens:
continue
# Otherwise, add the sentence to the chunk and add the number of tokens to the total
chunk.append(sentence)
tokens_so_far += token + 1
return chunks
split_no_overlap = overlapping_chunks(principles_of_ds_text, max_tokens=200, overlapping_factor=0)
avg_length = sum([len(tokenizer.encode(t)) for t in split_no_overlap
]) / len(split_no_overlap)
print(f'non-overlapping chunking approach has {len(split_no_overlap)} documents with average length {avg_length:.1f} tokens')
split_no_overlap[20]
# with 5 overlapping sentences per chunk
split_with_overlap = overlapping_chunks(principles_of_ds_text, max_tokens=200, overlapping_factor=5)
avg_length = sum([len(tokenizer.encode(t)) for t in split_with_overlap]) / len(split_with_overlap)
print(f'overlapping chunking approach has {len(split_with_overlap)} documents with average length {avg_length:.1f} tokens')
split_with_overlap[20]
With overlap, we see an increase in the number of document chunks, but they are all approximately the same size. The higher the overlapping factor, the more redundancy we introduce into the system.
# Bi-Encoder is used to encode all the documents one at a time, to use asymetric sematic search
bi_encoder = SentenceTransformer('msmarco-distilbert-base-v4') # distilbert was fine-tuned on msmarco data set
# To use the bi-encoder, we can call encoder method of the bi_encoder.This takes iterable string of document
# and convert them to tensors
doc_embeddings = bi_encoder.encode(split_no_overlap, convert_to_tensor=True, show_progress_bar=True)
# Encode Query
QUERY = """What is Oversampling?""" # a natural language query
# Encode the QUERY using the bi-encoder and convert to a tensor and find relevant documents
que_embedding = bi_encoder.encode(QUERY, convert_to_tensor=True)
# Give number of documents to retrieve with the bi-encoder
# top_k is top documents similar to query by descending Cosine similarity
cos_sim_score = util.semantic_search(que_embedding, doc_embeddings, top_k=3)[0]
print(f'QUERY: {QUERY}\n')
for no, ir in enumerate(cos_sim_score):
print(f'Document {no + 1}: Cosine Similarity is {ir["score"]:.3f}:\n\n{split_no_overlap[ir["corpus_id"]]}')
print('\n')
nlp = pipeline('question-answering',
model='deepset/roberta-base-squad2',
tokenizer='deepset/roberta-base-squad2',
max_length=512) # pipline of document-question-answering with roberta model
# Answer the QUERY from the top document
nlp(QUERY, str(split_no_overlap[cos_sim_score[0]['corpus_id']]),max_length=100)
An alternative method for document chunking is to employ clustering to generate semantic documents. This technique entails forming new documents by amalgamating small, semantically akin chunks of information. It necessitates a degree of creativity since any adjustments to the document chunks will impact the resultant vector. For instance, one could utilize agglomerative clustering from scikit-learn, grouping together comparable sentences or paragraphs to construct novel documents.
Let’s try to cluster the chunks. The Agglomerative Clustering model is specifically configured to use a precomputed distance matrix as input. Agglomerative Clustering is a hierarchical clustering technique that builds a hierarchy of clusters. It is a bottom-up approach, where each data point starts as its own cluster, and clusters are successively merged based on a linkage criterion until all data points belong to a single cluster or a specified number of clusters is reached. The result is a tree-like structure called a dendrogram, which illustrates the merging process.
from sklearn.cluster import AgglomerativeClustering
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
# Assuming we possess a list of text embeddings named 'doc_embeddings'
# Calculate the cosine similarity matrix for all pairs of embeddings
matrix_cosine_sim = cosine_similarity(doc_embeddings)
# Based on cosine similarity matrix
# Instantiate the AgglomerativeClustering model
agg_clustering = AgglomerativeClustering(
# The algorithm can determine the optimal number of clusters based on the data,
# we can also have it fixed
n_clusters=4,
# Clusters will be formed until all pairwise distances between clusters are greater than 0.1
#distance_threshold=0.5,
# We are providing a precomputed distance matrix (1 - similarity matrix) as input
affinity='precomputed',
# Form clusters by iteratively merging the smallest clusters based on the maximum distance between their components
linkage='complete'
)
# Fit the model to the cosine distance matrix (1 - similarity matrix)
agg_clustering.fit(1 - matrix_cosine_sim)
# Obtain the cluster labels for each embedding
cluster_labels_cosine = agg_clustering.labels_
# Display the number of embeddings in each cluster
unique_labels, counts = np.unique(cluster_labels_cosine, return_counts=True)
for label, count in zip(unique_labels, counts):
print(f'Cluster {label}: {count} texts')
matrix_cosine_sim
import pandas as pd
pd.set_option('display.max_colwidth', -1)
# Create an empty DataFrame
df = pd.DataFrame()
df['Split_text'] = split_no_overlap
df['clusters'] = cluster_labels_cosine
df
This method often produces semantically cohesive chunks, yet it encounters the drawback of having isolated pieces of content that lack contextual connection with the surrounding text. It is effective when the initial chunks are anticipated to be relatively unrelated to each other, demonstrating greater independence.