Summary
Semantic search seeks to improve search accuracy by understanding the content of the search query. In contrast to traditional search engines, which only find documents based on lexical matches, semantic search can also find synonyms. Semantic search with OpenAI embeddings to generate numerical representations (embeddings) for text. These embeddings capture the semantic meaning of the input, enabling a more nuanced understanding of language. When a user submits a search query, the model converts it into an embedding, and these embeddings are then compared to those of documents in a search index. The search results are ranked based on similarity scores, providing more accurate and context-aware retrieval of information compared to traditional keyword-based search. This approach leverages advanced language models to enhance the precision of search by focusing on the underlying meaning of the text. In this notebook, we explore the process of querying a question within a book and identifying the most relevant answers to that question.
Python functions and data files needed to run this notebook are available via this link.
import warnings
warnings.filterwarnings('ignore')
import openai
from openai.embeddings_utils import get_embedding
from urllib.request import urlopen
import numpy as np
from sentence_transformers import util
from transformers import pipeline
import pandas as pd
import scipy.stats as ss
import matplotlib.pyplot as plt
from IPython.display import Image
from IPython.core.display import HTML
from sklearn.metrics.pairwise import cosine_similarity
import requests
from bs4 import BeautifulSoup
The choice of the text embedder is critical as it determines the quality of the vector representation of the text. We have many options in how we vectorize with LLMs, both open and closed source. To get off of the ground quicker, we are going to use OpenAI’s closed-source “Embeddings” product. It is a closed-source product, which means we have limited control over its implementation and potential biases. It’s important to keep in mind that when using closed-source products, we may not have access to the underlying algorithms, which can make it difficult to troubleshoot any issues that may arise.
Once we convert our text into vectors, we have to find a mathematical representation of figuring out if pieces of text are “similar” or not. Cosine similarity is a way to measure how similar two things are. It looks at the angle between two vectors and gives a score based on how close they are in direction. If the vectors point in exactly the same direction, the cosine similarity is 1. If they’re perpendicular (90 degrees apart), it’s 0. And if they point in opposite directions, it’s -1. The size of the vectors doesn’t matter, only their orientation does.
GPT can generate text vectors to perform tasks below:
Semantic Search
Semantic search refers to a type of search that understands the meaning of the query and the context of the content, rather than just matching keywords. It aims to deliver more accurate and relevant results by considering the intent behind the search terms and the relationships between different pieces of information.
Clustering
Typically involves grouping together portions of text that share similar themes, topics, or characteristics.
The focus of this notebook is Semantic Search. Figure below shows flowchart of applying semantic search. This can be mostly replaced with OpenAI embedding
See Figure shows an example of abstractive question answering with GPT:
PERSON = 'Mehdi Rezvandehy'
# Google my name. This may not be the best way to google search
google_html = BeautifulSoup(requests.get(f'https://www.google.com/search?q={PERSON}').text).get_text()[:1024]
nlp = pipeline('question-answering',
model='deepset/roberta-base-squad2', # Using BERT roberta flavour for question and answering
tokenizer='deepset/roberta-base-squad2',
max_length=15)
nlp(f'Who is {PERSON}?', google_html)
The easiest approach to implement sentence similarity is through the sentence-transformers
library — which wraps most of this process into a few lines of code.
First, we install sentence-transformers
using pip install sentence-transformers
. This library uses HuggingFace’s transformers
behind the scenes — so we can actually find sentence-transformers models
We use bert-base-nli-mean-tokens
model. Let’s create some sentences, initialize our model, and encode the sentences:
from sentence_transformers import SentenceTransformer
model_bert = SentenceTransformer('bert-base-nli-mean-tokens')
corpus = ["Apple and orange are completely different from each other",
"Ocean temperature is rising rapidly",
"AI has taken the world by storm",
"Global warming is happening",
"The weather is not good to play golf today",
"Never compare an apple to an orange",
"People say I am a bookworm, in fact, I do not want to waste my time on TV",
"AI has transformed the way the world works",
"It is rainy today so we should postpone our golf game",
"I love reading books than watching TV"]
sentence_embeddings = model_bert.encode(corpus)
sentence_embeddings.shape
df = pd.DataFrame()
for i1 in range(len(corpus)):
sim_all = []
for i2 in range(len(corpus)):
tmp = cosine_similarity([sentence_embeddings[i1]],[sentence_embeddings[i2]])
sim_all.append(tmp[0][0])
df[corpus[i1]] = sim_all
df.index = corpus
df
def matrix_occure_prob(df,title,fontsize=11,vmin=-0.1, vmax=0.8,lable1='Sentence 1',pad=55,
lable2='Sentence 2',label='Cosine Similarity',rotation_x=90,axt=None,
num_ind=False,txtfont=6,lbl_font=9,shrink=0.8,cbar_per=False,
xline=False):
import matplotlib.pyplot as plt
"""Plot correlation matrix"""
ax = axt or plt.axes()
colmn1=list(df.columns)
colmn2=list(df.index)
corr=np.zeros((len(colmn2),len(colmn1)))
for l in range(len(colmn1)):
for l1 in range(len(colmn2)):
cc=df[colmn1[l]][df.index==colmn2[l1]].values[0]
try:
if len(cc)>1:
corr[l1,l]=cc[0]
except TypeError:
corr[l1,l]=cc
if num_ind:
ax.text(l, l1, str(round(cc,2)), va='center', ha='center',fontsize=txtfont)
im =ax.matshow(corr, cmap='jet', interpolation='nearest',vmin=vmin, vmax=vmax)
cbar =plt.colorbar(im,shrink=shrink,label=label)
if (cbar_per):
cbar.ax.set_yticklabels(['{:.0f}%'.format(x) for x in np.arange( 0,110,10)])
ax.set_xticks(np.arange(len(colmn1)))
ax.set_xticklabels(colmn1,fontsize=lbl_font)
ax.set_yticks(np.arange(len(colmn2)))
ax.set_yticklabels(colmn2,fontsize=lbl_font)
# Set ticks on both sides of axes on
ax.tick_params(axis="x", bottom=True, top=False, labelbottom=True, labeltop=False)
# Rotate and align bottom ticklabels
plt.setp([tick.label1 for tick in ax.xaxis.get_major_ticks()], rotation=rotation_x,
ha="right", va="center", rotation_mode="anchor")
# Rotate and align bottom ticklabels
plt.setp([tick.label1 for tick in ax.yaxis.get_major_ticks()], rotation=rotation_x,
ha="right", va="center", rotation_mode="anchor")
if xline:
x_labels = list(ax.get_xticklabels())
x_label_dict = dict([(x.get_text(), x.get_position()[0]) for x in x_labels])
for ix in xline:
plt.axvline(x=x_label_dict[ix]-0.5,linewidth =1.2,color='k', linestyle='--')
plt.axhline(y=x_label_dict[ix]-0.5,linewidth =1.2,color='k', linestyle='--')
plt.xlabel(lable1)
plt.ylabel(lable2)
ax.grid(color='k', linestyle='-', linewidth=0.05)
plt.title(f'{title}',fontsize=fontsize, pad=pad)
plt.show()
font = {'size' : 16}
plt.rc('font', **font)
fig, ax=plt.subplots(figsize=(15, 10), dpi= 110, facecolor='w', edgecolor='k')
matrix_occure_prob(df,title='Cosine Similarity Matrix by BERT Pre-trained Model ',lable1='',vmin=0, pad=10,axt=ax,
vmax=0.8,cbar_per=False,lable2='',num_ind=True,txtfont=15,xline=False,fontsize=22,
lbl_font=14,label='BERT Similarity',rotation_x=30)
# "OPENAI_API_KEY": notice, this is not a open soourse model like downloading from Huggingface
# we should have API_KEY that OpenAI send yo us that is persona API key.
openai.api_key = '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'
## Here are list of engines that OpenAI have for us
#openai.Engine.list().data
# Look at othe model that have either 'embed' or 'search tasks
[e for e in openai.Engine.list().data if 'embed' in e.id or 'search' in e.id][:5]
# We use embedding version of OpenAI model
ENGINE = 'text-embedding-ada-002'
To get OpenAI embedding, instead of downloading the model and using the model for embedding, there is only one line of the code as below. This could take time if we have hundreds of documents.
# Token level embedding
doc_embeddings = [get_embedding(doc_, engine=ENGINE) for doc_ in corpus[0]]
np.array(doc_embeddings).shape
# Sentence level embedding
que_embedding = np.array(get_embedding(corpus[0], engine=ENGINE))
que_embedding.shape
que_embedding
df = pd.DataFrame()
for i1 in range(len(corpus)):
sim_all = []
for i2 in range(len(corpus)):
a = np.array(get_embedding(corpus[i1], engine=ENGINE))
b = np.array(get_embedding(corpus[i2], engine=ENGINE))
tmp = cosine_similarity([a],[b])
sim_all.append(tmp[0][0])
df[corpus[i1]] = sim_all
df.index = corpus
df
font = {'size' : 16}
plt.rc('font', **font)
fig, ax=plt.subplots(figsize=(15, 10), dpi= 110, facecolor='w', edgecolor='k')
matrix_occure_prob(df,title=f'Cosine Similarity Matrix with OpenAI Close Source \n "{ENGINE}"',lable1='',vmin=0.5, pad=10,axt=ax,
vmax=1,cbar_per=False,lable2='',num_ind=True,txtfont=15,xline=False,fontsize=22,
lbl_font=14,label='BERT Similarity',rotation_x=30)
# textbook about animal: Title: Wild Animals I Have Known
text = urlopen("""https://www.gutenberg.org/cache/epub/3031/pg3031.txt""").read().decode() # open URL of the document and read
# split up the text book into paragraphs with new line character and only keep documents of at least 90 characters
doc = list(filter(lambda x: len(x) > 90, text.split('\r\n\r\n')))
doc = np.array(doc)
print(f'There are {len(doc)} paragraphs')
doc[0]
model_bert = SentenceTransformer('bert-base-nli-mean-tokens')
sentence_embeddings = model_bert.encode(doc)
sentence_embeddings.shape
QUERY = """What is snare?""" # a natural language query
sentence_embeddings_QUERY = model_bert.encode(QUERY)
df=pd.DataFrame()
sim_all=[]
for i1 in range(len(doc)):
tmp=cosine_similarity([sentence_embeddings_QUERY],[sentence_embeddings[i1]])
sim_all.append(tmp[0][0])
sim_all_rank = len(sim_all) - ss.rankdata(sim_all)
rank_1 = np.where(sim_all_rank==1)[0][0] - 1
rank_2 = np.where(sim_all_rank==2)[0][0] - 1
rank_3 = np.where(sim_all_rank==3)[0][0] - 1
# Encode the QUERY using the bi-encoder and convert to a tensor and find relevant documents
que_embedding = model_bert.encode(QUERY, convert_to_tensor=True)
# Give number of documents to retrieve with the bi-encoder
# top_k is top documents similar to query by descending Cosine similarity
cos_sim_score = util.semantic_search(que_embedding, sentence_embeddings, top_k=3)[0]
cos_sim_score
print(f'QUERY: {QUERY}\n')
for no, ir in enumerate(cos_sim_score):
print(f'Document {no + 1}: Cosine Similarity (sentence embedding) is {ir["score"]:.3f}:\n\n{doc[ir["corpus_id"]]}')
print('\n')
As can be seen, bert-base-nli-mean-tokens
embedding is not efficient to find proper answer for asked questions. Hence, OpenAI embedding is applied (see next section).
# We use embedding version of OpenAI model
ENGINE = 'text-embedding-ada-002'
QUERY = """What is snare?""" # a natural language query
The query (question) is encoded using OpenAI.
que_embedding = np.array(get_embedding(QUERY, engine=ENGINE))
The chunked document (text book) is encoded using OpenAI.
# This could take time if we have hundreds or thousands of documents
doc_embeddings = [get_embedding(document, engine=ENGINE) for document in doc]
# Transform list of lists to numpy
doc_embeddings = np.array(doc_embeddings)
doc_embeddings.shape
We can now use Sentence Transformers semantic:
cos_sim_score = util.semantic_search(que_embedding, doc_embeddings, top_k=3)[0]
cos_sim_score
que_embedding.shape
doc_embeddings.shape
print(f'QUERY: {QUERY}\n')
for no, ir in enumerate(cos_sim_score):
print(f'Document {no + 1}: Cosine Similarity (sentence embedding) is {ir["score"]:.3f}:\n\n{doc[ir["corpus_id"]]}')
print('\n')
From the top document, we can answer the question:
nlp(QUERY, str(doc[cos_sim_score[1]['corpus_id']]))
OpenAI is not an open-source so we do not have easy capability to fine-tune a model although OpenAI offers fine-tunning platform. However, we have to use their API and know how to use it.
OpenAI has developed a variety of models; there are three types of models that OpenAI has worked on:
Models like GPT-3 fall into this category. These models are trained to generate human-like text based on a given prompt or context. They can be used for various natural language processing tasks, such as language translation, question answering, text summarization, and more.
This seems to align closely with text completion models but with a specific emphasis on conversational interactions. Models like ChatGPT are designed to generate responses in a chat-like format. They can be used for creating chatbots, virtual assistants, and other conversational applications.
OpenAI has also explored models for working with images. One notable example is DALL-E, which is capable of generating images from textual descriptions. Another example is CLIP, which can understand and generate textual descriptions for images. These models demonstrate the capability of AI to work with visual data.
We can engineer a prompt to answer question based on a given context:
context = doc[cos_sim_score[0]['corpus_id']]
PROMPT = f"Answer the question by having the context below: \n\nContext: {context}\nQuery: {QUERY}\nAnswer:"
print(PROMPT)
Call the OpenAI API to extract the answer from the engineered prompt.
ENGINE_ = 'gpt-3.5-turbo-instruct'
answer = openai.Completion.create(
model=ENGINE_,
prompt=PROMPT,
temperature=1,
max_tokens=50,
top_p=0,
frequency_penalty=0,
presence_penalty=0
)
answer
# Get the completion
ans_comp = answer['choices'][0]["text"]
ans_comp
Now we engineer a prompt that answer the question below in fun way to a first grader:
PROMPT_FUN = f"Answer the question in a fun way for a first grader given the context belwo.\n\nContext: {context}\nQuery: {QUERY}\nAnswer:"
print(PROMPT_FUN)
ENGINE_ = 'gpt-3.5-turbo-instruct'
answer = openai.Completion.create(
model=ENGINE_,
prompt=PROMPT_FUN,
temperature=1,
max_tokens=50,
top_p=0,
frequency_penalty=0,
presence_penalty=0
)
# Get the completion
ans_comp_fun = answer['choices'][0]["text"]
ans_comp_fun
Call the OpenAI API to extract the answer from the engineered prompt.
ENGINE_ = 'gpt-3.5-turbo'
answer = openai.ChatCompletion.create(
model=ENGINE_,
messages=[
{"role": "user", "content": PROMPT}
],
temperature=1,
max_tokens=50,
top_p=0,
frequency_penalty=0,
presence_penalty=0
)
# Get the completion
ans_chat = answer['choices'][0]["message"]["content"]
ans_chat
Now we engineer a prompt that answer the question below in fun way to a first grader:
PROMPT_FUN = f"Answer the question in a fun way for a first grader given the context belwo.\n\nContext: {context}\nQuery: {QUERY}\nAnswer:"
print(PROMPT_FUN)
ENGINE_ = 'gpt-3.5-turbo'
answer = openai.ChatCompletion.create(
model=ENGINE_,
messages=[
{"role": "user", "content": PROMPT_FUN}
],
temperature=1,
max_tokens=50,
top_p=0,
frequency_penalty=0,
presence_penalty=0
)
# Get the completion
answer['choices'][0]["message"]["content"]
req_body = {"prompt":f"make an image of 'Snare' based on discription below:\n{ans_comp}",
"n":1,
"size":"512x512",
"top_p":1}
print ('Prompt is:\n')
print (req_body["prompt"])
# call OpenAI API
response = openai.Image.create (# creat a completed response based on a prompt
prompt=req_body["prompt"],
n=req_body["n"],
size=req_body["size"]
)
# response of image is JSON, we choose zero itrerator of urls
output_img = response["data"][0]["url"]
Image(url= output_img)