Summary

Semantic search seeks to improve search accuracy by understanding the content of the search query. In contrast to traditional search engines, which only find documents based on lexical matches, semantic search can also find synonyms. Semantic search with OpenAI embeddings to generate numerical representations (embeddings) for text. These embeddings capture the semantic meaning of the input, enabling a more nuanced understanding of language. When a user submits a search query, the model converts it into an embedding, and these embeddings are then compared to those of documents in a search index. The search results are ranked based on similarity scores, providing more accurate and context-aware retrieval of information compared to traditional keyword-based search. This approach leverages advanced language models to enhance the precision of search by focusing on the underlying meaning of the text. In this notebook, we explore the process of querying a question within a book and identifying the most relevant answers to that question.

Python functions and data files needed to run this notebook are available via this link.

import warnings
warnings.filterwarnings('ignore')
import openai
from openai.embeddings_utils import get_embedding
from urllib.request import urlopen
import numpy as np
from sentence_transformers import util
from transformers import pipeline
import pandas as pd
import scipy.stats as ss
import matplotlib.pyplot as plt
from IPython.display import Image
from IPython.core.display import HTML 
from sklearn.metrics.pairwise import cosine_similarity
import requests
from bs4 import BeautifulSoup

Introduction¶

The choice of the text embedder is critical as it determines the quality of the vector representation of the text. We have many options in how we vectorize with LLMs, both open and closed source. To get off of the ground quicker, we are going to use OpenAI’s closed-source “Embeddings” product. It is a closed-source product, which means we have limited control over its implementation and potential biases. It’s important to keep in mind that when using closed-source products, we may not have access to the underlying algorithms, which can make it difficult to troubleshoot any issues that may arise.

Once we convert our text into vectors, we have to find a mathematical representation of figuring out if pieces of text are “similar” or not. Cosine similarity is a way to measure how similar two things are. It looks at the angle between two vectors and gives a score based on how close they are in direction. If the vectors point in exactly the same direction, the cosine similarity is 1. If they’re perpendicular (90 degrees apart), it’s 0. And if they point in opposite directions, it’s -1. The size of the vectors doesn’t matter, only their orientation does.

GPT can generate text vectors to perform tasks below:

Semantic Search

Semantic search refers to a type of search that understands the meaning of the query and the context of the content, rather than just matching keywords. It aims to deliver more accurate and relevant results by considering the intent behind the search terms and the relationships between different pieces of information.

Clustering

Typically involves grouping together portions of text that share similar themes, topics, or characteristics.

Recommendations

Anomaly detection

Diversity measurement

Classification

The focus of this notebook is Semantic Search. Figure below shows flowchart of applying semantic search. This can be mostly replaced with OpenAI embedding

See Figure shows an example of abstractive question answering with GPT:

PERSON = 'Mehdi Rezvandehy'

# Google my name. This may not be the best way to google search
google_html = BeautifulSoup(requests.get(f'https://www.google.com/search?q={PERSON}').text).get_text()[:1024]

nlp = pipeline('question-answering', 
               model='deepset/roberta-base-squad2', # Using BERT roberta flavour for question and answering
               tokenizer='deepset/roberta-base-squad2', 
               max_length=15)

nlp(f'Who is {PERSON}?', google_html)

{'score': 0.041184596717357635,
 'start': 858,
 'end': 868,
 'answer': 'Researcher'}

Embedding with Sentence Transformer¶

The easiest approach to implement sentence similarity is through the sentence-transformers library — which wraps most of this process into a few lines of code.

First, we install sentence-transformers using pip install sentence-transformers. This library uses HuggingFace’s transformers behind the scenes — so we can actually find sentence-transformers models

We use bert-base-nli-mean-tokens model. Let’s create some sentences, initialize our model, and encode the sentences:

from sentence_transformers import SentenceTransformer

model_bert = SentenceTransformer('bert-base-nli-mean-tokens')

Example

corpus = ["Apple and orange are completely different from each other",   
        "Ocean temperature is rising rapidly",
        "AI has taken the world by storm",
    "Global warming is happening", 
        "The weather is not good to play golf today", 
        "Never compare an apple to an orange", 
        "People say I am a bookworm, in fact, I do not want to waste my time on TV",
         "AI has transformed the way the world works",
        "It is rainy today so we should postpone our golf game", 
        "I love reading books than watching TV"]

sentence_embeddings = model_bert.encode(corpus)
sentence_embeddings.shape

(10, 768)

df = pd.DataFrame()

for i1 in range(len(corpus)):
    sim_all = []
    for i2 in range(len(corpus)):
        tmp = cosine_similarity([sentence_embeddings[i1]],[sentence_embeddings[i2]])
        sim_all.append(tmp[0][0])
    df[corpus[i1]] = sim_all    
df.index = corpus    
df

def matrix_occure_prob(df,title,fontsize=11,vmin=-0.1, vmax=0.8,lable1='Sentence 1',pad=55,
                    lable2='Sentence 2',label='Cosine Similarity',rotation_x=90,axt=None,
                    num_ind=False,txtfont=6,lbl_font=9,shrink=0.8,cbar_per=False, 
                       xline=False):  
    import matplotlib.pyplot as plt
    
    """Plot correlation matrix"""
    ax = axt or plt.axes()
    colmn1=list(df.columns)
    colmn2=list(df.index)
    corr=np.zeros((len(colmn2),len(colmn1)))
    
    for l in range(len(colmn1)):
        for l1 in range(len(colmn2)):
            cc=df[colmn1[l]][df.index==colmn2[l1]].values[0]
            try:
                if len(cc)>1:
                    corr[l1,l]=cc[0]  
            except TypeError:
                corr[l1,l]=cc            
            if num_ind:
                ax.text(l, l1, str(round(cc,2)), va='center', ha='center',fontsize=txtfont)
    im =ax.matshow(corr, cmap='jet', interpolation='nearest',vmin=vmin, vmax=vmax)
    cbar =plt.colorbar(im,shrink=shrink,label=label) 
    if (cbar_per):
        cbar.ax.set_yticklabels(['{:.0f}%'.format(x) for x in np.arange( 0,110,10)])    

    ax.set_xticks(np.arange(len(colmn1)))
    ax.set_xticklabels(colmn1,fontsize=lbl_font)
    ax.set_yticks(np.arange(len(colmn2)))
    ax.set_yticklabels(colmn2,fontsize=lbl_font)    
    
    # Set ticks on both sides of axes on
    ax.tick_params(axis="x", bottom=True, top=False, labelbottom=True, labeltop=False)
    
    # Rotate and align bottom ticklabels
    plt.setp([tick.label1 for tick in ax.xaxis.get_major_ticks()], rotation=rotation_x,
             ha="right", va="center", rotation_mode="anchor")
    
    # Rotate and align bottom ticklabels
    plt.setp([tick.label1 for tick in ax.yaxis.get_major_ticks()], rotation=rotation_x,
             ha="right", va="center", rotation_mode="anchor")
    
    if xline:
        x_labels = list(ax.get_xticklabels())
        x_label_dict = dict([(x.get_text(), x.get_position()[0]) for x in x_labels])
        
        for ix in xline:
            plt.axvline(x=x_label_dict[ix]-0.5,linewidth =1.2,color='k', linestyle='--')
            plt.axhline(y=x_label_dict[ix]-0.5,linewidth =1.2,color='k', linestyle='--')  

    plt.xlabel(lable1)
    plt.ylabel(lable2)    
    ax.grid(color='k', linestyle='-', linewidth=0.05)
    plt.title(f'{title}',fontsize=fontsize, pad=pad)
    plt.show()

Similarity Matrix

font = {'size'   : 16}
plt.rc('font', **font)
fig, ax=plt.subplots(figsize=(15, 10), dpi= 110, facecolor='w', edgecolor='k')  
    
matrix_occure_prob(df,title='Cosine Similarity Matrix by BERT Pre-trained Model ',lable1='',vmin=0, pad=10,axt=ax,
                  vmax=0.8,cbar_per=False,lable2='',num_ind=True,txtfont=15,xline=False,fontsize=22,
                   lbl_font=14,label='BERT Similarity',rotation_x=30)

Embedding with OpenAI Close source¶

OPENAI_API_KEY¶

# "OPENAI_API_KEY": notice, this is not a open soourse model like downloading from Huggingface
# we should have API_KEY that OpenAI send yo us that is persona API key.
openai.api_key = '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'

## Here are list of engines that OpenAI have for us
#openai.Engine.list().data

Find OpenAI Engines with 'embed' or 'search tasks¶

# Look at othe model that have either 'embed' or 'search tasks
[e for e in openai.Engine.list().data if 'embed' in e.id or 'search' in e.id][:5]

[<Engine engine id=text-search-babbage-doc-001 at 0x19e3afa0f90> JSON: {
   "object": "engine",
   "id": "text-search-babbage-doc-001",
   "ready": true,
   "owner": "openai-dev",
   "permissions": null,
   "created": null
 },
 <Engine engine id=curie-search-query at 0x19e3afaf900> JSON: {
   "object": "engine",
   "id": "curie-search-query",
   "ready": true,
   "owner": "openai-dev",
   "permissions": null,
   "created": null
 },
 <Engine engine id=text-search-babbage-query-001 at 0x19e3afafdb0> JSON: {
   "object": "engine",
   "id": "text-search-babbage-query-001",
   "ready": true,
   "owner": "openai-dev",
   "permissions": null,
   "created": null
 },
 <Engine engine id=babbage-search-query at 0x19e3afafef0> JSON: {
   "object": "engine",
   "id": "babbage-search-query",
   "ready": true,
   "owner": "openai-dev",
   "permissions": null,
   "created": null
 },
 <Engine engine id=babbage-search-document at 0x19e3afba450> JSON: {
   "object": "engine",
   "id": "babbage-search-document",
   "ready": true,
   "owner": "openai-dev",
   "permissions": null,
   "created": null
 }]

# We use embedding version of OpenAI model
ENGINE = 'text-embedding-ada-002'

To get OpenAI embedding, instead of downloading the model and using the model for embedding, there is only one line of the code as below. This could take time if we have hundreds of documents.

# Token level embedding
doc_embeddings = [get_embedding(doc_, engine=ENGINE) for doc_ in corpus[0]]
np.array(doc_embeddings).shape

(57, 1536)

# Sentence level embedding
que_embedding = np.array(get_embedding(corpus[0], engine=ENGINE))
que_embedding.shape

(1536,)

que_embedding

array([ 0.03064561, -0.00129054,  0.01672037, ..., -0.00225058,
       -0.01147007,  0.00403215])

df = pd.DataFrame()

for i1 in range(len(corpus)):
    sim_all = []
    for i2 in range(len(corpus)):
        a = np.array(get_embedding(corpus[i1], engine=ENGINE))
        b = np.array(get_embedding(corpus[i2], engine=ENGINE))
        
        tmp = cosine_similarity([a],[b])
        sim_all.append(tmp[0][0])
    df[corpus[i1]] = sim_all    
df.index = corpus    
df

Similarity Matrix

font = {'size'   : 16}
plt.rc('font', **font)
fig, ax=plt.subplots(figsize=(15, 10), dpi= 110, facecolor='w', edgecolor='k')  
    
matrix_occure_prob(df,title=f'Cosine Similarity Matrix with OpenAI Close Source \n "{ENGINE}"',lable1='',vmin=0.5, pad=10,axt=ax,
                  vmax=1,cbar_per=False,lable2='',num_ind=True,txtfont=15,xline=False,fontsize=22,
                   lbl_font=14,label='BERT Similarity',rotation_x=30)

Load a Text Book¶

# textbook about animal: Title: Wild Animals I Have  Known
text = urlopen("""https://www.gutenberg.org/cache/epub/3031/pg3031.txt""").read().decode() # open URL of the document and read

# split up the text book into paragraphs with new line character and only keep documents of at least 90 characters
doc = list(filter(lambda x: len(x) > 90, text.split('\r\n\r\n')))

doc = np.array(doc)

print(f'There are {len(doc)} paragraphs')

There are 645 paragraphs

doc[0]

'\ufeffThe Project Gutenberg eBook of Wild Animals I Have Known\r\n    \r\nThis ebook is for the use of anyone anywhere in the United States and\r\nmost other parts of the world at no cost and with almost no restrictions\r\nwhatsoever. You may copy it, give it away or re-use it under the terms\r\nof the Project Gutenberg License included with this ebook or online\r\nat www.gutenberg.org. If you are not located in the United States,\r\nyou will have to check the laws of the country where you are located\r\nbefore using this eBook.'

Sentence Transformer¶

model_bert = SentenceTransformer('bert-base-nli-mean-tokens')

sentence_embeddings = model_bert.encode(doc)
sentence_embeddings.shape

(645, 768)

QUERY = """What is snare?"""  # a natural language query

sentence_embeddings_QUERY = model_bert.encode(QUERY)

df=pd.DataFrame()
sim_all=[]
for i1 in range(len(doc)):
    tmp=cosine_similarity([sentence_embeddings_QUERY],[sentence_embeddings[i1]])
    sim_all.append(tmp[0][0])

sim_all_rank = len(sim_all) - ss.rankdata(sim_all)

rank_1 = np.where(sim_all_rank==1)[0][0] - 1
rank_2 = np.where(sim_all_rank==2)[0][0] - 1
rank_3 = np.where(sim_all_rank==3)[0][0] - 1

# Encode the QUERY using the bi-encoder and convert to a tensor and find relevant documents
que_embedding = model_bert.encode(QUERY, convert_to_tensor=True)

# Give number of documents to retrieve with the bi-encoder
# top_k is top documents similar to query by descending Cosine similarity
cos_sim_score = util.semantic_search(que_embedding, sentence_embeddings, top_k=3)[0]

cos_sim_score

[{'corpus_id': 194, 'score': 0.502781331539154},
 {'corpus_id': 195, 'score': 0.4717433750629425},
 {'corpus_id': 459, 'score': 0.4647967219352722}]

print(f'QUERY: {QUERY}\n')

for no, ir in enumerate(cos_sim_score):
    
    print(f'Document {no + 1}: Cosine Similarity (sentence embedding) is {ir["score"]:.3f}:\n\n{doc[ir["corpus_id"]]}')
    print('\n')

QUERY: What is snare?

Document 1: Cosine Similarity (sentence embedding) is 0.503:

     "Ye Franckelyn's dogge leaped over a style,

     And yey yclept him lyttel Bingo,

      B-I-N-G-O,


Document 2: Cosine Similarity (sentence embedding) is 0.472:

     And he yclept ytte rare goode Stingo.

     Now ys not this a prettye rhyme,

     I thynke ytte ys bye Jingo,

      J-I-N-G-O,


Document 3: Cosine Similarity (sentence embedding) is 0.465:

A snort of terror and a bound in the air gave Tom the chance to add the

double hitch. The loop flashed up the line, and snake-like bound those

mighty hoofs.

As can be seen, bert-base-nli-mean-tokens embedding is not efficient to find proper answer for asked questions. Hence, OpenAI embedding is applied (see next section).

OpenAI for Embedding¶

# We use embedding version of OpenAI model
ENGINE = 'text-embedding-ada-002'

QUERY = """What is snare?"""  # a natural language query

The query (question) is encoded using OpenAI.

que_embedding = np.array(get_embedding(QUERY, engine=ENGINE))

The chunked document (text book) is encoded using OpenAI.

# This could take time if we have hundreds or thousands of documents
doc_embeddings = [get_embedding(document, engine=ENGINE) for document in doc]

# Transform list of lists to numpy
doc_embeddings = np.array(doc_embeddings)

doc_embeddings.shape

(645, 1536)

We can now use Sentence Transformers semantic:

cos_sim_score = util.semantic_search(que_embedding, doc_embeddings, top_k=3)[0]
cos_sim_score

[{'corpus_id': 144, 'score': 0.8674971581471483},
 {'corpus_id': 310, 'score': 0.7881266888837749},
 {'corpus_id': 264, 'score': 0.7866484835003427}]

que_embedding.shape

(1536,)

doc_embeddings.shape

(645, 1536)

print(f'QUERY: {QUERY}\n')

for no, ir in enumerate(cos_sim_score):
    
    print(f'Document {no + 1}: Cosine Similarity (sentence embedding) is {ir["score"]:.3f}:\n\n{doc[ir["corpus_id"]]}')
    print('\n')

QUERY: What is snare?

Document 1: Cosine Similarity (sentence embedding) is 0.867:

"A snare is something that looks like a creeper, but it doesn't grow and

it's worse than all the hawks in the world," said Molly, glancing at the

now far-away red-tail, "for there it hides night and day in the runway

till the chance to catch you comes."


Document 2: Cosine Similarity (sentence embedding) is 0.788:

And the trick is to locate the mouse and seize him first and see him

afterward. Vix soon made a spring, and in the middle of the bunch of

dead grass that she grabbed was a field-mouse squeaking his last squeak.


Document 3: Cosine Similarity (sentence embedding) is 0.787:

Oh, unlucky thought! Oh, mad heedlessness born of long immunity! That

fine sand was on the next wolftrap and in an instant I was a prisoner.

Although not wounded, for the traps have no teeth, and my thick trapping

gloves deadened the snap, I was firmly caught across the hand above the

knuckles. Not greatly alarmed at this, I tried to reach the trap-wrench

with my right foot. Stretching out at full length, face downward, I

worked myself toward it, making my imprisoned arm as long and straight

as possible. I could not see and reach at the same time, but counted on

my toe telling me when I touched the little iron key to my fetters. My

first effort was a failure; strain as I might at the chain my toe struck

no metal. I swung slowly around my anchor, but still failed. Then a

painfully taken observation showed I was much too far to the west. I set

about working around, tapping blindly with my toe to discover the key.

Thus wildly groping with my right foot I forgot about the other till

there was a sharp 'clank' and the iron jaws of trap No. 5 closed tight

on my left foot.

From the top document, we can answer the question:

nlp(QUERY, str(doc[cos_sim_score[1]['corpus_id']]))

{'score': 0.002211708575487137, 'start': 138, 'end': 139, 'answer': '\r'}

OpenAI is not an open-source so we do not have easy capability to fine-tune a model although OpenAI offers fine-tunning platform. However, we have to use their API and know how to use it.

Other Application of OpenAI¶

OpenAI has developed a variety of models; there are three types of models that OpenAI has worked on:

Text Completion Models:

Models like GPT-3 fall into this category. These models are trained to generate human-like text based on a given prompt or context. They can be used for various natural language processing tasks, such as language translation, question answering, text summarization, and more.

Chat Completion Models:

This seems to align closely with text completion models but with a specific emphasis on conversational interactions. Models like ChatGPT are designed to generate responses in a chat-like format. They can be used for creating chatbots, virtual assistants, and other conversational applications.

Image Models:

OpenAI has also explored models for working with images. One notable example is DALL-E, which is capable of generating images from textual descriptions. Another example is CLIP, which can understand and generate textual descriptions for images. These models demonstrate the capability of AI to work with visual data.

OpenAI Completion¶

We can engineer a prompt to answer question based on a given context:

context = doc[cos_sim_score[0]['corpus_id']]

PROMPT = f"Answer the question by having the context below: \n\nContext: {context}\nQuery: {QUERY}\nAnswer:"
print(PROMPT)

Answer the question by having the context below: 

Context: "A snare is something that looks like a creeper, but it doesn't grow and

it's worse than all the hawks in the world," said Molly, glancing at the

now far-away red-tail, "for there it hides night and day in the runway

till the chance to catch you comes."
Query: What is snare?
Answer:

Call the OpenAI API to extract the answer from the engineered prompt.

ENGINE_ = 'gpt-3.5-turbo-instruct'
answer = openai.Completion.create(
  model=ENGINE_,
  prompt=PROMPT,  
  temperature=1,
  max_tokens=50,
  top_p=0,
  frequency_penalty=0,
  presence_penalty=0
)

answer

<OpenAIObject text_completion id=cmpl-8ZmjUzAWkmqR1HjIRvYMi90pkIxxu at 0x19e3c00c680> JSON: {
  "id": "cmpl-8ZmjUzAWkmqR1HjIRvYMi90pkIxxu",
  "object": "text_completion",
  "created": 1703539500,
  "model": "gpt-3.5-turbo-instruct",
  "choices": [
    {
      "text": " A snare is a type of plant that resembles a creeper, but does not actually grow. It is considered to be more dangerous than all the hawks in the world because it hides in the runway and waits for the opportunity to catch its prey",
      "index": 0,
      "logprobs": null,
      "finish_reason": "length"
    }
  ],
  "usage": {
    "prompt_tokens": 89,
    "completion_tokens": 50,
    "total_tokens": 139
  }
}

# Get the completion
ans_comp = answer['choices'][0]["text"]
ans_comp

' A snare is a type of plant that resembles a creeper, but does not actually grow. It is considered to be more dangerous than all the hawks in the world because it hides in the runway and waits for the opportunity to catch its prey'

Now we engineer a prompt that answer the question below in fun way to a first grader:

PROMPT_FUN = f"Answer the question in a fun way for a first grader given the context belwo.\n\nContext: {context}\nQuery: {QUERY}\nAnswer:"
print(PROMPT_FUN)

Answer the question in a fun way for a first grader given the context belwo.

Context: "A snare is something that looks like a creeper, but it doesn't grow and

it's worse than all the hawks in the world," said Molly, glancing at the

now far-away red-tail, "for there it hides night and day in the runway

till the chance to catch you comes."
Query: What is snare?
Answer:

ENGINE_ = 'gpt-3.5-turbo-instruct'
answer = openai.Completion.create(
  model=ENGINE_,
  prompt=PROMPT_FUN,  
  temperature=1,
  max_tokens=50,
  top_p=0,
  frequency_penalty=0,
  presence_penalty=0
)

# Get the completion
ans_comp_fun = answer['choices'][0]["text"]
ans_comp_fun

" A snare is like a sneaky plant that doesn't grow, but it's even scarier than all the hawks in the world! It hides in the ground and waits for you to walk by so it can catch you!"

OpenAI ChatCompletion¶