Introduction to BERT Large Language Model
BERT, which stands for Bidirectional Encoder Representations from Transformers, is a groundbreaking natural language processing (NLP) model that has significantly advanced the field of language understanding and machine learning. Developed by Google in 2018, BERT is a transformer-based neural network architecture designed to comprehend context and relationships within sentences by considering the bidirectional flow of information. This notebook will elucidate the fundamentals of Large Language Models (LLMs), demonstrating the implementation of the BERT model using the transformer architecture.
The history of neural language models started in 2001 with the first neural language model using a feed forward neural network to predict future words in a sentence. In 2013, the Word2vec problem of creating word embeddings was solved using two methods: the continuous bag of words algorithm and the skip gram network. However, these word embeddings showed bias from the large scale corpora used to train the models. In 2013 and 2014, the rise of recurrent neural networks (RNNs) and convolutional neural networks (CNNs) started for natural language processing tasks, leading to sequence to sequence modeling: an encoder reads a variable length input sequence and a decoder generates a variable length sequence output.
sequence to sequence (top left), seq-to_vector (top right), vector-to-seq (bottom left) and Encoder-Decoder (bottom right) network.
Image retrieved from Géron, Aurélien. Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 2nd Edition
In 2015, attention was introduced allowing the decoder in sequence to sequence models to access intermediate hidden states. In 2017, the transformer was introduced in the paper "Attention is all you need", replacing RNNs as the primary mechanism for language modeling and leading to pre-trained language models like Google’s BERT, OpenAI’s GPT, and T5. Transformer-based architectures, like Elmo, showed improvements in nearly all areas of NLP compared to previous state-of-the-art results.
The transformer model is primarily based on the **attention** idea. In 2015, attention mechanisms started to be used in NLP tasks and became the dominant way to perform NLP tasks over the past decade. Attention in NLP is a mechanism designed to focus on specific parts of a sequence in the context of another sequence. It is used to perform various NLP tasks, such as language modeling, sequence classification, language translation, and image captioning. The idea behind attention is to give the model the ability to attend to relevant information while performing a prediction task. The attention mechanism allows the model to understand relationships between different elements in a sequence and make predictions based on that information. For example, in language translation, the attention mechanism helps the model understand which parts of a source sentence are relevant for translating into a target language. There are various types of attention mechanisms, including self-attention, which is the kind of attention that powers the transformer architecture in NLP.
One of the most important forms of attention is **self-attention**, which allows a model to learn relationships between each word in a sequence and every other word in the sequence. Self-attention alters the idea of attention to relate each word in the phrase with every other word in the phrase to teach model the relationship. This form of attention is the basis of the **transformer** architecture. One example of its use is language translation, where attention allows the model to focus on specific parts of the input sequence and understand relationships between words. The BERT model uses self-attention and can understand grammar rules and relationships between words, as demonstrated by its attention scores.
In simpler terms, self attention helps us create similar connections but within the same sentence. Look at the following example:
“I poured water from the bottle into the cup until it was full.”
it => cup
“I poured water from the bottle into the cup until it was empty.”
it=> bottle
By changing one word “full” — > “empty” the reference object for “it” changed. If we are translating such a sentence, we will want to know what the word “it” refers to.
The **Transformer** is a state-of-the-art natural language processing model based on an encoder-decoder architecture. It aims to solve sequence-to-sequence tasks while handling long-range dependencies with ease. It relies entirely on self-attention to compute representations of its input and output WITHOUT using sequence-aligned RNNs or convolution. In fact, it uses self attention to model the relationship between tokens in a sentence instead of recursion used in RNNs and LSTMs. The Transformer consists of an encoder stack and a decoder stack, where the input sentence is first processed into multiple input embeddings and then fed into the encoder. The final representation of the input sequence is then fed into the decoder, which uses cross-attention, combining the attention from the encoder and decoder, to generate an output.
The Encoder stack and the Decoder stack each have their corresponding Embedding layers for their respective inputs. Finally, there is an Output layer to generate the final output. Image is retrieved from https://towardsdatascience.com/
The Encoder contains the all-important Self-attention layer that computes the relationship between different words in the sequence, as well as a Feed-forward layer.
Image is retrieved from https://towardsdatascience.com/
The Transformer has different types of attention mechanisms, including 1-multi-head attention in the encoder, 2-masked multi-head attention in the decoder, and 3-cross-attention between the encoder and decoder. These attention mechanisms allow the Transformer to learn simultaneous grammatical structures and rules, and how to think in a train of thought. "Attention is all you need"
Attention allows a language model to distinguish between the following two sentences:
There’s a very important difference between these two almost identical sentences: in the first, “it” refers to the cup. In the second, “it” refers to the pitcher. Humans don’t have a problem understanding sentences like these, but it’s a difficult problem for computers. Attention allows Transformers to make the connection correctly because they understand connections between words that aren’t just local. It’s so important that the inventors originally wanted to call Transformers “Attention Net” until they were convinced that they needed a name that would attract more, well, attention.
There are three types of language models to train a model to predict a missing word:
1. Auto-regressive
Auto-regressive language models predict a missing word in a sentence given past tokens and future tokens but not both of them. It includes forward and backward prediction. Example is our phone auto completion sentence. This is categorized as **GPT** family.
2. Auto-encoding.
Auto-encoding language models strive to gain a comprehensive understanding of entire sequences of tokens given all possible context (past and future tokens are given). This is great for natural language understanding tasks like sequence classification and named entity recognition and **BERT** is an example.
See Figure below die difference:
3. Combinations of autoregressive and autoencoding, like T5, which can use the encoder and decoder to be more versatile and flexible in generating text. It has been shown that these combination models can generate more diverse and creative text in different contexts compared to pure decoder-based autoregressive models due to their ability to capture additional context using the encoder.
The transformer architecture was introduced in 2017 in the paper "Attention is All You Need" and has since replaced the use of recurrent neural networks (RNNs) and convolutional neural networks (CNNs) in natural language processing tasks. The transformer uses various forms of attention, including multi-headed attention and masked multi-headed attention, to perform NLP tasks effectively. Recently, the transformer architecture has started to be used in computer vision tasks as well, with a type of transformer called the vision transformer being used to replace CNNs.
Three matrices: query, key, value are applied to calculate standard attention.
Assume X is our tokenized input. X has input of 2 tokens (number of rows) and each token has 4 embedding dimension (number of columns). In real life, we have much bigger than 4 dimensions. In this case, our input has gone through all Bert processing steps.
We take X matrix and perform three linear transformations on that matrices (WQ, WK, WV) which are weights, to project them to three new vector spaces: Q (query), K (key), V (value). The output of this matrix multiplication is a matrix with rows are number of tokens (still 2 tokens) but number of columns reduce from 4 to 3. It is compressed with a smaller matrix.
Image from Sinan Ozdemir
Imagine our input sequence is "I like cats". The goal is to look at each token and find the relevance of each token to other tokens. The Figure below show attention score for "like", relevance of each query to like. Green is key space and orange is query space which represents the whole aim of speaking in first place.
The higher value of QxK^T denotes higher semantic similarity. I in key space is very far from like in query space.
Image from Sinan Ozdemir
As the numbers do not add up to 1, next step of attention scores("like") is to normalize score. Finally, the score are multiplied by value of tokens. This leads to achieve context-ful embedding of "like". It can be interpreted as the semantic meaning of the word like should be paid attention to 42% of our attention
Image from Sinan Ozdemir
Here is another example. On the left, QxK is calculated for each token divided by sqrt(300) and then goes into softmax function. Each row should add up to 1; each row shows how much attention should be paying to other token including itself. For example, "it's" spends most of the attention to "here" and some of the attention to "another". On the right, we multiply each row on the right with 300 dimension. When this multipication is done, a matrix is achieved which each is a token and each column shows a context-full dimension of meaning
Image from Sinan Ozdemir
Lets take at look below:
import warnings
warnings.filterwarnings('ignore')
# Import BERT model from transformer library which has a lot of pretrained models
# transformers is HuggingFace library
from transformers import BertModel
# BertModel also requires the PyTorch library
# Grab base BERT pretrained language model
model_Bert=BertModel.from_pretrained('bert-base-uncased')
See BERT Transformation for more information.
# The base BERT model has 12 encoders in its encoder stack
len(model_Bert.encoder.layer)
# Look at first encoder
model_Bert.encoder.layer[0]
As can be seen, it has self-attention, metrics for query, key and value including dropout layer. Then it takes the output of attention and runs it through another dense layer with a layer normalization and drop out. Finally, the output represents the actual representation that is feed into next encoder or decoder for further processing.
The attention block is the most import part to discuss because the main bulk of the work is happening there:
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
)
Multi-Headed Attention is a key feature of the transformer architecture used in models like BERT. Each encoder has multi-headed attention. Instead of having a single attention mechanism, Multi-Headed Attention splits the attention mechanism into multiple smaller mechanisms that work side by side, each learning different aspects of the input representation. Each of these attention mechanisms operates independently, with their results being concatenated to produce a final representation. The main aim of multi-headed self attention is to do attention (Q, K, V) again with a different set of weights. This process is repeated multiple times to see the differences. None of these talks to each other.
Image is retrieved from https://towardsdatascience.com/
In the BERT base model, there are 12 encoder layers and each encoder has 12 attention heads, meaning there are 144 separate attention calculations happening simultaneously, with each head potentially learning a different aspect of language. Research has shown that certain heads in the BERT model focus on specific aspects of language, such as direct object relations or possessive pronouns.
import pandas as pd
import matplotlib.pyplot as plt
import torch
# transformers is HuggingFace library
from transformers import BertTokenizer, BertModel
from bertviz import head_view # Attention visualization
# Load a vanilla BERT-base model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
text = "Almost before we knew it, we had left the ground. The unknown holds its grounds."
tokens = tokenizer.encode(text) # words (tokens) transformation into numbers
inputs = torch.tensor(tokens).unsqueeze(0) # unsqueeze changes the shape from (10,) -> (1, 10)
inputs
Pass the sentence above through 12 encoders each with 12 attention heads: 144 different calculations all at once.
# Optain base BERT pretrained language model
model_Bert = BertModel.from_pretrained('bert-base-uncased')
# Optain the attention scores from BERT
attention = model_Bert(inputs, output_attentions=True)[2]
Get the final encoder (last) and average them across all 12 heads: it represents final attention score for each of our tokens. We are looking for a matrix that each row is one of our tokens adding up to 1. The higher values in for each word indicates the row signifies how much attention occurs. It should be paying a lot of attention to period.
final_attention_mean = attention[-1].mean(1)[0]
df_attention = pd.DataFrame(final_attention_mean.detach()).applymap(float).round(4)
df_attention.columns = tokenizer.convert_ids_to_tokens(tokens)
df_attention.index = tokenizer.convert_ids_to_tokens(tokens)
df_attention
Given a sequence of tokens X = (x1, x2, . . . , xn), BERT pads the input sentence with [CLS]
and [SEP]
tokens.
CLS stands for "classification." It is a special token added to the beginning of each input sequence in BERT. This token is used to represent the entire sequence and is trained to predict the correct classification label for the input sequence. The output of the CLS token is used as an embedding representation of the input sequence for downstream classification tasks.
SEP stands for "separator." It is another special token used in BERT to separate two sentences or sequences of words that are concatenated together as input. For example, when using BERT to perform sentence pair classification, two input sentences are concatenated with a SEP token in between them to indicate the end of the first sentence and the beginning of the second one.
In summary, the CLS token is used to represent the entire input sequence and is used as the embedding for classification tasks, while the SEP token is used to separate sequences that are concatenated together as input.
each row or column are a probability which add up to 1 for each row and each column.
The library bertviz
shows a very nice interactive visualization to show how the heads and encoder are working. Each color represents each head for each encoder layer.
list_tokens= tokenizer.convert_ids_to_tokens(inputs[0])
head_view(attention, list_tokens)
Layers are encoders
# Zoom in on third encoding layer and our first head
head_view(attention, tokenizer.convert_ids_to_tokens(inputs[0]), layer=2, heads=[1])
# This shows attention score for that head and layer
# The paper claimed that 8 encoding layer, 10 th heads relates to direct object to its verb
head_view(attention, tokenizer.convert_ids_to_tokens(inputs[0]), layer=7, heads=[9])
Bert combines all those relations to final representation
# the attention for the 8th encoder's 10th head we can see direct object attention. We are not
# making average for this
eightth_tenth = attention[7][0][9]
# Get the attention matrix
attention_df = pd.DataFrame(eightth_tenth.detach()).round(4)
attention_df.columns = tokenizer.convert_ids_to_tokens(tokens)
attention_df.index = tokenizer.convert_ids_to_tokens(tokens)
attention_df # sums across rows add up to 1. sums across columns do not
ground has very high attention (64.86%) to left.
Transfer learning is a machine learning method where we reuse a pre-trained model as the starting point for a model on a new task. To put it simply—a model trained on one task is repurposed on a second, related task as an optimization that allows rapid progress when modeling the second task.
In NLP, transfer learning is achieved by first pre-training a model on an unlabeled text corpus in an unsupervised or semi-supervised manner, and then fine-tuning (update) the model on a smaller labeled dataset for a specific NLP task. If training is simply applied on smaller dataset without pre-training, it would not be possible to get high performance results. For example for NLP, we use BERT and for image classification we can use Resnet images
Image retrieved from Sinan Ozdemir
For example, BERT has been pretrained on two main Corpora: English Wikipedia (2.5B words) and BookCorpus (800M words) which are free books. BERT went through all of these resources multiple times to gain a general understanding.
The BERT weights learnt from pretrained model are fine-tined. Moreover, a separate layer will be added on top of BERT model.
So, there are three fine-tuning approaches:
Images are retrieved from Sinan Ozdemir
import torch
import numpy as np
Pytorch is an open source machine learning (ML) framework based on the Python programming language and the Torch library. It a library to make deep learning available for us. It has two purposes:
NumpPy replacement in order to leverage GPU power and other techniques. This can be helpful for big data, which leads to run faster.
Optimized approach to implement neural networks with less computational cost.
Pytorch based object are Tensors.
# One dimension
torch.tensor([2,3,5])
# One dimension
torch.LongTensor([2,3,5])
# Two dimension
torch.tensor([[2,3,5],[1,2,9]])
np.array([[2,3,5],[1,2,9]])
c_torch=torch.zeros(2, 2)
c_torch
c_numpy=np.zeros((2, 2))
c_numpy
torch.from_numpy(c_numpy)
c_torch.numpy()
two_d=torch.tensor([[2,3,5],[1,2,9]])
print(f'two_d {two_d}')
print(f'shape is {two_d.shape}, and dim is {two_d.dim()}')
unsqueeze
adds dimension at certain location forcing the dimension to existtwo_d_unsqueeze=two_d.unsqueeze(0)
print(f'two_d_unsqueeze {two_d_unsqueeze}')
print(f'shape is {two_d_unsqueeze.shape}, and dim is {two_d_unsqueeze.dim()}')
The unsqueeze is very useful for forcing a "batch" dimension if we want to predict a single example
Process 1 to 3 are repeated until we are satisfy with our model performance. This is a manual processing that can be tedious. See Figure below:
Image retrieved from Sinan Ozdemir
To address the problem above, we can use HuggingFace's Trainer API in our training loop. It takes the entire loop above including loss, gradient calculation and optimization all of them are calculated in a single API called trainer. Image retrieved from Sinan Ozdemir
The key objects are:
BERT stands for Bi-directional Encoder Representation from Transformers:
A sentence is fed into BERT to get a **context-full** representation (vector embedding) of every word in sentence. The context of each word in sentence is understood by the encoder using a multi-head attention mechanism (relating each word to every other word in sentence).
BERT comes with different models. The base model has 12 encoders which a good mix of complexity and size and speed of the model. BERT-small has 4 encoders and BERT-large has 24 encoders.
# Load pretrained BERT-base model with 12 encoder and 110M parameters
model_BERT_base = BertModel.from_pretrained('bert-base-uncased')
# Model's parameters
n_params = list(model_BERT_base.named_parameters())
print(f'The BERT model has {len(n_params)} different parameters')
print('********* Embedding Layer *********\n')
for par in n_params[0:5]:
print(f'{par[0], str(tuple(par[1].size()))}')
embeddings.word_embeddings.weight
: (30522, 768) means there are 30522 tokens that BERT is aware of that can be used for any NLP task; 768 represents the fact that each token has a contextless embedding dimension of 768.
print('********* First Encoder ********* \n')
for par in n_params[5:21]:
print(f'{par[0], str(tuple(par[1].size()))}')
print('********* Output Layer ********* \n')
for par in n_params[-2:]:
print(f'{par[0], str(tuple(par[1].size()))}')
pooler
: is a separate feed-forward network with a hyperbolic tanh activation function. When we are using BERT, this pooler is taking the vector embedding of token representing the entire sentence, not a particular token.
# load the bert-base uncased tokenizer.
tokenizer_bert = BertTokenizer.from_pretrained('bert-base-uncased')
# tokenize a sequence
tokenizer_sentence=tokenizer_bert.encode('AI has been my friend')
tokenizer_sentence
We always have token 101 at start which refers to CLS, and 102 at the end which refers to SEP. Those are automatically added to tokenizer.
We can run this token through a model:
# runing tokens via the model
response = model_BERT_base(torch.tensor(tokenizer_sentence).unsqueeze(0))
The code above does:
Passing this through our BERT model leads to many outputs.
response
# Embedding for each token
response.last_hidden_state
Each row represents a token in sequence and each vector represents that token's context in greater sequence. As mentioned before, first row is CLS.
# The size of pooler_output
response.pooler_output.shape
pooler_output
is meant to be representative of the entire sequence as a whole not just a individual token. The size of pooler_output
is our weight matrix
response.pooler_output
model_BERT_base.pooler
Model pooler is feed forward network with Tanh
activation.
# Get the final encoder's representation. First elemnt of second dimension is CLS token
CLS_embedding = response.last_hidden_state[:, 0, :].unsqueeze(0) # second dimension holds all of the token
CLS_embedding.shape
# put CLS_embedding through model's pooler
model_BERT_base.pooler(CLS_embedding).shape
model_BERT_base.pooler(CLS_embedding)
First dimension is our batch size which is still 1 and 768 is final embedding dimension of our final model. This tensor is a vector representation of the entire sequence at large.
(model_BERT_base.pooler(CLS_embedding) == response.pooler_output).all()
Running the embedding for CLS through the pooler gives the same output as the pooler_output
tot_prms = 0
for par in model_BERT_base.parameters(): # Iterate through parameters
if len(par.shape) == 2:
tot_prms += par.shape[0] * par.shape[1] # multiply matrecies dimension together and add them to our total parameters
print(f'BERT has total of {tot_prms:,} learnable parameters.')
print(f'This is how to get 110M learnable parameters of BERT')
print(f"""There are only {30522* 768} for context-less word embedding. The rest of parameters are scattered over
encoders specially out attention calculation""")
Tokenization is to split vocabulary over 30,000 token via BERT. As mentioned before, two special tokens [CLS] and [SEP] are added at the beginning and at the end of of phrase, respectively. [CLS] is to represents the entire sequence and [SEP] is to separate between sentences. For example, tokenization this sentence "AI has conquered the world" is :
["[CLS]","AI", "has", "conquered", "the", "world","[SEP]"]
There might be words do not exist in BERT for example "Mehdi loves AI"
'Mehdi' in tokenizer.vocab
BERT deals with this by splitting the word into pieces: "me", "##hdi". See the tokenization below:
["[CLS]","me", "##hdi", "loves", "ai","[SEP]"]
So the take away from example above is a clear distinction between word and token. They are not interchangeable!
There are two kind of BERT tokenization: uncased and cased:
uncased | cased |
---|---|
Lower-case accent removal | keep unchanged |
Generic Situation when case does not contribute to context | Case does matter ( Named Entity Recognition |
# Load BERT's "uncased" tokenizer.
tokenizer_bert = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
print(f'Number of BERTs vocabulary: {len(tokenizer_bert.vocab)}')
txt = "Example of a simple sentence!"
# Tokenization
tokens = tokenizer_bert.encode(txt)
print(tokens)
# Re-construct the sentence by decode method of tokens
tokenizer_bert.decode(tokens)
Lets try a more complex sentence:
text = "AI is my friend and has been friendly since it was invented."
tokens = tokenizer_bert.encode(text)
print(tokens)
We can show token and its representative word in nicer way below:
print(f'The sentence is "{text}", which leads to {len(tokens)} token:')
for tkn in tokens:
print(f'Token: {tkn}, corresponded word: {tokenizer_bert.decode([tkn])}')
text = "Mehdi loves AI"
tokens = tokenizer_bert.encode(text)
for tkn in tokens:
print(f'Token: {tkn}, corresponded word: {tokenizer_bert.decode([tkn])}')
Up to now, we were using encoding that gives ids, we can also use encode_plus
that gives us multiple things:
text = "Mehdi loves AI"
tokens = tokenizer_bert.encode_plus(text)
print(tokens)
# python is the 6th token
txt1='The coolest programming language is Python'
python_language = tokenizer.encode(txt1)
# python is the 1th token
txt2='Python can be aggressive sometimes during hunting'
python_pet = tokenizer.encode(txt2)
Processing steps below are required:
# Vector representation of 'python' in 'The coolest programming language is Python'
python_embedding_programming = model(torch.tensor(python_language).unsqueeze(0))[0][:,6,:].detach().numpy()
# Vector representation of 'python' in 'Python can be aggressive sometimes if it is hungary'
python_embedding_pet = model(torch.tensor(python_pet).unsqueeze(0))[0][:,1,:].detach().numpy()
# Import cosine similarity from sklearn
from sklearn.metrics.pairwise import cosine_similarity
# Calculate cosine similarity between representation of the word Python
sim=cosine_similarity(python_embedding_programming, python_embedding_pet)[0][0]
print(f'Cosine similarity between the representation of the word "Python" in \
two sentences below is {"{:.2f}".format(sim)}. \n 1- {txt1}\n 2- {txt2} \n ')
# Vector representation of 'snake' in 'snake'
txt3='my snake is not poisonous and very friendly'
snake_embedding = model(torch.tensor(tokenizer.encode(txt3)).unsqueeze(0))[0][:,2,:].detach().numpy()
txt4='Programming is very difficult for beginner'
# Vector representation of 'programming' in 'programming'
programming_embedding = model(torch.tensor(tokenizer.encode(txt4)).unsqueeze(0))[0][:,1,:].detach().numpy()
sim=cosine_similarity(snake_embedding, programming_embedding)[0][0]
print(f'Cosine similarity between the representation of the word "snake" and "programming" in \
two sentences below is {"{:.2f}".format(sim)}. \n 1- {txt3}\n 2- {txt4} \n ')
sim=cosine_similarity(python_embedding_programming, programming_embedding)[0][0]
print(f'Cosine similarity between the representation of the word "Python" and "Programming" in \
two sentences below is {"{:.2f}".format(sim)}. \n 1- {txt1}\n 2- {txt4} \n ')
sim=cosine_similarity(python_embedding_pet, snake_embedding)[0][0]
print(f'Cosine similarity between the representation of the word "Python" and "snake" in \
two sentences below is {"{:.2f}".format(sim)}. \n 1- {txt2}\n 2- {txt3} \n ')
def matrix_occure_prob(df,title,fontsize=11,vmin=-0.1, vmax=0.8,lable1='Sentence 1',pad=55,
lable2='Sentence 2',label='Cosine Similarity',rotation_x=90,axt=None,
num_ind=False,txtfont=6,lbl_font=9,shrink=0.8,cbar_per=False,
xline=False):
import matplotlib.pyplot as plt
"""Plot correlation matrix"""
ax = axt or plt.axes()
colmn1=list(df.columns)
colmn2=list(df.index)
corr=np.zeros((len(colmn2),len(colmn1)))
for l in range(len(colmn1)):
for l1 in range(len(colmn2)):
cc=df[colmn1[l]][df.index==colmn2[l1]].values[0]
try:
if len(cc)>1:
corr[l1,l]=cc[0]
except TypeError:
corr[l1,l]=cc
if num_ind:
ax.text(l, l1, str(round(cc,2)), va='center', ha='center',fontsize=txtfont)
im =ax.matshow(corr, cmap='jet', interpolation='nearest',vmin=vmin, vmax=vmax)
cbar =plt.colorbar(im,shrink=shrink,label=label)
if (cbar_per):
cbar.ax.set_yticklabels(['{:.0f}%'.format(x) for x in np.arange( 0,110,10)])
ax.set_xticks(np.arange(len(colmn1)))
ax.set_xticklabels(colmn1,fontsize=lbl_font)
ax.set_yticks(np.arange(len(colmn2)))
ax.set_yticklabels(colmn2,fontsize=lbl_font)
# Set ticks on both sides of axes on
ax.tick_params(axis="x", bottom=True, top=False, labelbottom=True, labeltop=False)
# Rotate and align bottom ticklabels
plt.setp([tick.label1 for tick in ax.xaxis.get_major_ticks()], rotation=rotation_x,
ha="right", va="center", rotation_mode="anchor")
# Rotate and align bottom ticklabels
plt.setp([tick.label1 for tick in ax.yaxis.get_major_ticks()], rotation=rotation_x,
ha="right", va="center", rotation_mode="anchor")
if xline:
x_labels = list(ax.get_xticklabels())
x_label_dict = dict([(x.get_text(), x.get_position()[0]) for x in x_labels])
for ix in xline:
plt.axvline(x=x_label_dict[ix]-0.5,linewidth =1.2,color='k', linestyle='--')
plt.axhline(y=x_label_dict[ix]-0.5,linewidth =1.2,color='k', linestyle='--')
plt.xlabel(lable1)
plt.ylabel(lable2)
ax.grid(color='k', linestyle='-', linewidth=0.05)
plt.title(f'{title}',fontsize=fontsize, pad=pad)
plt.show()
txt1 = "President greets the press in Chicago"
txt1_tokenized = tokenizer.encode(txt1)
#
txt2 = "Obama speaks to media in Illinois"
txt2_tokenized = tokenizer.encode(txt2)
df = pd.DataFrame()
for i in np.arange(1,len(txt1_tokenized)-1):
sim_all=[]
idx = []
for j in np.arange(1,len(txt2_tokenized)-1):
embedding_txt1_tokenized = model(torch.tensor(txt1_tokenized).unsqueeze(0))[0][:,i,:].detach().numpy()
embedding_txt2_tokenized = model(torch.tensor(txt2_tokenized).unsqueeze(0))[0][:,j,:].detach().numpy()
tmp = cosine_similarity(embedding_txt1_tokenized,embedding_txt2_tokenized)
sim_all.append(tmp[0][0])
idx.append(tokenizer.decode([txt2_tokenized[j]]))
df[tokenizer.decode([txt1_tokenized[i]])] = sim_all
df.index = idx
df
font = {'size' : 10}
plt.rc('font', **font)
fig, ax=plt.subplots(figsize=(7, 4), dpi= 110, facecolor='w', edgecolor='k')
matrix_occure_prob(df,vmin=0.0,vmax=0.8,title='Word Similarity by BERT Model',num_ind=True,axt=ax,pad=5,
txtfont=12,lable1='',lable2='',label='BERT',xline=False)
The easiest approach for us to implement sentence similarity is through the sentence-transformers
library — which wraps most of this process into a few lines of code.
First, we install sentence-transformers
using pip install sentence-transformers
. This library uses HuggingFace’s transformers
behind the scenes — so we can actually find sentence-transformers models
We use bert-base-nli-mean-tokens
model. Let’s create some sentences, initialize our model, and encode the sentences:
corpus = ["Global warming is happening",
"The weather is not good to play golf today",
"Never compare an apple to an orange",
"Apple and orange are completely different from each other",
"Ocean temperature is rising rapidly",
"AI has taken the world by storm",
"It is rainy today so we should postpone our golf game",
"I love reading books than watching TV",
"People say I am a bookworm, in fact, I do not want to waste my time on TV",
"AI has transformed the way the world works"]
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('bert-base-nli-mean-tokens')
sentence_embeddings = model.encode(corpus)
sentence_embeddings.shape
df=pd.DataFrame()
for i1 in range(len(corpus)):
sim_all=[]
for i2 in range(len(corpus)):
tmp=cosine_similarity([sentence_embeddings[i1]],[sentence_embeddings[i2]])
sim_all.append(tmp[0][0])
df[corpus[i1]]=sim_all
df.index=corpus
df
font = {'size' : 16}
plt.rc('font', **font)
fig, ax=plt.subplots(figsize=(15, 10), dpi= 110, facecolor='w', edgecolor='k')
matrix_occure_prob(df,title='Text Similarity Matrix by BERT Pre-trained Model ',lable1='',vmin=0, pad=10,axt=ax,
vmax=0.8,cbar_per=False,lable2='',num_ind=True,txtfont=15,xline=False,fontsize=22,
lbl_font=14,label='BERT Similarity',rotation_x=30)
BERT applies three different types of embeddings:
Token Embedding: it maps each token in the input sequence to a fixed-size embedding vector. BERT uses WordPiece tokenization, which breaks down words into smaller subword units based on their frequency in a large corpus of text.
Segment Embeddings. It is used for tasks that require processing pairs of sentences or sequences, such as sentence classification or question-answering. The segment embeddings indicate which sentence each token belongs to.
Position Embeddings. It adds positional information to the token embeddings. This is important because the order of words in a sentence is crucial to its meaning.
To create final embedding, we add up all these embedding to get a final representation of the shape (11,768)
Here are schematic illustration of Token Embeddings for the input text “I like strawberries”.
Image retrieved from medium
Here are schematic illustration of Segment Embeddings for input text “I like cats”, “I like dogs”.
Image retrieved from medium
The Segment Embeddings layer is comprised of solely two vector representations. The initial vector (index 0) is designated for all tokens associated with input 1, while the final vector (index 1) is designated for all tokens associated with input 2. In cases where there is only one input sentence, its segment embedding will be represented solely by the vector assigned to index 0 of the Segment Embeddings table.
BERT is comprised of a sequence of Transformers, which generally do not capture the sequential arrangement of their inputs. In summary, the inclusion of position embeddings in BERT enables it to comprehend input texts such as: "I think, therefore I am"
The first “I” should not have the same vector representation as the second “I”.
BERT was specifically created to process input sequences that have a maximum length of 512. To account for the sequential order of the input sequences, the authors of BERT enabled it to acquire a vector representation for each position through learning. This implies that the Position Embeddings layer consists of a lookup table measuring (512, 768), with the initial row representing the vector representation of any word in the first position, the subsequent row representing the vector representation of any word in the second position, and so on. Consequently, if the input comprises of phrases such as "Hello world" and "Hi there," both "Hello" and "Hi" will share the same position embeddings since they are the initial word in the input sequence. Correspondingly, both "world" and "there" will have identical position embeddings.
Transformers lack recurrent components, which means that they do not possess the ability to identify the positions of words within a sentence when presented with it as a whole. One approach involves adding sine and cosine waves with varying frequencies to the word vectors.
See the equation below to leverage get sin and con function to get token position:
where
$i$: position of the token between 0 and 511.
$j$= position of embedding dimension between 0 and 768 for BERT-base
$d$= embedding dimension which is 768 for BERT
This equation enable BERT to recognize if a token is at the beginning or at the end of embedding. This position embedding will be added to other embedding including word embedding and segment embedding.
model_bert = BertModel.from_pretrained('bert-base-uncased')
tokenizer_bert = BertTokenizer.from_pretrained('bert-base-uncased')
The code below shows embedding of BERT:
word_embeddings: context-free word embeddings, size= 30522 (vocabulary) * 768 (dimension)
position_embeddings : encodes word position, size= 512 (length) * 768 (dimension)
token_type_embeddings : 0 or 1. Used to lookup the segment embedding, size= 2 (segment A/B) * 768 (dimension)
model_bert.embeddings
simple_sentence = 'I am Mehdi'
tokenizer.encode(simple_sentence, return_tensors='pt') # return_tensors='pt' saves us for converting to Pytorch
# embedding of each tokenv(context-less) in the sentence
model_bert.embeddings.word_embeddings(tokenizer.encode(simple_sentence, return_tensors='pt'))
If we have a different sentence, first and last rows will be identical:
model_bert.embeddings.word_embeddings(tokenizer.encode('I am Hamed. ', return_tensors='pt'))
model_bert.embeddings.position_embeddings # 512 embeddings
torch.LongTensor(range(6)) # making a long tensor of length 6 by torch
model_bert.embeddings.position_embeddings(torch.LongTensor(range(6))) # positional embeddings for our example_phrase
Each row is a combination of sign and cosign to let BERT the position of each token.
model_bert.embeddings.token_type_embeddings # Segment A and B (2 embeddings)
torch.LongTensor([0]*6)
model_bert.embeddings.token_type_embeddings(torch.LongTensor([0]*6)) # Same embedding for all tokens
We get the same row representing the same segment.
To get final BERT embedding, we add all three types of embedding (word embedding, position embedding, token embedding) then passing to LayerNorm.
# Apply feed forward normalization layer
model_bert.embeddings.LayerNorm(
model_bert.embeddings.word_embeddings(tokenizer.encode(simple_sentence, return_tensors='pt')) + \
model_bert.embeddings.position_embeddings(torch.LongTensor(range(6))) + \
model_bert.embeddings.token_type_embeddings(torch.LongTensor([0]*6))
)
The same exact matrix can be achieved by embeddings
as below:
model_bert.embeddings(tokenizer.encode(simple_sentence, return_tensors='pt'))
model_bert.embeddings(tokenizer.encode(simple_sentence, return_tensors='pt')).shape
Batch of 1 data point with 6 tokens, each token has an embedding of 768 fixed length vector.
Since there is no decoder for BERT, this dimension will pass into BERT model
Here is BERT visualization:
BERT pre-trained model is for two tasks:
These are not useful tasks but they will help BERT learn how words works.
# import library: BertForMaskedLM stands for masked language model. pipline makes much easier for us to
# perform NLP task without too much code
from transformers import BertForMaskedLM, pipeline
The code below initializes masked language model with pretrained bert-base-cased
which keeps track of accent, lower upper case etc.
bert_lm_mask = BertForMaskedLM.from_pretrained('bert-base-cased')
# look at the model
bert_lm_mask
# Using pipelines in transformers makes our life easier for several tasks
# The code below shows how to perform an auto-encoder language model task+
# we should give pipline a model to do the task
# for the same result, we could have "model=bert_lm_mask"
nlp_mask = pipeline("fill-mask", model='bert-base-cased')
type(nlp_mask.model)
nlp_mask.tokenizer
preds = nlp_mask(f"If you don’t know how to swim, you will {nlp_mask.tokenizer.mask_token} in this lake.")
print('If you don’t know how to swim, you will .... in this lake.')
for p in preds:
print(f"Token:{p['token_str']}. Score: {100*p['score']:,.2f}%")
This may not be a very useful task, but these tasks are created by the author of the BERT to teach BERT about the basic of words used in sentences
BERT Next sentence Prediction involves feeding BERT the inputs "sentence A" and "sentence B" and predicting whether "sentence B" comes directly after "sentence A" (True/False). The BERT model is trained using next-sentence prediction (NSP) and masked-language modeling (MLM).
from transformers import BertForNextSentencePrediction
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
BERT_nsp = BertForNextSentencePrediction.from_pretrained('bert-base-uncased')
BERT_nsp
text = "I like cookies!"
text2 = "Do you like them too?"
inputs = tokenizer(text, text2, return_tensors='pt') # 'pt' stands for pytorch format
inputs
inputs.input_ids # tokens for sentence A and B
inputs.token_type_ids # segment Ids (0 == A & 1 == B)
inputs.attention_mask # pay attention to everything
# pass inputs into our nsp model
# 0 == "isNextSentence" and 1 == "notNextSentence"
outputs = BERT_nsp(**inputs)
outputs
logits is defining what class to predict: label 0 has much higher rate than label 1. Label 0 is "True" (sentence B comes after sentence A) and label 1 is "False" (sentence B does not come after sentence A).
# calculate loss by passing through a label
# If we pass in explicit label (label 0), tesor of logits is the same.
outputs = BERT_nsp(**inputs, labels=torch.LongTensor([0]))
outputs
loss is very low close to zero which means sentence B comes after sentence A.
# calculate loss by passing through a label
outputs = BERT_nsp(**inputs, labels=torch.LongTensor([1]))
outputs
loss is very high which means sentence B does not come after sentence A.
Again, next sentence prediction on its own is not a useful task, but it does work teaching BERT how to model the relationship between sentences
BERT achieves general idea about:
Whatever BERT has learned can be used to solve a specific NLP problem by fine-tuning the model.
Fine-tunning works first by feeding a sentence to a pre-trained Bert. CLS has pre-trained on next sentence prediction task through its pooler attribute. We are going to add another feedforward layer after the pooler to train to map it to the number of sequence classes we want. For classification problem as shown in Figure below, we do not care about the representation of each token after passing our sentence to the Bert. For this example, we classy the entire sequence with a label.
However, for token classification, we need to consider the representation of each token and pas them through feed forward layer to classify each token for each label we have. The classic example of this is Named Entity Recognition.
Question and answering is the most difficult fine-tuning. We have question and subtext is a context that has an answer to the question in it. We pass the entire sequence, question and context, to the pre-trained Bert. Similar to token classification, we will add a layer on top of every single token. What we are predicting is whether or not that specific token represents the start or end the answer of the question:
When fine-tunning BERT to solve NLP tasks, we will be able to utilize three built classes from transformer library especially BertForQuestionAnswering
, BertForTokenClassification
, BertForSequenceClassification
this pre-trained classes brought us with Hugging face.
from transformers import pipeline, BertForQuestionAnswering, BertForTokenClassification, BertForSequenceClassification
bert_sq_clss = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=3)
bert_sq_clss
bert_sq_clss.classifier
A classifier on the Huggingface
model repository is selected. This model has already been fine-tunned to predict if a sentence is positive, negative or neutral for financial data. Of cures we are able to fine-tune our own model. But, first, we look at other people models. First we need to create a pipeline:
ProsusAI/finbert
model is achieved from Hugging face as Financial Bert model. It is supposed to take a short form text about financial data and output whether or not that text is positive, negative or neutral in context of financial data.
bert_fin = pipeline('text-classification', model='ProsusAI/finbert', tokenizer='ProsusAI/finbert')
bert_fin('The stock market plummeted today, leaving many investors feeling stuck and uncertain about the future.')
bert_fin('Projects shows great trend for stock market.')
The BERT model has been updated to apply for financial context.
bert_fin.model
Token classification is a natural language understanding task in which a label is assigned to some tokens in a text. Some popular token classification subtasks are Named Entity Recognition (NER) and Part-of-Speech (PoS) tagging. NER models could be trained to identify specific entities in a text, such as dates, individuals and places; and PoS tagging would identify, for example, which words in a text are verbs, nouns, and punctuation marks.
bert_toc_clss = BertForTokenClassification.from_pretrained('bert-base-uncased')
bert_toc_clss
We can train our own model but here first we use a a fine-tuned model.
ner_classifier = pipeline("ner")
ner_classifier("Hello I'm Mehdi and I live in Calgary.")
In PoS tagging, the model recognizes parts of speech, such as nouns, pronouns, adjectives, or verbs, in a given text. The task is formulated as labeling each word with a part of the speech.
pos_classifier = pipeline("token-classification", model = "vblagoje/bert-english-uncased-finetuned-pos")
pos_classifier("Hello I'm Mehdi and I live in Calgary.")
bert_qa_clss = BertForQuestionAnswering.from_pretrained('bert-base-uncased')
bert_qa_clss
For *qa output*, there are only two output features so out_features should be always 2: whether or not it is start or end of our answer.
bert_qa_clss.qa_outputs
A QA model is found on the Huggingface model repository which is a flavor of Bert,roberta
train on squad (question and answering) data set. This is the common question and answering approach to fine-tune
model_name = "deepset/roberta-base-squad2"
qa = pipeline(model=model_name, tokenizer=model_name, revision="v1.0", task="question-answering")
sequence = "Where is Mehdi living these days?", "Mehdi lives in Calgary"
qa(*sequence)