Introduction to BERT Large Language Model

BERT, which stands for Bidirectional Encoder Representations from Transformers, is a groundbreaking natural language processing (NLP) model that has significantly advanced the field of language understanding and machine learning. Developed by Google in 2018, BERT is a transformer-based neural network architecture designed to comprehend context and relationships within sentences by considering the bidirectional flow of information. This notebook will elucidate the fundamentals of Large Language Models (LLMs), demonstrating the implementation of the BERT model using the transformer architecture.

Introduction

History of NLP

The history of neural language models started in 2001 with the first neural language model using a feed forward neural network to predict future words in a sentence. In 2013, the Word2vec problem of creating word embeddings was solved using two methods: the continuous bag of words algorithm and the skip gram network. However, these word embeddings showed bias from the large scale corpora used to train the models. In 2013 and 2014, the rise of recurrent neural networks (RNNs) and convolutional neural networks (CNNs) started for natural language processing tasks, leading to sequence to sequence modeling: an encoder reads a variable length input sequence and a decoder generates a variable length sequence output.

image.png sequence to sequence (top left), seq-to_vector (top right), vector-to-seq (bottom left) and Encoder-Decoder (bottom right) network. Image retrieved from Géron, Aurélien. Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 2nd Edition

In 2015, attention was introduced allowing the decoder in sequence to sequence models to access intermediate hidden states. In 2017, the transformer was introduced in the paper "Attention is all you need", replacing RNNs as the primary mechanism for language modeling and leading to pre-trained language models like Google’s BERT, OpenAI’s GPT, and T5. Transformer-based architectures, like Elmo, showed improvements in nearly all areas of NLP compared to previous state-of-the-art results.

Attention and Self-Attention

The transformer model is primarily based on the **attention** idea. In 2015, attention mechanisms started to be used in NLP tasks and became the dominant way to perform NLP tasks over the past decade. Attention in NLP is a mechanism designed to focus on specific parts of a sequence in the context of another sequence. It is used to perform various NLP tasks, such as language modeling, sequence classification, language translation, and image captioning. The idea behind attention is to give the model the ability to attend to relevant information while performing a prediction task. The attention mechanism allows the model to understand relationships between different elements in a sequence and make predictions based on that information. For example, in language translation, the attention mechanism helps the model understand which parts of a source sentence are relevant for translating into a target language. There are various types of attention mechanisms, including self-attention, which is the kind of attention that powers the transformer architecture in NLP.

One of the most important forms of attention is **self-attention**, which allows a model to learn relationships between each word in a sequence and every other word in the sequence. Self-attention alters the idea of attention to relate each word in the phrase with every other word in the phrase to teach model the relationship. This form of attention is the basis of the **transformer** architecture. One example of its use is language translation, where attention allows the model to focus on specific parts of the input sequence and understand relationships between words. The BERT model uses self-attention and can understand grammar rules and relationships between words, as demonstrated by its attention scores.

In simpler terms, self attention helps us create similar connections but within the same sentence. Look at the following example:

“I poured water from the bottle into the cup until it was full.”

it => cup

“I poured water from the bottle into the cup until it was empty.”

it=> bottle

By changing one word “full” — > “empty” the reference object for “it” changed. If we are translating such a sentence, we will want to know what the word “it” refers to.

Encoder-Decoder

The **Transformer** is a state-of-the-art natural language processing model based on an encoder-decoder architecture. It aims to solve sequence-to-sequence tasks while handling long-range dependencies with ease. It relies entirely on self-attention to compute representations of its input and output WITHOUT using sequence-aligned RNNs or convolution. In fact, it uses self attention to model the relationship between tokens in a sentence instead of recursion used in RNNs and LSTMs. The Transformer consists of an encoder stack and a decoder stack, where the input sentence is first processed into multiple input embeddings and then fed into the encoder. The final representation of the input sequence is then fed into the decoder, which uses cross-attention, combining the attention from the encoder and decoder, to generate an output.

image.png

The Encoder stack and the Decoder stack each have their corresponding Embedding layers for their respective inputs. Finally, there is an Output layer to generate the final output. image-4.png Image is retrieved from https://towardsdatascience.com/

The Encoder contains the all-important Self-attention layer that computes the relationship between different words in the sequence, as well as a Feed-forward layer.

image-5.png Image is retrieved from https://towardsdatascience.com/

The Transformer has different types of attention mechanisms, including 1-multi-head attention in the encoder, 2-masked multi-head attention in the decoder, and 3-cross-attention between the encoder and decoder. These attention mechanisms allow the Transformer to learn simultaneous grammatical structures and rules, and how to think in a train of thought. "Attention is all you need"

image-3.png

Attention allows a language model to distinguish between the following two sentences:

  • She poured water from the pitcher to the cup until it was full.
  • She poured water from the pitcher to the cup until it was empty.

There’s a very important difference between these two almost identical sentences: in the first, “it” refers to the cup. In the second, “it” refers to the pitcher. Humans don’t have a problem understanding sentences like these, but it’s a difficult problem for computers. Attention allows Transformers to make the connection correctly because they understand connections between words that aren’t just local. It’s so important that the inventors originally wanted to call Transformers “Attention Net” until they were convinced that they needed a name that would attract more, well, attention.

How a Text is Processed

There are three types of language models to train a model to predict a missing word:

1. Auto-regressive

Auto-regressive language models predict a missing word in a sentence given past tokens and future tokens but not both of them. It includes forward and backward prediction. Example is our phone auto completion sentence. This is categorized as **GPT** family.

2. Auto-encoding.

Auto-encoding language models strive to gain a comprehensive understanding of entire sequences of tokens given all possible context (past and future tokens are given). This is great for natural language understanding tasks like sequence classification and named entity recognition and **BERT** is an example.

See Figure below die difference: image.png

3. Combinations of autoregressive and autoencoding, like T5, which can use the encoder and decoder to be more versatile and flexible in generating text. It has been shown that these combination models can generate more diverse and creative text in different contexts compared to pure decoder-based autoregressive models due to their ability to capture additional context using the encoder.

Text Processing with Attention Applying Transformer

The transformer architecture was introduced in 2017 in the paper "Attention is All You Need" and has since replaced the use of recurrent neural networks (RNNs) and convolutional neural networks (CNNs) in natural language processing tasks. The transformer uses various forms of attention, including multi-headed attention and masked multi-headed attention, to perform NLP tasks effectively. Recently, the transformer architecture has started to be used in computer vision tasks as well, with a type of transformer called the vision transformer being used to replace CNNs.

How Transformer Learns and Thinks

Three matrices: query, key, value are applied to calculate standard attention.

Assume X is our tokenized input. X has input of 2 tokens (number of rows) and each token has 4 embedding dimension (number of columns). In real life, we have much bigger than 4 dimensions. In this case, our input has gone through all Bert processing steps.

We take X matrix and perform three linear transformations on that matrices (WQ, WK, WV) which are weights, to project them to three new vector spaces: Q (query), K (key), V (value). The output of this matrix multiplication is a matrix with rows are number of tokens (still 2 tokens) but number of columns reduce from 4 to 3. It is compressed with a smaller matrix.

image.png Image from Sinan Ozdemir

  • The query matrix is meant to represent the information we are looking for. For example saying a word outload, there should be a point for doing that.
  • The key matrix is meant to represent the relevance of each word to our query. For example, the reason behind we I say "I love cat". It quantifies the importance of each word to the overall query that I have.
  • The value matrix intuitively represents the contextless meaning of our input token. contextless means pre-attention, prelooking at all token. What does the word cat mean for example. The final aim is to have contextfull for our representation.

Imagine our input sequence is "I like cats". The goal is to look at each token and find the relevance of each token to other tokens. The Figure below show attention score for "like", relevance of each query to like. Green is key space and orange is query space which represents the whole aim of speaking in first place.

The higher value of QxK^T denotes higher semantic similarity. I in key space is very far from like in query space.

image-2.png Image from Sinan Ozdemir

As the numbers do not add up to 1, next step of attention scores("like") is to normalize score. Finally, the score are multiplied by value of tokens. This leads to achieve context-ful embedding of "like". It can be interpreted as the semantic meaning of the word like should be paid attention to 42% of our attention

image-3.png Image from Sinan Ozdemir

Here is another example. On the left, QxK is calculated for each token divided by sqrt(300) and then goes into softmax function. Each row should add up to 1; each row shows how much attention should be paying to other token including itself. For example, "it's" spends most of the attention to "here" and some of the attention to "another". On the right, we multiply each row on the right with 300 dimension. When this multipication is done, a matrix is achieved which each is a token and each column shows a context-full dimension of meaning

image-5.png Image from Sinan Ozdemir

Lets take at look below:

In [1]:
import warnings
warnings.filterwarnings('ignore')

# Import BERT model from transformer library which has a lot of pretrained models
# transformers is HuggingFace library
from transformers import BertModel 
# BertModel also requires the PyTorch library
In [2]:
# Grab base BERT pretrained language model
model_Bert=BertModel.from_pretrained('bert-base-uncased')
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).

See BERT Transformation for more information.

In [3]:
# The base BERT model has 12 encoders in its encoder stack
len(model_Bert.encoder.layer)
Out[3]:
12
In [4]:
# Look at first encoder
model_Bert.encoder.layer[0]
Out[4]:
BertLayer(
  (attention): BertAttention(
    (self): BertSelfAttention(
      (query): Linear(in_features=768, out_features=768, bias=True)
      (key): Linear(in_features=768, out_features=768, bias=True)
      (value): Linear(in_features=768, out_features=768, bias=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (output): BertSelfOutput(
      (dense): Linear(in_features=768, out_features=768, bias=True)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
  )
  (intermediate): BertIntermediate(
    (dense): Linear(in_features=768, out_features=3072, bias=True)
    (intermediate_act_fn): GELUActivation()
  )
  (output): BertOutput(
    (dense): Linear(in_features=3072, out_features=768, bias=True)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
)

As can be seen, it has self-attention, metrics for query, key and value including dropout layer. Then it takes the output of attention and runs it through another dense layer with a layer normalization and drop out. Finally, the output represents the actual representation that is feed into next encoder or decoder for further processing.

The attention block is the most import part to discuss because the main bulk of the work is happening there:

(attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=768, out_features=768, bias=True) (key): Linear(in_features=768, out_features=768, bias=True) (value): Linear(in_features=768, out_features=768, bias=True) (dropout): Dropout(p=0.1, inplace=False) )

Multi-Headed Self-Attention

Multi-Headed Attention is a key feature of the transformer architecture used in models like BERT. Each encoder has multi-headed attention. Instead of having a single attention mechanism, Multi-Headed Attention splits the attention mechanism into multiple smaller mechanisms that work side by side, each learning different aspects of the input representation. Each of these attention mechanisms operates independently, with their results being concatenated to produce a final representation. The main aim of multi-headed self attention is to do attention (Q, K, V) again with a different set of weights. This process is repeated multiple times to see the differences. None of these talks to each other.

image.png
Image is retrieved from https://towardsdatascience.com/

In the BERT base model, there are 12 encoder layers and each encoder has 12 attention heads, meaning there are 144 separate attention calculations happening simultaneously, with each head potentially learning a different aspect of language. Research has shown that certain heads in the BERT model focus on specific aspects of language, such as direct object relations or possessive pronouns.

In [5]:
import pandas as pd
import matplotlib.pyplot as plt
import torch
# transformers is HuggingFace library
from transformers import BertTokenizer, BertModel
from bertviz import head_view # Attention visualization
In [6]:
# Load a vanilla BERT-base model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
In [7]:
text = "Almost before we knew it, we had left the ground. The unknown holds its grounds."

tokens = tokenizer.encode(text)            # words (tokens) transformation into numbers
inputs = torch.tensor(tokens).unsqueeze(0) # unsqueeze changes the shape from (10,) -> (1, 10)
inputs
Out[7]:
tensor([[ 101, 2471, 2077, 2057, 2354, 2009, 1010, 2057, 2018, 2187, 1996, 2598,
         1012, 1996, 4242, 4324, 2049, 5286, 1012,  102]])

Pass the sentence above through 12 encoders each with 12 attention heads: 144 different calculations all at once.

In [8]:
# Optain base BERT pretrained language model
model_Bert = BertModel.from_pretrained('bert-base-uncased')

# Optain the attention scores from BERT
attention = model_Bert(inputs, output_attentions=True)[2]  
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).

Get the final encoder (last) and average them across all 12 heads: it represents final attention score for each of our tokens. We are looking for a matrix that each row is one of our tokens adding up to 1. The higher values in for each word indicates the row signifies how much attention occurs. It should be paying a lot of attention to period.

In [9]:
final_attention_mean = attention[-1].mean(1)[0]
In [10]:
df_attention = pd.DataFrame(final_attention_mean.detach()).applymap(float).round(4)
df_attention.columns = tokenizer.convert_ids_to_tokens(tokens)
df_attention.index = tokenizer.convert_ids_to_tokens(tokens)
df_attention 
Out[10]:
[CLS] almost before we knew it , we had left the ground . the unknown holds its grounds . [SEP]
[CLS] 0.1262 0.0181 0.0137 0.0368 0.0157 0.0090 0.0976 0.0322 0.0436 0.0182 0.0219 0.0538 0.1011 0.0548 0.0635 0.0474 0.0501 0.0362 0.0653 0.0946
almost 0.0338 0.0436 0.0286 0.0188 0.0245 0.0155 0.3000 0.0180 0.0253 0.0165 0.0167 0.0233 0.3188 0.0218 0.0172 0.0107 0.0196 0.0101 0.0136 0.0234
before 0.0174 0.0104 0.1099 0.0109 0.0273 0.0274 0.3433 0.0108 0.0124 0.0144 0.0066 0.0128 0.3619 0.0031 0.0042 0.0027 0.0041 0.0020 0.0033 0.0152
we 0.0381 0.0114 0.0113 0.0418 0.0134 0.0122 0.3188 0.0282 0.0166 0.0124 0.0125 0.0227 0.3366 0.0147 0.0189 0.0175 0.0134 0.0097 0.0157 0.0341
knew 0.0147 0.0071 0.0215 0.0113 0.1397 0.0450 0.3188 0.0094 0.0102 0.0246 0.0083 0.0162 0.3360 0.0041 0.0075 0.0052 0.0022 0.0023 0.0036 0.0123
it 0.0236 0.0108 0.0220 0.0136 0.0305 0.0643 0.3457 0.0126 0.0104 0.0173 0.0090 0.0151 0.3642 0.0054 0.0143 0.0052 0.0049 0.0029 0.0056 0.0224
, 0.0124 0.0024 0.0026 0.0075 0.0023 0.0016 0.4483 0.0059 0.0030 0.0026 0.0021 0.0044 0.4785 0.0018 0.0053 0.0020 0.0013 0.0016 0.0037 0.0107
we 0.0366 0.0104 0.0112 0.0314 0.0126 0.0104 0.3201 0.0334 0.0228 0.0149 0.0119 0.0216 0.3393 0.0143 0.0159 0.0170 0.0145 0.0106 0.0209 0.0304
had 0.0309 0.0127 0.0243 0.0234 0.0314 0.0094 0.3032 0.0226 0.0949 0.0176 0.0145 0.0162 0.3243 0.0053 0.0106 0.0067 0.0046 0.0032 0.0135 0.0307
left 0.0286 0.0075 0.0152 0.0111 0.0208 0.0076 0.3339 0.0113 0.0132 0.1071 0.0121 0.0247 0.3554 0.0037 0.0125 0.0029 0.0020 0.0027 0.0064 0.0213
the 0.0274 0.0108 0.0112 0.0177 0.0149 0.0077 0.3259 0.0147 0.0132 0.0230 0.0416 0.0338 0.3449 0.0172 0.0256 0.0088 0.0122 0.0096 0.0150 0.0250
ground 0.0344 0.0113 0.0091 0.0171 0.0114 0.0072 0.3337 0.0136 0.0128 0.0221 0.0169 0.0615 0.3536 0.0077 0.0260 0.0046 0.0045 0.0065 0.0122 0.0340
. 0.0127 0.0025 0.0027 0.0077 0.0024 0.0016 0.4474 0.0061 0.0031 0.0026 0.0021 0.0045 0.4774 0.0018 0.0055 0.0020 0.0013 0.0017 0.0038 0.0110
the 0.0429 0.0031 0.0031 0.0112 0.0032 0.0033 0.3036 0.0085 0.0039 0.0059 0.0100 0.0120 0.3214 0.0784 0.0534 0.0254 0.0485 0.0248 0.0182 0.0191
unknown 0.0509 0.0065 0.0051 0.0129 0.0067 0.0057 0.2953 0.0101 0.0055 0.0159 0.0130 0.0272 0.3109 0.0285 0.1159 0.0101 0.0179 0.0229 0.0121 0.0268
holds 0.0188 0.0017 0.0022 0.0059 0.0026 0.0018 0.3730 0.0044 0.0025 0.0037 0.0037 0.0056 0.3970 0.0106 0.0149 0.0847 0.0192 0.0300 0.0069 0.0108
its 0.0319 0.0026 0.0018 0.0077 0.0018 0.0021 0.3099 0.0064 0.0031 0.0030 0.0071 0.0074 0.3293 0.0307 0.0336 0.0459 0.0990 0.0432 0.0185 0.0148
grounds 0.0344 0.0028 0.0022 0.0063 0.0029 0.0024 0.3457 0.0047 0.0029 0.0038 0.0041 0.0071 0.3682 0.0195 0.0234 0.0176 0.0245 0.0941 0.0157 0.0175
. 0.0846 0.0127 0.0108 0.0209 0.0033 0.0040 0.3066 0.0138 0.0154 0.0053 0.0064 0.0070 0.3230 0.0220 0.0175 0.0220 0.0186 0.0197 0.0456 0.0409
[SEP] 0.0586 0.0208 0.0162 0.0372 0.0173 0.0122 0.2380 0.0308 0.0243 0.0136 0.0203 0.0287 0.2480 0.0320 0.0370 0.0275 0.0258 0.0195 0.0357 0.0563

Given a sequence of tokens X = (x1, x2, . . . , xn), BERT pads the input sentence with [CLS] and [SEP] tokens.

  • CLS stands for "classification." It is a special token added to the beginning of each input sequence in BERT. This token is used to represent the entire sequence and is trained to predict the correct classification label for the input sequence. The output of the CLS token is used as an embedding representation of the input sequence for downstream classification tasks.

  • SEP stands for "separator." It is another special token used in BERT to separate two sentences or sequences of words that are concatenated together as input. For example, when using BERT to perform sentence pair classification, two input sentences are concatenated with a SEP token in between them to indicate the end of the first sentence and the beginning of the second one.

In summary, the CLS token is used to represent the entire input sequence and is used as the embedding for classification tasks, while the SEP token is used to separate sequences that are concatenated together as input.

each row or column are a probability which add up to 1 for each row and each column.

The library bertviz shows a very nice interactive visualization to show how the heads and encoder are working. Each color represents each head for each encoder layer.

In [11]:
list_tokens= tokenizer.convert_ids_to_tokens(inputs[0])
head_view(attention, list_tokens)
Layer:

Layers are encoders

In [12]:
# Zoom in on third encoding layer and our first head
head_view(attention, tokenizer.convert_ids_to_tokens(inputs[0]), layer=2, heads=[1])
# This shows attention score for that head and layer
Layer:
In [13]:
# The paper claimed that 8 encoding layer, 10 th heads relates to direct object to its verb
head_view(attention, tokenizer.convert_ids_to_tokens(inputs[0]), layer=7, heads=[9])
Layer:

Bert combines all those relations to final representation

In [14]:
# the attention for the 8th encoder's 10th head we can see direct object attention. We are not 
# making average for this
eightth_tenth = attention[7][0][9]
In [15]:
# Get the attention matrix
attention_df = pd.DataFrame(eightth_tenth.detach()).round(4)

attention_df.columns = tokenizer.convert_ids_to_tokens(tokens)
attention_df.index = tokenizer.convert_ids_to_tokens(tokens)

attention_df  # sums across rows add up to 1. sums across columns do not
Out[15]:
[CLS] almost before we knew it , we had left the ground . the unknown holds its grounds . [SEP]
[CLS] 0.0083 0.0037 0.0013 0.0026 0.0025 0.0033 0.0016 0.0042 0.0033 0.0032 0.0041 0.0124 0.0033 0.0045 0.0016 0.0033 0.0117 0.0264 0.0333 0.8654
almost 0.0621 0.0455 0.0087 0.0023 0.0004 0.0042 0.0049 0.0049 0.0053 0.0017 0.0006 0.0012 0.0125 0.0026 0.0001 0.0003 0.0004 0.0017 0.0697 0.7709
before 0.1622 0.1840 0.0251 0.0206 0.0003 0.0033 0.0012 0.0044 0.0039 0.0011 0.0007 0.0005 0.0040 0.0046 0.0002 0.0005 0.0009 0.0004 0.0187 0.5635
we 0.0143 0.0430 0.6085 0.0220 0.0122 0.0024 0.0048 0.0011 0.0011 0.0006 0.0001 0.0004 0.0010 0.0007 0.0013 0.0005 0.0004 0.0014 0.0054 0.2788
knew 0.0035 0.0134 0.7772 0.0143 0.0081 0.0015 0.0025 0.0003 0.0002 0.0002 0.0000 0.0001 0.0001 0.0001 0.0003 0.0000 0.0001 0.0001 0.0010 0.1770
it 0.0037 0.0044 0.0994 0.0282 0.4465 0.0062 0.0010 0.0020 0.0002 0.0029 0.0003 0.0021 0.0004 0.0002 0.0022 0.0021 0.0012 0.0051 0.0015 0.3904
, 0.0570 0.0105 0.0112 0.0081 0.0129 0.0337 0.0250 0.0077 0.0161 0.0022 0.0004 0.0007 0.0002 0.0001 0.0000 0.0001 0.0001 0.0005 0.0559 0.7576
we 0.0157 0.0354 0.0298 0.0047 0.0099 0.0072 0.0814 0.0180 0.0363 0.0055 0.0004 0.0018 0.0007 0.0002 0.0007 0.0004 0.0003 0.0029 0.0252 0.7236
had 0.0216 0.0440 0.0224 0.0047 0.0029 0.0047 0.2535 0.0589 0.0659 0.0034 0.0004 0.0017 0.0014 0.0004 0.0002 0.0001 0.0002 0.0007 0.0282 0.4846
left 0.0065 0.0359 0.0287 0.0036 0.0022 0.0021 0.3632 0.0984 0.1780 0.0079 0.0011 0.0048 0.0023 0.0004 0.0002 0.0002 0.0004 0.0004 0.0038 0.2599
the 0.0031 0.0014 0.0034 0.0005 0.0044 0.0020 0.0243 0.0104 0.0129 0.8723 0.0078 0.0228 0.0063 0.0001 0.0002 0.0022 0.0002 0.0035 0.0015 0.0207
ground 0.0018 0.0014 0.0006 0.0008 0.0006 0.0008 0.0056 0.0128 0.0119 0.6486 0.0667 0.0244 0.0027 0.0005 0.0006 0.0021 0.0011 0.0019 0.0011 0.2138
. 0.0179 0.0038 0.0006 0.0008 0.0013 0.0042 0.0117 0.0172 0.0357 0.0147 0.0128 0.0138 0.0395 0.0079 0.0006 0.0035 0.0030 0.0163 0.0392 0.7554
the 0.0185 0.0023 0.0040 0.0014 0.0024 0.0030 0.0026 0.0050 0.0032 0.0146 0.0006 0.0105 0.1603 0.0284 0.0149 0.0217 0.0045 0.2451 0.0510 0.4062
unknown 0.0216 0.0093 0.0079 0.0026 0.0009 0.0037 0.0006 0.0017 0.0006 0.0024 0.0008 0.0069 0.0367 0.0839 0.0335 0.0146 0.0214 0.0945 0.0154 0.6410
holds 0.0306 0.0010 0.0007 0.0005 0.0002 0.0002 0.0005 0.0008 0.0008 0.0011 0.0001 0.0005 0.3007 0.0319 0.0038 0.0414 0.0127 0.0116 0.0524 0.5084
its 0.0048 0.0002 0.0004 0.0002 0.0011 0.0003 0.0002 0.0005 0.0008 0.0056 0.0006 0.0016 0.1137 0.0152 0.0096 0.3217 0.0278 0.2357 0.0219 0.2381
grounds 0.0048 0.0003 0.0002 0.0004 0.0003 0.0001 0.0002 0.0007 0.0002 0.0015 0.0001 0.0003 0.0242 0.0052 0.0036 0.4512 0.0459 0.0360 0.0117 0.4132
. 0.0068 0.0009 0.0003 0.0006 0.0014 0.0013 0.0004 0.0016 0.0014 0.0012 0.0010 0.0020 0.0181 0.0125 0.0012 0.0419 0.0244 0.1074 0.1777 0.5978
[SEP] 0.0166 0.0028 0.0042 0.0038 0.0023 0.0027 0.0033 0.0048 0.0044 0.0059 0.0038 0.0068 0.0096 0.0069 0.0043 0.0045 0.0060 0.0080 0.0298 0.8695

ground has very high attention (64.86%) to left.

Transfer Learning

Transfer learning is a machine learning method where we reuse a pre-trained model as the starting point for a model on a new task. To put it simply—a model trained on one task is repurposed on a second, related task as an optimization that allows rapid progress when modeling the second task.

In NLP, transfer learning is achieved by first pre-training a model on an unlabeled text corpus in an unsupervised or semi-supervised manner, and then fine-tuning (update) the model on a smaller labeled dataset for a specific NLP task. If training is simply applied on smaller dataset without pre-training, it would not be possible to get high performance results. For example for NLP, we use BERT and for image classification we can use Resnet images

image.png Image retrieved from Sinan Ozdemir

For example, BERT has been pretrained on two main Corpora: English Wikipedia (2.5B words) and BookCorpus (800M words) which are free books. BERT went through all of these resources multiple times to gain a general understanding.

The BERT weights learnt from pretrained model are fine-tined. Moreover, a separate layer will be added on top of BERT model.

So, there are three fine-tuning approaches:

  1. Add any additional layers on top while updating the entire whole model on labeled data. This is a common approach. All aspect of the model will be updated. See Figure below. This is usually the slowest but has the highest performance.

image.png

  1. Freeze some part of model. For example, keeping some weights of BERT model unchanged while updating other weights. This approach has average speed and average speed.

image-2.png

  1. Freeze the entire model and only train the additional layers that we added on top which are feed forward classifiers. This is the fastest approach but has the worst performance. This can be only used for generic tasks.

image-3.png Images are retrieved from Sinan Ozdemir

Pytorch Introduction

In [16]:
import torch
import numpy as np

Pytorch is an open source machine learning (ML) framework based on the Python programming language and the Torch library. It a library to make deep learning available for us. It has two purposes:

  1. NumpPy replacement in order to leverage GPU power and other techniques. This can be helpful for big data, which leads to run faster.

  2. Optimized approach to implement neural networks with less computational cost.

Pytorch based object are Tensors.

In [17]:
# One dimension
torch.tensor([2,3,5])
Out[17]:
tensor([2, 3, 5])
In [18]:
# One dimension
torch.LongTensor([2,3,5])
Out[18]:
tensor([2, 3, 5])
In [19]:
# Two dimension
torch.tensor([[2,3,5],[1,2,9]])
Out[19]:
tensor([[2, 3, 5],
        [1, 2, 9]])
In [20]:
np.array([[2,3,5],[1,2,9]])
Out[20]:
array([[2, 3, 5],
       [1, 2, 9]])
In [21]:
c_torch=torch.zeros(2, 2)
c_torch
Out[21]:
tensor([[0., 0.],
        [0., 0.]])
In [22]:
c_numpy=np.zeros((2, 2))
c_numpy
Out[22]:
array([[0., 0.],
       [0., 0.]])
In [23]:
torch.from_numpy(c_numpy)
Out[23]:
tensor([[0., 0.],
        [0., 0.]], dtype=torch.float64)
In [24]:
c_torch.numpy()
Out[24]:
array([[0., 0.],
       [0., 0.]], dtype=float32)
In [25]:
two_d=torch.tensor([[2,3,5],[1,2,9]])
print(f'two_d {two_d}')
print(f'shape is {two_d.shape}, and dim is {two_d.dim()}')
two_d tensor([[2, 3, 5],
        [1, 2, 9]])
shape is torch.Size([2, 3]), and dim is 2
  • unsqueeze adds dimension at certain location forcing the dimension to exist
In [26]:
two_d_unsqueeze=two_d.unsqueeze(0)
print(f'two_d_unsqueeze {two_d_unsqueeze}')
print(f'shape is {two_d_unsqueeze.shape}, and dim is {two_d_unsqueeze.dim()}')
two_d_unsqueeze tensor([[[2, 3, 5],
         [1, 2, 9]]])
shape is torch.Size([1, 2, 3]), and dim is 3

The unsqueeze is very useful for forcing a "batch" dimension if we want to predict a single example

Fine-tuning transformer with Native Pytorch

  1. Use training data to update a model
  2. The model compute loss function to indicate how right or wrong a model is for predicting the training data
  3. Compute gradients to optimize weights which leads to update model

Process 1 to 3 are repeated until we are satisfy with our model performance. This is a manual processing that can be tedious. See Figure below:

image.png Image retrieved from Sinan Ozdemir

Fine-tuning with HuggingFace's Trainer

To address the problem above, we can use HuggingFace's Trainer API in our training loop. It takes the entire loop above including loss, gradient calculation and optimization all of them are calculated in a single API called trainer. image.png Image retrieved from Sinan Ozdemir

The key objects are:

  • Dataset- Split data set to training set and test set
  • DataCollector- Convert data set to multiple batches
  • TrainingArguments- Monitor and track training arguments including saving strategy and Scheduler parameters
  • Trainer- API to the Pytorch

NLP with BERT

BERT stands for Bi-directional Encoder Representation from Transformers:

  • Bi-directional: Auto-encoding language model.
  • Encoder: Only encoder is used from transformer.
  • Representation: Counting on self-attention.
  • Transformers: transformer is the source to retrieve encoder.

A sentence is fed into BERT to get a **context-full** representation (vector embedding) of every word in sentence. The context of each word in sentence is understood by the encoder using a multi-head attention mechanism (relating each word to every other word in sentence).

BERT comes with different models. The base model has 12 encoders which a good mix of complexity and size and speed of the model. BERT-small has 4 encoders and BERT-large has 24 encoders. image.png

BERT's Architecture

In [27]:
# Load pretrained BERT-base model with 12 encoder and 110M parameters
model_BERT_base = BertModel.from_pretrained('bert-base-uncased')
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
In [28]:
# Model's parameters 
n_params = list(model_BERT_base.named_parameters())
print(f'The BERT model has {len(n_params)} different parameters')
The BERT model has 199 different parameters
In [29]:
print('********* Embedding Layer *********\n')
for par in n_params[0:5]:
    print(f'{par[0], str(tuple(par[1].size()))}')
********* Embedding Layer *********

('embeddings.word_embeddings.weight', '(30522, 768)')
('embeddings.position_embeddings.weight', '(512, 768)')
('embeddings.token_type_embeddings.weight', '(2, 768)')
('embeddings.LayerNorm.weight', '(768,)')
('embeddings.LayerNorm.bias', '(768,)')

embeddings.word_embeddings.weight: (30522, 768) means there are 30522 tokens that BERT is aware of that can be used for any NLP task; 768 represents the fact that each token has a contextless embedding dimension of 768.

In [30]:
print('********* First Encoder ********* \n')
for par in n_params[5:21]:
    print(f'{par[0], str(tuple(par[1].size()))}')
********* First Encoder ********* 

('encoder.layer.0.attention.self.query.weight', '(768, 768)')
('encoder.layer.0.attention.self.query.bias', '(768,)')
('encoder.layer.0.attention.self.key.weight', '(768, 768)')
('encoder.layer.0.attention.self.key.bias', '(768,)')
('encoder.layer.0.attention.self.value.weight', '(768, 768)')
('encoder.layer.0.attention.self.value.bias', '(768,)')
('encoder.layer.0.attention.output.dense.weight', '(768, 768)')
('encoder.layer.0.attention.output.dense.bias', '(768,)')
('encoder.layer.0.attention.output.LayerNorm.weight', '(768,)')
('encoder.layer.0.attention.output.LayerNorm.bias', '(768,)')
('encoder.layer.0.intermediate.dense.weight', '(3072, 768)')
('encoder.layer.0.intermediate.dense.bias', '(3072,)')
('encoder.layer.0.output.dense.weight', '(768, 3072)')
('encoder.layer.0.output.dense.bias', '(768,)')
('encoder.layer.0.output.LayerNorm.weight', '(768,)')
('encoder.layer.0.output.LayerNorm.bias', '(768,)')
In [31]:
print('********* Output Layer ********* \n')
for par in n_params[-2:]:
    print(f'{par[0], str(tuple(par[1].size()))}')    
********* Output Layer ********* 

('pooler.dense.weight', '(768, 768)')
('pooler.dense.bias', '(768,)')

pooler: is a separate feed-forward network with a hyperbolic tanh activation function. When we are using BERT, this pooler is taking the vector embedding of token representing the entire sentence, not a particular token.

In [32]:
# load the bert-base uncased tokenizer.
tokenizer_bert = BertTokenizer.from_pretrained('bert-base-uncased')
In [33]:
# tokenize a sequence
tokenizer_sentence=tokenizer_bert.encode('AI has been my friend') 
tokenizer_sentence
Out[33]:
[101, 9932, 2038, 2042, 2026, 2767, 102]

We always have token 101 at start which refers to CLS, and 102 at the end which refers to SEP. Those are automatically added to tokenizer.

We can run this token through a model:

In [34]:
# runing tokens via the model
response = model_BERT_base(torch.tensor(tokenizer_sentence).unsqueeze(0))

The code above does:

  1. Convert tokens_with_unknown_words into a tensor with a size (7,))
  2. Simulate batches by unsqueezing a first dimension to a shape of (1, 7)

Passing this through our BERT model leads to many outputs.

In [35]:
response
Out[35]:
BaseModelOutputWithPoolingAndCrossAttentions(last_hidden_state=tensor([[[ 0.0065,  0.0303, -0.1594,  ..., -0.1599,  0.1518,  0.3864],
         [-0.2074, -0.4378,  0.0418,  ..., -0.2403, -0.0033,  0.4402],
         [ 0.2448, -0.3865, -0.2682,  ..., -0.0998,  0.0463,  0.6762],
         ...,
         [ 0.2216,  0.2247,  0.6810,  ...,  0.0474, -0.0571,  0.0918],
         [-0.3868, -0.4962,  0.1083,  ...,  0.7687,  0.1917,  0.4949],
         [ 0.6903,  0.0883, -0.1104,  ...,  0.1298, -0.7293, -0.4013]]],
       grad_fn=<NativeLayerNormBackward0>), pooler_output=tensor([[-8.4019e-01, -3.1882e-01, -1.1802e-01,  6.1138e-01,  1.2458e-01,
         -7.2390e-02,  7.2166e-01,  1.2822e-01,  1.3944e-02, -9.9987e-01,
          1.6440e-01,  6.1015e-01,  9.8340e-01, -1.1298e-01,  9.2865e-01,
         -4.2231e-01,  2.1135e-01, -5.8620e-01,  3.3227e-01, -4.0761e-01,
          6.7032e-01,  9.9373e-01,  4.8922e-01,  2.7396e-01,  3.8718e-01,
          7.8635e-01, -5.6028e-01,  9.2875e-01,  9.4346e-01,  7.1670e-01,
         -4.6647e-01,  1.9079e-02, -9.8936e-01, -6.4714e-02, -2.6505e-01,
         -9.8804e-01,  2.1584e-01, -7.0535e-01,  1.0201e-01,  1.6985e-01,
         -8.9834e-01,  2.1958e-01,  9.9947e-01, -4.8887e-01,  1.9320e-01,
         -2.2646e-01, -9.9998e-01,  1.6410e-01, -8.8739e-01,  8.9277e-02,
          1.1545e-01,  5.5763e-02,  7.3313e-02,  3.1227e-01,  3.8665e-01,
          1.3411e-01, -1.8745e-01,  4.4677e-02, -1.6519e-01, -4.5726e-01,
         -5.6030e-01,  3.0404e-01, -3.1251e-01, -9.0514e-01, -1.0218e-01,
         -1.1786e-02, -2.8253e-02, -1.6019e-01,  5.1660e-02, -2.1202e-01,
          8.1708e-01,  1.7194e-01,  3.2267e-01, -8.5822e-01, -2.6363e-01,
          1.3043e-01, -6.0094e-01,  1.0000e+00, -2.4463e-01, -9.7932e-01,
          1.4845e-01, -8.1988e-02,  4.7916e-01,  5.5246e-01, -3.4487e-01,
         -1.0000e+00,  1.6188e-01, -5.4785e-02, -9.8982e-01,  1.9753e-01,
          3.3677e-01, -5.3499e-02, -1.5404e-01,  5.4101e-01, -1.4771e-01,
         -3.2175e-01, -1.9403e-01, -1.2564e-01, -1.8953e-01, -8.3452e-02,
          9.5139e-02, -8.3374e-02,  1.9742e-02, -3.1122e-01,  1.2161e-01,
         -3.3803e-01, -4.2167e-01,  3.1580e-01, -2.3429e-01,  5.1750e-01,
          3.9368e-01, -2.4697e-01,  3.4244e-01, -9.5657e-01,  5.6359e-01,
         -1.7688e-01, -9.8256e-01, -5.6589e-01, -9.8972e-01,  6.6694e-01,
          5.6286e-02, -6.9763e-02,  9.6668e-01,  3.0583e-01,  1.8306e-01,
          1.2463e-01, -8.1691e-02, -1.0000e+00, -4.2852e-01, -1.0079e-01,
          2.8287e-01,  4.2447e-02, -9.7421e-01, -9.5157e-01,  4.7280e-01,
          9.5219e-01,  8.8323e-02,  9.9850e-01, -1.4773e-01,  9.3264e-01,
          1.5195e-01, -2.5326e-01, -6.5234e-02, -3.7121e-01,  4.6421e-01,
          6.1250e-02, -4.8675e-01,  1.2147e-01, -1.5196e-01, -5.3219e-02,
         -1.7494e-01, -1.4436e-01,  8.4322e-02, -9.3840e-01, -3.4755e-01,
          9.5690e-01,  5.0465e-02, -1.8532e-01,  5.7495e-01, -1.1385e-02,
         -3.5609e-01,  7.9381e-01,  4.3387e-01,  2.6833e-01, -6.5987e-02,
          3.4605e-01, -2.9071e-01,  4.4789e-01, -7.3594e-01,  1.6916e-01,
          2.5087e-01, -1.7506e-01,  5.4941e-03, -9.7758e-01, -1.9994e-01,
          3.3828e-01,  9.8842e-01,  6.5638e-01,  1.2881e-01,  2.6131e-01,
         -2.0814e-01,  4.5110e-01, -9.3941e-01,  9.8110e-01, -5.8313e-02,
          1.2784e-01,  1.8424e-01, -1.7952e-01, -8.1037e-01, -1.6230e-01,
          7.1538e-01, -3.6188e-01, -8.3163e-01,  1.5908e-01, -4.5125e-01,
         -2.9951e-01, -2.4296e-01,  3.7822e-01, -2.4523e-01, -4.0879e-01,
          2.9153e-02,  9.2314e-01,  9.1886e-01,  7.1969e-01, -5.0266e-01,
          5.2195e-01, -9.0451e-01, -3.5395e-01,  1.0104e-02,  1.4049e-01,
          2.4304e-02,  9.9160e-01, -3.7990e-01,  2.4506e-02, -9.3874e-01,
         -9.8658e-01, -1.4311e-01, -8.7904e-01, -3.8416e-03, -5.8721e-01,
          4.5871e-01, -5.3391e-02, -9.3333e-02,  3.1171e-01, -9.6045e-01,
         -6.9689e-01,  3.1778e-01, -2.4635e-01,  3.3390e-01, -2.1891e-01,
          7.4805e-01,  3.7863e-01, -4.4383e-01,  4.7100e-01,  8.9908e-01,
         -9.5270e-02, -7.7301e-01,  6.7432e-01, -2.1779e-01,  8.1611e-01,
         -5.8778e-01,  9.7637e-01,  2.9445e-01,  5.0835e-01, -9.1879e-01,
          3.7507e-02, -8.3892e-01,  1.2187e-01, -3.9495e-03, -5.5968e-01,
          6.3451e-02,  5.8330e-01,  2.2523e-01,  7.8549e-01, -3.9388e-01,
          9.8227e-01, -8.4841e-01, -9.5250e-01, -8.6930e-02, -8.5167e-02,
         -9.8785e-01,  1.9011e-01,  1.9643e-01, -9.4055e-02, -3.5262e-01,
         -3.4306e-01, -9.5470e-01,  7.6475e-01,  1.1350e-02,  9.6574e-01,
         -1.6289e-01, -8.4739e-01, -2.8138e-01, -9.1398e-01, -2.1734e-01,
         -9.5179e-02,  4.4677e-01, -2.2065e-01, -9.4875e-01,  3.9212e-01,
          5.4968e-01,  3.8606e-01,  2.4907e-01,  9.8885e-01,  9.9990e-01,
          9.7885e-01,  8.8317e-01,  8.0030e-01, -9.7612e-01, -3.4398e-01,
          9.9997e-01, -7.9403e-01, -1.0000e+00, -9.2567e-01, -4.3925e-01,
          2.1243e-01, -1.0000e+00, -8.2615e-02,  1.8239e-01, -9.1405e-01,
         -1.8479e-01,  9.7302e-01,  9.6865e-01, -1.0000e+00,  8.2077e-01,
          9.4452e-01, -6.1627e-01,  5.5988e-01, -2.3957e-01,  9.7237e-01,
          1.8515e-01,  3.9561e-01, -1.4897e-01,  2.1397e-01, -3.5993e-01,
         -7.4232e-01,  2.9146e-01, -1.5196e-02,  9.1940e-01,  3.5501e-02,
         -6.6667e-01, -9.3237e-01,  2.6392e-01,  2.0782e-02, -2.6602e-01,
         -9.6166e-01, -1.3218e-01, -2.0757e-01,  5.7610e-01,  1.9222e-02,
          1.1422e-01, -6.9864e-01,  1.9141e-02, -6.4413e-01,  2.6498e-01,
          6.3279e-01, -9.2361e-01, -5.4853e-01,  4.0075e-01, -5.9763e-01,
          1.9277e-01, -9.6033e-01,  9.6271e-01, -2.9977e-01, -1.6034e-02,
          1.0000e+00, -1.7835e-01, -8.8636e-01,  3.4273e-01,  9.2608e-02,
          1.9448e-02,  1.0000e+00,  5.7300e-01, -9.7969e-01, -5.7004e-01,
          3.8942e-01, -3.4289e-01, -4.9846e-01,  9.9842e-01, -1.1213e-01,
          1.7271e-01,  2.3863e-01,  9.7849e-01, -9.8995e-01,  8.3309e-01,
         -8.7909e-01, -9.7500e-01,  9.6340e-01,  9.3254e-01, -2.7004e-02,
         -5.3194e-01, -4.9071e-02, -8.6259e-02,  1.2259e-01, -9.4202e-01,
          5.3747e-01,  3.1679e-01, -6.0265e-02,  9.0453e-01, -5.5356e-01,
         -5.8072e-01,  2.5374e-01,  2.6564e-01,  4.5445e-01,  3.3368e-01,
          3.9240e-01, -1.6495e-01, -1.5121e-02, -1.7205e-01, -4.1955e-01,
         -9.7607e-01,  3.5962e-01,  1.0000e+00,  1.2623e-01, -3.7409e-02,
         -2.0925e-02, -1.2303e-02, -2.8448e-01,  2.6054e-01,  3.7904e-01,
         -2.1519e-01, -8.2942e-01,  1.6569e-01, -9.0627e-01, -9.8925e-01,
          6.5700e-01,  1.6491e-01, -1.5381e-01,  9.9948e-01,  1.8580e-01,
          6.7990e-02, -1.8800e-01,  6.2918e-01, -1.1658e-01,  4.5580e-01,
         -1.8539e-01,  9.7872e-01, -1.5340e-01,  5.3704e-01,  7.2691e-01,
         -5.1653e-02, -2.9198e-01, -6.2096e-01, -6.6973e-02, -9.1705e-01,
          1.9522e-01, -9.5739e-01,  9.5797e-01, -2.7874e-02,  2.4168e-01,
          5.8768e-02,  1.7149e-01,  1.0000e+00, -3.3987e-01,  3.9393e-01,
          1.6514e-01,  6.7635e-01, -9.7061e-01, -7.3446e-01, -3.6084e-01,
          1.1563e-01,  1.9032e-01, -1.7933e-01,  1.0665e-01, -9.6862e-01,
         -1.2481e-01,  6.7816e-02, -9.6239e-01, -9.8988e-01,  3.2180e-01,
          5.3455e-01, -1.1899e-02, -7.0683e-01, -6.1434e-01, -6.2079e-01,
          2.4251e-01, -1.4921e-01, -9.3819e-01,  5.7455e-01, -1.5822e-01,
          3.1907e-01, -1.3497e-01,  5.5145e-01, -4.7374e-02,  8.4845e-01,
          1.9200e-01,  5.4274e-02,  3.4452e-02, -6.4710e-01,  7.2305e-01,
         -7.5373e-01, -4.4479e-01, -3.8771e-02,  1.0000e+00, -1.8918e-01,
          2.7436e-01,  6.2532e-01,  6.5794e-01, -2.1762e-02,  2.0079e-01,
          3.5000e-01,  1.0133e-01,  1.5298e-01,  3.5570e-01, -1.2939e-01,
         -2.0919e-01,  5.5322e-01, -1.9950e-01, -1.3132e-01,  7.8193e-01,
          3.6775e-01,  6.0601e-02,  1.1336e-01, -7.4062e-02,  9.9498e-01,
         -1.5782e-01, -4.6960e-02, -3.7331e-01,  1.0194e-01, -2.2308e-01,
         -3.0289e-02,  1.0000e+00,  1.9673e-01, -1.8919e-02, -9.9090e-01,
         -1.0400e-01, -8.7667e-01,  9.9960e-01,  7.8479e-01, -7.8601e-01,
          4.2355e-01,  2.4432e-01, -3.1730e-02,  6.1198e-01, -9.7466e-02,
         -1.1662e-01,  1.1271e-01,  4.6818e-02,  9.6686e-01, -3.1075e-01,
         -9.6792e-01, -4.9323e-01,  2.9655e-01, -9.5894e-01,  9.8621e-01,
         -3.9323e-01, -1.0892e-01, -1.6482e-01, -2.7801e-02,  1.2234e-02,
         -1.5106e-01, -9.8087e-01, -1.4156e-01,  3.6931e-03,  9.6774e-01,
          1.4032e-01, -5.6110e-01, -8.7167e-01, -1.9635e-01,  1.7122e-02,
         -1.0332e-01, -9.4580e-01,  9.6719e-01, -9.8246e-01,  4.4481e-01,
          1.0000e+00,  1.6533e-01, -6.1815e-01,  2.2743e-02, -2.5573e-01,
          1.9687e-01,  2.2244e-02,  4.4107e-01, -9.5761e-01, -2.9534e-01,
         -1.1568e-01,  1.7190e-01, -8.7656e-02, -8.3539e-02,  6.3041e-01,
          1.9310e-01, -5.2143e-01, -4.8012e-01, -7.6077e-02,  2.9015e-01,
          6.4318e-01, -1.6291e-01, -5.4982e-03, -2.8394e-02,  2.8721e-02,
         -9.0883e-01, -2.4054e-01, -2.8952e-01, -9.9737e-01,  4.4196e-01,
         -1.0000e+00, -2.1939e-01, -4.4858e-01, -1.7345e-01,  8.0803e-01,
          3.9329e-01,  1.4276e-01, -7.1020e-01,  1.2349e-01,  8.9528e-01,
          7.5468e-01, -1.3425e-01,  1.2547e-01, -6.6663e-01,  1.4947e-01,
         -1.6637e-02,  2.0310e-01, -9.5943e-02,  7.1982e-01, -1.2313e-01,
          1.0000e+00,  1.0626e-02, -4.3705e-01, -9.2045e-01,  1.2155e-01,
         -1.6007e-01,  9.9997e-01, -7.3421e-01, -9.4092e-01,  2.9346e-01,
         -4.8404e-01, -8.2087e-01,  3.0309e-01, -1.2000e-01, -5.9553e-01,
         -3.1017e-01,  9.4683e-01,  6.9984e-01, -5.2796e-01,  4.0851e-01,
         -1.8308e-01, -3.4092e-01, -3.1200e-02,  7.1636e-02,  9.8628e-01,
          4.7112e-03,  7.9730e-01,  7.9710e-02, -8.0972e-02,  9.6757e-01,
          1.5996e-01,  2.3156e-01,  8.3174e-02,  1.0000e+00,  2.0440e-01,
         -9.0639e-01,  4.4556e-01, -9.7806e-01, -3.1147e-03, -9.4551e-01,
          1.8154e-01, -1.9254e-02,  8.7372e-01, -9.5097e-02,  9.5773e-01,
          1.4778e-01, -1.5501e-02,  5.9342e-02,  4.3088e-01,  2.8630e-01,
         -9.0797e-01, -9.8606e-01, -9.8646e-01,  2.2276e-01, -4.1590e-01,
          7.8981e-02,  1.9840e-01,  4.4702e-02,  2.0835e-01,  3.0080e-01,
         -1.0000e+00,  9.3251e-01,  3.4374e-01,  1.3287e-01,  9.6515e-01,
          2.8173e-01,  3.6148e-01,  9.9562e-02, -9.8304e-01, -9.3873e-01,
         -2.8620e-01, -2.0811e-01,  6.3750e-01,  6.4102e-01,  8.0429e-01,
          3.0296e-01, -4.1761e-01, -3.9641e-01,  1.2098e-01, -7.2830e-01,
         -9.9235e-01,  2.9116e-01,  2.3890e-01, -8.9118e-01,  9.6165e-01,
         -6.5592e-01, -1.2626e-01,  5.1983e-01, -1.2479e-01,  8.3263e-01,
          6.9531e-01,  1.4468e-01,  2.8531e-02,  3.2526e-01,  8.8614e-01,
          8.7362e-01,  9.8784e-01, -1.1435e-01,  7.3967e-01, -1.5916e-05,
          3.4468e-01,  6.7965e-01, -9.3931e-01,  3.1565e-02,  1.7663e-01,
         -2.4366e-02,  1.4506e-01, -6.3675e-02, -9.0421e-01,  4.9323e-01,
         -2.1407e-01,  3.2881e-01, -3.9336e-01,  1.7543e-01, -3.0572e-01,
         -2.6955e-03, -6.2087e-01, -2.7020e-01,  6.6295e-01,  2.7706e-01,
          8.9237e-01,  6.2840e-01,  2.7339e-02, -5.3093e-01, -3.4315e-03,
          9.3736e-02, -9.0249e-01,  8.0867e-01,  1.7888e-01,  2.8157e-01,
         -1.4331e-01, -1.5794e-01,  7.4887e-01, -3.0997e-01, -3.0732e-01,
         -2.6704e-01, -5.0697e-01,  8.8122e-01, -4.6491e-01, -4.0141e-01,
         -3.3163e-01,  5.4524e-01,  2.2158e-01,  9.9518e-01,  2.9895e-02,
         -7.6689e-02, -3.6579e-01, -1.7576e-01,  2.5189e-01, -7.1489e-02,
         -1.0000e+00,  3.5259e-01,  3.5202e-02,  7.0007e-02,  3.7918e-02,
         -3.7228e-02, -4.2352e-02, -9.5362e-01, -8.5418e-02,  4.5871e-02,
         -1.0965e-01, -3.9032e-01, -2.7186e-01,  5.6127e-01,  3.4586e-01,
          5.9221e-01,  8.7096e-01,  1.4091e-01,  6.9970e-01,  6.3386e-01,
          7.6531e-02, -6.5625e-01,  8.8305e-01]], grad_fn=<TanhBackward0>), hidden_states=None, past_key_values=None, attentions=None, cross_attentions=None)
In [36]:
# Embedding for each token
response.last_hidden_state
Out[36]:
tensor([[[ 0.0065,  0.0303, -0.1594,  ..., -0.1599,  0.1518,  0.3864],
         [-0.2074, -0.4378,  0.0418,  ..., -0.2403, -0.0033,  0.4402],
         [ 0.2448, -0.3865, -0.2682,  ..., -0.0998,  0.0463,  0.6762],
         ...,
         [ 0.2216,  0.2247,  0.6810,  ...,  0.0474, -0.0571,  0.0918],
         [-0.3868, -0.4962,  0.1083,  ...,  0.7687,  0.1917,  0.4949],
         [ 0.6903,  0.0883, -0.1104,  ...,  0.1298, -0.7293, -0.4013]]],
       grad_fn=<NativeLayerNormBackward0>)

Each row represents a token in sequence and each vector represents that token's context in greater sequence. As mentioned before, first row is CLS.

In [37]:
# The size of pooler_output
response.pooler_output.shape
Out[37]:
torch.Size([1, 768])

pooler_output is meant to be representative of the entire sequence as a whole not just a individual token. The size of pooler_output is our weight matrix

In [38]:
response.pooler_output
Out[38]:
tensor([[-8.4019e-01, -3.1882e-01, -1.1802e-01,  6.1138e-01,  1.2458e-01,
         -7.2390e-02,  7.2166e-01,  1.2822e-01,  1.3944e-02, -9.9987e-01,
          1.6440e-01,  6.1015e-01,  9.8340e-01, -1.1298e-01,  9.2865e-01,
         -4.2231e-01,  2.1135e-01, -5.8620e-01,  3.3227e-01, -4.0761e-01,
          6.7032e-01,  9.9373e-01,  4.8922e-01,  2.7396e-01,  3.8718e-01,
          7.8635e-01, -5.6028e-01,  9.2875e-01,  9.4346e-01,  7.1670e-01,
         -4.6647e-01,  1.9079e-02, -9.8936e-01, -6.4714e-02, -2.6505e-01,
         -9.8804e-01,  2.1584e-01, -7.0535e-01,  1.0201e-01,  1.6985e-01,
         -8.9834e-01,  2.1958e-01,  9.9947e-01, -4.8887e-01,  1.9320e-01,
         -2.2646e-01, -9.9998e-01,  1.6410e-01, -8.8739e-01,  8.9277e-02,
          1.1545e-01,  5.5763e-02,  7.3313e-02,  3.1227e-01,  3.8665e-01,
          1.3411e-01, -1.8745e-01,  4.4677e-02, -1.6519e-01, -4.5726e-01,
         -5.6030e-01,  3.0404e-01, -3.1251e-01, -9.0514e-01, -1.0218e-01,
         -1.1786e-02, -2.8253e-02, -1.6019e-01,  5.1660e-02, -2.1202e-01,
          8.1708e-01,  1.7194e-01,  3.2267e-01, -8.5822e-01, -2.6363e-01,
          1.3043e-01, -6.0094e-01,  1.0000e+00, -2.4463e-01, -9.7932e-01,
          1.4845e-01, -8.1988e-02,  4.7916e-01,  5.5246e-01, -3.4487e-01,
         -1.0000e+00,  1.6188e-01, -5.4785e-02, -9.8982e-01,  1.9753e-01,
          3.3677e-01, -5.3499e-02, -1.5404e-01,  5.4101e-01, -1.4771e-01,
         -3.2175e-01, -1.9403e-01, -1.2564e-01, -1.8953e-01, -8.3452e-02,
          9.5139e-02, -8.3374e-02,  1.9742e-02, -3.1122e-01,  1.2161e-01,
         -3.3803e-01, -4.2167e-01,  3.1580e-01, -2.3429e-01,  5.1750e-01,
          3.9368e-01, -2.4697e-01,  3.4244e-01, -9.5657e-01,  5.6359e-01,
         -1.7688e-01, -9.8256e-01, -5.6589e-01, -9.8972e-01,  6.6694e-01,
          5.6286e-02, -6.9763e-02,  9.6668e-01,  3.0583e-01,  1.8306e-01,
          1.2463e-01, -8.1691e-02, -1.0000e+00, -4.2852e-01, -1.0079e-01,
          2.8287e-01,  4.2447e-02, -9.7421e-01, -9.5157e-01,  4.7280e-01,
          9.5219e-01,  8.8323e-02,  9.9850e-01, -1.4773e-01,  9.3264e-01,
          1.5195e-01, -2.5326e-01, -6.5234e-02, -3.7121e-01,  4.6421e-01,
          6.1250e-02, -4.8675e-01,  1.2147e-01, -1.5196e-01, -5.3219e-02,
         -1.7494e-01, -1.4436e-01,  8.4322e-02, -9.3840e-01, -3.4755e-01,
          9.5690e-01,  5.0465e-02, -1.8532e-01,  5.7495e-01, -1.1385e-02,
         -3.5609e-01,  7.9381e-01,  4.3387e-01,  2.6833e-01, -6.5987e-02,
          3.4605e-01, -2.9071e-01,  4.4789e-01, -7.3594e-01,  1.6916e-01,
          2.5087e-01, -1.7506e-01,  5.4941e-03, -9.7758e-01, -1.9994e-01,
          3.3828e-01,  9.8842e-01,  6.5638e-01,  1.2881e-01,  2.6131e-01,
         -2.0814e-01,  4.5110e-01, -9.3941e-01,  9.8110e-01, -5.8313e-02,
          1.2784e-01,  1.8424e-01, -1.7952e-01, -8.1037e-01, -1.6230e-01,
          7.1538e-01, -3.6188e-01, -8.3163e-01,  1.5908e-01, -4.5125e-01,
         -2.9951e-01, -2.4296e-01,  3.7822e-01, -2.4523e-01, -4.0879e-01,
          2.9153e-02,  9.2314e-01,  9.1886e-01,  7.1969e-01, -5.0266e-01,
          5.2195e-01, -9.0451e-01, -3.5395e-01,  1.0104e-02,  1.4049e-01,
          2.4304e-02,  9.9160e-01, -3.7990e-01,  2.4506e-02, -9.3874e-01,
         -9.8658e-01, -1.4311e-01, -8.7904e-01, -3.8416e-03, -5.8721e-01,
          4.5871e-01, -5.3391e-02, -9.3333e-02,  3.1171e-01, -9.6045e-01,
         -6.9689e-01,  3.1778e-01, -2.4635e-01,  3.3390e-01, -2.1891e-01,
          7.4805e-01,  3.7863e-01, -4.4383e-01,  4.7100e-01,  8.9908e-01,
         -9.5270e-02, -7.7301e-01,  6.7432e-01, -2.1779e-01,  8.1611e-01,
         -5.8778e-01,  9.7637e-01,  2.9445e-01,  5.0835e-01, -9.1879e-01,
          3.7507e-02, -8.3892e-01,  1.2187e-01, -3.9495e-03, -5.5968e-01,
          6.3451e-02,  5.8330e-01,  2.2523e-01,  7.8549e-01, -3.9388e-01,
          9.8227e-01, -8.4841e-01, -9.5250e-01, -8.6930e-02, -8.5167e-02,
         -9.8785e-01,  1.9011e-01,  1.9643e-01, -9.4055e-02, -3.5262e-01,
         -3.4306e-01, -9.5470e-01,  7.6475e-01,  1.1350e-02,  9.6574e-01,
         -1.6289e-01, -8.4739e-01, -2.8138e-01, -9.1398e-01, -2.1734e-01,
         -9.5179e-02,  4.4677e-01, -2.2065e-01, -9.4875e-01,  3.9212e-01,
          5.4968e-01,  3.8606e-01,  2.4907e-01,  9.8885e-01,  9.9990e-01,
          9.7885e-01,  8.8317e-01,  8.0030e-01, -9.7612e-01, -3.4398e-01,
          9.9997e-01, -7.9403e-01, -1.0000e+00, -9.2567e-01, -4.3925e-01,
          2.1243e-01, -1.0000e+00, -8.2615e-02,  1.8239e-01, -9.1405e-01,
         -1.8479e-01,  9.7302e-01,  9.6865e-01, -1.0000e+00,  8.2077e-01,
          9.4452e-01, -6.1627e-01,  5.5988e-01, -2.3957e-01,  9.7237e-01,
          1.8515e-01,  3.9561e-01, -1.4897e-01,  2.1397e-01, -3.5993e-01,
         -7.4232e-01,  2.9146e-01, -1.5196e-02,  9.1940e-01,  3.5501e-02,
         -6.6667e-01, -9.3237e-01,  2.6392e-01,  2.0782e-02, -2.6602e-01,
         -9.6166e-01, -1.3218e-01, -2.0757e-01,  5.7610e-01,  1.9222e-02,
          1.1422e-01, -6.9864e-01,  1.9141e-02, -6.4413e-01,  2.6498e-01,
          6.3279e-01, -9.2361e-01, -5.4853e-01,  4.0075e-01, -5.9763e-01,
          1.9277e-01, -9.6033e-01,  9.6271e-01, -2.9977e-01, -1.6034e-02,
          1.0000e+00, -1.7835e-01, -8.8636e-01,  3.4273e-01,  9.2608e-02,
          1.9448e-02,  1.0000e+00,  5.7300e-01, -9.7969e-01, -5.7004e-01,
          3.8942e-01, -3.4289e-01, -4.9846e-01,  9.9842e-01, -1.1213e-01,
          1.7271e-01,  2.3863e-01,  9.7849e-01, -9.8995e-01,  8.3309e-01,
         -8.7909e-01, -9.7500e-01,  9.6340e-01,  9.3254e-01, -2.7004e-02,
         -5.3194e-01, -4.9071e-02, -8.6259e-02,  1.2259e-01, -9.4202e-01,
          5.3747e-01,  3.1679e-01, -6.0265e-02,  9.0453e-01, -5.5356e-01,
         -5.8072e-01,  2.5374e-01,  2.6564e-01,  4.5445e-01,  3.3368e-01,
          3.9240e-01, -1.6495e-01, -1.5121e-02, -1.7205e-01, -4.1955e-01,
         -9.7607e-01,  3.5962e-01,  1.0000e+00,  1.2623e-01, -3.7409e-02,
         -2.0925e-02, -1.2303e-02, -2.8448e-01,  2.6054e-01,  3.7904e-01,
         -2.1519e-01, -8.2942e-01,  1.6569e-01, -9.0627e-01, -9.8925e-01,
          6.5700e-01,  1.6491e-01, -1.5381e-01,  9.9948e-01,  1.8580e-01,
          6.7990e-02, -1.8800e-01,  6.2918e-01, -1.1658e-01,  4.5580e-01,
         -1.8539e-01,  9.7872e-01, -1.5340e-01,  5.3704e-01,  7.2691e-01,
         -5.1653e-02, -2.9198e-01, -6.2096e-01, -6.6973e-02, -9.1705e-01,
          1.9522e-01, -9.5739e-01,  9.5797e-01, -2.7874e-02,  2.4168e-01,
          5.8768e-02,  1.7149e-01,  1.0000e+00, -3.3987e-01,  3.9393e-01,
          1.6514e-01,  6.7635e-01, -9.7061e-01, -7.3446e-01, -3.6084e-01,
          1.1563e-01,  1.9032e-01, -1.7933e-01,  1.0665e-01, -9.6862e-01,
         -1.2481e-01,  6.7816e-02, -9.6239e-01, -9.8988e-01,  3.2180e-01,
          5.3455e-01, -1.1899e-02, -7.0683e-01, -6.1434e-01, -6.2079e-01,
          2.4251e-01, -1.4921e-01, -9.3819e-01,  5.7455e-01, -1.5822e-01,
          3.1907e-01, -1.3497e-01,  5.5145e-01, -4.7374e-02,  8.4845e-01,
          1.9200e-01,  5.4274e-02,  3.4452e-02, -6.4710e-01,  7.2305e-01,
         -7.5373e-01, -4.4479e-01, -3.8771e-02,  1.0000e+00, -1.8918e-01,
          2.7436e-01,  6.2532e-01,  6.5794e-01, -2.1762e-02,  2.0079e-01,
          3.5000e-01,  1.0133e-01,  1.5298e-01,  3.5570e-01, -1.2939e-01,
         -2.0919e-01,  5.5322e-01, -1.9950e-01, -1.3132e-01,  7.8193e-01,
          3.6775e-01,  6.0601e-02,  1.1336e-01, -7.4062e-02,  9.9498e-01,
         -1.5782e-01, -4.6960e-02, -3.7331e-01,  1.0194e-01, -2.2308e-01,
         -3.0289e-02,  1.0000e+00,  1.9673e-01, -1.8919e-02, -9.9090e-01,
         -1.0400e-01, -8.7667e-01,  9.9960e-01,  7.8479e-01, -7.8601e-01,
          4.2355e-01,  2.4432e-01, -3.1730e-02,  6.1198e-01, -9.7466e-02,
         -1.1662e-01,  1.1271e-01,  4.6818e-02,  9.6686e-01, -3.1075e-01,
         -9.6792e-01, -4.9323e-01,  2.9655e-01, -9.5894e-01,  9.8621e-01,
         -3.9323e-01, -1.0892e-01, -1.6482e-01, -2.7801e-02,  1.2234e-02,
         -1.5106e-01, -9.8087e-01, -1.4156e-01,  3.6931e-03,  9.6774e-01,
          1.4032e-01, -5.6110e-01, -8.7167e-01, -1.9635e-01,  1.7122e-02,
         -1.0332e-01, -9.4580e-01,  9.6719e-01, -9.8246e-01,  4.4481e-01,
          1.0000e+00,  1.6533e-01, -6.1815e-01,  2.2743e-02, -2.5573e-01,
          1.9687e-01,  2.2244e-02,  4.4107e-01, -9.5761e-01, -2.9534e-01,
         -1.1568e-01,  1.7190e-01, -8.7656e-02, -8.3539e-02,  6.3041e-01,
          1.9310e-01, -5.2143e-01, -4.8012e-01, -7.6077e-02,  2.9015e-01,
          6.4318e-01, -1.6291e-01, -5.4982e-03, -2.8394e-02,  2.8721e-02,
         -9.0883e-01, -2.4054e-01, -2.8952e-01, -9.9737e-01,  4.4196e-01,
         -1.0000e+00, -2.1939e-01, -4.4858e-01, -1.7345e-01,  8.0803e-01,
          3.9329e-01,  1.4276e-01, -7.1020e-01,  1.2349e-01,  8.9528e-01,
          7.5468e-01, -1.3425e-01,  1.2547e-01, -6.6663e-01,  1.4947e-01,
         -1.6637e-02,  2.0310e-01, -9.5943e-02,  7.1982e-01, -1.2313e-01,
          1.0000e+00,  1.0626e-02, -4.3705e-01, -9.2045e-01,  1.2155e-01,
         -1.6007e-01,  9.9997e-01, -7.3421e-01, -9.4092e-01,  2.9346e-01,
         -4.8404e-01, -8.2087e-01,  3.0309e-01, -1.2000e-01, -5.9553e-01,
         -3.1017e-01,  9.4683e-01,  6.9984e-01, -5.2796e-01,  4.0851e-01,
         -1.8308e-01, -3.4092e-01, -3.1200e-02,  7.1636e-02,  9.8628e-01,
          4.7112e-03,  7.9730e-01,  7.9710e-02, -8.0972e-02,  9.6757e-01,
          1.5996e-01,  2.3156e-01,  8.3174e-02,  1.0000e+00,  2.0440e-01,
         -9.0639e-01,  4.4556e-01, -9.7806e-01, -3.1147e-03, -9.4551e-01,
          1.8154e-01, -1.9254e-02,  8.7372e-01, -9.5097e-02,  9.5773e-01,
          1.4778e-01, -1.5501e-02,  5.9342e-02,  4.3088e-01,  2.8630e-01,
         -9.0797e-01, -9.8606e-01, -9.8646e-01,  2.2276e-01, -4.1590e-01,
          7.8981e-02,  1.9840e-01,  4.4702e-02,  2.0835e-01,  3.0080e-01,
         -1.0000e+00,  9.3251e-01,  3.4374e-01,  1.3287e-01,  9.6515e-01,
          2.8173e-01,  3.6148e-01,  9.9562e-02, -9.8304e-01, -9.3873e-01,
         -2.8620e-01, -2.0811e-01,  6.3750e-01,  6.4102e-01,  8.0429e-01,
          3.0296e-01, -4.1761e-01, -3.9641e-01,  1.2098e-01, -7.2830e-01,
         -9.9235e-01,  2.9116e-01,  2.3890e-01, -8.9118e-01,  9.6165e-01,
         -6.5592e-01, -1.2626e-01,  5.1983e-01, -1.2479e-01,  8.3263e-01,
          6.9531e-01,  1.4468e-01,  2.8531e-02,  3.2526e-01,  8.8614e-01,
          8.7362e-01,  9.8784e-01, -1.1435e-01,  7.3967e-01, -1.5916e-05,
          3.4468e-01,  6.7965e-01, -9.3931e-01,  3.1565e-02,  1.7663e-01,
         -2.4366e-02,  1.4506e-01, -6.3675e-02, -9.0421e-01,  4.9323e-01,
         -2.1407e-01,  3.2881e-01, -3.9336e-01,  1.7543e-01, -3.0572e-01,
         -2.6955e-03, -6.2087e-01, -2.7020e-01,  6.6295e-01,  2.7706e-01,
          8.9237e-01,  6.2840e-01,  2.7339e-02, -5.3093e-01, -3.4315e-03,
          9.3736e-02, -9.0249e-01,  8.0867e-01,  1.7888e-01,  2.8157e-01,
         -1.4331e-01, -1.5794e-01,  7.4887e-01, -3.0997e-01, -3.0732e-01,
         -2.6704e-01, -5.0697e-01,  8.8122e-01, -4.6491e-01, -4.0141e-01,
         -3.3163e-01,  5.4524e-01,  2.2158e-01,  9.9518e-01,  2.9895e-02,
         -7.6689e-02, -3.6579e-01, -1.7576e-01,  2.5189e-01, -7.1489e-02,
         -1.0000e+00,  3.5259e-01,  3.5202e-02,  7.0007e-02,  3.7918e-02,
         -3.7228e-02, -4.2352e-02, -9.5362e-01, -8.5418e-02,  4.5871e-02,
         -1.0965e-01, -3.9032e-01, -2.7186e-01,  5.6127e-01,  3.4586e-01,
          5.9221e-01,  8.7096e-01,  1.4091e-01,  6.9970e-01,  6.3386e-01,
          7.6531e-02, -6.5625e-01,  8.8305e-01]], grad_fn=<TanhBackward0>)
In [39]:
model_BERT_base.pooler
Out[39]:
BertPooler(
  (dense): Linear(in_features=768, out_features=768, bias=True)
  (activation): Tanh()
)

Model pooler is feed forward network with Tanh activation.

In [40]:
# Get the final encoder's representation. First elemnt of second dimension is CLS token
CLS_embedding = response.last_hidden_state[:, 0, :].unsqueeze(0) # second dimension holds all of the token
CLS_embedding.shape
Out[40]:
torch.Size([1, 1, 768])
In [41]:
# put CLS_embedding through model's pooler
model_BERT_base.pooler(CLS_embedding).shape
Out[41]:
torch.Size([1, 768])
In [42]:
model_BERT_base.pooler(CLS_embedding)
Out[42]:
tensor([[-8.4019e-01, -3.1882e-01, -1.1802e-01,  6.1138e-01,  1.2458e-01,
         -7.2390e-02,  7.2166e-01,  1.2822e-01,  1.3944e-02, -9.9987e-01,
          1.6440e-01,  6.1015e-01,  9.8340e-01, -1.1298e-01,  9.2865e-01,
         -4.2231e-01,  2.1135e-01, -5.8620e-01,  3.3227e-01, -4.0761e-01,
          6.7032e-01,  9.9373e-01,  4.8922e-01,  2.7396e-01,  3.8718e-01,
          7.8635e-01, -5.6028e-01,  9.2875e-01,  9.4346e-01,  7.1670e-01,
         -4.6647e-01,  1.9079e-02, -9.8936e-01, -6.4714e-02, -2.6505e-01,
         -9.8804e-01,  2.1584e-01, -7.0535e-01,  1.0201e-01,  1.6985e-01,
         -8.9834e-01,  2.1958e-01,  9.9947e-01, -4.8887e-01,  1.9320e-01,
         -2.2646e-01, -9.9998e-01,  1.6410e-01, -8.8739e-01,  8.9277e-02,
          1.1545e-01,  5.5763e-02,  7.3313e-02,  3.1227e-01,  3.8665e-01,
          1.3411e-01, -1.8745e-01,  4.4677e-02, -1.6519e-01, -4.5726e-01,
         -5.6030e-01,  3.0404e-01, -3.1251e-01, -9.0514e-01, -1.0218e-01,
         -1.1786e-02, -2.8253e-02, -1.6019e-01,  5.1660e-02, -2.1202e-01,
          8.1708e-01,  1.7194e-01,  3.2267e-01, -8.5822e-01, -2.6363e-01,
          1.3043e-01, -6.0094e-01,  1.0000e+00, -2.4463e-01, -9.7932e-01,
          1.4845e-01, -8.1988e-02,  4.7916e-01,  5.5246e-01, -3.4487e-01,
         -1.0000e+00,  1.6188e-01, -5.4785e-02, -9.8982e-01,  1.9753e-01,
          3.3677e-01, -5.3499e-02, -1.5404e-01,  5.4101e-01, -1.4771e-01,
         -3.2175e-01, -1.9403e-01, -1.2564e-01, -1.8953e-01, -8.3452e-02,
          9.5139e-02, -8.3374e-02,  1.9742e-02, -3.1122e-01,  1.2161e-01,
         -3.3803e-01, -4.2167e-01,  3.1580e-01, -2.3429e-01,  5.1750e-01,
          3.9368e-01, -2.4697e-01,  3.4244e-01, -9.5657e-01,  5.6359e-01,
         -1.7688e-01, -9.8256e-01, -5.6589e-01, -9.8972e-01,  6.6694e-01,
          5.6286e-02, -6.9763e-02,  9.6668e-01,  3.0583e-01,  1.8306e-01,
          1.2463e-01, -8.1691e-02, -1.0000e+00, -4.2852e-01, -1.0079e-01,
          2.8287e-01,  4.2447e-02, -9.7421e-01, -9.5157e-01,  4.7280e-01,
          9.5219e-01,  8.8323e-02,  9.9850e-01, -1.4773e-01,  9.3264e-01,
          1.5195e-01, -2.5326e-01, -6.5234e-02, -3.7121e-01,  4.6421e-01,
          6.1250e-02, -4.8675e-01,  1.2147e-01, -1.5196e-01, -5.3219e-02,
         -1.7494e-01, -1.4436e-01,  8.4322e-02, -9.3840e-01, -3.4755e-01,
          9.5690e-01,  5.0465e-02, -1.8532e-01,  5.7495e-01, -1.1385e-02,
         -3.5609e-01,  7.9381e-01,  4.3387e-01,  2.6833e-01, -6.5987e-02,
          3.4605e-01, -2.9071e-01,  4.4789e-01, -7.3594e-01,  1.6916e-01,
          2.5087e-01, -1.7506e-01,  5.4941e-03, -9.7758e-01, -1.9994e-01,
          3.3828e-01,  9.8842e-01,  6.5638e-01,  1.2881e-01,  2.6131e-01,
         -2.0814e-01,  4.5110e-01, -9.3941e-01,  9.8110e-01, -5.8313e-02,
          1.2784e-01,  1.8424e-01, -1.7952e-01, -8.1037e-01, -1.6230e-01,
          7.1538e-01, -3.6188e-01, -8.3163e-01,  1.5908e-01, -4.5125e-01,
         -2.9951e-01, -2.4296e-01,  3.7822e-01, -2.4523e-01, -4.0879e-01,
          2.9153e-02,  9.2314e-01,  9.1886e-01,  7.1969e-01, -5.0266e-01,
          5.2195e-01, -9.0451e-01, -3.5395e-01,  1.0104e-02,  1.4049e-01,
          2.4304e-02,  9.9160e-01, -3.7990e-01,  2.4506e-02, -9.3874e-01,
         -9.8658e-01, -1.4311e-01, -8.7904e-01, -3.8416e-03, -5.8721e-01,
          4.5871e-01, -5.3391e-02, -9.3333e-02,  3.1171e-01, -9.6045e-01,
         -6.9689e-01,  3.1778e-01, -2.4635e-01,  3.3390e-01, -2.1891e-01,
          7.4805e-01,  3.7863e-01, -4.4383e-01,  4.7100e-01,  8.9908e-01,
         -9.5270e-02, -7.7301e-01,  6.7432e-01, -2.1779e-01,  8.1611e-01,
         -5.8778e-01,  9.7637e-01,  2.9445e-01,  5.0835e-01, -9.1879e-01,
          3.7507e-02, -8.3892e-01,  1.2187e-01, -3.9495e-03, -5.5968e-01,
          6.3451e-02,  5.8330e-01,  2.2523e-01,  7.8549e-01, -3.9388e-01,
          9.8227e-01, -8.4841e-01, -9.5250e-01, -8.6930e-02, -8.5167e-02,
         -9.8785e-01,  1.9011e-01,  1.9643e-01, -9.4055e-02, -3.5262e-01,
         -3.4306e-01, -9.5470e-01,  7.6475e-01,  1.1350e-02,  9.6574e-01,
         -1.6289e-01, -8.4739e-01, -2.8138e-01, -9.1398e-01, -2.1734e-01,
         -9.5179e-02,  4.4677e-01, -2.2065e-01, -9.4875e-01,  3.9212e-01,
          5.4968e-01,  3.8606e-01,  2.4907e-01,  9.8885e-01,  9.9990e-01,
          9.7885e-01,  8.8317e-01,  8.0030e-01, -9.7612e-01, -3.4398e-01,
          9.9997e-01, -7.9403e-01, -1.0000e+00, -9.2567e-01, -4.3925e-01,
          2.1243e-01, -1.0000e+00, -8.2615e-02,  1.8239e-01, -9.1405e-01,
         -1.8479e-01,  9.7302e-01,  9.6865e-01, -1.0000e+00,  8.2077e-01,
          9.4452e-01, -6.1627e-01,  5.5988e-01, -2.3957e-01,  9.7237e-01,
          1.8515e-01,  3.9561e-01, -1.4897e-01,  2.1397e-01, -3.5993e-01,
         -7.4232e-01,  2.9146e-01, -1.5196e-02,  9.1940e-01,  3.5501e-02,
         -6.6667e-01, -9.3237e-01,  2.6392e-01,  2.0782e-02, -2.6602e-01,
         -9.6166e-01, -1.3218e-01, -2.0757e-01,  5.7610e-01,  1.9222e-02,
          1.1422e-01, -6.9864e-01,  1.9141e-02, -6.4413e-01,  2.6498e-01,
          6.3279e-01, -9.2361e-01, -5.4853e-01,  4.0075e-01, -5.9763e-01,
          1.9277e-01, -9.6033e-01,  9.6271e-01, -2.9977e-01, -1.6034e-02,
          1.0000e+00, -1.7835e-01, -8.8636e-01,  3.4273e-01,  9.2608e-02,
          1.9448e-02,  1.0000e+00,  5.7300e-01, -9.7969e-01, -5.7004e-01,
          3.8942e-01, -3.4289e-01, -4.9846e-01,  9.9842e-01, -1.1213e-01,
          1.7271e-01,  2.3863e-01,  9.7849e-01, -9.8995e-01,  8.3309e-01,
         -8.7909e-01, -9.7500e-01,  9.6340e-01,  9.3254e-01, -2.7004e-02,
         -5.3194e-01, -4.9071e-02, -8.6259e-02,  1.2259e-01, -9.4202e-01,
          5.3747e-01,  3.1679e-01, -6.0265e-02,  9.0453e-01, -5.5356e-01,
         -5.8072e-01,  2.5374e-01,  2.6564e-01,  4.5445e-01,  3.3368e-01,
          3.9240e-01, -1.6495e-01, -1.5121e-02, -1.7205e-01, -4.1955e-01,
         -9.7607e-01,  3.5962e-01,  1.0000e+00,  1.2623e-01, -3.7409e-02,
         -2.0925e-02, -1.2303e-02, -2.8448e-01,  2.6054e-01,  3.7904e-01,
         -2.1519e-01, -8.2942e-01,  1.6569e-01, -9.0627e-01, -9.8925e-01,
          6.5700e-01,  1.6491e-01, -1.5381e-01,  9.9948e-01,  1.8580e-01,
          6.7990e-02, -1.8800e-01,  6.2918e-01, -1.1658e-01,  4.5580e-01,
         -1.8539e-01,  9.7872e-01, -1.5340e-01,  5.3704e-01,  7.2691e-01,
         -5.1653e-02, -2.9198e-01, -6.2096e-01, -6.6973e-02, -9.1705e-01,
          1.9522e-01, -9.5739e-01,  9.5797e-01, -2.7874e-02,  2.4168e-01,
          5.8768e-02,  1.7149e-01,  1.0000e+00, -3.3987e-01,  3.9393e-01,
          1.6514e-01,  6.7635e-01, -9.7061e-01, -7.3446e-01, -3.6084e-01,
          1.1563e-01,  1.9032e-01, -1.7933e-01,  1.0665e-01, -9.6862e-01,
         -1.2481e-01,  6.7816e-02, -9.6239e-01, -9.8988e-01,  3.2180e-01,
          5.3455e-01, -1.1899e-02, -7.0683e-01, -6.1434e-01, -6.2079e-01,
          2.4251e-01, -1.4921e-01, -9.3819e-01,  5.7455e-01, -1.5822e-01,
          3.1907e-01, -1.3497e-01,  5.5145e-01, -4.7374e-02,  8.4845e-01,
          1.9200e-01,  5.4274e-02,  3.4452e-02, -6.4710e-01,  7.2305e-01,
         -7.5373e-01, -4.4479e-01, -3.8771e-02,  1.0000e+00, -1.8918e-01,
          2.7436e-01,  6.2532e-01,  6.5794e-01, -2.1762e-02,  2.0079e-01,
          3.5000e-01,  1.0133e-01,  1.5298e-01,  3.5570e-01, -1.2939e-01,
         -2.0919e-01,  5.5322e-01, -1.9950e-01, -1.3132e-01,  7.8193e-01,
          3.6775e-01,  6.0601e-02,  1.1336e-01, -7.4062e-02,  9.9498e-01,
         -1.5782e-01, -4.6960e-02, -3.7331e-01,  1.0194e-01, -2.2308e-01,
         -3.0289e-02,  1.0000e+00,  1.9673e-01, -1.8919e-02, -9.9090e-01,
         -1.0400e-01, -8.7667e-01,  9.9960e-01,  7.8479e-01, -7.8601e-01,
          4.2355e-01,  2.4432e-01, -3.1730e-02,  6.1198e-01, -9.7466e-02,
         -1.1662e-01,  1.1271e-01,  4.6818e-02,  9.6686e-01, -3.1075e-01,
         -9.6792e-01, -4.9323e-01,  2.9655e-01, -9.5894e-01,  9.8621e-01,
         -3.9323e-01, -1.0892e-01, -1.6482e-01, -2.7801e-02,  1.2234e-02,
         -1.5106e-01, -9.8087e-01, -1.4156e-01,  3.6931e-03,  9.6774e-01,
          1.4032e-01, -5.6110e-01, -8.7167e-01, -1.9635e-01,  1.7122e-02,
         -1.0332e-01, -9.4580e-01,  9.6719e-01, -9.8246e-01,  4.4481e-01,
          1.0000e+00,  1.6533e-01, -6.1815e-01,  2.2743e-02, -2.5573e-01,
          1.9687e-01,  2.2244e-02,  4.4107e-01, -9.5761e-01, -2.9534e-01,
         -1.1568e-01,  1.7190e-01, -8.7656e-02, -8.3539e-02,  6.3041e-01,
          1.9310e-01, -5.2143e-01, -4.8012e-01, -7.6077e-02,  2.9015e-01,
          6.4318e-01, -1.6291e-01, -5.4982e-03, -2.8394e-02,  2.8721e-02,
         -9.0883e-01, -2.4054e-01, -2.8952e-01, -9.9737e-01,  4.4196e-01,
         -1.0000e+00, -2.1939e-01, -4.4858e-01, -1.7345e-01,  8.0803e-01,
          3.9329e-01,  1.4276e-01, -7.1020e-01,  1.2349e-01,  8.9528e-01,
          7.5468e-01, -1.3425e-01,  1.2547e-01, -6.6663e-01,  1.4947e-01,
         -1.6637e-02,  2.0310e-01, -9.5943e-02,  7.1982e-01, -1.2313e-01,
          1.0000e+00,  1.0626e-02, -4.3705e-01, -9.2045e-01,  1.2155e-01,
         -1.6007e-01,  9.9997e-01, -7.3421e-01, -9.4092e-01,  2.9346e-01,
         -4.8404e-01, -8.2087e-01,  3.0309e-01, -1.2000e-01, -5.9553e-01,
         -3.1017e-01,  9.4683e-01,  6.9984e-01, -5.2796e-01,  4.0851e-01,
         -1.8308e-01, -3.4092e-01, -3.1200e-02,  7.1636e-02,  9.8628e-01,
          4.7112e-03,  7.9730e-01,  7.9710e-02, -8.0972e-02,  9.6757e-01,
          1.5996e-01,  2.3156e-01,  8.3174e-02,  1.0000e+00,  2.0440e-01,
         -9.0639e-01,  4.4556e-01, -9.7806e-01, -3.1147e-03, -9.4551e-01,
          1.8154e-01, -1.9254e-02,  8.7372e-01, -9.5097e-02,  9.5773e-01,
          1.4778e-01, -1.5501e-02,  5.9342e-02,  4.3088e-01,  2.8630e-01,
         -9.0797e-01, -9.8606e-01, -9.8646e-01,  2.2276e-01, -4.1590e-01,
          7.8981e-02,  1.9840e-01,  4.4702e-02,  2.0835e-01,  3.0080e-01,
         -1.0000e+00,  9.3251e-01,  3.4374e-01,  1.3287e-01,  9.6515e-01,
          2.8173e-01,  3.6148e-01,  9.9562e-02, -9.8304e-01, -9.3873e-01,
         -2.8620e-01, -2.0811e-01,  6.3750e-01,  6.4102e-01,  8.0429e-01,
          3.0296e-01, -4.1761e-01, -3.9641e-01,  1.2098e-01, -7.2830e-01,
         -9.9235e-01,  2.9116e-01,  2.3890e-01, -8.9118e-01,  9.6165e-01,
         -6.5592e-01, -1.2626e-01,  5.1983e-01, -1.2479e-01,  8.3263e-01,
          6.9531e-01,  1.4468e-01,  2.8531e-02,  3.2526e-01,  8.8614e-01,
          8.7362e-01,  9.8784e-01, -1.1435e-01,  7.3967e-01, -1.5916e-05,
          3.4468e-01,  6.7965e-01, -9.3931e-01,  3.1565e-02,  1.7663e-01,
         -2.4366e-02,  1.4506e-01, -6.3675e-02, -9.0421e-01,  4.9323e-01,
         -2.1407e-01,  3.2881e-01, -3.9336e-01,  1.7543e-01, -3.0572e-01,
         -2.6955e-03, -6.2087e-01, -2.7020e-01,  6.6295e-01,  2.7706e-01,
          8.9237e-01,  6.2840e-01,  2.7339e-02, -5.3093e-01, -3.4315e-03,
          9.3736e-02, -9.0249e-01,  8.0867e-01,  1.7888e-01,  2.8157e-01,
         -1.4331e-01, -1.5794e-01,  7.4887e-01, -3.0997e-01, -3.0732e-01,
         -2.6704e-01, -5.0697e-01,  8.8122e-01, -4.6491e-01, -4.0141e-01,
         -3.3163e-01,  5.4524e-01,  2.2158e-01,  9.9518e-01,  2.9895e-02,
         -7.6689e-02, -3.6579e-01, -1.7576e-01,  2.5189e-01, -7.1489e-02,
         -1.0000e+00,  3.5259e-01,  3.5202e-02,  7.0007e-02,  3.7918e-02,
         -3.7228e-02, -4.2352e-02, -9.5362e-01, -8.5418e-02,  4.5871e-02,
         -1.0965e-01, -3.9032e-01, -2.7186e-01,  5.6127e-01,  3.4586e-01,
          5.9221e-01,  8.7096e-01,  1.4091e-01,  6.9970e-01,  6.3386e-01,
          7.6531e-02, -6.5625e-01,  8.8305e-01]], grad_fn=<TanhBackward0>)

First dimension is our batch size which is still 1 and 768 is final embedding dimension of our final model. This tensor is a vector representation of the entire sequence at large.

In [43]:
(model_BERT_base.pooler(CLS_embedding) == response.pooler_output).all()
Out[43]:
tensor(True)

Running the embedding for CLS through the pooler gives the same output as the pooler_output

In [44]:
tot_prms = 0
for par in model_BERT_base.parameters(): # Iterate through parameters
    if len(par.shape) == 2:
        tot_prms += par.shape[0] * par.shape[1] # multiply matrecies dimension together and add them to our total parameters
        
print(f'BERT has total of {tot_prms:,} learnable parameters.') 
print(f'This is how to get 110M learnable parameters of BERT') 
BERT has total of 109,360,128 learnable parameters.
This is how to get 110M learnable parameters of BERT
In [45]:
print(f"""There are only {30522* 768} for context-less word embedding. The rest of parameters are scattered over 
      encoders specially out attention calculation""")
There are only 23440896 for context-less word embedding. The rest of parameters are scattered over 
      encoders specially out attention calculation
In [ ]:
 

Tokenization

Tokenization is to split vocabulary over 30,000 token via BERT. As mentioned before, two special tokens [CLS] and [SEP] are added at the beginning and at the end of of phrase, respectively. [CLS] is to represents the entire sequence and [SEP] is to separate between sentences. For example, tokenization this sentence "AI has conquered the world" is :

["[CLS]","AI", "has", "conquered", "the", "world","[SEP]"]

There might be words do not exist in BERT for example "Mehdi loves AI"

In [46]:
'Mehdi' in tokenizer.vocab
Out[46]:
False

BERT deals with this by splitting the word into pieces: "me", "##hdi". See the tokenization below:

["[CLS]","me", "##hdi", "loves", "ai","[SEP]"]

So the take away from example above is a clear distinction between word and token. They are not interchangeable!

  • maximum sequence length for BERT are 512 tokens. Sequence less than 512 tokens will be padded to reach 512 and the model may give error if it is over 512.

There are two kind of BERT tokenization: uncased and cased:

uncased cased
Lower-case accent removal keep unchanged
Generic Situation when case does not contribute to context Case does matter ( Named Entity Recognition
  • For cased tokenization, both upper and lower cases are considered as different token.
In [47]:
# Load BERT's "uncased" tokenizer.
tokenizer_bert = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
print(f'Number of BERTs vocabulary: {len(tokenizer_bert.vocab)}')
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Number of BERTs vocabulary: 30522
In [48]:
txt = "Example of a simple sentence!"

# Tokenization
tokens = tokenizer_bert.encode(txt)  
print(tokens)
[101, 2742, 1997, 1037, 3722, 6251, 999, 102]
In [49]:
# Re-construct the sentence by decode method of tokens
tokenizer_bert.decode(tokens)
Out[49]:
'[CLS] example of a simple sentence! [SEP]'

Lets try a more complex sentence:

In [50]:
text = "AI is my friend and has been friendly since it was invented."

tokens = tokenizer_bert.encode(text)
print(tokens)
[101, 9932, 2003, 2026, 2767, 1998, 2038, 2042, 5379, 2144, 2009, 2001, 8826, 1012, 102]

We can show token and its representative word in nicer way below:

In [51]:
print(f'The sentence is "{text}", which leads to {len(tokens)} token:')
for tkn in tokens:
    print(f'Token: {tkn}, corresponded word: {tokenizer_bert.decode([tkn])}')
The sentence is "AI is my friend and has been friendly since it was invented.", which leads to 15 token:
Token: 101, corresponded word: [CLS]
Token: 9932, corresponded word: ai
Token: 2003, corresponded word: is
Token: 2026, corresponded word: my
Token: 2767, corresponded word: friend
Token: 1998, corresponded word: and
Token: 2038, corresponded word: has
Token: 2042, corresponded word: been
Token: 5379, corresponded word: friendly
Token: 2144, corresponded word: since
Token: 2009, corresponded word: it
Token: 2001, corresponded word: was
Token: 8826, corresponded word: invented
Token: 1012, corresponded word: .
Token: 102, corresponded word: [SEP]
In [52]:
text = "Mehdi loves AI"
tokens = tokenizer_bert.encode(text)
for tkn in tokens:
    print(f'Token: {tkn}, corresponded word: {tokenizer_bert.decode([tkn])}')
Token: 101, corresponded word: [CLS]
Token: 2033, corresponded word: me
Token: 22960, corresponded word: ##hdi
Token: 7459, corresponded word: loves
Token: 9932, corresponded word: ai
Token: 102, corresponded word: [SEP]

Up to now, we were using encoding that gives ids, we can also use encode_plus that gives us multiple things:

  1. Token id, which is word id
  2. Attention mask, which is a sequence of 1 (if token should be included for calculating attention) or 0 (if not included)
  3. token_type_ids, which is also a sequence of 0 or 1. It is representative if we are passing 1 or 2 sequences into BERT
In [53]:
text = "Mehdi loves AI"
tokens = tokenizer_bert.encode_plus(text)
print(tokens)
{'input_ids': [101, 2033, 22960, 7459, 9932, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1]}
In [54]:
# python is the 6th token
txt1='The coolest programming language is Python'
python_language = tokenizer.encode(txt1)

# python is the 1th token
txt2='Python can be aggressive sometimes during hunting'
python_pet = tokenizer.encode(txt2)

Processing steps below are required:

  1. Passing BERT tokenization to torch.tensor
  2. Squeeze it to have a batch of 1 size
  3. Pass it through BERT model and take zero index of it
  4. Obtain the fifth element of sequence because it is the index of word Python
  5. Detach it and convert to numpy array
In [55]:
# Vector representation of 'python' in 'The coolest programming language is Python'
python_embedding_programming = model(torch.tensor(python_language).unsqueeze(0))[0][:,6,:].detach().numpy()

# Vector representation of 'python' in 'Python can be aggressive sometimes if it is hungary'
python_embedding_pet = model(torch.tensor(python_pet).unsqueeze(0))[0][:,1,:].detach().numpy()

Word and Sentence Similarity

In [56]:
# Import cosine similarity from sklearn
from sklearn.metrics.pairwise import cosine_similarity
In [57]:
# Calculate cosine similarity between representation of the word Python
sim=cosine_similarity(python_embedding_programming, python_embedding_pet)[0][0]
print(f'Cosine similarity between the representation of the word "Python" in \
two sentences below is {"{:.2f}".format(sim)}. \n 1- {txt1}\n 2- {txt2} \n ')
Cosine similarity between the representation of the word "Python" in two sentences below is 0.01. 
 1- The coolest programming language is Python
 2- Python can be aggressive sometimes during hunting 
 
In [58]:
# Vector representation of 'snake' in 'snake'
txt3='my snake is not poisonous and very friendly'
snake_embedding = model(torch.tensor(tokenizer.encode(txt3)).unsqueeze(0))[0][:,2,:].detach().numpy()

txt4='Programming is very difficult for beginner'
# Vector representation of 'programming' in 'programming'
programming_embedding = model(torch.tensor(tokenizer.encode(txt4)).unsqueeze(0))[0][:,1,:].detach().numpy()
In [59]:
sim=cosine_similarity(snake_embedding, programming_embedding)[0][0]
print(f'Cosine similarity between the representation of the word "snake" and "programming" in \
two sentences below is {"{:.2f}".format(sim)}. \n 1- {txt3}\n 2- {txt4} \n ')
Cosine similarity between the representation of the word "snake" and "programming" in two sentences below is 0.36. 
 1- my snake is not poisonous and very friendly
 2- Programming is very difficult for beginner 
 
In [60]:
sim=cosine_similarity(python_embedding_programming, programming_embedding)[0][0]
print(f'Cosine similarity between the representation of the word "Python" and "Programming" in \
two sentences below is {"{:.2f}".format(sim)}. \n 1- {txt1}\n 2- {txt4} \n ')
Cosine similarity between the representation of the word "Python" and "Programming" in two sentences below is 0.40. 
 1- The coolest programming language is Python
 2- Programming is very difficult for beginner 
 
In [61]:
sim=cosine_similarity(python_embedding_pet, snake_embedding)[0][0]
print(f'Cosine similarity between the representation of the word "Python" and "snake" in \
two sentences below is {"{:.2f}".format(sim)}. \n 1- {txt2}\n 2- {txt3} \n ')
Cosine similarity between the representation of the word "Python" and "snake" in two sentences below is 0.44. 
 1- Python can be aggressive sometimes during hunting
 2- my snake is not poisonous and very friendly 
 
  • Another example
In [62]:
def matrix_occure_prob(df,title,fontsize=11,vmin=-0.1, vmax=0.8,lable1='Sentence 1',pad=55,
                    lable2='Sentence 2',label='Cosine Similarity',rotation_x=90,axt=None,
                    num_ind=False,txtfont=6,lbl_font=9,shrink=0.8,cbar_per=False, 
                       xline=False):  
    import matplotlib.pyplot as plt
    
    """Plot correlation matrix"""
    ax = axt or plt.axes()
    colmn1=list(df.columns)
    colmn2=list(df.index)
    corr=np.zeros((len(colmn2),len(colmn1)))
    
    for l in range(len(colmn1)):
        for l1 in range(len(colmn2)):
            cc=df[colmn1[l]][df.index==colmn2[l1]].values[0]
            try:
                if len(cc)>1:
                    corr[l1,l]=cc[0]  
            except TypeError:
                corr[l1,l]=cc            
            if num_ind:
                ax.text(l, l1, str(round(cc,2)), va='center', ha='center',fontsize=txtfont)
    im =ax.matshow(corr, cmap='jet', interpolation='nearest',vmin=vmin, vmax=vmax)
    cbar =plt.colorbar(im,shrink=shrink,label=label) 
    if (cbar_per):
        cbar.ax.set_yticklabels(['{:.0f}%'.format(x) for x in np.arange( 0,110,10)])    

    ax.set_xticks(np.arange(len(colmn1)))
    ax.set_xticklabels(colmn1,fontsize=lbl_font)
    ax.set_yticks(np.arange(len(colmn2)))
    ax.set_yticklabels(colmn2,fontsize=lbl_font)    
    
    # Set ticks on both sides of axes on
    ax.tick_params(axis="x", bottom=True, top=False, labelbottom=True, labeltop=False)
    
    # Rotate and align bottom ticklabels
    plt.setp([tick.label1 for tick in ax.xaxis.get_major_ticks()], rotation=rotation_x,
             ha="right", va="center", rotation_mode="anchor")
    
    # Rotate and align bottom ticklabels
    plt.setp([tick.label1 for tick in ax.yaxis.get_major_ticks()], rotation=rotation_x,
             ha="right", va="center", rotation_mode="anchor")
    
    if xline:
        x_labels = list(ax.get_xticklabels())
        x_label_dict = dict([(x.get_text(), x.get_position()[0]) for x in x_labels])
        
        for ix in xline:
            plt.axvline(x=x_label_dict[ix]-0.5,linewidth =1.2,color='k', linestyle='--')
            plt.axhline(y=x_label_dict[ix]-0.5,linewidth =1.2,color='k', linestyle='--')  

    plt.xlabel(lable1)
    plt.ylabel(lable2)    
    ax.grid(color='k', linestyle='-', linewidth=0.05)
    plt.title(f'{title}',fontsize=fontsize, pad=pad)
    plt.show()
In [63]:
txt1 = "President greets the press in Chicago"
txt1_tokenized = tokenizer.encode(txt1)
#
txt2 = "Obama speaks to media in Illinois"
txt2_tokenized = tokenizer.encode(txt2)
In [64]:
df = pd.DataFrame()

for i in np.arange(1,len(txt1_tokenized)-1):
    sim_all=[]
    idx = []
    for j in np.arange(1,len(txt2_tokenized)-1):
        embedding_txt1_tokenized = model(torch.tensor(txt1_tokenized).unsqueeze(0))[0][:,i,:].detach().numpy()
        embedding_txt2_tokenized = model(torch.tensor(txt2_tokenized).unsqueeze(0))[0][:,j,:].detach().numpy()
        tmp = cosine_similarity(embedding_txt1_tokenized,embedding_txt2_tokenized)
        sim_all.append(tmp[0][0]) 
        idx.append(tokenizer.decode([txt2_tokenized[j]]))
    df[tokenizer.decode([txt1_tokenized[i]])] = sim_all    
    df.index = idx
df        
Out[64]:
president greet ##s the press in chicago
obama 0.629611 0.107220 0.081697 0.131931 0.158450 0.021865 0.182913
speaks 0.333187 0.563302 0.599095 0.486098 0.366036 0.489660 0.430971
to 0.263272 0.458562 0.513506 0.554714 0.326395 0.466304 0.334229
media 0.311030 0.360933 0.298417 0.423117 0.674678 0.328684 0.357700
in 0.197451 0.358359 0.464872 0.507185 0.352629 0.740713 0.477997
illinois 0.258008 0.249743 0.240638 0.300416 0.334152 0.338123 0.740115
In [65]:
font = {'size'   : 10}
plt.rc('font', **font)
fig, ax=plt.subplots(figsize=(7, 4), dpi= 110, facecolor='w', edgecolor='k') 

matrix_occure_prob(df,vmin=0.0,vmax=0.8,title='Word Similarity by BERT Model',num_ind=True,axt=ax,pad=5,
                   txtfont=12,lable1='',lable2='',label='BERT',xline=False)
  • Sentence-Transformers

The easiest approach for us to implement sentence similarity is through the sentence-transformers library — which wraps most of this process into a few lines of code.

First, we install sentence-transformers using pip install sentence-transformers. This library uses HuggingFace’s transformers behind the scenes — so we can actually find sentence-transformers models

We use bert-base-nli-mean-tokens model. Let’s create some sentences, initialize our model, and encode the sentences:

In [66]:
corpus = ["Global warming is happening", 
        "The weather is not good to play golf today", 
        "Never compare an apple to an orange", 
        "Apple and orange are completely different from each other",   
        "Ocean temperature is rising rapidly",
        "AI has taken the world by storm",
        "It is rainy today so we should postpone our golf game", 
        "I love reading books than watching TV", 
        "People say I am a bookworm, in fact, I do not want to waste my time on TV",
         "AI has transformed the way the world works"]
In [67]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('bert-base-nli-mean-tokens')
In [1]:
sentence_embeddings = model.encode(corpus)
sentence_embeddings.shape
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
~\AppData\Local\Temp\ipykernel_19768\3357024934.py in <module>
----> 1 sentence_embeddings = model.encode(corpus)
      2 sentence_embeddings.shape

NameError: name 'model' is not defined
In [69]:
df=pd.DataFrame()

for i1 in range(len(corpus)):
    sim_all=[]
    for i2 in range(len(corpus)):
        tmp=cosine_similarity([sentence_embeddings[i1]],[sentence_embeddings[i2]])
        sim_all.append(tmp[0][0])
    df[corpus[i1]]=sim_all    
df.index=corpus    
df
Out[69]:
Global warming is happening The weather is not good to play golf today Never compare an apple to an orange Apple and orange are completely different from each other Ocean temperature is rising rapidly AI has taken the world by storm It is rainy today so we should postpone our golf game I love reading books than watching TV People say I am a bookworm, in fact, I do not want to waste my time on TV AI has transformed the way the world works
Global warming is happening 1.000000 0.284419 0.049815 0.135274 0.667538 0.562536 0.220875 0.111236 0.180935 0.652134
The weather is not good to play golf today 0.284419 1.000000 0.396641 0.537905 0.265201 0.475757 0.729455 0.237244 0.367050 0.270868
Never compare an apple to an orange 0.049815 0.396641 1.000000 0.760247 0.003454 0.256307 0.404155 0.145969 0.284058 0.110317
Apple and orange are completely different from each other 0.135274 0.537905 0.760247 1.000000 0.137347 0.453787 0.443579 0.183743 0.274083 0.244242
Ocean temperature is rising rapidly 0.667538 0.265201 0.003454 0.137347 1.000000 0.487562 0.095638 0.117352 0.091370 0.476548
AI has taken the world by storm 0.562536 0.475757 0.256307 0.453787 0.487562 1.000000 0.432809 0.076726 0.209972 0.631133
It is rainy today so we should postpone our golf game 0.220875 0.729455 0.404155 0.443579 0.095638 0.432809 1.000000 0.279056 0.509787 0.185599
I love reading books than watching TV 0.111236 0.237244 0.145969 0.183743 0.117352 0.076726 0.279056 1.000000 0.729154 0.203464
People say I am a bookworm, in fact, I do not want to waste my time on TV 0.180935 0.367050 0.284058 0.274083 0.091370 0.209972 0.509787 0.729154 1.000000 0.272889
AI has transformed the way the world works 0.652134 0.270868 0.110317 0.244242 0.476548 0.631133 0.185599 0.203464 0.272889 1.000000
In [70]:
font = {'size'   : 16}
plt.rc('font', **font)
fig, ax=plt.subplots(figsize=(15, 10), dpi= 110, facecolor='w', edgecolor='k')  
    
matrix_occure_prob(df,title='Text Similarity Matrix by BERT Pre-trained Model ',lable1='',vmin=0, pad=10,axt=ax,
                  vmax=0.8,cbar_per=False,lable2='',num_ind=True,txtfont=15,xline=False,fontsize=22,
                   lbl_font=14,label='BERT Similarity',rotation_x=30)

Embedding of BERT

BERT applies three different types of embeddings:

  1. Token Embedding: it maps each token in the input sequence to a fixed-size embedding vector. BERT uses WordPiece tokenization, which breaks down words into smaller subword units based on their frequency in a large corpus of text.

    • A lookup of 30,522 vectors for BERT-base
    • Context-less meaning of each token
    • The tokens are learnable and can be fine-tunning during training tasks
  1. Segment Embeddings. It is used for tasks that require processing pairs of sentences or sequences, such as sentence classification or question-answering. The segment embeddings indicate which sentence each token belongs to.

    • A look up of 2 possible vectors for example one for sentence A and one for sentence B. This can be applied for question and answer tasks.
    • This is not learnable. We are not allowed to change these vectors. These vectors are static one for A and another one for B and cannot be changed.
  1. Position Embeddings. It adds positional information to the token embeddings. This is important because the order of words in a sentence is crucial to its meaning.

    • Applies to represent position of tokens in the sentence.
    • This is not learnable.

image-3.png

To create final embedding, we add up all these embedding to get a final representation of the shape (11,768)

  • Token Embeddings

Here are schematic illustration of Token Embeddings for the input text “I like strawberries”.

image.png

Image retrieved from medium

  • Segment Embeddings

Here are schematic illustration of Segment Embeddings for input text “I like cats”, “I like dogs”.

image.png Image retrieved from medium

The Segment Embeddings layer is comprised of solely two vector representations. The initial vector (index 0) is designated for all tokens associated with input 1, while the final vector (index 1) is designated for all tokens associated with input 2. In cases where there is only one input sentence, its segment embedding will be represented solely by the vector assigned to index 0 of the Segment Embeddings table.

  • Position Embeddings

BERT is comprised of a sequence of Transformers, which generally do not capture the sequential arrangement of their inputs. In summary, the inclusion of position embeddings in BERT enables it to comprehend input texts such as: "I think, therefore I am"

The first “I” should not have the same vector representation as the second “I”.

BERT was specifically created to process input sequences that have a maximum length of 512. To account for the sequential order of the input sequences, the authors of BERT enabled it to acquire a vector representation for each position through learning. This implies that the Position Embeddings layer consists of a lookup table measuring (512, 768), with the initial row representing the vector representation of any word in the first position, the subsequent row representing the vector representation of any word in the second position, and so on. Consequently, if the input comprises of phrases such as "Hello world" and "Hi there," both "Hello" and "Hi" will share the same position embeddings since they are the initial word in the input sequence. Correspondingly, both "world" and "there" will have identical position embeddings.

Transformers lack recurrent components, which means that they do not possess the ability to identify the positions of words within a sentence when presented with it as a whole. One approach involves adding sine and cosine waves with varying frequencies to the word vectors.

image.png

See the equation below to leverage get sin and con function to get token position:

  • if j is even => $P(i,j)= sin \large \left( \frac{i}{10000^{\frac{j}{d}}} \right)$
  • if j is odd => $P(i,j)= cos \large \left( \frac{i}{10000^{\frac{j-1}{d}}} \right)$

where

$i$: position of the token between 0 and 511.

$j$= position of embedding dimension between 0 and 768 for BERT-base

$d$= embedding dimension which is 768 for BERT

This equation enable BERT to recognize if a token is at the beginning or at the end of embedding. This position embedding will be added to other embedding including word embedding and segment embedding.

In [71]:
model_bert = BertModel.from_pretrained('bert-base-uncased')
tokenizer_bert = BertTokenizer.from_pretrained('bert-base-uncased')
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).

The code below shows embedding of BERT:

word_embeddings: context-free word embeddings, size= 30522 (vocabulary) * 768 (dimension)

position_embeddings : encodes word position, size= 512 (length) * 768 (dimension)

token_type_embeddings : 0 or 1. Used to lookup the segment embedding, size= 2 (segment A/B) * 768 (dimension)

In [72]:
model_bert.embeddings
Out[72]:
BertEmbeddings(
  (word_embeddings): Embedding(30522, 768, padding_idx=0)
  (position_embeddings): Embedding(512, 768)
  (token_type_embeddings): Embedding(2, 768)
  (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
  (dropout): Dropout(p=0.1, inplace=False)
)
In [73]:
simple_sentence = 'I am Mehdi'

tokenizer.encode(simple_sentence, return_tensors='pt') # return_tensors='pt' saves us for converting to Pytorch
Out[73]:
tensor([[  101,  1045,  2572,  2033, 22960,   102]])
In [74]:
# embedding of each tokenv(context-less) in the sentence
model_bert.embeddings.word_embeddings(tokenizer.encode(simple_sentence, return_tensors='pt'))
Out[74]:
tensor([[[ 0.0136, -0.0265, -0.0235,  ...,  0.0087,  0.0071,  0.0151],
         [-0.0211,  0.0059, -0.0179,  ...,  0.0163,  0.0122,  0.0073],
         [-0.0437, -0.0150,  0.0029,  ..., -0.0282,  0.0474, -0.0448],
         [-0.0011,  0.0014, -0.0595,  ...,  0.0207,  0.0283,  0.0081],
         [-0.0025, -0.0325, -0.0462,  ..., -0.0579, -0.0926, -0.0251],
         [-0.0145, -0.0100,  0.0060,  ..., -0.0250,  0.0046, -0.0015]]],
       grad_fn=<EmbeddingBackward0>)

If we have a different sentence, first and last rows will be identical:

In [75]:
model_bert.embeddings.word_embeddings(tokenizer.encode('I am Hamed. ', return_tensors='pt'))
Out[75]:
tensor([[[ 0.0136, -0.0265, -0.0235,  ...,  0.0087,  0.0071,  0.0151],
         [-0.0211,  0.0059, -0.0179,  ...,  0.0163,  0.0122,  0.0073],
         [-0.0437, -0.0150,  0.0029,  ..., -0.0282,  0.0474, -0.0448],
         ...,
         [-0.0467,  0.0171, -0.0075,  ..., -0.0520, -0.0200,  0.0026],
         [-0.0207, -0.0020, -0.0118,  ...,  0.0128,  0.0200,  0.0259],
         [-0.0145, -0.0100,  0.0060,  ..., -0.0250,  0.0046, -0.0015]]],
       grad_fn=<EmbeddingBackward0>)
In [76]:
model_bert.embeddings.position_embeddings  # 512 embeddings
Out[76]:
Embedding(512, 768)
In [77]:
torch.LongTensor(range(6)) # making a long tensor of length 6 by torch
Out[77]:
tensor([0, 1, 2, 3, 4, 5])
In [78]:
model_bert.embeddings.position_embeddings(torch.LongTensor(range(6)))  # positional embeddings for our example_phrase
Out[78]:
tensor([[ 1.7505e-02, -2.5631e-02, -3.6642e-02,  ...,  3.3437e-05,
          6.8312e-04,  1.5441e-02],
        [ 7.7580e-03,  2.2613e-03, -1.9444e-02,  ...,  2.8910e-02,
          2.9753e-02, -5.3247e-03],
        [-1.1287e-02, -1.9644e-03, -1.1573e-02,  ...,  1.4908e-02,
          1.8741e-02, -7.3140e-03],
        [-4.1949e-03, -1.1852e-02, -2.1180e-02,  ...,  2.2455e-02,
          5.2826e-03, -1.9723e-03],
        [-5.6087e-03, -1.0445e-02, -7.2288e-03,  ...,  2.0837e-02,
          3.5402e-03,  4.7708e-03],
        [-3.0871e-03, -1.8956e-02, -1.8930e-02,  ...,  7.4045e-03,
          2.0183e-02,  3.4077e-03]], grad_fn=<EmbeddingBackward0>)

Each row is a combination of sign and cosign to let BERT the position of each token.

In [79]:
model_bert.embeddings.token_type_embeddings  # Segment A and B (2 embeddings)
Out[79]:
Embedding(2, 768)
In [80]:
torch.LongTensor([0]*6)
Out[80]:
tensor([0, 0, 0, 0, 0, 0])
In [81]:
model_bert.embeddings.token_type_embeddings(torch.LongTensor([0]*6))  # Same embedding for all tokens
Out[81]:
tensor([[ 0.0004,  0.0110,  0.0037,  ..., -0.0066, -0.0034, -0.0086],
        [ 0.0004,  0.0110,  0.0037,  ..., -0.0066, -0.0034, -0.0086],
        [ 0.0004,  0.0110,  0.0037,  ..., -0.0066, -0.0034, -0.0086],
        [ 0.0004,  0.0110,  0.0037,  ..., -0.0066, -0.0034, -0.0086],
        [ 0.0004,  0.0110,  0.0037,  ..., -0.0066, -0.0034, -0.0086],
        [ 0.0004,  0.0110,  0.0037,  ..., -0.0066, -0.0034, -0.0086]],
       grad_fn=<EmbeddingBackward0>)

We get the same row representing the same segment.

To get final BERT embedding, we add all three types of embedding (word embedding, position embedding, token embedding) then passing to LayerNorm.

In [82]:
# Apply feed forward normalization layer
model_bert.embeddings.LayerNorm(
    model_bert.embeddings.word_embeddings(tokenizer.encode(simple_sentence, return_tensors='pt')) + \
    model_bert.embeddings.position_embeddings(torch.LongTensor(range(6))) + \
    model_bert.embeddings.token_type_embeddings(torch.LongTensor([0]*6))
)
Out[82]:
tensor([[[ 1.6855e-01, -2.8577e-01, -3.2613e-01,  ..., -2.7571e-02,
           3.8253e-02,  1.6400e-01],
         [-3.4025e-04,  5.3974e-01, -2.8805e-01,  ...,  7.5731e-01,
           8.9008e-01,  1.6575e-01],
         [-6.3496e-01,  1.9748e-01,  2.5116e-01,  ..., -4.0819e-02,
           1.3468e+00, -6.9357e-01],
         [ 1.3645e-01,  2.2527e-01, -9.9824e-01,  ...,  7.2633e-01,
           7.5188e-01,  2.3614e-01],
         [ 4.2074e-01,  6.5248e-02, -1.3691e-01,  ..., -1.0155e-01,
          -7.3556e-01,  1.6419e-01],
         [-3.2507e-01, -3.1879e-01, -1.1632e-01,  ..., -3.9602e-01,
           4.1120e-01, -7.7552e-02]]], grad_fn=<NativeLayerNormBackward0>)

The same exact matrix can be achieved by embeddings as below:

In [83]:
model_bert.embeddings(tokenizer.encode(simple_sentence, return_tensors='pt'))
Out[83]:
tensor([[[ 1.6855e-01, -2.8577e-01, -3.2613e-01,  ..., -2.7571e-02,
           3.8253e-02,  1.6400e-01],
         [-3.4026e-04,  5.3974e-01, -2.8805e-01,  ...,  7.5731e-01,
           8.9008e-01,  1.6575e-01],
         [-6.3496e-01,  1.9748e-01,  2.5116e-01,  ..., -4.0819e-02,
           1.3468e+00, -6.9357e-01],
         [ 1.3645e-01,  2.2527e-01, -9.9824e-01,  ...,  7.2633e-01,
           7.5188e-01,  2.3614e-01],
         [ 4.2074e-01,  6.5248e-02, -1.3691e-01,  ..., -1.0155e-01,
          -7.3556e-01,  1.6419e-01],
         [-3.2507e-01, -3.1879e-01, -1.1632e-01,  ..., -3.9602e-01,
           4.1120e-01, -7.7552e-02]]], grad_fn=<NativeLayerNormBackward0>)
In [84]:
model_bert.embeddings(tokenizer.encode(simple_sentence, return_tensors='pt')).shape
Out[84]:
torch.Size([1, 6, 768])

Batch of 1 data point with 6 tokens, each token has an embedding of 768 fixed length vector.

Since there is no decoder for BERT, this dimension will pass into BERT model

Here is BERT visualization:

image.png

BERT Implementation: Pre-training and Fine-tuning

BERT pre-trained model is for two tasks:

  1. Masked Language Model(MLM): Replace 15% of words in corpus with special mask and ask BERT to predict the blank.
  2. Prediction of Next Sentence

These are not useful tasks but they will help BERT learn how words works.

Masked Language Modeling (MLM)

In [85]:
# import library: BertForMaskedLM stands for masked language model. pipline makes much easier for us to 
# perform NLP task without too much code
from transformers import BertForMaskedLM, pipeline

The code below initializes masked language model with pretrained bert-base-cased which keeps track of accent, lower upper case etc.

In [86]:
bert_lm_mask = BertForMaskedLM.from_pretrained('bert-base-cased')
Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
In [87]:
# look at the model
bert_lm_mask
Out[87]:
BertForMaskedLM(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(28996, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
          (intermediate): BertIntermediate(
            (dense): Linear(in_features=768, out_features=3072, bias=True)
            (intermediate_act_fn): GELUActivation()
          )
          (output): BertOutput(
            (dense): Linear(in_features=3072, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
      )
    )
  )
  (cls): BertOnlyMLMHead(
    (predictions): BertLMPredictionHead(
      (transform): BertPredictionHeadTransform(
        (dense): Linear(in_features=768, out_features=768, bias=True)
        (transform_act_fn): GELUActivation()
        (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      )
      (decoder): Linear(in_features=768, out_features=28996, bias=True)
    )
  )
)
In [88]:
# Using pipelines in transformers makes our life easier for several tasks

# The code below shows how to perform an auto-encoder language model task+
# we should give pipline a model to do the task
# for the same result, we could have "model=bert_lm_mask" 
nlp_mask = pipeline("fill-mask", model='bert-base-cased')  
Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
In [89]:
type(nlp_mask.model)
Out[89]:
transformers.models.bert.modeling_bert.BertForMaskedLM
In [90]:
nlp_mask.tokenizer
Out[90]:
BertTokenizerFast(name_or_path='bert-base-cased', vocab_size=28996, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'})
In [91]:
preds = nlp_mask(f"If you don’t know how to swim, you will  {nlp_mask.tokenizer.mask_token} in this lake.")

print('If you don’t know how to swim, you will .... in this lake.')

for p in preds:
    print(f"Token:{p['token_str']}. Score: {100*p['score']:,.2f}%")
If you don’t know how to swim, you will .... in this lake.
Token:drown. Score: 72.56%
Token:die. Score: 23.95%
Token:be. Score: 0.63%
Token:drowned. Score: 0.45%
Token:fall. Score: 0.39%

This may not be a very useful task, but these tasks are created by the author of the BERT to teach BERT about the basic of words used in sentences

Next Sentence Prediction (NSP)

BERT Next sentence Prediction involves feeding BERT the inputs "sentence A" and "sentence B" and predicting whether "sentence B" comes directly after "sentence A" (True/False). The BERT model is trained using next-sentence prediction (NSP) and masked-language modeling (MLM).

In [92]:
from transformers import BertForNextSentencePrediction
In [93]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

BERT_nsp = BertForNextSentencePrediction.from_pretrained('bert-base-uncased')
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForNextSentencePrediction: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertForNextSentencePrediction from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForNextSentencePrediction from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
In [94]:
BERT_nsp
Out[94]:
BertForNextSentencePrediction(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
          (intermediate): BertIntermediate(
            (dense): Linear(in_features=768, out_features=3072, bias=True)
            (intermediate_act_fn): GELUActivation()
          )
          (output): BertOutput(
            (dense): Linear(in_features=3072, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
      )
    )
    (pooler): BertPooler(
      (dense): Linear(in_features=768, out_features=768, bias=True)
      (activation): Tanh()
    )
  )
  (cls): BertOnlyNSPHead(
    (seq_relationship): Linear(in_features=768, out_features=2, bias=True)
  )
)
In [95]:
text = "I like cookies!"
text2 = "Do you like them too?"
In [96]:
inputs = tokenizer(text, text2, return_tensors='pt') # 'pt' stands for pytorch format
inputs
Out[96]:
{'input_ids': tensor([[  101,  1045,  2066, 16324,   999,   102,  2079,  2017,  2066,  2068,
          2205,  1029,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}
In [97]:
inputs.input_ids  # tokens for sentence A and B
Out[97]:
tensor([[  101,  1045,  2066, 16324,   999,   102,  2079,  2017,  2066,  2068,
          2205,  1029,   102]])
In [98]:
inputs.token_type_ids  # segment Ids (0 == A & 1 == B)
Out[98]:
tensor([[0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]])
In [99]:
inputs.attention_mask  # pay attention to everything
Out[99]:
tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])
In [100]:
# pass inputs into our nsp model
# 0 == "isNextSentence" and 1 == "notNextSentence"
outputs = BERT_nsp(**inputs)
outputs
Out[100]:
NextSentencePredictorOutput(loss=None, logits=tensor([[ 6.0661, -5.6789]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

logits is defining what class to predict: label 0 has much higher rate than label 1. Label 0 is "True" (sentence B comes after sentence A) and label 1 is "False" (sentence B does not come after sentence A).

In [101]:
# calculate loss by passing through a label
# If we pass in explicit label (label 0), tesor of logits is the same.
outputs = BERT_nsp(**inputs, labels=torch.LongTensor([0]))
outputs
Out[101]:
NextSentencePredictorOutput(loss=tensor(7.9870e-06, grad_fn=<NllLossBackward0>), logits=tensor([[ 6.0661, -5.6789]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

loss is very low close to zero which means sentence B comes after sentence A.

In [102]:
# calculate loss by passing through a label
outputs = BERT_nsp(**inputs, labels=torch.LongTensor([1]))
outputs
Out[102]:
NextSentencePredictorOutput(loss=tensor(11.7450, grad_fn=<NllLossBackward0>), logits=tensor([[ 6.0661, -5.6789]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

loss is very high which means sentence B does not come after sentence A.

Again, next sentence prediction on its own is not a useful task, but it does work teaching BERT how to model the relationship between sentences

How to Fine-tune BERT for NLP tasks

BERT achieves general idea about:

  1. Token (words) used in sentences (Masked Language Models)
  2. Sentences are treated in large corpora (Next Sentence Prediction)

Whatever BERT has learned can be used to solve a specific NLP problem by fine-tuning the model.

Fine-tunning works first by feeding a sentence to a pre-trained Bert. CLS has pre-trained on next sentence prediction task through its pooler attribute. We are going to add another feedforward layer after the pooler to train to map it to the number of sequence classes we want. For classification problem as shown in Figure below, we do not care about the representation of each token after passing our sentence to the Bert. For this example, we classy the entire sequence with a label.

image.png

However, for token classification, we need to consider the representation of each token and pas them through feed forward layer to classify each token for each label we have. The classic example of this is Named Entity Recognition.

image-2.png

Question and answering is the most difficult fine-tuning. We have question and subtext is a context that has an answer to the question in it. We pass the entire sequence, question and context, to the pre-trained Bert. Similar to token classification, we will add a layer on top of every single token. What we are predicting is whether or not that specific token represents the start or end the answer of the question: image.png

When fine-tunning BERT to solve NLP tasks, we will be able to utilize three built classes from transformer library especially BertForQuestionAnswering, BertForTokenClassification, BertForSequenceClassification this pre-trained classes brought us with Hugging face.

In [103]:
from transformers import pipeline, BertForQuestionAnswering, BertForTokenClassification, BertForSequenceClassification

Sequence Classification

In [122]:
bert_sq_clss = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=3)
bert_sq_clss
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Out[122]:
BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
          (intermediate): BertIntermediate(
            (dense): Linear(in_features=768, out_features=3072, bias=True)
            (intermediate_act_fn): GELUActivation()
          )
          (output): BertOutput(
            (dense): Linear(in_features=3072, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
      )
    )
    (pooler): BertPooler(
      (dense): Linear(in_features=768, out_features=768, bias=True)
      (activation): Tanh()
    )
  )
  (dropout): Dropout(p=0.1, inplace=False)
  (classifier): Linear(in_features=768, out_features=3, bias=True)
)
In [123]:
bert_sq_clss.classifier
Out[123]:
Linear(in_features=768, out_features=3, bias=True)

A classifier on the Huggingface model repository is selected. This model has already been fine-tunned to predict if a sentence is positive, negative or neutral for financial data. Of cures we are able to fine-tune our own model. But, first, we look at other people models. First we need to create a pipeline:

ProsusAI/finbert model is achieved from Hugging face as Financial Bert model. It is supposed to take a short form text about financial data and output whether or not that text is positive, negative or neutral in context of financial data.

In [106]:
bert_fin = pipeline('text-classification', model='ProsusAI/finbert', tokenizer='ProsusAI/finbert')
In [107]:
bert_fin('The stock market plummeted today, leaving many investors feeling stuck and uncertain about the future.')
Out[107]:
[{'label': 'negative', 'score': 0.9676334857940674}]
In [108]:
bert_fin('Projects shows great trend for stock market.')
Out[108]:
[{'label': 'positive', 'score': 0.8640651702880859}]

The BERT model has been updated to apply for financial context.

In [109]:
bert_fin.model
Out[109]:
BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
          (intermediate): BertIntermediate(
            (dense): Linear(in_features=768, out_features=3072, bias=True)
            (intermediate_act_fn): GELUActivation()
          )
          (output): BertOutput(
            (dense): Linear(in_features=3072, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
      )
    )
    (pooler): BertPooler(
      (dense): Linear(in_features=768, out_features=768, bias=True)
      (activation): Tanh()
    )
  )
  (dropout): Dropout(p=0.1, inplace=False)
  (classifier): Linear(in_features=768, out_features=3, bias=True)
)

Token Classification

Token classification is a natural language understanding task in which a label is assigned to some tokens in a text. Some popular token classification subtasks are Named Entity Recognition (NER) and Part-of-Speech (PoS) tagging. NER models could be trained to identify specific entities in a text, such as dates, individuals and places; and PoS tagging would identify, for example, which words in a text are verbs, nouns, and punctuation marks.

In [110]:
bert_toc_clss = BertForTokenClassification.from_pretrained('bert-base-uncased')
bert_toc_clss
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForTokenClassification: ['cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Out[110]:
BertForTokenClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
          (intermediate): BertIntermediate(
            (dense): Linear(in_features=768, out_features=3072, bias=True)
            (intermediate_act_fn): GELUActivation()
          )
          (output): BertOutput(
            (dense): Linear(in_features=3072, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
      )
    )
  )
  (dropout): Dropout(p=0.1, inplace=False)
  (classifier): Linear(in_features=768, out_features=2, bias=True)
)

We can train our own model but here first we use a a fine-tuned model.

  1. Named Entity Recognition (NER)
In [111]:
ner_classifier = pipeline("ner")
No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision f2482bf (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
In [112]:
ner_classifier("Hello I'm Mehdi and I live in Calgary.")
Out[112]:
[{'entity': 'I-PER',
  'score': 0.99913883,
  'index': 5,
  'word': 'Me',
  'start': 10,
  'end': 12},
 {'entity': 'I-PER',
  'score': 0.9982659,
  'index': 6,
  'word': '##hdi',
  'start': 12,
  'end': 15},
 {'entity': 'I-LOC',
  'score': 0.997134,
  'index': 11,
  'word': 'Calgary',
  'start': 30,
  'end': 37}]

Part-of-Speech (PoS) Tagging

In PoS tagging, the model recognizes parts of speech, such as nouns, pronouns, adjectives, or verbs, in a given text. The task is formulated as labeling each word with a part of the speech.

In [113]:
pos_classifier = pipeline("token-classification", model = "vblagoje/bert-english-uncased-finetuned-pos")
pos_classifier("Hello I'm Mehdi and I live in Calgary.")
Out[113]:
[{'entity': 'INTJ',
  'score': 0.9966468,
  'index': 1,
  'word': 'hello',
  'start': 0,
  'end': 5},
 {'entity': 'PRON',
  'score': 0.9994266,
  'index': 2,
  'word': 'i',
  'start': 6,
  'end': 7},
 {'entity': 'AUX',
  'score': 0.99607927,
  'index': 3,
  'word': "'",
  'start': 7,
  'end': 8},
 {'entity': 'AUX',
  'score': 0.9957975,
  'index': 4,
  'word': 'm',
  'start': 8,
  'end': 9},
 {'entity': 'PROPN',
  'score': 0.9986204,
  'index': 5,
  'word': 'me',
  'start': 10,
  'end': 12},
 {'entity': 'PROPN',
  'score': 0.99869543,
  'index': 6,
  'word': '##hdi',
  'start': 12,
  'end': 15},
 {'entity': 'CCONJ',
  'score': 0.99920326,
  'index': 7,
  'word': 'and',
  'start': 16,
  'end': 19},
 {'entity': 'PRON',
  'score': 0.9994598,
  'index': 8,
  'word': 'i',
  'start': 20,
  'end': 21},
 {'entity': 'VERB',
  'score': 0.99851996,
  'index': 9,
  'word': 'live',
  'start': 22,
  'end': 26},
 {'entity': 'ADP',
  'score': 0.9993967,
  'index': 10,
  'word': 'in',
  'start': 27,
  'end': 29},
 {'entity': 'PROPN',
  'score': 0.9989605,
  'index': 11,
  'word': 'calgary',
  'start': 30,
  'end': 37},
 {'entity': 'PUNCT',
  'score': 0.9996562,
  'index': 12,
  'word': '.',
  'start': 37,
  'end': 38}]

Question Answering

In [114]:
bert_qa_clss = BertForQuestionAnswering.from_pretrained('bert-base-uncased')
bert_qa_clss
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForQuestionAnswering: ['cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['qa_outputs.weight', 'qa_outputs.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Out[114]:
BertForQuestionAnswering(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
          (intermediate): BertIntermediate(
            (dense): Linear(in_features=768, out_features=3072, bias=True)
            (intermediate_act_fn): GELUActivation()
          )
          (output): BertOutput(
            (dense): Linear(in_features=3072, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
      )
    )
  )
  (qa_outputs): Linear(in_features=768, out_features=2, bias=True)
)

For *qa output*, there are only two output features so out_features should be always 2: whether or not it is start or end of our answer.

In [115]:
bert_qa_clss.qa_outputs
Out[115]:
Linear(in_features=768, out_features=2, bias=True)

A QA model is found on the Huggingface model repository which is a flavor of Bert,roberta train on squad (question and answering) data set. This is the common question and answering approach to fine-tune

In [116]:
model_name = "deepset/roberta-base-squad2"
qa = pipeline(model=model_name, tokenizer=model_name, revision="v1.0", task="question-answering")
In [117]:
sequence = "Where is Mehdi living these days?", "Mehdi lives in Calgary"
qa(*sequence)
Out[117]:
{'score': 0.42091551423072815, 'start': 15, 'end': 22, 'answer': 'Calgary'}
In [ ]: