Summary
Large language models (LLMs), rooted in the Transformer architecture, are specialized AI models trained on extensive text data to understand and generate human language, code, and more. These models exhibit remarkable accuracy and versatility, excelling in tasks from text classification to the generation of fluent and stylistically nuanced content.
BERT, or Bidirectional Encoder Representations from Transformers, involves utilizing pre-trained BERT models to analyze and classify sequences of text. BERT excels in capturing contextual information by considering both left and right context for each word in a sentence. Fine-tuning BERT for sequence classification involves adding a classification layer on top of the pre-trained BERT model and training it on a labeled dataset. This allows BERT to adapt its representations to the specific classification task. In this notebook, we initially delve into the introduction of LLM, focusing specifically on BERT, and explore the process of fine-tuning BERT to detect fake news.
Python functions and data files needed to run this notebook are available via this link.
import warnings
warnings.filterwarnings('ignore')
# Import BERT model from transformer library which has a lot of pretrained models
# transformers is HuggingFace library
from transformers import BertTokenizer, BertModel
# BertModel also requires the PyTorch library
import torch
from transformers import BertForMaskedLM, pipeline
from transformers import Trainer, TrainingArguments, DistilBertForSequenceClassification, DistilBertTokenizerFast, \
DataCollatorWithPadding, pipeline
import numpy as np
from sklearn.preprocessing import LabelEncoder
from transformers import AutoTokenizer
from datasets import load_metric, Dataset
import pandas as pd
import matplotlib.pyplot as plt
# Load pretrained BERT-base model with 12 encoder and 110M parameters
model_BERT_base = BertModel.from_pretrained('bert-base-uncased')
BERT, T5, and GPT are prominent large language models (LLMs) created by Google, Google, and OpenAI, respectively. Despite sharing the Transformer as a common foundation, these models, along with variants like RoBERTa, BART, and ELECTRA, exhibit distinct architectures. BERT, specifically designed as an autoencoding model, utilizes attention mechanisms to construct bidirectional sentence representations, excelling in tasks like sentence and token classification. While BERT doesn't directly classify text or summarize documents, its efficiency in processing extensive text corpora quickly has made it a widely adopted pre-trained model for downstream natural language processing tasks, solidifying its status as a cornerstone in the development of advanced language models within the NLP community.
There are three types of language models to train a model to predict a missing word:
1. Auto-regressive
Auto-regressive language models predict a missing word in a sentence given past tokens and future tokens but not both of them. It includes forward and backward prediction. Example is our phone auto completion sentence. This is categorized as **GPT** family.
2. Auto-encoding.
Auto-encoding language models strive to gain a comprehensive understanding of entire sequences of tokens given all possible context (past and future tokens are given). This is great for natural language understanding tasks like sequence classification and named entity recognition and **BERT** is an example.
See Figure below die difference: Image retrieved from Sinan Ozdemir
3. Combinations of autoregressive and autoencoding, like T5, which can use the encoder and decoder to be more versatile and flexible in generating text. It has been shown that these combination models can generate more diverse and creative text in different contexts compared to pure decoder-based autoregressive models due to their ability to capture additional context using the encoder.
Transfer learning is a machine learning method where we reuse a pre-trained model as the starting point for a model on a new task. To put it simply—a model trained on one task is repurposed on a second, related task as an optimization that allows rapid progress when modeling the second task.
In NLP, transfer learning is achieved by first pre-training a model on an unlabeled text corpus in an unsupervised or semi-supervised manner, and then fine-tuning (update) the model on a smaller labeled dataset for a specific NLP task. If training is simply applied on smaller dataset without pre-training, it would not be possible to get high performance results. For example for NLP, we use BERT and for image classification we can use Resnet images
Image retrieved from Sinan Ozdemir
For example, BERT has been pretrained on two main Corpora: English Wikipedia (2.5B words) and BookCorpus (800M words) which are free books. BERT went through all of these resources multiple times to gain a general understanding.
The BERT weights learnt from pretrained model are fine-tined. Moreover, a separate layer will be added on top of BERT model.
So, there are three fine-tuning approaches:
Images are retrieved from Sinan Ozdemir
Process 1 to 3 are repeated until we are satisfy with our model performance. This is a manual processing that can be tedious. See Figure below:
Image retrieved from Sinan Ozdemir
To address the problem above, we can use HuggingFace's Trainer API in our training loop. It takes the entire loop above including loss, gradient calculation and optimization all of them are calculated in a single API called trainer. Image retrieved from Sinan Ozdemir
The key objects are:
BERT stands for Bi-directional Encoder Representation from Transformers:
A sentence is fed into BERT to get a **context-full** representation (vector embedding) of every word in sentence. The context of each word in sentence is understood by the encoder using a multi-head attention mechanism (relating each word to every other word in sentence).
BERT comes with different models. The base model has 12 encoders which a good mix of complexity and size and speed of the model. BERT-small has 4 encoders and BERT-large has 24 encoders.
# Model's parameters
n_params = list(model_BERT_base.named_parameters())
print(f'The BERT model has {len(n_params)} different parameters')
print('********* Embedding Layer *********\n')
for par in n_params[0:5]:
print(f'{par[0], str(tuple(par[1].size()))}')
embeddings.word_embeddings.weight
: (30522, 768) means there are 30522 tokens that BERT is aware of that can be used for any NLP task; 768 represents the fact that each token has a contextless embedding dimension of 768.
print('********* First Encoder ********* \n')
for par in n_params[5:21]:
print(f'{par[0], str(tuple(par[1].size()))}')
print('********* Output Layer ********* \n')
for par in n_params[-2:]:
print(f'{par[0], str(tuple(par[1].size()))}')
pooler
: is a separate feed-forward network with a hyperbolic tanh activation function. When we are using BERT, this pooler is taking the vector embedding of token representing the entire sentence, not a particular token.
# load the bert-base uncased tokenizer.
tokenizer_bert = BertTokenizer.from_pretrained('bert-base-uncased')
# tokenize a sequence
tokenizer_sentence=tokenizer_bert.encode('AI has been my friend')
tokenizer_sentence
We always have token 101 at start which refers to CLS, and 102 at the end which refers to SEP. Those are automatically added to tokenizer.
We can run this token through a model:
# runing tokens via the model
response = model_BERT_base(torch.tensor(tokenizer_sentence).unsqueeze(0))
The code above does:
Passing this through our BERT model leads to many outputs.
# Embedding for each token
response.last_hidden_state
Each row represents a token in sequence and each vector represents that token's context in greater sequence. As mentioned before, first row is CLS.
# The size of pooler_output
response.pooler_output.shape
pooler_output
is meant to be representative of the entire sequence as a whole not just a individual token. The size of pooler_output
is our weight matrix
model_BERT_base.pooler
Model pooler is feed forward network with Tanh
activation.
# Get the final encoder's representation. First elemnt of second dimension is CLS token
CLS_embedding = response.last_hidden_state[:, 0, :].unsqueeze(0) # second dimension holds all of the token
CLS_embedding.shape
# put CLS_embedding through model's pooler
model_BERT_base.pooler(CLS_embedding).shape
First dimension is our batch size which is still 1 and 768 is final embedding dimension of our final model. This tensor is a vector representation of the entire sequence at large.
(model_BERT_base.pooler(CLS_embedding) == response.pooler_output).all()
Running the embedding for CLS through the pooler gives the same output as the pooler_output
tot_prms = 0
for par in model_BERT_base.parameters(): # Iterate through parameters
if len(par.shape) == 2:
tot_prms += par.shape[0] * par.shape[1] # multiply matrecies dimension together and add them to our total parameters
print(f'BERT has total of {tot_prms:,} learnable parameters.')
print(f'This is how to get 110M learnable parameters of BERT')
print(f"""There are only {30522* 768} for context-less word embedding. The rest of parameters are scattered over
encoders specially out attention calculation""")
BERT achieves general idea about:
Whatever BERT has learned can be used to solve a specific NLP problem by fine-tuning the model.
Fine-tunning works first by feeding a sentence to a pre-trained Bert. CLS has pre-trained on next sentence prediction task through its pooler attribute. We are going to add another feedforward layer after the pooler to train to map it to the number of sequence classes we want. For classification problem as shown in Figure below, we do not care about the representation of each token after passing our sentence to the Bert. For this example, we classy the entire sequence with a label.
However, for token classification, we need to consider the representation of each token and pas them through feed forward layer to classify each token for each label we have. The classic example of this is Named Entity Recognition.
Question and answering is the most difficult fine-tuning. We have question and subtext is a context that has an answer to the question in it. We pass the entire sequence, question and context, to the pre-trained Bert. Similar to token classification, we will add a layer on top of every single token. What we are predicting is whether or not that specific token represents the start or end the answer of the question:
When fine-tunning BERT to solve NLP tasks, we will be able to utilize three built classes from transformer library especially BertForQuestionAnswering
, BertForTokenClassification
, BertForSequenceClassification
this pre-trained classes brought us with Hugging face.
BERT has several derivative architecture, each has it own advantages and drawbacks. Three most popular flavor are:
Each of these flavor to enhance BERT by altering its architecture and /or how it was pre-trained.
Each flavor comes with pros and cons. From academic standpoint, ALBERT is better because of the idea of SOP and factorize embedding using a lot of classical machine learning technique to speed up and optimize the performance. However, for production and real world is Distilled BERT because it provides most obvious data performance we are looking for in much smaller package and it is easier to deploy on the cloud
Here are examples of flavour of BERT:
nlp = pipeline("fill-mask", model='bert-base-cased')
print(type(nlp.model))
preds = nlp(f"If you don’t know how to swim, you will {nlp.tokenizer.mask_token} in this lake.")
print('If you don’t know how to swim, you will .... in this lake.')
for p in preds:
print(f"Token:{p['token_str']}. Score: {100*p['score']:,.2f}%")
Now run the same model with roberta-base flavor.
nlp = pipeline("fill-mask", model='roberta-base') # Using a flavor of BERT called Roberta
preds = nlp(f"If you don’t know how to swim, you will {nlp.tokenizer.mask_token} in this lake.")
print('If you don’t know how to swim, you will .... in this lake.')
for p in preds:
print(f"Token:{p['token_str']}. Score: {100*p['score']:,.2f}%")
Now run the same model with Distil roberta flavor:
nlp = pipeline("fill-mask", model='distilroberta-base') # Using a flavor of BERT called Distilroberta
print(type(nlp.model))
preds = nlp(f"If you don’t know how to swim, you will {nlp.tokenizer.mask_token} in this lake.")
print('If you don’t know how to swim, you will .... in this lake.')
for p in preds:
print(f"Token:{p['token_str']}. Score: {100*p['score']:,.2f}%")
Now run the same model with DistilBERT flavor.
nlp = pipeline("fill-mask", model='distilbert-base-cased') # Using a flavor of BERT called DistilBERT
preds = nlp(f"If you don’t know how to swim, you will {nlp.tokenizer.mask_token} in this lake.")
print('If you don’t know how to swim, you will .... in this lake.')
for p in preds:
print(f"Token:{p['token_str']}. Score: {100*p['score']:,.2f}%")
We can use all these models interchangeably, if fine-tunning one model gives error we can use another model.
Import training algorithms: we are using "DistilBert" flavor for SequenceClassification
because of speed.
DistilBertTokenizerFast
is also applied to leverage speed for tokenization. Collator
is applied to create batch of data for training pipeline. The last import is pipeline object that we can use Hugging face but here we want to use to run our own fine-tuned models.
The datasets
library, a companion to Transformers, we're going to be importing load_metric
, allowing us to create our own custom metrics while evaluating and training our pipelines, and the Dataset
object, is our general collection holder for all of our data points.
Fake News Data Set
Fake News Data Set is used to fine-tune BERT model. The data can be downloaded from Kaggle.
Figure below shows schematic illustration of fine-tunning a bent model for fake news detection
data_news = pd.read_csv('fake_or_real_news.csv')
print(data_news.shape)
data_news[:15]
# This code segment parses the data_news dataset into a more manageable format
titles = []
tokenized_titles = []
sequence_labels = data_news['label']
title, tokenized_title = [], []
for news in data_news['title']:
title.append(news)
tokenized_title.append(news.split(' '))
# Python list for each news
title[0], tokenized_title[0], sequence_labels[0]
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-cased')
unique_sequence_labels = list(set(sequence_labels))
unique_sequence_labels
There are two categories to predict.
sequence_labels = [unique_sequence_labels.index(l) for l in sequence_labels]
print(f'There are {len(unique_sequence_labels)} unique sequence labels')
Our final python list is going to be something like this:
print(tokenized_title[0])
print(title[0])
print(sequence_labels[0])
print(unique_sequence_labels[sequence_labels[0]])
🤗 Datasets provides many tools for modifying the structure and content of a dataset. These tools are important for tidying up a dataset, creating additional columns, converting between features and formats, and much more.
This guide will show you how to:
After getting all data, we put it in dataset object. Then we can have train-test split by train_test_split
.
news_dataset = Dataset.from_dict(
dict(
titles=title,
label=sequence_labels,
tokens=tokenized_title,
)
)
news_dataset = news_dataset.train_test_split(test_size=0.2)
news_dataset
Here is first element of our training set:
news_dataset['train'][0]
Next is to instantiate tokenizer with DistilBertTokenizerFast
from 'distilbert-base-uncased'. FYI, uncased means lower or upper case of the words do not matter.
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')
Create a pre-process function to take in a batch of titles and tokenize them with DistilBertTokenizerFast
. The question is why we are tokenizing the title if we already have the tokens. The answer is we do not necessarily know the tokens that has been given to us will match up with tokenized version for BERT.
def preprocess_function(examples):
return tokenizer(examples["titles"], truncation=True) # truncation=True makes sure to exludes instances with more
# 512 tokens
Map the tokenizer function for the entire data set:
# go over all our data set, tokenize them
seq_clf_tokenized_news = news_dataset.map(preprocess_function, batched=True)
news_dataset
Looking at the first item, we also have input_ids
and attention_mask
. These are the items we are going to need in our model.
seq_clf_tokenized_news['train'][0]
DataCollatorWithPadding
creates batch of data. It also dynamically pads text to the length of the longest element in the batch (on the right), making them all the same length. It's possible to pad your text in the tokenizer function with padding=True
, dynamic padding is more efficient. This will make the training process faster.
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
Data Collator will pad data so that all examples are the same input length. Attention mask is how we ignore attention scores for padding tokens
It is now time to create our actual model.
sequence_clf_model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased',
num_labels=len(unique_sequence_labels),)
# set an index -> label dictionary
sequence_clf_model.config.id2label = {i: l for i, l in enumerate(unique_sequence_labels)}
Every model comes with a config
. In this config
, there is id2label
attribute which is a dictionary that has integer as keys and string as values. See below:
sequence_clf_model.config
sequence_clf_model.config.id2label[0]
Now it is the time to have a costume metric. HuggingFace always uses loss as performance metric but we need to calculate accuracy as a simpler metric.
metric = load_metric("accuracy")
def compute_metrics(eval_pred): # common method to take in logits and calculate accuracy of the eval set
logits, labels = eval_pred # logit and label are returning from training loop
predictions = np.argmax(logits, axis=-1)
return metric.compute(predictions=predictions, references=labels) # compute the accuracy
import evaluate
metric = evaluate.load("accuracy")
from sklearn.metrics import roc_auc_score
def compute_metrics(eval_pred):
logits, labels = eval_pred
predictions = np.argmax(logits, axis=-1)
return metric.compute(predictions=predictions, references=labels)
#####################################################
def compute_metrics_binary(eval_pred):
"""metrics for binary classification"""
logits, labels = eval_pred
preds = np.argmax(logits, axis=-1)
# Calculate the AUC score
auc_score = roc_auc_score(labels, preds)
# Calculate the accuracy, true positive, false positive, false negative, and true negative values
acc = metric.compute(predictions=preds, references=labels)
tp = ((preds >= 0.5) & (labels == 1)).sum()
fp = ((preds >= 0.5) & (labels == 0)).sum()
fn = ((preds < 0.5) & (labels == 1)).sum()
tn = ((preds < 0.5) & (labels == 0)).sum()
# Calculate the precision, recall, and F1 score
precision = tp / (tp + fp)
recall = tp / (tp + fn)
f1_score = 2 * (precision * recall) / (precision + recall)
return {
'Validation Accuracy': acc['accuracy'],
'Validation Precision': auc_score,
'Validation AUC': precision,
'Validation Recall': recall,
'Validation F1_Score': f1_score,
'Validation TP': tp,
'Validation FP': fp,
'Validation FN': fn,
'Validation TN': tn,
}
#####################################################
from sklearn.metrics import classification_report
def compute_metrics_multiclass(eval_pred):
"""metrics for multiclass classification"""
logits, labels = eval_pred
preds = np.argmax(logits, axis=-1)
report = classification_report(labels, preds, output_dict=True)
acc_score = report['accuracy']
pre_score = report['macro avg']['precision']
rcl_score = report['macro avg']['recall']
f1_score = report['macro avg']['f1-score']
return {
'Validation Accuracy': acc_score,
'Validation Macro Recall': rcl_score,
'Validation Macro Precision': pre_score,
'Validation Macro F1_Score': f1_score,
}
Take pre-trained knowledge of BERT and transfer that knowledge to our supervised data set by not training too many epochs. The code block below is going to repeat itself again and again because it define our training loop.
epochs = 2
# Training argument
training_args = TrainingArguments(
output_dir="./news_clf/results", # Local directory to save check point of our model as fitting
num_train_epochs=epochs, # minimum of two epochs
per_device_train_batch_size=32, # batch size for training and evaluation, it common to take around 32,
per_device_eval_batch_size=32, # sometimes less or more, The smaller batch size, the more change model update
load_best_model_at_end=True, # Even if we overfit the model by accident, load the best model through checkpoint
# some deep learning parameters that the trainer is able to take in
warmup_steps = len(seq_clf_tokenized_news['train']) // 5, # learning rate scheduler by number of warmup steps
weight_decay = 0.05, # weight decay for our learning rate schedule (regularization)
logging_steps = 1, # Tell the model minimum number of steps to log between (1 means logging as much as possible)
log_level = 'info',
evaluation_strategy = 'epoch', # It is "steps" or "epoch", we choose epoch: how many times to stop training to test
eval_steps = 50,
save_strategy = 'epoch' # save a check point of our model after each epoch
)
# Define the trainer:
trainer = Trainer(
model=sequence_clf_model, # take our model (sequence_clf_model)
args=training_args, # we just set it above
train_dataset=seq_clf_tokenized_news['train'], # training part of dataset
eval_dataset=seq_clf_tokenized_news['test'], # test (evaluation) part of dataset
compute_metrics=compute_metrics_binary, # This part is optional but we want to calculate accuracy of our model
data_collator=data_collator # data colladior with padding. Infact, we may or may not need a data collator
# we can check the model to see how it lookes like with or without the collator
)
Before we start training, we can run the trainer without fine-tune model to measure performance of the model
# Get initial metrics: evaluation on test set
trainer.evaluate()
We hope the initial loss and accuracy will improve after training. Since we have not fine-tuned the model yet, the metric is random guessing. The feed-forward layer on top of the model has not been updated yet.
trainer.train()
trainer.evaluate()
# make a pipline by passing in our fine-tuned model with tokenizer
pipe = pipeline("text-classification", model=sequence_clf_model, tokenizer=tokenizer)
pipe('Please add Here We Go by Dispatch to my road trip playlist')
# We can save our model on drirectory we specified
trainer.save_model()
We can easily call our pipline directly from directory. This very useful for deploying our model on the cloud with one line of the code. We can use it with exact same way to get the exact result.
pipe = pipeline("text-classification", "./news_clf/results", tokenizer=tokenizer)
text = 'The Battle of New York: Why This Primary Matters'
pipe(text)
text = """Breaking News: Researchers have discovered a new species of dinosaur that
can breathe fire. The creature, named Pyrodino, is believed to have lived
during the Jurassic period and could shoot flames out of its nostrils,
making it one of the deadliest predators of its time."""
pipe(text)
Up to now we updated all parameters, that is why it takes too much time. Below we freeze all our BERT model except for the classification layer. This is our third option that we freeze all our pre-trained model and only train a layer on top of it.
We are going to freeze every parameter in the model. the easiest way to freeze is to iterate over all distilbert.parameters()
and make them as False. It only updates pre_classifier
for param in sequence_clf_model.distilbert.parameters():
param.requires_grad = False # "False" makes the parameters unable to update. "grad" stands for gradient
# it never upgrade during training
By running the code above, the only layer allowed to be updated is below:
(pre_classifier): Linear(in_features=768, out_features=768, bias=True)
(classifier): Linear(in_features=768, out_features=2, bias=True)
(dropout): Dropout(p=0.2, inplace=False)
This leads to much faster training but will yield worse result.
epochs = 2
# Training argument
training_args = TrainingArguments(
output_dir="./news_clf/results", # Local directory to save check point of our model as fitting
num_train_epochs=epochs, # minimum of two epochs
per_device_train_batch_size=32, # batch size for training and evaluation, it common to take around 32,
per_device_eval_batch_size=32, # sometimes less or more, The smaller batch size, the more change model update
load_best_model_at_end=True, # Even if we overfit the model by accident, load the best model through checkpoint
# some deep learning parameters that the trainer is able to take in
warmup_steps = len(seq_clf_tokenized_news['train']) // 5, # learning rate scheduler by number of warmup steps
weight_decay = 0.05, # weight decay for our learning rate schedule (regularization)
logging_steps = 1, # Tell the model minimum number of steps to log between (1 means logging as much as possible)
log_level = 'info',
evaluation_strategy = 'epoch', # It is "steps" or "epoch", we choose epoch: how many times to stop training to test
eval_steps = 50,
save_strategy = 'epoch' # save a check point of our model after each epoch
)
# Define the trainer:
trainer = Trainer(
model=sequence_clf_model, # take our model (sequence_clf_model)
args=training_args, # we just set it above
train_dataset=seq_clf_tokenized_news['train'], # training part of dataset
eval_dataset=seq_clf_tokenized_news['test'], # test (evaluation) part of dataset
compute_metrics=compute_metrics_binary, # This part is optional but we want to calculate accuracy of our model
data_collator=data_collator # data colladior with padding. Infact, we may or may not need a data collator
# we can check the model to see how it lookes like with or without the collator
)
#epochs = 2
#
## Training argument
#training_args = TrainingArguments(
# output_dir="./news_clf/results",
# num_train_epochs=epochs,
# per_device_train_batch_size=32,
# per_device_eval_batch_size=32,
# load_best_model_at_end=True,
#
# # some deep learning parameters that the Trainer is able to take in
# warmup_steps = len(seq_clf_tokenized_news['train']) // 5, # number of warmup steps for learning rate scheduler,
# weight_decay = 0.05,
#
# logging_steps = 1,
# log_level = 'info',
# evaluation_strategy = 'epoch',
# eval_steps = 50,
# save_strategy = 'epoch'
#)
#
## Define the trainer:
#
#trainer = Trainer(
# model=frozen_sequence_clf_model,
# args=training_args,
# train_dataset=seq_clf_tokenized_news['train'],
# eval_dataset=seq_clf_tokenized_news['test'],
# compute_metrics=compute_metrics_binary,
# data_collator=data_collator # data colladior with padding
#)
#
trainer.evaluate()
trainer.train()
trainer.evaluate()
For this data set, freezing parameters lead to higher performance and faster training time . However, when we update the entire model we usually will get slower run but higher performance. There is middle ground that we freeze part of model to see how it will work for us. The best practice is to apply different approaches for updating model's parameters and choose the updated model that gives the highest performance.