Summary

Large language models (LLMs), rooted in the Transformer architecture, are specialized AI models trained on extensive text data to understand and generate human language, code, and more. These models exhibit remarkable accuracy and versatility, excelling in tasks from text classification to the generation of fluent and stylistically nuanced content.

BERT, or Bidirectional Encoder Representations from Transformers, involves utilizing pre-trained BERT models to analyze and classify sequences of text. BERT excels in capturing contextual information by considering both left and right context for each word in a sentence. Fine-tuning BERT for sequence classification involves adding a classification layer on top of the pre-trained BERT model and training it on a labeled dataset. This allows BERT to adapt its representations to the specific classification task. In this notebook, we initially delve into the introduction of LLM, focusing specifically on BERT, and explore the process of fine-tuning BERT to detect fake news.

Python functions and data files needed to run this notebook are available via this link.

In [1]:
import warnings
warnings.filterwarnings('ignore')

# Import BERT model from transformer library which has a lot of pretrained models
# transformers is HuggingFace library
from transformers import BertTokenizer, BertModel
# BertModel also requires the PyTorch library
import torch
from transformers import BertForMaskedLM, pipeline
from transformers import Trainer, TrainingArguments, DistilBertForSequenceClassification, DistilBertTokenizerFast, \
     DataCollatorWithPadding, pipeline
import numpy as np
from sklearn.preprocessing import LabelEncoder
from transformers import AutoTokenizer
from datasets import load_metric, Dataset
import pandas as pd
import matplotlib.pyplot as plt

# Load pretrained BERT-base model with 12 encoder and 110M parameters
model_BERT_base = BertModel.from_pretrained('bert-base-uncased')
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.seq_relationship.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).

Introduction

History of Natural Language Processing (NLP)

  • NLP started in 2001 with the first neural language model using a feed forward neural network to predict future words in a sentence
  • In 2013, the Word2vec problem of creating word embeddings. However, these word embeddings showed bias
  • In 2013 and 2014, the rise of recurrent neural networks (RNNs) and convolutional neural networks (CNNs) started for natural language processing tasks, using an encoder to reads a variable length and a decoder to generate a variable length sequence output
  • In 2015, attention mechanisms started to be used in NLP tasks and became the dominant way to perform NLP tasks over the past decade. Attention is designed to understand relationships between different elements in a sequence
  • In 2017, the transformer was introduced in the paper Attention is all you need using “self-attention” to replace RNNs and used as the primary mechanism to pre-trained language models like Facebook’s BART, Google’s BERT, OpenAI’s GPT (ChatGPT)

BERT, T5, and GPT are prominent large language models (LLMs) created by Google, Google, and OpenAI, respectively. Despite sharing the Transformer as a common foundation, these models, along with variants like RoBERTa, BART, and ELECTRA, exhibit distinct architectures. BERT, specifically designed as an autoencoding model, utilizes attention mechanisms to construct bidirectional sentence representations, excelling in tasks like sentence and token classification. While BERT doesn't directly classify text or summarize documents, its efficiency in processing extensive text corpora quickly has made it a widely adopted pre-trained model for downstream natural language processing tasks, solidifying its status as a cornerstone in the development of advanced language models within the NLP community.

How a Text is Processed

There are three types of language models to train a model to predict a missing word:

1. Auto-regressive

Auto-regressive language models predict a missing word in a sentence given past tokens and future tokens but not both of them. It includes forward and backward prediction. Example is our phone auto completion sentence. This is categorized as **GPT** family.

2. Auto-encoding.

Auto-encoding language models strive to gain a comprehensive understanding of entire sequences of tokens given all possible context (past and future tokens are given). This is great for natural language understanding tasks like sequence classification and named entity recognition and **BERT** is an example.

See Figure below die difference: image.png Image retrieved from Sinan Ozdemir

3. Combinations of autoregressive and autoencoding, like T5, which can use the encoder and decoder to be more versatile and flexible in generating text. It has been shown that these combination models can generate more diverse and creative text in different contexts compared to pure decoder-based autoregressive models due to their ability to capture additional context using the encoder.

Transfer Learning

Transfer learning is a machine learning method where we reuse a pre-trained model as the starting point for a model on a new task. To put it simply—a model trained on one task is repurposed on a second, related task as an optimization that allows rapid progress when modeling the second task.

In NLP, transfer learning is achieved by first pre-training a model on an unlabeled text corpus in an unsupervised or semi-supervised manner, and then fine-tuning (update) the model on a smaller labeled dataset for a specific NLP task. If training is simply applied on smaller dataset without pre-training, it would not be possible to get high performance results. For example for NLP, we use BERT and for image classification we can use Resnet images

image.png Image retrieved from Sinan Ozdemir

For example, BERT has been pretrained on two main Corpora: English Wikipedia (2.5B words) and BookCorpus (800M words) which are free books. BERT went through all of these resources multiple times to gain a general understanding.

The BERT weights learnt from pretrained model are fine-tined. Moreover, a separate layer will be added on top of BERT model.

So, there are three fine-tuning approaches:

  1. Add any additional layers on top while updating the entire whole model on labeled data. This is a common approach. All aspect of the model will be updated. See Figure below. This is usually the slowest but has the highest performance.

image.png

  1. Freeze some part of model. For example, keeping some weights of BERT model unchanged while updating other weights. This approach has average speed and average speed.

image-2.png

  1. Freeze the entire model and only train the additional layers that we added on top which are feed forward classifiers. This is the fastest approach but has the worst performance. This can be only used for generic tasks.

image-3.png Images are retrieved from Sinan Ozdemir

Fine-tuning transformer with Native Pytorch

  1. Use training data to update a model
  2. The model compute loss function to indicate how right or wrong a model is for predicting the training data
  3. Compute gradients to optimize weights which leads to update model

Process 1 to 3 are repeated until we are satisfy with our model performance. This is a manual processing that can be tedious. See Figure below:

image.png Image retrieved from Sinan Ozdemir

Fine-tuning with HuggingFace's Trainer

To address the problem above, we can use HuggingFace's Trainer API in our training loop. It takes the entire loop above including loss, gradient calculation and optimization all of them are calculated in a single API called trainer. image.png Image retrieved from Sinan Ozdemir

The key objects are:

  • Dataset- Split data set to training set and test set
  • DataCollector- Convert data set to multiple batches
  • TrainingArguments- Monitor and track training arguments including saving strategy and Scheduler parameters
  • Trainer- API to the Pytorch

NLP with BERT

BERT stands for Bi-directional Encoder Representation from Transformers:

  • Bi-directional: Auto-encoding language model.
  • Encoder: Only encoder is used from transformer.
  • Representation: Counting on self-attention.
  • Transformers: transformer is the source to retrieve encoder.

A sentence is fed into BERT to get a **context-full** representation (vector embedding) of every word in sentence. The context of each word in sentence is understood by the encoder using a multi-head attention mechanism (relating each word to every other word in sentence).

BERT comes with different models. The base model has 12 encoders which a good mix of complexity and size and speed of the model. BERT-small has 4 encoders and BERT-large has 24 encoders. image.png

BERT's Architecture

In [2]:
# Model's parameters 
n_params = list(model_BERT_base.named_parameters())
print(f'The BERT model has {len(n_params)} different parameters')
The BERT model has 199 different parameters
In [3]:
print('********* Embedding Layer *********\n')
for par in n_params[0:5]:
    print(f'{par[0], str(tuple(par[1].size()))}')
********* Embedding Layer *********

('embeddings.word_embeddings.weight', '(30522, 768)')
('embeddings.position_embeddings.weight', '(512, 768)')
('embeddings.token_type_embeddings.weight', '(2, 768)')
('embeddings.LayerNorm.weight', '(768,)')
('embeddings.LayerNorm.bias', '(768,)')

embeddings.word_embeddings.weight: (30522, 768) means there are 30522 tokens that BERT is aware of that can be used for any NLP task; 768 represents the fact that each token has a contextless embedding dimension of 768.

In [4]:
print('********* First Encoder ********* \n')
for par in n_params[5:21]:
    print(f'{par[0], str(tuple(par[1].size()))}')
********* First Encoder ********* 

('encoder.layer.0.attention.self.query.weight', '(768, 768)')
('encoder.layer.0.attention.self.query.bias', '(768,)')
('encoder.layer.0.attention.self.key.weight', '(768, 768)')
('encoder.layer.0.attention.self.key.bias', '(768,)')
('encoder.layer.0.attention.self.value.weight', '(768, 768)')
('encoder.layer.0.attention.self.value.bias', '(768,)')
('encoder.layer.0.attention.output.dense.weight', '(768, 768)')
('encoder.layer.0.attention.output.dense.bias', '(768,)')
('encoder.layer.0.attention.output.LayerNorm.weight', '(768,)')
('encoder.layer.0.attention.output.LayerNorm.bias', '(768,)')
('encoder.layer.0.intermediate.dense.weight', '(3072, 768)')
('encoder.layer.0.intermediate.dense.bias', '(3072,)')
('encoder.layer.0.output.dense.weight', '(768, 3072)')
('encoder.layer.0.output.dense.bias', '(768,)')
('encoder.layer.0.output.LayerNorm.weight', '(768,)')
('encoder.layer.0.output.LayerNorm.bias', '(768,)')
In [5]:
print('********* Output Layer ********* \n')
for par in n_params[-2:]:
    print(f'{par[0], str(tuple(par[1].size()))}')    
********* Output Layer ********* 

('pooler.dense.weight', '(768, 768)')
('pooler.dense.bias', '(768,)')

pooler: is a separate feed-forward network with a hyperbolic tanh activation function. When we are using BERT, this pooler is taking the vector embedding of token representing the entire sentence, not a particular token.

In [6]:
# load the bert-base uncased tokenizer.
tokenizer_bert = BertTokenizer.from_pretrained('bert-base-uncased')
In [7]:
# tokenize a sequence
tokenizer_sentence=tokenizer_bert.encode('AI has been my friend') 
tokenizer_sentence
Out[7]:
[101, 9932, 2038, 2042, 2026, 2767, 102]

We always have token 101 at start which refers to CLS, and 102 at the end which refers to SEP. Those are automatically added to tokenizer.

We can run this token through a model:

In [8]:
# runing tokens via the model
response = model_BERT_base(torch.tensor(tokenizer_sentence).unsqueeze(0))

The code above does:

  1. Convert tokens_with_unknown_words into a tensor with a size (7,))
  2. Simulate batches by unsqueezing a first dimension to a shape of (1, 7)

Passing this through our BERT model leads to many outputs.

In [9]:
# Embedding for each token
response.last_hidden_state
Out[9]:
tensor([[[ 0.0065,  0.0303, -0.1594,  ..., -0.1599,  0.1518,  0.3864],
         [-0.2074, -0.4378,  0.0418,  ..., -0.2403, -0.0033,  0.4402],
         [ 0.2448, -0.3865, -0.2682,  ..., -0.0998,  0.0463,  0.6762],
         ...,
         [ 0.2216,  0.2247,  0.6810,  ...,  0.0474, -0.0571,  0.0918],
         [-0.3868, -0.4962,  0.1083,  ...,  0.7687,  0.1917,  0.4949],
         [ 0.6903,  0.0883, -0.1104,  ...,  0.1298, -0.7293, -0.4013]]],
       grad_fn=<NativeLayerNormBackward0>)

Each row represents a token in sequence and each vector represents that token's context in greater sequence. As mentioned before, first row is CLS.

In [10]:
# The size of pooler_output
response.pooler_output.shape
Out[10]:
torch.Size([1, 768])

pooler_output is meant to be representative of the entire sequence as a whole not just a individual token. The size of pooler_output is our weight matrix

In [11]:
model_BERT_base.pooler
Out[11]:
BertPooler(
  (dense): Linear(in_features=768, out_features=768, bias=True)
  (activation): Tanh()
)

Model pooler is feed forward network with Tanh activation.

In [12]:
# Get the final encoder's representation. First elemnt of second dimension is CLS token
CLS_embedding = response.last_hidden_state[:, 0, :].unsqueeze(0) # second dimension holds all of the token
CLS_embedding.shape
Out[12]:
torch.Size([1, 1, 768])
In [13]:
# put CLS_embedding through model's pooler
model_BERT_base.pooler(CLS_embedding).shape
Out[13]:
torch.Size([1, 768])

First dimension is our batch size which is still 1 and 768 is final embedding dimension of our final model. This tensor is a vector representation of the entire sequence at large.

In [14]:
(model_BERT_base.pooler(CLS_embedding) == response.pooler_output).all()
Out[14]:
tensor(True)

Running the embedding for CLS through the pooler gives the same output as the pooler_output

In [15]:
tot_prms = 0
for par in model_BERT_base.parameters(): # Iterate through parameters
    if len(par.shape) == 2:
        tot_prms += par.shape[0] * par.shape[1] # multiply matrecies dimension together and add them to our total parameters
        
print(f'BERT has total of {tot_prms:,} learnable parameters.') 
print(f'This is how to get 110M learnable parameters of BERT') 
BERT has total of 109,360,128 learnable parameters.
This is how to get 110M learnable parameters of BERT
In [16]:
print(f"""There are only {30522* 768} for context-less word embedding. The rest of parameters are scattered over 
      encoders specially out attention calculation""")
There are only 23440896 for context-less word embedding. The rest of parameters are scattered over 
      encoders specially out attention calculation
In [ ]:
 

How to Fine-tune BERT for NLP tasks

BERT achieves general idea about:

  1. Token (words) used in sentences (Masked Language Models)
  2. Sentences are treated in large corpora (Next Sentence Prediction)

Whatever BERT has learned can be used to solve a specific NLP problem by fine-tuning the model.

Fine-tunning works first by feeding a sentence to a pre-trained Bert. CLS has pre-trained on next sentence prediction task through its pooler attribute. We are going to add another feedforward layer after the pooler to train to map it to the number of sequence classes we want. For classification problem as shown in Figure below, we do not care about the representation of each token after passing our sentence to the Bert. For this example, we classy the entire sequence with a label.

image.png

However, for token classification, we need to consider the representation of each token and pas them through feed forward layer to classify each token for each label we have. The classic example of this is Named Entity Recognition.

image-2.png

Question and answering is the most difficult fine-tuning. We have question and subtext is a context that has an answer to the question in it. We pass the entire sequence, question and context, to the pre-trained Bert. Similar to token classification, we will add a layer on top of every single token. What we are predicting is whether or not that specific token represents the start or end the answer of the question: image.png

When fine-tunning BERT to solve NLP tasks, we will be able to utilize three built classes from transformer library especially BertForQuestionAnswering, BertForTokenClassification, BertForSequenceClassification this pre-trained classes brought us with Hugging face.

BERT's Flavor

BERT has several derivative architecture, each has it own advantages and drawbacks. Three most popular flavor are:

  1. RoBERTa
  2. DistilBERT
  3. ALBERT

Each of these flavor to enhance BERT by altering its architecture and /or how it was pre-trained.

RoBERTa

  1. This flavor was created because authors believed that BERT is hugely under-trained
  2. There was not enough data to train BERT, 10 times more training was applied (16GB vs. 160GB)
  3. Model is bigger with 15% more parameters
  4. Next sentence prediction is removed from BERT because the authors claimed there is no use
  5. 4 times more masking task to learn by dynamic masking pattern

Distilled BERT

  1. Distillation is an approach to train a student to mimic a teacher model
  2. 60% faster than BERT with 40% fewer parameters while the performance is approximately the same as BERT

image.png

A Lite BERT (ALBERT)

  1. Has 90% fewer parameters
  2. Factorize token embedding that leads to much smaller token embedding
  3. Parameters across the encoders are shared. The leads to faster updating
  4. It was claimed NSP is useless, the authors developed sentence order prediction (SOP)

Each flavor comes with pros and cons. From academic standpoint, ALBERT is better because of the idea of SOP and factorize embedding using a lot of classical machine learning technique to speed up and optimize the performance. However, for production and real world is Distilled BERT because it provides most obvious data performance we are looking for in much smaller package and it is easier to deploy on the cloud

Here are examples of flavour of BERT:

In [17]:
nlp = pipeline("fill-mask", model='bert-base-cased')

print(type(nlp.model))

preds = nlp(f"If you don’t know how to swim, you will  {nlp.tokenizer.mask_token} in this lake.")

print('If you don’t know how to swim, you will .... in this lake.')

for p in preds:
    print(f"Token:{p['token_str']}. Score: {100*p['score']:,.2f}%")
Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
<class 'transformers.models.bert.modeling_bert.BertForMaskedLM'>
If you don’t know how to swim, you will .... in this lake.
Token:drown. Score: 72.56%
Token:die. Score: 23.95%
Token:be. Score: 0.63%
Token:drowned. Score: 0.45%
Token:fall. Score: 0.39%

Now run the same model with roberta-base flavor.

In [18]:
nlp = pipeline("fill-mask", model='roberta-base') # Using a flavor of BERT called Roberta

preds = nlp(f"If you don’t know how to swim, you will  {nlp.tokenizer.mask_token} in this lake.")

print('If you don’t know how to swim, you will .... in this lake.')

for p in preds:
    print(f"Token:{p['token_str']}. Score: {100*p['score']:,.2f}%")
If you don’t know how to swim, you will .... in this lake.
Token: drown. Score: 90.25%
Token: die. Score: 8.34%
Token: perish. Score: 0.36%
Token: survive. Score: 0.18%
Token: starve. Score: 0.17%

Now run the same model with Distil roberta flavor:

In [19]:
nlp = pipeline("fill-mask", model='distilroberta-base') # Using a flavor of BERT called Distilroberta

print(type(nlp.model))

preds = nlp(f"If you don’t know how to swim, you will  {nlp.tokenizer.mask_token} in this lake.")

print('If you don’t know how to swim, you will .... in this lake.')

for p in preds:
    print(f"Token:{p['token_str']}. Score: {100*p['score']:,.2f}%")
<class 'transformers.models.roberta.modeling_roberta.RobertaForMaskedLM'>
If you don’t know how to swim, you will .... in this lake.
Token: drown. Score: 99.16%
Token: swim. Score: 0.34%
Token: die. Score: 0.23%
Token: perish. Score: 0.03%
Token: be. Score: 0.03%

Now run the same model with DistilBERT flavor.

In [20]:
nlp = pipeline("fill-mask", model='distilbert-base-cased')  # Using a flavor of BERT called DistilBERT

preds = nlp(f"If you don’t know how to swim, you will  {nlp.tokenizer.mask_token} in this lake.")

print('If you don’t know how to swim, you will .... in this lake.')

for p in preds:
    print(f"Token:{p['token_str']}. Score: {100*p['score']:,.2f}%")
If you don’t know how to swim, you will .... in this lake.
Token:drown. Score: 78.49%
Token:swim. Score: 12.65%
Token:die. Score: 2.16%
Token:stay. Score: 0.66%
Token:float. Score: 0.65%

We can use all these models interchangeably, if fine-tunning one model gives error we can use another model.

In [ ]:
 

Sequence Classification

Import training algorithms: we are using "DistilBert" flavor for SequenceClassification because of speed. DistilBertTokenizerFast is also applied to leverage speed for tokenization. Collator is applied to create batch of data for training pipeline. The last import is pipeline object that we can use Hugging face but here we want to use to run our own fine-tuned models.

The datasets library, a companion to Transformers, we're going to be importing load_metric, allowing us to create our own custom metrics while evaluating and training our pipelines, and the Dataset object, is our general collection holder for all of our data points.

Fake News Data Set

Fake News Data Set

Fake News Data Set is used to fine-tune BERT model. The data can be downloaded from Kaggle.

Figure below shows schematic illustration of fine-tunning a bent model for fake news detection

fake_news_bert.jpg

In [22]:
data_news = pd.read_csv('fake_or_real_news.csv')
print(data_news.shape)
data_news[:15]
(6335, 4)
Out[22]:
Unnamed: 0 title text label
0 8476 You Can Smell Hillary’s Fear Daniel Greenfield, a Shillman Journalism Fello... FAKE
1 10294 Watch The Exact Moment Paul Ryan Committed Pol... Google Pinterest Digg Linkedin Reddit Stumbleu... FAKE
2 3608 Kerry to go to Paris in gesture of sympathy U.S. Secretary of State John F. Kerry said Mon... REAL
3 10142 Bernie supporters on Twitter erupt in anger ag... — Kaydee King (@KaydeeKing) November 9, 2016 T... FAKE
4 875 The Battle of New York: Why This Primary Matters It's primary day in New York and front-runners... REAL
5 6903 Tehran, USA \nI’m not an immigrant, but my grandparents ... FAKE
6 7341 Girl Horrified At What She Watches Boyfriend D... Share This Baylee Luciani (left), Screenshot o... FAKE
7 95 ‘Britain’s Schindler’ Dies at 106 A Czech stockbroker who saved more than 650 Je... REAL
8 4869 Fact check: Trump and Clinton at the 'commande... Hillary Clinton and Donald Trump made some ina... REAL
9 2909 Iran reportedly makes new push for uranium con... Iranian negotiators reportedly have made a las... REAL
10 1357 With all three Clintons in Iowa, a glimpse at ... CEDAR RAPIDS, Iowa — “I had one of the most wo... REAL
11 988 Donald Trump’s Shockingly Weak Delegate Game S... Donald Trump’s organizational problems have go... REAL
12 7041 Strong Solar Storm, Tech Risks Today | S0 News... Click Here To Learn More About Alexandra's Per... FAKE
13 7623 10 Ways America Is Preparing for World War 3 October 31, 2016 at 4:52 am \nPretty factual e... FAKE
14 1571 Trump takes on Cruz, but lightly Killing Obama administration rules, dismantlin... REAL

Segment parses

In [23]:
# This code segment parses the data_news dataset into a more manageable format

titles = []
tokenized_titles = []
sequence_labels = data_news['label']

title, tokenized_title =  [], []
for news in data_news['title']:
    title.append(news)
    tokenized_title.append(news.split(' '))
    
In [24]:
# Python list for each news
title[0], tokenized_title[0], sequence_labels[0]
Out[24]:
('You Can Smell Hillary’s Fear',
 ['You', 'Can', 'Smell', 'Hillary’s', 'Fear'],
 'FAKE')
In [25]:
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-cased')
In [26]:
unique_sequence_labels = list(set(sequence_labels))
unique_sequence_labels
Out[26]:
['FAKE', 'REAL']

There are two categories to predict.

In [27]:
sequence_labels = [unique_sequence_labels.index(l) for l in sequence_labels]

print(f'There are {len(unique_sequence_labels)} unique sequence labels')
There are 2 unique sequence labels

Our final python list is going to be something like this:

In [28]:
print(tokenized_title[0])
print(title[0])
print(sequence_labels[0])
print(unique_sequence_labels[sequence_labels[0]])
['You', 'Can', 'Smell', 'Hillary’s', 'Fear']
You Can Smell Hillary’s Fear
0
FAKE

Split Data to Training and Test Set

🤗 Datasets provides many tools for modifying the structure and content of a dataset. These tools are important for tidying up a dataset, creating additional columns, converting between features and formats, and much more.

This guide will show you how to:

  • Reorder rows and split the dataset.
  • Rename and remove columns, and other common column operations.
  • Apply processing functions to each example in a dataset.
  • Concatenate datasets.
  • Apply a custom formatting transform.
  • Save and export processed datasets.

After getting all data, we put it in dataset object. Then we can have train-test split by train_test_split.

In [29]:
news_dataset = Dataset.from_dict(
    dict(
        titles=title, 
        label=sequence_labels,
        tokens=tokenized_title,
    )
)
news_dataset = news_dataset.train_test_split(test_size=0.2)

news_dataset
Out[29]:
DatasetDict({
    train: Dataset({
        features: ['titles', 'label', 'tokens'],
        num_rows: 5068
    })
    test: Dataset({
        features: ['titles', 'label', 'tokens'],
        num_rows: 1267
    })
})

Here is first element of our training set:

In [30]:
news_dataset['train'][0]
Out[30]:
{'titles': 'City, County Leaders Ask Court To Lift Injunction On Obama Immigration Programs',
 'label': 1,
 'tokens': ['City,',
  'County',
  'Leaders',
  'Ask',
  'Court',
  'To',
  'Lift',
  'Injunction',
  'On',
  'Obama',
  'Immigration',
  'Programs']}

Tokenizer

Next is to instantiate tokenizer with DistilBertTokenizerFast from 'distilbert-base-uncased'. FYI, uncased means lower or upper case of the words do not matter.

In [31]:
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

Create a pre-process function to take in a batch of titles and tokenize them with DistilBertTokenizerFast. The question is why we are tokenizing the title if we already have the tokens. The answer is we do not necessarily know the tokens that has been given to us will match up with tokenized version for BERT.

In [32]:
def preprocess_function(examples):
    return tokenizer(examples["titles"], truncation=True) # truncation=True makes sure to exludes instances with more 
                                                            # 512 tokens

Map the tokenizer function for the entire data set:

In [33]:
# go over all our data set, tokenize them
seq_clf_tokenized_news = news_dataset.map(preprocess_function, batched=True)
In [34]:
news_dataset
Out[34]:
DatasetDict({
    train: Dataset({
        features: ['titles', 'label', 'tokens'],
        num_rows: 5068
    })
    test: Dataset({
        features: ['titles', 'label', 'tokens'],
        num_rows: 1267
    })
})

Looking at the first item, we also have input_ids and attention_mask. These are the items we are going to need in our model.

In [35]:
seq_clf_tokenized_news['train'][0]
Out[35]:
{'titles': 'City, County Leaders Ask Court To Lift Injunction On Obama Immigration Programs',
 'label': 1,
 'tokens': ['City,',
  'County',
  'Leaders',
  'Ask',
  'Court',
  'To',
  'Lift',
  'Injunction',
  'On',
  'Obama',
  'Immigration',
  'Programs'],
 'input_ids': [101,
  2103,
  1010,
  2221,
  4177,
  3198,
  2457,
  2000,
  6336,
  22928,
  2006,
  8112,
  7521,
  3454,
  102],
 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

Batch of Data

DataCollatorWithPadding creates batch of data. It also dynamically pads text to the length of the longest element in the batch (on the right), making them all the same length. It's possible to pad your text in the tokenizer function with padding=True, dynamic padding is more efficient. This will make the training process faster.

In [36]:
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

Data Collator will pad data so that all examples are the same input length. Attention mask is how we ignore attention scores for padding tokens

Fine-tune Model

It is now time to create our actual model.

In [37]:
sequence_clf_model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased', 
                                                                         num_labels=len(unique_sequence_labels),)

# set an index -> label dictionary
sequence_clf_model.config.id2label = {i: l for i, l in enumerate(unique_sequence_labels)}
Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.bias', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias', 'vocab_transform.weight', 'vocab_layer_norm.weight']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.weight', 'pre_classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

Every model comes with a config. In this config, there is id2label attribute which is a dictionary that has integer as keys and string as values. See below:

In [38]:
sequence_clf_model.config
Out[38]:
DistilBertConfig {
  "_name_or_path": "distilbert-base-uncased",
  "activation": "gelu",
  "architectures": [
    "DistilBertForMaskedLM"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "id2label": {
    "0": "FAKE",
    "1": "REAL"
  },
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "transformers_version": "4.26.1",
  "vocab_size": 30522
}
In [39]:
sequence_clf_model.config.id2label[0]
Out[39]:
'FAKE'

Now it is the time to have a costume metric. HuggingFace always uses loss as performance metric but we need to calculate accuracy as a simpler metric.

In [40]:
metric = load_metric("accuracy")

def compute_metrics(eval_pred):  # common method to take in logits and calculate accuracy of the eval set
    logits, labels = eval_pred   # logit and label are returning from training loop
    predictions = np.argmax(logits, axis=-1) 
    return metric.compute(predictions=predictions, references=labels) # compute the accuracy

import evaluate
metric = evaluate.load("accuracy")
from sklearn.metrics import roc_auc_score

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

#####################################################

def compute_metrics_binary(eval_pred):
    """metrics for binary classification"""
    
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=-1)

    # Calculate the AUC score
    auc_score = roc_auc_score(labels, preds)

    # Calculate the accuracy, true positive, false positive, false negative, and true negative values
    acc = metric.compute(predictions=preds, references=labels)
    tp = ((preds >= 0.5) & (labels == 1)).sum()
    fp = ((preds >= 0.5) & (labels == 0)).sum()
    fn = ((preds < 0.5) & (labels == 1)).sum()
    tn = ((preds < 0.5) & (labels == 0)).sum()

    # Calculate the precision, recall, and F1 score
    precision = tp / (tp + fp)
    recall = tp / (tp + fn)
    f1_score = 2 * (precision * recall) / (precision + recall)

    return {
        'Validation Accuracy': acc['accuracy'],
        'Validation Precision': auc_score,
        'Validation AUC': precision,
        'Validation Recall': recall,
        'Validation F1_Score': f1_score,
        'Validation TP': tp,
        'Validation FP': fp,
        'Validation FN': fn,
        'Validation TN': tn,
    }

#####################################################

from sklearn.metrics import classification_report

def compute_metrics_multiclass(eval_pred):
    """metrics for multiclass classification"""
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=-1)

    report = classification_report(labels, preds, output_dict=True)
    acc_score = report['accuracy']
    pre_score = report['macro avg']['precision']
    rcl_score = report['macro avg']['recall']
    f1_score = report['macro avg']['f1-score']

    return {
        'Validation Accuracy': acc_score,
        'Validation Macro Recall': rcl_score,
        'Validation Macro Precision': pre_score,        
        'Validation Macro F1_Score': f1_score,
        }

Take pre-trained knowledge of BERT and transfer that knowledge to our supervised data set by not training too many epochs. The code block below is going to repeat itself again and again because it define our training loop.

In [41]:
epochs = 2

# Training argument
training_args = TrainingArguments(
    output_dir="./news_clf/results", # Local directory to save check point of our model as fitting
    num_train_epochs=epochs,         # minimum of two epochs
    per_device_train_batch_size=32,  # batch size for training and evaluation, it common to take around 32, 
    per_device_eval_batch_size=32,   # sometimes less or more, The smaller batch size, the more change model update 
    load_best_model_at_end=True,     # Even if we overfit the model by accident, load the best model through checkpoint
    
    # some deep learning parameters that the trainer is able to take in
    warmup_steps = len(seq_clf_tokenized_news['train']) // 5,  # learning rate scheduler by number of warmup steps
    weight_decay = 0.05,    # weight decay for our learning rate schedule (regularization)
    
    logging_steps = 1,  # Tell the model minimum number of steps to log between (1 means logging as much as possible)
    log_level = 'info',
    evaluation_strategy = 'epoch', # It is "steps" or "epoch", we choose epoch: how many times to stop training to test
    eval_steps = 50,
    save_strategy = 'epoch'  # save a check point of our model after each epoch
)

# Define the trainer:
trainer = Trainer(
    model=sequence_clf_model,   # take our model (sequence_clf_model)
    args=training_args,         # we just set it above
    train_dataset=seq_clf_tokenized_news['train'], # training part of dataset
    eval_dataset=seq_clf_tokenized_news['test'],   # test (evaluation) part of dataset
    compute_metrics=compute_metrics_binary,    # This part is optional but we want to calculate accuracy of our model 
    data_collator=data_collator         # data colladior with padding. Infact, we may or may not need a data collator
                                        # we can check the model to see how it lookes like with or without the collator
)

Before we start training, we can run the trainer without fine-tune model to measure performance of the model

In [42]:
# Get initial metrics: evaluation on test set
trainer.evaluate()
The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: tokens, titles. If tokens, titles are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1267
  Batch size = 32
You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
[40/40 10:14]
Out[42]:
{'eval_loss': 0.6951481103897095,
 'eval_Validation Accuracy': 0.47277032359905286,
 'eval_Validation Precision': 0.47535773046816576,
 'eval_Validation AUC': 0.46954314720812185,
 'eval_Validation Recall': 0.5967741935483871,
 'eval_Validation F1_Score': 0.5255681818181818,
 'eval_Validation TP': 370,
 'eval_Validation FP': 418,
 'eval_Validation FN': 250,
 'eval_Validation TN': 229,
 'eval_runtime': 40.6855,
 'eval_samples_per_second': 31.141,
 'eval_steps_per_second': 0.983}

We hope the initial loss and accuracy will improve after training. Since we have not fine-tuned the model yet, the metric is random guessing. The feed-forward layer on top of the model has not been updated yet.

In [43]:
trainer.train()
The following columns in the training set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: tokens, titles. If tokens, titles are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 5068
  Num Epochs = 2
  Instantaneous batch size per device = 32
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 1
  Total optimization steps = 318
  Number of trainable parameters = 66955010
[318/318 19:18, Epoch 2/2]
Epoch Training Loss Validation Loss Validation accuracy Validation precision Validation auc Validation recall Validation f1 Score Validation tp Validation fp Validation fn Validation tn
1 0.452600 0.530531 0.753749 0.752190 0.788390 0.679032 0.729636 421 113 199 534
2 0.535100 0.453855 0.803473 0.805420 0.750337 0.896774 0.817046 556 185 64 462

The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: tokens, titles. If tokens, titles are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1267
  Batch size = 32
Saving model checkpoint to ./news_clf/results\checkpoint-159
Configuration saved in ./news_clf/results\checkpoint-159\config.json
Model weights saved in ./news_clf/results\checkpoint-159\pytorch_model.bin
The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: tokens, titles. If tokens, titles are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1267
  Batch size = 32
Saving model checkpoint to ./news_clf/results\checkpoint-318
Configuration saved in ./news_clf/results\checkpoint-318\config.json
Model weights saved in ./news_clf/results\checkpoint-318\pytorch_model.bin


Training completed. Do not forget to share your model on huggingface.co/models =)


Loading best model from ./news_clf/results\checkpoint-318 (score: 0.45385515689849854).
Out[43]:
TrainOutput(global_step=318, training_loss=0.5451465674541282, metrics={'train_runtime': 1161.5944, 'train_samples_per_second': 8.726, 'train_steps_per_second': 0.274, 'total_flos': 80716111646688.0, 'train_loss': 0.5451465674541282, 'epoch': 2.0})
In [44]:
trainer.evaluate()
The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: tokens, titles. If tokens, titles are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1267
  Batch size = 32
[40/40 00:40]
Out[44]:
{'eval_loss': 0.45385515689849854,
 'eval_Validation Accuracy': 0.8034727703235991,
 'eval_Validation Precision': 0.8054195542703295,
 'eval_Validation AUC': 0.7503373819163293,
 'eval_Validation Recall': 0.896774193548387,
 'eval_Validation F1_Score': 0.8170462894930198,
 'eval_Validation TP': 556,
 'eval_Validation FP': 185,
 'eval_Validation FN': 64,
 'eval_Validation TN': 462,
 'eval_runtime': 40.9227,
 'eval_samples_per_second': 30.961,
 'eval_steps_per_second': 0.977,
 'epoch': 2.0}
In [45]:
# make a pipline by passing in our fine-tuned model with tokenizer
pipe = pipeline("text-classification", model=sequence_clf_model, tokenizer=tokenizer)
pipe('Please add Here We Go by Dispatch to my road trip playlist')
Out[45]:
[{'label': 'FAKE', 'score': 0.7830724716186523}]
In [46]:
# We can save our model on drirectory we specified
trainer.save_model()
Saving model checkpoint to ./news_clf/results
Configuration saved in ./news_clf/results\config.json
Model weights saved in ./news_clf/results\pytorch_model.bin

We can easily call our pipline directly from directory. This very useful for deploying our model on the cloud with one line of the code. We can use it with exact same way to get the exact result.

In [47]:
pipe = pipeline("text-classification", "./news_clf/results", tokenizer=tokenizer)
loading configuration file ./news_clf/results\config.json
Model config DistilBertConfig {
  "_name_or_path": "./news_clf/results",
  "activation": "gelu",
  "architectures": [
    "DistilBertForSequenceClassification"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "id2label": {
    "0": "FAKE",
    "1": "REAL"
  },
  "initializer_range": 0.02,
  "label2id": null,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "problem_type": "single_label_classification",
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "torch_dtype": "float32",
  "transformers_version": "4.26.1",
  "vocab_size": 30522
}

loading configuration file ./news_clf/results\config.json
Model config DistilBertConfig {
  "_name_or_path": "./news_clf/results",
  "activation": "gelu",
  "architectures": [
    "DistilBertForSequenceClassification"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "id2label": {
    "0": "FAKE",
    "1": "REAL"
  },
  "initializer_range": 0.02,
  "label2id": null,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "problem_type": "single_label_classification",
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "torch_dtype": "float32",
  "transformers_version": "4.26.1",
  "vocab_size": 30522
}

loading weights file ./news_clf/results\pytorch_model.bin
All model checkpoint weights were used when initializing DistilBertForSequenceClassification.

All the weights of DistilBertForSequenceClassification were initialized from the model checkpoint at ./news_clf/results.
If your task is similar to the task the model of the checkpoint was trained on, you can already use DistilBertForSequenceClassification for predictions without further training.
In [48]:
text = 'The Battle of New York: Why This Primary Matters'
pipe(text)
Out[48]:
[{'label': 'REAL', 'score': 0.9690483808517456}]
In [49]:
text = """Breaking News: Researchers have discovered a new species of dinosaur that 
        can breathe fire. The creature, named Pyrodino, is believed to have lived 
        during the Jurassic period and could shoot flames out of its nostrils, 
        making it one of the deadliest predators of its time."""
pipe(text)
Out[49]:
[{'label': 'FAKE', 'score': 0.9162157773971558}]

Freezing Model

Up to now we updated all parameters, that is why it takes too much time. Below we freeze all our BERT model except for the classification layer. This is our third option that we freeze all our pre-trained model and only train a layer on top of it.

We are going to freeze every parameter in the model. the easiest way to freeze is to iterate over all distilbert.parameters() and make them as False. It only updates pre_classifier

In [50]:
for param in sequence_clf_model.distilbert.parameters():
    param.requires_grad = False   # "False" makes the parameters unable to update. "grad" stands for gradient
                                  # it never upgrade during training        

By running the code above, the only layer allowed to be updated is below:

(pre_classifier): Linear(in_features=768, out_features=768, bias=True)

(classifier): Linear(in_features=768, out_features=2, bias=True)

(dropout): Dropout(p=0.2, inplace=False)

This leads to much faster training but will yield worse result.

In [51]:
epochs = 2

# Training argument
training_args = TrainingArguments(
    output_dir="./news_clf/results", # Local directory to save check point of our model as fitting
    num_train_epochs=epochs,         # minimum of two epochs
    per_device_train_batch_size=32,  # batch size for training and evaluation, it common to take around 32, 
    per_device_eval_batch_size=32,   # sometimes less or more, The smaller batch size, the more change model update 
    load_best_model_at_end=True,     # Even if we overfit the model by accident, load the best model through checkpoint
    
    # some deep learning parameters that the trainer is able to take in
    warmup_steps = len(seq_clf_tokenized_news['train']) // 5,  # learning rate scheduler by number of warmup steps
    weight_decay = 0.05,    # weight decay for our learning rate schedule (regularization)
    
    logging_steps = 1,  # Tell the model minimum number of steps to log between (1 means logging as much as possible)
    log_level = 'info',
    evaluation_strategy = 'epoch', # It is "steps" or "epoch", we choose epoch: how many times to stop training to test
    eval_steps = 50,
    save_strategy = 'epoch'  # save a check point of our model after each epoch
)

# Define the trainer:
trainer = Trainer(
    model=sequence_clf_model,   # take our model (sequence_clf_model)
    args=training_args,         # we just set it above
    train_dataset=seq_clf_tokenized_news['train'], # training part of dataset
    eval_dataset=seq_clf_tokenized_news['test'],   # test (evaluation) part of dataset
    compute_metrics=compute_metrics_binary,    # This part is optional but we want to calculate accuracy of our model 
    data_collator=data_collator         # data colladior with padding. Infact, we may or may not need a data collator
                                        # we can check the model to see how it lookes like with or without the collator
)
PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
In [52]:
#epochs = 2
#
## Training argument
#training_args = TrainingArguments(
#    output_dir="./news_clf/results",
#    num_train_epochs=epochs,
#    per_device_train_batch_size=32,
#    per_device_eval_batch_size=32,
#    load_best_model_at_end=True,
#    
#    # some deep learning parameters that the Trainer is able to take in
#    warmup_steps = len(seq_clf_tokenized_news['train']) // 5,  # number of warmup steps for learning rate scheduler,
#    weight_decay = 0.05,
#    
#    logging_steps = 1, 
#    log_level = 'info',
#    evaluation_strategy = 'epoch',
#    eval_steps = 50,
#    save_strategy = 'epoch'
#)
#
## Define the trainer:
#
#trainer = Trainer(
#    model=frozen_sequence_clf_model,
#    args=training_args,
#    train_dataset=seq_clf_tokenized_news['train'],
#    eval_dataset=seq_clf_tokenized_news['test'],
#    compute_metrics=compute_metrics_binary,
#    data_collator=data_collator  # data colladior with padding
#)
#
In [53]:
trainer.evaluate()
The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: tokens, titles. If tokens, titles are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1267
  Batch size = 32
[40/40 04:16]
Out[53]:
{'eval_loss': 0.45385515689849854,
 'eval_Validation Accuracy': 0.8034727703235991,
 'eval_Validation Precision': 0.8054195542703295,
 'eval_Validation AUC': 0.7503373819163293,
 'eval_Validation Recall': 0.896774193548387,
 'eval_Validation F1_Score': 0.8170462894930198,
 'eval_Validation TP': 556,
 'eval_Validation FP': 185,
 'eval_Validation FN': 64,
 'eval_Validation TN': 462,
 'eval_runtime': 40.5608,
 'eval_samples_per_second': 31.237,
 'eval_steps_per_second': 0.986}
In [54]:
trainer.train()
The following columns in the training set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: tokens, titles. If tokens, titles are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 5068
  Num Epochs = 2
  Instantaneous batch size per device = 32
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 1
  Total optimization steps = 318
  Number of trainable parameters = 592130
[318/318 07:20, Epoch 2/2]
Epoch Training Loss Validation Loss Validation accuracy Validation precision Validation auc Validation recall Validation f1 Score Validation tp Validation fp Validation fn Validation tn
1 0.149300 0.415941 0.816890 0.817985 0.781159 0.869355 0.822901 539 151 81 496
2 0.473200 0.427284 0.818469 0.819329 0.788462 0.859677 0.822531 533 143 87 504

The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: tokens, titles. If tokens, titles are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1267
  Batch size = 32
Saving model checkpoint to ./news_clf/results\checkpoint-159
Configuration saved in ./news_clf/results\checkpoint-159\config.json
Model weights saved in ./news_clf/results\checkpoint-159\pytorch_model.bin
The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: tokens, titles. If tokens, titles are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1267
  Batch size = 32
Saving model checkpoint to ./news_clf/results\checkpoint-318
Configuration saved in ./news_clf/results\checkpoint-318\config.json
Model weights saved in ./news_clf/results\checkpoint-318\pytorch_model.bin


Training completed. Do not forget to share your model on huggingface.co/models =)


Loading best model from ./news_clf/results\checkpoint-159 (score: 0.4159414768218994).
Out[54]:
TrainOutput(global_step=318, training_loss=0.28454621440771993, metrics={'train_runtime': 441.4054, 'train_samples_per_second': 22.963, 'train_steps_per_second': 0.72, 'total_flos': 80716111646688.0, 'train_loss': 0.28454621440771993, 'epoch': 2.0})
In [55]:
trainer.evaluate()
The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: tokens, titles. If tokens, titles are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1267
  Batch size = 32
[40/40 00:39]
Out[55]:
{'eval_loss': 0.4159414768218994,
 'eval_Validation Accuracy': 0.8168902920284136,
 'eval_Validation Precision': 0.8179849927706037,
 'eval_Validation AUC': 0.7811594202898551,
 'eval_Validation Recall': 0.8693548387096774,
 'eval_Validation F1_Score': 0.8229007633587787,
 'eval_Validation TP': 539,
 'eval_Validation FP': 151,
 'eval_Validation FN': 81,
 'eval_Validation TN': 496,
 'eval_runtime': 40.8822,
 'eval_samples_per_second': 30.991,
 'eval_steps_per_second': 0.978,
 'epoch': 2.0}

For this data set, freezing parameters lead to higher performance and faster training time . However, when we update the entire model we usually will get slower run but higher performance. There is middle ground that we freeze part of model to see how it will work for us. The best practice is to apply different approaches for updating model's parameters and choose the updated model that gives the highest performance.