Summary

Facebook's BART (Bidirectional and Auto-Regressive Transformers) is a sequence-to-sequence pre-trained language model designed for various natural language processing tasks. BART has been trained on the MultiNLI (MNLI) dataset. It utilizes a denoising autoencoder objective, where the model is trained to reconstruct a corrupted version of an input sequence. One notable application of BART is zero-shot classification, where the model can perform text classification on unseen classes without specific training examples. This is accomplished by appending a prefix to the input text, prompting the model to generate the target label. Additionally, BART can be fine-tuned for downstream tasks such as sentiment analysis, summarization, or translation by training on task-specific datasets. Fine-tuning involves updating the model's parameters on the new task while leveraging the knowledge gained during pre-training, enabling BART to adapt to a wide range of natural language understanding tasks. In this notebook, Coronavirus tweets have been pulled from Twitter and manual tagging has been done on them to divide tweets to 5 sentiments: 'Positive', 'Neutral', 'Extremely Positive', 'Extremely Negative', 'Negative'. BART model is applied as zero shot classification to predict sentiments for tweets. The BART model is fine-tuned based on labeled data for more accurate sentiment prediction.

Python functions and data files needed to run this notebook are available via this link.

import warnings
warnings.filterwarnings('ignore')

from transformers import pipeline
import numpy as np
import pandas as pd
import matplotlib as mpl
mpl.rcParams.update(mpl.rcParamsDefault)
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
from matplotlib.offsetbox import AnchoredText
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from matplotlib.ticker import PercentFormatter
from nltk.tokenize import TweetTokenizer
import nltk
import re
pd.set_option('display.max_colwidth', None)
import warnings
warnings.filterwarnings('ignore')

Zero-shot classification is a machine learning technique employed to categorize or classify data into predefined classes without any previous examples or training data specific to those classes. The term "zero-shot" reflects the model's ability to make predictions for classes it has never encountered during training, thus possessing zero prior exposure to them. While traditional classification models rely on labeled examples for each class they aim to recognize, zero-shot classification trains the model to comprehend relationships between different classes, enabling it to generalize to new classes based on their inherent characteristics or attributes.

The fundamental concept underlying zero-shot classification involves leveraging auxiliary information, such as class descriptions, semantic embeddings, or attributes, to impart high-level knowledge about the classes. This supplementary information assists the model in learning associations between specific features or patterns and particular classes.

Zero-shot classification proves particularly advantageous in scenarios where obtaining labeled data for all potential classes is costly, time-consuming, or impractical. This technique empowers models to generalize effectively to unseen classes, enhancing their adaptability and flexibility.

Coronavirus Tweet Data set¶

The Coronavirus Tweet data set is downloaded from kaggle. The tweets have been pulled from Twitter and manual tagging has been done then. The names and usernames have been given codes to avoid any privacy concerns. The columns are:

1) Location (London, UK, ...)

2) Tweet At (date, e.g. 16-03-2020)

3) Original Tweet

4) Sentiment (Label

file_name = 'Corona_NLP_train.csv'
train = pd.read_csv(file_name, encoding = "ISO-8859-1")
#
file_name = 'Corona_NLP_test.csv'
test = pd.read_csv(file_name, encoding = "ISO-8859-1")

train[:4]

Clean Data¶

nltk.download('punkt')

def remove_meaningless(text):
    """
    remove words starting with '@', '\n \r', '#', 'https://' that have no meaning 
    """
    tokenizer = TweetTokenizer()
    tokens = tokenizer.tokenize(text)
    filtered_tokens = [token for token in tokens if not (token.startswith('@') or token.startswith('\n') or 
                                                         token.startswith('\r') or token.startswith('#') or 
                                                         'https://' in token)]
    return ' '.join(filtered_tokens)

train['OriginalTweet'] = train.apply(lambda x : remove_meaningless(x['OriginalTweet']), axis=1)
test['OriginalTweet'] = test.apply(lambda x : remove_meaningless(x['OriginalTweet']), axis=1)

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\mrezv\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!

train[:4]

tt = TweetTokenizer()
#
train['n_tokens'] = train.apply(lambda x : len(tt.tokenize(x['OriginalTweet'])), axis=1)
test['n_tokens'] = test.apply(lambda x : len(tt.tokenize(x['OriginalTweet'])), axis=1)

# Remove tweets less than 10 tokens
train = train[train['n_tokens']>10]
train.reset_index(drop=True)

test = test[test['n_tokens']>10]
test.reset_index(drop=True)

train[:4]

# Since the data set is big for fine-tunning, we select 10000 samples for traning set and 2000 sampels for test set
n_samples = 10000
train = train.iloc[:n_samples].copy()
test = test.iloc[:int(n_samples*0.2)].copy()

See the code below how to apply zero shot classification for Coronavirus Tweet data set:

# Load pre-trained bart-large-mnli (zero shot classifier)
pipe_pre_trained = pipeline(model="facebook/bart-large-mnli")

txt = """What #CONVID19 safety measures r being taken by online shopping companies &amp; 
their courier partners @amazonIN @Flipkart etc? I fear that shopping packages which 
travel vast distances through flights/trains &amp; r handled by many along d way can b 
potential #coronavirus carriers??"""
pipe_pre_trained(txt,candidate_labels=['Neutral', 'Extremely Negative', 'Negative',
                                       'Extremely Positive', 'Positive'])

{'sequence': 'What #CONVID19 safety measures r being taken by online shopping companies &amp; \ntheir courier partners @amazonIN @Flipkart etc? I fear that shopping packages which \ntravel vast distances through flights/trains &amp; r handled by many along d way can b \npotential #coronavirus carriers??',
 'labels': ['Negative',
  'Neutral',
  'Extremely Negative',
  'Positive',
  'Extremely Positive'],
 'scores': [0.4203762412071228,
  0.2806272804737091,
  0.1792459338903427,
  0.07781031727790833,
  0.041940245777368546]}

For this text, the actual label is Negative, zero shot has predicted correctly with probability of 0.42 although the probability is not very high. This relatively low predicted probability indicates the model is not very confident.

Another Example

txt = """It amazes me...I go to the supermarket and everything is gone, people are panic buying, 
yet not one single person was wearing a facemask. Buying 200 rolls of toilet paper won't 
prevent you from catching #COVID2019. Take proper precautions and stay safe. https://t.co/lNoF0xfsx4"""
pipe_pre_trained(txt,candidate_labels=['Neutral', 'Extremely Negative', 'Negative',
                                       'Extremely Positive', 'Positive'])

{'sequence': "It amazes me...I go to the supermarket and everything is gone, people are panic buying, \nyet not one single person was wearing a facemask. Buying 200 rolls of toilet paper won't \nprevent you from catching #COVID2019. Take proper precautions and stay safe. https://t.co/lNoF0xfsx4",
 'labels': ['Negative',
  'Extremely Negative',
  'Positive',
  'Neutral',
  'Extremely Positive'],
 'scores': [0.3562481999397278,
  0.3409477174282074,
  0.1348302662372589,
  0.10292468965053558,
  0.06504908204078674]}

For this text, the actual label is Extremely Positive while zero shot is predicted as Negative with predicted probability 0.35. Therefore, we need to fine-tune the model with manual labels.

import subprocess
try:
    subprocess.check_output('nvidia-smi')
    print('Nvidia GPU detected!')
except Exception: # this command not being found can raise quite a few different errors depending on the configuration
    print('No Nvidia GPU in system!')

No Nvidia GPU in system!

def bargraph(val_ob: [list],  title: str, ylabel: str, titlefontsize: int=10, xfontsize: int=5,scale: int=1, 
             yfontsize: int=8, select: bool= False, fontsizelable: bool= False, xshift: float=-0.1, nsim: int=False
             ,yshift: float=0.01,percent: bool=False, xlim: list=False, axt: bool=None, color: str='b',sort=True,
             ylim: list=False, y_rot: int=0, ytick_rot: int=90, graph_float: int=1, 
            loc: int =1,legend: int=1) -> None:
    
    """ vertical bargraph """
    
    ax1 = axt or plt.axes()

    tot = len(val_ob)
    miss_p_ob = (len(val_ob[pd.isnull(val_ob)])/tot)*100        
    n_nonmis_ob = len(val_ob[~pd.isnull(val_ob)])    
    con = np.array(val_ob.value_counts())
    len_ = len(con)
    if len_ > 10: len_ = 10
    cats = list(val_ob.value_counts().keys())
    val_ob = con[:len_]
    clmns = cats[:len_]
    # Sort counts
    if sort:
        sort_score = sorted(zip(val_ob,clmns), reverse=True)
        Clmns_sort = [sort_score[i][1] for i in range(len(clmns))]
        sort_score = [sort_score[i][0] for i in range(len(clmns))]              
    else:
        Clmns_sort = clmns
        sort_score = val_ob
    index1 = np.arange(len(clmns))
    if (select):
        Clmns_sort=Clmns_sort[:select]
        sort_score=sort_score[:select]
    ax1.bar(Clmns_sort, sort_score, width=0.6, align='center', alpha=1, edgecolor='k', capsize=4,color=color)
    plt.title(title,fontsize=titlefontsize)
    ax1.set_ylabel(ylabel,fontsize=yfontsize)
    ax1.set_xticks(np.arange(len(Clmns_sort)))
    
    ax1.set_xticklabels(Clmns_sort,fontsize=xfontsize, rotation=ytick_rot,y=0.02)   
    if (percent): plt.gca().yaxis.set_major_formatter(PercentFormatter(1))
    ax1.xaxis.grid(color='k', linestyle='--', linewidth=0.2) 
    if (xlim): plt.xlim(xlim)
    if (ylim): plt.ylim(ylim)
    if (fontsizelable):
        for ii in range(len(sort_score)):
            if (percent):
                plt.text(xshift+ii, sort_score[ii]+yshift,f'{"{0:.2f}".format(sort_score[ii]*100)}%',
                fontsize=fontsizelable,rotation=y_rot,color='k')     
            else:
                plt.text(xshift+ii, sort_score[ii]+yshift,f'{np.round(sort_score[ii],graph_float)}',
                    fontsize=fontsizelable,rotation=y_rot,color='k')                                 
    dic_Clmns = {}
    for i in range(len(Clmns_sort)):
        dic_Clmns[Clmns_sort[i]]=sort_score[i]
        
    txt = 'n (not missing)=%.0f\nMissing=%.1f%%'       
    anchored_text = AnchoredText(txt %(n_nonmis_ob,miss_p_ob), borderpad=0, 
                                 loc=loc)    
    if(legend==1): ax1.add_artist(anchored_text)

font = {'size'   : 12}
plt.rc('font', **font)
colors_map = plt.cm.get_cmap('jet')

fig, ax = plt.subplots(figsize=(14, 4), dpi= 100, facecolor='w', edgecolor='k') 

ax1 = plt.subplot(1,2,1) 
val_obj = train['Sentiment']
bargraph (val_obj, title=f'Training set_Coronavirus Tweet Labels', ylabel='Counts',titlefontsize=15, xfontsize=12,yfontsize=13,
                       percent=False,fontsizelable=12,yshift=20,xshift=-0.2,color='r',legend=True,ytick_rot=25, y_rot=0, axt=ax1,loc=1, ylim=[0,3000])  

ax1 = plt.subplot(1,2,2) 
val_obj = test['Sentiment']
bargraph (val_obj, title=f'Test set_Coronavirus Tweet Labels', ylabel='Counts',titlefontsize=15, xfontsize=12,yfontsize=13,
                       percent=False,fontsizelable=12,yshift=5,xshift=-0.1,color='b',legend=True,ytick_rot=25, y_rot=0, axt=ax1,loc=1, ylim=[0,650]) 

plt.subplots_adjust(hspace=0.5)
plt.subplots_adjust(wspace=0.3)

plt.show()

Fine-Tune BART Model¶

So, there are three fine-tuning approaches:

Add any additional layers on top while updating the entire whole model on labeled data. This is a common approach. All aspect of the model will be updated. See Figure below. This is usually the slowest but has the highest performance.

Freeze some part of model. For example, keeping some weights of BERT model unchanged while updating other weights. This approach has average speed and average speed.

Freeze the entire model and only train the additional layers that we added on top which are feed forward classifiers. This is the fastest approach but has the worst performance. This can be only used for generic tasks.

Segment parses¶

# This code segment parses the train dataset into a more manageable format
titles = []
tokenized_titles = []
sequence_labels = train['Sentiment']

title, tokenized_title =  [], []
for comments in train['OriginalTweet']:
    title.append(comments)
    tokenized_title.append(comments.split(' '))

unique_sequence_labels = list(set(sequence_labels))
unique_sequence_labels

['Positive', 'Neutral', 'Extremely Positive', 'Extremely Negative', 'Negative']

sequence_labels = [unique_sequence_labels.index(l) for l in sequence_labels]

print(f'There are {len(unique_sequence_labels)} unique sequence labels')

There are 5 unique sequence labels

Our final python list is going to be something like this:

print(tokenized_title[0])
print(title[0])
print(sequence_labels[0])
print(unique_sequence_labels[sequence_labels[0]])

['advice', 'Talk', 'to', 'your', 'neighbours', 'family', 'to', 'exchange', 'phone', 'numbers', 'create', 'contact', 'list', 'with', 'phone', 'numbers', 'of', 'neighbours', 'schools', 'employer', 'chemist', 'GP', 'set', 'up', 'online', 'shopping', 'accounts', 'if', 'poss', 'adequate', 'supplies', 'of', 'regular', 'meds', 'but', 'not', 'over', 'order']
advice Talk to your neighbours family to exchange phone numbers create contact list with phone numbers of neighbours schools employer chemist GP set up online shopping accounts if poss adequate supplies of regular meds but not over order
0
Positive

Split Data to Training and Validation Set¶

🤗 Datasets offer a plethora of tools for modifying both the structure and content of a dataset. These tools play a pivotal role in refining a dataset, encompassing tasks such as organizing rows, splitting datasets, creating additional columns, converting between different features and formats, and more.

This guide will walk you through the following processes:

Rearranging rows and dividing the dataset.
Renaming and eliminating columns, along with other commonly performed column operations.
Applying processing functions to each example within a dataset.
Concatenating datasets.
Implementing a custom formatting transform.
Saving and exporting processed datasets.

Once all the data is collected, it is encapsulated within a dataset object. Subsequently, a train-test split can be executed using the train_test_split function.

from datasets import Dataset

tweet_dataset = Dataset.from_dict(
    dict(
        titles=title, 
        label=sequence_labels,
        tokens=tokenized_title,
    )
)
tweet_dataset = tweet_dataset.train_test_split(test_size=0.1, seed=42, shuffle=True)

tweet_dataset

DatasetDict({
    train: Dataset({
        features: ['titles', 'label', 'tokens'],
        num_rows: 9000
    })
    test: Dataset({
        features: ['titles', 'label', 'tokens'],
        num_rows: 1000
    })
})

Here is first element of our training set:

pd.Series([ii for ii in tweet_dataset['test']['label']]).value_counts()

0    285
4    254
2    161
1    159
3    141
dtype: int64

Tokenizer¶

Instantiate tokenizer:

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('facebook/bart-large-mnli')

Create a pre-process function to take in a batch of titles and tokenize them.

def preprocess_function(examples):
    return tokenizer(examples["titles"], truncation=True) # truncation=True makes sure to exludes instances exceeding maximum tokens

Map the tokenizer function for the entire data set:

# go over all our data set, tokenize them
seq_clf_tokenized_comments = tweet_dataset.map(preprocess_function, batched=True)

tweet_dataset

DatasetDict({
    train: Dataset({
        features: ['titles', 'label', 'tokens'],
        num_rows: 9000
    })
    test: Dataset({
        features: ['titles', 'label', 'tokens'],
        num_rows: 1000
    })
})

Looking at the first item, we also have input_ids and attention_mask. These are the items we are going to need in our model.

Batch of Data¶

DataCollatorWithPadding facilitates the creation of data batches by dynamically padding text sequences to the length of the longest element in each batch, ensuring uniform length across all elements. While it is feasible to pad text manually within the tokenizer function using the padding=True option, employing dynamic padding in the DataCollatorWithPadding is a more efficient approach. This optimized process contributes to an accelerated training pace, enhancing the overall efficiency of the training process.

from transformers import DataCollatorWithPadding
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

The Data Collator is responsible for padding data to achieve uniform input lengths across all examples in a batch. The attention mask is a mechanism utilized to exclude attention scores associated with padding tokens, ensuring that these scores are disregarded during processing.

Create Model¶

Create actual model.

from transformers import BartForSequenceClassification

tweet_clf_model = BartForSequenceClassification.from_pretrained('facebook/bart-large-mnli',
                                                            num_labels=len(unique_sequence_labels),
                                                            ignore_mismatched_sizes=True)
# set an index -> label dictionary
tweet_clf_model.config.id2label = {i: l for i, l in enumerate(unique_sequence_labels)}
tweet_clf_model.config.label2id = {l: i for i, l in enumerate(unique_sequence_labels)}

Some weights of BartForSequenceClassification were not initialized from the model checkpoint at facebook/bart-large-mnli and are newly initialized because the shapes did not match:
- classification_head.out_proj.weight: found shape torch.Size([3, 1024]) in the checkpoint and torch.Size([5, 1024]) in the model instantiated
- classification_head.out_proj.bias: found shape torch.Size([3]) in the checkpoint and torch.Size([5]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

# Model's parameters 
n_params = list(tweet_clf_model.named_parameters())
print(f'The BART model has {len(n_params)} different parameters')

The BART model has 515 different parameters

n_params[0:5][0]

('model.shared.weight',
 Parameter containing:
 tensor([[-0.0403,  0.0787,  0.1707,  ...,  0.1901,  0.0629, -0.0710],
         [ 0.0055, -0.0050, -0.0069,  ..., -0.0030,  0.0038,  0.0087],
         [-0.0459,  0.4714, -0.0611,  ...,  0.1072,  0.0301,  0.0497],
         ...,
         [-0.0135,  0.0287, -0.0467,  ...,  0.0460, -0.0252,  0.0121],
         [-0.0041,  0.0145, -0.0552,  ...,  0.0493,  0.0098, -0.0091],
         [ 0.0063,  0.0296, -0.0188,  ..., -0.0108,  0.0221, -0.0010]],
        requires_grad=True))

print('********* Embedding Layer *********\n')
for par in n_params[0:5]:
    print(f'{par[0], str(tuple(par[1].size()))}')

********* Embedding Layer *********

('model.shared.weight', '(50265, 1024)')
('model.encoder.embed_positions.weight', '(1026, 1024)')
('model.encoder.layers.0.self_attn.k_proj.weight', '(1024, 1024)')
('model.encoder.layers.0.self_attn.k_proj.bias', '(1024,)')
('model.encoder.layers.0.self_attn.v_proj.weight', '(1024, 1024)')

print('********* First Encoder ********* \n')
for par in n_params[5:21]:
    print(f'{par[0], str(tuple(par[1].size()))}')

********* First Encoder ********* 

('model.encoder.layers.0.self_attn.v_proj.bias', '(1024,)')
('model.encoder.layers.0.self_attn.q_proj.weight', '(1024, 1024)')
('model.encoder.layers.0.self_attn.q_proj.bias', '(1024,)')
('model.encoder.layers.0.self_attn.out_proj.weight', '(1024, 1024)')
('model.encoder.layers.0.self_attn.out_proj.bias', '(1024,)')
('model.encoder.layers.0.self_attn_layer_norm.weight', '(1024,)')
('model.encoder.layers.0.self_attn_layer_norm.bias', '(1024,)')
('model.encoder.layers.0.fc1.weight', '(4096, 1024)')
('model.encoder.layers.0.fc1.bias', '(4096,)')
('model.encoder.layers.0.fc2.weight', '(1024, 4096)')
('model.encoder.layers.0.fc2.bias', '(1024,)')
('model.encoder.layers.0.final_layer_norm.weight', '(1024,)')
('model.encoder.layers.0.final_layer_norm.bias', '(1024,)')
('model.encoder.layers.1.self_attn.k_proj.weight', '(1024, 1024)')
('model.encoder.layers.1.self_attn.k_proj.bias', '(1024,)')
('model.encoder.layers.1.self_attn.v_proj.weight', '(1024, 1024)')

print('********* Output Layer ********* \n')
for par in n_params[-2:]:
    print(f'{par[0], str(tuple(par[1].size()))}')

********* Output Layer ********* 

('classification_head.out_proj.weight', '(5, 1024)')
('classification_head.out_proj.bias', '(5,)')

tweet_clf_model.config

BartConfig {
  "_name_or_path": "facebook/bart-large-mnli",
  "_num_labels": 3,
  "activation_dropout": 0.0,
  "activation_function": "gelu",
  "add_final_layer_norm": false,
  "architectures": [
    "BartForSequenceClassification"
  ],
  "attention_dropout": 0.0,
  "bos_token_id": 0,
  "classif_dropout": 0.0,
  "classifier_dropout": 0.0,
  "d_model": 1024,
  "decoder_attention_heads": 16,
  "decoder_ffn_dim": 4096,
  "decoder_layerdrop": 0.0,
  "decoder_layers": 12,
  "decoder_start_token_id": 2,
  "dropout": 0.1,
  "encoder_attention_heads": 16,
  "encoder_ffn_dim": 4096,
  "encoder_layerdrop": 0.0,
  "encoder_layers": 12,
  "eos_token_id": 2,
  "forced_eos_token_id": 2,
  "gradient_checkpointing": false,
  "id2label": {
    "0": "Positive",
    "1": "Neutral",
    "2": "Extremely Positive",
    "3": "Extremely Negative",
    "4": "Negative"
  },
  "init_std": 0.02,
  "is_encoder_decoder": true,
  "label2id": {
    "Extremely Negative": 3,
    "Extremely Positive": 2,
    "Negative": 4,
    "Neutral": 1,
    "Positive": 0
  },
  "max_position_embeddings": 1024,
  "model_type": "bart",
  "normalize_before": false,
  "num_hidden_layers": 12,
  "output_past": false,
  "pad_token_id": 1,
  "scale_embedding": false,
  "transformers_version": "4.26.1",
  "use_cache": true,
  "vocab_size": 50265
}

Every model comes with a config. In this config, there are id2label and label2id:

Within the framework of the BartForSequenceClassification model, the terms "id2label" and "label2id" pertain to two mapping dictionaries employed for the conversion of labels between their textual and numerical representations.

id2label: This dictionary establishes a mapping from numerical label IDs to their respective textual labels. Each label ID is uniquely associated with a specific label. For instance, in a scenario with three labels— "positive," "neutral," and "negative"— and their corresponding IDs 0, 1, and 2, the id2label mapping would appear as {0: "positive", 1: "neutral", 2: "negative."}. This mapping proves beneficial when transforming model predictions, typically presented as numerical IDs, back into their corresponding textual labels.

label2id: Serving as the inverse of id2label, this dictionary maps textual labels to their corresponding numerical IDs. Using the aforementioned example, the label2id mapping would be {"positive": 0, "neutral": 1, "negative": 2}. This mapping becomes valuable when converting textual labels into their numerical representations, a requirement often encountered during the training or evaluation of a model.

tweet_clf_model.config.id2label[0]

'Positive'

It is now opportune to introduce a custom metric. While Hugging Face traditionally employs loss as the performance metric, there is a need to compute accuracy as a more straightforward and intuitive metric for evaluation.

Fine-tune Trainer with Labeled Data¶

Take pre-trained knowledge of BART and transfer that knowledge to our supervised data set by not training too many epochs. The code block below is going to repeat itself again and again because it define our training loop.

from transformers import TrainingArguments, Trainer

See HuggingFace for more information on fine-tuning model.

🤗 Transformers offers a Trainer class designed to facilitate the fine-tuning of any pretrained models on your specific dataset. After completing the data preprocessing tasks outlined in the previous section, there are only a few remaining steps to configure the Trainer. The most challenging aspect is likely to be the preparation of the environment for executing Trainer.train(), as the process tends to be significantly slower when run on a CPU.

Training

Before defining our Trainer, the initial step involves defining a TrainingArguments class that consolidates all the hyperparameters utilized by the Trainer for training and evaluation. The sole mandatory argument is the directory path where the trained model, along with checkpoints, will be stored. For the remaining parameters, default values can be retained, typically sufficient for basic fine-tuning.

Subsequently, the Trainer can be instantiated by incorporating all the previously constructed objects—comprising the model, training_args, training and validation datasets, and the data_collator.

This initiates the fine-tuning process, which, when performed on a GPU, typically concludes within a few minutes. The Trainer provides periodic updates on the training loss every 500 steps. However, it does not furnish information on the model's performance, as:

The Trainer has not been configured to perform evaluations during training. This can be achieved by setting evaluation_strategy to either "steps" (evaluate every eval_steps) or "epoch" (evaluate at the end of each epoch).

A compute_metrics() function needs to be supplied to the Trainer for calculating a metric during the aforementioned evaluations. Otherwise, the evaluation results would only display the loss, which may not be the most intuitive metric.

Evaluation

Let's explore the process of constructing a valuable compute_metrics() function and integrating it into our next training session. This function is designed to take an EvalPrediction object, which is a named tuple featuring a predictions field and a label_ids field. Its output is expected to be a dictionary that maps strings to floats, where the strings denote the names of the returned metrics and the floats represent their respective values. To obtain predictions from our model, the Trainer.predict() command can be employed.

The result of the predict() method is yet another named tuple encompassing three fields: predictions, label_ids, and metrics. The metrics field includes the dataset's loss, along with time-related metrics such as the total prediction time and the average prediction time. Once we finalize the compute_metrics() function and incorporate it into the Trainer, this field will also encompass the metrics outputted by compute_metrics().

See stackoverflow how to use evaluate.load.

Freezing BART Parameters¶

Since the BART model is very large, fine-tunning for all parameter will take long, so, we freeze some parameters to speed up the fine-tunning process.

# Get the number of layers
config = tweet_clf_model.config
num_layers = config.num_hidden_layers

print("Number of layers in the BART model:", num_layers)

Number of layers in the BART model: 12

# to speed up training, freeze all encoder layers in BART except last one
ir = 0
for name, param in tweet_clf_model.named_parameters():
    ir += 1
    #print(name)
    if 'model.decoder.layers.11' in name: # our large model has 11 encoder so everything until last 2 are removed
        print(f'Parameter {ir}: model.decoder.layers.11')
        break 
    param.requires_grad = False  # disable training in BART

Parameter 484: model.decoder.layers.11

import evaluate
metric = evaluate.load("accuracy")
from sklearn.metrics import roc_auc_score

def compute_metrics(eval_pred):
    logits = eval_pred.predictions[0]
    labels = eval_pred.label_ids
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

#####################################################

def compute_metrics_binary(eval_pred):
    """metrics for binary classification"""
    
    labels = eval_pred.label_ids
    logits = eval_pred.predictions[0]
    preds = np.argmax(logits, axis=-1)

    # Calculate the AUC score
    auc_score = roc_auc_score(labels, eval_pred)

    # Calculate the true positive, false positive, false negative, and true negative values
    tp = ((eval_pred >= 0.5) & (labels == 1)).sum()
    fp = ((eval_pred >= 0.5) & (labels == 0)).sum()
    fn = ((eval_pred < 0.5) & (labels == 1)).sum()
    tn = ((eval_pred < 0.5) & (labels == 0)).sum()

    # Calculate the precision, recall, and F1 score
    precision = tp / (tp + fp)
    recall = tp / (tp + fn)
    f1_score = 2 * (precision * recall) / (precision + recall)

    return {
        'Validation Precision': precision,
        'Validation Recall': recall,
        'Validation F1_Score': f1_score,
        'Validation AUC_Score': auc_score,
        'Validation TP': tp,
        'Validation FP': fp,
        'Validation FN': fn,
        'Validation TN': tn,
    }

#####################################################

from sklearn.metrics import classification_report

def compute_metrics_multiclass(eval_pred):
    """metrics for multiclass classification"""
    labels = eval_pred.label_ids
    logits = eval_pred.predictions[0]
    preds = np.argmax(logits, axis=-1)

    report = classification_report(labels, preds, output_dict=True)
    acc_score = report['accuracy']
    pre_score = report['macro avg']['precision']
    rcl_score = report['macro avg']['recall']
    f1_score = report['macro avg']['f1-score']

    return {
        'Validation Accuracy': acc_score,
        'Validation Macro Recall': rcl_score,
        'Validation Macro Precision': pre_score,        
        'Validation Macro F1_Score': f1_score,
        }

Trainer¶

epochs = 5
batch_size = 5

from transformers import set_seed
set_seed(42)

tweet_clf_model.eval()

# Training argument: monitor and track training arguments including saving 
# strategy and Scheduler parameters
training_args = TrainingArguments(
    output_dir="./tweet_clf/results", # Local directory to save check point of our model as fitting
    num_train_epochs=epochs,         # minimum of two epochs
    per_device_train_batch_size=batch_size,   # batch size for training and evaluation, it is common to take around 32, 
    per_device_eval_batch_size=batch_size,    # sometimes less or more, The smaller batch size, the more change model update 
    load_best_model_at_end=True,     # Even if we overfit the model by accident, load the best model through checkpoint
    metric_for_best_model= 'Validation Accuracy',   # Use in conjunction with`load_best_model_at_end` to specify the metric to use to compare two different
                                                    # models. Must be the name of a metric returned by the evaluation with or without the prefix :obj:`"eval_"`.
                                                    # Will default to :obj:`"loss"` if unspecified and :obj:`load_best_model_at_end=True` (to use the evaluation
                                                    # loss).
    # some deep learning parameters that the trainer is able to take in
    warmup_steps = len(seq_clf_tokenized_comments['train']) // 2,  # learning rate scheduler by number of warmup steps
    weight_decay = 0.1,         # weight decay for our learning rate schedule (regularization)
    seed = 42,                  # Random seed to ensure reproducibility across runs
    logging_steps = 1,           # Tell the model minimum number of steps to log between (1 means logging as much as possible)
    log_level = 'info',
    evaluation_strategy = 'epoch', # It is "steps" or "epoch", we choose epoch: how many times to stop training to test
    eval_steps = 50,
    save_strategy = 'epoch'  # save a check point of our model after each epoch
)

# Define the trainer: API to the Pytorch
trainer = Trainer(
    model=tweet_clf_model,   # take our model (tweet_clf_model)
    args=training_args,         # we just set it above
    train_dataset=seq_clf_tokenized_comments['train'], # training part of dataset
    eval_dataset=seq_clf_tokenized_comments['test'],   # test (evaluation) part of dataset
    compute_metrics=compute_metrics_multiclass,    # This part is optional but we want to metrics for our model 
    data_collator=data_collator         # data colladior with padding. Infact, we may or may not need a data collator
                                        # we can check the model to see how it lookes like with or without the collator
)

Before we start training, we can run the trainer without fine-tune model to measure performance of the model

# Get initial metrics: evaluation on test set
trainer.evaluate()

The following columns in the evaluation set don't have a corresponding argument in `BartForSequenceClassification.forward` and have been ignored: titles, tokens. If titles, tokens are not expected by `BartForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1000
  Batch size = 5
You're using a BartTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.

{'eval_loss': 1.6543835401535034,
 'eval_Validation Accuracy': 0.247,
 'eval_Validation Macro Recall': 0.21789604394781487,
 'eval_Validation Macro Precision': 0.17544466735784783,
 'eval_Validation Macro F1_Score': 0.1576815486394501,
 'eval_runtime': 383.0252,
 'eval_samples_per_second': 2.611,
 'eval_steps_per_second': 0.522}

%%time
trainer.train()

The following columns in the training set don't have a corresponding argument in `BartForSequenceClassification.forward` and have been ignored: titles, tokens. If titles, tokens are not expected by `BartForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 9000
  Num Epochs = 5
  Instantaneous batch size per device = 5
  Total train batch size (w. parallel, distributed & accumulation) = 5
  Gradient Accumulation steps = 1
  Total optimization steps = 9000
  Number of trainable parameters = 17853445

The following columns in the evaluation set don't have a corresponding argument in `BartForSequenceClassification.forward` and have been ignored: titles, tokens. If titles, tokens are not expected by `BartForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1000
  Batch size = 5
Saving model checkpoint to ./tweet_clf/results\checkpoint-1800
Configuration saved in ./tweet_clf/results\checkpoint-1800\config.json
Model weights saved in ./tweet_clf/results\checkpoint-1800\pytorch_model.bin
The following columns in the evaluation set don't have a corresponding argument in `BartForSequenceClassification.forward` and have been ignored: titles, tokens. If titles, tokens are not expected by `BartForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1000
  Batch size = 5
Saving model checkpoint to ./tweet_clf/results\checkpoint-3600
Configuration saved in ./tweet_clf/results\checkpoint-3600\config.json
Model weights saved in ./tweet_clf/results\checkpoint-3600\pytorch_model.bin
The following columns in the evaluation set don't have a corresponding argument in `BartForSequenceClassification.forward` and have been ignored: titles, tokens. If titles, tokens are not expected by `BartForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1000
  Batch size = 5
Saving model checkpoint to ./tweet_clf/results\checkpoint-5400
Configuration saved in ./tweet_clf/results\checkpoint-5400\config.json
Model weights saved in ./tweet_clf/results\checkpoint-5400\pytorch_model.bin
The following columns in the evaluation set don't have a corresponding argument in `BartForSequenceClassification.forward` and have been ignored: titles, tokens. If titles, tokens are not expected by `BartForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1000
  Batch size = 5
Saving model checkpoint to ./tweet_clf/results\checkpoint-7200
Configuration saved in ./tweet_clf/results\checkpoint-7200\config.json
Model weights saved in ./tweet_clf/results\checkpoint-7200\pytorch_model.bin
The following columns in the evaluation set don't have a corresponding argument in `BartForSequenceClassification.forward` and have been ignored: titles, tokens. If titles, tokens are not expected by `BartForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1000
  Batch size = 5
Saving model checkpoint to ./tweet_clf/results\checkpoint-9000
Configuration saved in ./tweet_clf/results\checkpoint-9000\config.json
Model weights saved in ./tweet_clf/results\checkpoint-9000\pytorch_model.bin


Training completed. Do not forget to share your model on huggingface.co/models =)


Loading best model from ./tweet_clf/results\checkpoint-5400 (score: 0.584).

Wall time: 12h 20s

TrainOutput(global_step=9000, training_loss=0.9914621342285019, metrics={'train_runtime': 43220.4566, 'train_samples_per_second': 1.041, 'train_steps_per_second': 0.208, 'total_flos': 5513932597456650.0, 'train_loss': 0.9914621342285019, 'epoch': 5.0})

trainer.evaluate()

The following columns in the evaluation set don't have a corresponding argument in `BartForSequenceClassification.forward` and have been ignored: titles, tokens. If titles, tokens are not expected by `BartForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1000
  Batch size = 5

{'eval_loss': 1.0290082693099976,
 'eval_Validation Accuracy': 0.584,
 'eval_Validation Macro Recall': 0.5941889076629356,
 'eval_Validation Macro Precision': 0.6026636803376203,
 'eval_Validation Macro F1_Score': 0.587307594829771,
 'eval_runtime': 339.3369,
 'eval_samples_per_second': 2.947,
 'eval_steps_per_second': 0.589,
 'epoch': 5.0}

## Save pipline on the drive
#tweet_clf_model = tweet_clf_model.to('cpu') #put my model to cpu
pipe = pipeline("text-classification", model=tweet_clf_model, tokenizer=tokenizer)

text = """Wish I was physic so I could have predicted how much sanitizer and toilet 
paper I would need before this got real.  #stopPanicBuying #coronavirus"""
pipe(text)

[{'label': 'Negative', 'score': 0.5844146609306335}]

text = """After almost two weeks of going  absolutely nowhere, we had to go to 
the grocery store to stock up on food, fruit and veggies. Some older people: 
their behaviour!?? ? What part of social distancing do you not understand?! 
Already had such a stressful day. ? #SocialDistancing"""
pipe(text)

[{'label': 'Negative', 'score': 0.5122951865196228}]

Save Model¶

# We can save our model on drirectory we specified and load the model for prediction
trainer.save_model()

Saving model checkpoint to ./tweet_clf/results
Configuration saved in ./tweet_clf/results\config.json
Model weights saved in ./tweet_clf/results\pytorch_model.bin

We can easily call our pipline directly from directory. This very useful for deploying our model on the cloud with one line of the code. We can use it with exact same way to get the exact result.

pipe = pipeline("text-classification", "./tweet_clf/results", tokenizer=tokenizer)

loading configuration file ./tweet_clf/results\config.json
You passed along `num_labels=3` with an incompatible id to label map: {'0': 'Positive', '1': 'Neutral', '2': 'Extremely Positive', '3': 'Extremely Negative', '4': 'Negative'}. The number of labels wil be overwritten to 5.
Model config BartConfig {
  "_name_or_path": "./tweet_clf/results",
  "_num_labels": 3,
  "activation_dropout": 0.0,
  "activation_function": "gelu",
  "add_final_layer_norm": false,
  "architectures": [
    "BartForSequenceClassification"
  ],
  "attention_dropout": 0.0,
  "bos_token_id": 0,
  "classif_dropout": 0.0,
  "classifier_dropout": 0.0,
  "d_model": 1024,
  "decoder_attention_heads": 16,
  "decoder_ffn_dim": 4096,
  "decoder_layerdrop": 0.0,
  "decoder_layers": 12,
  "decoder_start_token_id": 2,
  "dropout": 0.1,
  "encoder_attention_heads": 16,
  "encoder_ffn_dim": 4096,
  "encoder_layerdrop": 0.0,
  "encoder_layers": 12,
  "eos_token_id": 2,
  "forced_eos_token_id": 2,
  "gradient_checkpointing": false,
  "id2label": {
    "0": "Positive",
    "1": "Neutral",
    "2": "Extremely Positive",
    "3": "Extremely Negative",
    "4": "Negative"
  },
  "init_std": 0.02,
  "is_encoder_decoder": true,
  "label2id": {
    "Extremely Negative": 3,
    "Extremely Positive": 2,
    "Negative": 4,
    "Neutral": 1,
    "Positive": 0
  },
  "max_position_embeddings": 1024,
  "model_type": "bart",
  "normalize_before": false,
  "num_hidden_layers": 12,
  "output_past": false,
  "pad_token_id": 1,
  "problem_type": "single_label_classification",
  "scale_embedding": false,
  "torch_dtype": "float32",
  "transformers_version": "4.26.1",
  "use_cache": true,
  "vocab_size": 50265
}

loading configuration file ./tweet_clf/results\config.json
You passed along `num_labels=3` with an incompatible id to label map: {'0': 'Positive', '1': 'Neutral', '2': 'Extremely Positive', '3': 'Extremely Negative', '4': 'Negative'}. The number of labels wil be overwritten to 5.
Model config BartConfig {
  "_name_or_path": "./tweet_clf/results",
  "_num_labels": 3,
  "activation_dropout": 0.0,
  "activation_function": "gelu",
  "add_final_layer_norm": false,
  "architectures": [
    "BartForSequenceClassification"
  ],
  "attention_dropout": 0.0,
  "bos_token_id": 0,
  "classif_dropout": 0.0,
  "classifier_dropout": 0.0,
  "d_model": 1024,
  "decoder_attention_heads": 16,
  "decoder_ffn_dim": 4096,
  "decoder_layerdrop": 0.0,
  "decoder_layers": 12,
  "decoder_start_token_id": 2,
  "dropout": 0.1,
  "encoder_attention_heads": 16,
  "encoder_ffn_dim": 4096,
  "encoder_layerdrop": 0.0,
  "encoder_layers": 12,
  "eos_token_id": 2,
  "forced_eos_token_id": 2,
  "gradient_checkpointing": false,
  "id2label": {
    "0": "Positive",
    "1": "Neutral",
    "2": "Extremely Positive",
    "3": "Extremely Negative",
    "4": "Negative"
  },
  "init_std": 0.02,
  "is_encoder_decoder": true,
  "label2id": {
    "Extremely Negative": 3,
    "Extremely Positive": 2,
    "Negative": 4,
    "Neutral": 1,
    "Positive": 0
  },
  "max_position_embeddings": 1024,
  "model_type": "bart",
  "normalize_before": false,
  "num_hidden_layers": 12,
  "output_past": false,
  "pad_token_id": 1,
  "problem_type": "single_label_classification",
  "scale_embedding": false,
  "torch_dtype": "float32",
  "transformers_version": "4.26.1",
  "use_cache": true,
  "vocab_size": 50265
}

loading weights file ./tweet_clf/results\pytorch_model.bin
All model checkpoint weights were used when initializing BartForSequenceClassification.

All the weights of BartForSequenceClassification were initialized from the model checkpoint at ./tweet_clf/results.
If your task is similar to the task the model of the checkpoint was trained on, you can already use BartForSequenceClassification for predictions without further training.

Evaluate Model On Validation Set¶

See this link for evaluating multiclass classification Kaggle.

pipe = pipeline("text-classification", model=tweet_clf_model, tokenizer=tokenizer, return_all_scores=True)
def refined_zero(x):
    label_ = [pipe(x)[0][i]['label'] for i in range(5)]
    score_ = [pipe(x)[0][i]['score'] for i in range(5)]
    pipe_x =  sorted(zip(label_, score_), key=lambda x: x[1], reverse=True)    
    try:
        l1 = pipe_x[0][0]
        p1 = pipe_x[0][1]
        l2 = pipe_x[1][0]
        p2 = pipe_x[1][1]
        l3 = pipe_x[2][0]
        p3 = pipe_x[2][1]
        l4 = pipe_x[3][0]
        p4 = pipe_x[3][1]
        l5 = pipe_x[4][0]
        p5 = pipe_x[4][1]      
    except :
        l1 = None 
        p1 = None
        l2 = None
        p2 = None
        l3 = None
        p3 = None
        l4 = None
        p4 = None
        l5 = None
        p5 = None        

    return l1, p1, l2, p2, l3, p3, l4, p4, l5, p5

titles = []
perdi = []
score = []
for ii in tweet_dataset['test']['titles']:
    titles.append(ii)
    run = refined_zero(ii)
    perdi.append(run)

label = []
for ii in tweet_dataset['test']['label']:
    label.append(tweet_clf_model.config.id2label[ii])

Actl_pred_copy = pd.DataFrame()

Actl_pred_copy['titles'] = titles
Actl_pred_copy['Actual'] = label
Actl_pred_copy['label_1'] = [i[0] for i in perdi]
Actl_pred_copy['prob_label_1'] = [i[1] for i in perdi]
#
Actl_pred_copy['label_2'] = [i[2] for i in perdi]
Actl_pred_copy['prob_label_2'] = [i[3] for i in perdi]
#
Actl_pred_copy['label_3'] = [i[4] for i in perdi]
Actl_pred_copy['prob_label_3'] = [i[5] for i in perdi]
#
Actl_pred_copy['label_4'] = [i[6] for i in perdi]
Actl_pred_copy['prob_label_4'] = [i[7] for i in perdi]
#
Actl_pred_copy['label_5'] = [i[8] for i in perdi]
Actl_pred_copy['prob_label_5'] = [i[9] for i in perdi]

Actl_pred = Actl_pred_copy[Actl_pred_copy['prob_label_1']>0.0]

Actl_pred[:3]

font = {'size'   : 12}
plt.rc('font', **font)
colors_map = plt.cm.get_cmap('jet')

fig, ax = plt.subplots(figsize=(14, 4), dpi= 100, facecolor='w', edgecolor='k') 

ax1 = plt.subplot(1,2,1) 
val_obj = Actl_pred['label_1']
bargraph (val_obj, title=f'Predicted Labels', ylabel='Counts',titlefontsize=15, xfontsize=10,yfontsize=13,
                       percent=False,fontsizelable=11,yshift=1,xshift=-0.1,color='g',legend=True,ytick_rot=25, y_rot=0, axt=ax1) 
plt.show()

pred = Actl_pred['label_1']
label = Actl_pred.Actual

accuracy = np.round(accuracy_score(label, pred)*100,1)
macro_averaged_precision = np.round(precision_score(label, pred, average = 'macro')*100,1)
micro_averaged_precision = np.round(precision_score(label, pred, average = 'micro')*100,1)
macro_averaged_recall = np.round(recall_score(label, pred, average = 'macro')*100,1)
micro_averaged_recall = np.round(recall_score(label, pred, average = 'micro')*100,1)

# Calculate the percentage
font = {'size'   : 9}
plt.rc('font', **font)
fig, ax=plt.subplots(figsize=(6, 5), dpi= 120, facecolor='w', edgecolor='k')

alllabels = list(set(label))

per_cltr=np.zeros((5,5))
for i in range(len(alllabels)):
    for j in range(len(alllabels)):
        per_cltr[i,j] = len(Actl_pred[(Actl_pred['Actual']==alllabels[i]) & 
                                       (Actl_pred['label_1']==alllabels[j])])/len(Actl_pred[Actl_pred['Actual']==alllabels[i]])
    
cax =ax.matshow(per_cltr, cmap='jet', interpolation='nearest',vmin=0, vmax=1)
cbar=fig.colorbar(cax,shrink=0.6,orientation='vertical',label='Low %                                    High %')
cbar.set_ticks([])
#plt.title('Mismatch Percentage', fontsize=14,y=1.17)

for i in range(5):
    for j in range(5):
        c = per_cltr[i,j]*100
        ax.text(j, i, str(round(c,1))+'%', va='center',weight="bold", ha='center',fontsize=12,c='w')

columns=[f'{alllabels[i]} \n (Predicted) ' for i in range(5)] 
ax.set_xticks(np.arange(len(alllabels)))
ax.set_xticklabels(columns,fontsize=8,rotation=35,y=0.97)
columns=[f'{alllabels[i]}\n (Actual) ' for i in range(5)]  
ax.set_yticks(np.arange(len(alllabels)))
ax.set_yticklabels(columns,fontsize=9,rotation='0')

plt.title(f'Confusion Matrix to Predict {len(titles)} Validation Set Samples', fontsize=12,y=1.25)

txt = "Overall Accuracy = "+ r'$\mathbf{' + str(accuracy) + '}$%\n'
txt += "Macro Precision = "+ r'$\mathbf{' + str(macro_averaged_precision) + '}$%\n'
txt += "Micro Precision = "+ r'$\mathbf{' + str(micro_averaged_precision) + '}$%\n'
txt += "Macro Recall = "+ r'$\mathbf{' + str(macro_averaged_recall) + '}$%\n'
txt += "Micro Recall = "+ r'$\mathbf{' + str(micro_averaged_recall) + '}$%'

plt.text(6, 2.5, txt,rotation=0,color='k', ha = 'left',fontsize=12,bbox=dict(facecolor='#FFE4C4', alpha=0.6)) 


txt_def = r'$\mathbf{' + 'Macro' + '}$'+": make sum of all True Positives and False Negatives and then calculate metrics\n"
txt_def+= r'$\mathbf{' + 'Micro' + '}$'+": calculate metrics for each label separately and then make average"
plt.text(-1, 5.5, txt_def,rotation=0,color='k', ha = 'left',fontsize=12,bbox=dict(facecolor='#98F5FF', alpha=0.6)) 

plt.show()

Evaluate Model On Test Set¶

Fine-tuned Prediction¶

pipe = pipeline("text-classification", model=tweet_clf_model, tokenizer=tokenizer, return_all_scores=True)

titles = []
perdi = []
score = []
for ii in test['OriginalTweet']:
    titles.append(ii)
    run = refined_zero(ii)
    perdi.append(run)

label = []
for ii in test['Sentiment']:
    label.append(ii)

Actl_pred_copy = pd.DataFrame()

Actl_pred_copy['titles'] = titles
Actl_pred_copy['Actual'] = label
Actl_pred_copy['label_1'] = [i[0] for i in perdi]
Actl_pred_copy['prob_label_1'] = [i[1] for i in perdi]
#
Actl_pred_copy['label_2'] = [i[2] for i in perdi]
Actl_pred_copy['prob_label_2'] = [i[3] for i in perdi]
#
Actl_pred_copy['label_3'] = [i[4] for i in perdi]
Actl_pred_copy['prob_label_3'] = [i[5] for i in perdi]
#
Actl_pred_copy['label_4'] = [i[6] for i in perdi]
Actl_pred_copy['prob_label_4'] = [i[7] for i in perdi]
#
Actl_pred_copy['label_5'] = [i[8] for i in perdi]
Actl_pred_copy['prob_label_5'] = [i[9] for i in perdi]

Actl_pred = Actl_pred_copy[Actl_pred_copy['prob_label_1']>0.0]

Actl_pred[:3]

def histplt (val: list,bins: int,title: str,xlabl: str,ylabl: str,xlimt: list,
             ylimt: list=False, loc: int =1,legend: int=1,axt=None,days: int=False,
             class_: int=False,scale: int=1,x_tick: list=False, calc_perc: bool= True,
             nsplit: int=1,font: int=5,color: str='b') -> [float] :
    
    """ Histogram including important statistics """
    
    ax1 = axt or plt.axes()
    font = {'size'   : font }
    plt.rc('font', **font) 
    
    miss_n = len(val[np.isnan(val)])
    tot = len(val)
    n_distinct = len(np.unique(val))
    miss_p = (len(val[np.isnan(val)])/tot)*100
    val = val[~np.isnan(val)]        
    val = np.array(val)
    plt.hist(val, bins=bins, weights=np.ones(len(val)) / len(val),ec='black',color=color)
    n_nonmis = len(val[~np.isnan(val)])
    if class_: 
        times = 100
    else:
        times = 1 
    Mean = np.nanmean(val)*times
    Median = np.nanmedian(val)*times
    sd = np.sqrt(np.nanvar(val))
    Max = np.nanmax(val)
    Min = np.nanmin(val)
    p1 = np.quantile(val, 0.01)
    p25 = np.quantile(val, 0.25)
    p75 = np.quantile(val, 0.75)
    p99 = np.quantile(val, 0.99)
    
    if calc_perc == True:
        txt = 'n (not missing)=%.0f\nn_distinct=%.0f\nMissing=%.1f%%\nMean=%0.2f\nσ=%0.1f\np1%%=%0.1f\np99%%=%0.1f\nMin=%0.1f\nMax=%0.1f'       
        anchored_text = AnchoredText(txt %(n_nonmis,n_distinct,miss_p,Mean,sd,p1,p99,Min,Max), borderpad=0, 
                                     loc=loc,prop={ 'size': font['size']*scale})    
    else:
        txt = 'n (not missing)=%.0f\nn_distinct=%.0f\nMissing=%.1f%%\nMean=%0.2f\nσ=%0.1f\nMin=%0.1f\nMax=%0.1f'       
        anchored_text = AnchoredText(txt %(n_nonmis,n_distinct,miss_p,Mean,sd,Min,Max), borderpad=0, 
                                     loc=loc,prop={ 'size': font['size']*scale})                
        
    if(legend==1): ax1.add_artist(anchored_text)
    if (scale): plt.title(title,fontsize=font['size']*(scale+0.15))
    else:       plt.title(title)
    plt.xlabel(xlabl,fontsize=font['size']) 
    ax1.set_ylabel('Frequency',fontsize=font['size'])
    if (scale): ax1.set_xlabel(xlabl,fontsize=font['size']*scale)
    else:       ax1.set_xlabel(xlabl)
    try:
        xlabl
    except NameError:
        pass    
    else:
        if (scale): plt.xlabel(xlabl,fontsize=font['size']*scale) 
        else:        plt.xlabel(xlabl)   
    try:
        ylabl
    except NameError:
        pass      
    else:
        if (scale): plt.ylabel(ylabl,fontsize=font['size']*scale)  
        else:         plt.ylabel(ylabl)  
        
    if (class_==True): plt.xticks([0,1])
    plt.gca().yaxis.set_major_formatter(PercentFormatter(1))
    ax1.grid(linewidth='0.1')
    try:
        xlimt
    except NameError:
        pass  
    else:
        plt.xlim(xlimt) 
    try:
        ylimt
    except NameError:
        pass  
    else:
        plt.ylim(ylimt)  
        
    if x_tick: plt.xticks(x_tick,fontsize=font['size']*scale)    
    plt.yticks(fontsize=font['size']*scale)  
    plt.grid(linewidth='0.12')
    
    # Interquartile Range Method for outlier detection
    iqr = p75 - p25
    
    # calculate the outlier cutoff
    cut_off = np.array(iqr) * 1.5
    lower, upper = p25 - cut_off, p75 + cut_off        
            
    
    return tot, n_nonmis, n_distinct, miss_n, miss_p, Mean, Median, sd, Max, Min, p1, p25, p75, p99, sd


font = {'size'   : 12}
plt.rc('font', **font)
colors_map = plt.cm.get_cmap('jet')

fig, ax = plt.subplots(figsize=(12, 4), dpi= 100, facecolor='w', edgecolor='k') 

ax1 = plt.subplot(1,2,1) 
val = Actl_pred.prob_label_1
_,_,_, _, _,_ ,_ ,_ ,_ ,_ ,\
_,_ ,_ ,_ ,_ = histplt (val,bins=25,title=f'Fine-tuned Model: Histogram of Predicted Probability',xlabl=None,days=False,
                  ylabl=None,xlimt=(0,1),ylimt=False
                  ,axt=ax1,nsplit=5,scale=0.95,font=12,loc=2,color='g')
plt.show()

Actl_pred.Actual.value_counts()

Negative              583
Positive              497
Extremely Positive    329
Extremely Negative    316
Neutral               275
Name: Actual, dtype: int64

pred = Actl_pred['label_1']
label = Actl_pred.Actual

accuracy = np.round(accuracy_score(label, pred)*100,1)
macro_averaged_precision = np.round(precision_score(label, pred, average = 'macro')*100,1)
micro_averaged_precision = np.round(precision_score(label, pred, average = 'micro')*100,1)
macro_averaged_recall = np.round(recall_score(label, pred, average = 'macro')*100,1)
micro_averaged_recall = np.round(recall_score(label, pred, average = 'micro')*100,1)

# Calculate the percentage
font = {'size'   : 9}
plt.rc('font', **font)
fig, ax=plt.subplots(figsize=(6, 5), dpi= 120, facecolor='w', edgecolor='k')

alllabels = list(set(label))

per_cltr=np.zeros((5,5))
for i in range(len(alllabels)):
    for j in range(len(alllabels)):
        per_cltr[i,j] = len(Actl_pred[(Actl_pred['Actual']==alllabels[i]) & 
                                       (Actl_pred['label_1']==alllabels[j])])/len(Actl_pred[Actl_pred['Actual']==alllabels[i]])
    
cax =ax.matshow(per_cltr, cmap='jet', interpolation='nearest',vmin=0, vmax=1)
cbar=fig.colorbar(cax,shrink=0.6,orientation='vertical',label='Low %                                    High %')
cbar.set_ticks([])
#plt.title('Mismatch Percentage', fontsize=14,y=1.17)

for i in range(5):
    for j in range(5):
        c = per_cltr[i,j]*100
        ax.text(j, i, str(round(c,1))+'%', va='center',weight="bold", ha='center',fontsize=12,c='w')

columns=[f'{alllabels[i]} \n (Predicted) ' for i in range(5)] 
ax.set_xticks(np.arange(len(alllabels)))
ax.set_xticklabels(columns,fontsize=8,rotation=35,y=0.97)
columns=[f'{alllabels[i]}\n (Actual) ' for i in range(5)]  
ax.set_yticks(np.arange(len(alllabels)))
ax.set_yticklabels(columns,fontsize=9,rotation='0')

plt.title(f'Fine-tuned Model: Confusion Matrix to Predict {len(label)} Test', fontsize=12,y=1.25)

txt = "Overall Accuracy = "+ r'$\mathbf{' + str(accuracy) + '}$%\n'
txt += "Macro Precision = "+ r'$\mathbf{' + str(macro_averaged_precision) + '}$%\n'
txt += "Micro Precision = "+ r'$\mathbf{' + str(micro_averaged_precision) + '}$%\n'
txt += "Macro Recall = "+ r'$\mathbf{' + str(macro_averaged_recall) + '}$%\n'
txt += "Micro Recall = "+ r'$\mathbf{' + str(micro_averaged_recall) + '}$%'

plt.text(6, 2.5, txt,rotation=0,color='k', ha = 'left',fontsize=12,bbox=dict(facecolor='#FFE4C4', alpha=0.6)) 


txt_def = r'$\mathbf{' + 'Macro' + '}$'+": make sum of all True Positives and False Negatives and then calculate metrics\n"
txt_def+= r'$\mathbf{' + 'Micro' + '}$'+": calculate metrics for each label separately and then make average"
plt.text(-1, 5.5, txt_def,rotation=0,color='k', ha = 'left',fontsize=12,bbox=dict(facecolor='#98F5FF', alpha=0.6)) 

plt.show()

Zero-shot Prediction¶

pipe = pipe_pre_trained
def refined_zero(x):
    pred_zero = pipe_pre_trained(x,candidate_labels=['Neutral', 'Extremely Negative', 'Negative',
                                       'Extremely Positive', 'Positive'])
    label_ = pred_zero['labels'] 
    score_ = pred_zero['scores']
    pipe_x =  sorted(zip(label_, score_), key=lambda x: x[1], reverse=True)    
    try:
        l1 = pipe_x[0][0]
        p1 = pipe_x[0][1]
        l2 = pipe_x[1][0]
        p2 = pipe_x[1][1]
        l3 = pipe_x[2][0]
        p3 = pipe_x[2][1]
        l4 = pipe_x[3][0]
        p4 = pipe_x[3][1] 
    except :
        l1 = None 
        p1 = None
        l2 = None
        p2 = None
        l3 = None
        p3 = None
        l4 = None
        p4 = None     

    return l1, p1, l2, p2, l3, p3, l4, p4

titles = []
perdi = []
score = []
for ii in test['OriginalTweet']:
    titles.append(ii)
    run = refined_zero(ii)
    perdi.append(run)

label = []
for ii in test['Sentiment']:
    label.append(ii)

Actl_pred_copy = pd.DataFrame()

Actl_pred_copy['titles'] = titles
Actl_pred_copy['Actual'] = label
Actl_pred_copy['label_1'] = [i[0] for i in perdi]
Actl_pred_copy['prob_label_1'] = [i[1] for i in perdi]
#
Actl_pred_copy['label_2'] = [i[2] for i in perdi]
Actl_pred_copy['prob_label_2'] = [i[3] for i in perdi]
#
Actl_pred_copy['label_3'] = [i[4] for i in perdi]
Actl_pred_copy['prob_label_3'] = [i[5] for i in perdi]
#
Actl_pred_copy['label_4'] = [i[6] for i in perdi]
Actl_pred_copy['prob_label_4'] = [i[7] for i in perdi]

Actl_pred = Actl_pred_copy[Actl_pred_copy['prob_label_1']>0.0]

Actl_pred[:3]

font = {'size'   : 12}
plt.rc('font', **font)
colors_map = plt.cm.get_cmap('jet')

fig, ax = plt.subplots(figsize=(12, 4), dpi= 100, facecolor='w', edgecolor='k') 

ax1 = plt.subplot(1,2,1) 
val = Actl_pred.prob_label_1
_,_,_, _, _,_ ,_ ,_ ,_ ,_ ,\
_,_ ,_ ,_ ,_ = histplt (val,bins=25,title=f'Zero shot Model: Histogram of Predicted Probability',xlabl=None,days=False,
                  ylabl=None,xlimt=(0,1),ylimt=False
                  ,axt=ax1,nsplit=5,scale=0.95,font=12,loc=2,color='g')

plt.show()

Actl_pred.Actual.value_counts()

Negative              583
Positive              497
Extremely Positive    329
Extremely Negative    316
Neutral               275
Name: Actual, dtype: int64

pred = Actl_pred['label_1']
label = Actl_pred.Actual

accuracy = np.round(accuracy_score(label, pred)*100,1)
macro_averaged_precision = np.round(precision_score(label, pred, average = 'macro')*100,1)
micro_averaged_precision = np.round(precision_score(label, pred, average = 'micro')*100,1)
macro_averaged_recall = np.round(recall_score(label, pred, average = 'macro')*100,1)
micro_averaged_recall = np.round(recall_score(label, pred, average = 'micro')*100,1)

# Calculate the percentage
font = {'size'   : 9}
plt.rc('font', **font)
fig, ax=plt.subplots(figsize=(6, 5), dpi= 120, facecolor='w', edgecolor='k')

alllabels = list(set(label))

per_cltr=np.zeros((5,5))
for i in range(len(alllabels)):
    for j in range(len(alllabels)):
        per_cltr[i,j] = len(Actl_pred[(Actl_pred['Actual']==alllabels[i]) & 
                                       (Actl_pred['label_1']==alllabels[j])])/len(Actl_pred[Actl_pred['Actual']==alllabels[i]])
    
cax =ax.matshow(per_cltr, cmap='jet', interpolation='nearest',vmin=0, vmax=1)
cbar=fig.colorbar(cax,shrink=0.6,orientation='vertical',label='Low %                                    High %')
cbar.set_ticks([])
#plt.title('Mismatch Percentage', fontsize=14,y=1.17)

for i in range(5):
    for j in range(5):
        c = per_cltr[i,j]*100
        ax.text(j, i, str(round(c,1))+'%', va='center',weight="bold", ha='center',fontsize=12,c='w')

columns=[f'{alllabels[i]} \n (Predicted) ' for i in range(5)] 
ax.set_xticks(np.arange(len(alllabels)))
ax.set_xticklabels(columns,fontsize=8,rotation=35,y=0.97)
columns=[f'{alllabels[i]}\n (Actual) ' for i in range(5)]  
ax.set_yticks(np.arange(len(alllabels)))
ax.set_yticklabels(columns,fontsize=9,rotation='0')

plt.title(f'Zero-shot Model: Confusion Matrix to Predict {len(label)} test', fontsize=12,y=1.25)

txt = "Overall Accuracy = "+ r'$\mathbf{' + str(accuracy) + '}$%\n'
txt += "Macro Precision = "+ r'$\mathbf{' + str(macro_averaged_precision) + '}$%\n'
txt += "Micro Precision = "+ r'$\mathbf{' + str(micro_averaged_precision) + '}$%\n'
txt += "Macro Recall = "+ r'$\mathbf{' + str(macro_averaged_recall) + '}$%\n'
txt += "Micro Recall = "+ r'$\mathbf{' + str(micro_averaged_recall) + '}$%'

plt.text(6, 2.5, txt,rotation=0,color='k', ha = 'left',fontsize=12,bbox=dict(facecolor='#FFE4C4', alpha=0.6)) 


txt_def = r'$\mathbf{' + 'Macro' + '}$'+": make sum of all True Positives and False Negatives and then calculate metrics\n"
txt_def+= r'$\mathbf{' + 'Micro' + '}$'+": calculate metrics for each label separately and then make average"
plt.text(-1, 5.5, txt_def,rotation=0,color='k', ha = 'left',fontsize=12,bbox=dict(facecolor='#98F5FF', alpha=0.6)) 

plt.show()

Performance of zero shot classification is much lower than fine-tuned model.

	UserName	ScreenName	Location	TweetAt	OriginalTweet	Sentiment
0	3799	48751	London	16-03-2020	@MeNyrbie @Phil_Gahan @Chrisitv https://t.co/iFz9FAn2Pa and https://t.co/xX6ghGFzCC and https://t.co/I2NlzdxNo8	Neutral
1	3800	48752	UK	16-03-2020	advice Talk to your neighbours family to exchange phone numbers create contact list with phone numbers of neighbours schools employer chemist GP set up online shopping accounts if poss adequate supplies of regular meds but not over order	Positive
2	3801	48753	Vagabonds	16-03-2020	Coronavirus Australia: Woolworths to give elderly, disabled dedicated shopping hours amid COVID-19 outbreak https://t.co/bInCA9Vp8P	Positive
3	3802	48754	NaN	16-03-2020	My food stock is not the only one which is empty...\r\r\n\r\r\nPLEASE, don't panic, THERE WILL BE ENOUGH FOOD FOR EVERYONE if you do not take more than you need. \r\r\nStay calm, stay safe.\r\r\n\r\r\n#COVID19france #COVID_19 #COVID19 #coronavirus #confinement #Confinementotal #ConfinementGeneral https://t.co/zrlG0Z520j	Positive

	UserName	ScreenName	Location	TweetAt	OriginalTweet	Sentiment
0	3799	48751	London	16-03-2020	and and	Neutral
1	3800	48752	UK	16-03-2020	advice Talk to your neighbours family to exchange phone numbers create contact list with phone numbers of neighbours schools employer chemist GP set up online shopping accounts if poss adequate supplies of regular meds but not over order	Positive
2	3801	48753	Vagabonds	16-03-2020	Coronavirus Australia : Woolworths to give elderly , disabled dedicated shopping hours amid COVID - 19 outbreak	Positive
3	3802	48754	NaN	16-03-2020	My food stock is not the only one which is empty ... PLEASE , don't panic , THERE WILL BE ENOUGH FOOD FOR EVERYONE if you do not take more than you need . Stay calm , stay safe .	Positive

Epoch	Training Loss	Validation Loss	Validation accuracy	Validation macro recall	Validation macro precision	Validation macro f1 Score
1	1.033300	1.293204	0.437000	0.415319	0.507195	0.422844
2	1.265700	1.165269	0.476000	0.490785	0.543280	0.480481
3	0.766400	1.029008	0.584000	0.594189	0.602664	0.587308
4	0.666600	1.083287	0.573000	0.585591	0.593171	0.587646
5	0.471600	1.240799	0.568000	0.584214	0.582199	0.582162

	titles	Actual	label_1	prob_label_1	label_2	prob_label_2	label_3	prob_label_3	label_4	prob_label_4	label_5	prob_label_5
0	I work at a grocery store and we staying open during the State lockdown , but I don't feel comfortable with being there ... what should I do ?	Extremely Positive	Neutral	0.622212	Positive	0.216991	Negative	0.133315	Extremely Positive	0.024164	Extremely Negative	0.003319
1	When many of us are busy Stock piling there is this great Sikh Volunteers Australia community who is busy in providing free food home delivery service for needy people in this Hats off to them Please share it so that it can reach needy people across the state	Extremely Positive	Extremely Positive	0.796087	Positive	0.201347	Negative	0.002179	Neutral	0.000264	Extremely Negative	0.000123
2	1st Wave - From China 2nd Wave - Local Mass Gathering 3rd Wave - Supermarket , Balik Kampung Transport , Police Station Gathering 4th Wave - Sampai Kampung Gathering	Neutral	Neutral	0.816192	Negative	0.144212	Positive	0.026821	Extremely Negative	0.011600	Extremely Positive	0.001176

	titles	Actual	label_1	prob_label_1	label_2	prob_label_2	label_3	prob_label_3	label_4	prob_label_4	label_5	prob_label_5
0	TRENDING : New Yorkers encounter empty supermarket shelves ( pictured , Wegmans in Brooklyn ) , sold-out online grocers ( FoodKick , MaxDelivery ) as shoppers stock up	Extremely Negative	Negative	0.538205	Neutral	0.245195	Positive	0.193545	Extremely Positive	0.015169	Extremely Negative	0.007887
1	When I couldn't find hand sanitizer at Fred Meyer , I turned to . But $ 114.97 for a 2 pack of Purell ? ? ! ! Check out how concerns are driving up prices .	Positive	Positive	0.561097	Neutral	0.184296	Extremely Positive	0.156479	Negative	0.095451	Extremely Negative	0.002677
2	Find out how you can protect yourself and loved ones from . ?	Extremely Positive	Positive	0.485621	Extremely Positive	0.461758	Neutral	0.041895	Negative	0.010484	Extremely Negative	0.000242

Table of Contents