Summary
Facebook's BART (Bidirectional and Auto-Regressive Transformers) is a sequence-to-sequence pre-trained language model designed for various natural language processing tasks. BART has been trained on the MultiNLI (MNLI) dataset. It utilizes a denoising autoencoder objective, where the model is trained to reconstruct a corrupted version of an input sequence. One notable application of BART is zero-shot classification, where the model can perform text classification on unseen classes without specific training examples. This is accomplished by appending a prefix to the input text, prompting the model to generate the target label. Additionally, BART can be fine-tuned for downstream tasks such as sentiment analysis, summarization, or translation by training on task-specific datasets. Fine-tuning involves updating the model's parameters on the new task while leveraging the knowledge gained during pre-training, enabling BART to adapt to a wide range of natural language understanding tasks. In this notebook, Coronavirus tweets have been pulled from Twitter and manual tagging has been done on them to divide tweets to 5 sentiments: 'Positive', 'Neutral', 'Extremely Positive', 'Extremely Negative', 'Negative'. BART model is applied as zero shot classification to predict sentiments for tweets. The BART model is fine-tuned based on labeled data for more accurate sentiment prediction.
Python functions and data files to run this notebook are in my Github page.
import warnings
warnings.filterwarnings('ignore')
from transformers import pipeline
import numpy as np
import pandas as pd
import matplotlib as mpl
mpl.rcParams.update(mpl.rcParamsDefault)
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
from matplotlib.offsetbox import AnchoredText
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from matplotlib.ticker import PercentFormatter
from nltk.tokenize import TweetTokenizer
import nltk
import re
pd.set_option('display.max_colwidth', None)
import warnings
warnings.filterwarnings('ignore')
Zero-shot classification is a machine learning technique employed to categorize or classify data into predefined classes without any previous examples or training data specific to those classes. The term "zero-shot" reflects the model's ability to make predictions for classes it has never encountered during training, thus possessing zero prior exposure to them. While traditional classification models rely on labeled examples for each class they aim to recognize, zero-shot classification trains the model to comprehend relationships between different classes, enabling it to generalize to new classes based on their inherent characteristics or attributes.
The fundamental concept underlying zero-shot classification involves leveraging auxiliary information, such as class descriptions, semantic embeddings, or attributes, to impart high-level knowledge about the classes. This supplementary information assists the model in learning associations between specific features or patterns and particular classes.
Zero-shot classification proves particularly advantageous in scenarios where obtaining labeled data for all potential classes is costly, time-consuming, or impractical. This technique empowers models to generalize effectively to unseen classes, enhancing their adaptability and flexibility.
Coronavirus Tweet Data set¶
The Coronavirus Tweet data set is downloaded from kaggle. The tweets have been pulled from Twitter and manual tagging has been done then. The names and usernames have been given codes to avoid any privacy concerns. The columns are:
1) Location (London, UK, ...)
2) Tweet At (date, e.g. 16-03-2020)
3) Original Tweet
4) Sentiment (Label
file_name = 'Corona_NLP_train.csv'
train = pd.read_csv(file_name, encoding = "ISO-8859-1")
#
file_name = 'Corona_NLP_test.csv'
test = pd.read_csv(file_name, encoding = "ISO-8859-1")
train[:4]
UserName | ScreenName | Location | TweetAt | OriginalTweet | Sentiment | |
---|---|---|---|---|---|---|
0 | 3799 | 48751 | London | 16-03-2020 | @MeNyrbie @Phil_Gahan @Chrisitv https://t.co/iFz9FAn2Pa and https://t.co/xX6ghGFzCC and https://t.co/I2NlzdxNo8 | Neutral |
1 | 3800 | 48752 | UK | 16-03-2020 | advice Talk to your neighbours family to exchange phone numbers create contact list with phone numbers of neighbours schools employer chemist GP set up online shopping accounts if poss adequate supplies of regular meds but not over order | Positive |
2 | 3801 | 48753 | Vagabonds | 16-03-2020 | Coronavirus Australia: Woolworths to give elderly, disabled dedicated shopping hours amid COVID-19 outbreak https://t.co/bInCA9Vp8P | Positive |
3 | 3802 | 48754 | NaN | 16-03-2020 | My food stock is not the only one which is empty...\r\r\n\r\r\nPLEASE, don't panic, THERE WILL BE ENOUGH FOOD FOR EVERYONE if you do not take more than you need. \r\r\nStay calm, stay safe.\r\r\n\r\r\n#COVID19france #COVID_19 #COVID19 #coronavirus #confinement #Confinementotal #ConfinementGeneral https://t.co/zrlG0Z520j | Positive |
Clean Data¶
nltk.download('punkt')
def remove_meaningless(text):
"""
remove words starting with '@', '\n \r', '#', 'https://' that have no meaning
"""
tokenizer = TweetTokenizer()
tokens = tokenizer.tokenize(text)
filtered_tokens = [token for token in tokens if not (token.startswith('@') or token.startswith('\n') or
token.startswith('\r') or token.startswith('#') or
'https://' in token)]
return ' '.join(filtered_tokens)
train['OriginalTweet'] = train.apply(lambda x : remove_meaningless(x['OriginalTweet']), axis=1)
test['OriginalTweet'] = test.apply(lambda x : remove_meaningless(x['OriginalTweet']), axis=1)
[nltk_data] Downloading package punkt to [nltk_data] C:\Users\mrezv\AppData\Roaming\nltk_data... [nltk_data] Package punkt is already up-to-date!
train[:4]
UserName | ScreenName | Location | TweetAt | OriginalTweet | Sentiment | |
---|---|---|---|---|---|---|
0 | 3799 | 48751 | London | 16-03-2020 | and and | Neutral |
1 | 3800 | 48752 | UK | 16-03-2020 | advice Talk to your neighbours family to exchange phone numbers create contact list with phone numbers of neighbours schools employer chemist GP set up online shopping accounts if poss adequate supplies of regular meds but not over order | Positive |
2 | 3801 | 48753 | Vagabonds | 16-03-2020 | Coronavirus Australia : Woolworths to give elderly , disabled dedicated shopping hours amid COVID - 19 outbreak | Positive |
3 | 3802 | 48754 | NaN | 16-03-2020 | My food stock is not the only one which is empty ... PLEASE , don't panic , THERE WILL BE ENOUGH FOOD FOR EVERYONE if you do not take more than you need . Stay calm , stay safe . | Positive |
tt = TweetTokenizer()
#
train['n_tokens'] = train.apply(lambda x : len(tt.tokenize(x['OriginalTweet'])), axis=1)
test['n_tokens'] = test.apply(lambda x : len(tt.tokenize(x['OriginalTweet'])), axis=1)
# Remove tweets less than 10 tokens
train = train[train['n_tokens']>10]
train.reset_index(drop=True)
test = test[test['n_tokens']>10]
test.reset_index(drop=True)
train[:4]
UserName | ScreenName | Location | TweetAt | OriginalTweet | Sentiment | n_tokens | |
---|---|---|---|---|---|---|---|
1 | 3800 | 48752 | UK | 16-03-2020 | advice Talk to your neighbours family to exchange phone numbers create contact list with phone numbers of neighbours schools employer chemist GP set up online shopping accounts if poss adequate supplies of regular meds but not over order | Positive | 38 |
2 | 3801 | 48753 | Vagabonds | 16-03-2020 | Coronavirus Australia : Woolworths to give elderly , disabled dedicated shopping hours amid COVID - 19 outbreak | Positive | 17 |
3 | 3802 | 48754 | NaN | 16-03-2020 | My food stock is not the only one which is empty ... PLEASE , don't panic , THERE WILL BE ENOUGH FOOD FOR EVERYONE if you do not take more than you need . Stay calm , stay safe . | Positive | 40 |
4 | 3803 | 48755 | NaN | 16-03-2020 | Me , ready to go at supermarket during the outbreak . Not because I'm paranoid , but because my food stock is litteraly empty . The is a serious thing , but please , don't panic . It causes shortage ... | Extremely Negative | 41 |
# Since the data set is big for fine-tunning, we select 10000 samples for traning set and 2000 sampels for test set
n_samples = 10000
train = train.iloc[:n_samples].copy()
test = test.iloc[:int(n_samples*0.2)].copy()
See the code below how to apply zero shot classification for Coronavirus Tweet data set:
# Load pre-trained bart-large-mnli (zero shot classifier)
pipe_pre_trained = pipeline(model="facebook/bart-large-mnli")
txt = """What #CONVID19 safety measures r being taken by online shopping companies &
their courier partners @amazonIN @Flipkart etc? I fear that shopping packages which
travel vast distances through flights/trains & r handled by many along d way can b
potential #coronavirus carriers??"""
pipe_pre_trained(txt,candidate_labels=['Neutral', 'Extremely Negative', 'Negative',
'Extremely Positive', 'Positive'])
{'sequence': 'What #CONVID19 safety measures r being taken by online shopping companies & \ntheir courier partners @amazonIN @Flipkart etc? I fear that shopping packages which \ntravel vast distances through flights/trains & r handled by many along d way can b \npotential #coronavirus carriers??', 'labels': ['Negative', 'Neutral', 'Extremely Negative', 'Positive', 'Extremely Positive'], 'scores': [0.4203762412071228, 0.2806272804737091, 0.1792459338903427, 0.07781031727790833, 0.041940245777368546]}
For this text, the actual label is Negative, zero shot has predicted correctly with probability of 0.42 although the probability is not very high. This relatively low predicted probability indicates the model is not very confident.
Another Example
txt = """It amazes me...I go to the supermarket and everything is gone, people are panic buying,
yet not one single person was wearing a facemask. Buying 200 rolls of toilet paper won't
prevent you from catching #COVID2019. Take proper precautions and stay safe. https://t.co/lNoF0xfsx4"""
pipe_pre_trained(txt,candidate_labels=['Neutral', 'Extremely Negative', 'Negative',
'Extremely Positive', 'Positive'])
{'sequence': "It amazes me...I go to the supermarket and everything is gone, people are panic buying, \nyet not one single person was wearing a facemask. Buying 200 rolls of toilet paper won't \nprevent you from catching #COVID2019. Take proper precautions and stay safe. https://t.co/lNoF0xfsx4", 'labels': ['Negative', 'Extremely Negative', 'Positive', 'Neutral', 'Extremely Positive'], 'scores': [0.3562481999397278, 0.3409477174282074, 0.1348302662372589, 0.10292468965053558, 0.06504908204078674]}
For this text, the actual label is Extremely Positive while zero shot is predicted as Negative with predicted probability 0.35. Therefore, we need to fine-tune the model with manual labels.
import subprocess
try:
subprocess.check_output('nvidia-smi')
print('Nvidia GPU detected!')
except Exception: # this command not being found can raise quite a few different errors depending on the configuration
print('No Nvidia GPU in system!')
No Nvidia GPU in system!
def bargraph(val_ob: [list], title: str, ylabel: str, titlefontsize: int=10, xfontsize: int=5,scale: int=1,
yfontsize: int=8, select: bool= False, fontsizelable: bool= False, xshift: float=-0.1, nsim: int=False
,yshift: float=0.01,percent: bool=False, xlim: list=False, axt: bool=None, color: str='b',sort=True,
ylim: list=False, y_rot: int=0, ytick_rot: int=90, graph_float: int=1,
loc: int =1,legend: int=1) -> None:
""" vertical bargraph """
ax1 = axt or plt.axes()
tot = len(val_ob)
miss_p_ob = (len(val_ob[pd.isnull(val_ob)])/tot)*100
n_nonmis_ob = len(val_ob[~pd.isnull(val_ob)])
con = np.array(val_ob.value_counts())
len_ = len(con)
if len_ > 10: len_ = 10
cats = list(val_ob.value_counts().keys())
val_ob = con[:len_]
clmns = cats[:len_]
# Sort counts
if sort:
sort_score = sorted(zip(val_ob,clmns), reverse=True)
Clmns_sort = [sort_score[i][1] for i in range(len(clmns))]
sort_score = [sort_score[i][0] for i in range(len(clmns))]
else:
Clmns_sort = clmns
sort_score = val_ob
index1 = np.arange(len(clmns))
if (select):
Clmns_sort=Clmns_sort[:select]
sort_score=sort_score[:select]
ax1.bar(Clmns_sort, sort_score, width=0.6, align='center', alpha=1, edgecolor='k', capsize=4,color=color)
plt.title(title,fontsize=titlefontsize)
ax1.set_ylabel(ylabel,fontsize=yfontsize)
ax1.set_xticks(np.arange(len(Clmns_sort)))
ax1.set_xticklabels(Clmns_sort,fontsize=xfontsize, rotation=ytick_rot,y=0.02)
if (percent): plt.gca().yaxis.set_major_formatter(PercentFormatter(1))
ax1.xaxis.grid(color='k', linestyle='--', linewidth=0.2)
if (xlim): plt.xlim(xlim)
if (ylim): plt.ylim(ylim)
if (fontsizelable):
for ii in range(len(sort_score)):
if (percent):
plt.text(xshift+ii, sort_score[ii]+yshift,f'{"{0:.2f}".format(sort_score[ii]*100)}%',
fontsize=fontsizelable,rotation=y_rot,color='k')
else:
plt.text(xshift+ii, sort_score[ii]+yshift,f'{np.round(sort_score[ii],graph_float)}',
fontsize=fontsizelable,rotation=y_rot,color='k')
dic_Clmns = {}
for i in range(len(Clmns_sort)):
dic_Clmns[Clmns_sort[i]]=sort_score[i]
txt = 'n (not missing)=%.0f\nMissing=%.1f%%'
anchored_text = AnchoredText(txt %(n_nonmis_ob,miss_p_ob), borderpad=0,
loc=loc)
if(legend==1): ax1.add_artist(anchored_text)
font = {'size' : 12}
plt.rc('font', **font)
colors_map = plt.cm.get_cmap('jet')
fig, ax = plt.subplots(figsize=(14, 4), dpi= 100, facecolor='w', edgecolor='k')
ax1 = plt.subplot(1,2,1)
val_obj = train['Sentiment']
bargraph (val_obj, title=f'Training set_Coronavirus Tweet Labels', ylabel='Counts',titlefontsize=15, xfontsize=12,yfontsize=13,
percent=False,fontsizelable=12,yshift=20,xshift=-0.2,color='r',legend=True,ytick_rot=25, y_rot=0, axt=ax1,loc=1, ylim=[0,3000])
ax1 = plt.subplot(1,2,2)
val_obj = test['Sentiment']
bargraph (val_obj, title=f'Test set_Coronavirus Tweet Labels', ylabel='Counts',titlefontsize=15, xfontsize=12,yfontsize=13,
percent=False,fontsizelable=12,yshift=5,xshift=-0.1,color='b',legend=True,ytick_rot=25, y_rot=0, axt=ax1,loc=1, ylim=[0,650])
plt.subplots_adjust(hspace=0.5)
plt.subplots_adjust(wspace=0.3)
plt.show()
Fine-Tune BART Model¶
So, there are three fine-tuning approaches:
- Add any additional layers on top while updating the entire whole model on labeled data. This is a common approach. All aspect of the model will be updated. See Figure below. This is usually the slowest but has the highest performance.
- Freeze some part of model. For example, keeping some weights of BERT model unchanged while updating other weights. This approach has average speed and average speed.
- Freeze the entire model and only train the additional layers that we added on top which are feed forward classifiers. This is the fastest approach but has the worst performance. This can be only used for generic tasks.
Segment parses¶
# This code segment parses the train dataset into a more manageable format
titles = []
tokenized_titles = []
sequence_labels = train['Sentiment']
title, tokenized_title = [], []
for comments in train['OriginalTweet']:
title.append(comments)
tokenized_title.append(comments.split(' '))
unique_sequence_labels = list(set(sequence_labels))
unique_sequence_labels
['Positive', 'Neutral', 'Extremely Positive', 'Extremely Negative', 'Negative']
sequence_labels = [unique_sequence_labels.index(l) for l in sequence_labels]
print(f'There are {len(unique_sequence_labels)} unique sequence labels')
There are 5 unique sequence labels
Our final python list is going to be something like this:
print(tokenized_title[0])
print(title[0])
print(sequence_labels[0])
print(unique_sequence_labels[sequence_labels[0]])
['advice', 'Talk', 'to', 'your', 'neighbours', 'family', 'to', 'exchange', 'phone', 'numbers', 'create', 'contact', 'list', 'with', 'phone', 'numbers', 'of', 'neighbours', 'schools', 'employer', 'chemist', 'GP', 'set', 'up', 'online', 'shopping', 'accounts', 'if', 'poss', 'adequate', 'supplies', 'of', 'regular', 'meds', 'but', 'not', 'over', 'order'] advice Talk to your neighbours family to exchange phone numbers create contact list with phone numbers of neighbours schools employer chemist GP set up online shopping accounts if poss adequate supplies of regular meds but not over order 0 Positive
Split Data to Training and Validation Set¶
🤗 Datasets offer a plethora of tools for modifying both the structure and content of a dataset. These tools play a pivotal role in refining a dataset, encompassing tasks such as organizing rows, splitting datasets, creating additional columns, converting between different features and formats, and more.
This guide will walk you through the following processes:
- Rearranging rows and dividing the dataset.
- Renaming and eliminating columns, along with other commonly performed column operations.
- Applying processing functions to each example within a dataset.
- Concatenating datasets.
- Implementing a custom formatting transform.
- Saving and exporting processed datasets.
Once all the data is collected, it is encapsulated within a dataset object. Subsequently, a train-test split can be executed using the train_test_split function.
from datasets import Dataset
tweet_dataset = Dataset.from_dict(
dict(
titles=title,
label=sequence_labels,
tokens=tokenized_title,
)
)
tweet_dataset = tweet_dataset.train_test_split(test_size=0.1, seed=42, shuffle=True)
tweet_dataset
DatasetDict({ train: Dataset({ features: ['titles', 'label', 'tokens'], num_rows: 9000 }) test: Dataset({ features: ['titles', 'label', 'tokens'], num_rows: 1000 }) })
Here is first element of our training set:
pd.Series([ii for ii in tweet_dataset['test']['label']]).value_counts()
0 285 4 254 2 161 1 159 3 141 dtype: int64
Tokenizer¶
Instantiate tokenizer:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('facebook/bart-large-mnli')
Create a pre-process function to take in a batch of titles and tokenize them.
def preprocess_function(examples):
return tokenizer(examples["titles"], truncation=True) # truncation=True makes sure to exludes instances exceeding maximum tokens
Map the tokenizer function for the entire data set:
# go over all our data set, tokenize them
seq_clf_tokenized_comments = tweet_dataset.map(preprocess_function, batched=True)
tweet_dataset
DatasetDict({ train: Dataset({ features: ['titles', 'label', 'tokens'], num_rows: 9000 }) test: Dataset({ features: ['titles', 'label', 'tokens'], num_rows: 1000 }) })
Looking at the first item, we also have input_ids
and attention_mask
. These are the items we are going to need in our model.
Batch of Data¶
DataCollatorWithPadding
facilitates the creation of data batches by dynamically padding text sequences to the length of the longest element in each batch, ensuring uniform length across all elements. While it is feasible to pad text manually within the tokenizer function using the padding=True
option, employing dynamic padding in the DataCollatorWithPadding
is a more efficient approach. This optimized process contributes to an accelerated training pace, enhancing the overall efficiency of the training process.
from transformers import DataCollatorWithPadding
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
The Data Collator is responsible for padding data to achieve uniform input lengths across all examples in a batch. The attention mask is a mechanism utilized to exclude attention scores associated with padding tokens, ensuring that these scores are disregarded during processing.
Create Model¶
Create actual model.
from transformers import BartForSequenceClassification
tweet_clf_model = BartForSequenceClassification.from_pretrained('facebook/bart-large-mnli',
num_labels=len(unique_sequence_labels),
ignore_mismatched_sizes=True)
# set an index -> label dictionary
tweet_clf_model.config.id2label = {i: l for i, l in enumerate(unique_sequence_labels)}
tweet_clf_model.config.label2id = {l: i for i, l in enumerate(unique_sequence_labels)}
Some weights of BartForSequenceClassification were not initialized from the model checkpoint at facebook/bart-large-mnli and are newly initialized because the shapes did not match: - classification_head.out_proj.weight: found shape torch.Size([3, 1024]) in the checkpoint and torch.Size([5, 1024]) in the model instantiated - classification_head.out_proj.bias: found shape torch.Size([3]) in the checkpoint and torch.Size([5]) in the model instantiated You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
# Model's parameters
n_params = list(tweet_clf_model.named_parameters())
print(f'The BART model has {len(n_params)} different parameters')
The BART model has 515 different parameters
n_params[0:5][0]
('model.shared.weight', Parameter containing: tensor([[-0.0403, 0.0787, 0.1707, ..., 0.1901, 0.0629, -0.0710], [ 0.0055, -0.0050, -0.0069, ..., -0.0030, 0.0038, 0.0087], [-0.0459, 0.4714, -0.0611, ..., 0.1072, 0.0301, 0.0497], ..., [-0.0135, 0.0287, -0.0467, ..., 0.0460, -0.0252, 0.0121], [-0.0041, 0.0145, -0.0552, ..., 0.0493, 0.0098, -0.0091], [ 0.0063, 0.0296, -0.0188, ..., -0.0108, 0.0221, -0.0010]], requires_grad=True))
print('********* Embedding Layer *********\n')
for par in n_params[0:5]:
print(f'{par[0], str(tuple(par[1].size()))}')
********* Embedding Layer ********* ('model.shared.weight', '(50265, 1024)') ('model.encoder.embed_positions.weight', '(1026, 1024)') ('model.encoder.layers.0.self_attn.k_proj.weight', '(1024, 1024)') ('model.encoder.layers.0.self_attn.k_proj.bias', '(1024,)') ('model.encoder.layers.0.self_attn.v_proj.weight', '(1024, 1024)')
print('********* First Encoder ********* \n')
for par in n_params[5:21]:
print(f'{par[0], str(tuple(par[1].size()))}')
********* First Encoder ********* ('model.encoder.layers.0.self_attn.v_proj.bias', '(1024,)') ('model.encoder.layers.0.self_attn.q_proj.weight', '(1024, 1024)') ('model.encoder.layers.0.self_attn.q_proj.bias', '(1024,)') ('model.encoder.layers.0.self_attn.out_proj.weight', '(1024, 1024)') ('model.encoder.layers.0.self_attn.out_proj.bias', '(1024,)') ('model.encoder.layers.0.self_attn_layer_norm.weight', '(1024,)') ('model.encoder.layers.0.self_attn_layer_norm.bias', '(1024,)') ('model.encoder.layers.0.fc1.weight', '(4096, 1024)') ('model.encoder.layers.0.fc1.bias', '(4096,)') ('model.encoder.layers.0.fc2.weight', '(1024, 4096)') ('model.encoder.layers.0.fc2.bias', '(1024,)') ('model.encoder.layers.0.final_layer_norm.weight', '(1024,)') ('model.encoder.layers.0.final_layer_norm.bias', '(1024,)') ('model.encoder.layers.1.self_attn.k_proj.weight', '(1024, 1024)') ('model.encoder.layers.1.self_attn.k_proj.bias', '(1024,)') ('model.encoder.layers.1.self_attn.v_proj.weight', '(1024, 1024)')
print('********* Output Layer ********* \n')
for par in n_params[-2:]:
print(f'{par[0], str(tuple(par[1].size()))}')
********* Output Layer ********* ('classification_head.out_proj.weight', '(5, 1024)') ('classification_head.out_proj.bias', '(5,)')
tweet_clf_model.config
BartConfig { "_name_or_path": "facebook/bart-large-mnli", "_num_labels": 3, "activation_dropout": 0.0, "activation_function": "gelu", "add_final_layer_norm": false, "architectures": [ "BartForSequenceClassification" ], "attention_dropout": 0.0, "bos_token_id": 0, "classif_dropout": 0.0, "classifier_dropout": 0.0, "d_model": 1024, "decoder_attention_heads": 16, "decoder_ffn_dim": 4096, "decoder_layerdrop": 0.0, "decoder_layers": 12, "decoder_start_token_id": 2, "dropout": 0.1, "encoder_attention_heads": 16, "encoder_ffn_dim": 4096, "encoder_layerdrop": 0.0, "encoder_layers": 12, "eos_token_id": 2, "forced_eos_token_id": 2, "gradient_checkpointing": false, "id2label": { "0": "Positive", "1": "Neutral", "2": "Extremely Positive", "3": "Extremely Negative", "4": "Negative" }, "init_std": 0.02, "is_encoder_decoder": true, "label2id": { "Extremely Negative": 3, "Extremely Positive": 2, "Negative": 4, "Neutral": 1, "Positive": 0 }, "max_position_embeddings": 1024, "model_type": "bart", "normalize_before": false, "num_hidden_layers": 12, "output_past": false, "pad_token_id": 1, "scale_embedding": false, "transformers_version": "4.26.1", "use_cache": true, "vocab_size": 50265 }
Every model comes with a config
. In this config
, there are id2label
and label2id
:
Within the framework of the BartForSequenceClassification
model, the terms "id2label" and "label2id" pertain to two mapping dictionaries employed for the conversion of labels between their textual and numerical representations.
- id2label: This dictionary establishes a mapping from numerical label IDs to their respective textual labels. Each label ID is uniquely associated with a specific label. For instance, in a scenario with three labels— "positive," "neutral," and "negative"— and their corresponding IDs 0, 1, and 2, the id2label mapping would appear as {0: "positive", 1: "neutral", 2: "negative."}. This mapping proves beneficial when transforming model predictions, typically presented as numerical IDs, back into their corresponding textual labels.
- label2id: Serving as the inverse of id2label, this dictionary maps textual labels to their corresponding numerical IDs. Using the aforementioned example, the label2id mapping would be {"positive": 0, "neutral": 1, "negative": 2}. This mapping becomes valuable when converting textual labels into their numerical representations, a requirement often encountered during the training or evaluation of a model.
tweet_clf_model.config.id2label[0]
'Positive'
It is now opportune to introduce a custom metric. While Hugging Face traditionally employs loss as the performance metric, there is a need to compute accuracy as a more straightforward and intuitive metric for evaluation.
Fine-tune Trainer with Labeled Data¶
Take pre-trained knowledge of BART and transfer that knowledge to our supervised data set by not training too many epochs. The code block below is going to repeat itself again and again because it define our training loop.
from transformers import TrainingArguments, Trainer
See HuggingFace for more information on fine-tuning model.
🤗 Transformers offers a Trainer class designed to facilitate the fine-tuning of any pretrained models on your specific dataset. After completing the data preprocessing tasks outlined in the previous section, there are only a few remaining steps to configure the Trainer. The most challenging aspect is likely to be the preparation of the environment for executing Trainer.train(), as the process tends to be significantly slower when run on a CPU.
- Training
Before defining our Trainer, the initial step involves defining a TrainingArguments
class that consolidates all the hyperparameters utilized by the Trainer for training and evaluation. The sole mandatory argument is the directory path where the trained model, along with checkpoints, will be stored. For the remaining parameters, default values can be retained, typically sufficient for basic fine-tuning.
Subsequently, the Trainer can be instantiated by incorporating all the previously constructed objects—comprising the model, training_args
, training and validation datasets, and the data_collator.
This initiates the fine-tuning process, which, when performed on a GPU, typically concludes within a few minutes. The Trainer provides periodic updates on the training loss every 500 steps. However, it does not furnish information on the model's performance, as:
The Trainer has not been configured to perform evaluations during training. This can be achieved by setting evaluation_strategy to either "steps" (evaluate every eval_steps) or "epoch" (evaluate at the end of each epoch).
A compute_metrics()
function needs to be supplied to the Trainer for calculating a metric during the aforementioned evaluations. Otherwise, the evaluation results would only display the loss, which may not be the most intuitive metric.
- Evaluation
Let's explore the process of constructing a valuable compute_metrics()
function and integrating it into our next training session. This function is designed to take an EvalPrediction
object, which is a named tuple featuring a predictions field and a label_ids field. Its output is expected to be a dictionary that maps strings to floats, where the strings denote the names of the returned metrics and the floats represent their respective values. To obtain predictions from our model, the Trainer.predict()
command can be employed.
The result of the predict()
method is yet another named tuple encompassing three fields: predictions, label_ids, and metrics. The metrics field includes the dataset's loss, along with time-related metrics such as the total prediction time and the average prediction time. Once we finalize the compute_metrics()
function and incorporate it into the Trainer, this field will also encompass the metrics outputted by compute_metrics()
.
See stackoverflow how to use evaluate.load
.
Freezing BART Parameters¶
Since the BART model is very large, fine-tunning for all parameter will take long, so, we freeze some parameters to speed up the fine-tunning process.
# Get the number of layers
config = tweet_clf_model.config
num_layers = config.num_hidden_layers
print("Number of layers in the BART model:", num_layers)
Number of layers in the BART model: 12
# to speed up training, freeze all encoder layers in BART except last one
ir = 0
for name, param in tweet_clf_model.named_parameters():
ir += 1
#print(name)
if 'model.decoder.layers.11' in name: # our large model has 11 encoder so everything until last 2 are removed
print(f'Parameter {ir}: model.decoder.layers.11')
break
param.requires_grad = False # disable training in BART
Parameter 484: model.decoder.layers.11
import evaluate
metric = evaluate.load("accuracy")
from sklearn.metrics import roc_auc_score
def compute_metrics(eval_pred):
logits = eval_pred.predictions[0]
labels = eval_pred.label_ids
predictions = np.argmax(logits, axis=-1)
return metric.compute(predictions=predictions, references=labels)
#####################################################
def compute_metrics_binary(eval_pred):
"""metrics for binary classification"""
labels = eval_pred.label_ids
logits = eval_pred.predictions[0]
preds = np.argmax(logits, axis=-1)
# Calculate the AUC score
auc_score = roc_auc_score(labels, eval_pred)
# Calculate the true positive, false positive, false negative, and true negative values
tp = ((eval_pred >= 0.5) & (labels == 1)).sum()
fp = ((eval_pred >= 0.5) & (labels == 0)).sum()
fn = ((eval_pred < 0.5) & (labels == 1)).sum()
tn = ((eval_pred < 0.5) & (labels == 0)).sum()
# Calculate the precision, recall, and F1 score
precision = tp / (tp + fp)
recall = tp / (tp + fn)
f1_score = 2 * (precision * recall) / (precision + recall)
return {
'Validation Precision': precision,
'Validation Recall': recall,
'Validation F1_Score': f1_score,
'Validation AUC_Score': auc_score,
'Validation TP': tp,
'Validation FP': fp,
'Validation FN': fn,
'Validation TN': tn,
}
#####################################################
from sklearn.metrics import classification_report
def compute_metrics_multiclass(eval_pred):
"""metrics for multiclass classification"""
labels = eval_pred.label_ids
logits = eval_pred.predictions[0]
preds = np.argmax(logits, axis=-1)
report = classification_report(labels, preds, output_dict=True)
acc_score = report['accuracy']
pre_score = report['macro avg']['precision']
rcl_score = report['macro avg']['recall']
f1_score = report['macro avg']['f1-score']
return {
'Validation Accuracy': acc_score,
'Validation Macro Recall': rcl_score,
'Validation Macro Precision': pre_score,
'Validation Macro F1_Score': f1_score,
}
Trainer¶
epochs = 5
batch_size = 5
from transformers import set_seed
set_seed(42)
tweet_clf_model.eval()
# Training argument: monitor and track training arguments including saving
# strategy and Scheduler parameters
training_args = TrainingArguments(
output_dir="./tweet_clf/results", # Local directory to save check point of our model as fitting
num_train_epochs=epochs, # minimum of two epochs
per_device_train_batch_size=batch_size, # batch size for training and evaluation, it is common to take around 32,
per_device_eval_batch_size=batch_size, # sometimes less or more, The smaller batch size, the more change model update
load_best_model_at_end=True, # Even if we overfit the model by accident, load the best model through checkpoint
metric_for_best_model= 'Validation Accuracy', # Use in conjunction with`load_best_model_at_end` to specify the metric to use to compare two different
# models. Must be the name of a metric returned by the evaluation with or without the prefix :obj:`"eval_"`.
# Will default to :obj:`"loss"` if unspecified and :obj:`load_best_model_at_end=True` (to use the evaluation
# loss).
# some deep learning parameters that the trainer is able to take in
warmup_steps = len(seq_clf_tokenized_comments['train']) // 2, # learning rate scheduler by number of warmup steps
weight_decay = 0.1, # weight decay for our learning rate schedule (regularization)
seed = 42, # Random seed to ensure reproducibility across runs
logging_steps = 1, # Tell the model minimum number of steps to log between (1 means logging as much as possible)
log_level = 'info',
evaluation_strategy = 'epoch', # It is "steps" or "epoch", we choose epoch: how many times to stop training to test
eval_steps = 50,
save_strategy = 'epoch' # save a check point of our model after each epoch
)
# Define the trainer: API to the Pytorch
trainer = Trainer(
model=tweet_clf_model, # take our model (tweet_clf_model)
args=training_args, # we just set it above
train_dataset=seq_clf_tokenized_comments['train'], # training part of dataset
eval_dataset=seq_clf_tokenized_comments['test'], # test (evaluation) part of dataset
compute_metrics=compute_metrics_multiclass, # This part is optional but we want to metrics for our model
data_collator=data_collator # data colladior with padding. Infact, we may or may not need a data collator
# we can check the model to see how it lookes like with or without the collator
)
Before we start training, we can run the trainer without fine-tune model to measure performance of the model
# Get initial metrics: evaluation on test set
trainer.evaluate()
The following columns in the evaluation set don't have a corresponding argument in `BartForSequenceClassification.forward` and have been ignored: titles, tokens. If titles, tokens are not expected by `BartForSequenceClassification.forward`, you can safely ignore this message. ***** Running Evaluation ***** Num examples = 1000 Batch size = 5 You're using a BartTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
{'eval_loss': 1.6543835401535034, 'eval_Validation Accuracy': 0.247, 'eval_Validation Macro Recall': 0.21789604394781487, 'eval_Validation Macro Precision': 0.17544466735784783, 'eval_Validation Macro F1_Score': 0.1576815486394501, 'eval_runtime': 383.0252, 'eval_samples_per_second': 2.611, 'eval_steps_per_second': 0.522}
%%time
trainer.train()
The following columns in the training set don't have a corresponding argument in `BartForSequenceClassification.forward` and have been ignored: titles, tokens. If titles, tokens are not expected by `BartForSequenceClassification.forward`, you can safely ignore this message. ***** Running training ***** Num examples = 9000 Num Epochs = 5 Instantaneous batch size per device = 5 Total train batch size (w. parallel, distributed & accumulation) = 5 Gradient Accumulation steps = 1 Total optimization steps = 9000 Number of trainable parameters = 17853445
Epoch | Training Loss | Validation Loss | Validation accuracy | Validation macro recall | Validation macro precision | Validation macro f1 Score |
---|---|---|---|---|---|---|
1 | 1.033300 | 1.293204 | 0.437000 | 0.415319 | 0.507195 | 0.422844 |
2 | 1.265700 | 1.165269 | 0.476000 | 0.490785 | 0.543280 | 0.480481 |
3 | 0.766400 | 1.029008 | 0.584000 | 0.594189 | 0.602664 | 0.587308 |
4 | 0.666600 | 1.083287 | 0.573000 | 0.585591 | 0.593171 | 0.587646 |
5 | 0.471600 | 1.240799 | 0.568000 | 0.584214 | 0.582199 | 0.582162 |
The following columns in the evaluation set don't have a corresponding argument in `BartForSequenceClassification.forward` and have been ignored: titles, tokens. If titles, tokens are not expected by `BartForSequenceClassification.forward`, you can safely ignore this message. ***** Running Evaluation ***** Num examples = 1000 Batch size = 5 Saving model checkpoint to ./tweet_clf/results\checkpoint-1800 Configuration saved in ./tweet_clf/results\checkpoint-1800\config.json Model weights saved in ./tweet_clf/results\checkpoint-1800\pytorch_model.bin The following columns in the evaluation set don't have a corresponding argument in `BartForSequenceClassification.forward` and have been ignored: titles, tokens. If titles, tokens are not expected by `BartForSequenceClassification.forward`, you can safely ignore this message. ***** Running Evaluation ***** Num examples = 1000 Batch size = 5 Saving model checkpoint to ./tweet_clf/results\checkpoint-3600 Configuration saved in ./tweet_clf/results\checkpoint-3600\config.json Model weights saved in ./tweet_clf/results\checkpoint-3600\pytorch_model.bin The following columns in the evaluation set don't have a corresponding argument in `BartForSequenceClassification.forward` and have been ignored: titles, tokens. If titles, tokens are not expected by `BartForSequenceClassification.forward`, you can safely ignore this message. ***** Running Evaluation ***** Num examples = 1000 Batch size = 5 Saving model checkpoint to ./tweet_clf/results\checkpoint-5400 Configuration saved in ./tweet_clf/results\checkpoint-5400\config.json Model weights saved in ./tweet_clf/results\checkpoint-5400\pytorch_model.bin The following columns in the evaluation set don't have a corresponding argument in `BartForSequenceClassification.forward` and have been ignored: titles, tokens. If titles, tokens are not expected by `BartForSequenceClassification.forward`, you can safely ignore this message. ***** Running Evaluation ***** Num examples = 1000 Batch size = 5 Saving model checkpoint to ./tweet_clf/results\checkpoint-7200 Configuration saved in ./tweet_clf/results\checkpoint-7200\config.json Model weights saved in ./tweet_clf/results\checkpoint-7200\pytorch_model.bin The following columns in the evaluation set don't have a corresponding argument in `BartForSequenceClassification.forward` and have been ignored: titles, tokens. If titles, tokens are not expected by `BartForSequenceClassification.forward`, you can safely ignore this message. ***** Running Evaluation ***** Num examples = 1000 Batch size = 5 Saving model checkpoint to ./tweet_clf/results\checkpoint-9000 Configuration saved in ./tweet_clf/results\checkpoint-9000\config.json Model weights saved in ./tweet_clf/results\checkpoint-9000\pytorch_model.bin Training completed. Do not forget to share your model on huggingface.co/models =) Loading best model from ./tweet_clf/results\checkpoint-5400 (score: 0.584).
Wall time: 12h 20s
TrainOutput(global_step=9000, training_loss=0.9914621342285019, metrics={'train_runtime': 43220.4566, 'train_samples_per_second': 1.041, 'train_steps_per_second': 0.208, 'total_flos': 5513932597456650.0, 'train_loss': 0.9914621342285019, 'epoch': 5.0})
trainer.evaluate()
The following columns in the evaluation set don't have a corresponding argument in `BartForSequenceClassification.forward` and have been ignored: titles, tokens. If titles, tokens are not expected by `BartForSequenceClassification.forward`, you can safely ignore this message. ***** Running Evaluation ***** Num examples = 1000 Batch size = 5
{'eval_loss': 1.0290082693099976, 'eval_Validation Accuracy': 0.584, 'eval_Validation Macro Recall': 0.5941889076629356, 'eval_Validation Macro Precision': 0.6026636803376203, 'eval_Validation Macro F1_Score': 0.587307594829771, 'eval_runtime': 339.3369, 'eval_samples_per_second': 2.947, 'eval_steps_per_second': 0.589, 'epoch': 5.0}
## Save pipline on the drive
#tweet_clf_model = tweet_clf_model.to('cpu') #put my model to cpu
pipe = pipeline("text-classification", model=tweet_clf_model, tokenizer=tokenizer)
text = """Wish I was physic so I could have predicted how much sanitizer and toilet
paper I would need before this got real. #stopPanicBuying #coronavirus"""
pipe(text)
[{'label': 'Negative', 'score': 0.5844146609306335}]
text = """After almost two weeks of going absolutely nowhere, we had to go to
the grocery store to stock up on food, fruit and veggies. Some older people:
their behaviour!?? ? What part of social distancing do you not understand?!
Already had such a stressful day. ? #SocialDistancing"""
pipe(text)
[{'label': 'Negative', 'score': 0.5122951865196228}]
Save Model¶
# We can save our model on drirectory we specified and load the model for prediction
trainer.save_model()
Saving model checkpoint to ./tweet_clf/results Configuration saved in ./tweet_clf/results\config.json Model weights saved in ./tweet_clf/results\pytorch_model.bin
We can easily call our pipline directly from directory. This very useful for deploying our model on the cloud with one line of the code. We can use it with exact same way to get the exact result.
pipe = pipeline("text-classification", "./tweet_clf/results", tokenizer=tokenizer)
loading configuration file ./tweet_clf/results\config.json You passed along `num_labels=3` with an incompatible id to label map: {'0': 'Positive', '1': 'Neutral', '2': 'Extremely Positive', '3': 'Extremely Negative', '4': 'Negative'}. The number of labels wil be overwritten to 5. Model config BartConfig { "_name_or_path": "./tweet_clf/results", "_num_labels": 3, "activation_dropout": 0.0, "activation_function": "gelu", "add_final_layer_norm": false, "architectures": [ "BartForSequenceClassification" ], "attention_dropout": 0.0, "bos_token_id": 0, "classif_dropout": 0.0, "classifier_dropout": 0.0, "d_model": 1024, "decoder_attention_heads": 16, "decoder_ffn_dim": 4096, "decoder_layerdrop": 0.0, "decoder_layers": 12, "decoder_start_token_id": 2, "dropout": 0.1, "encoder_attention_heads": 16, "encoder_ffn_dim": 4096, "encoder_layerdrop": 0.0, "encoder_layers": 12, "eos_token_id": 2, "forced_eos_token_id": 2, "gradient_checkpointing": false, "id2label": { "0": "Positive", "1": "Neutral", "2": "Extremely Positive", "3": "Extremely Negative", "4": "Negative" }, "init_std": 0.02, "is_encoder_decoder": true, "label2id": { "Extremely Negative": 3, "Extremely Positive": 2, "Negative": 4, "Neutral": 1, "Positive": 0 }, "max_position_embeddings": 1024, "model_type": "bart", "normalize_before": false, "num_hidden_layers": 12, "output_past": false, "pad_token_id": 1, "problem_type": "single_label_classification", "scale_embedding": false, "torch_dtype": "float32", "transformers_version": "4.26.1", "use_cache": true, "vocab_size": 50265 } loading configuration file ./tweet_clf/results\config.json You passed along `num_labels=3` with an incompatible id to label map: {'0': 'Positive', '1': 'Neutral', '2': 'Extremely Positive', '3': 'Extremely Negative', '4': 'Negative'}. The number of labels wil be overwritten to 5. Model config BartConfig { "_name_or_path": "./tweet_clf/results", "_num_labels": 3, "activation_dropout": 0.0, "activation_function": "gelu", "add_final_layer_norm": false, "architectures": [ "BartForSequenceClassification" ], "attention_dropout": 0.0, "bos_token_id": 0, "classif_dropout": 0.0, "classifier_dropout": 0.0, "d_model": 1024, "decoder_attention_heads": 16, "decoder_ffn_dim": 4096, "decoder_layerdrop": 0.0, "decoder_layers": 12, "decoder_start_token_id": 2, "dropout": 0.1, "encoder_attention_heads": 16, "encoder_ffn_dim": 4096, "encoder_layerdrop": 0.0, "encoder_layers": 12, "eos_token_id": 2, "forced_eos_token_id": 2, "gradient_checkpointing": false, "id2label": { "0": "Positive", "1": "Neutral", "2": "Extremely Positive", "3": "Extremely Negative", "4": "Negative" }, "init_std": 0.02, "is_encoder_decoder": true, "label2id": { "Extremely Negative": 3, "Extremely Positive": 2, "Negative": 4, "Neutral": 1, "Positive": 0 }, "max_position_embeddings": 1024, "model_type": "bart", "normalize_before": false, "num_hidden_layers": 12, "output_past": false, "pad_token_id": 1, "problem_type": "single_label_classification", "scale_embedding": false, "torch_dtype": "float32", "transformers_version": "4.26.1", "use_cache": true, "vocab_size": 50265 } loading weights file ./tweet_clf/results\pytorch_model.bin All model checkpoint weights were used when initializing BartForSequenceClassification. All the weights of BartForSequenceClassification were initialized from the model checkpoint at ./tweet_clf/results. If your task is similar to the task the model of the checkpoint was trained on, you can already use BartForSequenceClassification for predictions without further training.
pipe = pipeline("text-classification", model=tweet_clf_model, tokenizer=tokenizer, return_all_scores=True)
def refined_zero(x):
label_ = [pipe(x)[0][i]['label'] for i in range(5)]
score_ = [pipe(x)[0][i]['score'] for i in range(5)]
pipe_x = sorted(zip(label_, score_), key=lambda x: x[1], reverse=True)
try:
l1 = pipe_x[0][0]
p1 = pipe_x[0][1]
l2 = pipe_x[1][0]
p2 = pipe_x[1][1]
l3 = pipe_x[2][0]
p3 = pipe_x[2][1]
l4 = pipe_x[3][0]
p4 = pipe_x[3][1]
l5 = pipe_x[4][0]
p5 = pipe_x[4][1]
except :
l1 = None
p1 = None
l2 = None
p2 = None
l3 = None
p3 = None
l4 = None
p4 = None
l5 = None
p5 = None
return l1, p1, l2, p2, l3, p3, l4, p4, l5, p5
titles = []
perdi = []
score = []
for ii in tweet_dataset['test']['titles']:
titles.append(ii)
run = refined_zero(ii)
perdi.append(run)
label = []
for ii in tweet_dataset['test']['label']:
label.append(tweet_clf_model.config.id2label[ii])
Actl_pred_copy = pd.DataFrame()
Actl_pred_copy['titles'] = titles
Actl_pred_copy['Actual'] = label
Actl_pred_copy['label_1'] = [i[0] for i in perdi]
Actl_pred_copy['prob_label_1'] = [i[1] for i in perdi]
#
Actl_pred_copy['label_2'] = [i[2] for i in perdi]
Actl_pred_copy['prob_label_2'] = [i[3] for i in perdi]
#
Actl_pred_copy['label_3'] = [i[4] for i in perdi]
Actl_pred_copy['prob_label_3'] = [i[5] for i in perdi]
#
Actl_pred_copy['label_4'] = [i[6] for i in perdi]
Actl_pred_copy['prob_label_4'] = [i[7] for i in perdi]
#
Actl_pred_copy['label_5'] = [i[8] for i in perdi]
Actl_pred_copy['prob_label_5'] = [i[9] for i in perdi]
Actl_pred = Actl_pred_copy[Actl_pred_copy['prob_label_1']>0.0]
Actl_pred[:3]
titles | Actual | label_1 | prob_label_1 | label_2 | prob_label_2 | label_3 | prob_label_3 | label_4 | prob_label_4 | label_5 | prob_label_5 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | I work at a grocery store and we staying open during the State lockdown , but I don't feel comfortable with being there ... what should I do ? | Extremely Positive | Neutral | 0.622212 | Positive | 0.216991 | Negative | 0.133315 | Extremely Positive | 0.024164 | Extremely Negative | 0.003319 |
1 | When many of us are busy Stock piling there is this great Sikh Volunteers Australia community who is busy in providing free food home delivery service for needy people in this Hats off to them Please share it so that it can reach needy people across the state | Extremely Positive | Extremely Positive | 0.796087 | Positive | 0.201347 | Negative | 0.002179 | Neutral | 0.000264 | Extremely Negative | 0.000123 |
2 | 1st Wave - From China 2nd Wave - Local Mass Gathering 3rd Wave - Supermarket , Balik Kampung Transport , Police Station Gathering 4th Wave - Sampai Kampung Gathering | Neutral | Neutral | 0.816192 | Negative | 0.144212 | Positive | 0.026821 | Extremely Negative | 0.011600 | Extremely Positive | 0.001176 |
font = {'size' : 12}
plt.rc('font', **font)
colors_map = plt.cm.get_cmap('jet')
fig, ax = plt.subplots(figsize=(14, 4), dpi= 100, facecolor='w', edgecolor='k')
ax1 = plt.subplot(1,2,1)
val_obj = Actl_pred['label_1']
bargraph (val_obj, title=f'Predicted Labels', ylabel='Counts',titlefontsize=15, xfontsize=10,yfontsize=13,
percent=False,fontsizelable=11,yshift=1,xshift=-0.1,color='g',legend=True,ytick_rot=25, y_rot=0, axt=ax1)
plt.show()
pred = Actl_pred['label_1']
label = Actl_pred.Actual
accuracy = np.round(accuracy_score(label, pred)*100,1)
macro_averaged_precision = np.round(precision_score(label, pred, average = 'macro')*100,1)
micro_averaged_precision = np.round(precision_score(label, pred, average = 'micro')*100,1)
macro_averaged_recall = np.round(recall_score(label, pred, average = 'macro')*100,1)
micro_averaged_recall = np.round(recall_score(label, pred, average = 'micro')*100,1)
# Calculate the percentage
font = {'size' : 9}
plt.rc('font', **font)
fig, ax=plt.subplots(figsize=(6, 5), dpi= 120, facecolor='w', edgecolor='k')
alllabels = list(set(label))
per_cltr=np.zeros((5,5))
for i in range(len(alllabels)):
for j in range(len(alllabels)):
per_cltr[i,j] = len(Actl_pred[(Actl_pred['Actual']==alllabels[i]) &
(Actl_pred['label_1']==alllabels[j])])/len(Actl_pred[Actl_pred['Actual']==alllabels[i]])
cax =ax.matshow(per_cltr, cmap='jet', interpolation='nearest',vmin=0, vmax=1)
cbar=fig.colorbar(cax,shrink=0.6,orientation='vertical',label='Low % High %')
cbar.set_ticks([])
#plt.title('Mismatch Percentage', fontsize=14,y=1.17)
for i in range(5):
for j in range(5):
c = per_cltr[i,j]*100
ax.text(j, i, str(round(c,1))+'%', va='center',weight="bold", ha='center',fontsize=12,c='w')
columns=[f'{alllabels[i]} \n (Predicted) ' for i in range(5)]
ax.set_xticks(np.arange(len(alllabels)))
ax.set_xticklabels(columns,fontsize=8,rotation=35,y=0.97)
columns=[f'{alllabels[i]}\n (Actual) ' for i in range(5)]
ax.set_yticks(np.arange(len(alllabels)))
ax.set_yticklabels(columns,fontsize=9,rotation='0')
plt.title(f'Confusion Matrix to Predict {len(titles)} Validation Set Samples', fontsize=12,y=1.25)
txt = "Overall Accuracy = "+ r'$\mathbf{' + str(accuracy) + '}$%\n'
txt += "Macro Precision = "+ r'$\mathbf{' + str(macro_averaged_precision) + '}$%\n'
txt += "Micro Precision = "+ r'$\mathbf{' + str(micro_averaged_precision) + '}$%\n'
txt += "Macro Recall = "+ r'$\mathbf{' + str(macro_averaged_recall) + '}$%\n'
txt += "Micro Recall = "+ r'$\mathbf{' + str(micro_averaged_recall) + '}$%'
plt.text(6, 2.5, txt,rotation=0,color='k', ha = 'left',fontsize=12,bbox=dict(facecolor='#FFE4C4', alpha=0.6))
txt_def = r'$\mathbf{' + 'Macro' + '}$'+": make sum of all True Positives and False Negatives and then calculate metrics\n"
txt_def+= r'$\mathbf{' + 'Micro' + '}$'+": calculate metrics for each label separately and then make average"
plt.text(-1, 5.5, txt_def,rotation=0,color='k', ha = 'left',fontsize=12,bbox=dict(facecolor='#98F5FF', alpha=0.6))
plt.show()
Evaluate Model On Test Set¶
Fine-tuned Prediction¶
pipe = pipeline("text-classification", model=tweet_clf_model, tokenizer=tokenizer, return_all_scores=True)
titles = []
perdi = []
score = []
for ii in test['OriginalTweet']:
titles.append(ii)
run = refined_zero(ii)
perdi.append(run)
label = []
for ii in test['Sentiment']:
label.append(ii)
Actl_pred_copy = pd.DataFrame()
Actl_pred_copy['titles'] = titles
Actl_pred_copy['Actual'] = label
Actl_pred_copy['label_1'] = [i[0] for i in perdi]
Actl_pred_copy['prob_label_1'] = [i[1] for i in perdi]
#
Actl_pred_copy['label_2'] = [i[2] for i in perdi]
Actl_pred_copy['prob_label_2'] = [i[3] for i in perdi]
#
Actl_pred_copy['label_3'] = [i[4] for i in perdi]
Actl_pred_copy['prob_label_3'] = [i[5] for i in perdi]
#
Actl_pred_copy['label_4'] = [i[6] for i in perdi]
Actl_pred_copy['prob_label_4'] = [i[7] for i in perdi]
#
Actl_pred_copy['label_5'] = [i[8] for i in perdi]
Actl_pred_copy['prob_label_5'] = [i[9] for i in perdi]
Actl_pred = Actl_pred_copy[Actl_pred_copy['prob_label_1']>0.0]
Actl_pred[:3]
titles | Actual | label_1 | prob_label_1 | label_2 | prob_label_2 | label_3 | prob_label_3 | label_4 | prob_label_4 | label_5 | prob_label_5 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | TRENDING : New Yorkers encounter empty supermarket shelves ( pictured , Wegmans in Brooklyn ) , sold-out online grocers ( FoodKick , MaxDelivery ) as shoppers stock up | Extremely Negative | Negative | 0.538205 | Neutral | 0.245195 | Positive | 0.193545 | Extremely Positive | 0.015169 | Extremely Negative | 0.007887 |
1 | When I couldn't find hand sanitizer at Fred Meyer , I turned to . But $ 114.97 for a 2 pack of Purell ? ? ! ! Check out how concerns are driving up prices . | Positive | Positive | 0.561097 | Neutral | 0.184296 | Extremely Positive | 0.156479 | Negative | 0.095451 | Extremely Negative | 0.002677 |
2 | Find out how you can protect yourself and loved ones from . ? | Extremely Positive | Positive | 0.485621 | Extremely Positive | 0.461758 | Neutral | 0.041895 | Negative | 0.010484 | Extremely Negative | 0.000242 |
def histplt (val: list,bins: int,title: str,xlabl: str,ylabl: str,xlimt: list,
ylimt: list=False, loc: int =1,legend: int=1,axt=None,days: int=False,
class_: int=False,scale: int=1,x_tick: list=False, calc_perc: bool= True,
nsplit: int=1,font: int=5,color: str='b') -> [float] :
""" Histogram including important statistics """
ax1 = axt or plt.axes()
font = {'size' : font }
plt.rc('font', **font)
miss_n = len(val[np.isnan(val)])
tot = len(val)
n_distinct = len(np.unique(val))
miss_p = (len(val[np.isnan(val)])/tot)*100
val = val[~np.isnan(val)]
val = np.array(val)
plt.hist(val, bins=bins, weights=np.ones(len(val)) / len(val),ec='black',color=color)
n_nonmis = len(val[~np.isnan(val)])
if class_:
times = 100
else:
times = 1
Mean = np.nanmean(val)*times
Median = np.nanmedian(val)*times
sd = np.sqrt(np.nanvar(val))
Max = np.nanmax(val)
Min = np.nanmin(val)
p1 = np.quantile(val, 0.01)
p25 = np.quantile(val, 0.25)
p75 = np.quantile(val, 0.75)
p99 = np.quantile(val, 0.99)
if calc_perc == True:
txt = 'n (not missing)=%.0f\nn_distinct=%.0f\nMissing=%.1f%%\nMean=%0.2f\nσ=%0.1f\np1%%=%0.1f\np99%%=%0.1f\nMin=%0.1f\nMax=%0.1f'
anchored_text = AnchoredText(txt %(n_nonmis,n_distinct,miss_p,Mean,sd,p1,p99,Min,Max), borderpad=0,
loc=loc,prop={ 'size': font['size']*scale})
else:
txt = 'n (not missing)=%.0f\nn_distinct=%.0f\nMissing=%.1f%%\nMean=%0.2f\nσ=%0.1f\nMin=%0.1f\nMax=%0.1f'
anchored_text = AnchoredText(txt %(n_nonmis,n_distinct,miss_p,Mean,sd,Min,Max), borderpad=0,
loc=loc,prop={ 'size': font['size']*scale})
if(legend==1): ax1.add_artist(anchored_text)
if (scale): plt.title(title,fontsize=font['size']*(scale+0.15))
else: plt.title(title)
plt.xlabel(xlabl,fontsize=font['size'])
ax1.set_ylabel('Frequency',fontsize=font['size'])
if (scale): ax1.set_xlabel(xlabl,fontsize=font['size']*scale)
else: ax1.set_xlabel(xlabl)
try:
xlabl
except NameError:
pass
else:
if (scale): plt.xlabel(xlabl,fontsize=font['size']*scale)
else: plt.xlabel(xlabl)
try:
ylabl
except NameError:
pass
else:
if (scale): plt.ylabel(ylabl,fontsize=font['size']*scale)
else: plt.ylabel(ylabl)
if (class_==True): plt.xticks([0,1])
plt.gca().yaxis.set_major_formatter(PercentFormatter(1))
ax1.grid(linewidth='0.1')
try:
xlimt
except NameError:
pass
else:
plt.xlim(xlimt)
try:
ylimt
except NameError:
pass
else:
plt.ylim(ylimt)
if x_tick: plt.xticks(x_tick,fontsize=font['size']*scale)
plt.yticks(fontsize=font['size']*scale)
plt.grid(linewidth='0.12')
# Interquartile Range Method for outlier detection
iqr = p75 - p25
# calculate the outlier cutoff
cut_off = np.array(iqr) * 1.5
lower, upper = p25 - cut_off, p75 + cut_off
return tot, n_nonmis, n_distinct, miss_n, miss_p, Mean, Median, sd, Max, Min, p1, p25, p75, p99, sd
font = {'size' : 12}
plt.rc('font', **font)
colors_map = plt.cm.get_cmap('jet')
fig, ax = plt.subplots(figsize=(12, 4), dpi= 100, facecolor='w', edgecolor='k')
ax1 = plt.subplot(1,2,1)
val = Actl_pred.prob_label_1
_,_,_, _, _,_ ,_ ,_ ,_ ,_ ,\
_,_ ,_ ,_ ,_ = histplt (val,bins=25,title=f'Fine-tuned Model: Histogram of Predicted Probability',xlabl=None,days=False,
ylabl=None,xlimt=(0,1),ylimt=False
,axt=ax1,nsplit=5,scale=0.95,font=12,loc=2,color='g')
plt.show()
Actl_pred.Actual.value_counts()
Negative 583 Positive 497 Extremely Positive 329 Extremely Negative 316 Neutral 275 Name: Actual, dtype: int64
pred = Actl_pred['label_1']
label = Actl_pred.Actual
accuracy = np.round(accuracy_score(label, pred)*100,1)
macro_averaged_precision = np.round(precision_score(label, pred, average = 'macro')*100,1)
micro_averaged_precision = np.round(precision_score(label, pred, average = 'micro')*100,1)
macro_averaged_recall = np.round(recall_score(label, pred, average = 'macro')*100,1)
micro_averaged_recall = np.round(recall_score(label, pred, average = 'micro')*100,1)
# Calculate the percentage
font = {'size' : 9}
plt.rc('font', **font)
fig, ax=plt.subplots(figsize=(6, 5), dpi= 120, facecolor='w', edgecolor='k')
alllabels = list(set(label))
per_cltr=np.zeros((5,5))
for i in range(len(alllabels)):
for j in range(len(alllabels)):
per_cltr[i,j] = len(Actl_pred[(Actl_pred['Actual']==alllabels[i]) &
(Actl_pred['label_1']==alllabels[j])])/len(Actl_pred[Actl_pred['Actual']==alllabels[i]])
cax =ax.matshow(per_cltr, cmap='jet', interpolation='nearest',vmin=0, vmax=1)
cbar=fig.colorbar(cax,shrink=0.6,orientation='vertical',label='Low % High %')
cbar.set_ticks([])
#plt.title('Mismatch Percentage', fontsize=14,y=1.17)
for i in range(5):
for j in range(5):
c = per_cltr[i,j]*100
ax.text(j, i, str(round(c,1))+'%', va='center',weight="bold", ha='center',fontsize=12,c='w')
columns=[f'{alllabels[i]} \n (Predicted) ' for i in range(5)]
ax.set_xticks(np.arange(len(alllabels)))
ax.set_xticklabels(columns,fontsize=8,rotation=35,y=0.97)
columns=[f'{alllabels[i]}\n (Actual) ' for i in range(5)]
ax.set_yticks(np.arange(len(alllabels)))
ax.set_yticklabels(columns,fontsize=9,rotation='0')
plt.title(f'Fine-tuned Model: Confusion Matrix to Predict {len(label)} Test', fontsize=12,y=1.25)
txt = "Overall Accuracy = "+ r'$\mathbf{' + str(accuracy) + '}$%\n'
txt += "Macro Precision = "+ r'$\mathbf{' + str(macro_averaged_precision) + '}$%\n'
txt += "Micro Precision = "+ r'$\mathbf{' + str(micro_averaged_precision) + '}$%\n'
txt += "Macro Recall = "+ r'$\mathbf{' + str(macro_averaged_recall) + '}$%\n'
txt += "Micro Recall = "+ r'$\mathbf{' + str(micro_averaged_recall) + '}$%'
plt.text(6, 2.5, txt,rotation=0,color='k', ha = 'left',fontsize=12,bbox=dict(facecolor='#FFE4C4', alpha=0.6))
txt_def = r'$\mathbf{' + 'Macro' + '}$'+": make sum of all True Positives and False Negatives and then calculate metrics\n"
txt_def+= r'$\mathbf{' + 'Micro' + '}$'+": calculate metrics for each label separately and then make average"
plt.text(-1, 5.5, txt_def,rotation=0,color='k', ha = 'left',fontsize=12,bbox=dict(facecolor='#98F5FF', alpha=0.6))
plt.show()
Zero-shot Prediction¶
pipe = pipe_pre_trained
def refined_zero(x):
pred_zero = pipe_pre_trained(x,candidate_labels=['Neutral', 'Extremely Negative', 'Negative',
'Extremely Positive', 'Positive'])
label_ = pred_zero['labels']
score_ = pred_zero['scores']
pipe_x = sorted(zip(label_, score_), key=lambda x: x[1], reverse=True)
try:
l1 = pipe_x[0][0]
p1 = pipe_x[0][1]
l2 = pipe_x[1][0]
p2 = pipe_x[1][1]
l3 = pipe_x[2][0]
p3 = pipe_x[2][1]
l4 = pipe_x[3][0]
p4 = pipe_x[3][1]
except :
l1 = None
p1 = None
l2 = None
p2 = None
l3 = None
p3 = None
l4 = None
p4 = None
return l1, p1, l2, p2, l3, p3, l4, p4
titles = []
perdi = []
score = []
for ii in test['OriginalTweet']:
titles.append(ii)
run = refined_zero(ii)
perdi.append(run)
label = []
for ii in test['Sentiment']:
label.append(ii)
Actl_pred_copy = pd.DataFrame()
Actl_pred_copy['titles'] = titles
Actl_pred_copy['Actual'] = label
Actl_pred_copy['label_1'] = [i[0] for i in perdi]
Actl_pred_copy['prob_label_1'] = [i[1] for i in perdi]
#
Actl_pred_copy['label_2'] = [i[2] for i in perdi]
Actl_pred_copy['prob_label_2'] = [i[3] for i in perdi]
#
Actl_pred_copy['label_3'] = [i[4] for i in perdi]
Actl_pred_copy['prob_label_3'] = [i[5] for i in perdi]
#
Actl_pred_copy['label_4'] = [i[6] for i in perdi]
Actl_pred_copy['prob_label_4'] = [i[7] for i in perdi]
Actl_pred = Actl_pred_copy[Actl_pred_copy['prob_label_1']>0.0]
Actl_pred[:3]
titles | Actual | label_1 | prob_label_1 | label_2 | prob_label_2 | label_3 | prob_label_3 | label_4 | prob_label_4 | |
---|---|---|---|---|---|---|---|---|---|---|
0 | TRENDING : New Yorkers encounter empty supermarket shelves ( pictured , Wegmans in Brooklyn ) , sold-out online grocers ( FoodKick , MaxDelivery ) as shoppers stock up | Extremely Negative | Positive | 0.719895 | Extremely Positive | 0.186460 | Negative | 0.040895 | Extremely Negative | 0.027538 |
1 | When I couldn't find hand sanitizer at Fred Meyer , I turned to . But $ 114.97 for a 2 pack of Purell ? ? ! ! Check out how concerns are driving up prices . | Positive | Extremely Negative | 0.618703 | Negative | 0.253448 | Neutral | 0.058025 | Positive | 0.042613 |
2 | Find out how you can protect yourself and loved ones from . ? | Extremely Positive | Negative | 0.430846 | Extremely Negative | 0.222634 | Positive | 0.157836 | Neutral | 0.131127 |
font = {'size' : 12}
plt.rc('font', **font)
colors_map = plt.cm.get_cmap('jet')
fig, ax = plt.subplots(figsize=(12, 4), dpi= 100, facecolor='w', edgecolor='k')
ax1 = plt.subplot(1,2,1)
val = Actl_pred.prob_label_1
_,_,_, _, _,_ ,_ ,_ ,_ ,_ ,\
_,_ ,_ ,_ ,_ = histplt (val,bins=25,title=f'Zero shot Model: Histogram of Predicted Probability',xlabl=None,days=False,
ylabl=None,xlimt=(0,1),ylimt=False
,axt=ax1,nsplit=5,scale=0.95,font=12,loc=2,color='g')
plt.show()
Actl_pred.Actual.value_counts()
Negative 583 Positive 497 Extremely Positive 329 Extremely Negative 316 Neutral 275 Name: Actual, dtype: int64
pred = Actl_pred['label_1']
label = Actl_pred.Actual
accuracy = np.round(accuracy_score(label, pred)*100,1)
macro_averaged_precision = np.round(precision_score(label, pred, average = 'macro')*100,1)
micro_averaged_precision = np.round(precision_score(label, pred, average = 'micro')*100,1)
macro_averaged_recall = np.round(recall_score(label, pred, average = 'macro')*100,1)
micro_averaged_recall = np.round(recall_score(label, pred, average = 'micro')*100,1)
# Calculate the percentage
font = {'size' : 9}
plt.rc('font', **font)
fig, ax=plt.subplots(figsize=(6, 5), dpi= 120, facecolor='w', edgecolor='k')
alllabels = list(set(label))
per_cltr=np.zeros((5,5))
for i in range(len(alllabels)):
for j in range(len(alllabels)):
per_cltr[i,j] = len(Actl_pred[(Actl_pred['Actual']==alllabels[i]) &
(Actl_pred['label_1']==alllabels[j])])/len(Actl_pred[Actl_pred['Actual']==alllabels[i]])
cax =ax.matshow(per_cltr, cmap='jet', interpolation='nearest',vmin=0, vmax=1)
cbar=fig.colorbar(cax,shrink=0.6,orientation='vertical',label='Low % High %')
cbar.set_ticks([])
#plt.title('Mismatch Percentage', fontsize=14,y=1.17)
for i in range(5):
for j in range(5):
c = per_cltr[i,j]*100
ax.text(j, i, str(round(c,1))+'%', va='center',weight="bold", ha='center',fontsize=12,c='w')
columns=[f'{alllabels[i]} \n (Predicted) ' for i in range(5)]
ax.set_xticks(np.arange(len(alllabels)))
ax.set_xticklabels(columns,fontsize=8,rotation=35,y=0.97)
columns=[f'{alllabels[i]}\n (Actual) ' for i in range(5)]
ax.set_yticks(np.arange(len(alllabels)))
ax.set_yticklabels(columns,fontsize=9,rotation='0')
plt.title(f'Zero-shot Model: Confusion Matrix to Predict {len(label)} test', fontsize=12,y=1.25)
txt = "Overall Accuracy = "+ r'$\mathbf{' + str(accuracy) + '}$%\n'
txt += "Macro Precision = "+ r'$\mathbf{' + str(macro_averaged_precision) + '}$%\n'
txt += "Micro Precision = "+ r'$\mathbf{' + str(micro_averaged_precision) + '}$%\n'
txt += "Macro Recall = "+ r'$\mathbf{' + str(macro_averaged_recall) + '}$%\n'
txt += "Micro Recall = "+ r'$\mathbf{' + str(micro_averaged_recall) + '}$%'
plt.text(6, 2.5, txt,rotation=0,color='k', ha = 'left',fontsize=12,bbox=dict(facecolor='#FFE4C4', alpha=0.6))
txt_def = r'$\mathbf{' + 'Macro' + '}$'+": make sum of all True Positives and False Negatives and then calculate metrics\n"
txt_def+= r'$\mathbf{' + 'Micro' + '}$'+": calculate metrics for each label separately and then make average"
plt.text(-1, 5.5, txt_def,rotation=0,color='k', ha = 'left',fontsize=12,bbox=dict(facecolor='#98F5FF', alpha=0.6))
plt.show()
Performance of zero shot classification is much lower than fine-tuned model.
- Home
-
- Prediction of Movie Genre by Fine-tunning GPT
- Fine-tunning BERT for Fake News Detection
- Covid Tweet Classification by Fine-tunning BART
- Semantic Search Using BERT
- Abstractive Semantic Search by OpenAI Embedding
- Fine-tunning GPT for Style Completion
- Extractive Question-Answering by BERT
- Fine-tunning T5 Model for Abstract Title Prediction
- Image Captioning by Fine-tunning ViT
- Build Serverless ChatGPT API
- Statistical Analysis in Python
- Clustering Algorithms
- Customer Segmentation
- Time Series Forecasting
- PySpark Fundamentals for Big Data
- Predict Customer Churn
- Classification with Imbalanced Classes
- Feature Importance
- Feature Selection
- Text Similarity Measurement
- Dimensionality Reduction
- Prediction of Methane Leakage
- Imputation by LU Simulation
- Histogram Uncertainty
- Delustering to Improve Preferential Sampling
- Uncertainty in Spatial Correlation
-
- Machine Learning Overview
- Python and Pandas
- Main Steps of Machine Learning
- Classification
- Model Training
- Support Vector Machines
- Decision Trees
- Ensemble Learning & Random Forests
- Artificial Neural Network
- Deep Neural Network (DNN)
- Unsupervised Learning
- Multicollinearity
- Introduction to Git
- Introduction to R
- SQL Basic to Advanced Level
- Develop Python Package
- Introduction to BERT LLM
- Exploratory Data Analysis
- Object Oriented Programming in Python
- Natural Language Processing
- Convolutional Neural Network
- Publications