Summary
Facebook's BART (Bidirectional and Auto-Regressive Transformers) is a sequence-to-sequence pre-trained language model designed for various natural language processing tasks. BART has been trained on the MultiNLI (MNLI) dataset. It utilizes a denoising autoencoder objective, where the model is trained to reconstruct a corrupted version of an input sequence. One notable application of BART is zero-shot classification, where the model can perform text classification on unseen classes without specific training examples. This is accomplished by appending a prefix to the input text, prompting the model to generate the target label. Additionally, BART can be fine-tuned for downstream tasks such as sentiment analysis, summarization, or translation by training on task-specific datasets. Fine-tuning involves updating the model's parameters on the new task while leveraging the knowledge gained during pre-training, enabling BART to adapt to a wide range of natural language understanding tasks. In this notebook, Coronavirus tweets have been pulled from Twitter and manual tagging has been done on them to divide tweets to 5 sentiments: 'Positive', 'Neutral', 'Extremely Positive', 'Extremely Negative', 'Negative'. BART model is applied as zero shot classification to predict sentiments for tweets. The BART model is fine-tuned based on labeled data for more accurate sentiment prediction.
Python functions and data files needed to run this notebook are available via this link.
import warnings
warnings.filterwarnings('ignore')
from transformers import pipeline
import numpy as np
import pandas as pd
import matplotlib as mpl
mpl.rcParams.update(mpl.rcParamsDefault)
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
from matplotlib.offsetbox import AnchoredText
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from matplotlib.ticker import PercentFormatter
from nltk.tokenize import TweetTokenizer
import nltk
import re
pd.set_option('display.max_colwidth', None)
import warnings
warnings.filterwarnings('ignore')
Zero-shot classification is a machine learning technique employed to categorize or classify data into predefined classes without any previous examples or training data specific to those classes. The term "zero-shot" reflects the model's ability to make predictions for classes it has never encountered during training, thus possessing zero prior exposure to them. While traditional classification models rely on labeled examples for each class they aim to recognize, zero-shot classification trains the model to comprehend relationships between different classes, enabling it to generalize to new classes based on their inherent characteristics or attributes.
The fundamental concept underlying zero-shot classification involves leveraging auxiliary information, such as class descriptions, semantic embeddings, or attributes, to impart high-level knowledge about the classes. This supplementary information assists the model in learning associations between specific features or patterns and particular classes.
Zero-shot classification proves particularly advantageous in scenarios where obtaining labeled data for all potential classes is costly, time-consuming, or impractical. This technique empowers models to generalize effectively to unseen classes, enhancing their adaptability and flexibility.
The Coronavirus Tweet data set is downloaded from kaggle. The tweets have been pulled from Twitter and manual tagging has been done then. The names and usernames have been given codes to avoid any privacy concerns. The columns are:
1) Location (London, UK, ...)
2) Tweet At (date, e.g. 16-03-2020)
3) Original Tweet
4) Sentiment (Label
file_name = 'Corona_NLP_train.csv'
train = pd.read_csv(file_name, encoding = "ISO-8859-1")
#
file_name = 'Corona_NLP_test.csv'
test = pd.read_csv(file_name, encoding = "ISO-8859-1")
train[:4]
nltk.download('punkt')
def remove_meaningless(text):
"""
remove words starting with '@', '\n \r', '#', 'https://' that have no meaning
"""
tokenizer = TweetTokenizer()
tokens = tokenizer.tokenize(text)
filtered_tokens = [token for token in tokens if not (token.startswith('@') or token.startswith('\n') or
token.startswith('\r') or token.startswith('#') or
'https://' in token)]
return ' '.join(filtered_tokens)
train['OriginalTweet'] = train.apply(lambda x : remove_meaningless(x['OriginalTweet']), axis=1)
test['OriginalTweet'] = test.apply(lambda x : remove_meaningless(x['OriginalTweet']), axis=1)
train[:4]
tt = TweetTokenizer()
#
train['n_tokens'] = train.apply(lambda x : len(tt.tokenize(x['OriginalTweet'])), axis=1)
test['n_tokens'] = test.apply(lambda x : len(tt.tokenize(x['OriginalTweet'])), axis=1)
# Remove tweets less than 10 tokens
train = train[train['n_tokens']>10]
train.reset_index(drop=True)
test = test[test['n_tokens']>10]
test.reset_index(drop=True)
train[:4]
# Since the data set is big for fine-tunning, we select 10000 samples for traning set and 2000 sampels for test set
n_samples = 10000
train = train.iloc[:n_samples].copy()
test = test.iloc[:int(n_samples*0.2)].copy()
See the code below how to apply zero shot classification for Coronavirus Tweet data set:
# Load pre-trained bart-large-mnli (zero shot classifier)
pipe_pre_trained = pipeline(model="facebook/bart-large-mnli")
txt = """What #CONVID19 safety measures r being taken by online shopping companies &
their courier partners @amazonIN @Flipkart etc? I fear that shopping packages which
travel vast distances through flights/trains & r handled by many along d way can b
potential #coronavirus carriers??"""
pipe_pre_trained(txt,candidate_labels=['Neutral', 'Extremely Negative', 'Negative',
'Extremely Positive', 'Positive'])
For this text, the actual label is Negative, zero shot has predicted correctly with probability of 0.42 although the probability is not very high. This relatively low predicted probability indicates the model is not very confident.
Another Example
txt = """It amazes me...I go to the supermarket and everything is gone, people are panic buying,
yet not one single person was wearing a facemask. Buying 200 rolls of toilet paper won't
prevent you from catching #COVID2019. Take proper precautions and stay safe. https://t.co/lNoF0xfsx4"""
pipe_pre_trained(txt,candidate_labels=['Neutral', 'Extremely Negative', 'Negative',
'Extremely Positive', 'Positive'])
For this text, the actual label is Extremely Positive while zero shot is predicted as Negative with predicted probability 0.35. Therefore, we need to fine-tune the model with manual labels.
import subprocess
try:
subprocess.check_output('nvidia-smi')
print('Nvidia GPU detected!')
except Exception: # this command not being found can raise quite a few different errors depending on the configuration
print('No Nvidia GPU in system!')
def bargraph(val_ob: [list], title: str, ylabel: str, titlefontsize: int=10, xfontsize: int=5,scale: int=1,
yfontsize: int=8, select: bool= False, fontsizelable: bool= False, xshift: float=-0.1, nsim: int=False
,yshift: float=0.01,percent: bool=False, xlim: list=False, axt: bool=None, color: str='b',sort=True,
ylim: list=False, y_rot: int=0, ytick_rot: int=90, graph_float: int=1,
loc: int =1,legend: int=1) -> None:
""" vertical bargraph """
ax1 = axt or plt.axes()
tot = len(val_ob)
miss_p_ob = (len(val_ob[pd.isnull(val_ob)])/tot)*100
n_nonmis_ob = len(val_ob[~pd.isnull(val_ob)])
con = np.array(val_ob.value_counts())
len_ = len(con)
if len_ > 10: len_ = 10
cats = list(val_ob.value_counts().keys())
val_ob = con[:len_]
clmns = cats[:len_]
# Sort counts
if sort:
sort_score = sorted(zip(val_ob,clmns), reverse=True)
Clmns_sort = [sort_score[i][1] for i in range(len(clmns))]
sort_score = [sort_score[i][0] for i in range(len(clmns))]
else:
Clmns_sort = clmns
sort_score = val_ob
index1 = np.arange(len(clmns))
if (select):
Clmns_sort=Clmns_sort[:select]
sort_score=sort_score[:select]
ax1.bar(Clmns_sort, sort_score, width=0.6, align='center', alpha=1, edgecolor='k', capsize=4,color=color)
plt.title(title,fontsize=titlefontsize)
ax1.set_ylabel(ylabel,fontsize=yfontsize)
ax1.set_xticks(np.arange(len(Clmns_sort)))
ax1.set_xticklabels(Clmns_sort,fontsize=xfontsize, rotation=ytick_rot,y=0.02)
if (percent): plt.gca().yaxis.set_major_formatter(PercentFormatter(1))
ax1.xaxis.grid(color='k', linestyle='--', linewidth=0.2)
if (xlim): plt.xlim(xlim)
if (ylim): plt.ylim(ylim)
if (fontsizelable):
for ii in range(len(sort_score)):
if (percent):
plt.text(xshift+ii, sort_score[ii]+yshift,f'{"{0:.2f}".format(sort_score[ii]*100)}%',
fontsize=fontsizelable,rotation=y_rot,color='k')
else:
plt.text(xshift+ii, sort_score[ii]+yshift,f'{np.round(sort_score[ii],graph_float)}',
fontsize=fontsizelable,rotation=y_rot,color='k')
dic_Clmns = {}
for i in range(len(Clmns_sort)):
dic_Clmns[Clmns_sort[i]]=sort_score[i]
txt = 'n (not missing)=%.0f\nMissing=%.1f%%'
anchored_text = AnchoredText(txt %(n_nonmis_ob,miss_p_ob), borderpad=0,
loc=loc)
if(legend==1): ax1.add_artist(anchored_text)
font = {'size' : 12}
plt.rc('font', **font)
colors_map = plt.cm.get_cmap('jet')
fig, ax = plt.subplots(figsize=(14, 4), dpi= 100, facecolor='w', edgecolor='k')
ax1 = plt.subplot(1,2,1)
val_obj = train['Sentiment']
bargraph (val_obj, title=f'Training set_Coronavirus Tweet Labels', ylabel='Counts',titlefontsize=15, xfontsize=12,yfontsize=13,
percent=False,fontsizelable=12,yshift=20,xshift=-0.2,color='r',legend=True,ytick_rot=25, y_rot=0, axt=ax1,loc=1, ylim=[0,3000])
ax1 = plt.subplot(1,2,2)
val_obj = test['Sentiment']
bargraph (val_obj, title=f'Test set_Coronavirus Tweet Labels', ylabel='Counts',titlefontsize=15, xfontsize=12,yfontsize=13,
percent=False,fontsizelable=12,yshift=5,xshift=-0.1,color='b',legend=True,ytick_rot=25, y_rot=0, axt=ax1,loc=1, ylim=[0,650])
plt.subplots_adjust(hspace=0.5)
plt.subplots_adjust(wspace=0.3)
plt.show()
So, there are three fine-tuning approaches:
# This code segment parses the train dataset into a more manageable format
titles = []
tokenized_titles = []
sequence_labels = train['Sentiment']
title, tokenized_title = [], []
for comments in train['OriginalTweet']:
title.append(comments)
tokenized_title.append(comments.split(' '))
unique_sequence_labels = list(set(sequence_labels))
unique_sequence_labels
sequence_labels = [unique_sequence_labels.index(l) for l in sequence_labels]
print(f'There are {len(unique_sequence_labels)} unique sequence labels')
Our final python list is going to be something like this:
print(tokenized_title[0])
print(title[0])
print(sequence_labels[0])
print(unique_sequence_labels[sequence_labels[0]])
🤗 Datasets offer a plethora of tools for modifying both the structure and content of a dataset. These tools play a pivotal role in refining a dataset, encompassing tasks such as organizing rows, splitting datasets, creating additional columns, converting between different features and formats, and more.
This guide will walk you through the following processes:
Once all the data is collected, it is encapsulated within a dataset object. Subsequently, a train-test split can be executed using the train_test_split function.
from datasets import Dataset
tweet_dataset = Dataset.from_dict(
dict(
titles=title,
label=sequence_labels,
tokens=tokenized_title,
)
)
tweet_dataset = tweet_dataset.train_test_split(test_size=0.1, seed=42, shuffle=True)
tweet_dataset
Here is first element of our training set:
pd.Series([ii for ii in tweet_dataset['test']['label']]).value_counts()
Instantiate tokenizer:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('facebook/bart-large-mnli')
Create a pre-process function to take in a batch of titles and tokenize them.
def preprocess_function(examples):
return tokenizer(examples["titles"], truncation=True) # truncation=True makes sure to exludes instances exceeding maximum tokens
Map the tokenizer function for the entire data set:
# go over all our data set, tokenize them
seq_clf_tokenized_comments = tweet_dataset.map(preprocess_function, batched=True)
tweet_dataset
Looking at the first item, we also have input_ids
and attention_mask
. These are the items we are going to need in our model.
DataCollatorWithPadding
facilitates the creation of data batches by dynamically padding text sequences to the length of the longest element in each batch, ensuring uniform length across all elements. While it is feasible to pad text manually within the tokenizer function using the padding=True
option, employing dynamic padding in the DataCollatorWithPadding
is a more efficient approach. This optimized process contributes to an accelerated training pace, enhancing the overall efficiency of the training process.
from transformers import DataCollatorWithPadding
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
The Data Collator is responsible for padding data to achieve uniform input lengths across all examples in a batch. The attention mask is a mechanism utilized to exclude attention scores associated with padding tokens, ensuring that these scores are disregarded during processing.
Create actual model.
from transformers import BartForSequenceClassification
tweet_clf_model = BartForSequenceClassification.from_pretrained('facebook/bart-large-mnli',
num_labels=len(unique_sequence_labels),
ignore_mismatched_sizes=True)
# set an index -> label dictionary
tweet_clf_model.config.id2label = {i: l for i, l in enumerate(unique_sequence_labels)}
tweet_clf_model.config.label2id = {l: i for i, l in enumerate(unique_sequence_labels)}
# Model's parameters
n_params = list(tweet_clf_model.named_parameters())
print(f'The BART model has {len(n_params)} different parameters')
n_params[0:5][0]
print('********* Embedding Layer *********\n')
for par in n_params[0:5]:
print(f'{par[0], str(tuple(par[1].size()))}')
print('********* First Encoder ********* \n')
for par in n_params[5:21]:
print(f'{par[0], str(tuple(par[1].size()))}')
print('********* Output Layer ********* \n')
for par in n_params[-2:]:
print(f'{par[0], str(tuple(par[1].size()))}')
tweet_clf_model.config
Every model comes with a config
. In this config
, there are id2label
and label2id
:
Within the framework of the BartForSequenceClassification
model, the terms "id2label" and "label2id" pertain to two mapping dictionaries employed for the conversion of labels between their textual and numerical representations.
tweet_clf_model.config.id2label[0]
It is now opportune to introduce a custom metric. While Hugging Face traditionally employs loss as the performance metric, there is a need to compute accuracy as a more straightforward and intuitive metric for evaluation.
Take pre-trained knowledge of BART and transfer that knowledge to our supervised data set by not training too many epochs. The code block below is going to repeat itself again and again because it define our training loop.
from transformers import TrainingArguments, Trainer
See HuggingFace for more information on fine-tuning model.
🤗 Transformers offers a Trainer class designed to facilitate the fine-tuning of any pretrained models on your specific dataset. After completing the data preprocessing tasks outlined in the previous section, there are only a few remaining steps to configure the Trainer. The most challenging aspect is likely to be the preparation of the environment for executing Trainer.train(), as the process tends to be significantly slower when run on a CPU.
Before defining our Trainer, the initial step involves defining a TrainingArguments
class that consolidates all the hyperparameters utilized by the Trainer for training and evaluation. The sole mandatory argument is the directory path where the trained model, along with checkpoints, will be stored. For the remaining parameters, default values can be retained, typically sufficient for basic fine-tuning.
Subsequently, the Trainer can be instantiated by incorporating all the previously constructed objects—comprising the model, training_args
, training and validation datasets, and the data_collator.
This initiates the fine-tuning process, which, when performed on a GPU, typically concludes within a few minutes. The Trainer provides periodic updates on the training loss every 500 steps. However, it does not furnish information on the model's performance, as:
The Trainer has not been configured to perform evaluations during training. This can be achieved by setting evaluation_strategy to either "steps" (evaluate every eval_steps) or "epoch" (evaluate at the end of each epoch).
A compute_metrics()
function needs to be supplied to the Trainer for calculating a metric during the aforementioned evaluations. Otherwise, the evaluation results would only display the loss, which may not be the most intuitive metric.
Let's explore the process of constructing a valuable compute_metrics()
function and integrating it into our next training session. This function is designed to take an EvalPrediction
object, which is a named tuple featuring a predictions field and a label_ids field. Its output is expected to be a dictionary that maps strings to floats, where the strings denote the names of the returned metrics and the floats represent their respective values. To obtain predictions from our model, the Trainer.predict()
command can be employed.
The result of the predict()
method is yet another named tuple encompassing three fields: predictions, label_ids, and metrics. The metrics field includes the dataset's loss, along with time-related metrics such as the total prediction time and the average prediction time. Once we finalize the compute_metrics()
function and incorporate it into the Trainer, this field will also encompass the metrics outputted by compute_metrics()
.
See stackoverflow how to use evaluate.load
.
Since the BART model is very large, fine-tunning for all parameter will take long, so, we freeze some parameters to speed up the fine-tunning process.
# Get the number of layers
config = tweet_clf_model.config
num_layers = config.num_hidden_layers
print("Number of layers in the BART model:", num_layers)
# to speed up training, freeze all encoder layers in BART except last one
ir = 0
for name, param in tweet_clf_model.named_parameters():
ir += 1
#print(name)
if 'model.decoder.layers.11' in name: # our large model has 11 encoder so everything until last 2 are removed
print(f'Parameter {ir}: model.decoder.layers.11')
break
param.requires_grad = False # disable training in BART
import evaluate
metric = evaluate.load("accuracy")
from sklearn.metrics import roc_auc_score
def compute_metrics(eval_pred):
logits = eval_pred.predictions[0]
labels = eval_pred.label_ids
predictions = np.argmax(logits, axis=-1)
return metric.compute(predictions=predictions, references=labels)
#####################################################
def compute_metrics_binary(eval_pred):
"""metrics for binary classification"""
labels = eval_pred.label_ids
logits = eval_pred.predictions[0]
preds = np.argmax(logits, axis=-1)
# Calculate the AUC score
auc_score = roc_auc_score(labels, eval_pred)
# Calculate the true positive, false positive, false negative, and true negative values
tp = ((eval_pred >= 0.5) & (labels == 1)).sum()
fp = ((eval_pred >= 0.5) & (labels == 0)).sum()
fn = ((eval_pred < 0.5) & (labels == 1)).sum()
tn = ((eval_pred < 0.5) & (labels == 0)).sum()
# Calculate the precision, recall, and F1 score
precision = tp / (tp + fp)
recall = tp / (tp + fn)
f1_score = 2 * (precision * recall) / (precision + recall)
return {
'Validation Precision': precision,
'Validation Recall': recall,
'Validation F1_Score': f1_score,
'Validation AUC_Score': auc_score,
'Validation TP': tp,
'Validation FP': fp,
'Validation FN': fn,
'Validation TN': tn,
}
#####################################################
from sklearn.metrics import classification_report
def compute_metrics_multiclass(eval_pred):
"""metrics for multiclass classification"""
labels = eval_pred.label_ids
logits = eval_pred.predictions[0]
preds = np.argmax(logits, axis=-1)
report = classification_report(labels, preds, output_dict=True)
acc_score = report['accuracy']
pre_score = report['macro avg']['precision']
rcl_score = report['macro avg']['recall']
f1_score = report['macro avg']['f1-score']
return {
'Validation Accuracy': acc_score,
'Validation Macro Recall': rcl_score,
'Validation Macro Precision': pre_score,
'Validation Macro F1_Score': f1_score,
}
epochs = 5
batch_size = 5
from transformers import set_seed
set_seed(42)
tweet_clf_model.eval()
# Training argument: monitor and track training arguments including saving
# strategy and Scheduler parameters
training_args = TrainingArguments(
output_dir="./tweet_clf/results", # Local directory to save check point of our model as fitting
num_train_epochs=epochs, # minimum of two epochs
per_device_train_batch_size=batch_size, # batch size for training and evaluation, it is common to take around 32,
per_device_eval_batch_size=batch_size, # sometimes less or more, The smaller batch size, the more change model update
load_best_model_at_end=True, # Even if we overfit the model by accident, load the best model through checkpoint
metric_for_best_model= 'Validation Accuracy', # Use in conjunction with`load_best_model_at_end` to specify the metric to use to compare two different
# models. Must be the name of a metric returned by the evaluation with or without the prefix :obj:`"eval_"`.
# Will default to :obj:`"loss"` if unspecified and :obj:`load_best_model_at_end=True` (to use the evaluation
# loss).
# some deep learning parameters that the trainer is able to take in
warmup_steps = len(seq_clf_tokenized_comments['train']) // 2, # learning rate scheduler by number of warmup steps
weight_decay = 0.1, # weight decay for our learning rate schedule (regularization)
seed = 42, # Random seed to ensure reproducibility across runs
logging_steps = 1, # Tell the model minimum number of steps to log between (1 means logging as much as possible)
log_level = 'info',
evaluation_strategy = 'epoch', # It is "steps" or "epoch", we choose epoch: how many times to stop training to test
eval_steps = 50,
save_strategy = 'epoch' # save a check point of our model after each epoch
)
# Define the trainer: API to the Pytorch
trainer = Trainer(
model=tweet_clf_model, # take our model (tweet_clf_model)
args=training_args, # we just set it above
train_dataset=seq_clf_tokenized_comments['train'], # training part of dataset
eval_dataset=seq_clf_tokenized_comments['test'], # test (evaluation) part of dataset
compute_metrics=compute_metrics_multiclass, # This part is optional but we want to metrics for our model
data_collator=data_collator # data colladior with padding. Infact, we may or may not need a data collator
# we can check the model to see how it lookes like with or without the collator
)
Before we start training, we can run the trainer without fine-tune model to measure performance of the model
# Get initial metrics: evaluation on test set
trainer.evaluate()
%%time
trainer.train()
trainer.evaluate()
## Save pipline on the drive
#tweet_clf_model = tweet_clf_model.to('cpu') #put my model to cpu
pipe = pipeline("text-classification", model=tweet_clf_model, tokenizer=tokenizer)
text = """Wish I was physic so I could have predicted how much sanitizer and toilet
paper I would need before this got real. #stopPanicBuying #coronavirus"""
pipe(text)
text = """After almost two weeks of going absolutely nowhere, we had to go to
the grocery store to stock up on food, fruit and veggies. Some older people:
their behaviour!?? ? What part of social distancing do you not understand?!
Already had such a stressful day. ? #SocialDistancing"""
pipe(text)
# We can save our model on drirectory we specified and load the model for prediction
trainer.save_model()
We can easily call our pipline directly from directory. This very useful for deploying our model on the cloud with one line of the code. We can use it with exact same way to get the exact result.
pipe = pipeline("text-classification", "./tweet_clf/results", tokenizer=tokenizer)
pipe = pipeline("text-classification", model=tweet_clf_model, tokenizer=tokenizer, return_all_scores=True)
def refined_zero(x):
label_ = [pipe(x)[0][i]['label'] for i in range(5)]
score_ = [pipe(x)[0][i]['score'] for i in range(5)]
pipe_x = sorted(zip(label_, score_), key=lambda x: x[1], reverse=True)
try:
l1 = pipe_x[0][0]
p1 = pipe_x[0][1]
l2 = pipe_x[1][0]
p2 = pipe_x[1][1]
l3 = pipe_x[2][0]
p3 = pipe_x[2][1]
l4 = pipe_x[3][0]
p4 = pipe_x[3][1]
l5 = pipe_x[4][0]
p5 = pipe_x[4][1]
except :
l1 = None
p1 = None
l2 = None
p2 = None
l3 = None
p3 = None
l4 = None
p4 = None
l5 = None
p5 = None
return l1, p1, l2, p2, l3, p3, l4, p4, l5, p5
titles = []
perdi = []
score = []
for ii in tweet_dataset['test']['titles']:
titles.append(ii)
run = refined_zero(ii)
perdi.append(run)
label = []
for ii in tweet_dataset['test']['label']:
label.append(tweet_clf_model.config.id2label[ii])
Actl_pred_copy = pd.DataFrame()
Actl_pred_copy['titles'] = titles
Actl_pred_copy['Actual'] = label
Actl_pred_copy['label_1'] = [i[0] for i in perdi]
Actl_pred_copy['prob_label_1'] = [i[1] for i in perdi]
#
Actl_pred_copy['label_2'] = [i[2] for i in perdi]
Actl_pred_copy['prob_label_2'] = [i[3] for i in perdi]
#
Actl_pred_copy['label_3'] = [i[4] for i in perdi]
Actl_pred_copy['prob_label_3'] = [i[5] for i in perdi]
#
Actl_pred_copy['label_4'] = [i[6] for i in perdi]
Actl_pred_copy['prob_label_4'] = [i[7] for i in perdi]
#
Actl_pred_copy['label_5'] = [i[8] for i in perdi]
Actl_pred_copy['prob_label_5'] = [i[9] for i in perdi]
Actl_pred = Actl_pred_copy[Actl_pred_copy['prob_label_1']>0.0]
Actl_pred[:3]
font = {'size' : 12}
plt.rc('font', **font)
colors_map = plt.cm.get_cmap('jet')
fig, ax = plt.subplots(figsize=(14, 4), dpi= 100, facecolor='w', edgecolor='k')
ax1 = plt.subplot(1,2,1)
val_obj = Actl_pred['label_1']
bargraph (val_obj, title=f'Predicted Labels', ylabel='Counts',titlefontsize=15, xfontsize=10,yfontsize=13,
percent=False,fontsizelable=11,yshift=1,xshift=-0.1,color='g',legend=True,ytick_rot=25, y_rot=0, axt=ax1)
plt.show()
pred = Actl_pred['label_1']
label = Actl_pred.Actual
accuracy = np.round(accuracy_score(label, pred)*100,1)
macro_averaged_precision = np.round(precision_score(label, pred, average = 'macro')*100,1)
micro_averaged_precision = np.round(precision_score(label, pred, average = 'micro')*100,1)
macro_averaged_recall = np.round(recall_score(label, pred, average = 'macro')*100,1)
micro_averaged_recall = np.round(recall_score(label, pred, average = 'micro')*100,1)
# Calculate the percentage
font = {'size' : 9}
plt.rc('font', **font)
fig, ax=plt.subplots(figsize=(6, 5), dpi= 120, facecolor='w', edgecolor='k')
alllabels = list(set(label))
per_cltr=np.zeros((5,5))
for i in range(len(alllabels)):
for j in range(len(alllabels)):
per_cltr[i,j] = len(Actl_pred[(Actl_pred['Actual']==alllabels[i]) &
(Actl_pred['label_1']==alllabels[j])])/len(Actl_pred[Actl_pred['Actual']==alllabels[i]])
cax =ax.matshow(per_cltr, cmap='jet', interpolation='nearest',vmin=0, vmax=1)
cbar=fig.colorbar(cax,shrink=0.6,orientation='vertical',label='Low % High %')
cbar.set_ticks([])
#plt.title('Mismatch Percentage', fontsize=14,y=1.17)
for i in range(5):
for j in range(5):
c = per_cltr[i,j]*100
ax.text(j, i, str(round(c,1))+'%', va='center',weight="bold", ha='center',fontsize=12,c='w')
columns=[f'{alllabels[i]} \n (Predicted) ' for i in range(5)]
ax.set_xticks(np.arange(len(alllabels)))
ax.set_xticklabels(columns,fontsize=8,rotation=35,y=0.97)
columns=[f'{alllabels[i]}\n (Actual) ' for i in range(5)]
ax.set_yticks(np.arange(len(alllabels)))
ax.set_yticklabels(columns,fontsize=9,rotation='0')
plt.title(f'Confusion Matrix to Predict {len(titles)} Validation Set Samples', fontsize=12,y=1.25)
txt = "Overall Accuracy = "+ r'$\mathbf{' + str(accuracy) + '}$%\n'
txt += "Macro Precision = "+ r'$\mathbf{' + str(macro_averaged_precision) + '}$%\n'
txt += "Micro Precision = "+ r'$\mathbf{' + str(micro_averaged_precision) + '}$%\n'
txt += "Macro Recall = "+ r'$\mathbf{' + str(macro_averaged_recall) + '}$%\n'
txt += "Micro Recall = "+ r'$\mathbf{' + str(micro_averaged_recall) + '}$%'
plt.text(6, 2.5, txt,rotation=0,color='k', ha = 'left',fontsize=12,bbox=dict(facecolor='#FFE4C4', alpha=0.6))
txt_def = r'$\mathbf{' + 'Macro' + '}$'+": make sum of all True Positives and False Negatives and then calculate metrics\n"
txt_def+= r'$\mathbf{' + 'Micro' + '}$'+": calculate metrics for each label separately and then make average"
plt.text(-1, 5.5, txt_def,rotation=0,color='k', ha = 'left',fontsize=12,bbox=dict(facecolor='#98F5FF', alpha=0.6))
plt.show()
pipe = pipeline("text-classification", model=tweet_clf_model, tokenizer=tokenizer, return_all_scores=True)
titles = []
perdi = []
score = []
for ii in test['OriginalTweet']:
titles.append(ii)
run = refined_zero(ii)
perdi.append(run)
label = []
for ii in test['Sentiment']:
label.append(ii)
Actl_pred_copy = pd.DataFrame()
Actl_pred_copy['titles'] = titles
Actl_pred_copy['Actual'] = label
Actl_pred_copy['label_1'] = [i[0] for i in perdi]
Actl_pred_copy['prob_label_1'] = [i[1] for i in perdi]
#
Actl_pred_copy['label_2'] = [i[2] for i in perdi]
Actl_pred_copy['prob_label_2'] = [i[3] for i in perdi]
#
Actl_pred_copy['label_3'] = [i[4] for i in perdi]
Actl_pred_copy['prob_label_3'] = [i[5] for i in perdi]
#
Actl_pred_copy['label_4'] = [i[6] for i in perdi]
Actl_pred_copy['prob_label_4'] = [i[7] for i in perdi]
#
Actl_pred_copy['label_5'] = [i[8] for i in perdi]
Actl_pred_copy['prob_label_5'] = [i[9] for i in perdi]
Actl_pred = Actl_pred_copy[Actl_pred_copy['prob_label_1']>0.0]
Actl_pred[:3]
def histplt (val: list,bins: int,title: str,xlabl: str,ylabl: str,xlimt: list,
ylimt: list=False, loc: int =1,legend: int=1,axt=None,days: int=False,
class_: int=False,scale: int=1,x_tick: list=False, calc_perc: bool= True,
nsplit: int=1,font: int=5,color: str='b') -> [float] :
""" Histogram including important statistics """
ax1 = axt or plt.axes()
font = {'size' : font }
plt.rc('font', **font)
miss_n = len(val[np.isnan(val)])
tot = len(val)
n_distinct = len(np.unique(val))
miss_p = (len(val[np.isnan(val)])/tot)*100
val = val[~np.isnan(val)]
val = np.array(val)
plt.hist(val, bins=bins, weights=np.ones(len(val)) / len(val),ec='black',color=color)
n_nonmis = len(val[~np.isnan(val)])
if class_:
times = 100
else:
times = 1
Mean = np.nanmean(val)*times
Median = np.nanmedian(val)*times
sd = np.sqrt(np.nanvar(val))
Max = np.nanmax(val)
Min = np.nanmin(val)
p1 = np.quantile(val, 0.01)
p25 = np.quantile(val, 0.25)
p75 = np.quantile(val, 0.75)
p99 = np.quantile(val, 0.99)
if calc_perc == True:
txt = 'n (not missing)=%.0f\nn_distinct=%.0f\nMissing=%.1f%%\nMean=%0.2f\nσ=%0.1f\np1%%=%0.1f\np99%%=%0.1f\nMin=%0.1f\nMax=%0.1f'
anchored_text = AnchoredText(txt %(n_nonmis,n_distinct,miss_p,Mean,sd,p1,p99,Min,Max), borderpad=0,
loc=loc,prop={ 'size': font['size']*scale})
else:
txt = 'n (not missing)=%.0f\nn_distinct=%.0f\nMissing=%.1f%%\nMean=%0.2f\nσ=%0.1f\nMin=%0.1f\nMax=%0.1f'
anchored_text = AnchoredText(txt %(n_nonmis,n_distinct,miss_p,Mean,sd,Min,Max), borderpad=0,
loc=loc,prop={ 'size': font['size']*scale})
if(legend==1): ax1.add_artist(anchored_text)
if (scale): plt.title(title,fontsize=font['size']*(scale+0.15))
else: plt.title(title)
plt.xlabel(xlabl,fontsize=font['size'])
ax1.set_ylabel('Frequency',fontsize=font['size'])
if (scale): ax1.set_xlabel(xlabl,fontsize=font['size']*scale)
else: ax1.set_xlabel(xlabl)
try:
xlabl
except NameError:
pass
else:
if (scale): plt.xlabel(xlabl,fontsize=font['size']*scale)
else: plt.xlabel(xlabl)
try:
ylabl
except NameError:
pass
else:
if (scale): plt.ylabel(ylabl,fontsize=font['size']*scale)
else: plt.ylabel(ylabl)
if (class_==True): plt.xticks([0,1])
plt.gca().yaxis.set_major_formatter(PercentFormatter(1))
ax1.grid(linewidth='0.1')
try:
xlimt
except NameError:
pass
else:
plt.xlim(xlimt)
try:
ylimt
except NameError:
pass
else:
plt.ylim(ylimt)
if x_tick: plt.xticks(x_tick,fontsize=font['size']*scale)
plt.yticks(fontsize=font['size']*scale)
plt.grid(linewidth='0.12')
# Interquartile Range Method for outlier detection
iqr = p75 - p25
# calculate the outlier cutoff
cut_off = np.array(iqr) * 1.5
lower, upper = p25 - cut_off, p75 + cut_off
return tot, n_nonmis, n_distinct, miss_n, miss_p, Mean, Median, sd, Max, Min, p1, p25, p75, p99, sd
font = {'size' : 12}
plt.rc('font', **font)
colors_map = plt.cm.get_cmap('jet')
fig, ax = plt.subplots(figsize=(12, 4), dpi= 100, facecolor='w', edgecolor='k')
ax1 = plt.subplot(1,2,1)
val = Actl_pred.prob_label_1
_,_,_, _, _,_ ,_ ,_ ,_ ,_ ,\
_,_ ,_ ,_ ,_ = histplt (val,bins=25,title=f'Fine-tuned Model: Histogram of Predicted Probability',xlabl=None,days=False,
ylabl=None,xlimt=(0,1),ylimt=False
,axt=ax1,nsplit=5,scale=0.95,font=12,loc=2,color='g')
plt.show()
Actl_pred.Actual.value_counts()
pred = Actl_pred['label_1']
label = Actl_pred.Actual
accuracy = np.round(accuracy_score(label, pred)*100,1)
macro_averaged_precision = np.round(precision_score(label, pred, average = 'macro')*100,1)
micro_averaged_precision = np.round(precision_score(label, pred, average = 'micro')*100,1)
macro_averaged_recall = np.round(recall_score(label, pred, average = 'macro')*100,1)
micro_averaged_recall = np.round(recall_score(label, pred, average = 'micro')*100,1)
# Calculate the percentage
font = {'size' : 9}
plt.rc('font', **font)
fig, ax=plt.subplots(figsize=(6, 5), dpi= 120, facecolor='w', edgecolor='k')
alllabels = list(set(label))
per_cltr=np.zeros((5,5))
for i in range(len(alllabels)):
for j in range(len(alllabels)):
per_cltr[i,j] = len(Actl_pred[(Actl_pred['Actual']==alllabels[i]) &
(Actl_pred['label_1']==alllabels[j])])/len(Actl_pred[Actl_pred['Actual']==alllabels[i]])
cax =ax.matshow(per_cltr, cmap='jet', interpolation='nearest',vmin=0, vmax=1)
cbar=fig.colorbar(cax,shrink=0.6,orientation='vertical',label='Low % High %')
cbar.set_ticks([])
#plt.title('Mismatch Percentage', fontsize=14,y=1.17)
for i in range(5):
for j in range(5):
c = per_cltr[i,j]*100
ax.text(j, i, str(round(c,1))+'%', va='center',weight="bold", ha='center',fontsize=12,c='w')
columns=[f'{alllabels[i]} \n (Predicted) ' for i in range(5)]
ax.set_xticks(np.arange(len(alllabels)))
ax.set_xticklabels(columns,fontsize=8,rotation=35,y=0.97)
columns=[f'{alllabels[i]}\n (Actual) ' for i in range(5)]
ax.set_yticks(np.arange(len(alllabels)))
ax.set_yticklabels(columns,fontsize=9,rotation='0')
plt.title(f'Fine-tuned Model: Confusion Matrix to Predict {len(label)} Test', fontsize=12,y=1.25)
txt = "Overall Accuracy = "+ r'$\mathbf{' + str(accuracy) + '}$%\n'
txt += "Macro Precision = "+ r'$\mathbf{' + str(macro_averaged_precision) + '}$%\n'
txt += "Micro Precision = "+ r'$\mathbf{' + str(micro_averaged_precision) + '}$%\n'
txt += "Macro Recall = "+ r'$\mathbf{' + str(macro_averaged_recall) + '}$%\n'
txt += "Micro Recall = "+ r'$\mathbf{' + str(micro_averaged_recall) + '}$%'
plt.text(6, 2.5, txt,rotation=0,color='k', ha = 'left',fontsize=12,bbox=dict(facecolor='#FFE4C4', alpha=0.6))
txt_def = r'$\mathbf{' + 'Macro' + '}$'+": make sum of all True Positives and False Negatives and then calculate metrics\n"
txt_def+= r'$\mathbf{' + 'Micro' + '}$'+": calculate metrics for each label separately and then make average"
plt.text(-1, 5.5, txt_def,rotation=0,color='k', ha = 'left',fontsize=12,bbox=dict(facecolor='#98F5FF', alpha=0.6))
plt.show()
pipe = pipe_pre_trained
def refined_zero(x):
pred_zero = pipe_pre_trained(x,candidate_labels=['Neutral', 'Extremely Negative', 'Negative',
'Extremely Positive', 'Positive'])
label_ = pred_zero['labels']
score_ = pred_zero['scores']
pipe_x = sorted(zip(label_, score_), key=lambda x: x[1], reverse=True)
try:
l1 = pipe_x[0][0]
p1 = pipe_x[0][1]
l2 = pipe_x[1][0]
p2 = pipe_x[1][1]
l3 = pipe_x[2][0]
p3 = pipe_x[2][1]
l4 = pipe_x[3][0]
p4 = pipe_x[3][1]
except :
l1 = None
p1 = None
l2 = None
p2 = None
l3 = None
p3 = None
l4 = None
p4 = None
return l1, p1, l2, p2, l3, p3, l4, p4
titles = []
perdi = []
score = []
for ii in test['OriginalTweet']:
titles.append(ii)
run = refined_zero(ii)
perdi.append(run)
label = []
for ii in test['Sentiment']:
label.append(ii)
Actl_pred_copy = pd.DataFrame()
Actl_pred_copy['titles'] = titles
Actl_pred_copy['Actual'] = label
Actl_pred_copy['label_1'] = [i[0] for i in perdi]
Actl_pred_copy['prob_label_1'] = [i[1] for i in perdi]
#
Actl_pred_copy['label_2'] = [i[2] for i in perdi]
Actl_pred_copy['prob_label_2'] = [i[3] for i in perdi]
#
Actl_pred_copy['label_3'] = [i[4] for i in perdi]
Actl_pred_copy['prob_label_3'] = [i[5] for i in perdi]
#
Actl_pred_copy['label_4'] = [i[6] for i in perdi]
Actl_pred_copy['prob_label_4'] = [i[7] for i in perdi]
Actl_pred = Actl_pred_copy[Actl_pred_copy['prob_label_1']>0.0]
Actl_pred[:3]
font = {'size' : 12}
plt.rc('font', **font)
colors_map = plt.cm.get_cmap('jet')
fig, ax = plt.subplots(figsize=(12, 4), dpi= 100, facecolor='w', edgecolor='k')
ax1 = plt.subplot(1,2,1)
val = Actl_pred.prob_label_1
_,_,_, _, _,_ ,_ ,_ ,_ ,_ ,\
_,_ ,_ ,_ ,_ = histplt (val,bins=25,title=f'Zero shot Model: Histogram of Predicted Probability',xlabl=None,days=False,
ylabl=None,xlimt=(0,1),ylimt=False
,axt=ax1,nsplit=5,scale=0.95,font=12,loc=2,color='g')
plt.show()
Actl_pred.Actual.value_counts()
pred = Actl_pred['label_1']
label = Actl_pred.Actual
accuracy = np.round(accuracy_score(label, pred)*100,1)
macro_averaged_precision = np.round(precision_score(label, pred, average = 'macro')*100,1)
micro_averaged_precision = np.round(precision_score(label, pred, average = 'micro')*100,1)
macro_averaged_recall = np.round(recall_score(label, pred, average = 'macro')*100,1)
micro_averaged_recall = np.round(recall_score(label, pred, average = 'micro')*100,1)
# Calculate the percentage
font = {'size' : 9}
plt.rc('font', **font)
fig, ax=plt.subplots(figsize=(6, 5), dpi= 120, facecolor='w', edgecolor='k')
alllabels = list(set(label))
per_cltr=np.zeros((5,5))
for i in range(len(alllabels)):
for j in range(len(alllabels)):
per_cltr[i,j] = len(Actl_pred[(Actl_pred['Actual']==alllabels[i]) &
(Actl_pred['label_1']==alllabels[j])])/len(Actl_pred[Actl_pred['Actual']==alllabels[i]])
cax =ax.matshow(per_cltr, cmap='jet', interpolation='nearest',vmin=0, vmax=1)
cbar=fig.colorbar(cax,shrink=0.6,orientation='vertical',label='Low % High %')
cbar.set_ticks([])
#plt.title('Mismatch Percentage', fontsize=14,y=1.17)
for i in range(5):
for j in range(5):
c = per_cltr[i,j]*100
ax.text(j, i, str(round(c,1))+'%', va='center',weight="bold", ha='center',fontsize=12,c='w')
columns=[f'{alllabels[i]} \n (Predicted) ' for i in range(5)]
ax.set_xticks(np.arange(len(alllabels)))
ax.set_xticklabels(columns,fontsize=8,rotation=35,y=0.97)
columns=[f'{alllabels[i]}\n (Actual) ' for i in range(5)]
ax.set_yticks(np.arange(len(alllabels)))
ax.set_yticklabels(columns,fontsize=9,rotation='0')
plt.title(f'Zero-shot Model: Confusion Matrix to Predict {len(label)} test', fontsize=12,y=1.25)
txt = "Overall Accuracy = "+ r'$\mathbf{' + str(accuracy) + '}$%\n'
txt += "Macro Precision = "+ r'$\mathbf{' + str(macro_averaged_precision) + '}$%\n'
txt += "Micro Precision = "+ r'$\mathbf{' + str(micro_averaged_precision) + '}$%\n'
txt += "Macro Recall = "+ r'$\mathbf{' + str(macro_averaged_recall) + '}$%\n'
txt += "Micro Recall = "+ r'$\mathbf{' + str(micro_averaged_recall) + '}$%'
plt.text(6, 2.5, txt,rotation=0,color='k', ha = 'left',fontsize=12,bbox=dict(facecolor='#FFE4C4', alpha=0.6))
txt_def = r'$\mathbf{' + 'Macro' + '}$'+": make sum of all True Positives and False Negatives and then calculate metrics\n"
txt_def+= r'$\mathbf{' + 'Micro' + '}$'+": calculate metrics for each label separately and then make average"
plt.text(-1, 5.5, txt_def,rotation=0,color='k', ha = 'left',fontsize=12,bbox=dict(facecolor='#98F5FF', alpha=0.6))
plt.show()
Performance of zero shot classification is much lower than fine-tuned model.