Summary
The T5 (Text-to-Text Transfer Transformer) model is a transformer-based architecture developed by Google Research. It is designed to handle a wide range of natural language processing (NLP) tasks in a unified manner. T5 is unique in that it frames all NLP tasks as a text-to-text problem, meaning both the input and output are treated as sequences of text. T5 is a powerful and versatile transformer-based model that approaches NLP tasks in a text-to-text framework, allowing for unified architecture and effective transfer learning. T5's ability to handle diverse tasks with a single model structure makes it a valuable tool in natural language processing. In this notebook, first Off-shelf T5 (pre-trained) models are applied for various tasks including translation, summarization, question/answering .... However, these pre-trained models can be used as baseline prediction, but cannot be directly used in production. Off-shelf results with T5 cannot be used in production; we can only use them as baseline prediction. T5 should be fine-tuned for models in production. In this study, t5-base model is fine-tuned to predict titles for abstracts.
Python functions and data files needed to run this notebook are available via this link.
import warnings
warnings.filterwarnings('ignore')
from transformers import T5ForConditionalGeneration, T5Tokenizer # Similar to GPT2
from transformers import pipeline, T5ForConditionalGeneration, TrainingArguments
from transformers import Trainer, DataCollatorForSeq2Seq
import pandas as pd
pd.set_option('display.max_colwidth', -1)
from datasets import Dataset
import random
Instead of relying on only encoder or decoder, T5 has a complete end to end transformer (uses both encoder and decoder). Cross-attention bridges the gap between encoder and decoder. BERT derived from encoder part of transformer and where GPT is derived from decoder stack of transformer. There are proses and cons for each approach:
BERT: it is faster natural language understanding tasks such as sequence and token classification but the flexibility of being able to create prompts and autoregressive use cases are lost.
GPT: it is very flexible to teach for different domains, multiple type of tasks at once (sequence classification, summarization). However, generating text with GPT is very slow and is not robust enough to be used in classification task (like BERT).
T5 uses both encoder and decoder stacks at the same time: it is is pure sequence to sequence model. T5 can apply multiple NLP tasks at once.
Image retrieved from Sinan Ozdemir
Figure below shows 4 of them:
translate
English to Germancola
means linguistic acceptability e.g. grammarstsb
semantic text similarity given two phrases how similar they aresummarize
task. These four tasks can be done by T5 off the shelf.
Each of these tasks comes with a prompt
Image retrieved from Sinan Ozdemir
T5 was pre-trained on data set called common crawl (https://commoncrawl.org/).
Sentinel tokens are used similar to masking tasks to predict missing text. BERT is used masked language modeling but the biggest difference here is instead of masking single token and predicting it, T5 masks multiple tokens and predict them all
Three training objectives of T5 are:
t5_base_model = T5ForConditionalGeneration.from_pretrained('t5-base') # There are other flavour of T5
t5_base_tokenizer = T5Tokenizer.from_pretrained('t5-base')
input_ids = t5_base_tokenizer.encode('translate English to German: How are you doing?', return_tensors='pt')
# translate
translate_ids = t5_base_model.generate(
input_ids,
num_beams=4,
no_repeat_ngram_size=3,
max_length=20,
early_stopping=True
)
translate = t5_base_tokenizer.decode(translate_ids[0], skip_special_tokens=True)
print (f"Translated text:\n{translate}")
# pass labels in to calculate loss
ids = t5_base_tokenizer('translate English to German: How are you doing?', return_tensors='pt').input_ids
labels = t5_base_tokenizer('Wo ist die Schokolade?', return_tensors='pt').input_ids
loss = t5_base_model(input_ids=ids, labels=labels).loss
labels, loss
Abstractive Summarization is a task creating a summary of a context without being constrained to use specific context of the subset.
text ="""The transformer model is primarily based on the attention idea. In 2015, attention
mechanisms started to be used in NLP tasks and became the dominant way to perform NLP tasks
over the past decade. Attention in NLP is a mechanism designed to focus on specific parts of a
sequence in the context of another sequence. It is used to perform various NLP tasks, such as language
modeling, sequence classification, language translation, and image captioning. The idea behind attention
is to give the model the ability to attend to relevant information while performing a prediction task.
The attention mechanism allows the model to understand relationships between different elements in a sequence
and make predictions based on that information. For example, in language translation, the attention mechanism
helps the model understand which parts of a source sentence are relevant for translating into a target language.
There are various types of attention mechanisms, including self-attention, which is the kind of attention that
powers the transformer architecture in NLP.
"""
preprocessed_text = text.strip().replace("\n","")
print ("preprocessed text :\n", preprocessed_text)
Making a summarization prompt with T5; it has benefit of multi-task learning. For example, for BERT, we have to change the architect of layers to perform a NLP task. It is not multi-task oriented. GPT is multi-task oriented, we should prompt an input sequence that GPT has an idea about the task. However, T5 can do this with minor implementation. For T5, we only need to place the prompt at the beginning of the task whereas for GPT we also need to add token at the end.
# We only need to add "summarize: " at the beginning to apply text summerization
t5_prepared_text = "summarize: " + preprocessed_text
ids = t5_base_tokenizer.encode(t5_prepared_text, return_tensors="pt") # encode the phrase
# summmarize
summmarize = t5_base_model.generate(ids,num_beams=4,no_repeat_ngram_size=3, # num_beams to make prediction less random
min_length=10,max_length=30,early_stopping=True
)
summarized = t5_base_tokenizer.decode(summmarize[0], skip_special_tokens=True)
print (f"The Summarized text is: \n{summarized}")
text ="""There are many approaches that use weak-supervision to train networks to
segment 2D images. By contrast, existing 3D approaches rely on full-supervision
of a subset of 2D slices of the 3D image volume. In this paper, we propose an
approach that is truly weakly-supervised in the sense that we only need to
provide a sparse set of 3D point on the surface of target objects, an easy task
that can be quickly done. We use the 3D points to deform a 3D template so that
it roughly matches the target object outlines and we introduce an architecture
that exploits the supervision provided by coarse template to train a network to
find accurate boundaries. We evaluate the performance of our approach on Computed Tomography (CT),
Magnetic Resonance Imagery (MRI) and Electron Microscopy (EM) image datasets.
We will show that it outperforms a more traditional approach to
weak-supervision in 3D at a reduced supervision cost."""
preprocessed_text = text.strip().replace("\n","")
# We only need to add "Make title: " at the beginning to apply text summerization
t5_prepared_text = "extract labels: " + preprocessed_text
ids = t5_base_tokenizer.encode(t5_prepared_text, return_tensors="pt") # encode the phrase
# summmarize
summmarize = t5_base_model.generate(ids,num_beams=4,no_repeat_ngram_size=3, # num_beams to make prediction less random
min_length=10,max_length=20,early_stopping=True
)
summarized = t5_base_tokenizer.decode(summmarize[0], skip_special_tokens=True)
print (f"The title of the text is: \n{summarized}")
CoLA
: The Corpus of Linguistic Acceptability¶CoLA (Corpus of Linguistic Acceptability) is for checking grammatical correctess.
ids = t5_base_tokenizer.encode('cola sentence: How are you doing?', return_tensors='pt')
# CoLA
translate_ids = t5_base_model.generate(ids,num_beams=4,no_repeat_ngram_size=3,
max_length=20,early_stopping=True)
translate = t5_base_tokenizer.decode(translate_ids[0], skip_special_tokens=True)
print(f"is grammatically correct?: \n{translate}")
input_ids = t5_base_tokenizer.encode('cola sentence: How are you doings?', return_tensors='pt')
# summmarize
translate_ids = t5_base_model.generate(
input_ids,
max_length=20,
early_stopping=True
)
output = t5_base_tokenizer.decode(translate_ids[0], skip_special_tokens=True)
print(f"is grammatically correct?: \n{output}")
Extractive Q/A can be applied by BERT and abstractive Q/A with GPT. T5 (Text-to-Text Transfer Transformer) can be used for abstractive question answering. In the context of question answering, the input can be a question, and the target can be the corresponding answer.
ids = t5_base_tokenizer.encode(
'question: Where does I work? context: I live in Alberta but work in Calgary.', return_tensors='pt'
)
# Q/A
translate_ids = t5_base_model.generate(ids,early_stopping=True)
result = t5_base_tokenizer.decode(translate_ids[0], skip_special_tokens=True)
print(f"result: \n{result}")
STSB
- Semantic Text Similarity Benchmark¶STSB
calculates the similarity between two sentences from 0 to 5 scale.
sentence_one = 'Python is scary animal'
sentence_two = 'I love coding with Python'
ids = t5_base_tokenizer.encode(f"stsb sentence1: {sentence_one} sentence2: {sentence_two}", return_tensors='pt')
# calculate semantic similarity
translate_ids = t5_base_model.generate(ids, max_length=3, early_stopping=True)
result = t5_base_tokenizer.decode(translate_ids[0], skip_special_tokens=True)
print(f"semantically similar? (0-5): \n{result}")
sentence_one = 'President greets the press in Chicago'
sentence_two = 'Biden speaks to media in Illinois'
ids = t5_base_tokenizer.encode(f"stsb sentence1: {sentence_one} sentence2: {sentence_two}", return_tensors='pt')
# calculate semantic similarity
translate_ids = t5_base_model.generate(ids, max_length=3, early_stopping=True)
result = t5_base_tokenizer.decode(translate_ids[0], skip_special_tokens=True)
print(f"semantically similar? (0-5): \n{result}")
MNLI
- Multi-Genre Natural Language Inference¶Multi-Genre Natural Language Inference refers to a challenging task in natural language processing where a model is trained to determine the logical relationship between pairs of sentences, but with a focus on handling diverse genres or writing styles. This involves developing models that can effectively understand and infer entailment, contradiction, or neutral across various types of textual content, enhancing their applicability in a wide range of language understanding tasks.
ids = t5_base_tokenizer.encode(
'mnli premise: I am going to medical school. hypothesis: I will be an Engineer', return_tensors='pt')
# mnli
translate_ids = t5_base_model.generate(ids,early_stopping=True)
result = t5_base_tokenizer.decode(translate_ids[0], skip_special_tokens=True)
print(f"Result: \n{result}")
ids = t5_base_tokenizer.encode(
'mnli premise: I am going to medical school. hypothesis: I am going to be a physician', return_tensors='pt')
# mnli
translate_ids = t5_base_model.generate(ids,early_stopping=True)
result = t5_base_tokenizer.decode(translate_ids[0], skip_special_tokens=True)
print(f"Result: \n{result}")
ids = t5_base_tokenizer.encode(
'mnli premise: I am going to medical school. hypothesis: I will be top 2 student in my class', return_tensors='pt')
# mnli
translate_ids = t5_base_model.generate(ids,early_stopping=True)
result = t5_base_tokenizer.decode(translate_ids[0], skip_special_tokens=True)
print(f"Result: \n{result}")
Off-shelf results with T5 cannot be used in production; we can only use them as baseline prediction. T5 should be fine-tuned for models in production.
Paper submission systems (CMT, OpenReview, etc.) require the users to upload paper titles and paper abstracts and then specify the subject areas their papers best belong to. Won't it be nice if such submission systems provided viable subject area suggestions as to where the corresponding papers could be best associated with?
This dataset would allow developers to build baseline models that might benefit this use case. Data analysts might also enjoy analyzing the intricacies of different papers and how well their abstracts correlate to their noted categories. Additionally, we hope that the dataset will serve as a decent benchmark for building useful text classification systems.
base_model = T5ForConditionalGeneration.from_pretrained('t5-small')
base_tokenizer = T5Tokenizer.from_pretrained('t5-small')
df = pd.read_csv('arxiv_data.csv',encoding='latin-1')
df = df[["summaries","titles"]]
df = df.dropna()
print(df.shape)
# Set aside some data as holdout set
Holdout = df[-100:]
# Since the data is big for fine-tunning, 10000 of samples are selected.
df = df[:10000].copy()
df.head(2)
df = df[ (df['summaries'].str.len() >=30)]
df.shape
# Pre-processing step
# Punctuation is important in grammar and important for complex decoding architectures to know when to stop!
def add_punc(s):
if s[-1] not in ('.', '!', '?'):
s = s + '.'
return s
df.dropna(inplace=True)
df['summaries'] = df['summaries'].map(add_punc)
print(df.shape)
df.head(2)
random.seed(20)
paper_title = Dataset.from_pandas(df[:2000])
# Set aside some data as holdout set
Holdout = df[-100:]
# We have a prompt but only as a prefix in the encoder
prefix = "title: "
# Manually add our own labels because unlike GPT,
# we cannot assume the labels are based on the inputs
def preprocess_function(examples):
inputs = [prefix + doc for doc in examples["summaries"]]
model_inputs = base_tokenizer(inputs, max_length=1024, truncation=True)
labels = base_tokenizer(examples["titles"], max_length=128, truncation=True)
model_inputs["labels"] = labels["input_ids"]
return model_inputs
paper_title = paper_title.map(preprocess_function, batched=True)
paper_title[0]
paper_title = paper_title.train_test_split(test_size=.1)
paper_title
# Data collator specifically for generic sequence to sequence tasks
# Use when we are translating one sequence to another like translation, summarization, etc
data_collator = DataCollatorForSeq2Seq(tokenizer=base_tokenizer, model=base_model)
epochs = 2
batch_size = 5
training_args = TrainingArguments(
output_dir="./T5_abst_title", # The output directory
overwrite_output_dir=True, # overwrite the content of the output directory
num_train_epochs=epochs, # number of training epochs
per_device_train_batch_size=batch_size, # batch size for training
per_device_eval_batch_size=batch_size, # batch size for evaluation
logging_steps=50,
load_best_model_at_end=True,
evaluation_strategy='epoch', # It is "steps" or "epoch", we choose epoch: how many times to stop training to test
save_strategy='epoch' # save a check point of our model after each epoch
)
trainer = Trainer(
model=base_model, # take our model (tweet_clf_model)
args=training_args, # we just set it above
train_dataset=paper_title["train"], # training part of dataset
eval_dataset=paper_title["test"], # test (evaluation) part of dataset
data_collator=data_collator # data colladior with padding. Infact, we may or may not need a data collator
# we can check the model to see how it lookes like with or without the collator
)
trainer.evaluate()
trainer.train() # total of 9 hours of training on my laptop!
trainer.evaluate()
trainer.save_model()
fine_tuned_model = T5ForConditionalGeneration.from_pretrained('./t5_news_summary')
# Define a Custom Label Extraction Function
def label_extraction(abstract):
inputs = base_tokenizer("extract labels: " + abstract, return_tensors="pt")
outputs = fine_tuned_model.generate(**inputs)
extracted_labels = base_tokenizer.decode(outputs[0], skip_special_tokens=True)
return extracted_labels
ir = 0
abstract = Holdout['summaries'].iloc[ir]
print(f'-------summary---------')
print(abstract)
title = Holdout['titles'].iloc[ir]
print(f'\n-------actual title---------')
print(title)
title_pred = extracted_labels = label_extraction(abstract)
print(f'\n-------predicted title---------')
print(title_pred)
ir = 1
abstract = Holdout['summaries'].iloc[ir]
print(f'-------summary---------')
print(abstract)
title = Holdout['titles'].iloc[ir]
print(f'\n-------actual title---------')
print(title)
title_pred = extracted_labels = label_extraction(abstract)
print(f'\n-------predicted title---------')
print(title_pred)
ir = 2
abstract = Holdout['summaries'].iloc[ir]
print(f'-------summary---------')
print(abstract)
title = Holdout['titles'].iloc[ir]
print(f'\n-------actual title---------')
print(title)
title_pred = extracted_labels = label_extraction(abstract)
print(f'\n-------predicted title---------')
print(title_pred)
ir = 3
abstract = Holdout['summaries'].iloc[ir]
print(f'-------summary---------')
print(abstract)
title = Holdout['titles'].iloc[ir]
print(f'\n-------actual title---------')
print(title)
title_pred = extracted_labels = label_extraction(abstract)
print(f'\n-------predicted title---------')
print(title_pred)