Summary

The T5 (Text-to-Text Transfer Transformer) model is a transformer-based architecture developed by Google Research. It is designed to handle a wide range of natural language processing (NLP) tasks in a unified manner. T5 is unique in that it frames all NLP tasks as a text-to-text problem, meaning both the input and output are treated as sequences of text. T5 is a powerful and versatile transformer-based model that approaches NLP tasks in a text-to-text framework, allowing for unified architecture and effective transfer learning. T5's ability to handle diverse tasks with a single model structure makes it a valuable tool in natural language processing. In this notebook, first Off-shelf T5 (pre-trained) models are applied for various tasks including translation, summarization, question/answering .... However, these pre-trained models can be used as baseline prediction, but cannot be directly used in production. Off-shelf results with T5 cannot be used in production; we can only use them as baseline prediction. T5 should be fine-tuned for models in production. In this study, t5-base model is fine-tuned to predict titles for abstracts.

Python functions and data files to run this notebook are in my Github

import warnings
warnings.filterwarnings('ignore')

from transformers import T5ForConditionalGeneration, T5Tokenizer # Similar to GPT2
from transformers import pipeline, T5ForConditionalGeneration, TrainingArguments
from transformers import Trainer, DataCollatorForSeq2Seq
import pandas as pd
pd.set_option('display.max_colwidth', -1)
from datasets import Dataset
import random

Introduction¶

Instead of relying on only encoder or decoder, T5 has a complete end to end transformer (uses both encoder and decoder). Cross-attention bridges the gap between encoder and decoder. BERT derived from encoder part of transformer and where GPT is derived from decoder stack of transformer. There are proses and cons for each approach:

BERT: it is faster natural language understanding tasks such as sequence and token classification but the flexibility of being able to create prompts and autoregressive use cases are lost.

GPT: it is very flexible to teach for different domains, multiple type of tasks at once (sequence classification, summarization). However, generating text with GPT is very slow and is not robust enough to be used in classification task (like BERT).

T5 uses both encoder and decoder stacks at the same time: it is is pure sequence to sequence model. T5 can apply multiple NLP tasks at once.

Image retrieved from Sinan Ozdemir

Figure below shows 4 of them:

translate English to German
cola means linguistic acceptability e.g. grammar
stsb semantic text similarity given two phrases how similar they are
summarize task. These four tasks can be done by T5 off the shelf. Each of these tasks comes with a prompt

Image retrieved from Sinan Ozdemir

T5 was pre-trained on data set called common crawl (https://commoncrawl.org/).

Sentinel tokens are used similar to masking tasks to predict missing text. BERT is used masked language modeling but the biggest difference here is instead of masking single token and predicting it, T5 masks multiple tokens and predict them all

Three training objectives of T5 are:

casual (auto-regressive) language modeling: next word prediction
BERT-style objective: masking words and predicting the original text
Deshuffling: shuffling the input randomly and predicting the original text.

Ready-made Results with T5¶

t5_base_model = T5ForConditionalGeneration.from_pretrained('t5-base') # There are other flavour of T5
t5_base_tokenizer = T5Tokenizer.from_pretrained('t5-base')

English to German Translation¶

input_ids = t5_base_tokenizer.encode('translate English to German: How are you doing?', return_tensors='pt')

# translate 
translate_ids = t5_base_model.generate(
    input_ids,
    num_beams=4,
    no_repeat_ngram_size=3,
    max_length=20,
    early_stopping=True
)

translate = t5_base_tokenizer.decode(translate_ids[0], skip_special_tokens=True)

print (f"Translated text:\n{translate}")

Translated text:
Wie geht es Ihnen?

# pass labels in to calculate loss
ids = t5_base_tokenizer('translate English to German: How are you doing?', return_tensors='pt').input_ids
labels = t5_base_tokenizer('Wo ist die Schokolade?', return_tensors='pt').input_ids

loss = t5_base_model(input_ids=ids, labels=labels).loss

labels, loss

(tensor([[ 3488,   229,    67, 31267,    58,     1]]),
 tensor(4.0347, grad_fn=<NllLossBackward0>))

Abstractive Summarization¶

Abstractive Summarization is a task creating a summary of a context without being constrained to use specific context of the subset.

text ="""The transformer model is primarily based on the attention idea. In 2015, attention 
mechanisms started to be used in NLP tasks and became the dominant way to perform NLP tasks 
over the past decade. Attention in NLP is a mechanism designed to focus on specific parts of a 
sequence in the context of another sequence. It is used to perform various NLP tasks, such as language 
modeling, sequence classification, language translation, and image captioning. The idea behind attention 
is to give the model the ability to attend to relevant information while performing a prediction task. 
The attention mechanism allows the model to understand relationships between different elements in a sequence 
and make predictions based on that information. For example, in language translation, the attention mechanism 
helps the model understand which parts of a source sentence are relevant for translating into a target language. 
There are various types of attention mechanisms, including self-attention, which is the kind of attention that 
powers the transformer architecture in NLP.
"""

preprocessed_text = text.strip().replace("\n","")

print ("preprocessed text :\n", preprocessed_text)

preprocessed text :
 The transformer model is primarily based on the attention idea. In 2015, attention mechanisms started to be used in NLP tasks and became the dominant way to perform NLP tasks over the past decade. Attention in NLP is a mechanism designed to focus on specific parts of a sequence in the context of another sequence. It is used to perform various NLP tasks, such as language modeling, sequence classification, language translation, and image captioning. The idea behind attention is to give the model the ability to attend to relevant information while performing a prediction task. The attention mechanism allows the model to understand relationships between different elements in a sequence and make predictions based on that information. For example, in language translation, the attention mechanism helps the model understand which parts of a source sentence are relevant for translating into a target language. There are various types of attention mechanisms, including self-attention, which is the kind of attention that powers the transformer architecture in NLP.

Making a summarization prompt with T5; it has benefit of multi-task learning. For example, for BERT, we have to change the architect of layers to perform a NLP task. It is not multi-task oriented. GPT is multi-task oriented, we should prompt an input sequence that GPT has an idea about the task. However, T5 can do this with minor implementation. For T5, we only need to place the prompt at the beginning of the task whereas for GPT we also need to add token at the end.

# We only need to add "summarize: " at the beginning to apply text summerization
t5_prepared_text = "summarize: " + preprocessed_text

ids = t5_base_tokenizer.encode(t5_prepared_text, return_tensors="pt") # encode the phrase

# summmarize 
summmarize = t5_base_model.generate(ids,num_beams=4,no_repeat_ngram_size=3, # num_beams to make prediction less random
                                     min_length=10,max_length=30,early_stopping=True
                                     )

summarized = t5_base_tokenizer.decode(summmarize[0], skip_special_tokens=True)

print (f"The Summarized text is: \n{summarized}")

The Summarized text is: 
attention mechanisms started to be used in NLP tasks in 2015. it is a mechanism designed to focus on specific parts of a sequence

ids: input sequence.

num_beams=4: This parameter controls the number of beams to use in the beam search algorithm during text generation. Increasing the number of beams makes the generation less random and can lead to more focused and coherent output.

no_repeat_ngram_size=3: This parameter is related to avoiding the repetition of n-grams (sequences of 'n' tokens). It ensures that the generated text does not contain repeated sequences of a certain size (in this case, 3 tokens).

min_length=30: Specifies the minimum length of the generated text. The generated text should be at least 30 tokens long.

max_length=70: Specifies the maximum length of the generated text. The generated text should not exceed 70 tokens in length.

early_stopping=True: This parameter enables early stopping during text generation. If early stopping is set to True, the generation process may stop when certain conditions are met, such as when the model is confident about the generated sequence.

text ="""There are many approaches that use weak-supervision to train networks to
segment 2D images. By contrast, existing 3D approaches rely on full-supervision
of a subset of 2D slices of the 3D image volume. In this paper, we propose an
approach that is truly weakly-supervised in the sense that we only need to
provide a sparse set of 3D point on the surface of target objects, an easy task
that can be quickly done. We use the 3D points to deform a 3D template so that
it roughly matches the target object outlines and we introduce an architecture
that exploits the supervision provided by coarse template to train a network to
find accurate boundaries. We evaluate the performance of our approach on Computed Tomography (CT),
Magnetic Resonance Imagery (MRI) and Electron Microscopy (EM) image datasets.
We will show that it outperforms a more traditional approach to
weak-supervision in 3D at a reduced supervision cost."""

preprocessed_text = text.strip().replace("\n","")

# We only need to add "Make title: " at the beginning to apply text summerization
t5_prepared_text = "extract labels: " + preprocessed_text

ids = t5_base_tokenizer.encode(t5_prepared_text, return_tensors="pt") # encode the phrase

# summmarize 
summmarize = t5_base_model.generate(ids,num_beams=4,no_repeat_ngram_size=3, # num_beams to make prediction less random
                                     min_length=10,max_length=20,early_stopping=True
                                     )

summarized = t5_base_tokenizer.decode(summmarize[0], skip_special_tokens=True)

print (f"The title of the text is: \n{summarized}")

The title of the text is: 
Falk: weakly-supervised approach to train networks tosegment 2D images

`CoLA`: The Corpus of Linguistic Acceptability¶

CoLA (Corpus of Linguistic Acceptability) is for checking grammatical correctess.

ids = t5_base_tokenizer.encode('cola sentence: How are you doing?', return_tensors='pt')

# CoLA 
translate_ids = t5_base_model.generate(ids,num_beams=4,no_repeat_ngram_size=3,
                                       max_length=20,early_stopping=True)

translate = t5_base_tokenizer.decode(translate_ids[0], skip_special_tokens=True)

print(f"is grammatically correct?: \n{translate}")

is grammatically correct?: 
acceptable

input_ids = t5_base_tokenizer.encode('cola sentence: How are you doings?', return_tensors='pt')

# summmarize 
translate_ids = t5_base_model.generate(
    input_ids,
    max_length=20,
    early_stopping=True
)

output = t5_base_tokenizer.decode(translate_ids[0], skip_special_tokens=True)

print(f"is grammatically correct?: \n{output}")

is grammatically correct?: 
unacceptable

Q/A - Question/Answering¶

Extractive Q/A can be applied by BERT and abstractive Q/A with GPT. T5 (Text-to-Text Transfer Transformer) can be used for abstractive question answering. In the context of question answering, the input can be a question, and the target can be the corresponding answer.

ids = t5_base_tokenizer.encode(
    'question: Where does I work? context: I live in Alberta but work in Calgary.', return_tensors='pt'
)

# Q/A
translate_ids = t5_base_model.generate(ids,early_stopping=True)

result = t5_base_tokenizer.decode(translate_ids[0], skip_special_tokens=True)

print(f"result: \n{result}")

result: 
Calgary

`STSB` - Semantic Text Similarity Benchmark¶

STSB calculates the similarity between two sentences from 0 to 5 scale.

sentence_one = 'Python is scary animal'
sentence_two = 'I love coding with Python'

ids = t5_base_tokenizer.encode(f"stsb sentence1: {sentence_one} sentence2: {sentence_two}", return_tensors='pt')

# calculate semantic similarity 
translate_ids = t5_base_model.generate(ids, max_length=3, early_stopping=True)

result = t5_base_tokenizer.decode(translate_ids[0], skip_special_tokens=True)

print(f"semantically similar? (0-5): \n{result}")

semantically similar? (0-5): 
1.0

sentence_one = 'President greets the press in Chicago'
sentence_two = 'Biden speaks to media in Illinois'

ids = t5_base_tokenizer.encode(f"stsb sentence1: {sentence_one} sentence2: {sentence_two}", return_tensors='pt')

# calculate semantic similarity 
translate_ids = t5_base_model.generate(ids, max_length=3, early_stopping=True)

result = t5_base_tokenizer.decode(translate_ids[0], skip_special_tokens=True)

print(f"semantically similar? (0-5): \n{result}")

semantically similar? (0-5): 
1.6

`MNLI` - Multi-Genre Natural Language Inference¶

Multi-Genre Natural Language Inference refers to a challenging task in natural language processing where a model is trained to determine the logical relationship between pairs of sentences, but with a focus on handling diverse genres or writing styles. This involves developing models that can effectively understand and infer entailment, contradiction, or neutral across various types of textual content, enhancing their applicability in a wide range of language understanding tasks.

ids = t5_base_tokenizer.encode(
    'mnli premise: I am going to medical school. hypothesis: I will be an Engineer', return_tensors='pt')

# mnli 
translate_ids = t5_base_model.generate(ids,early_stopping=True)

result = t5_base_tokenizer.decode(translate_ids[0], skip_special_tokens=True)

print(f"Result: \n{result}")

Result: 
contradiction

ids = t5_base_tokenizer.encode(
    'mnli premise: I am going to medical school. hypothesis: I am going to be a physician', return_tensors='pt')

# mnli 
translate_ids = t5_base_model.generate(ids,early_stopping=True)

result = t5_base_tokenizer.decode(translate_ids[0], skip_special_tokens=True)

print(f"Result: \n{result}")

Result: 
entailment

ids = t5_base_tokenizer.encode(
    'mnli premise: I am going to medical school. hypothesis: I will be top 2 student in my class', return_tensors='pt')

# mnli 
translate_ids = t5_base_model.generate(ids,early_stopping=True)

result = t5_base_tokenizer.decode(translate_ids[0], skip_special_tokens=True)

print(f"Result: \n{result}")

Result: 
neutral

Off-shelf results with T5 cannot be used in production; we can only use them as baseline prediction. T5 should be fine-tuned for models in production.

Fine-tune T5 for Abstractive Summarization¶

arXiv Paper Abstracts Data set¶

Paper submission systems (CMT, OpenReview, etc.) require the users to upload paper titles and paper abstracts and then specify the subject areas their papers best belong to. Won't it be nice if such submission systems provided viable subject area suggestions as to where the corresponding papers could be best associated with?

This dataset would allow developers to build baseline models that might benefit this use case. Data analysts might also enjoy analyzing the intricacies of different papers and how well their abstracts correlate to their noted categories. Additionally, we hope that the dataset will serve as a decent benchmark for building useful text classification systems.

base_model = T5ForConditionalGeneration.from_pretrained('t5-small')
base_tokenizer = T5Tokenizer.from_pretrained('t5-small')

df = pd.read_csv('arxiv_data.csv',encoding='latin-1')

df = df[["summaries","titles"]]
df = df.dropna()
print(df.shape)

# Set aside some data as holdout set
Holdout = df[-100:]

# Since the data is big for fine-tunning, 10000 of samples are selected.
df = df[:10000].copy()


df.head(2)

(51774, 2)

Split train and test set¶

df = df[ (df['summaries'].str.len() >=30)]
df.shape

(10000, 2)

# Pre-processing step
# Punctuation is important in grammar and important for complex decoding architectures to know when to stop!
def add_punc(s):
    if s[-1] not in ('.', '!', '?'):
        s = s + '.'
    return s

df.dropna(inplace=True)

df['summaries'] = df['summaries'].map(add_punc)

print(df.shape)

df.head(2)

(10000, 2)

random.seed(20)

paper_title = Dataset.from_pandas(df[:2000])

# Set aside some data as holdout set
Holdout = df[-100:]

# We have a prompt but only as a prefix in the encoder
prefix = "title: "

# Manually add our own labels because unlike GPT, 
# we cannot assume the labels are based on the inputs
def preprocess_function(examples):
    inputs = [prefix + doc for doc in examples["summaries"]]
    model_inputs = base_tokenizer(inputs, max_length=1024, truncation=True)

    labels = base_tokenizer(examples["titles"], max_length=128, truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

paper_title = paper_title.map(preprocess_function, batched=True)

paper_title[0]

{'summaries': 'Stereo matching is one of the widely used techniques for inferring depth from\nstereo images owing to its robustness and speed. It has become one of the major\ntopics of research since it finds its applications in autonomous driving,\nrobotic navigation, 3D reconstruction, and many other fields. Finding pixel\ncorrespondences in non-textured, occluded and reflective areas is the major\nchallenge in stereo matching. Recent developments have shown that semantic cues\nfrom image segmentation can be used to improve the results of stereo matching.\nMany deep neural network architectures have been proposed to leverage the\nadvantages of semantic segmentation in stereo matching. This paper aims to give\na comparison among the state of art networks both in terms of accuracy and in\nterms of speed which are of higher importance in real-time applications.',
 'titles': 'Survey on Semantic Stereo Matching / Semantic Depth Estimation',
 '__index_level_0__': 0,
 'input_ids': [2233,
  10,
  30535,
  8150,
  19,
  80,
  13,
  8,
  5456,
  261,
  2097,
  21,
  16,
  1010,
  1007,
  4963,
  45,
  16687,
  1383,
  3,
  15942,
  12,
  165,
  6268,
  655,
  11,
  1634,
  5,
  94,
  65,
  582,
  80,
  13,
  8,
  779,
  4064,
  13,
  585,
  437,
  34,
  12902,
  165,
  1564,
  16,
  21286,
  2191,
  6,
  20407,
  8789,
  6,
  220,
  308,
  20532,
  6,
  11,
  186,
  119,
  4120,
  5,
  14490,
  3,
  14251,
  17215,
  7,
  16,
  529,
  18,
  25616,
  6,
  3,
  13377,
  21135,
  26,
  11,
  22891,
  844,
  19,
  8,
  779,
  1921,
  16,
  16687,
  8150,
  5,
  17716,
  11336,
  43,
  2008,
  24,
  27632,
  123,
  15,
  7,
  45,
  1023,
  5508,
  257,
  54,
  36,
  261,
  12,
  1172,
  8,
  772,
  13,
  16687,
  8150,
  5,
  1404,
  1659,
  24228,
  1229,
  4648,
  7,
  43,
  118,
  4382,
  12,
  11531,
  8,
  7648,
  13,
  27632,
  5508,
  257,
  16,
  16687,
  8150,
  5,
  100,
  1040,
  3,
  8345,
  12,
  428,
  3,
  9,
  4993,
  859,
  8,
  538,
  13,
  768,
  5275,
  321,
  16,
  1353,
  13,
  7452,
  11,
  16,
  1353,
  13,
  1634,
  84,
  33,
  13,
  1146,
  3172,
  16,
  490,
  18,
  715,
  1564,
  5,
  1],
 'attention_mask': [1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1],
 'labels': [11418,
  30,
  679,
  348,
  1225,
  30535,
  12296,
  53,
  3,
  87,
  679,
  348,
  1225,
  25734,
  107,
  23621,
  23,
  106,
  1]}

paper_title = paper_title.train_test_split(test_size=.1)
paper_title

DatasetDict({
    train: Dataset({
        features: ['summaries', 'titles', '__index_level_0__', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 1800
    })
    test: Dataset({
        features: ['summaries', 'titles', '__index_level_0__', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 200
    })
})

# Data collator specifically for generic sequence to sequence tasks
# Use when we are translating one sequence to another like translation, summarization, etc
data_collator = DataCollatorForSeq2Seq(tokenizer=base_tokenizer, model=base_model)

Fine-tune Trainer¶

epochs = 2
batch_size = 5

training_args = TrainingArguments(
    output_dir="./T5_abst_title",      # The output directory
    overwrite_output_dir=True,      # overwrite the content of the output directory
    num_train_epochs=epochs,        # number of training epochs
    per_device_train_batch_size=batch_size, # batch size for training
    per_device_eval_batch_size=batch_size,  # batch size for evaluation
    logging_steps=50,
    load_best_model_at_end=True,
    evaluation_strategy='epoch',    # It is "steps" or "epoch", we choose epoch: how many times to stop training to test
    save_strategy='epoch'           # save a check point of our model after each epoch
)

trainer = Trainer(
    model=base_model,                    # take our model (tweet_clf_model)
    args=training_args,             # we just set it above
    train_dataset=paper_title["train"],    # training part of dataset
    eval_dataset=paper_title["test"],      # test (evaluation) part of dataset
    data_collator=data_collator     # data colladior with padding. Infact, we may or may not need a data collator
                                    # we can check the model to see how it lookes like with or without the collator
) 

trainer.evaluate()

The following columns in the evaluation set don't have a corresponding argument in `T5ForConditionalGeneration.forward` and have been ignored: titles, summaries, __index_level_0__. If titles, summaries, __index_level_0__ are not expected by `T5ForConditionalGeneration.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 200
  Batch size = 5

{'eval_loss': 4.031590461730957,
 'eval_runtime': 46.8441,
 'eval_samples_per_second': 4.269,
 'eval_steps_per_second': 0.854}

trainer.train()  # total of 9 hours of training on my laptop!

The following columns in the training set don't have a corresponding argument in `T5ForConditionalGeneration.forward` and have been ignored: titles, summaries, __index_level_0__. If titles, summaries, __index_level_0__ are not expected by `T5ForConditionalGeneration.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 1800
  Num Epochs = 2
  Instantaneous batch size per device = 5
  Total train batch size (w. parallel, distributed & accumulation) = 5
  Gradient Accumulation steps = 1
  Total optimization steps = 720
  Number of trainable parameters = 60506624

The following columns in the evaluation set don't have a corresponding argument in `T5ForConditionalGeneration.forward` and have been ignored: titles, summaries, __index_level_0__. If titles, summaries, __index_level_0__ are not expected by `T5ForConditionalGeneration.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 200
  Batch size = 5
Saving model checkpoint to ./T5_abst_title\checkpoint-360
Configuration saved in ./T5_abst_title\checkpoint-360\config.json
Configuration saved in ./T5_abst_title\checkpoint-360\generation_config.json
Model weights saved in ./T5_abst_title\checkpoint-360\pytorch_model.bin
The following columns in the evaluation set don't have a corresponding argument in `T5ForConditionalGeneration.forward` and have been ignored: titles, summaries, __index_level_0__. If titles, summaries, __index_level_0__ are not expected by `T5ForConditionalGeneration.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 200
  Batch size = 5
Saving model checkpoint to ./T5_abst_title\checkpoint-720
Configuration saved in ./T5_abst_title\checkpoint-720\config.json
Configuration saved in ./T5_abst_title\checkpoint-720\generation_config.json
Model weights saved in ./T5_abst_title\checkpoint-720\pytorch_model.bin


Training completed. Do not forget to share your model on huggingface.co/models =)


Loading best model from ./T5_abst_title\checkpoint-720 (score: 2.120922803878784).

TrainOutput(global_step=720, training_loss=2.489546706941393, metrics={'train_runtime': 2824.1471, 'train_samples_per_second': 1.275, 'train_steps_per_second': 0.255, 'total_flos': 324548053893120.0, 'train_loss': 2.489546706941393, 'epoch': 2.0})

trainer.evaluate()

The following columns in the evaluation set don't have a corresponding argument in `T5ForConditionalGeneration.forward` and have been ignored: titles, summaries, __index_level_0__. If titles, summaries, __index_level_0__ are not expected by `T5ForConditionalGeneration.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 200
  Batch size = 5

{'eval_loss': 2.120922803878784,
 'eval_runtime': 46.5597,
 'eval_samples_per_second': 4.296,
 'eval_steps_per_second': 0.859,
 'epoch': 2.0}

trainer.save_model()

Saving model checkpoint to ./T5_abst_title
Configuration saved in ./T5_abst_title\config.json
Configuration saved in ./T5_abst_title\generation_config.json
Model weights saved in ./T5_abst_title\pytorch_model.bin

Prediction with fine-tuned T5 on Holdout¶

fine_tuned_model = T5ForConditionalGeneration.from_pretrained('./t5_news_summary')

# Define a Custom Label Extraction Function
def label_extraction(abstract):
    inputs = base_tokenizer("extract labels: " + abstract, return_tensors="pt")
    outputs = fine_tuned_model.generate(**inputs)
    extracted_labels = base_tokenizer.decode(outputs[0], skip_special_tokens=True)
    return extracted_labels

loading configuration file ./t5_news_summary\config.json
Model config T5Config {
  "_name_or_path": "t5-small",
  "architectures": [
    "T5ForConditionalGeneration"
  ],
  "d_ff": 2048,
  "d_kv": 64,
  "d_model": 512,
  "decoder_start_token_id": 0,
  "dense_act_fn": "relu",
  "dropout_rate": 0.1,
  "eos_token_id": 1,
  "feed_forward_proj": "relu",
  "initializer_factor": 1.0,
  "is_encoder_decoder": true,
  "is_gated_act": false,
  "layer_norm_epsilon": 1e-06,
  "model_type": "t5",
  "n_positions": 512,
  "num_decoder_layers": 6,
  "num_heads": 8,
  "num_layers": 6,
  "output_past": true,
  "pad_token_id": 0,
  "relative_attention_max_distance": 128,
  "relative_attention_num_buckets": 32,
  "task_specific_params": {
    "summarization": {
      "early_stopping": true,
      "length_penalty": 2.0,
      "max_length": 200,
      "min_length": 30,
      "no_repeat_ngram_size": 3,
      "num_beams": 4,
      "prefix": "summarize: "
    },
    "translation_en_to_de": {
      "early_stopping": true,
      "max_length": 300,
      "num_beams": 4,
      "prefix": "translate English to German: "
    },
    "translation_en_to_fr": {
      "early_stopping": true,
      "max_length": 300,
      "num_beams": 4,
      "prefix": "translate English to French: "
    },
    "translation_en_to_ro": {
      "early_stopping": true,
      "max_length": 300,
      "num_beams": 4,
      "prefix": "translate English to Romanian: "
    }
  },
  "torch_dtype": "float32",
  "transformers_version": "4.26.1",
  "use_cache": true,
  "vocab_size": 32128
}

loading weights file ./t5_news_summary\pytorch_model.bin
Generate config GenerationConfig {
  "decoder_start_token_id": 0,
  "eos_token_id": 1,
  "pad_token_id": 0,
  "transformers_version": "4.26.1"
}

All model checkpoint weights were used when initializing T5ForConditionalGeneration.

All the weights of T5ForConditionalGeneration were initialized from the model checkpoint at ./t5_news_summary.
If your task is similar to the task the model of the checkpoint was trained on, you can already use T5ForConditionalGeneration for predictions without further training.
loading configuration file ./t5_news_summary\generation_config.json
Generate config GenerationConfig {
  "_from_model_config": true,
  "decoder_start_token_id": 0,
  "eos_token_id": 1,
  "pad_token_id": 0,
  "transformers_version": "4.26.1"
}

Example 1¶

ir = 0
abstract = Holdout['summaries'].iloc[ir]
print(f'-------summary---------')
print(abstract)
title = Holdout['titles'].iloc[ir]
print(f'\n-------actual title---------')
print(title)

title_pred = extracted_labels = label_extraction(abstract)
print(f'\n-------predicted title---------')
print(title_pred)

Generate config GenerationConfig {
  "decoder_start_token_id": 0,
  "eos_token_id": 1,
  "pad_token_id": 0,
  "transformers_version": "4.26.1"
}

-------summary---------
A representation is supposed universal if it encodes any element of the
visual world (e.g., objects, scenes) in any configuration (e.g., scale,
context). While not expecting pure universal representations, the goal in the
literature is to improve the universality level, starting from a representation
with a certain level. To do so, the state-of-the-art consists in learning
CNN-based representations on a diversified training problem (e.g., ImageNet
modified by adding annotated data). While it effectively increases
universality, such approach still requires a large amount of efforts to satisfy
the needs in annotated data. In this work, we propose two methods to improve
universality, but pay special attention to limit the need of annotated data. We
also propose a unified framework of the methods based on the diversifying of
the training problem. Finally, to better match Atkinson's cognitive study about
universal human representations, we proposed to rely on the transfer-learning
scheme as well as a new metric to evaluate universality. This latter, aims us
to demonstrates the interest of our methods on 10 target-problems, relating to
the classification task and a variety of visual domains.

-------actual title---------
Learning More Universal Representations for Transfer-Learning

-------predicted title---------
Universal Representation Learning: A unified framework for universality

Example 2¶

ir = 1
abstract = Holdout['summaries'].iloc[ir]
print(f'-------summary---------')
print(abstract)
title = Holdout['titles'].iloc[ir]
print(f'\n-------actual title---------')
print(title)

title_pred = extracted_labels = label_extraction(abstract)
print(f'\n-------predicted title---------')
print(title_pred)

Generate config GenerationConfig {
  "decoder_start_token_id": 0,
  "eos_token_id": 1,
  "pad_token_id": 0,
  "transformers_version": "4.26.1"
}

-------summary---------
In this work, a region-based Deep Convolutional Neural Network framework is
proposed for document structure learning. The contribution of this work
involves efficient training of region based classifiers and effective
ensembling for document image classification. A primary level of `inter-domain'
transfer learning is used by exporting weights from a pre-trained VGG16
architecture on the ImageNet dataset to train a document classifier on whole
document images. Exploiting the nature of region based influence modelling, a
secondary level of `intra-domain' transfer learning is used for rapid training
of deep learning models for image segments. Finally, stacked generalization
based ensembling is utilized for combining the predictions of the base deep
neural network models. The proposed method achieves state-of-the-art accuracy
of 92.2% on the popular RVL-CDIP document image dataset, exceeding benchmarks
set by existing algorithms.

-------actual title---------
Document Image Classification with Intra-Domain Transfer Learning and Stacked Generalization of Deep Convolutional Neural Networks

-------predicted title---------
Stack Generalization Based Ensembling for Document Structure Learning

Example 3¶

ir = 2
abstract = Holdout['summaries'].iloc[ir]
print(f'-------summary---------')
print(abstract)
title = Holdout['titles'].iloc[ir]
print(f'\n-------actual title---------')
print(title)

title_pred = extracted_labels = label_extraction(abstract)
print(f'\n-------predicted title---------')
print(title_pred)

Generate config GenerationConfig {
  "decoder_start_token_id": 0,
  "eos_token_id": 1,
  "pad_token_id": 0,
  "transformers_version": "4.26.1"
}

-------summary---------
The use of iris as a biometric trait is widely used because of its high level
of distinction and uniqueness. Nowadays, one of the major research challenges
relies on the recognition of iris images obtained in visible spectrum under
unconstrained environments. In this scenario, the acquired iris are affected by
capture distance, rotation, blur, motion blur, low contrast and specular
reflection, creating noises that disturb the iris recognition systems. Besides
delineating the iris region, usually preprocessing techniques such as
normalization and segmentation of noisy iris images are employed to minimize
these problems. But these techniques inevitably run into some errors. In this
context, we propose the use of deep representations, more specifically,
architectures based on VGG and ResNet-50 networks, for dealing with the images
using (and not) iris segmentation and normalization. We use transfer learning
from the face domain and also propose a specific data augmentation technique
for iris images. Our results show that the approach using non-normalized and
only circle-delimited iris images reaches a new state of the art in the
official protocol of the NICE.II competition, a subset of the UBIRIS database,
one of the most challenging databases on unconstrained environments, reporting
an average Equal Error Rate (EER) of 13.98% which represents an absolute
reduction of about 5%.

-------actual title---------
The Impact of Preprocessing on Deep Representations for Iris Recognition on Unconstrained Environments

-------predicted title---------
iris image recognition using non-normalized and only circle-delimited iris images

Example 4¶

ir = 3
abstract = Holdout['summaries'].iloc[ir]
print(f'-------summary---------')
print(abstract)
title = Holdout['titles'].iloc[ir]
print(f'\n-------actual title---------')
print(title)

title_pred = extracted_labels = label_extraction(abstract)
print(f'\n-------predicted title---------')
print(title_pred)

Generate config GenerationConfig {
  "decoder_start_token_id": 0,
  "eos_token_id": 1,
  "pad_token_id": 0,
  "transformers_version": "4.26.1"
}

-------summary---------
Deep learning has revolutionized the performance of classification, but
meanwhile demands sufficient labeled data for training. Given insufficient
data, while many techniques have been developed to help combat overfitting, the
challenge remains if one tries to train deep networks, especially in the
ill-posed extremely low data regimes: only a small set of labeled data are
available, and nothing -- including unlabeled data -- else. Such regimes arise
from practical situations where not only data labeling but also data collection
itself is expensive. We propose a deep adversarial data augmentation (DADA)
technique to address the problem, in which we elaborately formulate data
augmentation as a problem of training a class-conditional and supervised
generative adversarial network (GAN). Specifically, a new discriminator loss is
proposed to fit the goal of data augmentation, through which both real and
augmented samples are enforced to contribute to and be consistent in finding
the decision boundaries. Tailored training techniques are developed
accordingly. To quantitatively validate its effectiveness, we first perform
extensive simulations to show that DADA substantially outperforms both
traditional data augmentation and a few GAN-based options. We then extend
experiments to three real-world small labeled datasets where existing data
augmentation and/or transfer learning strategies are either less effective or
infeasible. All results endorse the superior capability of DADA in enhancing
the generalization ability of deep networks trained in practical extremely low
data regimes. Source code is available at
https://github.com/SchafferZhang/DADA.

-------actual title---------
DADA: Deep Adversarial Data Augmentation for Extremely Low Data Regime Classification

-------predicted title---------
Deep Analyse of Deep Learning: A Deep Analyse of a Class

Epoch	Training Loss	Validation Loss
1	2.502300	2.153708
2	2.308100	2.120923

Fine-tunning T5 Model for Abstract Title Prediction
© Mehdi Rezvandehy

Table of Contents

Introduction¶

Ready-made Results with T5¶

English to German Translation¶

Abstractive Summarization¶

`CoLA`: The Corpus of Linguistic Acceptability¶

Q/A - Question/Answering¶

`STSB` - Semantic Text Similarity Benchmark¶

`MNLI` - Multi-Genre Natural Language Inference¶

Fine-tune T5 for Abstractive Summarization¶

arXiv Paper Abstracts Data set¶

Split train and test set¶

Fine-tune Trainer¶

Prediction with fine-tuned T5 on Holdout¶

Example 1¶

Example 2¶

Example 3¶

Example 4¶

Fine-tunning T5 Model for Abstract Title Prediction © Mehdi Rezvandehy

Table of Contents

Introduction¶

Ready-made Results with T5¶

English to German Translation¶

Abstractive Summarization¶

CoLA: The Corpus of Linguistic Acceptability¶

Q/A - Question/Answering¶

STSB - Semantic Text Similarity Benchmark¶

MNLI - Multi-Genre Natural Language Inference¶

Fine-tune T5 for Abstractive Summarization¶

arXiv Paper Abstracts Data set¶

Split train and test set¶

Fine-tune Trainer¶

Prediction with fine-tuned T5 on Holdout¶

Example 1¶

Example 2¶

Example 3¶

Example 4¶

Fine-tunning T5 Model for Abstract Title Prediction
© Mehdi Rezvandehy

`CoLA`: The Corpus of Linguistic Acceptability¶

`STSB` - Semantic Text Similarity Benchmark¶

`MNLI` - Multi-Genre Natural Language Inference¶