Summary
The T5 (Text-to-Text Transfer Transformer) model is a transformer-based architecture developed by Google Research. It is designed to handle a wide range of natural language processing (NLP) tasks in a unified manner. T5 is unique in that it frames all NLP tasks as a text-to-text problem, meaning both the input and output are treated as sequences of text. T5 is a powerful and versatile transformer-based model that approaches NLP tasks in a text-to-text framework, allowing for unified architecture and effective transfer learning. T5's ability to handle diverse tasks with a single model structure makes it a valuable tool in natural language processing. In this notebook, first Off-shelf T5 (pre-trained) models are applied for various tasks including translation, summarization, question/answering .... However, these pre-trained models can be used as baseline prediction, but cannot be directly used in production. Off-shelf results with T5 cannot be used in production; we can only use them as baseline prediction. T5 should be fine-tuned for models in production. In this study, t5-base model is fine-tuned to predict titles for abstracts.
Python functions and data files to run this notebook are in my Github
import warnings
warnings.filterwarnings('ignore')
from transformers import T5ForConditionalGeneration, T5Tokenizer # Similar to GPT2
from transformers import pipeline, T5ForConditionalGeneration, TrainingArguments
from transformers import Trainer, DataCollatorForSeq2Seq
import pandas as pd
pd.set_option('display.max_colwidth', -1)
from datasets import Dataset
import random
Introduction¶
Instead of relying on only encoder or decoder, T5 has a complete end to end transformer (uses both encoder and decoder). Cross-attention bridges the gap between encoder and decoder. BERT derived from encoder part of transformer and where GPT is derived from decoder stack of transformer. There are proses and cons for each approach:
BERT: it is faster natural language understanding tasks such as sequence and token classification but the flexibility of being able to create prompts and autoregressive use cases are lost.
GPT: it is very flexible to teach for different domains, multiple type of tasks at once (sequence classification, summarization). However, generating text with GPT is very slow and is not robust enough to be used in classification task (like BERT).
T5 uses both encoder and decoder stacks at the same time: it is is pure sequence to sequence model. T5 can apply multiple NLP tasks at once.
Image retrieved from Sinan Ozdemir
Figure below shows 4 of them:
translate
English to Germancola
means linguistic acceptability e.g. grammarstsb
semantic text similarity given two phrases how similar they aresummarize
task. These four tasks can be done by T5 off the shelf. Each of these tasks comes with a prompt
Image retrieved from Sinan Ozdemir
T5 was pre-trained on data set called common crawl (https://commoncrawl.org/).
Sentinel tokens are used similar to masking tasks to predict missing text. BERT is used masked language modeling but the biggest difference here is instead of masking single token and predicting it, T5 masks multiple tokens and predict them all
Three training objectives of T5 are:
- casual (auto-regressive) language modeling: next word prediction
- BERT-style objective: masking words and predicting the original text
- Deshuffling: shuffling the input randomly and predicting the original text.
Ready-made Results with T5¶
t5_base_model = T5ForConditionalGeneration.from_pretrained('t5-base') # There are other flavour of T5
t5_base_tokenizer = T5Tokenizer.from_pretrained('t5-base')
English to German Translation¶
input_ids = t5_base_tokenizer.encode('translate English to German: How are you doing?', return_tensors='pt')
# translate
translate_ids = t5_base_model.generate(
input_ids,
num_beams=4,
no_repeat_ngram_size=3,
max_length=20,
early_stopping=True
)
translate = t5_base_tokenizer.decode(translate_ids[0], skip_special_tokens=True)
print (f"Translated text:\n{translate}")
Translated text: Wie geht es Ihnen?
# pass labels in to calculate loss
ids = t5_base_tokenizer('translate English to German: How are you doing?', return_tensors='pt').input_ids
labels = t5_base_tokenizer('Wo ist die Schokolade?', return_tensors='pt').input_ids
loss = t5_base_model(input_ids=ids, labels=labels).loss
labels, loss
(tensor([[ 3488, 229, 67, 31267, 58, 1]]), tensor(4.0347, grad_fn=<NllLossBackward0>))
Abstractive Summarization¶
Abstractive Summarization is a task creating a summary of a context without being constrained to use specific context of the subset.
text ="""The transformer model is primarily based on the attention idea. In 2015, attention
mechanisms started to be used in NLP tasks and became the dominant way to perform NLP tasks
over the past decade. Attention in NLP is a mechanism designed to focus on specific parts of a
sequence in the context of another sequence. It is used to perform various NLP tasks, such as language
modeling, sequence classification, language translation, and image captioning. The idea behind attention
is to give the model the ability to attend to relevant information while performing a prediction task.
The attention mechanism allows the model to understand relationships between different elements in a sequence
and make predictions based on that information. For example, in language translation, the attention mechanism
helps the model understand which parts of a source sentence are relevant for translating into a target language.
There are various types of attention mechanisms, including self-attention, which is the kind of attention that
powers the transformer architecture in NLP.
"""
preprocessed_text = text.strip().replace("\n","")
print ("preprocessed text :\n", preprocessed_text)
preprocessed text : The transformer model is primarily based on the attention idea. In 2015, attention mechanisms started to be used in NLP tasks and became the dominant way to perform NLP tasks over the past decade. Attention in NLP is a mechanism designed to focus on specific parts of a sequence in the context of another sequence. It is used to perform various NLP tasks, such as language modeling, sequence classification, language translation, and image captioning. The idea behind attention is to give the model the ability to attend to relevant information while performing a prediction task. The attention mechanism allows the model to understand relationships between different elements in a sequence and make predictions based on that information. For example, in language translation, the attention mechanism helps the model understand which parts of a source sentence are relevant for translating into a target language. There are various types of attention mechanisms, including self-attention, which is the kind of attention that powers the transformer architecture in NLP.
Making a summarization prompt with T5; it has benefit of multi-task learning. For example, for BERT, we have to change the architect of layers to perform a NLP task. It is not multi-task oriented. GPT is multi-task oriented, we should prompt an input sequence that GPT has an idea about the task. However, T5 can do this with minor implementation. For T5, we only need to place the prompt at the beginning of the task whereas for GPT we also need to add token at the end.
# We only need to add "summarize: " at the beginning to apply text summerization
t5_prepared_text = "summarize: " + preprocessed_text
ids = t5_base_tokenizer.encode(t5_prepared_text, return_tensors="pt") # encode the phrase
# summmarize
summmarize = t5_base_model.generate(ids,num_beams=4,no_repeat_ngram_size=3, # num_beams to make prediction less random
min_length=10,max_length=30,early_stopping=True
)
summarized = t5_base_tokenizer.decode(summmarize[0], skip_special_tokens=True)
print (f"The Summarized text is: \n{summarized}")
The Summarized text is: attention mechanisms started to be used in NLP tasks in 2015. it is a mechanism designed to focus on specific parts of a sequence
- ids: input sequence.
- num_beams=4: This parameter controls the number of beams to use in the beam search algorithm during text generation. Increasing the number of beams makes the generation less random and can lead to more focused and coherent output.
- no_repeat_ngram_size=3: This parameter is related to avoiding the repetition of n-grams (sequences of 'n' tokens). It ensures that the generated text does not contain repeated sequences of a certain size (in this case, 3 tokens).
- min_length=30: Specifies the minimum length of the generated text. The generated text should be at least 30 tokens long.
- max_length=70: Specifies the maximum length of the generated text. The generated text should not exceed 70 tokens in length.
- early_stopping=True: This parameter enables early stopping during text generation. If early stopping is set to True, the generation process may stop when certain conditions are met, such as when the model is confident about the generated sequence.
text ="""There are many approaches that use weak-supervision to train networks to
segment 2D images. By contrast, existing 3D approaches rely on full-supervision
of a subset of 2D slices of the 3D image volume. In this paper, we propose an
approach that is truly weakly-supervised in the sense that we only need to
provide a sparse set of 3D point on the surface of target objects, an easy task
that can be quickly done. We use the 3D points to deform a 3D template so that
it roughly matches the target object outlines and we introduce an architecture
that exploits the supervision provided by coarse template to train a network to
find accurate boundaries. We evaluate the performance of our approach on Computed Tomography (CT),
Magnetic Resonance Imagery (MRI) and Electron Microscopy (EM) image datasets.
We will show that it outperforms a more traditional approach to
weak-supervision in 3D at a reduced supervision cost."""
preprocessed_text = text.strip().replace("\n","")
# We only need to add "Make title: " at the beginning to apply text summerization
t5_prepared_text = "extract labels: " + preprocessed_text
ids = t5_base_tokenizer.encode(t5_prepared_text, return_tensors="pt") # encode the phrase
# summmarize
summmarize = t5_base_model.generate(ids,num_beams=4,no_repeat_ngram_size=3, # num_beams to make prediction less random
min_length=10,max_length=20,early_stopping=True
)
summarized = t5_base_tokenizer.decode(summmarize[0], skip_special_tokens=True)
print (f"The title of the text is: \n{summarized}")
The title of the text is: Falk: weakly-supervised approach to train networks tosegment 2D images
CoLA
: The Corpus of Linguistic Acceptability¶
CoLA (Corpus of Linguistic Acceptability) is for checking grammatical correctess.
ids = t5_base_tokenizer.encode('cola sentence: How are you doing?', return_tensors='pt')
# CoLA
translate_ids = t5_base_model.generate(ids,num_beams=4,no_repeat_ngram_size=3,
max_length=20,early_stopping=True)
translate = t5_base_tokenizer.decode(translate_ids[0], skip_special_tokens=True)
print(f"is grammatically correct?: \n{translate}")
is grammatically correct?: acceptable
input_ids = t5_base_tokenizer.encode('cola sentence: How are you doings?', return_tensors='pt')
# summmarize
translate_ids = t5_base_model.generate(
input_ids,
max_length=20,
early_stopping=True
)
output = t5_base_tokenizer.decode(translate_ids[0], skip_special_tokens=True)
print(f"is grammatically correct?: \n{output}")
is grammatically correct?: unacceptable
Q/A - Question/Answering¶
Extractive Q/A can be applied by BERT and abstractive Q/A with GPT. T5 (Text-to-Text Transfer Transformer) can be used for abstractive question answering. In the context of question answering, the input can be a question, and the target can be the corresponding answer.
ids = t5_base_tokenizer.encode(
'question: Where does I work? context: I live in Alberta but work in Calgary.', return_tensors='pt'
)
# Q/A
translate_ids = t5_base_model.generate(ids,early_stopping=True)
result = t5_base_tokenizer.decode(translate_ids[0], skip_special_tokens=True)
print(f"result: \n{result}")
result: Calgary
STSB
- Semantic Text Similarity Benchmark¶
STSB
calculates the similarity between two sentences from 0 to 5 scale.
sentence_one = 'Python is scary animal'
sentence_two = 'I love coding with Python'
ids = t5_base_tokenizer.encode(f"stsb sentence1: {sentence_one} sentence2: {sentence_two}", return_tensors='pt')
# calculate semantic similarity
translate_ids = t5_base_model.generate(ids, max_length=3, early_stopping=True)
result = t5_base_tokenizer.decode(translate_ids[0], skip_special_tokens=True)
print(f"semantically similar? (0-5): \n{result}")
semantically similar? (0-5): 1.0
sentence_one = 'President greets the press in Chicago'
sentence_two = 'Biden speaks to media in Illinois'
ids = t5_base_tokenizer.encode(f"stsb sentence1: {sentence_one} sentence2: {sentence_two}", return_tensors='pt')
# calculate semantic similarity
translate_ids = t5_base_model.generate(ids, max_length=3, early_stopping=True)
result = t5_base_tokenizer.decode(translate_ids[0], skip_special_tokens=True)
print(f"semantically similar? (0-5): \n{result}")
semantically similar? (0-5): 1.6
MNLI
- Multi-Genre Natural Language Inference¶
Multi-Genre Natural Language Inference refers to a challenging task in natural language processing where a model is trained to determine the logical relationship between pairs of sentences, but with a focus on handling diverse genres or writing styles. This involves developing models that can effectively understand and infer entailment, contradiction, or neutral across various types of textual content, enhancing their applicability in a wide range of language understanding tasks.
ids = t5_base_tokenizer.encode(
'mnli premise: I am going to medical school. hypothesis: I will be an Engineer', return_tensors='pt')
# mnli
translate_ids = t5_base_model.generate(ids,early_stopping=True)
result = t5_base_tokenizer.decode(translate_ids[0], skip_special_tokens=True)
print(f"Result: \n{result}")
Result: contradiction
ids = t5_base_tokenizer.encode(
'mnli premise: I am going to medical school. hypothesis: I am going to be a physician', return_tensors='pt')
# mnli
translate_ids = t5_base_model.generate(ids,early_stopping=True)
result = t5_base_tokenizer.decode(translate_ids[0], skip_special_tokens=True)
print(f"Result: \n{result}")
Result: entailment
ids = t5_base_tokenizer.encode(
'mnli premise: I am going to medical school. hypothesis: I will be top 2 student in my class', return_tensors='pt')
# mnli
translate_ids = t5_base_model.generate(ids,early_stopping=True)
result = t5_base_tokenizer.decode(translate_ids[0], skip_special_tokens=True)
print(f"Result: \n{result}")
Result: neutral
Off-shelf results with T5 cannot be used in production; we can only use them as baseline prediction. T5 should be fine-tuned for models in production.
Fine-tune T5 for Abstractive Summarization¶
arXiv Paper Abstracts Data set¶
Paper submission systems (CMT, OpenReview, etc.) require the users to upload paper titles and paper abstracts and then specify the subject areas their papers best belong to. Won't it be nice if such submission systems provided viable subject area suggestions as to where the corresponding papers could be best associated with?
This dataset would allow developers to build baseline models that might benefit this use case. Data analysts might also enjoy analyzing the intricacies of different papers and how well their abstracts correlate to their noted categories. Additionally, we hope that the dataset will serve as a decent benchmark for building useful text classification systems.
base_model = T5ForConditionalGeneration.from_pretrained('t5-small')
base_tokenizer = T5Tokenizer.from_pretrained('t5-small')
df = pd.read_csv('arxiv_data.csv',encoding='latin-1')
df = df[["summaries","titles"]]
df = df.dropna()
print(df.shape)
# Set aside some data as holdout set
Holdout = df[-100:]
# Since the data is big for fine-tunning, 10000 of samples are selected.
df = df[:10000].copy()
df.head(2)
(51774, 2)
summaries | titles | |
---|---|---|
0 | Stereo matching is one of the widely used techniques for inferring depth from\nstereo images owing to its robustness and speed. It has become one of the major\ntopics of research since it finds its applications in autonomous driving,\nrobotic navigation, 3D reconstruction, and many other fields. Finding pixel\ncorrespondences in non-textured, occluded and reflective areas is the major\nchallenge in stereo matching. Recent developments have shown that semantic cues\nfrom image segmentation can be used to improve the results of stereo matching.\nMany deep neural network architectures have been proposed to leverage the\nadvantages of semantic segmentation in stereo matching. This paper aims to give\na comparison among the state of art networks both in terms of accuracy and in\nterms of speed which are of higher importance in real-time applications. | Survey on Semantic Stereo Matching / Semantic Depth Estimation |
1 | The recent advancements in artificial intelligence (AI) combined with the\nextensive amount of data generated by today's clinical systems, has led to the\ndevelopment of imaging AI solutions across the whole value chain of medical\nimaging, including image reconstruction, medical image segmentation,\nimage-based diagnosis and treatment planning. Notwithstanding the successes and\nfuture potential of AI in medical imaging, many stakeholders are concerned of\nthe potential risks and ethical implications of imaging AI solutions, which are\nperceived as complex, opaque, and difficult to comprehend, utilise, and trust\nin critical clinical applications. Despite these concerns and risks, there are\ncurrently no concrete guidelines and best practices for guiding future AI\ndevelopments in medical imaging towards increased trust, safety and adoption.\nTo bridge this gap, this paper introduces a careful selection of guiding\nprinciples drawn from the accumulated experiences, consensus, and best\npractices from five large European projects on AI in Health Imaging. These\nguiding principles are named FUTURE-AI and its building blocks consist of (i)\nFairness, (ii) Universality, (iii) Traceability, (iv) Usability, (v) Robustness\nand (vi) Explainability. In a step-by-step approach, these guidelines are\nfurther translated into a framework of concrete recommendations for specifying,\ndeveloping, evaluating, and deploying technically, clinically and ethically\ntrustworthy AI solutions into clinical practice. | FUTURE-AI: Guiding Principles and Consensus Recommendations for Trustworthy Artificial Intelligence in Future Medical Imaging |
Split train and test set¶
df = df[ (df['summaries'].str.len() >=30)]
df.shape
(10000, 2)
# Pre-processing step
# Punctuation is important in grammar and important for complex decoding architectures to know when to stop!
def add_punc(s):
if s[-1] not in ('.', '!', '?'):
s = s + '.'
return s
df.dropna(inplace=True)
df['summaries'] = df['summaries'].map(add_punc)
print(df.shape)
df.head(2)
(10000, 2)
summaries | titles | |
---|---|---|
0 | Stereo matching is one of the widely used techniques for inferring depth from\nstereo images owing to its robustness and speed. It has become one of the major\ntopics of research since it finds its applications in autonomous driving,\nrobotic navigation, 3D reconstruction, and many other fields. Finding pixel\ncorrespondences in non-textured, occluded and reflective areas is the major\nchallenge in stereo matching. Recent developments have shown that semantic cues\nfrom image segmentation can be used to improve the results of stereo matching.\nMany deep neural network architectures have been proposed to leverage the\nadvantages of semantic segmentation in stereo matching. This paper aims to give\na comparison among the state of art networks both in terms of accuracy and in\nterms of speed which are of higher importance in real-time applications. | Survey on Semantic Stereo Matching / Semantic Depth Estimation |
1 | The recent advancements in artificial intelligence (AI) combined with the\nextensive amount of data generated by today's clinical systems, has led to the\ndevelopment of imaging AI solutions across the whole value chain of medical\nimaging, including image reconstruction, medical image segmentation,\nimage-based diagnosis and treatment planning. Notwithstanding the successes and\nfuture potential of AI in medical imaging, many stakeholders are concerned of\nthe potential risks and ethical implications of imaging AI solutions, which are\nperceived as complex, opaque, and difficult to comprehend, utilise, and trust\nin critical clinical applications. Despite these concerns and risks, there are\ncurrently no concrete guidelines and best practices for guiding future AI\ndevelopments in medical imaging towards increased trust, safety and adoption.\nTo bridge this gap, this paper introduces a careful selection of guiding\nprinciples drawn from the accumulated experiences, consensus, and best\npractices from five large European projects on AI in Health Imaging. These\nguiding principles are named FUTURE-AI and its building blocks consist of (i)\nFairness, (ii) Universality, (iii) Traceability, (iv) Usability, (v) Robustness\nand (vi) Explainability. In a step-by-step approach, these guidelines are\nfurther translated into a framework of concrete recommendations for specifying,\ndeveloping, evaluating, and deploying technically, clinically and ethically\ntrustworthy AI solutions into clinical practice. | FUTURE-AI: Guiding Principles and Consensus Recommendations for Trustworthy Artificial Intelligence in Future Medical Imaging |
random.seed(20)
paper_title = Dataset.from_pandas(df[:2000])
# Set aside some data as holdout set
Holdout = df[-100:]
# We have a prompt but only as a prefix in the encoder
prefix = "title: "
# Manually add our own labels because unlike GPT,
# we cannot assume the labels are based on the inputs
def preprocess_function(examples):
inputs = [prefix + doc for doc in examples["summaries"]]
model_inputs = base_tokenizer(inputs, max_length=1024, truncation=True)
labels = base_tokenizer(examples["titles"], max_length=128, truncation=True)
model_inputs["labels"] = labels["input_ids"]
return model_inputs
paper_title = paper_title.map(preprocess_function, batched=True)
paper_title[0]
{'summaries': 'Stereo matching is one of the widely used techniques for inferring depth from\nstereo images owing to its robustness and speed. It has become one of the major\ntopics of research since it finds its applications in autonomous driving,\nrobotic navigation, 3D reconstruction, and many other fields. Finding pixel\ncorrespondences in non-textured, occluded and reflective areas is the major\nchallenge in stereo matching. Recent developments have shown that semantic cues\nfrom image segmentation can be used to improve the results of stereo matching.\nMany deep neural network architectures have been proposed to leverage the\nadvantages of semantic segmentation in stereo matching. This paper aims to give\na comparison among the state of art networks both in terms of accuracy and in\nterms of speed which are of higher importance in real-time applications.', 'titles': 'Survey on Semantic Stereo Matching / Semantic Depth Estimation', '__index_level_0__': 0, 'input_ids': [2233, 10, 30535, 8150, 19, 80, 13, 8, 5456, 261, 2097, 21, 16, 1010, 1007, 4963, 45, 16687, 1383, 3, 15942, 12, 165, 6268, 655, 11, 1634, 5, 94, 65, 582, 80, 13, 8, 779, 4064, 13, 585, 437, 34, 12902, 165, 1564, 16, 21286, 2191, 6, 20407, 8789, 6, 220, 308, 20532, 6, 11, 186, 119, 4120, 5, 14490, 3, 14251, 17215, 7, 16, 529, 18, 25616, 6, 3, 13377, 21135, 26, 11, 22891, 844, 19, 8, 779, 1921, 16, 16687, 8150, 5, 17716, 11336, 43, 2008, 24, 27632, 123, 15, 7, 45, 1023, 5508, 257, 54, 36, 261, 12, 1172, 8, 772, 13, 16687, 8150, 5, 1404, 1659, 24228, 1229, 4648, 7, 43, 118, 4382, 12, 11531, 8, 7648, 13, 27632, 5508, 257, 16, 16687, 8150, 5, 100, 1040, 3, 8345, 12, 428, 3, 9, 4993, 859, 8, 538, 13, 768, 5275, 321, 16, 1353, 13, 7452, 11, 16, 1353, 13, 1634, 84, 33, 13, 1146, 3172, 16, 490, 18, 715, 1564, 5, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'labels': [11418, 30, 679, 348, 1225, 30535, 12296, 53, 3, 87, 679, 348, 1225, 25734, 107, 23621, 23, 106, 1]}
paper_title = paper_title.train_test_split(test_size=.1)
paper_title
DatasetDict({ train: Dataset({ features: ['summaries', 'titles', '__index_level_0__', 'input_ids', 'attention_mask', 'labels'], num_rows: 1800 }) test: Dataset({ features: ['summaries', 'titles', '__index_level_0__', 'input_ids', 'attention_mask', 'labels'], num_rows: 200 }) })
# Data collator specifically for generic sequence to sequence tasks
# Use when we are translating one sequence to another like translation, summarization, etc
data_collator = DataCollatorForSeq2Seq(tokenizer=base_tokenizer, model=base_model)
Fine-tune Trainer¶
epochs = 2
batch_size = 5
training_args = TrainingArguments(
output_dir="./T5_abst_title", # The output directory
overwrite_output_dir=True, # overwrite the content of the output directory
num_train_epochs=epochs, # number of training epochs
per_device_train_batch_size=batch_size, # batch size for training
per_device_eval_batch_size=batch_size, # batch size for evaluation
logging_steps=50,
load_best_model_at_end=True,
evaluation_strategy='epoch', # It is "steps" or "epoch", we choose epoch: how many times to stop training to test
save_strategy='epoch' # save a check point of our model after each epoch
)
trainer = Trainer(
model=base_model, # take our model (tweet_clf_model)
args=training_args, # we just set it above
train_dataset=paper_title["train"], # training part of dataset
eval_dataset=paper_title["test"], # test (evaluation) part of dataset
data_collator=data_collator # data colladior with padding. Infact, we may or may not need a data collator
# we can check the model to see how it lookes like with or without the collator
)
trainer.evaluate()
The following columns in the evaluation set don't have a corresponding argument in `T5ForConditionalGeneration.forward` and have been ignored: titles, summaries, __index_level_0__. If titles, summaries, __index_level_0__ are not expected by `T5ForConditionalGeneration.forward`, you can safely ignore this message. ***** Running Evaluation ***** Num examples = 200 Batch size = 5
{'eval_loss': 4.031590461730957, 'eval_runtime': 46.8441, 'eval_samples_per_second': 4.269, 'eval_steps_per_second': 0.854}
trainer.train() # total of 9 hours of training on my laptop!
The following columns in the training set don't have a corresponding argument in `T5ForConditionalGeneration.forward` and have been ignored: titles, summaries, __index_level_0__. If titles, summaries, __index_level_0__ are not expected by `T5ForConditionalGeneration.forward`, you can safely ignore this message. ***** Running training ***** Num examples = 1800 Num Epochs = 2 Instantaneous batch size per device = 5 Total train batch size (w. parallel, distributed & accumulation) = 5 Gradient Accumulation steps = 1 Total optimization steps = 720 Number of trainable parameters = 60506624
Epoch | Training Loss | Validation Loss |
---|---|---|
1 | 2.502300 | 2.153708 |
2 | 2.308100 | 2.120923 |
The following columns in the evaluation set don't have a corresponding argument in `T5ForConditionalGeneration.forward` and have been ignored: titles, summaries, __index_level_0__. If titles, summaries, __index_level_0__ are not expected by `T5ForConditionalGeneration.forward`, you can safely ignore this message. ***** Running Evaluation ***** Num examples = 200 Batch size = 5 Saving model checkpoint to ./T5_abst_title\checkpoint-360 Configuration saved in ./T5_abst_title\checkpoint-360\config.json Configuration saved in ./T5_abst_title\checkpoint-360\generation_config.json Model weights saved in ./T5_abst_title\checkpoint-360\pytorch_model.bin The following columns in the evaluation set don't have a corresponding argument in `T5ForConditionalGeneration.forward` and have been ignored: titles, summaries, __index_level_0__. If titles, summaries, __index_level_0__ are not expected by `T5ForConditionalGeneration.forward`, you can safely ignore this message. ***** Running Evaluation ***** Num examples = 200 Batch size = 5 Saving model checkpoint to ./T5_abst_title\checkpoint-720 Configuration saved in ./T5_abst_title\checkpoint-720\config.json Configuration saved in ./T5_abst_title\checkpoint-720\generation_config.json Model weights saved in ./T5_abst_title\checkpoint-720\pytorch_model.bin Training completed. Do not forget to share your model on huggingface.co/models =) Loading best model from ./T5_abst_title\checkpoint-720 (score: 2.120922803878784).
TrainOutput(global_step=720, training_loss=2.489546706941393, metrics={'train_runtime': 2824.1471, 'train_samples_per_second': 1.275, 'train_steps_per_second': 0.255, 'total_flos': 324548053893120.0, 'train_loss': 2.489546706941393, 'epoch': 2.0})
trainer.evaluate()
The following columns in the evaluation set don't have a corresponding argument in `T5ForConditionalGeneration.forward` and have been ignored: titles, summaries, __index_level_0__. If titles, summaries, __index_level_0__ are not expected by `T5ForConditionalGeneration.forward`, you can safely ignore this message. ***** Running Evaluation ***** Num examples = 200 Batch size = 5
{'eval_loss': 2.120922803878784, 'eval_runtime': 46.5597, 'eval_samples_per_second': 4.296, 'eval_steps_per_second': 0.859, 'epoch': 2.0}
trainer.save_model()
Saving model checkpoint to ./T5_abst_title Configuration saved in ./T5_abst_title\config.json Configuration saved in ./T5_abst_title\generation_config.json Model weights saved in ./T5_abst_title\pytorch_model.bin
Prediction with fine-tuned T5 on Holdout¶
fine_tuned_model = T5ForConditionalGeneration.from_pretrained('./t5_news_summary')
# Define a Custom Label Extraction Function
def label_extraction(abstract):
inputs = base_tokenizer("extract labels: " + abstract, return_tensors="pt")
outputs = fine_tuned_model.generate(**inputs)
extracted_labels = base_tokenizer.decode(outputs[0], skip_special_tokens=True)
return extracted_labels
loading configuration file ./t5_news_summary\config.json Model config T5Config { "_name_or_path": "t5-small", "architectures": [ "T5ForConditionalGeneration" ], "d_ff": 2048, "d_kv": 64, "d_model": 512, "decoder_start_token_id": 0, "dense_act_fn": "relu", "dropout_rate": 0.1, "eos_token_id": 1, "feed_forward_proj": "relu", "initializer_factor": 1.0, "is_encoder_decoder": true, "is_gated_act": false, "layer_norm_epsilon": 1e-06, "model_type": "t5", "n_positions": 512, "num_decoder_layers": 6, "num_heads": 8, "num_layers": 6, "output_past": true, "pad_token_id": 0, "relative_attention_max_distance": 128, "relative_attention_num_buckets": 32, "task_specific_params": { "summarization": { "early_stopping": true, "length_penalty": 2.0, "max_length": 200, "min_length": 30, "no_repeat_ngram_size": 3, "num_beams": 4, "prefix": "summarize: " }, "translation_en_to_de": { "early_stopping": true, "max_length": 300, "num_beams": 4, "prefix": "translate English to German: " }, "translation_en_to_fr": { "early_stopping": true, "max_length": 300, "num_beams": 4, "prefix": "translate English to French: " }, "translation_en_to_ro": { "early_stopping": true, "max_length": 300, "num_beams": 4, "prefix": "translate English to Romanian: " } }, "torch_dtype": "float32", "transformers_version": "4.26.1", "use_cache": true, "vocab_size": 32128 } loading weights file ./t5_news_summary\pytorch_model.bin Generate config GenerationConfig { "decoder_start_token_id": 0, "eos_token_id": 1, "pad_token_id": 0, "transformers_version": "4.26.1" } All model checkpoint weights were used when initializing T5ForConditionalGeneration. All the weights of T5ForConditionalGeneration were initialized from the model checkpoint at ./t5_news_summary. If your task is similar to the task the model of the checkpoint was trained on, you can already use T5ForConditionalGeneration for predictions without further training. loading configuration file ./t5_news_summary\generation_config.json Generate config GenerationConfig { "_from_model_config": true, "decoder_start_token_id": 0, "eos_token_id": 1, "pad_token_id": 0, "transformers_version": "4.26.1" }
Example 1¶
ir = 0
abstract = Holdout['summaries'].iloc[ir]
print(f'-------summary---------')
print(abstract)
title = Holdout['titles'].iloc[ir]
print(f'\n-------actual title---------')
print(title)
title_pred = extracted_labels = label_extraction(abstract)
print(f'\n-------predicted title---------')
print(title_pred)
Generate config GenerationConfig { "decoder_start_token_id": 0, "eos_token_id": 1, "pad_token_id": 0, "transformers_version": "4.26.1" }
-------summary--------- A representation is supposed universal if it encodes any element of the visual world (e.g., objects, scenes) in any configuration (e.g., scale, context). While not expecting pure universal representations, the goal in the literature is to improve the universality level, starting from a representation with a certain level. To do so, the state-of-the-art consists in learning CNN-based representations on a diversified training problem (e.g., ImageNet modified by adding annotated data). While it effectively increases universality, such approach still requires a large amount of efforts to satisfy the needs in annotated data. In this work, we propose two methods to improve universality, but pay special attention to limit the need of annotated data. We also propose a unified framework of the methods based on the diversifying of the training problem. Finally, to better match Atkinson's cognitive study about universal human representations, we proposed to rely on the transfer-learning scheme as well as a new metric to evaluate universality. This latter, aims us to demonstrates the interest of our methods on 10 target-problems, relating to the classification task and a variety of visual domains. -------actual title--------- Learning More Universal Representations for Transfer-Learning -------predicted title--------- Universal Representation Learning: A unified framework for universality
Example 2¶
ir = 1
abstract = Holdout['summaries'].iloc[ir]
print(f'-------summary---------')
print(abstract)
title = Holdout['titles'].iloc[ir]
print(f'\n-------actual title---------')
print(title)
title_pred = extracted_labels = label_extraction(abstract)
print(f'\n-------predicted title---------')
print(title_pred)
Generate config GenerationConfig { "decoder_start_token_id": 0, "eos_token_id": 1, "pad_token_id": 0, "transformers_version": "4.26.1" }
-------summary--------- In this work, a region-based Deep Convolutional Neural Network framework is proposed for document structure learning. The contribution of this work involves efficient training of region based classifiers and effective ensembling for document image classification. A primary level of `inter-domain' transfer learning is used by exporting weights from a pre-trained VGG16 architecture on the ImageNet dataset to train a document classifier on whole document images. Exploiting the nature of region based influence modelling, a secondary level of `intra-domain' transfer learning is used for rapid training of deep learning models for image segments. Finally, stacked generalization based ensembling is utilized for combining the predictions of the base deep neural network models. The proposed method achieves state-of-the-art accuracy of 92.2% on the popular RVL-CDIP document image dataset, exceeding benchmarks set by existing algorithms. -------actual title--------- Document Image Classification with Intra-Domain Transfer Learning and Stacked Generalization of Deep Convolutional Neural Networks -------predicted title--------- Stack Generalization Based Ensembling for Document Structure Learning
Example 3¶
ir = 2
abstract = Holdout['summaries'].iloc[ir]
print(f'-------summary---------')
print(abstract)
title = Holdout['titles'].iloc[ir]
print(f'\n-------actual title---------')
print(title)
title_pred = extracted_labels = label_extraction(abstract)
print(f'\n-------predicted title---------')
print(title_pred)
Generate config GenerationConfig { "decoder_start_token_id": 0, "eos_token_id": 1, "pad_token_id": 0, "transformers_version": "4.26.1" }
-------summary--------- The use of iris as a biometric trait is widely used because of its high level of distinction and uniqueness. Nowadays, one of the major research challenges relies on the recognition of iris images obtained in visible spectrum under unconstrained environments. In this scenario, the acquired iris are affected by capture distance, rotation, blur, motion blur, low contrast and specular reflection, creating noises that disturb the iris recognition systems. Besides delineating the iris region, usually preprocessing techniques such as normalization and segmentation of noisy iris images are employed to minimize these problems. But these techniques inevitably run into some errors. In this context, we propose the use of deep representations, more specifically, architectures based on VGG and ResNet-50 networks, for dealing with the images using (and not) iris segmentation and normalization. We use transfer learning from the face domain and also propose a specific data augmentation technique for iris images. Our results show that the approach using non-normalized and only circle-delimited iris images reaches a new state of the art in the official protocol of the NICE.II competition, a subset of the UBIRIS database, one of the most challenging databases on unconstrained environments, reporting an average Equal Error Rate (EER) of 13.98% which represents an absolute reduction of about 5%. -------actual title--------- The Impact of Preprocessing on Deep Representations for Iris Recognition on Unconstrained Environments -------predicted title--------- iris image recognition using non-normalized and only circle-delimited iris images
Example 4¶
ir = 3
abstract = Holdout['summaries'].iloc[ir]
print(f'-------summary---------')
print(abstract)
title = Holdout['titles'].iloc[ir]
print(f'\n-------actual title---------')
print(title)
title_pred = extracted_labels = label_extraction(abstract)
print(f'\n-------predicted title---------')
print(title_pred)
Generate config GenerationConfig { "decoder_start_token_id": 0, "eos_token_id": 1, "pad_token_id": 0, "transformers_version": "4.26.1" }
-------summary--------- Deep learning has revolutionized the performance of classification, but meanwhile demands sufficient labeled data for training. Given insufficient data, while many techniques have been developed to help combat overfitting, the challenge remains if one tries to train deep networks, especially in the ill-posed extremely low data regimes: only a small set of labeled data are available, and nothing -- including unlabeled data -- else. Such regimes arise from practical situations where not only data labeling but also data collection itself is expensive. We propose a deep adversarial data augmentation (DADA) technique to address the problem, in which we elaborately formulate data augmentation as a problem of training a class-conditional and supervised generative adversarial network (GAN). Specifically, a new discriminator loss is proposed to fit the goal of data augmentation, through which both real and augmented samples are enforced to contribute to and be consistent in finding the decision boundaries. Tailored training techniques are developed accordingly. To quantitatively validate its effectiveness, we first perform extensive simulations to show that DADA substantially outperforms both traditional data augmentation and a few GAN-based options. We then extend experiments to three real-world small labeled datasets where existing data augmentation and/or transfer learning strategies are either less effective or infeasible. All results endorse the superior capability of DADA in enhancing the generalization ability of deep networks trained in practical extremely low data regimes. Source code is available at https://github.com/SchafferZhang/DADA. -------actual title--------- DADA: Deep Adversarial Data Augmentation for Extremely Low Data Regime Classification -------predicted title--------- Deep Analyse of Deep Learning: A Deep Analyse of a Class
- Home
-
- Prediction of Movie Genre by Fine-tunning GPT
- Fine-tunning BERT for Fake News Detection
- Covid Tweet Classification by Fine-tunning BART
- Semantic Search Using BERT
- Abstractive Semantic Search by OpenAI Embedding
- Fine-tunning GPT for Style Completion
- Extractive Question-Answering by BERT
- Fine-tunning T5 Model for Abstract Title Prediction
- Image Captioning by Fine-tunning ViT
- Build Serverless ChatGPT API
- Statistical Analysis in Python
- Clustering Algorithms
- Customer Segmentation
- Time Series Forecasting
- PySpark Fundamentals for Big Data
- Predict Customer Churn
- Classification with Imbalanced Classes
- Feature Importance
- Feature Selection
- Text Similarity Measurement
- Dimensionality Reduction
- Prediction of Methane Leakage
- Imputation by LU Simulation
- Histogram Uncertainty
- Delustering to Improve Preferential Sampling
- Uncertainty in Spatial Correlation
-
- Machine Learning Overview
- Python and Pandas
- Main Steps of Machine Learning
- Classification
- Model Training
- Support Vector Machines
- Decision Trees
- Ensemble Learning & Random Forests
- Artificial Neural Network
- Deep Neural Network (DNN)
- Unsupervised Learning
- Multicollinearity
- Introduction to Git
- Introduction to R
- SQL Basic to Advanced Level
- Develop Python Package
- Introduction to BERT LLM
- Exploratory Data Analysis
- Object Oriented Programming in Python
- Natural Language Processing
- Convolutional Neural Network
- Publications