Summary
GPT (Generative Pre-trained Transformer) undergoes pre-training using a large text corpus called BookCorpus, comprised of 4.5 GB of text from 7,000 unpublished books. During pre-training, the model learns language patterns by predicting the next word in a sequence. After pre-training, GPT can be fine-tuned for specific tasks using task-specific datasets, adjusting parameters for tasks like classification or question answering. OpenAI has iteratively improved the GPT architecture, introducing models like GPT-2, GPT-3, GPT-3.5, and GPT-4, each trained on larger datasets with increased capacities. The widespread adoption of GPT models has significantly advanced natural language processing tasks in research and industry. Only GPT-2 is freely available on HuggingFace. In this notebook, first gpt2
few-shot learning is applied for sentiment analysis, question-answering and text summarization. Next, gpt2
model is fine-tuned for
"style completion", which refers to the ability of the model to generate text in a specific style.
Python functions and data files needed to run this notebook are available via this link.
from transformers import GPT2Tokenizer, TextDataset, DataCollatorForLanguageModeling, GPT2LMHeadModel, pipeline
from transformers import pipeline, set_seed, Trainer, TrainingArguments, GPT2Model
from torch import tensor, numel # counting number of parameters
from bertviz import model_view # visualize
import torch
import pandas as pd
set_seed(32) # we have randomness in GPT while we do not have in BERT
The pre-training process of GPT involves training the model on a large corpus of text called BookCorpus (4.5 GB of text from 7,000 unpublished books of different genres). During pre-training, the model is trained to predict the next word in a sequence given the previous words. This process is known as language modeling and is used to teach the model to understand the structure and patterns of natural language.
After pre-training, the GPT model can be fine-tuned for a specific task by providing it with a smaller, task-specific dataset. Fine-tuning involves adjusting the parameters of the model to better fit the task at hand. For example, the model can be fine-tuned for tasks such as classification, similarity scoring, or question answering.
The GPT architecture has since been improved and extended by OpenAI with the release of subsequent models such as GPT-2, GPT-3, GPT-3.5, and GPT-4. These models are trained on larger datasets and have larger capacities, so they can generate more complex and coherent text. The GPT models have been widely adopted by researchers and industry practitioners and have contributed to significant advancements in natural language processing tasks.
GPT stands for Generative Pre-trained Transformers:
Generative: from Auto-regressive Language Model. It signifies predicting tokens with one side of the context, past context.
Pre-trained: decoders are trained on huge corpora of data.
Transformers: The decoder is taken from the transformer architecture.
GPT refers to a family of models:
OpenAI's public release of the latest GPT-3 and ChatGPT models put the power of large autoregressive language models into the hands of the masses. Anyone now can you these models for their benefits. Limitation of these models:
They do not take direction, they simply want to finish your sentence in the same style. That is why few-shot learning works better but we cannot ask a question and demand an answer. In January 2022, OpenAI introduced an updated version of GPT-3 called InstructGPT. It works better in many ways showing a reduction in harmful baises and the ability to take direction from a prompt and answer without trying to finish the thought.
How InstructGPT is trained:
Prompt Engineering
Prompt Engineering is the process of designing input prompts for language model system like GPT-3 and ChetGPT:
We can influence the output produced by the model to get something more specific and usable by carefully crafting and adjusting prompts.
Prompt Engineering can be used to guide the model to produce more relevant and coherent output for a given task.
ChatGPT
Attention
The first step to understanding how GPT works is to understand how the attention mechanism works. This mechanism is what makes the Transformer architecture unique and distinct from recurrent approaches to language modeling. When we have developed a solid understanding of attention, we will then see how it is used within Transformer architectures such as GPT. For example for sentence below, the next word should be something synonymous with "big".
"The black elephant tried to get into the truck but it was too ..."
Certain other words in the sentence are important for helping us to make our decision. Other words in the sentence are not important at all. In other words, we are aying attention to certain words in the sentence and largely ignoring others. Wouldn’t it be great if our model could do the same thing?
An attention mechanism (also know as an attention head) in a Transformer is designed to do exactly this. It is able to decide where in the input it wants to pull information from, in order to efficiently extract useful information without being clouded by irrelevant details. This makes it highly adaptable to a range of circumstances, as it can decide where it wants to look for information at inference time.
GPT is very similar to BERT. GPT has Byte-level tokenization by splitting a list of token in our vocabulary over 50,000 tokens. We also add the special token <|endoftext|> at the end.
My name is Mehdi ==> ["My", "name", "is", "Mehdi","<|endoftext|>" ]
tokenizer_gpt = GPT2Tokenizer.from_pretrained("gpt2")
tokenizer_gpt ("Hi there")['input_ids']
# Having a space make it to different token
tokenizer_gpt (" Hi there")['input_ids']
Space in GPT's tokenizer is treated as part of token; this leads to encode the word differently.
GPT has two types of embeddings for tokenized sentences:
1. Word Token Embeddings (WTE)
a. Context-free (contex less) meaning of each token
b. Over 50,000 possible vectors
c. learnable during training
2. Word Position Embedding (WPE)
a. To represent token's position in the sentence
b. This is not learnable
These two are identical to BERT, but we have an additional embedding in BERT for segment id (sentence A versus B) that we do not have for GPT.
generator = pipeline('text-generation', model='gpt2') # create a pipline for text generation using gpt
generator("Hello, I am a data scientist and I want to", max_length=30, num_return_sequences=3) # here is randombess occures
These are not great text prediction.
tokenizer = GPT2Tokenizer.from_pretrained('gpt2') # GPT tokenizer
'Mehdi' in tokenizer.get_vocab() # Mehdi is not in gpt vocabulary
GPT2 by default is cased (Mehdi is different than mehdi).
txt = 'Mehdi loves working out'
tokenizer.convert_ids_to_tokens(tokenizer.encode(txt))
tokenizer.encode(txt)
encoded = tokenizer.encode(txt, return_tensors='pt') # Pytorch format
encoded
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2') # Language modeling head
model
# zoom in token embeding
model.transformer.wte(encoded).shape # word token embeding (wte)
Each token (6 tokens) has 768 length of vector
# Similar to BERT, we can pass in our position that will output with the same shape
model.transformer.wpe(tensor([0, 1, 2, 3, 4, 5]).reshape(1, 6)).shape #word postion embeding (wpe)
# word token embeding (wte) are added up with word postion embeding (wpe)
initial_input = model.transformer.wte(encoded) + model.transformer.wpe(tensor([0, 1, 2, 3, 4, 5]).reshape(1, 6))
initial_input.shape
initial_input = model.transformer.drop(initial_input)
initial_input
model.lm_head
# If input is passed one-by-one to model transformer we should end up with the same
for module in model.transformer.h:
initial_input = module(initial_input)[0]
initial_input = model.transformer.ln_f(initial_input)
(initial_input == model(encoded, output_hidden_states=True).hidden_states[-1]).all()
total_params = 0
for param in model.parameters():
total_params += numel(param)
print(f'Number of params for GPT2 is: {total_params:,}')
Masked Self-Attention: upper limit matrix are assigned to zero because of not cheating
GPT is predicting words one by one.
phrase = 'Today is a beautiful day. but, It is going to rain!'
encoded_phrase = tokenizer(phrase, return_tensors='pt')
response = model(**encoded_phrase, output_attentions=True, output_hidden_states=True)
len(response.attentions)
# GPT does not have a sense of sentence A or B.
encoded_phrase
# access to attention mechanism
response.attentions[-1].shape # From the final decoder
For this array, first item is batch size (1), 12 indicates there 12 heads in final encoder, and 14 is number of tokens.
encoded_phrase['input_ids'].shape
We can convert this ids back to tokens:
tokens = tokenizer.convert_ids_to_tokens(encoded_phrase['input_ids'][0])
tokens
Lets take a look at layer index 6, head 0. Check out the almost 60% attention the token it is giving to the token class
arr = response.attentions[6][0][0]
n_digits = 2
attention_df = pd.DataFrame((torch.round(arr * 10**n_digits) / (10**n_digits)).detach()).applymap(float)
attention_df.columns = tokens
attention_df.index = tokens
attention_df
is
has the highest access to today
93%!. Lets have a better visualization for each layer (12) and head (12).
tokens = tokenizer.convert_ids_to_tokens(encoded_phrase['input_ids'][0])
model_view(response.attentions, tokens)
response.hidden_states[-1].shape
response.logits.shape
logists are the final output of language model layer which applies a feed forward laye to each of 14 tokens.
pd.DataFrame(
zip(tokens, tokenizer.convert_ids_to_tokens(response.logits.argmax(2)[0])),
columns=['Sequence', 'Next predicted token with highest probability']
)
generator(phrase, max_length=40, num_return_sequences=1, do_sample=False) # greedy search
generator(phrase, max_length=40, num_return_sequences=1, do_sample=True) # greedy search with sampling
Similar to BERT, the authors of GPT need to pre-train the language model. However, mass language model does not make sense because GPT is auto regressor not auto encoder (unlike BERT). Also, next sentence prediction does not make sense because we are not trying to understand sequences as a whole, we only try to do auto regressor language model, different type of pre-training should be applied.
GPT was pre-trained on a corpora called WebText with 40 Gigabytes of text which is a collection of text data from the Internet. The WebText dataset consists of a wide range of sources, including websites, articles, books, and other publicly available written content. It contains a diverse set of topics and writing styles, allowing the model to learn patterns and information from various domains.
The pre-training process involves training the GPT model to predict the next word in a sequence of words given some context. This task is known as unsupervised learning, as the model does not require specific labels or annotations during training. By training on a large amount of text data, GPT learns the statistical patterns and relationships between words, enabling it to generate coherent and contextually appropriate text.
However, GPT-2, like any large language model, has the potential to exhibit biases in its output due to the nature of its training data and the patterns it learns during training. Users and developers employing GPT-2 or similar models should be aware of these potential biases and take steps to assess and mitigate them in specific use cases. It's crucial to consider ethical considerations and adopt best practices when deploying AI models, including addressing biases and promoting fairness in their applications.
generator = pipeline('text-generation', model='gpt2', tokenizer=tokenizer)
set_seed(0)
# Bias
generator("Muslim man work during the day as a", max_length=15, num_return_sequences=4, temperature=0.8)
# temperature: Reducing temperature makes it less random
# Bias
generator("The earth would be beautiful without", max_length=15, num_return_sequences=5, temperature=0.5)
From examples above, we should be careful and aware of bias using auto regression models since pre-trained corpora has bias in it. We should prevent transforming biases to our downstream tasks and decision making.
Zero-shot Learning: task description is given to the model but without any prior example
One-shot Learning: task description is given to the model with one single prior example
Few-shot Learning: task description is given to the model with as many as prior examples we desire to fit into the context window of model. GPT-2 has 1024 tokens
Sentiment Analysis, question/answering, translation are not part of task of pre-training for GPT2; it is a auto-regressive task to predict tokens. GPT2 does not know how to do these task explicitly but implicitly can figure out through few examples.
print(generator("""Sentiment Analysis
Text: I hate it when my laptop crashes.
Sentiment: Negative
###
Text: My day has been awesome!
Sentiment: Positive
###
Text: I am a couch potato
Sentiment:""", top_k=2, temperature=0.1, max_length=55)[0]['generated_text'])
top_k: The top_k parameter, also known as the "top-k sampling" strategy, is used to limit the number of words considered during the sampling process. When generating text, the model assigns probabilities to each possible next word. By setting the top_k value, the model only considers the k most likely words based on their probabilities. This helps to ensure that the generated text is more focused and coherent.
For example, if top_k is set to 10, the model will only consider the top 10 words with the highest probabilities for each word position, discarding the rest. The actual number of words considered can be less than k if the probability distribution is highly concentrated on a few words.
temperature: The temperature parameter controls the randomness of the generated text. It adjusts the probability distribution during sampling. A higher temperature value, such as 1.0, increases the randomness and diversity of the output. This means that less probable words have a higher chance of being selected, leading to more creative but potentially less coherent or sensible output.
On the other hand, a lower temperature value, such as 0.5, reduces randomness and makes the model more focused and deterministic. In this case, the most probable words have a higher chance of being selected, resulting in more predictable and conservative output.
print(generator("""Question/Answering
C: Google was founded in 1998 by Larry Page and Sergey Brin while they were Ph.D. students at Stanford University in California. Together they own about 14 percent of its shares and control 56 percent of the stockholder voting power through supervoting stock.
Q: When was Google founded?
A: 1998
###
C: Hugging Face is a company whi"ch develops social AI-run chatbot applications. It was established in 2016 by Clement Delangue and Julien Chaumond. The company is based in Brooklyn, New York, United States.
Q: What does Hugging Face develop?
A: social AI-run chatbot applications
###
C: The New York Jets are a professional American football team based in the New York metropolitan area. The Jets compete in the National Football League (NFL) as a member club of the league's American Football Conference (AFC) East division.
Q: What division do the Jets play in?
A:""", top_k=2, max_length=215, temperature=0.5)[0]['generated_text'])
Zero-shot doesn't work as much for the sentiment analysis.
print(generator("""Sentiment Analysis
Text: the food was great
Sentiment:""", top_k=2, temperature=0.9, max_length=55)[0]['generated_text'])
Zero-shot learning works better for Question/Answering task.
print(generator(
'''Question/Answering
C: The New York Jets are a professional American football team based in the New York metropolitan area.
The Jets compete in the National Football League (NFL) as a member club of the league's American Football
Conference (AFC) East division.
Q: What division do the Jets play in?
A:''',
top_k=2, max_length=80, temperature=0.5)[0]['generated_text']
)
Zero-shot can be used as a summarization approach.
to_summarize = """The exploitation of hydrocarbon reservoirs may potentially lead to contamination of soils, shallow water resources, and greenhouse gas emissions. Fluids such as methane or CO2 may in some cases migrate toward the groundwater zone and atmosphere through and along imperfectly sealed hydrocarbon wells. Field tests in hydrocarbon-producing regions are routinely conducted for detecting serious leakage to prevent environmental pollution. The challenge is that testing is costly, time-consuming, and sometimes labor-intensive. In this study, machine learning approaches were applied to predict serious leakage with uncertainty quantification for wells that have not been field tested in Alberta, Canada. An improved imputation technique was developed by Cholesky factorization of the covariance matrix between features, where missing data are imputed via conditioning of available values. The uncertainty in imputed values was quantified and incorporated into the final prediction to improve decision-making. Next, a wide range of predictive algorithms and various performance metrics were considered to achieve the most reliable classifier. However, a highly skewed distribution of field tests toward the negative class (nonserious leakage) forces predictive models to unrealistically underestimate the minority class (serious leakage). To address this issue, a combination of oversampling, undersampling, and ensemble learning was applied. By investigating all the models on never-before-seen data, an optimum classifier with minimal false negative prediction was determined. The developed methodology can be applied to identify the wells with the highest likelihood for serious fluid leakage within producing fields. This information is of key importance for optimizing field test operations to achieve economic and environmental benefits."""
TL;DR: too long did not read. This is the name of summerization algorithm. Zero shot learning is the best technique to do this.
print(generator(
f"""Summarization Task:\n{to_summarize}\n\nTL;DR:""",
max_length=512, top_k=2, temperature=0.8, no_repeat_ngram_size=3)[0]['generated_text'])
# no_repeat_ngram_size: stops from repeating this over and over
"Style completion" in the context of a Large Language Model (LLM) generally refers to the ability of the model to generate text in a specific style. Large language models like GPT-2 or similar ones can be fine-tuned or used in a controlled manner to produce text that matches a particular writing style. In this section, GPT-2 model will be fine-tuned by our data.
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
tokenizer.pad_token = tokenizer.eos_token # Set padding token to end of sequence token
# Example usage
inputs = ["This is an example input", "Another input"]
encoded_inputs = tokenizer(inputs, padding=True, truncation=True, return_tensors="pt")
# Print encoded inputs
print(encoded_inputs)
GPT is applied for the book Wild Animals I Have Known. The aim is to read through the paper multiple times to answer questions.
pds_data = TextDataset(
tokenizer=tokenizer,
# textbook about animal: Title: Wild Animals I Have Known
file_path='wild_animals_book.txt',
block_size=60 # length of each chunk of text to use as a datapoint
)
pds_data
Lets take a look at tokens of first example:
pds_data[0], pds_data[0].shape # inspect the first point
Decode the code to see what are the elements:
print(tokenizer.decode(pds_data[0]))
After having our data set we need our data collator (DataCollatorForLanguageModeling
)
data_collator = DataCollatorForLanguageModeling(
tokenizer=tokenizer, mlm=False, # MLM is Masked Language Modelling
)
Look at what collator is doing:
collator_example = data_collator([tokenizer('Can I have an input'), tokenizer('yes yo can')])
collator_example
collator_example.input_ids
collator_example.input_ids # 50256 is our pad token id
tokenizer.pad_token_id
collator_example.attention_mask # Note the 0 in the attention mask where we have a pad token
collator_example.labels # note the -100 to ignore loss calculation for the padded token
# Reminder that labels are shifted *inside* the GPT model so we don't need to worry about that
model = GPT2LMHeadModel.from_pretrained('gpt2-large') # load up a GPT2 model
pretrained_generator = pipeline(
'text-generation', model=model, tokenizer='gpt2',
config={'max_length': 200, 'do_sample': True, 'top_p': 0.9, 'temperature': 0.8, 'top_k': 50}
)
# 'do_sample': capable of multiple predictions
# 'top_p' and 'top_k': for sharpening our prediction and make it less random
# 'temperature': lower value makes it more consistent
Example below is text completion without fin-tuning the model:
print('----------')
for generated_sequence in pretrained_generator('The snare is ', num_return_sequences=3):
print(generated_sequence['generated_text'])
print('----------')
Answer
A snare is something that looks like a creeper, but it doesn't grow and it's worse than all the hawks in the world
The main aim is after reading the book
epochs = 2
batch_size = 5
training_args = TrainingArguments(
output_dir="./gpt2_StCp", #The output directory
overwrite_output_dir=True, #overwrite the content of the output directory
num_train_epochs=epochs, # number of training epochs which means reading the book three times
per_device_train_batch_size=batch_size, # batch size for training
per_device_eval_batch_size=batch_size, # batch size for evaluation
warmup_steps=len(pds_data.examples) // 5, # number of warmup steps for learning rate scheduler,
logging_steps=50,
load_best_model_at_end=True,
evaluation_strategy='epoch',
save_strategy='epoch'
)
trainer = Trainer(
model=model,
args=training_args,
data_collator=data_collator,
train_dataset=pds_data.examples[:int(len(pds_data.examples)*.8)], # Training set as first 80%
eval_dataset=pds_data.examples[int(len(pds_data.examples)*.8):] # Test set as second 20%
)
trainer.evaluate()
trainer.train()
trainer.evaluate()
trainer.save_model()
loaded_model = GPT2LMHeadModel.from_pretrained('./gpt2_StCp')
finetuned_generator = pipeline(
'text-generation', model=loaded_model, tokenizer=tokenizer,
config={'max_length': 400, 'do_sample': False, 'top_p': 0.9, 'temperature': 0.8, 'top_k': 50})
Sentence completion below is after reading the book 3 times (3 epochs) and fining the parameters of GPT:
print('----------')
for generated_sequence in finetuned_generator('Molly said snares are', num_return_sequences=3):
print(generated_sequence['generated_text'])
print('----------')
Answer
A snare is something that looks like a creeper, but it doesn't grow and it's worse than all the hawks in the world, said Molly, glancing at the now far-away red-tail, "for there it hides night and day in the runway till the chance to catch you comes.
print('----------')
for generated_sequence in finetuned_generator('The animals that typically use the well-marked trails at Antelope Springs for drinking are', num_return_sequences=3):
print(generated_sequence['generated_text'])
print('----------')
Answer:
print('----------')
for generated_sequence in finetuned_generator('Spotting the first corral and ranch-house, the man is pleased, but the Mustang', num_return_sequences=3):
print(generated_sequence['generated_text'])
print('----------')
Answer: