Summary

GPT (Generative Pre-trained Transformer) undergoes pre-training using a large text corpus called BookCorpus, comprised of 4.5 GB of text from 7,000 unpublished books. During pre-training, the model learns language patterns by predicting the next word in a sequence. After pre-training, GPT can be fine-tuned for specific tasks using task-specific datasets, adjusting parameters for tasks like classification or question answering. OpenAI has iteratively improved the GPT architecture, introducing models like GPT-2, GPT-3, GPT-3.5, and GPT-4, each trained on larger datasets with increased capacities. The widespread adoption of GPT models has significantly advanced natural language processing tasks in research and industry. Only GPT-2 is freely available on HuggingFace. In this notebook, first gpt2 few-shot learning is applied for sentiment analysis, question-answering and text summarization. Next, gpt2 model is fine-tuned for "style completion", which refers to the ability of the model to generate text in a specific style.

Python functions and data files needed to run this notebook are available via this link.

from transformers import GPT2Tokenizer, TextDataset, DataCollatorForLanguageModeling, GPT2LMHeadModel, pipeline
from transformers import pipeline, set_seed, Trainer, TrainingArguments, GPT2Model                 
from torch import tensor, numel # counting number of parameters
from bertviz import model_view # visualize 
import torch
import pandas as pd
set_seed(32) # we have randomness in GPT while we do not have in BERT

Introduction¶

The pre-training process of GPT involves training the model on a large corpus of text called BookCorpus (4.5 GB of text from 7,000 unpublished books of different genres). During pre-training, the model is trained to predict the next word in a sequence given the previous words. This process is known as language modeling and is used to teach the model to understand the structure and patterns of natural language.

After pre-training, the GPT model can be fine-tuned for a specific task by providing it with a smaller, task-specific dataset. Fine-tuning involves adjusting the parameters of the model to better fit the task at hand. For example, the model can be fine-tuned for tasks such as classification, similarity scoring, or question answering.

The GPT architecture has since been improved and extended by OpenAI with the release of subsequent models such as GPT-2, GPT-3, GPT-3.5, and GPT-4. These models are trained on larger datasets and have larger capacities, so they can generate more complex and coherent text. The GPT models have been widely adopted by researchers and industry practitioners and have contributed to significant advancements in natural language processing tasks.

GPT stands for Generative Pre-trained Transformers:

Generative: from Auto-regressive Language Model. It signifies predicting tokens with one side of the context, past context.
Pre-trained: decoders are trained on huge corpora of data.
Transformers: The decoder is taken from the transformer architecture.

GPT refers to a family of models:

GPT-1 released in 2018: 0.117B parameters
GPT-2 released in 2019: 1.5B parameters
GPT-3 released in 2020: 175B parameters

OpenAI's public release of the latest GPT-3 and ChatGPT models put the power of large autoregressive language models into the hands of the masses. Anyone now can you these models for their benefits. Limitation of these models:

They do not take direction, they simply want to finish your sentence in the same style. That is why few-shot learning works better but we cannot ask a question and demand an answer. In January 2022, OpenAI introduced an updated version of GPT-3 called InstructGPT. It works better in many ways showing a reduction in harmful baises and the ability to take direction from a prompt and answer without trying to finish the thought.

How InstructGPT is trained:

Fine-tuned using reinforcement learning from human feedback (RLHF)
OpenAI used a crowdsourced team to label instructions based prompts and outputs
OpenAI trained a reward model on this dataset to detect the output that the crowd preferred
The reward model was used to optimize the GPT-3 policy using a proximal policy optimization (PPO) algorithm to make outputs more inline with what human were expecting.

Prompt Engineering

Prompt Engineering is the process of designing input prompts for language model system like GPT-3 and ChetGPT:

We can influence the output produced by the model to get something more specific and usable by carefully crafting and adjusting prompts.
Prompt Engineering can be used to guide the model to produce more relevant and coherent output for a given task.

ChatGPT

Trained similarly to InstructGPT, ChatGPT is a newer implementation of GPT from OpenAI that uses a conversational UI (user interface).

Attention

The first step to understanding how GPT works is to understand how the attention mechanism works. This mechanism is what makes the Transformer architecture unique and distinct from recurrent approaches to language modeling. When we have developed a solid understanding of attention, we will then see how it is used within Transformer architectures such as GPT. For example for sentence below, the next word should be something synonymous with "big".

"The black elephant tried to get into the truck but it was too ..."

Certain other words in the sentence are important for helping us to make our decision. Other words in the sentence are not important at all. In other words, we are aying attention to certain words in the sentence and largely ignoring others. Wouldn’t it be great if our model could do the same thing?

An attention mechanism (also know as an attention head) in a Transformer is designed to do exactly this. It is able to decide where in the input it wants to pull information from, in order to efficiently extract useful information without being clouded by irrelevant details. This makes it highly adaptable to a range of circumstances, as it can decide where it wants to look for information at inference time.

GPT is very similar to BERT. GPT has Byte-level tokenization by splitting a list of token in our vocabulary over 50,000 tokens. We also add the special token <|endoftext|> at the end.

My name is Mehdi ==> ["My", "name", "is", "Mehdi","<|endoftext|>" ]

tokenizer_gpt = GPT2Tokenizer.from_pretrained("gpt2")
tokenizer_gpt ("Hi there")['input_ids']

[17250, 612]

# Having a space make it to different token
tokenizer_gpt (" Hi there")['input_ids']

[15902, 612]

Space in GPT's tokenizer is treated as part of token; this leads to encode the word differently.

GPT has two types of embeddings for tokenized sentences:

1. Word Token Embeddings (WTE)

a. Context-free (contex less) meaning of each token
b. Over 50,000 possible vectors
c. learnable during training

2. Word Position Embedding (WPE)

a. To represent token's position in the sentence
b. This is not learnable

These two are identical to BERT, but we have an additional embedding in BERT for segment id (sentence A versus B) that we do not have for GPT.

GPT-2¶

generator = pipeline('text-generation', model='gpt2') # create a pipline for text generation using gpt

generator("Hello, I am a data scientist and I want to", max_length=30, num_return_sequences=3) # here is randombess occures

C:\Users\mrezv\anaconda3\lib\site-packages\transformers\generation\utils.py:1186: UserWarning: You have modified the pretrained model configuration to control generation. This is a deprecated strategy to control generation and will be removed soon, in a future version. Please use a generation configuration file (see https://huggingface.co/docs/transformers/main_classes/text_generation)
  warnings.warn(
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.

[{'generated_text': 'Hello, I am a data scientist and I want to share some of my knowledge with you! As you know we are one of the most active community'},
 {'generated_text': 'Hello, I am a data scientist and I want to share some of my personal knowledge, techniques and solutions for building software that improves the efficiency of your'},
 {'generated_text': 'Hello, I am a data scientist and I want to share what I am currently working on. I have some ideas for next project in mind. Here'}]

These are not great text prediction.

tokenizer = GPT2Tokenizer.from_pretrained('gpt2') # GPT tokenizer

'Mehdi' in tokenizer.get_vocab()  # Mehdi is not in gpt vocabulary

False

GPT2 by default is cased (Mehdi is different than mehdi).

txt = 'Mehdi loves working out'

tokenizer.convert_ids_to_tokens(tokenizer.encode(txt))

['Me', 'h', 'di', 'Ġloves', 'Ġworking', 'Ġout']

tokenizer.encode(txt)

[5308, 71, 10989, 10408, 1762, 503]

encoded = tokenizer.encode(txt, return_tensors='pt') # Pytorch format
encoded

tensor([[ 5308,    71, 10989, 10408,  1762,   503]])

tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2') # Language modeling head

model

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
)

# zoom in token embeding
model.transformer.wte(encoded).shape # word token embeding (wte)

torch.Size([1, 6, 768])

Each token (6 tokens) has 768 length of vector

# Similar to BERT, we can pass in our position that will output with the same shape
model.transformer.wpe(tensor([0, 1, 2, 3, 4, 5]).reshape(1, 6)).shape  #word postion embeding (wpe)

torch.Size([1, 6, 768])

# word token embeding (wte) are added up with word postion embeding (wpe)

initial_input = model.transformer.wte(encoded) + model.transformer.wpe(tensor([0, 1, 2, 3, 4, 5]).reshape(1, 6))

initial_input.shape

torch.Size([1, 6, 768])

initial_input = model.transformer.drop(initial_input)
initial_input

tensor([[[ 0.0547, -0.4038,  0.2470,  ..., -0.1350, -0.1283,  0.1031],
         [ 0.0442, -0.0989,  0.0356,  ...,  0.0754, -0.1866,  0.1947],
         [ 0.0326, -0.0136,  0.1464,  ...,  0.1332,  0.2655,  0.0350],
         [-0.0761, -0.1581,  0.0896,  ..., -0.2130, -0.0214, -0.0970],
         [-0.0460,  0.0483,  0.1984,  ..., -0.0730,  0.1341, -0.0947],
         [-0.0388, -0.0780,  0.1656,  ...,  0.0859, -0.2295,  0.1717]]],
       grad_fn=<AddBackward0>)

model.lm_head

Linear(in_features=768, out_features=50257, bias=False)

# If input is passed one-by-one to model transformer we should end up with the same 
for module in model.transformer.h:
    initial_input = module(initial_input)[0]
    
initial_input = model.transformer.ln_f(initial_input)

(initial_input == model(encoded, output_hidden_states=True).hidden_states[-1]).all()

tensor(True)

total_params = 0
for param in model.parameters():
    total_params += numel(param)
    
print(f'Number of params for GPT2 is: {total_params:,}')

Number of params for GPT2 is: 124,439,808

Masked multi-headed attention¶

Masked Self-Attention: upper limit matrix are assigned to zero because of not cheating

GPT is predicting words one by one.

phrase = 'Today is a beautiful day. but, It is going to rain!'
encoded_phrase = tokenizer(phrase, return_tensors='pt')

response = model(**encoded_phrase, output_attentions=True, output_hidden_states=True)

len(response.attentions)

12

# GPT does not have a sense of sentence A or B.
encoded_phrase

{'input_ids': tensor([[8888,  318,  257, 4950, 1110,   13,  475,   11,  632,  318, 1016,  284,
         6290,    0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

# access to attention mechanism
response.attentions[-1].shape  # From the final decoder

torch.Size([1, 12, 14, 14])

For this array, first item is batch size (1), 12 indicates there 12 heads in final encoder, and 14 is number of tokens.

encoded_phrase['input_ids'].shape

torch.Size([1, 14])

We can convert this ids back to tokens:

tokens = tokenizer.convert_ids_to_tokens(encoded_phrase['input_ids'][0])
tokens

['Today',
 'Ġis',
 'Ġa',
 'Ġbeautiful',
 'Ġday',
 '.',
 'Ġbut',
 ',',
 'ĠIt',
 'Ġis',
 'Ġgoing',
 'Ġto',
 'Ġrain',
 '!']

Lets take a look at layer index 6, head 0. Check out the almost 60% attention the token it is giving to the token class

arr = response.attentions[6][0][0]

n_digits = 2

attention_df = pd.DataFrame((torch.round(arr * 10**n_digits) / (10**n_digits)).detach()).applymap(float)

attention_df.columns = tokens
attention_df.index = tokens

attention_df

is has the highest access to today 93%!. Lets have a better visualization for each layer (12) and head (12).

tokens = tokenizer.convert_ids_to_tokens(encoded_phrase['input_ids'][0]) 
model_view(response.attentions, tokens)

response.hidden_states[-1].shape

torch.Size([1, 14, 768])

response.logits.shape

torch.Size([1, 14, 50257])

logists are the final output of language model layer which applies a feed forward laye to each of 14 tokens.

pd.DataFrame(
    zip(tokens, tokenizer.convert_ids_to_tokens(response.logits.argmax(2)[0])), 
    columns=['Sequence', 'Next predicted token with highest probability']
)

generator(phrase, max_length=40, num_return_sequences=1, do_sample=False)  # greedy search

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.

[{'generated_text': 'Today is a beautiful day. but, It is going to rain!\n\nI am going to be in the hospital for a few days. I am going to be in the hospital for a few'}]

generator(phrase, max_length=40, num_return_sequences=1, do_sample=True)  # greedy search with sampling

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.

[{'generated_text': 'Today is a beautiful day. but, It is going to rain! because You want me to leave you alone, but I think the rain is better than it is right now! I hope that you'}]

Bias Prediction GPT¶

Similar to BERT, the authors of GPT need to pre-train the language model. However, mass language model does not make sense because GPT is auto regressor not auto encoder (unlike BERT). Also, next sentence prediction does not make sense because we are not trying to understand sequences as a whole, we only try to do auto regressor language model, different type of pre-training should be applied.

GPT was pre-trained on a corpora called WebText with 40 Gigabytes of text which is a collection of text data from the Internet. The WebText dataset consists of a wide range of sources, including websites, articles, books, and other publicly available written content. It contains a diverse set of topics and writing styles, allowing the model to learn patterns and information from various domains.

The pre-training process involves training the GPT model to predict the next word in a sequence of words given some context. This task is known as unsupervised learning, as the model does not require specific labels or annotations during training. By training on a large amount of text data, GPT learns the statistical patterns and relationships between words, enabling it to generate coherent and contextually appropriate text.

However, GPT-2, like any large language model, has the potential to exhibit biases in its output due to the nature of its training data and the patterns it learns during training. Users and developers employing GPT-2 or similar models should be aware of these potential biases and take steps to assess and mitigate them in specific use cases. It's crucial to consider ethical considerations and adopt best practices when deploying AI models, including addressing biases and promoting fairness in their applications.

generator = pipeline('text-generation', model='gpt2', tokenizer=tokenizer)
set_seed(0)

# Bias
generator("Muslim man work during the day as a", max_length=15, num_return_sequences=4, temperature=0.8)
# temperature: Reducing temperature makes it less random

C:\Users\mrezv\anaconda3\lib\site-packages\transformers\generation\utils.py:1186: UserWarning: You have modified the pretrained model configuration to control generation. This is a deprecated strategy to control generation and will be removed soon, in a future version. Please use a generation configuration file (see https://huggingface.co/docs/transformers/main_classes/text_generation)
  warnings.warn(
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.

[{'generated_text': 'Muslim man work during the day as a security guard. He was ordered to'},
 {'generated_text': 'Muslim man work during the day as a janitor at a hospital after his'},
 {'generated_text': 'Muslim man work during the day as a contractor. The police were looking for'},
 {'generated_text': 'Muslim man work during the day as a police officer on the highway in T'}]

# Bias
generator("The earth would be beautiful without", max_length=15, num_return_sequences=5, temperature=0.5)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.

[{'generated_text': 'The earth would be beautiful without a human being," said Dr. John S'},
 {'generated_text': 'The earth would be beautiful without the earth.\n\nWe are not to'},
 {'generated_text': 'The earth would be beautiful without the sun.\n\n[Pg 17]'},
 {'generated_text': 'The earth would be beautiful without you.\n\nI was a little worried'},
 {'generated_text': 'The earth would be beautiful without a sun. If you have a sun,'}]

From examples above, we should be careful and aware of bias using auto regression models since pre-trained corpora has bias in it. We should prevent transforming biases to our downstream tasks and decision making.

Few-shot learning¶

Zero-shot Learning: task description is given to the model but without any prior example
One-shot Learning: task description is given to the model with one single prior example
Few-shot Learning: task description is given to the model with as many as prior examples we desire to fit into the context window of model. GPT-2 has 1024 tokens

Sentiment Analysis, question/answering, translation are not part of task of pre-training for GPT2; it is a auto-regressive task to predict tokens. GPT2 does not know how to do these task explicitly but implicitly can figure out through few examples.

Sentiment Analysis¶

print(generator("""Sentiment Analysis
Text: I hate it when my laptop crashes.
Sentiment: Negative
###
Text: My day has been awesome!
Sentiment: Positive
###
Text: I am a couch potato
Sentiment:""", top_k=2, temperature=0.1, max_length=55)[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.

Sentiment Analysis
Text: I hate it when my laptop crashes.
Sentiment: Negative
###
Text: My day has been awesome!
Sentiment: Positive
###
Text: I am a couch potato
Sentiment: Negative
###
Text:

top_k: The top_k parameter, also known as the "top-k sampling" strategy, is used to limit the number of words considered during the sampling process. When generating text, the model assigns probabilities to each possible next word. By setting the top_k value, the model only considers the k most likely words based on their probabilities. This helps to ensure that the generated text is more focused and coherent.

For example, if top_k is set to 10, the model will only consider the top 10 words with the highest probabilities for each word position, discarding the rest. The actual number of words considered can be less than k if the probability distribution is highly concentrated on a few words.

temperature: The temperature parameter controls the randomness of the generated text. It adjusts the probability distribution during sampling. A higher temperature value, such as 1.0, increases the randomness and diversity of the output. This means that less probable words have a higher chance of being selected, leading to more creative but potentially less coherent or sensible output.

On the other hand, a lower temperature value, such as 0.5, reduces randomness and makes the model more focused and deterministic. In this case, the most probable words have a higher chance of being selected, resulting in more predictable and conservative output.

Question/Answering¶

print(generator("""Question/Answering
C: Google was founded in 1998 by Larry Page and Sergey Brin while they were Ph.D. students at Stanford University in California. Together they own about 14 percent of its shares and control 56 percent of the stockholder voting power through supervoting stock.
Q: When was Google founded?
A: 1998
###
C: Hugging Face is a company whi"ch develops social AI-run chatbot applications. It was established in 2016 by Clement Delangue and Julien Chaumond. The company is based in Brooklyn, New York, United States.
Q: What does Hugging Face develop?
A: social AI-run chatbot applications
###
C: The New York Jets are a professional American football team based in the New York metropolitan area. The Jets compete in the National Football League (NFL) as a member club of the league's American Football Conference (AFC) East division.
Q: What division do the Jets play in?
A:""", top_k=2, max_length=215, temperature=0.5)[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.

Question/Answering
C: Google was founded in 1998 by Larry Page and Sergey Brin while they were Ph.D. students at Stanford University in California. Together they own about 14 percent of its shares and control 56 percent of the stockholder voting power through supervoting stock.
Q: When was Google founded?
A: 1998
###
C: Hugging Face is a company whi"ch develops social AI-run chatbot applications. It was established in 2016 by Clement Delangue and Julien Chaumond. The company is based in Brooklyn, New York, United States.
Q: What does Hugging Face develop?
A: social AI-run chatbot applications
###
C: The New York Jets are a professional American football team based in the New York metropolitan area. The Jets compete in the National Football League (NFL) as a member club of the league's American Football Conference (AFC) East division.
Q: What division do the Jets play in?
A: American Football Conference East

Zero Shot Learning¶

Sentiment Analysis¶

Zero-shot doesn't work as much for the sentiment analysis.

print(generator("""Sentiment Analysis
Text: the food was great
Sentiment:""", top_k=2, temperature=0.9, max_length=55)[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.

Sentiment Analysis
Text: the food was great
Sentiment: the food was great
Text: the food was good
Sentiment: the food was good
Text: the food was good
Text: the food was good
Text: the food was good

Question/Answering¶

Zero-shot learning works better for Question/Answering task.

print(generator(
    '''Question/Answering
C: The New York Jets are a professional American football team based in the New York metropolitan area. 
The Jets compete in the National Football League (NFL) as a member club of the league's American Football 
Conference (AFC) East division.
Q: What division do the Jets play in?
A:''',
    top_k=2, max_length=80, temperature=0.5)[0]['generated_text']
)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.

Question/Answering
C: The New York Jets are a professional American football team based in the New York metropolitan area. 
The Jets compete in the National Football League (NFL) as a member club of the league's American Football 
Conference (AFC) East division.
Q: What division do the Jets play in?
A: The Jets play in the AFC

Summarize Text (TL;DR)¶

Zero-shot can be used as a summarization approach.

to_summarize = """The exploitation of hydrocarbon reservoirs may potentially lead to contamination of soils, shallow water resources, and greenhouse gas emissions. Fluids such as methane or CO2 may in some cases migrate toward the groundwater zone and atmosphere through and along imperfectly sealed hydrocarbon wells. Field tests in hydrocarbon-producing regions are routinely conducted for detecting serious leakage to prevent environmental pollution. The challenge is that testing is costly, time-consuming, and sometimes labor-intensive. In this study, machine learning approaches were applied to predict serious leakage with uncertainty quantification for wells that have not been field tested in Alberta, Canada. An improved imputation technique was developed by Cholesky factorization of the covariance matrix between features, where missing data are imputed via conditioning of available values. The uncertainty in imputed values was quantified and incorporated into the final prediction to improve decision-making. Next, a wide range of predictive algorithms and various performance metrics were considered to achieve the most reliable classifier. However, a highly skewed distribution of field tests toward the negative class (nonserious leakage) forces predictive models to unrealistically underestimate the minority class (serious leakage). To address this issue, a combination of oversampling, undersampling, and ensemble learning was applied. By investigating all the models on never-before-seen data, an optimum classifier with minimal false negative prediction was determined. The developed methodology can be applied to identify the wells with the highest likelihood for serious fluid leakage within producing fields. This information is of key importance for optimizing field test operations to achieve economic and environmental benefits."""

TL;DR: too long did not read. This is the name of summerization algorithm. Zero shot learning is the best technique to do this.

print(generator(
    f"""Summarization Task:\n{to_summarize}\n\nTL;DR:""", 
    max_length=512, top_k=2, temperature=0.8, no_repeat_ngram_size=3)[0]['generated_text'])

# no_repeat_ngram_size: stops from repeating this over and over

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.

Summarization Task:
The exploitation of hydrocarbon reservoirs may potentially lead to contamination of soils, shallow water resources, and greenhouse gas emissions. Fluids such as methane or CO2 may in some cases migrate toward the groundwater zone and atmosphere through and along imperfectly sealed hydrocarbon wells. Field tests in hydrocarbon-producing regions are routinely conducted for detecting serious leakage to prevent environmental pollution. The challenge is that testing is costly, time-consuming, and sometimes labor-intensive. In this study, machine learning approaches were applied to predict serious leakage with uncertainty quantification for wells that have not been field tested in Alberta, Canada. An improved imputation technique was developed by Cholesky factorization of the covariance matrix between features, where missing data are imputed via conditioning of available values. The uncertainty in imputed values was quantified and incorporated into the final prediction to improve decision-making. Next, a wide range of predictive algorithms and various performance metrics were considered to achieve the most reliable classifier. However, a highly skewed distribution of field tests toward the negative class (nonserious leakage) forces predictive models to unrealistically underestimate the minority class (serious leakage). To address this issue, a combination of oversampling, undersampling, and ensemble learning was applied. By investigating all the models on never-before-seen data, an optimum classifier with minimal false negative prediction was determined. The developed methodology can be applied to identify the wells with the highest likelihood for serious fluid leakage within producing fields. This information is of key importance for optimizing field test operations to achieve economic and environmental benefits.

TL;DR: The results of this study are a step toward a more accurate and cost-effective approach to the detection of serious leakage.
.

Style Completion by GPT¶

"Style completion" in the context of a Large Language Model (LLM) generally refers to the ability of the model to generate text in a specific style. Large language models like GPT-2 or similar ones can be fine-tuned or used in a controlled manner to produce text that matches a particular writing style. In this section, GPT-2 model will be fine-tuned by our data.

tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
tokenizer.pad_token = tokenizer.eos_token  # Set padding token to end of sequence token

# Example usage
inputs = ["This is an example input", "Another input"]
encoded_inputs = tokenizer(inputs, padding=True, truncation=True, return_tensors="pt")

# Print encoded inputs
print(encoded_inputs)

{'input_ids': tensor([[ 1212,   318,   281,  1672,  5128],
        [ 6610,  5128, 50256, 50256, 50256]]), 'attention_mask': tensor([[1, 1, 1, 1, 1],
        [1, 1, 0, 0, 0]])}

Create Dataset¶

GPT is applied for the book Wild Animals I Have Known. The aim is to read through the paper multiple times to answer questions.

pds_data = TextDataset(
    tokenizer=tokenizer,
    # textbook about animal: Title: Wild Animals I Have Known
    file_path='wild_animals_book.txt',  
    block_size=60           # length of each chunk of text to use as a datapoint
)

C:\Users\mrezv\anaconda3\lib\site-packages\transformers\data\datasets\language_modeling.py:54: FutureWarning: This dataset will be removed from the library soon, preprocessing should be handled with the 🤗 Datasets library. You can have a look at this example script for pointers: https://github.com/huggingface/transformers/blob/main/examples/pytorch/language-modeling/run_mlm.py
  warnings.warn(

pds_data

<transformers.data.datasets.language_modeling.TextDataset at 0x229c29ca8e0>

Lets take a look at tokens of first example:

pds_data[0], pds_data[0].shape  # inspect the first point

(tensor([  464,  4935, 20336, 46566,   286,  6183, 25390,   314,  8192, 29454,
           198,   220,   220,   220,   220,   198,  1212, 47179,   318,   329,
           262,   779,   286,  2687,  6609,   287,   262,  1578,  1829,   290,
           198,  1712,   584,  3354,   286,   262,   995,   379,   645,  1575,
           290,   351,  2048,   645,  8733,   198, 10919, 15485,    13,   921,
           743,  4866,   340,    11,  1577,   340,  1497,   393,   302,    12]),
 torch.Size([60]))

Decode the code to see what are the elements:

print(tokenizer.decode(pds_data[0]))

The Project Gutenberg eBook of Wild Animals I Have Known
    
This ebook is for the use of anyone anywhere in the United States and
most other parts of the world at no cost and with almost no restrictions
whatsoever. You may copy it, give it away or re-

DataCollator¶

After having our data set we need our data collator (DataCollatorForLanguageModeling)

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=False,  # MLM is Masked Language Modelling
)

Look at what collator is doing:

collator_example = data_collator([tokenizer('Can I have an input'), tokenizer('yes yo can')])
collator_example

{'input_ids': tensor([[ 6090,   314,   423,   281,  5128],
        [ 8505, 27406,   460, 50256, 50256]]), 'attention_mask': tensor([[1, 1, 1, 1, 1],
        [1, 1, 1, 0, 0]]), 'labels': tensor([[ 6090,   314,   423,   281,  5128],
        [ 8505, 27406,   460,  -100,  -100]])}

collator_example.input_ids

tensor([[ 6090,   314,   423,   281,  5128],
        [ 8505, 27406,   460, 50256, 50256]])

collator_example.input_ids  # 50256 is our pad token id

tensor([[ 6090,   314,   423,   281,  5128],
        [ 8505, 27406,   460, 50256, 50256]])

tokenizer.pad_token_id

50256

collator_example.attention_mask  # Note the 0 in the attention mask where we have a pad token

tensor([[1, 1, 1, 1, 1],
        [1, 1, 1, 0, 0]])

collator_example.labels  # note the -100 to ignore loss calculation for the padded token
# Reminder that labels are shifted *inside* the GPT model so we don't need to worry about that

tensor([[ 6090,   314,   423,   281,  5128],
        [ 8505, 27406,   460,  -100,  -100]])

model = GPT2LMHeadModel.from_pretrained('gpt2-large')  # load up a GPT2 model

Pipeline with Pre-trained Model:¶

pretrained_generator = pipeline(
    'text-generation', model=model, tokenizer='gpt2',
    config={'max_length': 200, 'do_sample': True, 'top_p': 0.9, 'temperature': 0.8, 'top_k': 50}
)
# 'do_sample': capable of multiple predictions
# 'top_p' and 'top_k': for sharpening our prediction and make it less random
# 'temperature': lower value makes it more consistent

Example below is text completion without fin-tuning the model:

print('----------')
for generated_sequence in pretrained_generator('The snare is ', num_return_sequences=3):
    print(generated_sequence['generated_text'])
    print('----------')

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.

----------

C:\Users\mrezv\anaconda3\lib\site-packages\transformers\generation\utils.py:1273: UserWarning: Neither `max_length` nor `max_new_tokens` has been set, `max_length` will default to 50 (`generation_config.max_length`). Controlling `max_length` via the config is deprecated and `max_length` will be removed from the config in v5 of Transformers -- we recommend using `max_new_tokens` to control the maximum length of the generation.
  warnings.warn(

The snare is  very accurate and is one of the best I've heard in a snare drum. This one is on for about six seconds every time I played all four of them. The cymbals are also very thick and will
----------
The snare is ikke påstånd när nytt svenska snare kommert påstånd.

It's the snare drum with the sound box so I used to
----------
The snare is été. And you are going for the head? Well, take it easy. We are going for the heart and not your head." What the fuck are you doing, man? I'd be fucking pissed off. I
----------

Answer

A snare is something that looks like a creeper, but it doesn't grow and it's worse than all the hawks in the world

Fine-tune Trainer¶

The main aim is after reading the book

epochs = 2
batch_size = 5

training_args = TrainingArguments(
    output_dir="./gpt2_StCp", #The output directory
    overwrite_output_dir=True, #overwrite the content of the output directory
    num_train_epochs=epochs, # number of training epochs which means reading the book three times
    per_device_train_batch_size=batch_size, # batch size for training
    per_device_eval_batch_size=batch_size,  # batch size for evaluation
    warmup_steps=len(pds_data.examples) // 5, # number of warmup steps for learning rate scheduler,
    logging_steps=50,
    load_best_model_at_end=True,
    evaluation_strategy='epoch',
    save_strategy='epoch'
)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=pds_data.examples[:int(len(pds_data.examples)*.8)],  # Training set as first 80%
    eval_dataset=pds_data.examples[int(len(pds_data.examples)*.8):]   # Test set as second 20%
)

trainer.evaluate()

***** Running Evaluation *****
  Num examples = 251
  Batch size = 5

{'eval_loss': 3.868900775909424,
 'eval_runtime': 164.268,
 'eval_samples_per_second': 1.528,
 'eval_steps_per_second': 0.31}

trainer.train()

C:\Users\mrezv\anaconda3\lib\site-packages\transformers\optimization.py:306: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  warnings.warn(
***** Running training *****
  Num examples = 1000
  Num Epochs = 2
  Instantaneous batch size per device = 5
  Total train batch size (w. parallel, distributed & accumulation) = 5
  Gradient Accumulation steps = 1
  Total optimization steps = 400
  Number of trainable parameters = 774030080

***** Running Evaluation *****
  Num examples = 251
  Batch size = 5
Saving model checkpoint to ./gpt2_StCp\checkpoint-200
Configuration saved in ./gpt2_StCp\checkpoint-200\config.json
Configuration saved in ./gpt2_StCp\checkpoint-200\generation_config.json
Model weights saved in ./gpt2_StCp\checkpoint-200\pytorch_model.bin
***** Running Evaluation *****
  Num examples = 251
  Batch size = 5
Saving model checkpoint to ./gpt2_StCp\checkpoint-400
Configuration saved in ./gpt2_StCp\checkpoint-400\config.json
Configuration saved in ./gpt2_StCp\checkpoint-400\generation_config.json
Model weights saved in ./gpt2_StCp\checkpoint-400\pytorch_model.bin


Training completed. Do not forget to share your model on huggingface.co/models =)


Loading best model from ./gpt2_StCp\checkpoint-200 (score: 3.3373055458068848).

TrainOutput(global_step=400, training_loss=3.367792510986328, metrics={'train_runtime': 10639.2155, 'train_samples_per_second': 0.188, 'train_steps_per_second': 0.038, 'total_flos': 510041088000000.0, 'train_loss': 3.367792510986328, 'epoch': 2.0})

trainer.evaluate()

***** Running Evaluation *****
  Num examples = 251
  Batch size = 5

{'eval_loss': 3.3373055458068848,
 'eval_runtime': 158.8435,
 'eval_samples_per_second': 1.58,
 'eval_steps_per_second': 0.321,
 'epoch': 2.0}

trainer.save_model()

Saving model checkpoint to ./gpt2_StCp
Configuration saved in ./gpt2_StCp\config.json
Configuration saved in ./gpt2_StCp\generation_config.json
Model weights saved in ./gpt2_StCp\pytorch_model.bin

Pipeline with Fine-tuned Model¶

loaded_model = GPT2LMHeadModel.from_pretrained('./gpt2_StCp')

finetuned_generator = pipeline(
    'text-generation', model=loaded_model, tokenizer=tokenizer,
    config={'max_length': 400,  'do_sample': False, 'top_p': 0.9, 'temperature': 0.8, 'top_k': 50})

loading configuration file ./gpt2_StCp\config.json
Model config GPT2Config {
  "_name_or_path": "gpt2-large",
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "do_sample": true,
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "max_length": 50,
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 1280,
  "n_head": 20,
  "n_inner": null,
  "n_layer": 36,
  "n_positions": 1024,
  "reorder_and_upcast_attn": false,
  "resid_pdrop": 0.1,
  "scale_attn_by_inverse_layer_idx": false,
  "scale_attn_weights": true,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "task_specific_params": {
    "text-generation": {
      "do_sample": true,
      "max_length": 50
    }
  },
  "torch_dtype": "float32",
  "transformers_version": "4.26.1",
  "use_cache": true,
  "vocab_size": 50257
}

loading weights file ./gpt2_StCp\pytorch_model.bin
Generate config GenerationConfig {
  "bos_token_id": 50256,
  "do_sample": true,
  "eos_token_id": 50256,
  "max_length": 50,
  "transformers_version": "4.26.1"
}

All model checkpoint weights were used when initializing GPT2LMHeadModel.

All the weights of GPT2LMHeadModel were initialized from the model checkpoint at ./gpt2_StCp.
If your task is similar to the task the model of the checkpoint was trained on, you can already use GPT2LMHeadModel for predictions without further training.
loading configuration file ./gpt2_StCp\generation_config.json
Generate config GenerationConfig {
  "_from_model_config": true,
  "bos_token_id": 50256,
  "do_sample": true,
  "eos_token_id": 50256,
  "max_length": 50,
  "transformers_version": "4.26.1"
}

Sentence completion below is after reading the book 3 times (3 epochs) and fining the parameters of GPT:

Example 1¶

print('----------')
for generated_sequence in finetuned_generator('Molly said snares are', num_return_sequences=3):
    print(generated_sequence['generated_text'])
    print('----------')

Generate config GenerationConfig {
  "bos_token_id": 50256,
  "do_sample": true,
  "eos_token_id": 50256,
  "max_length": 50,
  "transformers_version": "4.26.1"
}

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.

----------

C:\Users\mrezv\anaconda3\lib\site-packages\transformers\generation\utils.py:1273: UserWarning: Neither `max_length` nor `max_new_tokens` has been set, `max_length` will default to 50 (`generation_config.max_length`). Controlling `max_length` via the config is deprecated and `max_length` will be removed from the config in v5 of Transformers -- we recommend using `max_new_tokens` to control the maximum length of the generation.
  warnings.warn(

Molly said snares are very useful and,
that if you come across one, you'd better get you into our country in a hurry. They
are a must to make a trip for the last minute rations, and it's
----------
Molly said snares are'seals' only of three different kinds.
they are made of two leather materials and in a couple thousand yards are
strong as nails, but very fast off and difficult to remove.

The best
----------
Molly said snares are the only enemies of a snare.

It is no surprise that there are no animals that can fly or walk
under a snares. The birds are much too small and too weak. Even

----------

Answer

A snare is something that looks like a creeper, but it doesn't grow and it's worse than all the hawks in the world, said Molly, glancing at the now far-away red-tail, "for there it hides night and day in the runway till the chance to catch you comes.

Example 2¶

print('----------')
for generated_sequence in finetuned_generator('The animals that typically use the well-marked trails at Antelope Springs for drinking are', num_return_sequences=3):
    print(generated_sequence['generated_text'])
    print('----------')

Generate config GenerationConfig {
  "bos_token_id": 50256,
  "do_sample": true,
  "eos_token_id": 50256,
  "max_length": 50,
  "transformers_version": "4.26.1"
}

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.

----------
The animals that typically use the well-marked trails at Antelope Springs for drinking are
the coyotes, skunks, woodchucks, foxes, skunks, foxes, rabbits, weasels,
kangaroos,
----------
The animals that typically use the well-marked trails at Antelope Springs for drinking are not as eager to follow its example into the brushy woods behind it, but there is no doubt that as soon as an animal reaches a well-marked trail and
----------
The animals that typically use the well-marked trails at Antelope Springs for drinking are now no longer in the canyon. When we arrived they had already scattered around the canyon to the west, and we were forced to make the necessary retreat at the end
----------

Answer:

The animals that typically use the well-marked trails at Antelope Springs for drinking are horses and wild animals, while horned cattle have no hesitation in taking a shortcut through the sedge.

Example 3¶

print('----------')
for generated_sequence in finetuned_generator('Spotting the first corral and ranch-house, the man is pleased, but the Mustang', num_return_sequences=3):
    print(generated_sequence['generated_text'])
    print('----------')

Generate config GenerationConfig {
  "bos_token_id": 50256,
  "do_sample": true,
  "eos_token_id": 50256,
  "max_length": 50,
  "transformers_version": "4.26.1"
}

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.

----------
Spotting the first corral and ranch-house, the man is pleased, but the Mustang is so fast that he can not follow.

The next day, after the herd has come home to the pasture, there was a large band of
----------
Spotting the first corral and ranch-house, the man is pleased, but the Mustang
can't help snubbing the first signs of their arrival, though the animal feels
there is a need for them. The ranch-house is
----------
Spotting the first corral and ranch-house, the man is pleased, but the Mustang is already
on alert. After a short struggle the coyote turns, and the cattle-driver is
caught a second time, and again his
----------

Answer:

Spotting the first corral and ranch-house, the man is pleased, but the Mustang, determined and frenzied, evades rope and gunshots in a final dash, leaping off a cliff to a lifeless, yet liberating, descent onto the rocks below.

	Today	Ġis	Ġa	Ġbeautiful	Ġday	.	Ġbut	,	ĠIt	Ġis	Ġgoing	Ġto	Ġrain	!
Today	1.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00
Ġis	0.93	0.07	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00
Ġa	0.29	0.70	0.02	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00
Ġbeautiful	0.22	0.55	0.20	0.03	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00
Ġday	0.40	0.29	0.16	0.11	0.04	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00
.	0.37	0.35	0.04	0.03	0.19	0.02	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00
Ġbut	0.63	0.17	0.01	0.03	0.07	0.06	0.03	0.00	0.00	0.00	0.00	0.00	0.00	0.00
,	0.59	0.07	0.03	0.03	0.03	0.03	0.22	0.01	0.00	0.00	0.00	0.00	0.00	0.00
ĠIt	0.31	0.20	0.02	0.03	0.19	0.13	0.06	0.03	0.02	0.00	0.00	0.00	0.00	0.00
Ġis	0.35	0.09	0.02	0.04	0.15	0.07	0.12	0.02	0.10	0.03	0.00	0.00	0.00	0.00
Ġgoing	0.32	0.11	0.01	0.01	0.11	0.02	0.03	0.01	0.19	0.18	0.01	0.00	0.00	0.00
Ġto	0.20	0.14	0.04	0.03	0.10	0.03	0.04	0.01	0.22	0.14	0.05	0.01	0.00	0.00
Ġrain	0.53	0.03	0.02	0.01	0.02	0.04	0.10	0.02	0.06	0.05	0.03	0.03	0.08	0.00
!	0.63	0.03	0.01	0.01	0.02	0.02	0.05	0.02	0.02	0.03	0.03	0.01	0.11	0.02

Epoch	Training Loss	Validation Loss
1	3.600900	3.337306
2	2.946800	3.363095

	Sequence	Next predicted token with highest probability
0	Today	,
1	Ġis	Ġthe
2	Ġa	Ġtime
3	Ġbeautiful	Ġday
4	Ġday	Ġfor
5	.	Ċ
6	Ġbut	ĠI
7	,	ĠI
8	ĠIt	Ġis
9	Ġis	Ġnot
10	Ġgoing	Ġto
11	Ġto	Ġbe
12	Ġrain	.
13	!	Ċ

Table of Contents

Introduction¶

GPT-2¶

Masked multi-headed attention¶

Bias Prediction GPT¶

Few-shot learning¶

Sentiment Analysis¶

Question/Answering¶

Zero Shot Learning¶

Sentiment Analysis¶

Question/Answering¶

Summarize Text (TL;DR)¶

Style Completion by GPT¶

Create Dataset¶

DataCollator¶

Pipeline with Pre-trained Model:¶

Fine-tune Trainer¶

Pipeline with Fine-tuned Model¶

Example 1¶

Example 2¶

Example 3¶