Summary

GPT undergoes pre-training by exposing the model to a substantial text corpus called BookCorpus, comprising 4.5 GB of text sourced from 7,000 unpublished books spanning various genres. In this pre-training phase, the model learns to predict the next word in a sequence based on preceding words, a process known as language modeling. This helps impart an understanding of the inherent structure and patterns of natural language. Following pre-training, GPT can be fine-tuned for specific tasks using a more focused, task-specific dataset. Fine-tuning involves adjusting the model's parameters to enhance its suitability for the given task, such as classification, similarity scoring, or question answering. Over time, OpenAI has enhanced and expanded the GPT architecture, introducing subsequent models like GPT-2, GPT-3, GPT-3.5, and GPT-4. These newer models are trained on larger datasets and boast increased capacities, enabling them to generate more intricate and coherent text. The widespread adoption of GPT models by both researchers and industry practitioners has played a pivotal role in advancing natural language processing tasks significantly. Only GPT-2 is freely available on HuggingFace. In this notebook, distilgpt2 model (more efficient version of the GPT-2) was fine-tuned to predict movie genre prediction.

Python functions and data files needed to run this notebook are available via this link.

In [1]:
import warnings
warnings.filterwarnings('ignore')

from transformers import Trainer, TrainingArguments, DataCollatorForLanguageModeling, GPT2LMHeadModel, pipeline, \
                         GPT2Tokenizer
import re
import pandas as pd
pd.set_option('display.max_colwidth', None)
import numpy as np
import openai
import matplotlib as mpl
from sklearn.metrics import accuracy_score
mpl.rcParams.update(mpl.rcParamsDefault)
from matplotlib.offsetbox import AnchoredText
import matplotlib.pyplot as plt
from matplotlib.ticker import PercentFormatter
from nltk.tokenize import TweetTokenizer
import matplotlib.ticker as ticker
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator

from datasets import Dataset

Introduction

GPT stands for Generative Pre-trained Transformers:

  • Generative: from Auto-regressive Language Model. It signifies predicting tokens with one side of the context, past context.

  • Pre-trained: decoders are trained on huge corpora of data.

  • Transformers: The decoder is taken from the transformer architecture.

GPT refers to a family of models:

  • GPT-1 released in 2018: 0.117B parameters
  • GPT-2 released in 2019: 1.5B parameters
  • GPT-3 released in 2020: 175B parameters

OpenAI's public release of the latest GPT-3 and ChatGPT models has democratized access to powerful autoregressive language models. Now, individuals can harness these models for various applications. However, there are certain limitations:

These models lack the ability to take explicit direction; instead, their focus lies in completing sentences in a consistent style. While few-shot learning has proven effective, the models struggle when asked direct questions with a demand for specific answers. In response to these challenges, OpenAI introduced an updated iteration of GPT-3 in January 2022 called InstructGPT. This enhanced version exhibits improvements in various aspects, including a reduction in harmful biases and the capability to follow directions from prompts, providing answers without fixating on completing the given thought.

How InstructGPT is trained:

  • Fine-tuned using reinforcement learning from human feedback (RLHF)
  • OpenAI used a crowdsourced team to label instructions based prompts and outputs
  • OpenAI trained a reward model on this dataset to detect the output that the crowd preferred
  • The reward model was used to optimize the GPT-3 policy using a proximal policy optimization (PPO) algorithm to make outputs more inline with what human were expecting.

Prompt Engineering

Prompt Engineering is the process of designing input prompts for language model system like GPT-3 and ChatGPT:

  • We can influence the output produced by the model to get something more specific and usable by carefully crafting and adjusting prompts.

  • Prompt Engineering can be used to guide the model to produce more relevant and coherent output for a given task.

Alignment in Language Models To understand why prompt engineering is crucial to LLM-application development, we first have to understand not only how LLMs are trained, but how they are aligned to human input. Alignment in language models refers to how the model understands and responds to input prompts that are ā€œin line withā€ (at least according to the people in charge of aligning the LLM) what the user expected. In standard language modeling, a model is trained to predict the next word or sequence of words based on the context of the preceding words. However, this approach alone does not allow for specific instructions or prompts to be answered by the model, which can limit its usefulness for certain applications.

Prompt engineering can be challenging if the language model has not been aligned with the prompts, as it may generate irrelevant or incorrect responses. However, some language models have been developed with extra alignment features, such as Constitutional AI-driven Reinforcement Learning from AI Feedback (RLAIF) from Anthropic or Reinforcement Learning from Human Feedback (RLHF) in OpenAIā€™s GPT series, which can incorporate explicit instructions and feedback into the modelā€™s training. These alignment techniques can improve the modelā€™s ability to understand and respond to specific prompts, making them more useful for applications such as question-answering or language translation:

ChatGPT

  • Trained similarly to InstructGPT, ChatGPT is a newer implementation of GPT from OpenAI that uses a conversational UI (user interface).

Movie Genre Data Set

The data set is movie-genre-prediction retrieved from kaggle.

Objective of data set

The goal of this competition is to design a predictive model that accurately classifies movies into their respective genres based on their titles and synopses. The challenge lies not just in achieving high accuracy, but also in ensuring that the model is efficient and interpretable.

Why This is Interesting and Relevant

Understanding movie genres based on titles and synopses is a fascinating problem for multiple reasons.. From a recommendation system perspective, an effective genre classifier can help build more personalized user recommendations, increasing user engagement on streaming platforms. In the context of box office performance, understanding the relationship between genres and how they are perceived in synopses can provide insight into patterns of commercial success or failure. Furthermore, this challenge can facilitate a deeper comprehension of movie themes and trends in the industry, contributing to cultural and societal studies.

Participants will be provided with a comprehensive dataset comprising ~100,000 movies. Each entry includes the original title, the genre(s), and the synopsis of the movie.

The dataset contains a mix of both original and AI-generated titles, genres, and synopses to test the robustness of the models.

The 10 genres include action, adventure, crime, family, fantasy, horror, mystery, romance, scifi, and thriller.

In [2]:
data_movie_genre = pd.read_csv('movie_genre.csv')

# Since the data set is big for fine-tunning, we select 2000 samples
n_sample = 40000
movie_genre = data_movie_genre[:n_sample]
Hold_out = data_movie_genre[~data_movie_genre.index.isin(movie_genre.index)][:int(n_sample*0.1)].reset_index(drop=True)
movie_genre = movie_genre.reset_index(drop=True)
movie_genre.head()
Out[2]:
id movie_name synopsis genre
0 44978 Super Me A young scriptwriter starts bringing valuable objects back from his short nightmares of being chased by a demon. Selling them makes him rich. fantasy
1 50185 Entity Project A director and her friends renting a haunted house to capture paranormal events in order to prove it and become popular. horror
2 34131 Behavioral Family Therapy for Serious Psychiatric Disorders This is an educational video for families and family therapists that describes the Behavioral Family Therapy approach to dealing with serious psychiatric illnesses. family
3 78522 Blood Glacier Scientists working in the Austrian Alps discover that a glacier is leaking a liquid that appears to be affecting local wildlife. scifi
4 2206 Apat na anino Buy Day - Four Men Widely - Apart in Life - By Night Shadows United in One Fight Venting the Fire of their Fury Against the Hated Oppressors. action
In [3]:
print(f'Shape of training set is {movie_genre.shape}')
print(f'Shape of holdout set is {Hold_out.shape}')
Shape of training set is (40000, 4)
Shape of holdout set is (4000, 4)
In [ ]:
 
In [4]:
def histplt (val: list,bins: int,title: str,xlabl: str,ylabl: str,xlimt: list,
             ylimt: list=False, loc: int =1,legend: int=1,axt=None,days: int=False,
             class_: int=False,scale: int=1,x_tick: list=False, calc_perc: bool= True,
             nsplit: int=1,font: int=5,color: str='b') -> [float] :
    
    """ Histogram including important statistics """
    
    ax1 = axt or plt.axes()
    font = {'size'   : font }
    plt.rc('font', **font) 
    
    miss_n = len(val[np.isnan(val)])
    tot = len(val)
    n_distinct = len(np.unique(val))
    miss_p = (len(val[np.isnan(val)])/tot)*100
    val = val[~np.isnan(val)]        
    val = np.array(val)
    plt.hist(val, bins=bins, weights=np.ones(len(val)) / len(val),ec='black',color=color)
    n_nonmis = len(val[~np.isnan(val)])
    if class_: 
        times = 100
    else:
        times = 1 
    Mean = np.nanmean(val)*times
    Median = np.nanmedian(val)*times
    sd = np.sqrt(np.nanvar(val))
    Max = np.nanmax(val)
    Min = np.nanmin(val)
    p1 = np.quantile(val, 0.01)
    p25 = np.quantile(val, 0.25)
    p75 = np.quantile(val, 0.75)
    p99 = np.quantile(val, 0.99)
    
    if calc_perc == True:
        txt = 'n (not missing)=%.0f\nn_distinct=%.0f\nMissing=%.1f%%\nMean=%0.2f\nĻƒ=%0.1f\np1%%=%0.1f\np99%%=%0.1f\nMin=%0.1f\nMax=%0.1f'       
        anchored_text = AnchoredText(txt %(n_nonmis,n_distinct,miss_p,Mean,sd,p1,p99,Min,Max), borderpad=0, 
                                     loc=loc,prop={ 'size': font['size']*scale})    
    else:
        txt = 'n (not missing)=%.0f\nn_distinct=%.0f\nMissing=%.1f%%\nMean=%0.2f\nĻƒ=%0.1f\nMin=%0.1f\nMax=%0.1f'       
        anchored_text = AnchoredText(txt %(n_nonmis,n_distinct,miss_p,Mean,sd,Min,Max), borderpad=0, 
                                     loc=loc,prop={ 'size': font['size']*scale})                
        
    if(legend==1): ax1.add_artist(anchored_text)
    if (scale): plt.title(title,fontsize=font['size']*(scale+0.15))
    else:       plt.title(title)
    plt.xlabel(xlabl,fontsize=font['size']) 
    ax1.set_ylabel('Frequency',fontsize=font['size'])
    if (scale): ax1.set_xlabel(xlabl,fontsize=font['size']*scale)
    else:       ax1.set_xlabel(xlabl)
    try:
        xlabl
    except NameError:
        pass    
    else:
        if (scale): plt.xlabel(xlabl,fontsize=font['size']*scale) 
        else:        plt.xlabel(xlabl)   
    try:
        ylabl
    except NameError:
        pass      
    else:
        if (scale): plt.ylabel(ylabl,fontsize=font['size']*scale)  
        else:         plt.ylabel(ylabl)  
        
    if (class_==True): plt.xticks([0,1])
    plt.gca().yaxis.set_major_formatter(PercentFormatter(1))
    ax1.grid(linewidth='0.1')
    try:
        xlimt
    except NameError:
        pass  
    else:
        plt.xlim(xlimt) 
    try:
        ylimt
    except NameError:
        pass  
    else:
        plt.ylim(ylimt)  
        
    if x_tick: plt.xticks(x_tick,fontsize=font['size']*scale)    
    plt.yticks(fontsize=font['size']*scale)  
    plt.grid(linewidth='0.12')
    
    # Interquartile Range Method for outlier detection
    iqr = p75 - p25
    
    # calculate the outlier cutoff
    cut_off = np.array(iqr) * 1.5
    lower, upper = p25 - cut_off, p75 + cut_off        
            
    
    return tot, n_nonmis, n_distinct, miss_n, miss_p, Mean, Median, sd, Max, Min, p1, p25, p75, p99, sd


#############################################################################################################

def bargraph(val_ob: [list],  title: str, ylabel: str, titlefontsize: int=10, xfontsize: int=5,scale: int=1, 
             yfontsize: int=8, select: bool= False, fontsizelable: bool= False, xshift: float=-0.1, nsim: int=False
             ,yshift: float=0.01,percent: bool=False, xlim: list=False, axt: bool=None, color: str='b',sort=True,
             ylim: list=False, y_rot: int=0, ytick_rot: int=90, graph_float: int=1, 
            loc: int =1,legend: int=1) -> None:
    
    """ vertical bargraph """
    
    ax1 = axt or plt.axes()

    tot = len(val_ob)
    miss_p_ob = (len(val_ob[pd.isnull(val_ob)])/tot)*100        
    n_nonmis_ob = len(val_ob[~pd.isnull(val_ob)])    
    con = np.array(val_ob.value_counts())
    len_ = len(con)
    #if len_ > 10: len_ = 10
    cats = list(val_ob.value_counts().keys())
    val_ob = con[:len_]
    clmns = cats[:len_]
    # Sort counts
    if sort:
        sort_score = sorted(zip(val_ob,clmns), reverse=True)
        Clmns_sort = [sort_score[i][1] for i in range(len(clmns))]
        sort_score = [sort_score[i][0] for i in range(len(clmns))]              
    else:
        Clmns_sort = clmns
        sort_score = val_ob
    index1 = np.arange(len(clmns))
    if (select):
        Clmns_sort=Clmns_sort[:select]
        sort_score=sort_score[:select]
    ax1.bar(Clmns_sort, sort_score, width=0.6, align='center', alpha=1, edgecolor='k', capsize=4,color=color)
    plt.title(title,fontsize=titlefontsize)
    ax1.set_ylabel(ylabel,fontsize=yfontsize)
    ax1.set_xticks(np.arange(len(Clmns_sort)))
    
    ax1.set_xticklabels(Clmns_sort,fontsize=xfontsize, rotation=ytick_rot,y=0.02)   
    if (percent): plt.gca().yaxis.set_major_formatter(PercentFormatter(1))
    ax1.xaxis.grid(color='k', linestyle='--', linewidth=0.2) 
    if (xlim): plt.xlim(xlim)
    if (ylim): plt.ylim(ylim)
    if (fontsizelable):
        for ii in range(len(sort_score)):
            if (percent):
                plt.text(xshift+ii, sort_score[ii]+yshift,f'{"{0:.2f}".format(sort_score[ii]*100)}%',
                fontsize=fontsizelable,rotation=y_rot,color='k')     
            else:
                plt.text(xshift+ii, sort_score[ii]+yshift,f'{np.round(sort_score[ii],graph_float)}',
                    fontsize=fontsizelable,rotation=y_rot,color='k')                                 
    dic_Clmns = {}
    for i in range(len(Clmns_sort)):
        dic_Clmns[Clmns_sort[i]]=sort_score[i]
        
    txt = 'n (not missing)=%.0f\nMissing=%.1f%%'       
    anchored_text = AnchoredText(txt %(n_nonmis_ob,miss_p_ob), borderpad=0, 
                                 loc=loc)    
    if(legend==1): ax1.add_artist(anchored_text) 
In [5]:
font = {'size'   : 12}
plt.rc('font', **font)
colors_map = plt.cm.get_cmap('jet')

fig, ax = plt.subplots(figsize=(12, 4), dpi= 100, facecolor='w', edgecolor='k') 

ax1 = plt.subplot(1,2,1) 
val_obj = movie_genre['genre']
bargraph (val_obj, title=f'Training set_Movie Genre Labels', ylabel='Counts',titlefontsize=15, 
          xfontsize=12,yfontsize=13,percent=False,fontsizelable=10,yshift=5,xshift=-0.2,color='b',
          legend=True,ytick_rot=25, y_rot=0, axt=ax1,loc=1, ylim=[0,5500])  


ax2 = plt.subplot(1,2,2) 

tt = TweetTokenizer()
val = movie_genre.apply(lambda x : len(tt.tokenize(x['movie_name'])), axis=1)
_,_,_, _, _,_ ,_ ,_ ,_ ,_ ,\
_,_ ,_ ,_ ,_ = histplt (val,bins=50,title=f'Histogram of Length for Movie Genre',xlabl=None,days=False,
                  ylabl=None,xlimt=(0,16),ylimt=False
                  ,axt=ax2,nsplit=5,scale=0.95,font=12,loc=1,color='g')
plt.show()
In [4]:
# Start with one review:
text = data_movie_genre.genre.to_list()
token_string = " ".join(text)


# Generate the word cloud with custom settings
wordcloud = WordCloud(
    width=600, 
    height=400, 
    background_color='gray',  # Can also use 'transparent' for no background
    max_words=100,  # Control the number of words
    max_font_size=150,  # Max font size for the largest words
    min_font_size=10,   # Min font size for smaller words
    random_state=42,  # Control the randomness of word placement
    contour_color='black',  # Outline the word cloud with a contour
    contour_width=1.5  # Width of the contour
).generate(token_string)

# Plot the beautified word cloud using matplotlib
plt.figure(figsize=(12, 6))  # Set a larger figure size for higher resolution
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")  # Turn off the axis for a clean look
plt.show()

distilgpt2 Model

DistilGPT2 (short for Distilled-GPT2) is an English-language model pre-trained with the supervision of the smallest version of Generative Pre-trained Transformer 2 (GPT-2). Like GPT-2, DistilGPT2 can be used to generate text. Users of this model card should also consider information about the design, training, and limitations of GPT-2.

Model Description

DistilGPT2 is an English-language model pre-trained with the supervision of the 124 million parameter version of GPT-2. DistilGPT2, which has 82 million parameters, was developed using knowledge distillation and was designed to be a faster, lighter version of GPT-2.

Prompt Engineering

distilgpt2 is a smaller and more efficient version of the GPT-2 (Generative Pre-trained Transformer 2) language model developed by OpenAI. The "distil" in its name stands for "distillation," indicating that it is a distilled or compressed version of the original model.

Distillation involves training a smaller model (student) to mimic the behavior of a larger model (teacher). In the case of distilgpt2, it is trained to replicate the capabilities of the larger GPT-2 model but with a reduced number of parameters, making it more lightweight and suitable for environments with limited computational resources.

In [6]:
MODEL = 'distilgpt2'

# Load tokenizer
tokenizer = GPT2Tokenizer.from_pretrained(MODEL)
tokenizer.pad_token = tokenizer.eos_token

# Add prompt for task
genre_PROMPT = 'genre Task'
synopsis_TOKEN = '\nsynopsis:'
movie_name_TOKEN = '\movie_name:'
In [7]:
movie_genre['genre_text'] = f'{genre_PROMPT}\nmovie_name: ' + movie_genre['movie_name'] +\
                                   '\nsynopsis:' +  ' ' + movie_genre['synopsis'].astype(str)+\
                                   '\ngenre:' +  ' ' + movie_genre['genre'].astype(str)
In [8]:
print(movie_genre['genre_text'][0])
genre Task
movie_name: Super Me
synopsis: A young scriptwriter starts bringing valuable objects back from his short nightmares of being chased by a demon. Selling them makes him rich.
genre: fantasy
In [9]:
training_examples = movie_genre['genre_text']

print(len(training_examples))
40000

Split Data into Training and Test

In [10]:
multi_task_df = pd.DataFrame({'text': training_examples})

data = Dataset.from_pandas(multi_task_df)
def preprocess(examples):
    return tokenizer(examples['text'], truncation=True)

data = data.map(preprocess, batched=True)

data = data.train_test_split(train_size=.8)
In [11]:
multi_task_df[:4]
Out[11]:
text
0 genre Task\nmovie_name: Super Me\nsynopsis: A young scriptwriter starts bringing valuable objects back from his short nightmares of being chased by a demon. Selling them makes him rich.\ngenre: fantasy
1 genre Task\nmovie_name: Entity Project\nsynopsis: A director and her friends renting a haunted house to capture paranormal events in order to prove it and become popular.\ngenre: horror
2 genre Task\nmovie_name: Behavioral Family Therapy for Serious Psychiatric Disorders\nsynopsis: This is an educational video for families and family therapists that describes the Behavioral Family Therapy approach to dealing with serious psychiatric illnesses.\ngenre: family
3 genre Task\nmovie_name: Blood Glacier\nsynopsis: Scientists working in the Austrian Alps discover that a glacier is leaking a liquid that appears to be affecting local wildlife.\ngenre: scifi
In [12]:
data
Out[12]:
DatasetDict({
    train: Dataset({
        features: ['text', 'input_ids', 'attention_mask'],
        num_rows: 32000
    })
    test: Dataset({
        features: ['text', 'input_ids', 'attention_mask'],
        num_rows: 8000
    })
})

Fine-tune Trainer

In [13]:
# Load pretrained model
model = GPT2LMHeadModel.from_pretrained(MODEL)
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)
In [14]:
data['train'][0]
Out[14]:
{'text': 'genre Task\nmovie_name: Against Fear\nsynopsis: A former criminal profiler with an extreme form of anxiety disorder must track down a killer who can control and induce fear in their victims.\ngenre: thriller',
 'input_ids': [35850,
  15941,
  198,
  41364,
  62,
  3672,
  25,
  12914,
  16132,
  198,
  28869,
  24608,
  25,
  317,
  1966,
  4301,
  1534,
  5329,
  351,
  281,
  3257,
  1296,
  286,
  9751,
  8967,
  1276,
  2610,
  866,
  257,
  11868,
  508,
  460,
  1630,
  290,
  21155,
  3252,
  287,
  511,
  4970,
  13,
  198,
  35850,
  25,
  32251],
 'attention_mask': [1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1]}
In [15]:
epochs = 2
batch_size = 5

training_args = TrainingArguments(
    output_dir="./gpt2_movie", # The output directory
    overwrite_output_dir=True,      # overwrite the content of the output directory
    num_train_epochs=epochs,        # number of training epochs
    per_device_train_batch_size=batch_size, # batch size for training
    per_device_eval_batch_size=batch_size,  # batch size for evaluation
    logging_steps=5,
    load_best_model_at_end=True,
    evaluation_strategy='epoch',    # It is "steps" or "epoch", we choose epoch: how many times to stop training to test
    save_strategy='epoch'           # save a check point of our model after each epoch
)

trainer = Trainer(
    model=model,                    # take our model (tweet_clf_model)
    args=training_args,             # we just set it above
    train_dataset=data['train'],    # training part of dataset
    eval_dataset=data['test'],      # test (evaluation) part of dataset
    data_collator=data_collator     # data colladior with padding. Infact, we may or may not need a data collator
                                    # we can check the model to see how it lookes like with or without the collator
) 

trainer.evaluate()
The following columns in the evaluation set don't have a corresponding argument in `GPT2LMHeadModel.forward` and have been ignored: text. If text are not expected by `GPT2LMHeadModel.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 8000
  Batch size = 5
[1600/1600 3:36:53]
Out[15]:
{'eval_loss': 4.785849571228027,
 'eval_runtime': 900.4693,
 'eval_samples_per_second': 8.884,
 'eval_steps_per_second': 1.777}
In [16]:
trainer.train()
The following columns in the training set don't have a corresponding argument in `GPT2LMHeadModel.forward` and have been ignored: text. If text are not expected by `GPT2LMHeadModel.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 32000
  Num Epochs = 2
  Instantaneous batch size per device = 5
  Total train batch size (w. parallel, distributed & accumulation) = 5
  Gradient Accumulation steps = 1
  Total optimization steps = 12800
  Number of trainable parameters = 81912576
[12800/12800 11:52:42, Epoch 2/2]
Epoch Training Loss Validation Loss
1 3.052200 2.908587
2 2.541700 2.880669

The following columns in the evaluation set don't have a corresponding argument in `GPT2LMHeadModel.forward` and have been ignored: text. If text are not expected by `GPT2LMHeadModel.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 8000
  Batch size = 5
Saving model checkpoint to ./gpt2_movie\checkpoint-6400
Configuration saved in ./gpt2_movie\checkpoint-6400\config.json
Configuration saved in ./gpt2_movie\checkpoint-6400\generation_config.json
Model weights saved in ./gpt2_movie\checkpoint-6400\pytorch_model.bin
The following columns in the evaluation set don't have a corresponding argument in `GPT2LMHeadModel.forward` and have been ignored: text. If text are not expected by `GPT2LMHeadModel.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 8000
  Batch size = 5
Saving model checkpoint to ./gpt2_movie\checkpoint-12800
Configuration saved in ./gpt2_movie\checkpoint-12800\config.json
Configuration saved in ./gpt2_movie\checkpoint-12800\generation_config.json
Model weights saved in ./gpt2_movie\checkpoint-12800\pytorch_model.bin


Training completed. Do not forget to share your model on huggingface.co/models =)


Loading best model from ./gpt2_movie\checkpoint-12800 (score: 2.880668878555298).
Out[16]:
TrainOutput(global_step=12800, training_loss=2.947077699005604, metrics={'train_runtime': 42764.801, 'train_samples_per_second': 1.497, 'train_steps_per_second': 0.299, 'total_flos': 1089533449912320.0, 'train_loss': 2.947077699005604, 'epoch': 2.0})
In [17]:
trainer.evaluate()
The following columns in the evaluation set don't have a corresponding argument in `GPT2LMHeadModel.forward` and have been ignored: text. If text are not expected by `GPT2LMHeadModel.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 8000
  Batch size = 5
[1600/1600 11:41]
Out[17]:
{'eval_loss': 2.880668878555298,
 'eval_runtime': 701.413,
 'eval_samples_per_second': 11.406,
 'eval_steps_per_second': 2.281,
 'epoch': 2.0}
In [18]:
trainer.save_model()
Saving model checkpoint to ./gpt2_movie
Configuration saved in ./gpt2_movie\config.json
Configuration saved in ./gpt2_movie\generation_config.json
Model weights saved in ./gpt2_movie\pytorch_model.bin
In [ ]:
 

OpenAI Model

Few shot learning

In [19]:
# "OPENAI_API_KEY": notice, OpenAL is not a open soourse model like downloading from Huggingface
# we should have API_KEY that OpenAI send yo us that is persona API key.
openai.api_key = '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'  
In [20]:
## Here are list of engines that OpenAI have for us
#openai.Engine.list().data
In [21]:
# Look at othe model that have either 'text' or 'gpt'
[e for e in openai.Engine.list().data if 'text' in e.id or 'gpt' in e.id][:5]
Out[21]:
[<Engine engine id=text-search-babbage-doc-001 at 0x1dd2b60dae0> JSON: {
   "object": "engine",
   "id": "text-search-babbage-doc-001",
   "ready": true,
   "owner": "openai-dev",
   "permissions": null,
   "created": null
 },
 <Engine engine id=gpt-4-0613 at 0x1dd2b60df40> JSON: {
   "object": "engine",
   "id": "gpt-4-0613",
   "ready": true,
   "owner": "openai",
   "permissions": null,
   "created": null
 },
 <Engine engine id=gpt-4 at 0x1dd2b4ce770> JSON: {
   "object": "engine",
   "id": "gpt-4",
   "ready": true,
   "owner": "openai",
   "permissions": null,
   "created": null
 },
 <Engine engine id=text-davinci-003 at 0x1dd2b4ce860> JSON: {
   "object": "engine",
   "id": "text-davinci-003",
   "ready": true,
   "owner": "openai-internal",
   "permissions": null,
   "created": null
 },
 <Engine engine id=text-search-babbage-query-001 at 0x1dd2b4ce900> JSON: {
   "object": "engine",
   "id": "text-search-babbage-query-001",
   "ready": true,
   "owner": "openai-dev",
   "permissions": null,
   "created": null
 }]
In [22]:
fewshot = 30
prompt = f'See below for {fewshot} examples of "movie name", "synopsis", "genre"\n\n'

for ii in range(30):
    prompt = prompt + f'Example#{ii+1}\n movie name: ' + movie_genre['movie_name'].iloc[ii] +\
                                   '\nsynopsis:' +  ' ' + movie_genre['synopsis'].astype(str).iloc[ii]+\
                                   '\ngenre:' +  ' ' + movie_genre['genre'].astype(str).iloc[ii]+'\n\n'
    
prompt = prompt + f'Based on learning from these {fewshot} examples, please predict genre for this example:'    
i_ex=0
prompt = prompt + f'\n movie name: ' + Hold_out['movie_name'].iloc[i_ex] +\
                               '\nsynopsis:' +  ' ' + Hold_out['synopsis'].astype(str).iloc[i_ex]+\
                               '\ngenre: ?' +  '\n\n'
In [23]:
genre = openai.ChatCompletion.create(
  model="gpt-3.5-turbo",
  messages=[
        {"role": "user", "content": prompt}
    ],
  temperature=0.6,
  max_tokens=20,
  top_p=1,
  frequency_penalty=0,
  presence_penalty=0
)
In [24]:
# Get the completion, with some more flavor
print(genre['choices'][0]["message"]["content"])
Based on the provided examples, it is difficult to predict the genre for the movie "The Pot Car
In [25]:
print(f'The Actual genre is ')
Hold_out[['movie_name','synopsis','genre']].loc[[i_ex]]
The Actual genre is 
Out[25]:
movie_name synopsis genre
0 The Pot Carriers James Rainbow is a first time offender sentenced to 12 months imprisonment. Redband, a trusty and Mouse, a diminutive housebreaker take him under their wing. crime
In [ ]:
 

Predict Move Genre

In [26]:
# Load fine-tuned model
loaded_model_gpt2_movie_name = GPT2LMHeadModel.from_pretrained('./gpt2_movie')

# Make generator pipeline
generator_gpt2_movie_name = pipeline('text-generation', model=loaded_model_gpt2_movie_name, tokenizer=tokenizer)
loading configuration file ./gpt2_movie\config.json
Model config GPT2Config {
  "_name_or_path": "distilgpt2",
  "_num_labels": 1,
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "id2label": {
    "0": "LABEL_0"
  },
  "initializer_range": 0.02,
  "label2id": {
    "LABEL_0": 0
  },
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 768,
  "n_head": 12,
  "n_inner": null,
  "n_layer": 6,
  "n_positions": 1024,
  "reorder_and_upcast_attn": false,
  "resid_pdrop": 0.1,
  "scale_attn_by_inverse_layer_idx": false,
  "scale_attn_weights": true,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "task_specific_params": {
    "text-generation": {
      "do_sample": true,
      "max_length": 50
    }
  },
  "torch_dtype": "float32",
  "transformers_version": "4.26.1",
  "use_cache": true,
  "vocab_size": 50257
}

loading weights file ./gpt2_movie\pytorch_model.bin
Generate config GenerationConfig {
  "bos_token_id": 50256,
  "eos_token_id": 50256,
  "transformers_version": "4.26.1"
}

All model checkpoint weights were used when initializing GPT2LMHeadModel.

All the weights of GPT2LMHeadModel were initialized from the model checkpoint at ./gpt2_movie.
If your task is similar to the task the model of the checkpoint was trained on, you can already use GPT2LMHeadModel for predictions without further training.
loading configuration file ./gpt2_movie\generation_config.json
Generate config GenerationConfig {
  "_from_model_config": true,
  "bos_token_id": 50256,
  "eos_token_id": 50256,
  "transformers_version": "4.26.1"
}

Hold out 1

Fine-tuned distilgpt2

In [82]:
ir = 1
movie_name, synopsis, genre = Hold_out[['movie_name','synopsis','genre']].loc[ir]

print(f'movie_name: {movie_name}')
print(f'synopsis: {synopsis}')
print(f'genre: {genre}')

num_tokens = len(tokenizer(synopsis)['input_ids'])
movie_name: Anthony Blake El Sol Es El Cerebro De La Tierra
synopsis: A boomer's reflections on mirrors, money, doors, the sun, the brain and several other paranormal topics.
genre: mystery
In [83]:
genre_text_sample = f'{genre_PROMPT}\nmovie_name: ' + movie_name +\
                                   '\nsynopsis:' +  ' ' + synopsis+\
                                   '\ngenre:' 
print(f'prompt is:\n\n{genre_text_sample}')
prompt is:

genre Task
movie_name: Anthony Blake El Sol Es El Cerebro De La Tierra
synopsis: A boomer's reflections on mirrors, money, doors, the sun, the brain and several other paranormal topics.
genre:
In [87]:
for generated_text in generator_gpt2_movie_name(genre_text_sample, top_k=1, temperature=1, 
                                                num_return_sequences=3, max_length=num_tokens + 1):
    print(generated_text['generated_text']+'\n')
Generate config GenerationConfig {
  "bos_token_id": 50256,
  "do_sample": true,
  "eos_token_id": 50256,
  "max_length": 50,
  "transformers_version": "4.26.1"
}

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Input length of input_ids is 49, but `max_length` is set to 24. This can lead to unexpected behavior. You should consider increasing `max_new_tokens`.
genre Task
movie_name: Anthony Blake El Sol Es El Cerebro De La Tierra
synopsis: A boomer's reflections on mirrors, money, doors, the sun, the brain and several other paranormal topics.
genre: mystery

genre Task
movie_name: Anthony Blake El Sol Es El Cerebro De La Tierra
synopsis: A boomer's reflections on mirrors, money, doors, the sun, the brain and several other paranormal topics.
genre: mystery

genre Task
movie_name: Anthony Blake El Sol Es El Cerebro De La Tierra
synopsis: A boomer's reflections on mirrors, money, doors, the sun, the brain and several other paranormal topics.
genre: mystery

top_k: The top_k parameter, also known as the "top-k sampling" strategy, is used to limit the number of words considered during the sampling process. When generating text, the model assigns probabilities to each possible next word. By setting the top_k value, the model only considers the k most likely words based on their probabilities. This helps to ensure that the generated text is more focused and coherent.

For example, if top_k is set to 10, the model will only consider the top 10 words with the highest probabilities for each word position, discarding the rest. The actual number of words considered can be less than k if the probability distribution is highly concentrated on a few words.

temperature: The temperature parameter controls the randomness of the generated text. It adjusts the probability distribution during sampling. A higher temperature value, such as 1.0, increases the randomness and diversity of the output. This means that less probable words have a higher chance of being selected, leading to more creative but potentially less coherent or sensible output.

On the other hand, a lower temperature value, such as 0.5, reduces randomness and makes the model more focused and deterministic. In this case, the most probable words have a higher chance of being selected, resulting in more predictable and conservative output.

OpenAI

In [88]:
fewshot = 50
prompt = f'See below for {fewshot} examples of "movie name", "synopsis", "genre"\n\n'

for ii in range(fewshot):
    prompt = prompt + f'Example#{ii+1}\n movie name: ' + movie_genre['movie_name'].iloc[ii] +\
                                   '\nsynopsis:' +  ' ' + movie_genre['synopsis'].astype(str).iloc[ii]+\
                                   '\ngenre:' +  ' ' + movie_genre['genre'].astype(str).iloc[ii]+'\n\n'
    
prompt = prompt + f'Based on learning from these {fewshot} examples, please predict genre for this example:'    
prompt = prompt + f'\n movie name: ' + Hold_out['movie_name'].iloc[ir] +\
                               '\nsynopsis:' +  ' ' + Hold_out['synopsis'].astype(str).iloc[ir]+\
                               '\ngenre: ?' +  '\n\n'
In [89]:
genre = openai.ChatCompletion.create(
  model="gpt-3.5-turbo",
  messages=[
        {"role": "user", "content": prompt}
    ],
  temperature=0.6,
  max_tokens=20,
  top_p=1,
  frequency_penalty=0,
  presence_penalty=0
)
In [90]:
# Get the completion, with some more flavor
print(genre['choices'][0]["message"]["content"])
Based on the provided examples, it is difficult to predict the genre for the movie "Anthony Blake El
In [91]:
print(f'The Actual genre is ')
Hold_out[['movie_name','synopsis','genre']].loc[[ir]]
The Actual genre is 
Out[91]:
movie_name synopsis genre
1 Anthony Blake El Sol Es El Cerebro De La Tierra A boomer's reflections on mirrors, money, doors, the sun, the brain and several other paranormal topics. mystery

Few shot OpenAI cannot predict genre for this movie. Lets change the prompt:

In [92]:
gnrs = ', '.join(list(val_obj.value_counts().keys()))
prompt = f'For this movie synopsis \n "{Hold_out.synopsis.loc[ir]}" \n which genre out of {gnrs} can be assigned'
print(prompt)
For this movie synopsis 
 "A boomer's reflections on mirrors, money, doors, the sun, the brain and several other paranormal topics." 
 which genre out of horror, action, scifi, crime, mystery, adventure, thriller, fantasy, romance, family can be assigned
In [93]:
genre = openai.ChatCompletion.create(
  model="gpt-3.5-turbo",
  messages=[
        {"role": "user", "content": prompt}
    ],
  temperature=0.6,
  max_tokens=20,
  top_p=1,
  frequency_penalty=0,
  presence_penalty=0
)
In [94]:
# Get the completion, with some more flavor
print(genre['choices'][0]["message"]["content"])
Based on the given synopsis, it is difficult to assign a specific genre. However, the closest genre
In [ ]:
 

Hold out 2

Fine-tuned distilgpt2

In [95]:
ir = 2
movie_name, synopsis, genre = Hold_out[['movie_name','synopsis','genre']].loc[ir]

print(f'movie_name: {movie_name}')
print(f'synopsis: {synopsis}')
print(f'genre: {genre}')

num_tokens = len(tokenizer(synopsis)['input_ids'])
movie_name: Yakuza bangaichi
synopsis: An extremely fast-moving, entertaining example of just exactly what a gendai (modern) or jitsuroku (true account) yakuza movie should be.
genre: crime
In [96]:
genre_text_sample = f'{genre_PROMPT}\nmovie_name: ' + movie_name +\
                                   '\nsynopsis:' +  ' ' + synopsis+\
                                   '\ngenre:' 

print(f'prompt is:\n\n{genre_text_sample}')
prompt is:

genre Task
movie_name: Yakuza bangaichi
synopsis: An extremely fast-moving, entertaining example of just exactly what a gendai (modern) or jitsuroku (true account) yakuza movie should be.
genre:
In [97]:
for generated_text in generator_gpt2_movie_name(genre_text_sample, top_k=1, temperature=1, 
                                                num_return_sequences=3, max_length=num_tokens + 1):
    print(generated_text['generated_text']+'\n')
Generate config GenerationConfig {
  "bos_token_id": 50256,
  "do_sample": true,
  "eos_token_id": 50256,
  "max_length": 50,
  "transformers_version": "4.26.1"
}

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Input length of input_ids is 55, but `max_length` is set to 36. This can lead to unexpected behavior. You should consider increasing `max_new_tokens`.
genre Task
movie_name: Yakuza bangaichi
synopsis: An extremely fast-moving, entertaining example of just exactly what a gendai (modern) or jitsuroku (true account) yakuza movie should be.
genre: action

genre Task
movie_name: Yakuza bangaichi
synopsis: An extremely fast-moving, entertaining example of just exactly what a gendai (modern) or jitsuroku (true account) yakuza movie should be.
genre: action

genre Task
movie_name: Yakuza bangaichi
synopsis: An extremely fast-moving, entertaining example of just exactly what a gendai (modern) or jitsuroku (true account) yakuza movie should be.
genre: action

OpenAI

In [98]:
fewshot = 50
prompt = f'See below for {fewshot} examples of "movie name", "synopsis", "genre"\n\n'

for ii in range(fewshot):
    prompt = prompt + f'Example#{ii+1}\n movie name: ' + movie_genre['movie_name'].iloc[ii] +\
                                   '\nsynopsis:' +  ' ' + movie_genre['synopsis'].astype(str).iloc[ii]+\
                                   '\ngenre:' +  ' ' + movie_genre['genre'].astype(str).iloc[ii]+'\n\n'
    
prompt = prompt + f'Based on learning from these {fewshot} examples, please predict genre for this example:'    
prompt = prompt + f'\n movie name: ' + Hold_out['movie_name'].iloc[ir] +\
                               '\nsynopsis:' +  ' ' + Hold_out['synopsis'].astype(str).iloc[ir]+\
                               '\ngenre: ?' +  '\n\n'
In [99]:
genre = openai.ChatCompletion.create(
  model="gpt-3.5-turbo",
  messages=[
        {"role": "user", "content": prompt}
    ],
  temperature=0.6,
  max_tokens=20,
  top_p=1,
  frequency_penalty=0,
  presence_penalty=0
)
In [100]:
# Get the completion, with some more flavor
print(genre['choices'][0]["message"]["content"])
Based on the examples provided, it is difficult to predict the genre for the movie "Yakuza
In [101]:
print(f'The Actual genre is ')
Hold_out[['movie_name','synopsis','genre']].loc[[ir]]
The Actual genre is 
Out[101]:
movie_name synopsis genre
2 Yakuza bangaichi An extremely fast-moving, entertaining example of just exactly what a gendai (modern) or jitsuroku (true account) yakuza movie should be. crime

Lets change the prompt:

In [102]:
gnrs = ', '.join(list(val_obj.value_counts().keys()))
prompt = f'For this movie synopsis \n "{Hold_out.synopsis.loc[ir]}" \n which genre out of {gnrs} can be assigned'
print(prompt)
For this movie synopsis 
 "An extremely fast-moving, entertaining example of just exactly what a gendai (modern) or jitsuroku (true account) yakuza movie should be." 
 which genre out of horror, action, scifi, crime, mystery, adventure, thriller, fantasy, romance, family can be assigned
In [103]:
genre = openai.ChatCompletion.create(
  model="gpt-3.5-turbo",
  messages=[
        {"role": "user", "content": prompt}
    ],
  temperature=0.6,
  max_tokens=20,
  top_p=1,
  frequency_penalty=0,
  presence_penalty=0
)
In [104]:
# Get the completion, with some more flavor
print(genre['choices'][0]["message"]["content"])
Based on the given movie synopsis, the genre that can be assigned is crime.
In [ ]:
 

Predict all hold out set

In [ ]:
movie_name_ = []
synopsis_ = []
genre_ = []
genre_pred_ = []

df = pd.DataFrame()

# only predict 100 out 4000 hold out
for ir in range(100):
    movie_name, synopsis, genre = Hold_out[['movie_name','synopsis','genre']].loc[ir]
    
    movie_name_.append(movie_name)
    synopsis_.append(synopsis)
    genre_.append(genre)
    
    
    genre_text = f'{genre_PROMPT}\nmovie_name: ' + movie_name +\
                                       '\nsynopsis:' +  ' ' + synopsis+\
                                       '\ngenre:' 
    genre_pred = generator_gpt2_movie_name(genre_text, top_k=1, temperature=1, 
                                           num_return_sequences= num_tokens + 1)[0]['generated_text']
    match = re.search(r'genre:\s*([^\\]+)', genre_pred)
    genre_pred_.append(match.group(1))
In [111]:
df['movie_name'] = movie_name_
df['synopsis'] = synopsis_
df['actual_genre'] = genre_
df['predict_genre'] = genre_pred_
In [115]:
df[:4]
Out[115]:
movie_name synopsis actual_genre predict_genre
0 The Pot Carriers James Rainbow is a first time offender sentenced to 12 months imprisonment. Redband, a trusty and Mouse, a diminutive housebreaker take him under their wing. crime crime
1 Anthony Blake El Sol Es El Cerebro De La Tierra A boomer's reflections on mirrors, money, doors, the sun, the brain and several other paranormal topics. mystery mystery
2 Yakuza bangaichi An extremely fast-moving, entertaining example of just exactly what a gendai (modern) or jitsuroku (true account) yakuza movie should be. crime action
3 Rio 2096: A Story of Love and Fury Revolves around the love between an immortal hero and JanaĆ­na, the woman he has been in love with for 600 years, through Brazil's colonization, slavery, military regime and the future, in 2096, in the midst of wars for water. romance romance
In [116]:
## Extract actual and predicted values
#y_true = df['actual_genre']
#y_pred = df['predict_genre']
#
## Calculate accuracy
#accuracy = accuracy_score(y_true, y_pred)
#
#print("Accuracy:", accuracy)
In [ ]: