Summary
GPT undergoes pre-training by exposing the model to a substantial text corpus called BookCorpus, comprising 4.5 GB of text sourced from 7,000 unpublished books spanning various genres. In this pre-training phase, the model learns to predict the next word in a sequence based on preceding words, a process known as language modeling. This helps impart an understanding of the inherent structure and patterns of natural language. Following pre-training, GPT can be fine-tuned for specific tasks using a more focused, task-specific dataset. Fine-tuning involves adjusting the model's parameters to enhance its suitability for the given task, such as classification, similarity scoring, or question answering. Over time, OpenAI has enhanced and expanded the GPT architecture, introducing subsequent models like GPT-2, GPT-3, GPT-3.5, and GPT-4. These newer models are trained on larger datasets and boast increased capacities, enabling them to generate more intricate and coherent text. The widespread adoption of GPT models by both researchers and industry practitioners has played a pivotal role in advancing natural language processing tasks significantly. Only GPT-2 is freely available on HuggingFace. In this notebook, distilgpt2 model (more efficient version of the GPT-2) was fine-tuned to predict movie genre prediction.
Python functions and data files to run this notebook are in my Github page.
import warnings
warnings.filterwarnings('ignore')
from transformers import Trainer, TrainingArguments, DataCollatorForLanguageModeling, GPT2LMHeadModel, pipeline, \
GPT2Tokenizer
import re
import pandas as pd
pd.set_option('display.max_colwidth', None)
import numpy as np
import openai
import matplotlib as mpl
from sklearn.metrics import accuracy_score
mpl.rcParams.update(mpl.rcParamsDefault)
from matplotlib.offsetbox import AnchoredText
import matplotlib.pyplot as plt
from matplotlib.ticker import PercentFormatter
from nltk.tokenize import TweetTokenizer
import matplotlib.ticker as ticker
from datasets import Dataset
Table of Contents
Introduction¶
GPT stands for Generative Pre-trained Transformers:
Generative: from Auto-regressive Language Model. It signifies predicting tokens with one side of the context, past context.
Pre-trained: decoders are trained on huge corpora of data.
Transformers: The decoder is taken from the transformer architecture.
GPT refers to a family of models:
- GPT-1 released in 2018: 0.117B parameters
- GPT-2 released in 2019: 1.5B parameters
- GPT-3 released in 2020: 175B parameters
OpenAI's public release of the latest GPT-3 and ChatGPT models has democratized access to powerful autoregressive language models. Now, individuals can harness these models for various applications. However, there are certain limitations:
These models lack the ability to take explicit direction; instead, their focus lies in completing sentences in a consistent style. While few-shot learning has proven effective, the models struggle when asked direct questions with a demand for specific answers. In response to these challenges, OpenAI introduced an updated iteration of GPT-3 in January 2022 called InstructGPT. This enhanced version exhibits improvements in various aspects, including a reduction in harmful biases and the capability to follow directions from prompts, providing answers without fixating on completing the given thought.
How InstructGPT is trained:
- Fine-tuned using reinforcement learning from human feedback (RLHF)
- OpenAI used a crowdsourced team to label instructions based prompts and outputs
- OpenAI trained a reward model on this dataset to detect the output that the crowd preferred
- The reward model was used to optimize the GPT-3 policy using a proximal policy optimization (PPO) algorithm to make outputs more inline with what human were expecting.
Prompt Engineering
Prompt Engineering is the process of designing input prompts for language model system like GPT-3 and ChatGPT:
We can influence the output produced by the model to get something more specific and usable by carefully crafting and adjusting prompts.
Prompt Engineering can be used to guide the model to produce more relevant and coherent output for a given task.
Alignment in Language Models To understand why prompt engineering is crucial to LLM-application development, we first have to understand not only how LLMs are trained, but how they are aligned to human input. Alignment in language models refers to how the model understands and responds to input prompts that are āin line withā (at least according to the people in charge of aligning the LLM) what the user expected. In standard language modeling, a model is trained to predict the next word or sequence of words based on the context of the preceding words. However, this approach alone does not allow for specific instructions or prompts to be answered by the model, which can limit its usefulness for certain applications.
Prompt engineering can be challenging if the language model has not been aligned with the prompts, as it may generate irrelevant or incorrect responses. However, some language models have been developed with extra alignment features, such as Constitutional AI-driven Reinforcement Learning from AI Feedback (RLAIF) from Anthropic or Reinforcement Learning from Human Feedback (RLHF) in OpenAIās GPT series, which can incorporate explicit instructions and feedback into the modelās training. These alignment techniques can improve the modelās ability to understand and respond to specific prompts, making them more useful for applications such as question-answering or language translation:
ChatGPT
- Trained similarly to InstructGPT, ChatGPT is a newer implementation of GPT from OpenAI that uses a conversational UI (user interface).
Movie Genre Data Set¶
The data set is movie-genre-prediction retrieved from kaggle.
Objective of data set
The goal of this competition is to design a predictive model that accurately classifies movies into their respective genres based on their titles and synopses. The challenge lies not just in achieving high accuracy, but also in ensuring that the model is efficient and interpretable.
Why This is Interesting and Relevant
Understanding movie genres based on titles and synopses is a fascinating problem for multiple reasons.. From a recommendation system perspective, an effective genre classifier can help build more personalized user recommendations, increasing user engagement on streaming platforms. In the context of box office performance, understanding the relationship between genres and how they are perceived in synopses can provide insight into patterns of commercial success or failure. Furthermore, this challenge can facilitate a deeper comprehension of movie themes and trends in the industry, contributing to cultural and societal studies.
Participants will be provided with a comprehensive dataset comprising ~100,000 movies. Each entry includes the original title, the genre(s), and the synopsis of the movie.
The dataset contains a mix of both original and AI-generated titles, genres, and synopses to test the robustness of the models.
The 10 genres include action, adventure, crime, family, fantasy, horror, mystery, romance, scifi, and thriller.
data_movie_genre = pd.read_csv('movie_genre.csv')
# Since the data set is big for fine-tunning, we select 2000 samples
n_sample = 40000
movie_genre = data_movie_genre[:n_sample]
Hold_out = data_movie_genre[~data_movie_genre.index.isin(movie_genre.index)][:int(n_sample*0.1)].reset_index(drop=True)
movie_genre = movie_genre.reset_index(drop=True)
movie_genre.head()
id | movie_name | synopsis | genre | |
---|---|---|---|---|
0 | 44978 | Super Me | A young scriptwriter starts bringing valuable objects back from his short nightmares of being chased by a demon. Selling them makes him rich. | fantasy |
1 | 50185 | Entity Project | A director and her friends renting a haunted house to capture paranormal events in order to prove it and become popular. | horror |
2 | 34131 | Behavioral Family Therapy for Serious Psychiatric Disorders | This is an educational video for families and family therapists that describes the Behavioral Family Therapy approach to dealing with serious psychiatric illnesses. | family |
3 | 78522 | Blood Glacier | Scientists working in the Austrian Alps discover that a glacier is leaking a liquid that appears to be affecting local wildlife. | scifi |
4 | 2206 | Apat na anino | Buy Day - Four Men Widely - Apart in Life - By Night Shadows United in One Fight Venting the Fire of their Fury Against the Hated Oppressors. | action |
print(f'Shape of training set is {movie_genre.shape}')
print(f'Shape of holdout set is {Hold_out.shape}')
Shape of training set is (40000, 4) Shape of holdout set is (4000, 4)
def histplt (val: list,bins: int,title: str,xlabl: str,ylabl: str,xlimt: list,
ylimt: list=False, loc: int =1,legend: int=1,axt=None,days: int=False,
class_: int=False,scale: int=1,x_tick: list=False, calc_perc: bool= True,
nsplit: int=1,font: int=5,color: str='b') -> [float] :
""" Histogram including important statistics """
ax1 = axt or plt.axes()
font = {'size' : font }
plt.rc('font', **font)
miss_n = len(val[np.isnan(val)])
tot = len(val)
n_distinct = len(np.unique(val))
miss_p = (len(val[np.isnan(val)])/tot)*100
val = val[~np.isnan(val)]
val = np.array(val)
plt.hist(val, bins=bins, weights=np.ones(len(val)) / len(val),ec='black',color=color)
n_nonmis = len(val[~np.isnan(val)])
if class_:
times = 100
else:
times = 1
Mean = np.nanmean(val)*times
Median = np.nanmedian(val)*times
sd = np.sqrt(np.nanvar(val))
Max = np.nanmax(val)
Min = np.nanmin(val)
p1 = np.quantile(val, 0.01)
p25 = np.quantile(val, 0.25)
p75 = np.quantile(val, 0.75)
p99 = np.quantile(val, 0.99)
if calc_perc == True:
txt = 'n (not missing)=%.0f\nn_distinct=%.0f\nMissing=%.1f%%\nMean=%0.2f\nĻ=%0.1f\np1%%=%0.1f\np99%%=%0.1f\nMin=%0.1f\nMax=%0.1f'
anchored_text = AnchoredText(txt %(n_nonmis,n_distinct,miss_p,Mean,sd,p1,p99,Min,Max), borderpad=0,
loc=loc,prop={ 'size': font['size']*scale})
else:
txt = 'n (not missing)=%.0f\nn_distinct=%.0f\nMissing=%.1f%%\nMean=%0.2f\nĻ=%0.1f\nMin=%0.1f\nMax=%0.1f'
anchored_text = AnchoredText(txt %(n_nonmis,n_distinct,miss_p,Mean,sd,Min,Max), borderpad=0,
loc=loc,prop={ 'size': font['size']*scale})
if(legend==1): ax1.add_artist(anchored_text)
if (scale): plt.title(title,fontsize=font['size']*(scale+0.15))
else: plt.title(title)
plt.xlabel(xlabl,fontsize=font['size'])
ax1.set_ylabel('Frequency',fontsize=font['size'])
if (scale): ax1.set_xlabel(xlabl,fontsize=font['size']*scale)
else: ax1.set_xlabel(xlabl)
try:
xlabl
except NameError:
pass
else:
if (scale): plt.xlabel(xlabl,fontsize=font['size']*scale)
else: plt.xlabel(xlabl)
try:
ylabl
except NameError:
pass
else:
if (scale): plt.ylabel(ylabl,fontsize=font['size']*scale)
else: plt.ylabel(ylabl)
if (class_==True): plt.xticks([0,1])
plt.gca().yaxis.set_major_formatter(PercentFormatter(1))
ax1.grid(linewidth='0.1')
try:
xlimt
except NameError:
pass
else:
plt.xlim(xlimt)
try:
ylimt
except NameError:
pass
else:
plt.ylim(ylimt)
if x_tick: plt.xticks(x_tick,fontsize=font['size']*scale)
plt.yticks(fontsize=font['size']*scale)
plt.grid(linewidth='0.12')
# Interquartile Range Method for outlier detection
iqr = p75 - p25
# calculate the outlier cutoff
cut_off = np.array(iqr) * 1.5
lower, upper = p25 - cut_off, p75 + cut_off
return tot, n_nonmis, n_distinct, miss_n, miss_p, Mean, Median, sd, Max, Min, p1, p25, p75, p99, sd
#############################################################################################################
def bargraph(val_ob: [list], title: str, ylabel: str, titlefontsize: int=10, xfontsize: int=5,scale: int=1,
yfontsize: int=8, select: bool= False, fontsizelable: bool= False, xshift: float=-0.1, nsim: int=False
,yshift: float=0.01,percent: bool=False, xlim: list=False, axt: bool=None, color: str='b',sort=True,
ylim: list=False, y_rot: int=0, ytick_rot: int=90, graph_float: int=1,
loc: int =1,legend: int=1) -> None:
""" vertical bargraph """
ax1 = axt or plt.axes()
tot = len(val_ob)
miss_p_ob = (len(val_ob[pd.isnull(val_ob)])/tot)*100
n_nonmis_ob = len(val_ob[~pd.isnull(val_ob)])
con = np.array(val_ob.value_counts())
len_ = len(con)
#if len_ > 10: len_ = 10
cats = list(val_ob.value_counts().keys())
val_ob = con[:len_]
clmns = cats[:len_]
# Sort counts
if sort:
sort_score = sorted(zip(val_ob,clmns), reverse=True)
Clmns_sort = [sort_score[i][1] for i in range(len(clmns))]
sort_score = [sort_score[i][0] for i in range(len(clmns))]
else:
Clmns_sort = clmns
sort_score = val_ob
index1 = np.arange(len(clmns))
if (select):
Clmns_sort=Clmns_sort[:select]
sort_score=sort_score[:select]
ax1.bar(Clmns_sort, sort_score, width=0.6, align='center', alpha=1, edgecolor='k', capsize=4,color=color)
plt.title(title,fontsize=titlefontsize)
ax1.set_ylabel(ylabel,fontsize=yfontsize)
ax1.set_xticks(np.arange(len(Clmns_sort)))
ax1.set_xticklabels(Clmns_sort,fontsize=xfontsize, rotation=ytick_rot,y=0.02)
if (percent): plt.gca().yaxis.set_major_formatter(PercentFormatter(1))
ax1.xaxis.grid(color='k', linestyle='--', linewidth=0.2)
if (xlim): plt.xlim(xlim)
if (ylim): plt.ylim(ylim)
if (fontsizelable):
for ii in range(len(sort_score)):
if (percent):
plt.text(xshift+ii, sort_score[ii]+yshift,f'{"{0:.2f}".format(sort_score[ii]*100)}%',
fontsize=fontsizelable,rotation=y_rot,color='k')
else:
plt.text(xshift+ii, sort_score[ii]+yshift,f'{np.round(sort_score[ii],graph_float)}',
fontsize=fontsizelable,rotation=y_rot,color='k')
dic_Clmns = {}
for i in range(len(Clmns_sort)):
dic_Clmns[Clmns_sort[i]]=sort_score[i]
txt = 'n (not missing)=%.0f\nMissing=%.1f%%'
anchored_text = AnchoredText(txt %(n_nonmis_ob,miss_p_ob), borderpad=0,
loc=loc)
if(legend==1): ax1.add_artist(anchored_text)
font = {'size' : 12}
plt.rc('font', **font)
colors_map = plt.cm.get_cmap('jet')
fig, ax = plt.subplots(figsize=(12, 4), dpi= 100, facecolor='w', edgecolor='k')
ax1 = plt.subplot(1,2,1)
val_obj = movie_genre['genre']
bargraph (val_obj, title=f'Training set_Movie Genre Labels', ylabel='Counts',titlefontsize=15,
xfontsize=12,yfontsize=13,percent=False,fontsizelable=10,yshift=5,xshift=-0.2,color='b',
legend=True,ytick_rot=25, y_rot=0, axt=ax1,loc=1, ylim=[0,5500])
ax2 = plt.subplot(1,2,2)
tt = TweetTokenizer()
val = movie_genre.apply(lambda x : len(tt.tokenize(x['movie_name'])), axis=1)
_,_,_, _, _,_ ,_ ,_ ,_ ,_ ,\
_,_ ,_ ,_ ,_ = histplt (val,bins=50,title=f'Histogram of Length for Movie Genre',xlabl=None,days=False,
ylabl=None,xlimt=(0,16),ylimt=False
,axt=ax2,nsplit=5,scale=0.95,font=12,loc=1,color='g')
plt.show()
distilgpt2 Model¶
DistilGPT2 (short for Distilled-GPT2) is an English-language model pre-trained with the supervision of the smallest version of Generative Pre-trained Transformer 2 (GPT-2). Like GPT-2, DistilGPT2 can be used to generate text. Users of this model card should also consider information about the design, training, and limitations of GPT-2.
Model Description
DistilGPT2 is an English-language model pre-trained with the supervision of the 124 million parameter version of GPT-2. DistilGPT2, which has 82 million parameters, was developed using knowledge distillation and was designed to be a faster, lighter version of GPT-2.
Prompt Engineering¶
distilgpt2
is a smaller and more efficient version of the GPT-2 (Generative Pre-trained Transformer 2) language model developed by OpenAI. The "distil" in its name stands for "distillation," indicating that it is a distilled or compressed version of the original model.
Distillation involves training a smaller model (student) to mimic the behavior of a larger model (teacher). In the case of distilgpt2
, it is trained to replicate the capabilities of the larger GPT-2 model but with a reduced number of parameters, making it more lightweight and suitable for environments with limited computational resources.
MODEL = 'distilgpt2'
# Load tokenizer
tokenizer = GPT2Tokenizer.from_pretrained(MODEL)
tokenizer.pad_token = tokenizer.eos_token
# Add prompt for task
genre_PROMPT = 'genre Task'
synopsis_TOKEN = '\nsynopsis:'
movie_name_TOKEN = '\movie_name:'
movie_genre['genre_text'] = f'{genre_PROMPT}\nmovie_name: ' + movie_genre['movie_name'] +\
'\nsynopsis:' + ' ' + movie_genre['synopsis'].astype(str)+\
'\ngenre:' + ' ' + movie_genre['genre'].astype(str)
print(movie_genre['genre_text'][0])
genre Task movie_name: Super Me synopsis: A young scriptwriter starts bringing valuable objects back from his short nightmares of being chased by a demon. Selling them makes him rich. genre: fantasy
training_examples = movie_genre['genre_text']
print(len(training_examples))
40000
Split Data into Training and Test¶
multi_task_df = pd.DataFrame({'text': training_examples})
data = Dataset.from_pandas(multi_task_df)
def preprocess(examples):
return tokenizer(examples['text'], truncation=True)
data = data.map(preprocess, batched=True)
data = data.train_test_split(train_size=.8)
multi_task_df[:4]
text | |
---|---|
0 | genre Task\nmovie_name: Super Me\nsynopsis: A young scriptwriter starts bringing valuable objects back from his short nightmares of being chased by a demon. Selling them makes him rich.\ngenre: fantasy |
1 | genre Task\nmovie_name: Entity Project\nsynopsis: A director and her friends renting a haunted house to capture paranormal events in order to prove it and become popular.\ngenre: horror |
2 | genre Task\nmovie_name: Behavioral Family Therapy for Serious Psychiatric Disorders\nsynopsis: This is an educational video for families and family therapists that describes the Behavioral Family Therapy approach to dealing with serious psychiatric illnesses.\ngenre: family |
3 | genre Task\nmovie_name: Blood Glacier\nsynopsis: Scientists working in the Austrian Alps discover that a glacier is leaking a liquid that appears to be affecting local wildlife.\ngenre: scifi |
data
DatasetDict({ train: Dataset({ features: ['text', 'input_ids', 'attention_mask'], num_rows: 32000 }) test: Dataset({ features: ['text', 'input_ids', 'attention_mask'], num_rows: 8000 }) })
Fine-tune Trainer¶
# Load pretrained model
model = GPT2LMHeadModel.from_pretrained(MODEL)
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)
data['train'][0]
{'text': 'genre Task\nmovie_name: Against Fear\nsynopsis: A former criminal profiler with an extreme form of anxiety disorder must track down a killer who can control and induce fear in their victims.\ngenre: thriller', 'input_ids': [35850, 15941, 198, 41364, 62, 3672, 25, 12914, 16132, 198, 28869, 24608, 25, 317, 1966, 4301, 1534, 5329, 351, 281, 3257, 1296, 286, 9751, 8967, 1276, 2610, 866, 257, 11868, 508, 460, 1630, 290, 21155, 3252, 287, 511, 4970, 13, 198, 35850, 25, 32251], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
epochs = 2
batch_size = 5
training_args = TrainingArguments(
output_dir="./gpt2_movie", # The output directory
overwrite_output_dir=True, # overwrite the content of the output directory
num_train_epochs=epochs, # number of training epochs
per_device_train_batch_size=batch_size, # batch size for training
per_device_eval_batch_size=batch_size, # batch size for evaluation
logging_steps=5,
load_best_model_at_end=True,
evaluation_strategy='epoch', # It is "steps" or "epoch", we choose epoch: how many times to stop training to test
save_strategy='epoch' # save a check point of our model after each epoch
)
trainer = Trainer(
model=model, # take our model (tweet_clf_model)
args=training_args, # we just set it above
train_dataset=data['train'], # training part of dataset
eval_dataset=data['test'], # test (evaluation) part of dataset
data_collator=data_collator # data colladior with padding. Infact, we may or may not need a data collator
# we can check the model to see how it lookes like with or without the collator
)
trainer.evaluate()
The following columns in the evaluation set don't have a corresponding argument in `GPT2LMHeadModel.forward` and have been ignored: text. If text are not expected by `GPT2LMHeadModel.forward`, you can safely ignore this message. ***** Running Evaluation ***** Num examples = 8000 Batch size = 5
{'eval_loss': 4.785849571228027, 'eval_runtime': 900.4693, 'eval_samples_per_second': 8.884, 'eval_steps_per_second': 1.777}
trainer.train()
The following columns in the training set don't have a corresponding argument in `GPT2LMHeadModel.forward` and have been ignored: text. If text are not expected by `GPT2LMHeadModel.forward`, you can safely ignore this message. ***** Running training ***** Num examples = 32000 Num Epochs = 2 Instantaneous batch size per device = 5 Total train batch size (w. parallel, distributed & accumulation) = 5 Gradient Accumulation steps = 1 Total optimization steps = 12800 Number of trainable parameters = 81912576
Epoch | Training Loss | Validation Loss |
---|---|---|
1 | 3.052200 | 2.908587 |
2 | 2.541700 | 2.880669 |
The following columns in the evaluation set don't have a corresponding argument in `GPT2LMHeadModel.forward` and have been ignored: text. If text are not expected by `GPT2LMHeadModel.forward`, you can safely ignore this message. ***** Running Evaluation ***** Num examples = 8000 Batch size = 5 Saving model checkpoint to ./gpt2_movie\checkpoint-6400 Configuration saved in ./gpt2_movie\checkpoint-6400\config.json Configuration saved in ./gpt2_movie\checkpoint-6400\generation_config.json Model weights saved in ./gpt2_movie\checkpoint-6400\pytorch_model.bin The following columns in the evaluation set don't have a corresponding argument in `GPT2LMHeadModel.forward` and have been ignored: text. If text are not expected by `GPT2LMHeadModel.forward`, you can safely ignore this message. ***** Running Evaluation ***** Num examples = 8000 Batch size = 5 Saving model checkpoint to ./gpt2_movie\checkpoint-12800 Configuration saved in ./gpt2_movie\checkpoint-12800\config.json Configuration saved in ./gpt2_movie\checkpoint-12800\generation_config.json Model weights saved in ./gpt2_movie\checkpoint-12800\pytorch_model.bin Training completed. Do not forget to share your model on huggingface.co/models =) Loading best model from ./gpt2_movie\checkpoint-12800 (score: 2.880668878555298).
TrainOutput(global_step=12800, training_loss=2.947077699005604, metrics={'train_runtime': 42764.801, 'train_samples_per_second': 1.497, 'train_steps_per_second': 0.299, 'total_flos': 1089533449912320.0, 'train_loss': 2.947077699005604, 'epoch': 2.0})
trainer.evaluate()
The following columns in the evaluation set don't have a corresponding argument in `GPT2LMHeadModel.forward` and have been ignored: text. If text are not expected by `GPT2LMHeadModel.forward`, you can safely ignore this message. ***** Running Evaluation ***** Num examples = 8000 Batch size = 5
{'eval_loss': 2.880668878555298, 'eval_runtime': 701.413, 'eval_samples_per_second': 11.406, 'eval_steps_per_second': 2.281, 'epoch': 2.0}
trainer.save_model()
Saving model checkpoint to ./gpt2_movie Configuration saved in ./gpt2_movie\config.json Configuration saved in ./gpt2_movie\generation_config.json Model weights saved in ./gpt2_movie\pytorch_model.bin
OpenAI Model¶
Few shot learning¶
# "OPENAI_API_KEY": notice, OpenAL is not a open soourse model like downloading from Huggingface
# we should have API_KEY that OpenAI send yo us that is persona API key.
openai.api_key = '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'
## Here are list of engines that OpenAI have for us
#openai.Engine.list().data
# Look at othe model that have either 'text' or 'gpt'
[e for e in openai.Engine.list().data if 'text' in e.id or 'gpt' in e.id][:5]
[<Engine engine id=text-search-babbage-doc-001 at 0x1dd2b60dae0> JSON: { "object": "engine", "id": "text-search-babbage-doc-001", "ready": true, "owner": "openai-dev", "permissions": null, "created": null }, <Engine engine id=gpt-4-0613 at 0x1dd2b60df40> JSON: { "object": "engine", "id": "gpt-4-0613", "ready": true, "owner": "openai", "permissions": null, "created": null }, <Engine engine id=gpt-4 at 0x1dd2b4ce770> JSON: { "object": "engine", "id": "gpt-4", "ready": true, "owner": "openai", "permissions": null, "created": null }, <Engine engine id=text-davinci-003 at 0x1dd2b4ce860> JSON: { "object": "engine", "id": "text-davinci-003", "ready": true, "owner": "openai-internal", "permissions": null, "created": null }, <Engine engine id=text-search-babbage-query-001 at 0x1dd2b4ce900> JSON: { "object": "engine", "id": "text-search-babbage-query-001", "ready": true, "owner": "openai-dev", "permissions": null, "created": null }]
fewshot = 30
prompt = f'See below for {fewshot} examples of "movie name", "synopsis", "genre"\n\n'
for ii in range(30):
prompt = prompt + f'Example#{ii+1}\n movie name: ' + movie_genre['movie_name'].iloc[ii] +\
'\nsynopsis:' + ' ' + movie_genre['synopsis'].astype(str).iloc[ii]+\
'\ngenre:' + ' ' + movie_genre['genre'].astype(str).iloc[ii]+'\n\n'
prompt = prompt + f'Based on learning from these {fewshot} examples, please predict genre for this example:'
i_ex=0
prompt = prompt + f'\n movie name: ' + Hold_out['movie_name'].iloc[i_ex] +\
'\nsynopsis:' + ' ' + Hold_out['synopsis'].astype(str).iloc[i_ex]+\
'\ngenre: ?' + '\n\n'
genre = openai.ChatCompletion.create(
model="gpt-3.5-turbo",
messages=[
{"role": "user", "content": prompt}
],
temperature=0.6,
max_tokens=20,
top_p=1,
frequency_penalty=0,
presence_penalty=0
)
# Get the completion, with some more flavor
print(genre['choices'][0]["message"]["content"])
Based on the provided examples, it is difficult to predict the genre for the movie "The Pot Car
print(f'The Actual genre is ')
Hold_out[['movie_name','synopsis','genre']].loc[[i_ex]]
The Actual genre is
movie_name | synopsis | genre | |
---|---|---|---|
0 | The Pot Carriers | James Rainbow is a first time offender sentenced to 12 months imprisonment. Redband, a trusty and Mouse, a diminutive housebreaker take him under their wing. | crime |
Predict Move Genre¶
# Load fine-tuned model
loaded_model_gpt2_movie_name = GPT2LMHeadModel.from_pretrained('./gpt2_movie')
# Make generator pipeline
generator_gpt2_movie_name = pipeline('text-generation', model=loaded_model_gpt2_movie_name, tokenizer=tokenizer)
loading configuration file ./gpt2_movie\config.json Model config GPT2Config { "_name_or_path": "distilgpt2", "_num_labels": 1, "activation_function": "gelu_new", "architectures": [ "GPT2LMHeadModel" ], "attn_pdrop": 0.1, "bos_token_id": 50256, "embd_pdrop": 0.1, "eos_token_id": 50256, "id2label": { "0": "LABEL_0" }, "initializer_range": 0.02, "label2id": { "LABEL_0": 0 }, "layer_norm_epsilon": 1e-05, "model_type": "gpt2", "n_ctx": 1024, "n_embd": 768, "n_head": 12, "n_inner": null, "n_layer": 6, "n_positions": 1024, "reorder_and_upcast_attn": false, "resid_pdrop": 0.1, "scale_attn_by_inverse_layer_idx": false, "scale_attn_weights": true, "summary_activation": null, "summary_first_dropout": 0.1, "summary_proj_to_labels": true, "summary_type": "cls_index", "summary_use_proj": true, "task_specific_params": { "text-generation": { "do_sample": true, "max_length": 50 } }, "torch_dtype": "float32", "transformers_version": "4.26.1", "use_cache": true, "vocab_size": 50257 } loading weights file ./gpt2_movie\pytorch_model.bin Generate config GenerationConfig { "bos_token_id": 50256, "eos_token_id": 50256, "transformers_version": "4.26.1" } All model checkpoint weights were used when initializing GPT2LMHeadModel. All the weights of GPT2LMHeadModel were initialized from the model checkpoint at ./gpt2_movie. If your task is similar to the task the model of the checkpoint was trained on, you can already use GPT2LMHeadModel for predictions without further training. loading configuration file ./gpt2_movie\generation_config.json Generate config GenerationConfig { "_from_model_config": true, "bos_token_id": 50256, "eos_token_id": 50256, "transformers_version": "4.26.1" }
Hold out 1¶
Fine-tuned distilgpt2¶
ir = 1
movie_name, synopsis, genre = Hold_out[['movie_name','synopsis','genre']].loc[ir]
print(f'movie_name: {movie_name}')
print(f'synopsis: {synopsis}')
print(f'genre: {genre}')
num_tokens = len(tokenizer(synopsis)['input_ids'])
movie_name: Anthony Blake El Sol Es El Cerebro De La Tierra synopsis: A boomer's reflections on mirrors, money, doors, the sun, the brain and several other paranormal topics. genre: mystery
genre_text_sample = f'{genre_PROMPT}\nmovie_name: ' + movie_name +\
'\nsynopsis:' + ' ' + synopsis+\
'\ngenre:'
print(f'prompt is:\n\n{genre_text_sample}')
prompt is: genre Task movie_name: Anthony Blake El Sol Es El Cerebro De La Tierra synopsis: A boomer's reflections on mirrors, money, doors, the sun, the brain and several other paranormal topics. genre:
for generated_text in generator_gpt2_movie_name(genre_text_sample, top_k=1, temperature=1,
num_return_sequences=3, max_length=num_tokens + 1):
print(generated_text['generated_text']+'\n')
Generate config GenerationConfig { "bos_token_id": 50256, "do_sample": true, "eos_token_id": 50256, "max_length": 50, "transformers_version": "4.26.1" } Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation. Input length of input_ids is 49, but `max_length` is set to 24. This can lead to unexpected behavior. You should consider increasing `max_new_tokens`.
genre Task movie_name: Anthony Blake El Sol Es El Cerebro De La Tierra synopsis: A boomer's reflections on mirrors, money, doors, the sun, the brain and several other paranormal topics. genre: mystery genre Task movie_name: Anthony Blake El Sol Es El Cerebro De La Tierra synopsis: A boomer's reflections on mirrors, money, doors, the sun, the brain and several other paranormal topics. genre: mystery genre Task movie_name: Anthony Blake El Sol Es El Cerebro De La Tierra synopsis: A boomer's reflections on mirrors, money, doors, the sun, the brain and several other paranormal topics. genre: mystery
top_k: The top_k parameter, also known as the "top-k sampling" strategy, is used to limit the number of words considered during the sampling process. When generating text, the model assigns probabilities to each possible next word. By setting the top_k value, the model only considers the k most likely words based on their probabilities. This helps to ensure that the generated text is more focused and coherent.
For example, if top_k is set to 10, the model will only consider the top 10 words with the highest probabilities for each word position, discarding the rest. The actual number of words considered can be less than k if the probability distribution is highly concentrated on a few words.
temperature: The temperature parameter controls the randomness of the generated text. It adjusts the probability distribution during sampling. A higher temperature value, such as 1.0, increases the randomness and diversity of the output. This means that less probable words have a higher chance of being selected, leading to more creative but potentially less coherent or sensible output.
On the other hand, a lower temperature value, such as 0.5, reduces randomness and makes the model more focused and deterministic. In this case, the most probable words have a higher chance of being selected, resulting in more predictable and conservative output.
OpenAI¶
fewshot = 50
prompt = f'See below for {fewshot} examples of "movie name", "synopsis", "genre"\n\n'
for ii in range(fewshot):
prompt = prompt + f'Example#{ii+1}\n movie name: ' + movie_genre['movie_name'].iloc[ii] +\
'\nsynopsis:' + ' ' + movie_genre['synopsis'].astype(str).iloc[ii]+\
'\ngenre:' + ' ' + movie_genre['genre'].astype(str).iloc[ii]+'\n\n'
prompt = prompt + f'Based on learning from these {fewshot} examples, please predict genre for this example:'
prompt = prompt + f'\n movie name: ' + Hold_out['movie_name'].iloc[ir] +\
'\nsynopsis:' + ' ' + Hold_out['synopsis'].astype(str).iloc[ir]+\
'\ngenre: ?' + '\n\n'
genre = openai.ChatCompletion.create(
model="gpt-3.5-turbo",
messages=[
{"role": "user", "content": prompt}
],
temperature=0.6,
max_tokens=20,
top_p=1,
frequency_penalty=0,
presence_penalty=0
)
# Get the completion, with some more flavor
print(genre['choices'][0]["message"]["content"])
Based on the provided examples, it is difficult to predict the genre for the movie "Anthony Blake El
print(f'The Actual genre is ')
Hold_out[['movie_name','synopsis','genre']].loc[[ir]]
The Actual genre is
movie_name | synopsis | genre | |
---|---|---|---|
1 | Anthony Blake El Sol Es El Cerebro De La Tierra | A boomer's reflections on mirrors, money, doors, the sun, the brain and several other paranormal topics. | mystery |
Few shot OpenAI cannot predict genre for this movie. Lets change the prompt:
gnrs = ', '.join(list(val_obj.value_counts().keys()))
prompt = f'For this movie synopsis \n "{Hold_out.synopsis.loc[ir]}" \n which genre out of {gnrs} can be assigned'
print(prompt)
For this movie synopsis "A boomer's reflections on mirrors, money, doors, the sun, the brain and several other paranormal topics." which genre out of horror, action, scifi, crime, mystery, adventure, thriller, fantasy, romance, family can be assigned
genre = openai.ChatCompletion.create(
model="gpt-3.5-turbo",
messages=[
{"role": "user", "content": prompt}
],
temperature=0.6,
max_tokens=20,
top_p=1,
frequency_penalty=0,
presence_penalty=0
)
# Get the completion, with some more flavor
print(genre['choices'][0]["message"]["content"])
Based on the given synopsis, it is difficult to assign a specific genre. However, the closest genre
Hold out 2¶
Fine-tuned distilgpt2¶
ir = 2
movie_name, synopsis, genre = Hold_out[['movie_name','synopsis','genre']].loc[ir]
print(f'movie_name: {movie_name}')
print(f'synopsis: {synopsis}')
print(f'genre: {genre}')
num_tokens = len(tokenizer(synopsis)['input_ids'])
movie_name: Yakuza bangaichi synopsis: An extremely fast-moving, entertaining example of just exactly what a gendai (modern) or jitsuroku (true account) yakuza movie should be. genre: crime
genre_text_sample = f'{genre_PROMPT}\nmovie_name: ' + movie_name +\
'\nsynopsis:' + ' ' + synopsis+\
'\ngenre:'
print(f'prompt is:\n\n{genre_text_sample}')
prompt is: genre Task movie_name: Yakuza bangaichi synopsis: An extremely fast-moving, entertaining example of just exactly what a gendai (modern) or jitsuroku (true account) yakuza movie should be. genre:
for generated_text in generator_gpt2_movie_name(genre_text_sample, top_k=1, temperature=1,
num_return_sequences=3, max_length=num_tokens + 1):
print(generated_text['generated_text']+'\n')
Generate config GenerationConfig { "bos_token_id": 50256, "do_sample": true, "eos_token_id": 50256, "max_length": 50, "transformers_version": "4.26.1" } Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation. Input length of input_ids is 55, but `max_length` is set to 36. This can lead to unexpected behavior. You should consider increasing `max_new_tokens`.
genre Task movie_name: Yakuza bangaichi synopsis: An extremely fast-moving, entertaining example of just exactly what a gendai (modern) or jitsuroku (true account) yakuza movie should be. genre: action genre Task movie_name: Yakuza bangaichi synopsis: An extremely fast-moving, entertaining example of just exactly what a gendai (modern) or jitsuroku (true account) yakuza movie should be. genre: action genre Task movie_name: Yakuza bangaichi synopsis: An extremely fast-moving, entertaining example of just exactly what a gendai (modern) or jitsuroku (true account) yakuza movie should be. genre: action
OpenAI¶
fewshot = 50
prompt = f'See below for {fewshot} examples of "movie name", "synopsis", "genre"\n\n'
for ii in range(fewshot):
prompt = prompt + f'Example#{ii+1}\n movie name: ' + movie_genre['movie_name'].iloc[ii] +\
'\nsynopsis:' + ' ' + movie_genre['synopsis'].astype(str).iloc[ii]+\
'\ngenre:' + ' ' + movie_genre['genre'].astype(str).iloc[ii]+'\n\n'
prompt = prompt + f'Based on learning from these {fewshot} examples, please predict genre for this example:'
prompt = prompt + f'\n movie name: ' + Hold_out['movie_name'].iloc[ir] +\
'\nsynopsis:' + ' ' + Hold_out['synopsis'].astype(str).iloc[ir]+\
'\ngenre: ?' + '\n\n'
genre = openai.ChatCompletion.create(
model="gpt-3.5-turbo",
messages=[
{"role": "user", "content": prompt}
],
temperature=0.6,
max_tokens=20,
top_p=1,
frequency_penalty=0,
presence_penalty=0
)
# Get the completion, with some more flavor
print(genre['choices'][0]["message"]["content"])
Based on the examples provided, it is difficult to predict the genre for the movie "Yakuza
print(f'The Actual genre is ')
Hold_out[['movie_name','synopsis','genre']].loc[[ir]]
The Actual genre is
movie_name | synopsis | genre | |
---|---|---|---|
2 | Yakuza bangaichi | An extremely fast-moving, entertaining example of just exactly what a gendai (modern) or jitsuroku (true account) yakuza movie should be. | crime |
Lets change the prompt:
gnrs = ', '.join(list(val_obj.value_counts().keys()))
prompt = f'For this movie synopsis \n "{Hold_out.synopsis.loc[ir]}" \n which genre out of {gnrs} can be assigned'
print(prompt)
For this movie synopsis "An extremely fast-moving, entertaining example of just exactly what a gendai (modern) or jitsuroku (true account) yakuza movie should be." which genre out of horror, action, scifi, crime, mystery, adventure, thriller, fantasy, romance, family can be assigned
genre = openai.ChatCompletion.create(
model="gpt-3.5-turbo",
messages=[
{"role": "user", "content": prompt}
],
temperature=0.6,
max_tokens=20,
top_p=1,
frequency_penalty=0,
presence_penalty=0
)
# Get the completion, with some more flavor
print(genre['choices'][0]["message"]["content"])
Based on the given movie synopsis, the genre that can be assigned is crime.
Predict all hold out set¶
movie_name_ = []
synopsis_ = []
genre_ = []
genre_pred_ = []
df = pd.DataFrame()
# only predict 100 out 4000 hold out
for ir in range(100):
movie_name, synopsis, genre = Hold_out[['movie_name','synopsis','genre']].loc[ir]
movie_name_.append(movie_name)
synopsis_.append(synopsis)
genre_.append(genre)
genre_text = f'{genre_PROMPT}\nmovie_name: ' + movie_name +\
'\nsynopsis:' + ' ' + synopsis+\
'\ngenre:'
genre_pred = generator_gpt2_movie_name(genre_text, top_k=1, temperature=1,
num_return_sequences= num_tokens + 1)[0]['generated_text']
match = re.search(r'genre:\s*([^\\]+)', genre_pred)
genre_pred_.append(match.group(1))
df['movie_name'] = movie_name_
df['synopsis'] = synopsis_
df['actual_genre'] = genre_
df['predict_genre'] = genre_pred_
df[:4]
movie_name | synopsis | actual_genre | predict_genre | |
---|---|---|---|---|
0 | The Pot Carriers | James Rainbow is a first time offender sentenced to 12 months imprisonment. Redband, a trusty and Mouse, a diminutive housebreaker take him under their wing. | crime | crime |
1 | Anthony Blake El Sol Es El Cerebro De La Tierra | A boomer's reflections on mirrors, money, doors, the sun, the brain and several other paranormal topics. | mystery | mystery |
2 | Yakuza bangaichi | An extremely fast-moving, entertaining example of just exactly what a gendai (modern) or jitsuroku (true account) yakuza movie should be. | crime | action |
3 | Rio 2096: A Story of Love and Fury | Revolves around the love between an immortal hero and JanaĆna, the woman he has been in love with for 600 years, through Brazil's colonization, slavery, military regime and the future, in 2096, in the midst of wars for water. | romance | romance |
## Extract actual and predicted values
#y_true = df['actual_genre']
#y_pred = df['predict_genre']
#
## Calculate accuracy
#accuracy = accuracy_score(y_true, y_pred)
#
#print("Accuracy:", accuracy)
- Home
-
- Prediction of Movie Genre by Fine-tunning GPT
- Fine-tunning BERT for Fake News Detection
- Covid Tweet Classification by Fine-tunning BART
- Semantic Search Using BERT
- Abstractive Semantic Search by OpenAI Embedding
- Fine-tunning GPT for Style Completion
- Extractive Question-Answering by BERT
- Fine-tunning T5 Model for Abstract Title Prediction
- Image Captioning by Fine-tunning ViT
- Build Serverless ChatGPT API
- Statistical Analysis in Python
- Clustering Algorithms
- Customer Segmentation
- Time Series Forecasting
- PySpark Fundamentals for Big Data
- Predict Customer Churn
- Classification with Imbalanced Classes
- Feature Importance
- Feature Selection
- Text Similarity Measurement
- Dimensionality Reduction
- Prediction of Methane Leakage
- Imputation by LU Simulation
- Histogram Uncertainty
- Delustering to Improve Preferential Sampling
- Uncertainty in Spatial Correlation
-
- Machine Learning Overview
- Python and Pandas
- Main Steps of Machine Learning
- Classification
- Model Training
- Support Vector Machines
- Decision Trees
- Ensemble Learning & Random Forests
- Artificial Neural Network
- Deep Neural Network (DNN)
- Unsupervised Learning
- Multicollinearity
- Introduction to Git
- Introduction to R
- SQL Basic to Advanced Level
- Develop Python Package
- Introduction to BERT LLM
- Exploratory Data Analysis
- Object Oriented Programming in Python
- Natural Language Processing
- Convolutional Neural Network
- Publications