GPT undergoes pre-training by exposing the model to a substantial text corpus called BookCorpus, comprising 4.5 GB of text sourced from 7,000 unpublished books spanning various genres. In this pre-training phase, the model learns to predict the next word in a sequence based on preceding words, a process known as language modeling. This helps impart an understanding of the inherent structure and patterns of natural language. Following pre-training, GPT can be fine-tuned for specific tasks using a more focused, task-specific dataset. Fine-tuning involves adjusting the model's parameters to enhance its suitability for the given task, such as classification, similarity scoring, or question answering. Over time, OpenAI has enhanced and expanded the GPT architecture, introducing subsequent models like GPT-2, GPT-3, GPT-3.5, and GPT-4. These newer models are trained on larger datasets and boast increased capacities, enabling them to generate more intricate and coherent text. The widespread adoption of GPT models by both researchers and industry practitioners has played a pivotal role in advancing natural language processing tasks significantly. Only GPT-2 is freely available on HuggingFace. In this notebook, distilgpt2 model (more efficient version of the GPT-2) was fine-tuned to predict movie genre prediction.
Python functions and data files needed to run this notebook are available via this link.
GPT stands for Generative Pre-trained Transformers:
Generative: from Auto-regressive Language Model. It signifies predicting tokens with one side of the context, past context.
Pre-trained: decoders are trained on huge corpora of data.
Transformers: The decoder is taken from the transformer architecture.
GPT refers to a family of models:
OpenAI's public release of the latest GPT-3 and ChatGPT models has democratized access to powerful autoregressive language models. Now, individuals can harness these models for various applications. However, there are certain limitations:
These models lack the ability to take explicit direction; instead, their focus lies in completing sentences in a consistent style. While few-shot learning has proven effective, the models struggle when asked direct questions with a demand for specific answers. In response to these challenges, OpenAI introduced an updated iteration of GPT-3 in January 2022 called InstructGPT. This enhanced version exhibits improvements in various aspects, including a reduction in harmful biases and the capability to follow directions from prompts, providing answers without fixating on completing the given thought.
How InstructGPT is trained:
Prompt Engineering
Prompt Engineering is the process of designing input prompts for language model system like GPT-3 and ChatGPT:
We can influence the output produced by the model to get something more specific and usable by carefully crafting and adjusting prompts.
Prompt Engineering can be used to guide the model to produce more relevant and coherent output for a given task.
Alignment in Language Models To understand why prompt engineering is crucial to LLM-application development, we first have to understand not only how LLMs are trained, but how they are aligned to human input. Alignment in language models refers to how the model understands and responds to input prompts that are āin line withā (at least according to the people in charge of aligning the LLM) what the user expected. In standard language modeling, a model is trained to predict the next word or sequence of words based on the context of the preceding words. However, this approach alone does not allow for specific instructions or prompts to be answered by the model, which can limit its usefulness for certain applications.
Prompt engineering can be challenging if the language model has not been aligned with the prompts, as it may generate irrelevant or incorrect responses. However, some language models have been developed with extra alignment features, such as Constitutional AI-driven Reinforcement Learning from AI Feedback (RLAIF) from Anthropic or Reinforcement Learning from Human Feedback (RLHF) in OpenAIās GPT series, which can incorporate explicit instructions and feedback into the modelās training. These alignment techniques can improve the modelās ability to understand and respond to specific prompts, making them more useful for applications such as question-answering or language translation:
The data set is movie-genre-prediction retrieved from kaggle.
Objective of data set
The goal of this competition is to design a predictive model that accurately classifies movies into their respective genres based on their titles and synopses. The challenge lies not just in achieving high accuracy, but also in ensuring that the model is efficient and interpretable.
Why This is Interesting and Relevant
Understanding movie genres based on titles and synopses is a fascinating problem for multiple reasons.. From a recommendation system perspective, an effective genre classifier can help build more personalized user recommendations, increasing user engagement on streaming platforms. In the context of box office performance, understanding the relationship between genres and how they are perceived in synopses can provide insight into patterns of commercial success or failure. Furthermore, this challenge can facilitate a deeper comprehension of movie themes and trends in the industry, contributing to cultural and societal studies.
Participants will be provided with a comprehensive dataset comprising ~100,000 movies. Each entry includes the original title, the genre(s), and the synopsis of the movie.
The dataset contains a mix of both original and AI-generated titles, genres, and synopses to test the robustness of the models.
The 10 genres include action, adventure, crime, family, fantasy, horror, mystery, romance, scifi, and thriller.
data_movie_genre = pd.read_csv('movie_genre.csv')
# Since the data set is big for fine-tunning, we select 2000 samples
n_sample = 40000
movie_genre = data_movie_genre[:n_sample]
Hold_out = data_movie_genre[~data_movie_genre.index.isin(movie_genre.index)][:int(n_sample*0.1)].reset_index(drop=True)
movie_genre = movie_genre.reset_index(drop=True)
print(f'Shape of training set is {movie_genre.shape}')
print(f'Shape of holdout set is {Hold_out.shape}')
def histplt (val: list,bins: int,title: str,xlabl: str,ylabl: str,xlimt: list,
ylimt: list=False, loc: int =1,legend: int=1,axt=None,days: int=False,
class_: int=False,scale: int=1,x_tick: list=False, calc_perc: bool= True,
nsplit: int=1,font: int=5,color: str='b') -> [float] :
""" Histogram including important statistics """
ax1 = axt or plt.axes()
font = {'size' : font }
plt.rc('font', **font)
miss_n = len(val[np.isnan(val)])
tot = len(val)
n_distinct = len(np.unique(val))
miss_p = (len(val[np.isnan(val)])/tot)*100
val = val[~np.isnan(val)]
val = np.array(val)
plt.hist(val, bins=bins, weights=np.ones(len(val)) / len(val),ec='black',color=color)
n_nonmis = len(val[~np.isnan(val)])
if class_:
times = 100
times = 1
Mean = np.nanmean(val)*times
Median = np.nanmedian(val)*times
sd = np.sqrt(np.nanvar(val))
Max = np.nanmax(val)
Min = np.nanmin(val)
p1 = np.quantile(val, 0.01)
p25 = np.quantile(val, 0.25)
p75 = np.quantile(val, 0.75)
p99 = np.quantile(val, 0.99)
if calc_perc == True:
txt = 'n (not missing)=%.0f\nn_distinct=%.0f\nMissing=%.1f%%\nMean=%0.2f\nĻ=%0.1f\np1%%=%0.1f\np99%%=%0.1f\nMin=%0.1f\nMax=%0.1f'
anchored_text = AnchoredText(txt %(n_nonmis,n_distinct,miss_p,Mean,sd,p1,p99,Min,Max), borderpad=0,
loc=loc,prop={ 'size': font['size']*scale})
txt = 'n (not missing)=%.0f\nn_distinct=%.0f\nMissing=%.1f%%\nMean=%0.2f\nĻ=%0.1f\nMin=%0.1f\nMax=%0.1f'
anchored_text = AnchoredText(txt %(n_nonmis,n_distinct,miss_p,Mean,sd,Min,Max), borderpad=0,
loc=loc,prop={ 'size': font['size']*scale})
if(legend==1): ax1.add_artist(anchored_text)
if (scale): plt.title(title,fontsize=font['size']*(scale+0.15))
else: plt.title(title)
if (scale): ax1.set_xlabel(xlabl,fontsize=font['size']*scale)
else: ax1.set_xlabel(xlabl)
except NameError:
if (scale): plt.xlabel(xlabl,fontsize=font['size']*scale)
else: plt.xlabel(xlabl)
except NameError:
if (scale): plt.ylabel(ylabl,fontsize=font['size']*scale)
else: plt.ylabel(ylabl)
if (class_==True): plt.xticks([0,1])
except NameError:
except NameError:
if x_tick: plt.xticks(x_tick,fontsize=font['size']*scale)
# Interquartile Range Method for outlier detection
iqr = p75 - p25
# calculate the outlier cutoff
cut_off = np.array(iqr) * 1.5
lower, upper = p25 - cut_off, p75 + cut_off
return tot, n_nonmis, n_distinct, miss_n, miss_p, Mean, Median, sd, Max, Min, p1, p25, p75, p99, sd
def bargraph(val_ob: [list], title: str, ylabel: str, titlefontsize: int=10, xfontsize: int=5,scale: int=1,
yfontsize: int=8, select: bool= False, fontsizelable: bool= False, xshift: float=-0.1, nsim: int=False
,yshift: float=0.01,percent: bool=False, xlim: list=False, axt: bool=None, color: str='b',sort=True,
ylim: list=False, y_rot: int=0, ytick_rot: int=90, graph_float: int=1,
loc: int =1,legend: int=1) -> None:
""" vertical bargraph """
ax1 = axt or plt.axes()
tot = len(val_ob)
miss_p_ob = (len(val_ob[pd.isnull(val_ob)])/tot)*100
n_nonmis_ob = len(val_ob[~pd.isnull(val_ob)])
con = np.array(val_ob.value_counts())
len_ = len(con)
#if len_ > 10: len_ = 10
cats = list(val_ob.value_counts().keys())
val_ob = con[:len_]
clmns = cats[:len_]
# Sort counts
if sort:
sort_score = sorted(zip(val_ob,clmns), reverse=True)
Clmns_sort = [sort_score[i][1] for i in range(len(clmns))]
sort_score = [sort_score[i][0] for i in range(len(clmns))]
Clmns_sort = clmns
sort_score = val_ob
index1 = np.arange(len(clmns))
if (select):
sort_score=sort_score[:select], sort_score, width=0.6, align='center', alpha=1, edgecolor='k', capsize=4,color=color)
ax1.set_xticklabels(Clmns_sort,fontsize=xfontsize, rotation=ytick_rot,y=0.02)
if (percent): plt.gca().yaxis.set_major_formatter(PercentFormatter(1))
ax1.xaxis.grid(color='k', linestyle='--', linewidth=0.2)
if (xlim): plt.xlim(xlim)
if (ylim): plt.ylim(ylim)
if (fontsizelable):
for ii in range(len(sort_score)):
if (percent):
plt.text(xshift+ii, sort_score[ii]+yshift,f'{"{0:.2f}".format(sort_score[ii]*100)}%',
plt.text(xshift+ii, sort_score[ii]+yshift,f'{np.round(sort_score[ii],graph_float)}',
dic_Clmns = {}
for i in range(len(Clmns_sort)):
txt = 'n (not missing)=%.0f\nMissing=%.1f%%'
anchored_text = AnchoredText(txt %(n_nonmis_ob,miss_p_ob), borderpad=0,
if(legend==1): ax1.add_artist(anchored_text)
font = {'size' : 12}
plt.rc('font', **font)
colors_map ='jet')
fig, ax = plt.subplots(figsize=(12, 4), dpi= 100, facecolor='w', edgecolor='k')
ax1 = plt.subplot(1,2,1)
val_obj = movie_genre['genre']
bargraph (val_obj, title=f'Training set_Movie Genre Labels', ylabel='Counts',titlefontsize=15,
legend=True,ytick_rot=25, y_rot=0, axt=ax1,loc=1, ylim=[0,5500])
ax2 = plt.subplot(1,2,2)
tt = TweetTokenizer()
val = movie_genre.apply(lambda x : len(tt.tokenize(x['movie_name'])), axis=1)
_,_,_, _, _,_ ,_ ,_ ,_ ,_ ,\
_,_ ,_ ,_ ,_ = histplt (val,bins=50,title=f'Histogram of Length for Movie Genre',xlabl=None,days=False,
# Start with one review:
text = data_movie_genre.genre.to_list()
token_string = " ".join(text)
# Generate the word cloud with custom settings
wordcloud = WordCloud(
background_color='gray', # Can also use 'transparent' for no background
max_words=100, # Control the number of words
max_font_size=150, # Max font size for the largest words
min_font_size=10, # Min font size for smaller words
random_state=42, # Control the randomness of word placement
contour_color='black', # Outline the word cloud with a contour
contour_width=1.5 # Width of the contour
# Plot the beautified word cloud using matplotlib
plt.figure(figsize=(12, 6)) # Set a larger figure size for higher resolution
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off") # Turn off the axis for a clean look
DistilGPT2 (short for Distilled-GPT2) is an English-language model pre-trained with the supervision of the smallest version of Generative Pre-trained Transformer 2 (GPT-2). Like GPT-2, DistilGPT2 can be used to generate text. Users of this model card should also consider information about the design, training, and limitations of GPT-2.
Model Description
DistilGPT2 is an English-language model pre-trained with the supervision of the 124 million parameter version of GPT-2. DistilGPT2, which has 82 million parameters, was developed using knowledge distillation and was designed to be a faster, lighter version of GPT-2.
is a smaller and more efficient version of the GPT-2 (Generative Pre-trained Transformer 2) language model developed by OpenAI. The "distil" in its name stands for "distillation," indicating that it is a distilled or compressed version of the original model.
Distillation involves training a smaller model (student) to mimic the behavior of a larger model (teacher). In the case of distilgpt2
, it is trained to replicate the capabilities of the larger GPT-2 model but with a reduced number of parameters, making it more lightweight and suitable for environments with limited computational resources.
MODEL = 'distilgpt2'
# Load tokenizer
tokenizer = GPT2Tokenizer.from_pretrained(MODEL)
tokenizer.pad_token = tokenizer.eos_token
# Add prompt for task
genre_PROMPT = 'genre Task'
synopsis_TOKEN = '\nsynopsis:'
movie_name_TOKEN = '\movie_name:'
movie_genre['genre_text'] = f'{genre_PROMPT}\nmovie_name: ' + movie_genre['movie_name'] +\
'\nsynopsis:' + ' ' + movie_genre['synopsis'].astype(str)+\
'\ngenre:' + ' ' + movie_genre['genre'].astype(str)
training_examples = movie_genre['genre_text']
multi_task_df = pd.DataFrame({'text': training_examples})
data = Dataset.from_pandas(multi_task_df)
def preprocess(examples):
return tokenizer(examples['text'], truncation=True)
data =, batched=True)
data = data.train_test_split(train_size=.8)
# Load pretrained model
model = GPT2LMHeadModel.from_pretrained(MODEL)
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)
epochs = 2
batch_size = 5
training_args = TrainingArguments(
output_dir="./gpt2_movie", # The output directory
overwrite_output_dir=True, # overwrite the content of the output directory
num_train_epochs=epochs, # number of training epochs
per_device_train_batch_size=batch_size, # batch size for training
per_device_eval_batch_size=batch_size, # batch size for evaluation
evaluation_strategy='epoch', # It is "steps" or "epoch", we choose epoch: how many times to stop training to test
save_strategy='epoch' # save a check point of our model after each epoch
trainer = Trainer(
model=model, # take our model (tweet_clf_model)
args=training_args, # we just set it above
train_dataset=data['train'], # training part of dataset
eval_dataset=data['test'], # test (evaluation) part of dataset
data_collator=data_collator # data colladior with padding. Infact, we may or may not need a data collator
# we can check the model to see how it lookes like with or without the collator
# "OPENAI_API_KEY": notice, OpenAL is not a open soourse model like downloading from Huggingface
# we should have API_KEY that OpenAI send yo us that is persona API key.
## Here are list of engines that OpenAI have for us
# Look at othe model that have either 'text' or 'gpt'
[e for e in openai.Engine.list().data if 'text' in or 'gpt' in][:5]
fewshot = 30
prompt = f'See below for {fewshot} examples of "movie name", "synopsis", "genre"\n\n'
for ii in range(30):
prompt = prompt + f'Example#{ii+1}\n movie name: ' + movie_genre['movie_name'].iloc[ii] +\
'\nsynopsis:' + ' ' + movie_genre['synopsis'].astype(str).iloc[ii]+\
'\ngenre:' + ' ' + movie_genre['genre'].astype(str).iloc[ii]+'\n\n'
prompt = prompt + f'Based on learning from these {fewshot} examples, please predict genre for this example:'
prompt = prompt + f'\n movie name: ' + Hold_out['movie_name'].iloc[i_ex] +\
'\nsynopsis:' + ' ' + Hold_out['synopsis'].astype(str).iloc[i_ex]+\
'\ngenre: ?' + '\n\n'
genre = openai.ChatCompletion.create(
{"role": "user", "content": prompt}
# Get the completion, with some more flavor
print(f'The Actual genre is ')
# Load fine-tuned model
loaded_model_gpt2_movie_name = GPT2LMHeadModel.from_pretrained('./gpt2_movie')
# Make generator pipeline
generator_gpt2_movie_name = pipeline('text-generation', model=loaded_model_gpt2_movie_name, tokenizer=tokenizer)
ir = 1
movie_name, synopsis, genre = Hold_out[['movie_name','synopsis','genre']].loc[ir]
print(f'movie_name: {movie_name}')
print(f'synopsis: {synopsis}')
print(f'genre: {genre}')
num_tokens = len(tokenizer(synopsis)['input_ids'])
genre_text_sample = f'{genre_PROMPT}\nmovie_name: ' + movie_name +\
'\nsynopsis:' + ' ' + synopsis+\
print(f'prompt is:\n\n{genre_text_sample}')
for generated_text in generator_gpt2_movie_name(genre_text_sample, top_k=1, temperature=1,
num_return_sequences=3, max_length=num_tokens + 1):
top_k: The top_k parameter, also known as the "top-k sampling" strategy, is used to limit the number of words considered during the sampling process. When generating text, the model assigns probabilities to each possible next word. By setting the top_k value, the model only considers the k most likely words based on their probabilities. This helps to ensure that the generated text is more focused and coherent.
For example, if top_k is set to 10, the model will only consider the top 10 words with the highest probabilities for each word position, discarding the rest. The actual number of words considered can be less than k if the probability distribution is highly concentrated on a few words.
temperature: The temperature parameter controls the randomness of the generated text. It adjusts the probability distribution during sampling. A higher temperature value, such as 1.0, increases the randomness and diversity of the output. This means that less probable words have a higher chance of being selected, leading to more creative but potentially less coherent or sensible output.
On the other hand, a lower temperature value, such as 0.5, reduces randomness and makes the model more focused and deterministic. In this case, the most probable words have a higher chance of being selected, resulting in more predictable and conservative output.
fewshot = 50
prompt = f'See below for {fewshot} examples of "movie name", "synopsis", "genre"\n\n'
for ii in range(fewshot):
prompt = prompt + f'Example#{ii+1}\n movie name: ' + movie_genre['movie_name'].iloc[ii] +\
'\nsynopsis:' + ' ' + movie_genre['synopsis'].astype(str).iloc[ii]+\
'\ngenre:' + ' ' + movie_genre['genre'].astype(str).iloc[ii]+'\n\n'
prompt = prompt + f'Based on learning from these {fewshot} examples, please predict genre for this example:'
prompt = prompt + f'\n movie name: ' + Hold_out['movie_name'].iloc[ir] +\
'\nsynopsis:' + ' ' + Hold_out['synopsis'].astype(str).iloc[ir]+\
'\ngenre: ?' + '\n\n'
genre = openai.ChatCompletion.create(
{"role": "user", "content": prompt}
# Get the completion, with some more flavor
print(f'The Actual genre is ')
Few shot OpenAI cannot predict genre for this movie. Lets change the prompt:
gnrs = ', '.join(list(val_obj.value_counts().keys()))
prompt = f'For this movie synopsis \n "{Hold_out.synopsis.loc[ir]}" \n which genre out of {gnrs} can be assigned'
genre = openai.ChatCompletion.create(
{"role": "user", "content": prompt}
# Get the completion, with some more flavor
ir = 2
movie_name, synopsis, genre = Hold_out[['movie_name','synopsis','genre']].loc[ir]
print(f'movie_name: {movie_name}')
print(f'synopsis: {synopsis}')
print(f'genre: {genre}')
num_tokens = len(tokenizer(synopsis)['input_ids'])
genre_text_sample = f'{genre_PROMPT}\nmovie_name: ' + movie_name +\
'\nsynopsis:' + ' ' + synopsis+\
print(f'prompt is:\n\n{genre_text_sample}')
for generated_text in generator_gpt2_movie_name(genre_text_sample, top_k=1, temperature=1,
num_return_sequences=3, max_length=num_tokens + 1):
fewshot = 50
prompt = f'See below for {fewshot} examples of "movie name", "synopsis", "genre"\n\n'
for ii in range(fewshot):
prompt = prompt + f'Example#{ii+1}\n movie name: ' + movie_genre['movie_name'].iloc[ii] +\
'\nsynopsis:' + ' ' + movie_genre['synopsis'].astype(str).iloc[ii]+\
'\ngenre:' + ' ' + movie_genre['genre'].astype(str).iloc[ii]+'\n\n'
prompt = prompt + f'Based on learning from these {fewshot} examples, please predict genre for this example:'
prompt = prompt + f'\n movie name: ' + Hold_out['movie_name'].iloc[ir] +\
'\nsynopsis:' + ' ' + Hold_out['synopsis'].astype(str).iloc[ir]+\
'\ngenre: ?' + '\n\n'
genre = openai.ChatCompletion.create(
{"role": "user", "content": prompt}
# Get the completion, with some more flavor
print(f'The Actual genre is ')
Lets change the prompt:
gnrs = ', '.join(list(val_obj.value_counts().keys()))
prompt = f'For this movie synopsis \n "{Hold_out.synopsis.loc[ir]}" \n which genre out of {gnrs} can be assigned'
genre = openai.ChatCompletion.create(
{"role": "user", "content": prompt}
# Get the completion, with some more flavor
movie_name_ = []
synopsis_ = []
genre_ = []
genre_pred_ = []
df = pd.DataFrame()
# only predict 100 out 4000 hold out
for ir in range(100):
movie_name, synopsis, genre = Hold_out[['movie_name','synopsis','genre']].loc[ir]
genre_text = f'{genre_PROMPT}\nmovie_name: ' + movie_name +\
'\nsynopsis:' + ' ' + synopsis+\
genre_pred = generator_gpt2_movie_name(genre_text, top_k=1, temperature=1,
num_return_sequences= num_tokens + 1)[0]['generated_text']
match ='genre:\s*([^\\]+)', genre_pred)
df['movie_name'] = movie_name_
df['synopsis'] = synopsis_
df['actual_genre'] = genre_
df['predict_genre'] = genre_pred_
## Extract actual and predicted values
#y_true = df['actual_genre']
#y_pred = df['predict_genre']
## Calculate accuracy
#accuracy = accuracy_score(y_true, y_pred)
#print("Accuracy:", accuracy)