Summary
In this notebook, we will implement image captioning using a vision transformer. The application of pretrained ViT models for image captioning involves associating a textual description with an image, providing a comprehensive account of its contents. This procedure entails converting an image into a written narrative, establishing a connection between the domains of vision (image) and language (text). In the context of this document, we showcase how Vision Transformers (ViT) can execute this task when applied to images, utilizing the PyTorch backend as our primary technology. The goal is to demonstrate fine-tunning ViTs, for generating image captions without the necessity of retraining from scratch.
Notebook for this analysis is in my GitHub page
from transformers import ViTModel
from PIL import Image
from transformers import VisionEncoderDecoderModel, GPT2TokenizerFast, AutoFeatureExtractor, \
AutoTokenizer, TrainingArguments, Trainer
from PIL import Image
import os
import matplotlib.pyplot as plt
import numpy as np
from datasets import Dataset
import os
import torch
import numpy as np
import pandas as pd
# torchvision has several functions for image processing
from torchvision.transforms import RandomResizedCrop, Compose, Normalize, ToTensor, Resize
import pandas as pd
import requests
from io import BytesIO
Introduction to the Vision Transformer¶
The vision transformer first introduced in paper An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. The vision transformer was created to perform similar tasks of an NLP based transformer. The architect is similar:
Instead of token for words, fixed sized patches are used in image that are subset of images.
The model can be pre-trained on many types of datasets. The model we are using today has been pre-trained on the public ImageNet-21k dataset performs:
# Load up a pretrained Google vit model on HuggingFace
vit_model = ViTModel.from_pretrained('google/vit-base-patch16-224-in21k')
vit_model
ViTModel( (embeddings): ViTEmbeddings( (patch_embeddings): ViTPatchEmbeddings( (projection): Conv2d(3, 768, kernel_size=(16, 16), stride=(16, 16)) ) (dropout): Dropout(p=0.0, inplace=False) ) (encoder): ViTEncoder( (layer): ModuleList( (0-11): 12 x ViTLayer( (attention): ViTAttention( (attention): ViTSelfAttention( (query): Linear(in_features=768, out_features=768, bias=True) (key): Linear(in_features=768, out_features=768, bias=True) (value): Linear(in_features=768, out_features=768, bias=True) (dropout): Dropout(p=0.0, inplace=False) ) (output): ViTSelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (dropout): Dropout(p=0.0, inplace=False) ) ) (intermediate): ViTIntermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) (intermediate_act_fn): GELUActivation() ) (output): ViTOutput( (dense): Linear(in_features=3072, out_features=768, bias=True) (dropout): Dropout(p=0.0, inplace=False) ) (layernorm_before): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (layernorm_after): LayerNorm((768,), eps=1e-12, elementwise_affine=True) ) ) ) (layernorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (pooler): ViTPooler( (dense): Linear(in_features=768, out_features=768, bias=True) (activation): Tanh() ) )
ViTModel model is pre-trained on imageNet: ViTEncoder is similar for LLM models which uses attention mechanism; there are query
, key
and value
. There is patch_embeddings
at the beginning which is different from LLM.
ViTModel has 12 encoders. Finally, there is a pooler
layer at the end.
We need feature extractor to convert images to a tensor. This can be done by AutoFeatureExtractor
library:
# Load feature extractor
feature_ext = AutoFeatureExtractor.from_pretrained("google/vit-base-patch16-224-in21k")
C:\Users\mrezv\anaconda3\lib\site-packages\transformers\models\vit\feature_extraction_vit.py:28: FutureWarning: The class ViTFeatureExtractor is deprecated and will be removed in version 5 of Transformers. Please use ViTImageProcessor instead. warnings.warn(
We need to pay attention to how images are preprocessed to match how the model was pretrained
feature_ext
ViTFeatureExtractor { "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.5, 0.5, 0.5 ], "image_processor_type": "ViTFeatureExtractor", "image_std": [ 0.5, 0.5, 0.5 ], "resample": 2, "rescale_factor": 0.00392156862745098, "size": { "height": 224, "width": 224 } }
Feature extraction works as tokenizer in NLP. It takes raw string and turn them in tokens. Feature extraction takes a raw image and perform some pre-processing to convert it into tensor. Some parameters of feature extraction are do_normalize
, which is a normalization to certain mean and standard deviation and convert image to size of 224*224.
To load up an image, we can use Pillow's image object.
from PIL import *
import PIL.Image
img = Image.open('./image/1.jpg')
display(img)
If we use feature extraction:
import matplotlib.pyplot as plt
plt.imshow(feature_ext(img).pixel_values[0].transpose(1, 2, 0))
Clipping input data to the valid range for imshow with RGB data ([0..1] for floats or [0..255] for integers).
<matplotlib.image.AxesImage at 0x1128868b3a0>
feature_ext(img).pixel_values[0].shape
(3, 224, 224)
Although the image is ugly to us, it eliminates a lot of noises for specific tasks
Fine-tune Image Captioning System¶
Vision transformers¶
# Many weights are innitialized randomly, namely the cross attention weights
model = VisionEncoderDecoderModel.from_encoder_decoder_pretrained(
'google/vit-base-patch16-224-in21k', # :patch:16, image size: 224*224, pre-trained on: image net in21k
'distilgpt2') # decoder is `distilgpt2`
Some weights of GPT2LMHeadModel were not initialized from the model checkpoint at distilgpt2 and are newly initialized: ['transformer.h.3.crossattention.c_attn.weight', 'transformer.h.3.crossattention.q_attn.weight', 'transformer.h.5.crossattention.c_attn.weight', 'transformer.h.2.crossattention.c_proj.weight', 'transformer.h.5.ln_cross_attn.weight', 'transformer.h.2.crossattention.c_attn.weight', 'transformer.h.0.crossattention.c_proj.bias', 'transformer.h.5.crossattention.q_attn.weight', 'transformer.h.5.crossattention.bias', 'transformer.h.3.crossattention.c_proj.bias', 'transformer.h.3.crossattention.bias', 'transformer.h.2.ln_cross_attn.weight', 'transformer.h.1.crossattention.bias', 'transformer.h.4.crossattention.q_attn.weight', 'transformer.h.0.ln_cross_attn.weight', 'transformer.h.1.crossattention.c_proj.bias', 'transformer.h.4.crossattention.c_proj.weight', 'transformer.h.0.crossattention.c_proj.weight', 'transformer.h.2.crossattention.masked_bias', 'transformer.h.3.crossattention.c_proj.weight', 'transformer.h.1.crossattention.q_attn.weight', 'transformer.h.1.ln_cross_attn.weight', 'transformer.h.3.crossattention.masked_bias', 'transformer.h.4.crossattention.c_proj.bias', 'transformer.h.3.ln_cross_attn.weight', 'transformer.h.5.crossattention.c_proj.bias', 'transformer.h.4.ln_cross_attn.weight', 'transformer.h.0.crossattention.masked_bias', 'transformer.h.2.crossattention.c_proj.bias', 'transformer.h.4.crossattention.bias', 'transformer.h.0.crossattention.q_attn.weight', 'transformer.h.0.crossattention.c_attn.weight', 'transformer.h.2.crossattention.bias', 'transformer.h.5.crossattention.c_proj.weight', 'transformer.h.0.crossattention.bias', 'transformer.h.5.crossattention.masked_bias', 'transformer.h.4.crossattention.masked_bias', 'transformer.h.4.crossattention.c_attn.weight', 'transformer.h.1.crossattention.masked_bias', 'transformer.h.2.crossattention.q_attn.weight', 'transformer.h.1.crossattention.c_proj.weight', 'transformer.h.1.crossattention.c_attn.weight'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Vision transformer is like T5 because we have both encoder and decoder.
type(model.encoder)
transformers.models.vit.modeling_vit.ViTModel
type(model.decoder)
transformers.models.gpt2.modeling_gpt2.GPT2LMHeadModel
total_params = 0
for param in model.parameters():
total_params += param.numel()
print(f"The model has a combined {total_params:,} parameters")
The model has a combined 182,485,248 parameters
# Instantiate a tokenizer
gpt2_tokenizer = GPT2TokenizerFast.from_pretrained('distilgpt2')
Load Images and Image Caption¶
A new benchmark collection for sentence-based image description and search, consisting of 8,000 images that are each paired with five different captions which provide clear descriptions of the salient entities and events. … The images were chosen from six different Flickr groups, and tend not to contain any well-known people or locations, but were manually selected to depict a variety of scenes and situations
def captions_load(image_caption,image_path, min_caption=10, max_caption=50, num_image=10):
'''
optain image captions and image names
'''
with open(image_caption) as caption_file:
captions = caption_file.readlines()
map_captions = {}
map_captions_test = {}
text_data = []
text_data_test = []
if num_image>=len(captions):
num_image = len(captions)
# Loading up images from data set for training set
for line in captions[:num_image]:
line = line.rstrip("\n")
# Separate image name and captions using a tab
img_name, caption = line.split("\t")
# Five different captions is assigned to each image
# Each image name has a suffix `#(caption_number)`
img_name = img_name.split("#")[0]
img_name = os.path.join(image_path, img_name.strip())
if img_name.endswith("jpg"):
caption = caption.replace(' .', '').strip()
tokens = caption.strip().split()
if len(caption) < min_caption or len(caption) > max_caption:
continue
text_data.append(caption)
if img_name in map_captions:
map_captions[img_name].append(caption)
else:
map_captions[img_name] = [caption]
# Loading up images from data set for test set
for line in captions[num_image:]:
line = line.rstrip("\n")
# Separate image name and captions using a tab
img_name, caption = line.split("\t")
# Five different captions is assigned to each image
# Each image name has a suffix `#(caption_number)`
img_name = img_name.split("#")[0]
img_name = os.path.join(image_path, img_name.strip())
if img_name.endswith("jpg"):
caption = caption.replace(' .', '').strip()
tokens = caption.strip().split()
if len(caption) < min_caption or len(caption) > max_caption:
continue
text_data_test.append(caption)
if img_name in map_captions_test:
map_captions_test[img_name].append(caption)
else:
map_captions_test[img_name] = [caption]
return map_captions, text_data, map_captions_test, text_data_test
# Load the dataset
image_path = './image/Flickr8k_Dataset'
image_caption = './image/Flickr8k.token.txt'
map_captions, caption_only, map_captions_test, text_data_test = captions_load(image_caption,
image_path, num_image=7500)
list(map_captions.items())[:2]
[('./image/Flickr8k_Dataset\\1000268201_693b08cb0e.jpg', ['A girl going into a wooden building', 'A little girl climbing into a wooden playhouse', 'A little girl climbing the stairs to her playhouse']), ('./image/Flickr8k_Dataset\\1001773457_577c3a7d70.jpg', ['A black dog and a spotted dog are fighting', 'Two dogs on pavement moving toward each other'])]
list(map_captions_test.items())[:2]
[('./image/Flickr8k_Dataset\\2308108566_2cba6bca53.jpg', ['A man rides a bike on a dirt path', 'A person biking through the woods', 'A person dirtbikes along a muddy trail', 'Dirt bike rider riding down the slope', 'Person rides a dirt bike down a dirt hill']), ('./image/Flickr8k_Dataset\\2308256827_3c0a7d514d.jpg', ['A playground with two children and an adult'])]
Image Processing¶
feature_extractor = AutoFeatureExtractor.from_pretrained("google/vit-base-patch16-224-in21k")
normalize = Normalize(
mean=feature_extractor.image_mean,
std=feature_extractor.image_std
)
process_image = Compose(
[
RandomResizedCrop(list(feature_extractor.size.values())), # Data augmentation. Take a random resized crop of our image
ToTensor(), # Convert to pytorch tensor
normalize # normalize pixel values to look like images during pre-training
]
)
rows = []
# It is ok to have multiple captions per image becuase of data augmentation
for path, captions in map_captions.items():
for caption in captions:
rows.append({'path': path, 'caption': caption})
image_df = pd.DataFrame(rows)
image_dataset = Dataset.from_pandas(image_df)
gpt2_tokenizer.pad_token = gpt2_tokenizer.eos_token
def image_preprocess(examples):
# ViT expects pixel_values instead of input_ids
examples['pixel_values'] = [process_image(Image.open(path)) for path in examples['path']]
# We are padding tokens here instead of using a datacollator
tokenized = gpt2_tokenizer(
examples['caption'], padding='max_length', max_length=10, truncation=True
)['input_ids']
# the output captions
examples['labels'] = [[l if l != gpt2_tokenizer.pad_token_id else -100 for l in t] for t in tokenized]
# delete unused keys
del examples['path']
del examples['caption']
return examples
image_dataset = image_dataset.map(image_preprocess, batched=True)
# Train test split
image_dataset = image_dataset.train_test_split(test_size=0.1)
image_dataset
DatasetDict({ train: Dataset({ features: ['pixel_values', 'labels'], num_rows: 3259 }) test: Dataset({ features: ['pixel_values', 'labels'], num_rows: 363 }) })
# We set a pad token and a start token in our combined model to be the same as gpt2
model.config.pad_token = gpt2_tokenizer.pad_token
model.config.pad_token_id = gpt2_tokenizer.pad_token_id
model.config.decoder_start_token = gpt2_tokenizer.bos_token
model.config.decoder_start_token_id = gpt2_tokenizer.bos_token_id
Freezing Parameters¶
Since the ViT model is very large, fine-tunning for all parameter will take long, so, we freeze some parameters to speed up the fine-tunning process.
# Get the number of layers
config = model.config
num_layers = config.encoder.num_hidden_layers
print("Number of hidden layers in the VisionEncoderDecoderConfig model:", num_layers)
Number of hidden layers in the VisionEncoderDecoderConfig model: 12
## to speed up training, freeze all encoder layers except last one
#ir = 0
#for name, param in model.encoder.named_parameters():
# ir += 1
# #print(name)
# if 'encoder.layer.3' in name: # freeze 3 layers in the ViT
# print(f'Parameter {ir}: encoder.layer.3')
# break
# param.requires_grad = False # disable training in ViT
Fine-tune Vision Model¶
epochs = 3
batch_size = 5
from transformers import set_seed
set_seed(42)
training_args = TrainingArguments(
output_dir='./caption_image', # The output directory
overwrite_output_dir=True, # overwrite the content of the output directory
num_train_epochs=epochs, # number of training epochs
per_device_train_batch_size=batch_size, # batch size for training
per_device_eval_batch_size=batch_size, # batch size for evaluation
load_best_model_at_end=True,
log_level='info',
logging_steps=50,
evaluation_strategy='epoch',
save_strategy='epoch',
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=image_dataset['train'],
eval_dataset=image_dataset['test'],
)
import os
os.environ['WANDB_DISABLED'] = 'true'
trainer.evaluate()
***** Running Evaluation ***** Num examples = 363 Batch size = 5
Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"
`offline` in this directory.
Run `wandb online` or set WANDB_MODE=online to enable cloud syncing.
{'eval_loss': 5.0215582847595215, 'eval_runtime': 144.0833, 'eval_samples_per_second': 2.519, 'eval_steps_per_second': 0.507}
trainer.train()
C:\Users\mrezv\anaconda3\lib\site-packages\transformers\optimization.py:306: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning warnings.warn( ***** Running training ***** Num examples = 3259 Num Epochs = 3 Instantaneous batch size per device = 5 Total train batch size (w. parallel, distributed & accumulation) = 5 Gradient Accumulation steps = 1 Total optimization steps = 1956 Number of trainable parameters = 182485248
Epoch | Training Loss | Validation Loss |
---|---|---|
1 | 3.053000 | 2.941434 |
2 | 2.554100 | 2.813167 |
3 | 2.215500 | 2.802942 |
***** Running Evaluation ***** Num examples = 363 Batch size = 5 Saving model checkpoint to ./caption_image\checkpoint-652 Configuration saved in ./caption_image\checkpoint-652\config.json Configuration saved in ./caption_image\checkpoint-652\generation_config.json Model weights saved in ./caption_image\checkpoint-652\pytorch_model.bin ***** Running Evaluation ***** Num examples = 363 Batch size = 5 Saving model checkpoint to ./caption_image\checkpoint-1304 Configuration saved in ./caption_image\checkpoint-1304\config.json Configuration saved in ./caption_image\checkpoint-1304\generation_config.json Model weights saved in ./caption_image\checkpoint-1304\pytorch_model.bin ***** Running Evaluation ***** Num examples = 363 Batch size = 5 Saving model checkpoint to ./caption_image\checkpoint-1956 Configuration saved in ./caption_image\checkpoint-1956\config.json Configuration saved in ./caption_image\checkpoint-1956\generation_config.json Model weights saved in ./caption_image\checkpoint-1956\pytorch_model.bin Training completed. Do not forget to share your model on huggingface.co/models =) Loading best model from ./caption_image\checkpoint-1956 (score: 2.8029420375823975).
TrainOutput(global_step=1956, training_loss=2.7051771218547547, metrics={'train_runtime': 10062.1453, 'train_samples_per_second': 0.972, 'train_steps_per_second': 0.194, 'total_flos': 1.2636248585954918e+18, 'train_loss': 2.7051771218547547, 'epoch': 3.0})
# the loss decline is starting to slow down. This is a good indication that we may want to try training on more data
trainer.save_model()
Saving model checkpoint to ./caption_image Configuration saved in ./caption_image\config.json Configuration saved in ./caption_image\generation_config.json Model weights saved in ./caption_image\pytorch_model.bin
Fine-tuned Model for Prediction¶
# loading model and config from pretrained folder
finetuned_model = VisionEncoderDecoderModel.from_pretrained('./caption_image')
loading configuration file ./caption_image\config.json Model config VisionEncoderDecoderConfig { "_commit_hash": null, "architectures": [ "VisionEncoderDecoderModel" ], "decoder": { "_name_or_path": "distilgpt2", "_num_labels": 1, "activation_function": "gelu_new", "add_cross_attention": true, "architectures": [ "GPT2LMHeadModel" ], "attn_pdrop": 0.1, "bad_words_ids": null, "begin_suppress_tokens": null, "bos_token_id": 50256, "chunk_size_feed_forward": 0, "cross_attention_hidden_size": null, "decoder_start_token_id": null, "diversity_penalty": 0.0, "do_sample": false, "early_stopping": false, "embd_pdrop": 0.1, "encoder_no_repeat_ngram_size": 0, "eos_token_id": 50256, "exponential_decay_length_penalty": null, "finetuning_task": null, "forced_bos_token_id": null, "forced_eos_token_id": null, "id2label": { "0": "LABEL_0" }, "initializer_range": 0.02, "is_decoder": true, "is_encoder_decoder": false, "label2id": { "LABEL_0": 0 }, "layer_norm_epsilon": 1e-05, "length_penalty": 1.0, "max_length": 20, "min_length": 0, "model_type": "gpt2", "n_ctx": 1024, "n_embd": 768, "n_head": 12, "n_inner": null, "n_layer": 6, "n_positions": 1024, "no_repeat_ngram_size": 0, "num_beam_groups": 1, "num_beams": 1, "num_return_sequences": 1, "output_attentions": false, "output_hidden_states": false, "output_scores": false, "pad_token_id": null, "prefix": null, "problem_type": null, "pruned_heads": {}, "remove_invalid_values": false, "reorder_and_upcast_attn": false, "repetition_penalty": 1.0, "resid_pdrop": 0.1, "return_dict": true, "return_dict_in_generate": false, "scale_attn_by_inverse_layer_idx": false, "scale_attn_weights": true, "sep_token_id": null, "summary_activation": null, "summary_first_dropout": 0.1, "summary_proj_to_labels": true, "summary_type": "cls_index", "summary_use_proj": true, "suppress_tokens": null, "task_specific_params": { "text-generation": { "do_sample": true, "max_length": 50 } }, "temperature": 1.0, "tf_legacy_loss": false, "tie_encoder_decoder": false, "tie_word_embeddings": true, "tokenizer_class": null, "top_k": 50, "top_p": 1.0, "torch_dtype": null, "torchscript": false, "transformers_version": "4.26.1", "typical_p": 1.0, "use_bfloat16": false, "use_cache": true, "vocab_size": 50257 }, "decoder_start_token": "<|endoftext|>", "decoder_start_token_id": 50256, "encoder": { "_name_or_path": "google/vit-base-patch16-224-in21k", "add_cross_attention": false, "architectures": [ "ViTModel" ], "attention_probs_dropout_prob": 0.0, "bad_words_ids": null, "begin_suppress_tokens": null, "bos_token_id": null, "chunk_size_feed_forward": 0, "cross_attention_hidden_size": null, "decoder_start_token_id": null, "diversity_penalty": 0.0, "do_sample": false, "early_stopping": false, "encoder_no_repeat_ngram_size": 0, "encoder_stride": 16, "eos_token_id": null, "exponential_decay_length_penalty": null, "finetuning_task": null, "forced_bos_token_id": null, "forced_eos_token_id": null, "hidden_act": "gelu", "hidden_dropout_prob": 0.0, "hidden_size": 768, "id2label": { "0": "LABEL_0", "1": "LABEL_1" }, "image_size": 224, "initializer_range": 0.02, "intermediate_size": 3072, "is_decoder": false, "is_encoder_decoder": false, "label2id": { "LABEL_0": 0, "LABEL_1": 1 }, "layer_norm_eps": 1e-12, "length_penalty": 1.0, "max_length": 20, "min_length": 0, "model_type": "vit", "no_repeat_ngram_size": 0, "num_attention_heads": 12, "num_beam_groups": 1, "num_beams": 1, "num_channels": 3, "num_hidden_layers": 12, "num_return_sequences": 1, "output_attentions": false, "output_hidden_states": false, "output_scores": false, "pad_token_id": null, "patch_size": 16, "prefix": null, "problem_type": null, "pruned_heads": {}, "qkv_bias": true, "remove_invalid_values": false, "repetition_penalty": 1.0, "return_dict": true, "return_dict_in_generate": false, "sep_token_id": null, "suppress_tokens": null, "task_specific_params": null, "temperature": 1.0, "tf_legacy_loss": false, "tie_encoder_decoder": false, "tie_word_embeddings": true, "tokenizer_class": null, "top_k": 50, "top_p": 1.0, "torch_dtype": null, "torchscript": false, "transformers_version": "4.26.1", "typical_p": 1.0, "use_bfloat16": false }, "is_encoder_decoder": true, "model_type": "vision-encoder-decoder", "pad_token": "<|endoftext|>", "pad_token_id": 50256, "tie_word_embeddings": false, "torch_dtype": "float32", "transformers_version": null } loading weights file ./caption_image\pytorch_model.bin Generate config GenerationConfig { "decoder_start_token_id": 50256, "pad_token_id": 50256, "transformers_version": "4.26.1" } Generate config GenerationConfig { "bos_token_id": 50256, "eos_token_id": 50256, "transformers_version": "4.26.1" } All model checkpoint weights were used when initializing VisionEncoderDecoderModel. All the weights of VisionEncoderDecoderModel were initialized from the model checkpoint at ./caption_image. If your task is similar to the task the model of the checkpoint was trained on, you can already use VisionEncoderDecoderModel for predictions without further training. loading configuration file ./caption_image\generation_config.json Generate config GenerationConfig { "_from_model_config": true, "bos_token_id": 50256, "eos_token_id": 50256, "transformers_version": "4.26.1" }
feature_extractor = AutoFeatureExtractor.from_pretrained("google/vit-base-patch16-224-in21k")
loading configuration file preprocessor_config.json from cache at C:\Users\mrezv/.cache\huggingface\hub\models--google--vit-base-patch16-224-in21k\snapshots\7cbdb7ee3a6bcdf99dae654893f66519c480a0f8\preprocessor_config.json loading configuration file config.json from cache at C:\Users\mrezv/.cache\huggingface\hub\models--google--vit-base-patch16-224-in21k\snapshots\7cbdb7ee3a6bcdf99dae654893f66519c480a0f8\config.json Model config ViTConfig { "_name_or_path": "google/vit-base-patch16-224-in21k", "architectures": [ "ViTModel" ], "attention_probs_dropout_prob": 0.0, "encoder_stride": 16, "hidden_act": "gelu", "hidden_dropout_prob": 0.0, "hidden_size": 768, "image_size": 224, "initializer_range": 0.02, "intermediate_size": 3072, "layer_norm_eps": 1e-12, "model_type": "vit", "num_attention_heads": 12, "num_channels": 3, "num_hidden_layers": 12, "patch_size": 16, "qkv_bias": true, "transformers_version": "4.26.1" } C:\Users\mrezv\anaconda3\lib\site-packages\transformers\models\vit\feature_extraction_vit.py:28: FutureWarning: The class ViTFeatureExtractor is deprecated and will be removed in version 5 of Transformers. Please use ViTImageProcessor instead. warnings.warn( size should be a dictionary on of the following set of keys: ({'height', 'width'}, {'shortest_edge'}, {'longest_edge', 'shortest_edge'}), got 224. Converted to {'height': 224, 'width': 224}. Image processor ViTFeatureExtractor { "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.5, 0.5, 0.5 ], "image_processor_type": "ViTFeatureExtractor", "image_std": [ 0.5, 0.5, 0.5 ], "resample": 2, "rescale_factor": 0.00392156862745098, "size": { "height": 224, "width": 224 } }
normalize = Normalize(
mean=feature_extractor.image_mean,
std=feature_extractor.image_std
)
# Create a new composition that doesn't crop images for inference to make it easier for the model
inferenceprocess_image = Compose(
[
RandomResizedCrop(list(feature_extractor.size.values())),
ToTensor(),
normalize
]
)
# a helper function to caption images from the web or a file path
def caption_image(m,path,num_beams=3,max_length=15,top_k=10, num_return_sequences=5):
if 'http' in path:
response = requests.get(path)
img = Image.open(BytesIO(response.content))
image_matrix = inferenceprocess_image(img).unsqueeze(0)
else:
img = Image.open(path)
image_matrix = inferenceprocess_image(img).unsqueeze(0)
generated = m.generate(
image_matrix,
num_beams=num_beams,
max_length=max_length,
early_stopping=True,
do_sample=True,
top_k=top_k,
num_return_sequences=num_return_sequences,
)
caption_options = [gpt2_tokenizer.decode(g, skip_special_tokens=True).strip() for g in generated]
display(img)
return caption_options, generated, image_matrix
Example 1¶
captions, generated, image_matrix = caption_image( # Out of sample photo
finetuned_model, list(map_captions_test.items())[0][0]
)
captions
Generate config GenerationConfig { "decoder_start_token_id": 50256, "pad_token_id": 50256, "transformers_version": "4.26.1" } C:\Users\mrezv\anaconda3\lib\site-packages\transformers\generation\utils.py:1186: UserWarning: You have modified the pretrained model configuration to control generation. This is a deprecated strategy to control generation and will be removed soon, in a future version. Please use a generation configuration file (see https://huggingface.co/docs/transformers/main_classes/text_generation) warnings.warn(
['A man is riding a bike down a dirt road with a helmet on', 'A man is riding a bike through the woods on a sunny day in', 'A man is riding a bike through a forest covered area near a creek', 'A man is riding a bike down a hill in a race car with', 'A man riding a bike down a hill with a helmet on his head']
Example 2¶
captions, generated, image_matrix = caption_image( # Another one
finetuned_model, list(map_captions_test.items())[1][0]
)
captions
Generate config GenerationConfig { "decoder_start_token_id": 50256, "pad_token_id": 50256, "transformers_version": "4.26.1" }
['A boy in a blue shirt is jumping on a swing set in a', 'A boy is jumping on a swing set in a backyard gymnasium', 'A boy is jumping on a swing set on a swing set in a', 'A young boy is swinging on a swing set in a yard with a', 'A boy is jumping on a swing set in a backyard gymnasium']
Example 3¶
captions, generated, image_matrix = caption_image( # from our flicker dataset
finetuned_model,
list(map_captions_test.items())[2][0]
)
captions
Generate config GenerationConfig { "decoder_start_token_id": 50256, "pad_token_id": 50256, "transformers_version": "4.26.1" }
['A black dog is running through the water with a stick in its mouth', 'A black dog jumps into the water to catch a tennis ball in the', 'A black dog jumping into the water to catch a ball in a lake', 'A black dog is running through the water with a stick in its mouth', 'A black dog runs through the water with a stick in its mouth,']
Example Image with url¶
url = "https://raw.githubusercontent.com/MehdiRezvandehy/Machine-Learning-Course-University-of-Calgary/master/Images/2308978137_bfe776d541.jpg"
captions, generated, image_matrix = caption_image( # Out of sample photo
finetuned_model, url
)
captions
Generate config GenerationConfig { "decoder_start_token_id": 50256, "pad_token_id": 50256, "transformers_version": "4.26.1" }
['A group of people gather to watch a group of people gather to watch', 'A group of people watch a group of people on a snowy day in', 'A group of people gather to watch a group of people gather to watch', 'A group of people gathered around a campfire outside a campfire in', 'A group of people gather for the camera at a festival in the desert']
Not Fine-tuned Model for Prediction¶
# loading model and config from pretrained folder
finetuned_model = VisionEncoderDecoderModel.from_pretrained('./caption_image')
loading configuration file ./caption_image\config.json Model config VisionEncoderDecoderConfig { "_commit_hash": null, "architectures": [ "VisionEncoderDecoderModel" ], "decoder": { "_name_or_path": "distilgpt2", "_num_labels": 1, "activation_function": "gelu_new", "add_cross_attention": true, "architectures": [ "GPT2LMHeadModel" ], "attn_pdrop": 0.1, "bad_words_ids": null, "begin_suppress_tokens": null, "bos_token_id": 50256, "chunk_size_feed_forward": 0, "cross_attention_hidden_size": null, "decoder_start_token_id": null, "diversity_penalty": 0.0, "do_sample": false, "early_stopping": false, "embd_pdrop": 0.1, "encoder_no_repeat_ngram_size": 0, "eos_token_id": 50256, "exponential_decay_length_penalty": null, "finetuning_task": null, "forced_bos_token_id": null, "forced_eos_token_id": null, "id2label": { "0": "LABEL_0" }, "initializer_range": 0.02, "is_decoder": true, "is_encoder_decoder": false, "label2id": { "LABEL_0": 0 }, "layer_norm_epsilon": 1e-05, "length_penalty": 1.0, "max_length": 20, "min_length": 0, "model_type": "gpt2", "n_ctx": 1024, "n_embd": 768, "n_head": 12, "n_inner": null, "n_layer": 6, "n_positions": 1024, "no_repeat_ngram_size": 0, "num_beam_groups": 1, "num_beams": 1, "num_return_sequences": 1, "output_attentions": false, "output_hidden_states": false, "output_scores": false, "pad_token_id": null, "prefix": null, "problem_type": null, "pruned_heads": {}, "remove_invalid_values": false, "reorder_and_upcast_attn": false, "repetition_penalty": 1.0, "resid_pdrop": 0.1, "return_dict": true, "return_dict_in_generate": false, "scale_attn_by_inverse_layer_idx": false, "scale_attn_weights": true, "sep_token_id": null, "summary_activation": null, "summary_first_dropout": 0.1, "summary_proj_to_labels": true, "summary_type": "cls_index", "summary_use_proj": true, "suppress_tokens": null, "task_specific_params": { "text-generation": { "do_sample": true, "max_length": 50 } }, "temperature": 1.0, "tf_legacy_loss": false, "tie_encoder_decoder": false, "tie_word_embeddings": true, "tokenizer_class": null, "top_k": 50, "top_p": 1.0, "torch_dtype": null, "torchscript": false, "transformers_version": "4.26.1", "typical_p": 1.0, "use_bfloat16": false, "use_cache": true, "vocab_size": 50257 }, "decoder_start_token": "<|endoftext|>", "decoder_start_token_id": 50256, "encoder": { "_name_or_path": "google/vit-base-patch16-224-in21k", "add_cross_attention": false, "architectures": [ "ViTModel" ], "attention_probs_dropout_prob": 0.0, "bad_words_ids": null, "begin_suppress_tokens": null, "bos_token_id": null, "chunk_size_feed_forward": 0, "cross_attention_hidden_size": null, "decoder_start_token_id": null, "diversity_penalty": 0.0, "do_sample": false, "early_stopping": false, "encoder_no_repeat_ngram_size": 0, "encoder_stride": 16, "eos_token_id": null, "exponential_decay_length_penalty": null, "finetuning_task": null, "forced_bos_token_id": null, "forced_eos_token_id": null, "hidden_act": "gelu", "hidden_dropout_prob": 0.0, "hidden_size": 768, "id2label": { "0": "LABEL_0", "1": "LABEL_1" }, "image_size": 224, "initializer_range": 0.02, "intermediate_size": 3072, "is_decoder": false, "is_encoder_decoder": false, "label2id": { "LABEL_0": 0, "LABEL_1": 1 }, "layer_norm_eps": 1e-12, "length_penalty": 1.0, "max_length": 20, "min_length": 0, "model_type": "vit", "no_repeat_ngram_size": 0, "num_attention_heads": 12, "num_beam_groups": 1, "num_beams": 1, "num_channels": 3, "num_hidden_layers": 12, "num_return_sequences": 1, "output_attentions": false, "output_hidden_states": false, "output_scores": false, "pad_token_id": null, "patch_size": 16, "prefix": null, "problem_type": null, "pruned_heads": {}, "qkv_bias": true, "remove_invalid_values": false, "repetition_penalty": 1.0, "return_dict": true, "return_dict_in_generate": false, "sep_token_id": null, "suppress_tokens": null, "task_specific_params": null, "temperature": 1.0, "tf_legacy_loss": false, "tie_encoder_decoder": false, "tie_word_embeddings": true, "tokenizer_class": null, "top_k": 50, "top_p": 1.0, "torch_dtype": null, "torchscript": false, "transformers_version": "4.26.1", "typical_p": 1.0, "use_bfloat16": false }, "is_encoder_decoder": true, "model_type": "vision-encoder-decoder", "pad_token": "<|endoftext|>", "pad_token_id": 50256, "tie_word_embeddings": false, "torch_dtype": "float32", "transformers_version": null } loading weights file ./caption_image\pytorch_model.bin Generate config GenerationConfig { "decoder_start_token_id": 50256, "pad_token_id": 50256, "transformers_version": "4.26.1" } Generate config GenerationConfig { "bos_token_id": 50256, "eos_token_id": 50256, "transformers_version": "4.26.1" } All model checkpoint weights were used when initializing VisionEncoderDecoderModel. All the weights of VisionEncoderDecoderModel were initialized from the model checkpoint at ./caption_image. If your task is similar to the task the model of the checkpoint was trained on, you can already use VisionEncoderDecoderModel for predictions without further training. loading configuration file ./caption_image\generation_config.json Generate config GenerationConfig { "_from_model_config": true, "bos_token_id": 50256, "eos_token_id": 50256, "transformers_version": "4.26.1" }
non_finetuned = VisionEncoderDecoderModel.from_encoder_decoder_pretrained('google/vit-base-patch16-224-in21k',
'distilgpt2')
loading configuration file config.json from cache at C:\Users\mrezv/.cache\huggingface\hub\models--google--vit-base-patch16-224-in21k\snapshots\7cbdb7ee3a6bcdf99dae654893f66519c480a0f8\config.json Model config ViTConfig { "_name_or_path": "google/vit-base-patch16-224-in21k", "architectures": [ "ViTModel" ], "attention_probs_dropout_prob": 0.0, "encoder_stride": 16, "hidden_act": "gelu", "hidden_dropout_prob": 0.0, "hidden_size": 768, "image_size": 224, "initializer_range": 0.02, "intermediate_size": 3072, "layer_norm_eps": 1e-12, "model_type": "vit", "num_attention_heads": 12, "num_channels": 3, "num_hidden_layers": 12, "patch_size": 16, "qkv_bias": true, "transformers_version": "4.26.1" } loading weights file pytorch_model.bin from cache at C:\Users\mrezv/.cache\huggingface\hub\models--google--vit-base-patch16-224-in21k\snapshots\7cbdb7ee3a6bcdf99dae654893f66519c480a0f8\pytorch_model.bin All model checkpoint weights were used when initializing ViTModel. All the weights of ViTModel were initialized from the model checkpoint at google/vit-base-patch16-224-in21k. If your task is similar to the task the model of the checkpoint was trained on, you can already use ViTModel for predictions without further training. loading configuration file config.json from cache at C:\Users\mrezv/.cache\huggingface\hub\models--distilgpt2\snapshots\38cc92ec43315abd5136313225e95acc5986876c\config.json Model config GPT2Config { "_name_or_path": "distilgpt2", "_num_labels": 1, "activation_function": "gelu_new", "architectures": [ "GPT2LMHeadModel" ], "attn_pdrop": 0.1, "bos_token_id": 50256, "embd_pdrop": 0.1, "eos_token_id": 50256, "id2label": { "0": "LABEL_0" }, "initializer_range": 0.02, "label2id": { "LABEL_0": 0 }, "layer_norm_epsilon": 1e-05, "model_type": "gpt2", "n_ctx": 1024, "n_embd": 768, "n_head": 12, "n_inner": null, "n_layer": 6, "n_positions": 1024, "reorder_and_upcast_attn": false, "resid_pdrop": 0.1, "scale_attn_by_inverse_layer_idx": false, "scale_attn_weights": true, "summary_activation": null, "summary_first_dropout": 0.1, "summary_proj_to_labels": true, "summary_type": "cls_index", "summary_use_proj": true, "task_specific_params": { "text-generation": { "do_sample": true, "max_length": 50 } }, "transformers_version": "4.26.1", "use_cache": true, "vocab_size": 50257 } Initializing distilgpt2 as a decoder model. Cross attention layers are added to distilgpt2 and randomly initialized if distilgpt2's architecture allows for cross attention layers. loading weights file model.safetensors from cache at C:\Users\mrezv/.cache\huggingface\hub\models--distilgpt2\snapshots\38cc92ec43315abd5136313225e95acc5986876c\model.safetensors Generate config GenerationConfig { "bos_token_id": 50256, "eos_token_id": 50256, "transformers_version": "4.26.1" } All model checkpoint weights were used when initializing GPT2LMHeadModel. Some weights of GPT2LMHeadModel were not initialized from the model checkpoint at distilgpt2 and are newly initialized: ['transformer.h.3.crossattention.c_attn.weight', 'transformer.h.3.crossattention.q_attn.weight', 'transformer.h.5.crossattention.c_attn.weight', 'transformer.h.2.crossattention.c_proj.weight', 'transformer.h.5.ln_cross_attn.weight', 'transformer.h.2.crossattention.c_attn.weight', 'transformer.h.0.crossattention.c_proj.bias', 'transformer.h.5.crossattention.q_attn.weight', 'transformer.h.5.crossattention.bias', 'transformer.h.3.crossattention.c_proj.bias', 'transformer.h.3.crossattention.bias', 'transformer.h.2.ln_cross_attn.weight', 'transformer.h.1.crossattention.bias', 'transformer.h.4.crossattention.q_attn.weight', 'transformer.h.0.ln_cross_attn.weight', 'transformer.h.1.crossattention.c_proj.bias', 'transformer.h.4.crossattention.c_proj.weight', 'transformer.h.0.crossattention.c_proj.weight', 'transformer.h.2.crossattention.masked_bias', 'transformer.h.3.crossattention.c_proj.weight', 'transformer.h.1.crossattention.q_attn.weight', 'transformer.h.1.ln_cross_attn.weight', 'transformer.h.3.crossattention.masked_bias', 'transformer.h.4.crossattention.c_proj.bias', 'transformer.h.3.ln_cross_attn.weight', 'transformer.h.5.crossattention.c_proj.bias', 'transformer.h.4.ln_cross_attn.weight', 'transformer.h.0.crossattention.masked_bias', 'transformer.h.2.crossattention.c_proj.bias', 'transformer.h.4.crossattention.bias', 'transformer.h.0.crossattention.q_attn.weight', 'transformer.h.0.crossattention.c_attn.weight', 'transformer.h.2.crossattention.bias', 'transformer.h.5.crossattention.c_proj.weight', 'transformer.h.0.crossattention.bias', 'transformer.h.5.crossattention.masked_bias', 'transformer.h.4.crossattention.masked_bias', 'transformer.h.4.crossattention.c_attn.weight', 'transformer.h.1.crossattention.masked_bias', 'transformer.h.2.crossattention.q_attn.weight', 'transformer.h.1.crossattention.c_proj.weight', 'transformer.h.1.crossattention.c_attn.weight'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. loading configuration file generation_config.json from cache at C:\Users\mrezv/.cache\huggingface\hub\models--distilgpt2\snapshots\38cc92ec43315abd5136313225e95acc5986876c\generation_config.json Generate config GenerationConfig { "_from_model_config": true, "bos_token_id": 50256, "eos_token_id": 50256, "transformers_version": "4.26.1" } Setting `config.is_decoder=True` and `config.add_cross_attention=True` for decoder_config Generate config GenerationConfig { "transformers_version": "4.26.1" }
Example 1¶
captions, generated, image_matrix = caption_image( # Out of sample photo
non_finetuned, list(map_captions_test.items())[0][0]
)
captions
Generate config GenerationConfig { "transformers_version": "4.26.1" } The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results. Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
['.', '.', 'The U.S. military has been accused of violating the U.', 'The U.S. Department of Homeland Security announced on Monday that it', '.']
Example 2¶
captions, generated, image_matrix = caption_image( # Another one
non_finetuned, list(map_captions_test.items())[1][0]
)
captions
Generate config GenerationConfig { "transformers_version": "4.26.1" } The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results. Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
['"I think that\'s a good thing. I think that\'s', '"I don\'t think it\'s fair to say that it\'s', '"I don\'t think it\'s going to be easy for the', '.', '.']
Example 3¶
captions, generated, image_matrix = caption_image( # from our flicker dataset
non_finetuned,
list(map_captions_test.items())[2][0]
)
captions
Generate config GenerationConfig { "transformers_version": "4.26.1" } The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results. Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
['"I don\'t know what to do with that," he said', ', and it’s a good thing.\n\n“', '.', 'The New York Times is reporting that the Trump administration is planning to build', 'A man was shot and killed in a shooting at a convenience store in']
Example Image with url¶
url = "https://raw.githubusercontent.com/MehdiRezvandehy/Machine-Learning-Course-University-of-Calgary/master/Images/2308978137_bfe776d541.jpg"
captions, generated, image_matrix = caption_image( # Out of sample photo
non_finetuned, url
)
captions
Generate config GenerationConfig { "transformers_version": "4.26.1" } The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results. Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
['The U.S. Department of Homeland Security has issued a warning to', '"It\'s not just about me, it\'s about me,"', '.', 'The U.S. Supreme Court has ruled that the U.S', 'A new report from the Center for Disease Control and Prevention (CDC)']
list(map_captions_test.items())[3][0]
'./image/Flickr8k_Dataset\\2308978137_bfe776d541.jpg'
- Home
-
- Prediction of Movie Genre by Fine-tunning GPT
- Fine-tunning BERT for Fake News Detection
- Covid Tweet Classification by Fine-tunning BART
- Semantic Search Using BERT
- Abstractive Semantic Search by OpenAI Embedding
- Fine-tunning GPT for Style Completion
- Extractive Question-Answering by BERT
- Fine-tunning T5 Model for Abstract Title Prediction
- Image Captioning by Fine-tunning ViT
- Build Serverless ChatGPT API
- Statistical Analysis in Python
- Clustering Algorithms
- Customer Segmentation
- Time Series Forecasting
- PySpark Fundamentals for Big Data
- Predict Customer Churn
- Classification with Imbalanced Classes
- Feature Importance
- Feature Selection
- Text Similarity Measurement
- Dimensionality Reduction
- Prediction of Methane Leakage
- Imputation by LU Simulation
- Histogram Uncertainty
- Delustering to Improve Preferential Sampling
- Uncertainty in Spatial Correlation
-
- Machine Learning Overview
- Python and Pandas
- Main Steps of Machine Learning
- Classification
- Model Training
- Support Vector Machines
- Decision Trees
- Ensemble Learning & Random Forests
- Artificial Neural Network
- Deep Neural Network (DNN)
- Unsupervised Learning
- Multicollinearity
- Introduction to Git
- Introduction to R
- SQL Basic to Advanced Level
- Develop Python Package
- Introduction to BERT LLM
- Exploratory Data Analysis
- Object Oriented Programming in Python
- Natural Language Processing
- Convolutional Neural Network
- Publications