Summary

In this notebook, we will implement image captioning using a vision transformer. The application of pretrained ViT models for image captioning involves associating a textual description with an image, providing a comprehensive account of its contents. This procedure entails converting an image into a written narrative, establishing a connection between the domains of vision (image) and language (text). In the context of this document, we showcase how Vision Transformers (ViT) can execute this task when applied to images, utilizing the PyTorch backend as our primary technology. The goal is to demonstrate fine-tunning ViTs, for generating image captions without the necessity of retraining from scratch.

Notebook for this analysis is in my GitHub page

from transformers import ViTModel
from PIL import Image
from transformers import VisionEncoderDecoderModel, GPT2TokenizerFast, AutoFeatureExtractor, \
                         AutoTokenizer, TrainingArguments, Trainer
from PIL import Image
import os
import matplotlib.pyplot as plt
import numpy as np
from datasets import Dataset
import os
import torch
import numpy as np
import pandas as pd
# torchvision has several functions for image processing
from torchvision.transforms import RandomResizedCrop, Compose, Normalize, ToTensor, Resize
import pandas as pd
import requests
from io import BytesIO

Introduction to the Vision Transformer¶

The vision transformer first introduced in paper An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. The vision transformer was created to perform similar tasks of an NLP based transformer. The architect is similar:

Instead of token for words, fixed sized patches are used in image that are subset of images.

The model can be pre-trained on many types of datasets. The model we are using today has been pre-trained on the public ImageNet-21k dataset performs:

# Load up a pretrained Google vit model on HuggingFace
vit_model = ViTModel.from_pretrained('google/vit-base-patch16-224-in21k')

vit_model

ViTModel(
  (embeddings): ViTEmbeddings(
    (patch_embeddings): ViTPatchEmbeddings(
      (projection): Conv2d(3, 768, kernel_size=(16, 16), stride=(16, 16))
    )
    (dropout): Dropout(p=0.0, inplace=False)
  )
  (encoder): ViTEncoder(
    (layer): ModuleList(
      (0-11): 12 x ViTLayer(
        (attention): ViTAttention(
          (attention): ViTSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.0, inplace=False)
          )
          (output): ViTSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.0, inplace=False)
          )
        )
        (intermediate): ViTIntermediate(
          (dense): Linear(in_features=768, out_features=3072, bias=True)
          (intermediate_act_fn): GELUActivation()
        )
        (output): ViTOutput(
          (dense): Linear(in_features=3072, out_features=768, bias=True)
          (dropout): Dropout(p=0.0, inplace=False)
        )
        (layernorm_before): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        (layernorm_after): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      )
    )
  )
  (layernorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
  (pooler): ViTPooler(
    (dense): Linear(in_features=768, out_features=768, bias=True)
    (activation): Tanh()
  )
)

ViTModel model is pre-trained on imageNet: ViTEncoder is similar for LLM models which uses attention mechanism; there are query, key and value. There is patch_embeddings at the beginning which is different from LLM.

ViTModel has 12 encoders. Finally, there is a pooler layer at the end.

We need feature extractor to convert images to a tensor. This can be done by AutoFeatureExtractor library:

# Load feature extractor
feature_ext = AutoFeatureExtractor.from_pretrained("google/vit-base-patch16-224-in21k")

C:\Users\mrezv\anaconda3\lib\site-packages\transformers\models\vit\feature_extraction_vit.py:28: FutureWarning: The class ViTFeatureExtractor is deprecated and will be removed in version 5 of Transformers. Please use ViTImageProcessor instead.
  warnings.warn(

We need to pay attention to how images are preprocessed to match how the model was pretrained

feature_ext

ViTFeatureExtractor {
  "do_normalize": true,
  "do_rescale": true,
  "do_resize": true,
  "image_mean": [
    0.5,
    0.5,
    0.5
  ],
  "image_processor_type": "ViTFeatureExtractor",
  "image_std": [
    0.5,
    0.5,
    0.5
  ],
  "resample": 2,
  "rescale_factor": 0.00392156862745098,
  "size": {
    "height": 224,
    "width": 224
  }
}

Feature extraction works as tokenizer in NLP. It takes raw string and turn them in tokens. Feature extraction takes a raw image and perform some pre-processing to convert it into tensor. Some parameters of feature extraction are do_normalize, which is a normalization to certain mean and standard deviation and convert image to size of 224*224.

To load up an image, we can use Pillow's image object.

from PIL import *
import PIL.Image
img = Image.open('./image/1.jpg')
display(img)

If we use feature extraction:

import matplotlib.pyplot as plt

plt.imshow(feature_ext(img).pixel_values[0].transpose(1, 2, 0))

Clipping input data to the valid range for imshow with RGB data ([0..1] for floats or [0..255] for integers).

<matplotlib.image.AxesImage at 0x1128868b3a0>

feature_ext(img).pixel_values[0].shape

(3, 224, 224)

Although the image is ugly to us, it eliminates a lot of noises for specific tasks

Fine-tune Image Captioning System¶

Vision transformers¶

# Many weights are innitialized randomly, namely the cross attention weights
model = VisionEncoderDecoderModel.from_encoder_decoder_pretrained(
    'google/vit-base-patch16-224-in21k', # :patch:16, image size: 224*224, pre-trained on: image net in21k
    'distilgpt2') # decoder is `distilgpt2`

Some weights of GPT2LMHeadModel were not initialized from the model checkpoint at distilgpt2 and are newly initialized: ['transformer.h.3.crossattention.c_attn.weight', 'transformer.h.3.crossattention.q_attn.weight', 'transformer.h.5.crossattention.c_attn.weight', 'transformer.h.2.crossattention.c_proj.weight', 'transformer.h.5.ln_cross_attn.weight', 'transformer.h.2.crossattention.c_attn.weight', 'transformer.h.0.crossattention.c_proj.bias', 'transformer.h.5.crossattention.q_attn.weight', 'transformer.h.5.crossattention.bias', 'transformer.h.3.crossattention.c_proj.bias', 'transformer.h.3.crossattention.bias', 'transformer.h.2.ln_cross_attn.weight', 'transformer.h.1.crossattention.bias', 'transformer.h.4.crossattention.q_attn.weight', 'transformer.h.0.ln_cross_attn.weight', 'transformer.h.1.crossattention.c_proj.bias', 'transformer.h.4.crossattention.c_proj.weight', 'transformer.h.0.crossattention.c_proj.weight', 'transformer.h.2.crossattention.masked_bias', 'transformer.h.3.crossattention.c_proj.weight', 'transformer.h.1.crossattention.q_attn.weight', 'transformer.h.1.ln_cross_attn.weight', 'transformer.h.3.crossattention.masked_bias', 'transformer.h.4.crossattention.c_proj.bias', 'transformer.h.3.ln_cross_attn.weight', 'transformer.h.5.crossattention.c_proj.bias', 'transformer.h.4.ln_cross_attn.weight', 'transformer.h.0.crossattention.masked_bias', 'transformer.h.2.crossattention.c_proj.bias', 'transformer.h.4.crossattention.bias', 'transformer.h.0.crossattention.q_attn.weight', 'transformer.h.0.crossattention.c_attn.weight', 'transformer.h.2.crossattention.bias', 'transformer.h.5.crossattention.c_proj.weight', 'transformer.h.0.crossattention.bias', 'transformer.h.5.crossattention.masked_bias', 'transformer.h.4.crossattention.masked_bias', 'transformer.h.4.crossattention.c_attn.weight', 'transformer.h.1.crossattention.masked_bias', 'transformer.h.2.crossattention.q_attn.weight', 'transformer.h.1.crossattention.c_proj.weight', 'transformer.h.1.crossattention.c_attn.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

Vision transformer is like T5 because we have both encoder and decoder.

type(model.encoder)

transformers.models.vit.modeling_vit.ViTModel

type(model.decoder)

transformers.models.gpt2.modeling_gpt2.GPT2LMHeadModel

total_params = 0
for param in model.parameters():
    total_params += param.numel()
    
print(f"The model has a combined {total_params:,} parameters")

The model has a combined 182,485,248 parameters

# Instantiate a tokenizer
gpt2_tokenizer = GPT2TokenizerFast.from_pretrained('distilgpt2')

Load Images and Image Caption¶

Flickr 8k Dataset ¶

A new benchmark collection for sentence-based image description and search, consisting of 8,000 images that are each paired with five different captions which provide clear descriptions of the salient entities and events. … The images were chosen from six different Flickr groups, and tend not to contain any well-known people or locations, but were manually selected to depict a variety of scenes and situations

def captions_load(image_caption,image_path, min_caption=10, max_caption=50, num_image=10):
    '''
    optain image captions and image names
    '''
    with open(image_caption) as caption_file:
        captions = caption_file.readlines()
        map_captions = {}
        map_captions_test = {}
        text_data = []
        text_data_test = []
        if num_image>=len(captions):
            num_image = len(captions) 
            
        # Loading up images from data set for training set
        for line in captions[:num_image]:
            line = line.rstrip("\n")
            
            # Separate image name and captions using a tab
            img_name, caption = line.split("\t")

            # Five different captions is assigned to each image
            # Each image name has a suffix `#(caption_number)`
            img_name = img_name.split("#")[0]
            img_name = os.path.join(image_path, img_name.strip())

            if img_name.endswith("jpg"):
                caption = caption.replace(' .', '').strip()
                tokens = caption.strip().split()
                if len(caption) < min_caption or len(caption) > max_caption:
                    continue
                text_data.append(caption)

                if img_name in map_captions:
                    map_captions[img_name].append(caption)
                else:
                    map_captions[img_name] = [caption]

        # Loading up images from data set for test set
        for line in captions[num_image:]:
            line = line.rstrip("\n")
            
            # Separate image name and captions using a tab
            img_name, caption = line.split("\t")

            # Five different captions is assigned to each image
            # Each image name has a suffix `#(caption_number)`
            img_name = img_name.split("#")[0]
            img_name = os.path.join(image_path, img_name.strip())

            if img_name.endswith("jpg"):
                caption = caption.replace(' .', '').strip()
                tokens = caption.strip().split()
                if len(caption) < min_caption or len(caption) > max_caption:
                    continue
                text_data_test.append(caption)

                if img_name in map_captions_test:
                    map_captions_test[img_name].append(caption)
                else:
                    map_captions_test[img_name] = [caption]                    
                    
        return map_captions, text_data, map_captions_test, text_data_test

# Load the dataset
image_path = './image/Flickr8k_Dataset'
image_caption = './image/Flickr8k.token.txt'
map_captions, caption_only, map_captions_test, text_data_test = captions_load(image_caption, 
                                                                              image_path, num_image=7500)

list(map_captions.items())[:2]

[('./image/Flickr8k_Dataset\\1000268201_693b08cb0e.jpg',
  ['A girl going into a wooden building',
   'A little girl climbing into a wooden playhouse',
   'A little girl climbing the stairs to her playhouse']),
 ('./image/Flickr8k_Dataset\\1001773457_577c3a7d70.jpg',
  ['A black dog and a spotted dog are fighting',
   'Two dogs on pavement moving toward each other'])]

list(map_captions_test.items())[:2]

[('./image/Flickr8k_Dataset\\2308108566_2cba6bca53.jpg',
  ['A man rides a bike on a dirt path',
   'A person biking through the woods',
   'A person dirtbikes along a muddy trail',
   'Dirt bike rider riding down the slope',
   'Person rides a dirt bike down a dirt hill']),
 ('./image/Flickr8k_Dataset\\2308256827_3c0a7d514d.jpg',
  ['A playground with two children and an adult'])]

Image Processing¶

feature_extractor = AutoFeatureExtractor.from_pretrained("google/vit-base-patch16-224-in21k")

normalize = Normalize(
    mean=feature_extractor.image_mean, 
    std=feature_extractor.image_std
)

process_image = Compose(
    [
        RandomResizedCrop(list(feature_extractor.size.values())),   # Data augmentation. Take a random resized crop of our image
        ToTensor(),                                  # Convert to pytorch tensor
        normalize                                    # normalize pixel values to look like images during pre-training
    ]
)

rows = []

# It is ok to have multiple captions per image becuase of data augmentation
for path, captions in map_captions.items():
    for caption in captions:
        rows.append({'path': path, 'caption': caption})

image_df = pd.DataFrame(rows)

image_dataset = Dataset.from_pandas(image_df)

gpt2_tokenizer.pad_token = gpt2_tokenizer.eos_token

def image_preprocess(examples):
    
    # ViT expects pixel_values instead of input_ids
    examples['pixel_values'] = [process_image(Image.open(path)) for path in examples['path']]
    
    # We are padding tokens here instead of using a datacollator
    tokenized = gpt2_tokenizer(
        examples['caption'], padding='max_length', max_length=10, truncation=True
    )['input_ids']
    
    # the output captions
    examples['labels'] = [[l if l != gpt2_tokenizer.pad_token_id else -100 for l in t] for t in tokenized]
    
    # delete unused keys
    del examples['path']
    del examples['caption']
    return examples

image_dataset = image_dataset.map(image_preprocess, batched=True)

# Train test split
image_dataset = image_dataset.train_test_split(test_size=0.1)

image_dataset

DatasetDict({
    train: Dataset({
        features: ['pixel_values', 'labels'],
        num_rows: 3259
    })
    test: Dataset({
        features: ['pixel_values', 'labels'],
        num_rows: 363
    })
})

# We set a pad token and a start token in our combined model to be the same as gpt2
model.config.pad_token = gpt2_tokenizer.pad_token
model.config.pad_token_id = gpt2_tokenizer.pad_token_id

model.config.decoder_start_token = gpt2_tokenizer.bos_token
model.config.decoder_start_token_id = gpt2_tokenizer.bos_token_id

Freezing Parameters¶

Since the ViT model is very large, fine-tunning for all parameter will take long, so, we freeze some parameters to speed up the fine-tunning process.

# Get the number of layers
config = model.config
num_layers = config.encoder.num_hidden_layers

print("Number of hidden layers in the VisionEncoderDecoderConfig model:", num_layers)

Number of hidden layers in the VisionEncoderDecoderConfig model: 12

## to speed up training, freeze all encoder layers except last one
#ir = 0
#for name, param in model.encoder.named_parameters():
#    ir += 1
#    #print(name)
#    if 'encoder.layer.3' in name: # freeze 3 layers in the ViT
#        print(f'Parameter {ir}: encoder.layer.3')
#        break 
#    param.requires_grad = False  # disable training in ViT

Fine-tune Vision Model¶

epochs = 3
batch_size = 5

from transformers import set_seed
set_seed(42)


training_args = TrainingArguments(
    output_dir='./caption_image',    # The output directory
    overwrite_output_dir=True,       # overwrite the content of the output directory
    num_train_epochs=epochs,              # number of training epochs
    per_device_train_batch_size=batch_size,   # batch size for training
    per_device_eval_batch_size=batch_size,    # batch size for evaluation
    load_best_model_at_end=True,
    log_level='info',
    logging_steps=50,
    evaluation_strategy='epoch',
    save_strategy='epoch',
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=image_dataset['train'],
    eval_dataset=image_dataset['test'],
)

import os
os.environ['WANDB_DISABLED'] = 'true'

trainer.evaluate()

***** Running Evaluation *****
  Num examples = 363
  Batch size = 5

Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"

{'eval_loss': 5.0215582847595215,
 'eval_runtime': 144.0833,
 'eval_samples_per_second': 2.519,
 'eval_steps_per_second': 0.507}

trainer.train()

C:\Users\mrezv\anaconda3\lib\site-packages\transformers\optimization.py:306: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  warnings.warn(
***** Running training *****
  Num examples = 3259
  Num Epochs = 3
  Instantaneous batch size per device = 5
  Total train batch size (w. parallel, distributed & accumulation) = 5
  Gradient Accumulation steps = 1
  Total optimization steps = 1956
  Number of trainable parameters = 182485248

***** Running Evaluation *****
  Num examples = 363
  Batch size = 5
Saving model checkpoint to ./caption_image\checkpoint-652
Configuration saved in ./caption_image\checkpoint-652\config.json
Configuration saved in ./caption_image\checkpoint-652\generation_config.json
Model weights saved in ./caption_image\checkpoint-652\pytorch_model.bin
***** Running Evaluation *****
  Num examples = 363
  Batch size = 5
Saving model checkpoint to ./caption_image\checkpoint-1304
Configuration saved in ./caption_image\checkpoint-1304\config.json
Configuration saved in ./caption_image\checkpoint-1304\generation_config.json
Model weights saved in ./caption_image\checkpoint-1304\pytorch_model.bin
***** Running Evaluation *****
  Num examples = 363
  Batch size = 5
Saving model checkpoint to ./caption_image\checkpoint-1956
Configuration saved in ./caption_image\checkpoint-1956\config.json
Configuration saved in ./caption_image\checkpoint-1956\generation_config.json
Model weights saved in ./caption_image\checkpoint-1956\pytorch_model.bin


Training completed. Do not forget to share your model on huggingface.co/models =)


Loading best model from ./caption_image\checkpoint-1956 (score: 2.8029420375823975).

TrainOutput(global_step=1956, training_loss=2.7051771218547547, metrics={'train_runtime': 10062.1453, 'train_samples_per_second': 0.972, 'train_steps_per_second': 0.194, 'total_flos': 1.2636248585954918e+18, 'train_loss': 2.7051771218547547, 'epoch': 3.0})

# the loss decline is starting to slow down. This is a good indication that we may want to try training on more data

trainer.save_model()

Saving model checkpoint to ./caption_image
Configuration saved in ./caption_image\config.json
Configuration saved in ./caption_image\generation_config.json
Model weights saved in ./caption_image\pytorch_model.bin

Fine-tuned Model for Prediction¶

# loading model and config from pretrained folder
finetuned_model = VisionEncoderDecoderModel.from_pretrained('./caption_image')

loading configuration file ./caption_image\config.json
Model config VisionEncoderDecoderConfig {
  "_commit_hash": null,
  "architectures": [
    "VisionEncoderDecoderModel"
  ],
  "decoder": {
    "_name_or_path": "distilgpt2",
    "_num_labels": 1,
    "activation_function": "gelu_new",
    "add_cross_attention": true,
    "architectures": [
      "GPT2LMHeadModel"
    ],
    "attn_pdrop": 0.1,
    "bad_words_ids": null,
    "begin_suppress_tokens": null,
    "bos_token_id": 50256,
    "chunk_size_feed_forward": 0,
    "cross_attention_hidden_size": null,
    "decoder_start_token_id": null,
    "diversity_penalty": 0.0,
    "do_sample": false,
    "early_stopping": false,
    "embd_pdrop": 0.1,
    "encoder_no_repeat_ngram_size": 0,
    "eos_token_id": 50256,
    "exponential_decay_length_penalty": null,
    "finetuning_task": null,
    "forced_bos_token_id": null,
    "forced_eos_token_id": null,
    "id2label": {
      "0": "LABEL_0"
    },
    "initializer_range": 0.02,
    "is_decoder": true,
    "is_encoder_decoder": false,
    "label2id": {
      "LABEL_0": 0
    },
    "layer_norm_epsilon": 1e-05,
    "length_penalty": 1.0,
    "max_length": 20,
    "min_length": 0,
    "model_type": "gpt2",
    "n_ctx": 1024,
    "n_embd": 768,
    "n_head": 12,
    "n_inner": null,
    "n_layer": 6,
    "n_positions": 1024,
    "no_repeat_ngram_size": 0,
    "num_beam_groups": 1,
    "num_beams": 1,
    "num_return_sequences": 1,
    "output_attentions": false,
    "output_hidden_states": false,
    "output_scores": false,
    "pad_token_id": null,
    "prefix": null,
    "problem_type": null,
    "pruned_heads": {},
    "remove_invalid_values": false,
    "reorder_and_upcast_attn": false,
    "repetition_penalty": 1.0,
    "resid_pdrop": 0.1,
    "return_dict": true,
    "return_dict_in_generate": false,
    "scale_attn_by_inverse_layer_idx": false,
    "scale_attn_weights": true,
    "sep_token_id": null,
    "summary_activation": null,
    "summary_first_dropout": 0.1,
    "summary_proj_to_labels": true,
    "summary_type": "cls_index",
    "summary_use_proj": true,
    "suppress_tokens": null,
    "task_specific_params": {
      "text-generation": {
        "do_sample": true,
        "max_length": 50
      }
    },
    "temperature": 1.0,
    "tf_legacy_loss": false,
    "tie_encoder_decoder": false,
    "tie_word_embeddings": true,
    "tokenizer_class": null,
    "top_k": 50,
    "top_p": 1.0,
    "torch_dtype": null,
    "torchscript": false,
    "transformers_version": "4.26.1",
    "typical_p": 1.0,
    "use_bfloat16": false,
    "use_cache": true,
    "vocab_size": 50257
  },
  "decoder_start_token": "<|endoftext|>",
  "decoder_start_token_id": 50256,
  "encoder": {
    "_name_or_path": "google/vit-base-patch16-224-in21k",
    "add_cross_attention": false,
    "architectures": [
      "ViTModel"
    ],
    "attention_probs_dropout_prob": 0.0,
    "bad_words_ids": null,
    "begin_suppress_tokens": null,
    "bos_token_id": null,
    "chunk_size_feed_forward": 0,
    "cross_attention_hidden_size": null,
    "decoder_start_token_id": null,
    "diversity_penalty": 0.0,
    "do_sample": false,
    "early_stopping": false,
    "encoder_no_repeat_ngram_size": 0,
    "encoder_stride": 16,
    "eos_token_id": null,
    "exponential_decay_length_penalty": null,
    "finetuning_task": null,
    "forced_bos_token_id": null,
    "forced_eos_token_id": null,
    "hidden_act": "gelu",
    "hidden_dropout_prob": 0.0,
    "hidden_size": 768,
    "id2label": {
      "0": "LABEL_0",
      "1": "LABEL_1"
    },
    "image_size": 224,
    "initializer_range": 0.02,
    "intermediate_size": 3072,
    "is_decoder": false,
    "is_encoder_decoder": false,
    "label2id": {
      "LABEL_0": 0,
      "LABEL_1": 1
    },
    "layer_norm_eps": 1e-12,
    "length_penalty": 1.0,
    "max_length": 20,
    "min_length": 0,
    "model_type": "vit",
    "no_repeat_ngram_size": 0,
    "num_attention_heads": 12,
    "num_beam_groups": 1,
    "num_beams": 1,
    "num_channels": 3,
    "num_hidden_layers": 12,
    "num_return_sequences": 1,
    "output_attentions": false,
    "output_hidden_states": false,
    "output_scores": false,
    "pad_token_id": null,
    "patch_size": 16,
    "prefix": null,
    "problem_type": null,
    "pruned_heads": {},
    "qkv_bias": true,
    "remove_invalid_values": false,
    "repetition_penalty": 1.0,
    "return_dict": true,
    "return_dict_in_generate": false,
    "sep_token_id": null,
    "suppress_tokens": null,
    "task_specific_params": null,
    "temperature": 1.0,
    "tf_legacy_loss": false,
    "tie_encoder_decoder": false,
    "tie_word_embeddings": true,
    "tokenizer_class": null,
    "top_k": 50,
    "top_p": 1.0,
    "torch_dtype": null,
    "torchscript": false,
    "transformers_version": "4.26.1",
    "typical_p": 1.0,
    "use_bfloat16": false
  },
  "is_encoder_decoder": true,
  "model_type": "vision-encoder-decoder",
  "pad_token": "<|endoftext|>",
  "pad_token_id": 50256,
  "tie_word_embeddings": false,
  "torch_dtype": "float32",
  "transformers_version": null
}

loading weights file ./caption_image\pytorch_model.bin
Generate config GenerationConfig {
  "decoder_start_token_id": 50256,
  "pad_token_id": 50256,
  "transformers_version": "4.26.1"
}

Generate config GenerationConfig {
  "bos_token_id": 50256,
  "eos_token_id": 50256,
  "transformers_version": "4.26.1"
}

All model checkpoint weights were used when initializing VisionEncoderDecoderModel.

All the weights of VisionEncoderDecoderModel were initialized from the model checkpoint at ./caption_image.
If your task is similar to the task the model of the checkpoint was trained on, you can already use VisionEncoderDecoderModel for predictions without further training.
loading configuration file ./caption_image\generation_config.json
Generate config GenerationConfig {
  "_from_model_config": true,
  "bos_token_id": 50256,
  "eos_token_id": 50256,
  "transformers_version": "4.26.1"
}

feature_extractor = AutoFeatureExtractor.from_pretrained("google/vit-base-patch16-224-in21k")

loading configuration file preprocessor_config.json from cache at C:\Users\mrezv/.cache\huggingface\hub\models--google--vit-base-patch16-224-in21k\snapshots\7cbdb7ee3a6bcdf99dae654893f66519c480a0f8\preprocessor_config.json
loading configuration file config.json from cache at C:\Users\mrezv/.cache\huggingface\hub\models--google--vit-base-patch16-224-in21k\snapshots\7cbdb7ee3a6bcdf99dae654893f66519c480a0f8\config.json
Model config ViTConfig {
  "_name_or_path": "google/vit-base-patch16-224-in21k",
  "architectures": [
    "ViTModel"
  ],
  "attention_probs_dropout_prob": 0.0,
  "encoder_stride": 16,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.0,
  "hidden_size": 768,
  "image_size": 224,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "model_type": "vit",
  "num_attention_heads": 12,
  "num_channels": 3,
  "num_hidden_layers": 12,
  "patch_size": 16,
  "qkv_bias": true,
  "transformers_version": "4.26.1"
}

C:\Users\mrezv\anaconda3\lib\site-packages\transformers\models\vit\feature_extraction_vit.py:28: FutureWarning: The class ViTFeatureExtractor is deprecated and will be removed in version 5 of Transformers. Please use ViTImageProcessor instead.
  warnings.warn(
size should be a dictionary on of the following set of keys: ({'height', 'width'}, {'shortest_edge'}, {'longest_edge', 'shortest_edge'}), got 224. Converted to {'height': 224, 'width': 224}.
Image processor ViTFeatureExtractor {
  "do_normalize": true,
  "do_rescale": true,
  "do_resize": true,
  "image_mean": [
    0.5,
    0.5,
    0.5
  ],
  "image_processor_type": "ViTFeatureExtractor",
  "image_std": [
    0.5,
    0.5,
    0.5
  ],
  "resample": 2,
  "rescale_factor": 0.00392156862745098,
  "size": {
    "height": 224,
    "width": 224
  }
}

normalize = Normalize(
    mean=feature_extractor.image_mean, 
    std=feature_extractor.image_std
)

# Create a new composition that doesn't crop images for inference to make it easier for the model
inferenceprocess_image = Compose(
    [
        RandomResizedCrop(list(feature_extractor.size.values())), 
        ToTensor(), 
        normalize
    ]
)

# a helper function to caption images from the web or a file path
def caption_image(m,path,num_beams=3,max_length=15,top_k=10, num_return_sequences=5):
    if 'http' in path:
        response = requests.get(path)
        img = Image.open(BytesIO(response.content))
        image_matrix = inferenceprocess_image(img).unsqueeze(0)
    else:
        img = Image.open(path)
        image_matrix = inferenceprocess_image(img).unsqueeze(0)

    generated = m.generate(
        image_matrix,
        num_beams=num_beams,
        max_length=max_length,
        early_stopping=True,
        do_sample=True,
        top_k=top_k,
        num_return_sequences=num_return_sequences,
    )
    
    caption_options = [gpt2_tokenizer.decode(g, skip_special_tokens=True).strip() for g in generated]
    display(img)
    return caption_options, generated, image_matrix

Example 1¶

captions, generated, image_matrix = caption_image(  # Out of sample photo
    finetuned_model, list(map_captions_test.items())[0][0]
)

captions

Generate config GenerationConfig {
  "decoder_start_token_id": 50256,
  "pad_token_id": 50256,
  "transformers_version": "4.26.1"
}

C:\Users\mrezv\anaconda3\lib\site-packages\transformers\generation\utils.py:1186: UserWarning: You have modified the pretrained model configuration to control generation. This is a deprecated strategy to control generation and will be removed soon, in a future version. Please use a generation configuration file (see https://huggingface.co/docs/transformers/main_classes/text_generation)
  warnings.warn(

['A man is riding a bike down a dirt road with a helmet on',
 'A man is riding a bike through the woods on a sunny day in',
 'A man is riding a bike through a forest covered area near a creek',
 'A man is riding a bike down a hill in a race car with',
 'A man riding a bike down a hill with a helmet on his head']

Example 2¶

captions, generated, image_matrix = caption_image(  # Another one
    finetuned_model, list(map_captions_test.items())[1][0]
)

captions

Generate config GenerationConfig {
  "decoder_start_token_id": 50256,
  "pad_token_id": 50256,
  "transformers_version": "4.26.1"
}

['A boy in a blue shirt is jumping on a swing set in a',
 'A boy is jumping on a swing set in a backyard gymnasium',
 'A boy is jumping on a swing set on a swing set in a',
 'A young boy is swinging on a swing set in a yard with a',
 'A boy is jumping on a swing set in a backyard gymnasium']

Example 3¶

captions, generated, image_matrix = caption_image(  # from our flicker dataset
    finetuned_model, 
    list(map_captions_test.items())[2][0]
)
captions

Generate config GenerationConfig {
  "decoder_start_token_id": 50256,
  "pad_token_id": 50256,
  "transformers_version": "4.26.1"
}

['A black dog is running through the water with a stick in its mouth',
 'A black dog jumps into the water to catch a tennis ball in the',
 'A black dog jumping into the water to catch a ball in a lake',
 'A black dog is running through the water with a stick in its mouth',
 'A black dog runs through the water with a stick in its mouth,']

Example Image with url¶

url = "https://raw.githubusercontent.com/MehdiRezvandehy/Machine-Learning-Course-University-of-Calgary/master/Images/2308978137_bfe776d541.jpg"
captions, generated, image_matrix = caption_image(  # Out of sample photo
    finetuned_model, url
)

captions

Generate config GenerationConfig {
  "decoder_start_token_id": 50256,
  "pad_token_id": 50256,
  "transformers_version": "4.26.1"
}

['A group of people gather to watch a group of people gather to watch',
 'A group of people watch a group of people on a snowy day in',
 'A group of people gather to watch a group of people gather to watch',
 'A group of people gathered around a campfire outside a campfire in',
 'A group of people gather for the camera at a festival in the desert']

Not Fine-tuned Model for Prediction¶

# loading model and config from pretrained folder
finetuned_model = VisionEncoderDecoderModel.from_pretrained('./caption_image')

loading configuration file ./caption_image\config.json
Model config VisionEncoderDecoderConfig {
  "_commit_hash": null,
  "architectures": [
    "VisionEncoderDecoderModel"
  ],
  "decoder": {
    "_name_or_path": "distilgpt2",
    "_num_labels": 1,
    "activation_function": "gelu_new",
    "add_cross_attention": true,
    "architectures": [
      "GPT2LMHeadModel"
    ],
    "attn_pdrop": 0.1,
    "bad_words_ids": null,
    "begin_suppress_tokens": null,
    "bos_token_id": 50256,
    "chunk_size_feed_forward": 0,
    "cross_attention_hidden_size": null,
    "decoder_start_token_id": null,
    "diversity_penalty": 0.0,
    "do_sample": false,
    "early_stopping": false,
    "embd_pdrop": 0.1,
    "encoder_no_repeat_ngram_size": 0,
    "eos_token_id": 50256,
    "exponential_decay_length_penalty": null,
    "finetuning_task": null,
    "forced_bos_token_id": null,
    "forced_eos_token_id": null,
    "id2label": {
      "0": "LABEL_0"
    },
    "initializer_range": 0.02,
    "is_decoder": true,
    "is_encoder_decoder": false,
    "label2id": {
      "LABEL_0": 0
    },
    "layer_norm_epsilon": 1e-05,
    "length_penalty": 1.0,
    "max_length": 20,
    "min_length": 0,
    "model_type": "gpt2",
    "n_ctx": 1024,
    "n_embd": 768,
    "n_head": 12,
    "n_inner": null,
    "n_layer": 6,
    "n_positions": 1024,
    "no_repeat_ngram_size": 0,
    "num_beam_groups": 1,
    "num_beams": 1,
    "num_return_sequences": 1,
    "output_attentions": false,
    "output_hidden_states": false,
    "output_scores": false,
    "pad_token_id": null,
    "prefix": null,
    "problem_type": null,
    "pruned_heads": {},
    "remove_invalid_values": false,
    "reorder_and_upcast_attn": false,
    "repetition_penalty": 1.0,
    "resid_pdrop": 0.1,
    "return_dict": true,
    "return_dict_in_generate": false,
    "scale_attn_by_inverse_layer_idx": false,
    "scale_attn_weights": true,
    "sep_token_id": null,
    "summary_activation": null,
    "summary_first_dropout": 0.1,
    "summary_proj_to_labels": true,
    "summary_type": "cls_index",
    "summary_use_proj": true,
    "suppress_tokens": null,
    "task_specific_params": {
      "text-generation": {
        "do_sample": true,
        "max_length": 50
      }
    },
    "temperature": 1.0,
    "tf_legacy_loss": false,
    "tie_encoder_decoder": false,
    "tie_word_embeddings": true,
    "tokenizer_class": null,
    "top_k": 50,
    "top_p": 1.0,
    "torch_dtype": null,
    "torchscript": false,
    "transformers_version": "4.26.1",
    "typical_p": 1.0,
    "use_bfloat16": false,
    "use_cache": true,
    "vocab_size": 50257
  },
  "decoder_start_token": "<|endoftext|>",
  "decoder_start_token_id": 50256,
  "encoder": {
    "_name_or_path": "google/vit-base-patch16-224-in21k",
    "add_cross_attention": false,
    "architectures": [
      "ViTModel"
    ],
    "attention_probs_dropout_prob": 0.0,
    "bad_words_ids": null,
    "begin_suppress_tokens": null,
    "bos_token_id": null,
    "chunk_size_feed_forward": 0,
    "cross_attention_hidden_size": null,
    "decoder_start_token_id": null,
    "diversity_penalty": 0.0,
    "do_sample": false,
    "early_stopping": false,
    "encoder_no_repeat_ngram_size": 0,
    "encoder_stride": 16,
    "eos_token_id": null,
    "exponential_decay_length_penalty": null,
    "finetuning_task": null,
    "forced_bos_token_id": null,
    "forced_eos_token_id": null,
    "hidden_act": "gelu",
    "hidden_dropout_prob": 0.0,
    "hidden_size": 768,
    "id2label": {
      "0": "LABEL_0",
      "1": "LABEL_1"
    },
    "image_size": 224,
    "initializer_range": 0.02,
    "intermediate_size": 3072,
    "is_decoder": false,
    "is_encoder_decoder": false,
    "label2id": {
      "LABEL_0": 0,
      "LABEL_1": 1
    },
    "layer_norm_eps": 1e-12,
    "length_penalty": 1.0,
    "max_length": 20,
    "min_length": 0,
    "model_type": "vit",
    "no_repeat_ngram_size": 0,
    "num_attention_heads": 12,
    "num_beam_groups": 1,
    "num_beams": 1,
    "num_channels": 3,
    "num_hidden_layers": 12,
    "num_return_sequences": 1,
    "output_attentions": false,
    "output_hidden_states": false,
    "output_scores": false,
    "pad_token_id": null,
    "patch_size": 16,
    "prefix": null,
    "problem_type": null,
    "pruned_heads": {},
    "qkv_bias": true,
    "remove_invalid_values": false,
    "repetition_penalty": 1.0,
    "return_dict": true,
    "return_dict_in_generate": false,
    "sep_token_id": null,
    "suppress_tokens": null,
    "task_specific_params": null,
    "temperature": 1.0,
    "tf_legacy_loss": false,
    "tie_encoder_decoder": false,
    "tie_word_embeddings": true,
    "tokenizer_class": null,
    "top_k": 50,
    "top_p": 1.0,
    "torch_dtype": null,
    "torchscript": false,
    "transformers_version": "4.26.1",
    "typical_p": 1.0,
    "use_bfloat16": false
  },
  "is_encoder_decoder": true,
  "model_type": "vision-encoder-decoder",
  "pad_token": "<|endoftext|>",
  "pad_token_id": 50256,
  "tie_word_embeddings": false,
  "torch_dtype": "float32",
  "transformers_version": null
}

loading weights file ./caption_image\pytorch_model.bin
Generate config GenerationConfig {
  "decoder_start_token_id": 50256,
  "pad_token_id": 50256,
  "transformers_version": "4.26.1"
}

Generate config GenerationConfig {
  "bos_token_id": 50256,
  "eos_token_id": 50256,
  "transformers_version": "4.26.1"
}

All model checkpoint weights were used when initializing VisionEncoderDecoderModel.

All the weights of VisionEncoderDecoderModel were initialized from the model checkpoint at ./caption_image.
If your task is similar to the task the model of the checkpoint was trained on, you can already use VisionEncoderDecoderModel for predictions without further training.
loading configuration file ./caption_image\generation_config.json
Generate config GenerationConfig {
  "_from_model_config": true,
  "bos_token_id": 50256,
  "eos_token_id": 50256,
  "transformers_version": "4.26.1"
}

non_finetuned = VisionEncoderDecoderModel.from_encoder_decoder_pretrained('google/vit-base-patch16-224-in21k',
                                                                          'distilgpt2')

loading configuration file config.json from cache at C:\Users\mrezv/.cache\huggingface\hub\models--google--vit-base-patch16-224-in21k\snapshots\7cbdb7ee3a6bcdf99dae654893f66519c480a0f8\config.json
Model config ViTConfig {
  "_name_or_path": "google/vit-base-patch16-224-in21k",
  "architectures": [
    "ViTModel"
  ],
  "attention_probs_dropout_prob": 0.0,
  "encoder_stride": 16,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.0,
  "hidden_size": 768,
  "image_size": 224,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "model_type": "vit",
  "num_attention_heads": 12,
  "num_channels": 3,
  "num_hidden_layers": 12,
  "patch_size": 16,
  "qkv_bias": true,
  "transformers_version": "4.26.1"
}

loading weights file pytorch_model.bin from cache at C:\Users\mrezv/.cache\huggingface\hub\models--google--vit-base-patch16-224-in21k\snapshots\7cbdb7ee3a6bcdf99dae654893f66519c480a0f8\pytorch_model.bin
All model checkpoint weights were used when initializing ViTModel.

All the weights of ViTModel were initialized from the model checkpoint at google/vit-base-patch16-224-in21k.
If your task is similar to the task the model of the checkpoint was trained on, you can already use ViTModel for predictions without further training.
loading configuration file config.json from cache at C:\Users\mrezv/.cache\huggingface\hub\models--distilgpt2\snapshots\38cc92ec43315abd5136313225e95acc5986876c\config.json
Model config GPT2Config {
  "_name_or_path": "distilgpt2",
  "_num_labels": 1,
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "id2label": {
    "0": "LABEL_0"
  },
  "initializer_range": 0.02,
  "label2id": {
    "LABEL_0": 0
  },
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 768,
  "n_head": 12,
  "n_inner": null,
  "n_layer": 6,
  "n_positions": 1024,
  "reorder_and_upcast_attn": false,
  "resid_pdrop": 0.1,
  "scale_attn_by_inverse_layer_idx": false,
  "scale_attn_weights": true,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "task_specific_params": {
    "text-generation": {
      "do_sample": true,
      "max_length": 50
    }
  },
  "transformers_version": "4.26.1",
  "use_cache": true,
  "vocab_size": 50257
}

Initializing distilgpt2 as a decoder model. Cross attention layers are added to distilgpt2 and randomly initialized if distilgpt2's architecture allows for cross attention layers.
loading weights file model.safetensors from cache at C:\Users\mrezv/.cache\huggingface\hub\models--distilgpt2\snapshots\38cc92ec43315abd5136313225e95acc5986876c\model.safetensors
Generate config GenerationConfig {
  "bos_token_id": 50256,
  "eos_token_id": 50256,
  "transformers_version": "4.26.1"
}

All model checkpoint weights were used when initializing GPT2LMHeadModel.

Some weights of GPT2LMHeadModel were not initialized from the model checkpoint at distilgpt2 and are newly initialized: ['transformer.h.3.crossattention.c_attn.weight', 'transformer.h.3.crossattention.q_attn.weight', 'transformer.h.5.crossattention.c_attn.weight', 'transformer.h.2.crossattention.c_proj.weight', 'transformer.h.5.ln_cross_attn.weight', 'transformer.h.2.crossattention.c_attn.weight', 'transformer.h.0.crossattention.c_proj.bias', 'transformer.h.5.crossattention.q_attn.weight', 'transformer.h.5.crossattention.bias', 'transformer.h.3.crossattention.c_proj.bias', 'transformer.h.3.crossattention.bias', 'transformer.h.2.ln_cross_attn.weight', 'transformer.h.1.crossattention.bias', 'transformer.h.4.crossattention.q_attn.weight', 'transformer.h.0.ln_cross_attn.weight', 'transformer.h.1.crossattention.c_proj.bias', 'transformer.h.4.crossattention.c_proj.weight', 'transformer.h.0.crossattention.c_proj.weight', 'transformer.h.2.crossattention.masked_bias', 'transformer.h.3.crossattention.c_proj.weight', 'transformer.h.1.crossattention.q_attn.weight', 'transformer.h.1.ln_cross_attn.weight', 'transformer.h.3.crossattention.masked_bias', 'transformer.h.4.crossattention.c_proj.bias', 'transformer.h.3.ln_cross_attn.weight', 'transformer.h.5.crossattention.c_proj.bias', 'transformer.h.4.ln_cross_attn.weight', 'transformer.h.0.crossattention.masked_bias', 'transformer.h.2.crossattention.c_proj.bias', 'transformer.h.4.crossattention.bias', 'transformer.h.0.crossattention.q_attn.weight', 'transformer.h.0.crossattention.c_attn.weight', 'transformer.h.2.crossattention.bias', 'transformer.h.5.crossattention.c_proj.weight', 'transformer.h.0.crossattention.bias', 'transformer.h.5.crossattention.masked_bias', 'transformer.h.4.crossattention.masked_bias', 'transformer.h.4.crossattention.c_attn.weight', 'transformer.h.1.crossattention.masked_bias', 'transformer.h.2.crossattention.q_attn.weight', 'transformer.h.1.crossattention.c_proj.weight', 'transformer.h.1.crossattention.c_attn.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
loading configuration file generation_config.json from cache at C:\Users\mrezv/.cache\huggingface\hub\models--distilgpt2\snapshots\38cc92ec43315abd5136313225e95acc5986876c\generation_config.json
Generate config GenerationConfig {
  "_from_model_config": true,
  "bos_token_id": 50256,
  "eos_token_id": 50256,
  "transformers_version": "4.26.1"
}

Setting `config.is_decoder=True` and `config.add_cross_attention=True` for decoder_config
Generate config GenerationConfig {
  "transformers_version": "4.26.1"
}

Example 1¶

captions, generated, image_matrix = caption_image(  # Out of sample photo
    non_finetuned, list(map_captions_test.items())[0][0]
)

captions

Generate config GenerationConfig {
  "transformers_version": "4.26.1"
}

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.

['.',
 '.',
 'The U.S. military has been accused of violating the U.',
 'The U.S. Department of Homeland Security announced on Monday that it',
 '.']

Example 2¶

captions, generated, image_matrix = caption_image(  # Another one
    non_finetuned, list(map_captions_test.items())[1][0]
)

captions

Generate config GenerationConfig {
  "transformers_version": "4.26.1"
}

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.

['"I think that\'s a good thing. I think that\'s',
 '"I don\'t think it\'s fair to say that it\'s',
 '"I don\'t think it\'s going to be easy for the',
 '.',
 '.']

Example 3¶

captions, generated, image_matrix = caption_image(  # from our flicker dataset
    non_finetuned, 
    list(map_captions_test.items())[2][0]
)
captions

Generate config GenerationConfig {
  "transformers_version": "4.26.1"
}

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.

['"I don\'t know what to do with that," he said',
 ', and it’s a good thing.\n\n“',
 '.',
 'The New York Times is reporting that the Trump administration is planning to build',
 'A man was shot and killed in a shooting at a convenience store in']

Example Image with url¶

url = "https://raw.githubusercontent.com/MehdiRezvandehy/Machine-Learning-Course-University-of-Calgary/master/Images/2308978137_bfe776d541.jpg"
captions, generated, image_matrix = caption_image(  # Out of sample photo
    non_finetuned, url
)

captions

Generate config GenerationConfig {
  "transformers_version": "4.26.1"
}

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.

['The U.S. Department of Homeland Security has issued a warning to',
 '"It\'s not just about me, it\'s about me,"',
 '.',
 'The U.S. Supreme Court has ruled that the U.S',
 'A new report from the Center for Disease Control and Prevention (CDC)']

list(map_captions_test.items())[3][0]

'./image/Flickr8k_Dataset\\2308978137_bfe776d541.jpg'

Epoch	Training Loss	Validation Loss
1	3.053000	2.941434
2	2.554100	2.813167
3	2.215500	2.802942

Image Captioning by Fine-tunning ViT
© Mehdi Rezvandehy

Table of Contents

Introduction to the Vision Transformer¶

Fine-tune Image Captioning System¶

Vision transformers¶

Load Images and Image Caption¶

Flickr 8k Dataset ¶

Image Processing¶

Freezing Parameters¶

Fine-tune Vision Model¶

Fine-tuned Model for Prediction¶

Example 1¶

Example 2¶

Example 3¶

Example Image with url¶

Not Fine-tuned Model for Prediction¶

Example 1¶

Example 2¶

Example 3¶

Example Image with url¶

Image Captioning by Fine-tunning ViT© Mehdi Rezvandehy

Table of Contents

Introduction to the Vision Transformer¶

Fine-tune Image Captioning System¶

Vision transformers¶

Load Images and Image Caption¶

Flickr 8k Dataset¶

Image Processing¶

Freezing Parameters¶

Fine-tune Vision Model¶

Fine-tuned Model for Prediction¶

Example 1¶

Example 2¶

Example 3¶

Example Image with url¶

Not Fine-tuned Model for Prediction¶

Example 1¶

Example 2¶

Example 3¶

Example Image with url¶

Image Captioning by Fine-tunning ViT
© Mehdi Rezvandehy

Flickr 8k Dataset ¶