Summary

This work builds a visual search system that retrieves images based on how similar they look to a given query image. It works by converting images into numeric representations that capture their visual features. These representations are then compared to find the most visually similar images from a collection. The system demonstrates how visual similarity can be used for tasks like finding duplicates, recommending related content, or organizing large image datasets. This end-to-end example shows the core steps from image processing to search and result visualization.

Python functions and data files needed to run this notebook are available via this link.

Table of Contents

  • 1  Introduction
    • 1.1  How Visual Search Works
    • 1.2  Representation Learning for Visual Search Systems
      • 1.2.1  Tools & Frameworks for Implementation
  • 2  Image Data Processing & Feature Engineering
    • 2.1  Image Preprocessing (Feature Engineering)
  • 3  User & User-Image Interaction Data Engineering
    • 3.1  User Metadata
    • 3.2  User-Image Interaction Features
    • 3.3  Feature Engineering for User-Image Interactions
  • 4  Example: Implementation of Visual Search System
    • 4.1  Fashion Product Images Dataset
    • 4.2  Augmentation for Test set
    • 4.3  Pre-trained ResNet Model
    • 4.4  Image transformation
    • 4.5  Image Embedding for Train Set
    • 4.6  Example for Prediction
    • 4.7  Get Ground Truth Images for Test set
  • 5  Offline Evaluation Metrics
    • 5.1  Mean Reciprocal Rank (MRR)
    • 5.2  Recall@K
    • 5.3  Precision@K
    • 5.4  Mean Average Precision (mAP)
    • 5.5  Normalized Discounted Cumulative Gain (nDCG)
  • 6  Appendix

https://medium.com/analytics-vidhya/how-to-build-a-visual-search-engine-64046e58ad2f

Introduction¶

Visual search is a form of similarity search in which the items being searched are images or objects within images. It is commonly used for tasks such as finding similar products on e-commerce websites or identifying rare scenarios for training self-driving cars. The challenge in visual search, and the reason it remains an active area of research, is that users typically prioritize semantic similarity over visual similarity. For example, if I search for an image of a dog, I expect to see other images of dogs in the results, rather than images that merely share similar colors or pixel patterns.

How Visual Search Works¶

To build a visual search engine, the first step is to create an index—a data structure that maps search queries to relevant results. The search engine scans this index to identify the most similar results to a given query.

Since visual search relies on similarity matching, we must define how similarity is measured between images. This is achieved by converting images into numerical representations, known as embedding vectors, which capture various features. These embeddings are typically generated using heuristics or deep learning-based feature extraction models in computer vision. The similarity between images is then quantified using the inner product of their embedding vectors.

Here's a synthetic example of an index for a visual search engine:

During a search, the engine converts a query image into an embedding vector and searches the index for the most similar vectors—those with the highest inner product relative to the query’s embedding. A brute-force search scales linearly with the number of indexed images, making it impractical for large-scale datasets containing billions of images. To address this, we use approximate nearest neighbor search (ANNS), which balances speed and accuracy effectively.

Here's a synthetic example of an index for a visual search engine. We have a small dataset of images with feature vectors extracted using a deep learning model (e.g., ResNet, ViT). We build an index that maps these feature vectors to image IDs.

Image ID Feature Vector (Embeddings) Metadata (Optional)
img_001 [0.12, 0.85, 0.43, ..., 0.91] "Red car, outdoor, daytime"
img_002 [0.78, 0.22, 0.56, ..., 0.34] "Blue truck, highway, night"
img_003 [0.44, 0.67, 0.89, ..., 0.12] "Black dog, park, running"
img_004 [0.31, 0.77, 0.23, ..., 0.65] "Sunset over ocean, waves"
img_005 [0.99, 0.05, 0.48, ..., 0.78] "Person with sunglasses, city"
In [1]:
#%pip install torch torchvision torchaudio
#%pip install faiss-cpu --no-build-isolation --no-cache-dir

Representation Learning for Visual Search Systems¶

In a visual search system, the goal is to retrieve images based on a query image (instead of text). Representation learning is crucial for such systems because it allows images to be transformed into compact, meaningful feature vectors that can be compared efficiently.

Step 1: Feature Extraction

  • Use a pretrained CNN (e.g., ResNet, EfficientNet) or a self-supervised model (e.g., SimCLR, DINO) to extract embeddings.

Step 2: Indexing & Storage

  • Store image embeddings in a vector database like FAISS, Annoy, or Milvus for fast similarity search.

Step 3: Similarity Search

  • Use cosine similarity or Euclidean distance to find the closest embeddings to the query image.

Step 4: Retrieval & Ranking

  • Retrieve top-K similar images and rank them based on relevance.

Tools & Frameworks for Implementation¶

✅ TensorFlow/Keras or PyTorch – Train deep learning models.
✅ FAISS / Annoy / Milvus – Fast nearest neighbor search.
✅ OpenCV / PIL – Image preprocessing.
✅ Hugging Face Transformers (for CLIP) – Use multimodal search models.

Image Data Processing & Feature Engineering¶

  • Data Collection
    • Collect images from multiple sources (e.g., e-commerce websites, social media, product catalogs).
    • Store images efficiently in cloud storage (AWS S3, Google Cloud Storage) or databases (PostgreSQL, MongoDB).

Image Preprocessing (Feature Engineering)¶

Before feeding images into a deep learning model, they need to be processed:

  1. Resizing

🔹 Why? Standardizes input dimensions for deep learning models.
🔹 How?

  • CNNs typically require fixed-size inputs (e.g., 224×224 for ResNet, 256×256 for ViT).
  • Use libraries like OpenCV, PIL, or torchvision for resizing.
from PIL import Image
image = Image.open("image.jpg").resize((224, 224))
  1. Normalization

🔹 Why? Ensures pixel values are within a specific range (e.g., [0,1] or [-1,1]) for stable training.
🔹 How?

  • Convert pixel values from [0, 255] to [0,1] or standardize with mean and std deviation.
import torchvision.transforms as transforms

transform = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor(),  # Converts to [0,1]
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])  # ResNet normalization
])
  1. Data Augmentation

🔹 Why? Increases dataset size and diversity, reducing overfitting.
🔹 How? Apply transformations like rotation, flipping, color jitter, and cropping.

transform = transforms.Compose([
    transforms.RandomResizedCrop(224),
    transforms.RandomHorizontalFlip(),
    transforms.ColorJitter(brightness=0.2, contrast=0.2),
    transforms.ToTensor()
])
  1. Feature Extraction (Embeddings)
  • Use a pre-trained CNN (e.g., ResNet, ViT, CLIP) to extract deep image features.
  • Store embeddings for fast retrieval.
import torch
import torchvision.models as models

model = models.resnet18(pretrained=True)
model = torch.nn.Sequential(*list(model.children())[:-1])  # Remove last classification layer

# Convert image to embedding
def get_embedding(image):
    image = transform(image).unsqueeze(0)
    with torch.no_grad():
        embedding = model(image).squeeze().numpy()
    return embedding

User & User-Image Interaction Data Engineering¶

Apart from images, user interactions improve search relevance.

User Metadata¶

Feature Description
User ID Unique identifier
Preferences Categories of interest
Search History Past queries and clicks
Location & Device Used for personalization

User-Image Interaction Features¶

Feature Description
Clicks How often a user clicks an image
Dwell Time Time spent viewing an image
Purchases Whether the user bought the item
Likes/Saves Whether the user favorited the image

Feature Engineering for User-Image Interactions¶

  1. One-Hot Encoding: Convert categorical features (e.g., location, device type) into vectors.
  2. Embedding Learning: Learn user embeddings (e.g., via Word2Vec or deep learning).
  3. Interaction Scores: Compute engagement scores (e.g., weighted sum of clicks, likes, and dwell time).

Example: Implementation of Visual Search System¶

In [2]:
#pip cache purge  # Clears cached versions
#pip install imgaug
#pip install numpy==1.26.4 pandas scikit-learn matplotlib
#conda install -c pytorch faiss-cpu
#conda install pytorch cpuonly -c pytorch
In [3]:
import os
import re
import pandas as pd
import imgaug.augmenters as iaa
import cv2
import numpy as np
In [ ]:
 

Fashion Product Images Dataset¶

In [4]:
#pip install kagglehub
#import kagglehub
#
## Download latest version
#path = kagglehub.dataset_download("paramaggarwal/fashion-product-images-dataset")
#
#print("Path to dataset files:", path)

The Fashion Product Images Dataset is a large-scale dataset designed for visual search, classification, and recommendation tasks in fashion e-commerce. It contains images of fashion products, along with metadata such as category labels, product descriptions, and sometimes attributes like color, pattern, or brand.

Key Features

  • High-quality product images: Typically scraped from e-commerce websites.
  • Multiple categories: Clothing, footwear, accessories, and more.
  • Ground truth labels: Product category, subcategory, brand, and sometimes attributes (e.g., sleeve length, fabric type).
  • Perfect for visual search: Can be used to develop image-based retrieval systems (e.g., finding similar fashion items).
In [5]:
# Load images from a folder and extract embeddings
image_folder = "fashion-product-images-dataset/training_images"  
image_paths = [os.path.join(image_folder, img) for img in os.listdir(image_folder)]
In [6]:
# load test folder
image_folder_test = "fashion-product-images-dataset/test_images"  
image_paths_test = [os.path.join(image_folder_test, img) for img in os.listdir(image_folder_test)]

Augmentation for Test set¶

Augment test set image to be used as ground truth image

In [7]:
def load_images_from_folder(folder):
    images, filenames = [], []
    valid_exts = ('.jpg', '.jpeg', '.png', '.bmp', '.tiff')
    for filename in os.listdir(folder):
        if filename.lower().endswith(valid_exts):
            path = os.path.join(folder, filename)
            img = cv2.imread(path)
            if img is not None:
                images.append(img)
                filenames.append(filename)
            else:
                print(f"❌ Failed to load: {path}")
    return images, filenames

def apply_augmentation(images):
    augmenter = iaa.Sequential([
        iaa.Fliplr(0.5),
        iaa.Affine(rotate=(-20, 20)),
        iaa.GaussianBlur(sigma=(0.5, 2)),
        iaa.AdditiveGaussianNoise(scale=(5, 25)),
        iaa.Multiply((0.8, 1.2)),
    ])
    return augmenter.augment_images(images)

def save_images(folder, images, filenames, prefix="aug_"):
    os.makedirs(folder, exist_ok=True)
    for img, fname in zip(images, filenames):
        save_path = os.path.join(folder, prefix + fname)
        cv2.imwrite(save_path, img)

def main(input_folder, output_folder):
    images, filenames = load_images_from_folder(input_folder)
    if not images:
        print("❌ No images loaded.")
        return
    print(f"✅ Loaded {len(images)} images. Augmenting...")
    aug_images = apply_augmentation(images)
    save_images(output_folder, aug_images, filenames)
    print(f"✅ Augmented images saved to: {output_folder}")

if __name__ == "__main__":
    input_folder = image_folder_test
    output_folder = image_folder
    main(input_folder, output_folder)
✅ Loaded 100 images. Augmenting...
✅ Augmented images saved to: fashion-product-images-dataset/training_images

Pre-trained ResNet Model¶

In [8]:
import torch
import torchvision.transforms as transforms
import torchvision.models as models
import faiss
import numpy as np
import os
from PIL import Image
import matplotlib.pyplot as plt

# Load a pre-trained ResNet model (without the classification head)
model = models.resnet18(pretrained=True)
model = torch.nn.Sequential(*list(model.children())[:-1])  # Remove last layer
model.eval()  # Set model to evaluation mode
C:\Users\mrezv\anaconda3\envs\faiss-env\lib\site-packages\torchvision\models\_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and may be removed in the future, please use 'weights' instead.
  warnings.warn(
C:\Users\mrezv\anaconda3\envs\faiss-env\lib\site-packages\torchvision\models\_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=ResNet18_Weights.IMAGENET1K_V1`. You can also use `weights=ResNet18_Weights.DEFAULT` to get the most up-to-date weights.
  warnings.warn(msg)
Out[8]:
Sequential(
  (0): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
  (1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (2): ReLU(inplace=True)
  (3): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
  (4): Sequential(
    (0): BasicBlock(
      (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
    (1): BasicBlock(
      (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
  )
  (5): Sequential(
    (0): BasicBlock(
      (conv1): Conv2d(64, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (downsample): Sequential(
        (0): Conv2d(64, 128, kernel_size=(1, 1), stride=(2, 2), bias=False)
        (1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
    )
    (1): BasicBlock(
      (conv1): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
  )
  (6): Sequential(
    (0): BasicBlock(
      (conv1): Conv2d(128, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (downsample): Sequential(
        (0): Conv2d(128, 256, kernel_size=(1, 1), stride=(2, 2), bias=False)
        (1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
    )
    (1): BasicBlock(
      (conv1): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
  )
  (7): Sequential(
    (0): BasicBlock(
      (conv1): Conv2d(256, 512, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (downsample): Sequential(
        (0): Conv2d(256, 512, kernel_size=(1, 1), stride=(2, 2), bias=False)
        (1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
    )
    (1): BasicBlock(
      (conv1): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
  )
  (8): AdaptiveAvgPool2d(output_size=(1, 1))
)

Image transformation¶

In [9]:
# Image transformation (resize, normalize, and convert to tensor)
transform = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])

Image Embedding for Train Set¶

In [10]:
# Function to extract embeddings
def get_embedding(image_path):
    image = Image.open(image_path).convert("RGB")
    image = transform(image).unsqueeze(0)  # Add batch dimension
    with torch.no_grad():
        embedding = model(image).squeeze().numpy()  # Extract features
    return embedding.flatten()  # Convert to 1D array

# Get image embedding for training set
embeddings = np.array([get_embedding(img) for img in image_paths])

# Get image embedding for test set
embeddings_test = np.array([get_embedding(img) for img in image_paths_test])

FAISS (Facebook AI Similarity Search) is a high-performance library for efficient similarity search and clustering of dense vectors. It is widely used for nearest neighbor search in large-scale datasets, especially for applications like image retrieval, recommendation systems, and NLP embeddings.

How FAISS Works

  1. Indexing:

    • FAISS stores high-dimensional vectors and allows fast similarity searches.
    • It supports different types of indexes, such as:
      • IndexFlatL2: Exact search with L2 (Euclidean) distance.
      • IndexIVFFlat: Approximate search using inverted file lists for scalability.
      • IndexHNSW: Uses Hierarchical Navigable Small World (HNSW) graphs for efficiency.
  2. Search:

    • Given a query vector, FAISS quickly retrieves the K most similar vectors using approximate or exact nearest neighbor techniques.
    • It supports both Euclidean distance (L2) and cosine similarity (IP - Inner Product).
  3. Optimization for Speed:

    • FAISS is optimized for GPU acceleration, multi-threading, and vectorized operations.
    • It's much faster than brute-force nearest neighbor search (e.g., using sklearn.neighbors.KNeighborsClassifier).

FAISS vs. KNN (KNeighborsClassifier) | Feature | FAISS (IndexFlatL2) | KNeighborsClassifier | |---------------|----------------------|----------------------| | Speed | Faster for large datasets | Slower for large datasets | | Scalability | Handles millions of vectors | Limited scalability | | Approximation | Supports approximate search (IVF, HNSW) | Always exact | | Distance Metrics | L2 (Euclidean), Inner Product (Cosine) | Euclidean, Manhattan, etc. | | GPU Support | Yes | No |

When to Use FAISS?

✅ When dealing with large-scale datasets (millions of vectors).
✅ When fast retrieval is needed (e.g., real-time visual search).
✅ When you can trade accuracy for speed (e.g., using approximate search).

In [11]:
import random
#from sklearn.neighbors import KNeighborsClassifier

# Assuming embeddings is a numpy array of shape (N, D), and labels contains the image labels.
dimension = embeddings.shape[1]

# Create and build the FAISS (Facebook AI Similarity Search) index
index = faiss.IndexFlatL2(dimension)  # L2 distance for Euclidean similarity
index.add(embeddings)  # Add embeddings to the FAISS index
In [12]:
# FAISS only accepts np.float32 arrays. If you're passing float64 (default for NumPy), 
#it crashes or causes undefined behavior.
embeddings = embeddings.astype('float32')
In [13]:
# Number of neighbors to retrieve
k = 5  # Retrieve 3 nearest neighbors for each query

# Perform search for a query embedding (example query_embedding is from a test image)
query_embedding = embeddings[0].reshape(1, -1).astype('float32')  # Example: querying the first image
In [14]:
print("Embeddings shape:", embeddings.shape)
print("Embeddings dtype:", embeddings.dtype)
Embeddings shape: (3432, 512)
Embeddings dtype: float32
In [15]:
distances, indices = index.search(query_embedding, k)

# indices will give us the indices of the 5 nearest neighbors
nearest_neighbors = indices[0]

Example for Prediction¶

In [16]:
for i in range (10):
    # Query an image and find similar images
    query_image_path = image_paths_test[i]  # Change this to test different queries
    query_embedding = np.array([get_embedding(query_image_path)])
    
    distances, indices = index.search(query_embedding, k)
    
    # indices will give us the indices of the 5 nearest neighbors
    nearest_neighbors = indices[0]
    
    # Display query and similar images
    fig, axes = plt.subplots(1, 1, figsize=(10, 3))
    axes.imshow(Image.open(image_paths_test[i]))
    axes.set_title("Query Image")
    
    fig, axes = plt.subplots(1, k, figsize=(15, 5))
    for i, idx in enumerate(nearest_neighbors):
        axes[i].imshow(Image.open(image_paths[idx]))
        axes[i].set_title(f"Match {i+1}")
    
    plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

Get Ground Truth Images for Test set¶

In [17]:
ground_truth = np.zeros((len(image_paths_test), k))
for itest in range (len(image_paths_test)):
    # Query an image and find similar images
    query_image_path = image_paths_test[itest]  # Change this to test different queries
    query_embedding = np.array([get_embedding(query_image_path)])
    
    distances, indices = index.search(query_embedding, k)

    for i, idx in enumerate(indices[0]):
        if "aug_" in image_paths[idx]:
            ground_truth[itest, i] = 1
In [18]:
ground_truth[:10]
Out[18]:
array([[0., 0., 0., 1., 0.],
       [1., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0.],
       [0., 0., 0., 0., 1.],
       [1., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 1., 0., 0.]])

Offline Evaluation Metrics¶

https://amitness.com/posts/information-retrieval-evaluation

Evaluating a visual search system based on the k nearest images requires ranking-based metrics that assess how well the retrieved images match the ground truth. Here's how you can evaluate it using different approaches:


Mean Reciprocal Rank (MRR)¶

  • MRR measures how quickly the first relevant image appears in the retrieved list.

  • It is calculated as:

    $MRR = \frac{1}{N} \sum_{i=1}^{N} \frac{1}{rank_i}$

    where ( rank_i ) is the position of the first relevant image for query ( i ).

See schematic illustration below image.png Retrieved from https://amitness.com/posts/information-retrieval-evaluation

Computation:

In [19]:
def mean_reciprocal_rank(y_true):
    reciprocal_ranks = []
    for row in y_true:
        ranks = np.where(row == 1)[0]
        if ranks.size > 0:
            reciprocal_ranks.append(1.0 / (ranks[0] + 1))  # ranks are 0-based
        else:
            reciprocal_ranks.append(0.0)
    return np.mean(reciprocal_ranks)
In [20]:
mrr = mean_reciprocal_rank(ground_truth)
print(f"mean_reciprocal_rank: {mrr:.4f}")
mean_reciprocal_rank: 0.4260
In [ ]:
 

Recall@K¶

  • Measures the proportion of queries where at least one relevant image appears in the top K retrieved images.
  • Formula: $ Recall@K = \frac{\text{# of queries with at least 1 relevant image in top-K}}{\text{Total queries}} $

image.png Retrieved from https://amitness.com/posts/information-retrieval-evaluation

Computation:

In [21]:
def recall_at_k(y_true, k):
    """
    y_true: numpy array of shape (n_queries, top_k) with binary labels (1=relevant, 0=not)
    Assumes each query has exactly one relevant item.
    """
    clipped = y_true[:, :k]
    return np.mean(np.sum(clipped, axis=1))  # equivalent to mean of hit rate at K


rec = recall_at_k(ground_truth, k)
print(f"Recall@{k}: {rec:.2f}")
Recall@5: 0.56

Precision@K¶

  • Measures how many of the top K retrieved images are relevant.
  • Formula:

$ Precision@K = \frac{\text{# of relevant images in top-K}}{K} $

image-2.png Retrieved from https://amitness.com/posts/information-retrieval-evaluation

Computation:

In [22]:
def precision_at_k(y_true, k):
    """
    y_true: numpy array of shape (n_queries, top_k) with binary labels (1=relevant, 0=not)
    k: cutoff rank
    """
    clipped = y_true[:, :k]
    return np.mean(np.sum(clipped, axis=1) / k)

p_at_k = precision_at_k(ground_truth, k)
print(f"Precision@{k}: {p_at_k:.2f}")
Precision@5: 0.11

Mean Average Precision (mAP)¶

  • Computes Average Precision (AP) per query and then averages over all queries.
  • AP is computed as: $ AP = \frac{1}{\text{relevant items}} \sum_{i} Precision@i \times \text{relevance}(i) $

image.png Retrieved from https://amitness.com/posts/information-retrieval-evaluation

Computation:

In [23]:
def average_precision(row):
    precisions = []
    num_hits = 0
    for i, val in enumerate(row):
        if val == 1:
            num_hits += 1
            precisions.append(num_hits / (i + 1))
    return np.mean(precisions) if precisions else 0.0

def mean_average_precision(y_true):
    return np.mean([average_precision(row) for row in y_true])

print("mean_average_precision:", mean_average_precision(ground_truth))
mean_average_precision: 0.4230000000000001

Normalized Discounted Cumulative Gain (nDCG)¶

  • Considers both relevance and ranking position, penalizing relevant images appearing lower in the list.
  • DCG formula:

$ DCG@K = \sum_{i=1}^{K} \frac{rel_i}{\log_2(i+1)} $

  • nDCG is DCG normalized by the ideal DCG (IDCG), which is computed by sorting relevant images in the best possible order.

image.png

image-2.png Retrieved from https://amitness.com/posts/information-retrieval-evaluation

Computation:

In [24]:
def dcg(row):
    return np.sum([
        rel / np.log2(idx + 2)  # +2 because idx starts at 0
        for idx, rel in enumerate(row)
    ])

def ndcg(row):
    ideal_row = sorted(row, reverse=True)
    idcg = dcg(ideal_row)
    return dcg(row) / idcg if idcg > 0 else 0.0

def mean_ndcg(y_true):
    return np.mean([ndcg(row) for row in y_true])

ndcg_score = mean_ndcg(ground_truth)

print(f"nDCG: {ndcg_score:.4f}")
nDCG: 0.4554

Appendix¶

In [25]:
#train_image_name = []
#for i_image in image_paths:
#    # Extract filename without extension
#    filename = os.path.splitext(os.path.basename(i_image))[0]
#
#    # Extract numeric part using regex
#    match = re.search(r'\d+', filename)
#    number = match.group() if match else None
#    train_image_name.append(int(number))
##
#test_image_name = []
#for i_image in image_paths_test:
#    # Extract filename without extension
#    filename = os.path.splitext(os.path.basename(i_image))[0]
#
#    # Extract numeric part using regex
#    match = re.search(r'\d+', filename)
#    number = match.group() if match else None
#    test_image_name.append(int(number))
In [ ]:
 
In [26]:
#styles = pd.read_csv("./fashion-product-images-dataset/styles.csv", on_bad_lines='skip')
#ground_truth_test = []
#for i in range(len(test_image_name)):
#    season = styles[styles.id==test_image_name[i]].season.tolist()[0]
#    gender = styles[styles.id==test_image_name[i]].gender.tolist()[0]
#    articleType = styles[styles.id==test_image_name[i]].articleType.tolist()[0]
#    masterCategory = styles[styles.id==test_image_name[i]].masterCategory.tolist()[0]
#    subCategory = styles[styles.id==test_image_name[i]].subCategory.tolist()[0]
#    baseColour = styles[styles.id==test_image_name[i]].baseColour.tolist()[0]
#    usage = styles[styles.id==test_image_name[i]].usage.tolist()[0]
#    productDisplayName = styles[styles.id==test_image_name[i]].productDisplayName.tolist()[0]
#    
#    id_ = styles[(styles.season==season) &
#        (styles.gender==gender) &
#        (styles.articleType==articleType) &
#       (styles.masterCategory==masterCategory) &
#       (styles.subCategory==subCategory) &
#       (styles.baseColour==baseColour) &
#       (styles.productDisplayName==productDisplayName)]
#    ground_truth_test.append(id_.id.tolist())