Using GPT4 with Vision to tag and caption images

Feb 28, 2024
Open in Github

This notebook explores how to leverage GPT-4V to tag & caption images.

We can leverage the multimodal capabilities of GPT-4V to provide input images along with additional context on what they represent, and prompt the model to output tags or image descriptions. The image descriptions can then be further refined with a language model (in this notebook, we'll use GPT-4-turbo) to generate captions.

Generating text content from images can be useful for multiple use cases, especially use cases involving search.
We will illustrate a search use case in this notebook by using generated keywords and product captions to search for products - both from a text input and an image input.

As an example, we will use a dataset of Amazon furniture items, tag them with relevant keywords and generate short, descriptive captions.

# Install dependencies if needed
%pip install openai
%pip install scikit-learn
from IPython.display import Image, display
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
from openai import OpenAI

# Initializing OpenAI client - see https://platform.openai.com/docs/quickstart?context=python
client = OpenAI()
# Loading dataset
dataset_path =  "data/amazon_furniture_dataset.csv"
df = pd.read_csv(dataset_path)
df.head()
asin url title brand price availability categories primary_image images upc ... color material style important_information product_overview about_item description specifications uniq_id scraped_at
0 B0CJHKVG6P https://www.amazon.com/dp/B0CJHKVG6P GOYMFK 1pc Free Standing Shoe Rack, Multi-laye... GOYMFK $24.99 Only 13 left in stock - order soon. ['Home & Kitchen', 'Storage & Organization', '... https://m.media-amazon.com/images/I/416WaLx10j... ['https://m.media-amazon.com/images/I/416WaLx1... NaN ... White Metal Modern [] [{'Brand': ' GOYMFK '}, {'Color': ' White '}, ... ['Multiple layers: Provides ample storage spac... multiple shoes, coats, hats, and other items E... ['Brand: GOYMFK', 'Color: White', 'Material: M... 02593e81-5c09-5069-8516-b0b29f439ded 2024-02-02 15:15:08
1 B0B66QHB23 https://www.amazon.com/dp/B0B66QHB23 subrtex Leather ding Room, Dining Chairs Set o... subrtex NaN NaN ['Home & Kitchen', 'Furniture', 'Dining Room F... https://m.media-amazon.com/images/I/31SejUEWY7... ['https://m.media-amazon.com/images/I/31SejUEW... NaN ... Black Sponge Black Rubber Wood [] NaN ['【Easy Assembly】: Set of 2 dining room chairs... subrtex Dining chairs Set of 2 ['Brand: subrtex', 'Color: Black', 'Product Di... 5938d217-b8c5-5d3e-b1cf-e28e340f292e 2024-02-02 15:15:09
2 B0BXRTWLYK https://www.amazon.com/dp/B0BXRTWLYK Plant Repotting Mat MUYETOL Waterproof Transpl... MUYETOL $5.98 In Stock ['Patio, Lawn & Garden', 'Outdoor Décor', 'Doo... https://m.media-amazon.com/images/I/41RgefVq70... ['https://m.media-amazon.com/images/I/41RgefVq... NaN ... Green Polyethylene Modern [] [{'Brand': ' MUYETOL '}, {'Size': ' 26.8*26.8 ... ['PLANT REPOTTING MAT SIZE: 26.8" x 26.8", squ... NaN ['Brand: MUYETOL', 'Size: 26.8*26.8', 'Item We... b2ede786-3f51-5a45-9a5b-bcf856958cd8 2024-02-02 15:15:09
3 B0C1MRB2M8 https://www.amazon.com/dp/B0C1MRB2M8 Pickleball Doormat, Welcome Doormat Absorbent ... VEWETOL $13.99 Only 10 left in stock - order soon. ['Patio, Lawn & Garden', 'Outdoor Décor', 'Doo... https://m.media-amazon.com/images/I/61vz1Igler... ['https://m.media-amazon.com/images/I/61vz1Igl... NaN ... A5589 Rubber Modern [] [{'Brand': ' VEWETOL '}, {'Size': ' 16*24INCH ... ['Specifications: 16x24 Inch ', " High-Quality... The decorative doormat features a subtle textu... ['Brand: VEWETOL', 'Size: 16*24INCH', 'Materia... 8fd9377b-cfa6-5f10-835c-6b8eca2816b5 2024-02-02 15:15:10
4 B0CG1N9QRC https://www.amazon.com/dp/B0CG1N9QRC JOIN IRON Foldable TV Trays for Eating Set of ... JOIN IRON Store $89.99 Usually ships within 5 to 6 weeks ['Home & Kitchen', 'Furniture', 'Game & Recrea... https://m.media-amazon.com/images/I/41p4d4VJnN... ['https://m.media-amazon.com/images/I/41p4d4VJ... NaN ... Grey Set of 4 Iron X Classic Style [] NaN ['Includes 4 Folding Tv Tray Tables And one Co... Set of Four Folding Trays With Matching Storag... ['Brand: JOIN IRON', 'Shape: Rectangular', 'In... bdc9aa30-9439-50dc-8e89-213ea211d66a 2024-02-02 15:15:11

5 rows × 25 columns

Tag images

In this section, we'll use GPT-4V to generate relevant tags for our products.

We'll use a simple zero-shot approach to extract keywords, and deduplicate those keywords using embeddings to avoid having multiple keywords that are too similar.

We will use a combination of an image and the product title to avoid extracting keywords for other items that are depicted in the image - sometimes there are multiple items used in the scene and we want to focus on just the one we want to tag.

system_prompt = '''
    You are an agent specialized in tagging images of furniture items, decorative items, or furnishings with relevant keywords that could be used to search for these items on a marketplace.
    
    You will be provided with an image and the title of the item that is depicted in the image, and your goal is to extract keywords for only the item specified. 
    
    Keywords should be concise and in lower case. 
    
    Keywords can describe things like:
    - Item type e.g. 'sofa bed', 'chair', 'desk', 'plant'
    - Item material e.g. 'wood', 'metal', 'fabric'
    - Item style e.g. 'scandinavian', 'vintage', 'industrial'
    - Item color e.g. 'red', 'blue', 'white'
    
    Only deduce material, style or color keywords when it is obvious that they make the item depicted in the image stand out.

    Return keywords in the format of an array of strings, like this:
    ['desk', 'industrial', 'metal']
    
'''

def analyze_image(img_url, title):
    response = client.chat.completions.create(
    model="gpt-4-vision-preview",
    messages=[
        {
            "role": "system",
            "content": system_prompt
        },
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": img_url,
                },
            ],
        },
        {
            "role": "user",
            "content": title
        }
    ],
        max_tokens=300,
        top_p=0.1
    )

    return response.choices[0].message.content
examples = df.iloc[:5]
for index, ex in examples.iterrows():
    url = ex['primary_image']
    img = Image(url=url)
    display(img)
    result = analyze_image(url, ex['title'])
    print(result)
    print("\n\n")
['shoe rack', 'free standing', 'multi-layer', 'metal', 'white']



['dining chairs', 'set of 2', 'leather', 'black']



['plant repotting mat', 'waterproof', 'portable', 'foldable', 'easy to clean', 'green']



['doormat', 'absorbent', 'non-slip', 'brown']



['tv tray table set', 'foldable', 'iron', 'grey']



# Feel free to change the embedding model here
def get_embedding(value, model="text-embedding-3-large"): 
    embeddings = client.embeddings.create(
      model=model,
      input=value,
      encoding_format="float"
    )
    return embeddings.data[0].embedding
# Existing keywords
keywords_list = ['industrial', 'metal', 'wood', 'vintage', 'bed']
df_keywords = pd.DataFrame(keywords_list, columns=['keyword'])
df_keywords['embedding'] = df_keywords['keyword'].apply(lambda x: get_embedding(x))
df_keywords
keyword embedding
0 industrial [-0.026137426, 0.021297162, -0.007273361, -0.0...
1 metal [-0.020530997, 0.004478126, -0.011049379, -0.0...
2 wood [0.013877833, 0.02955235, 0.0006239023, -0.035...
3 vintage [-0.05235507, 0.008213689, -0.015532949, 0.002...
4 bed [-0.011677503, 0.023275835, 0.0026937425, -0.0...
def compare_keyword(keyword):
    embedded_value = get_embedding(keyword)
    df_keywords['similarity'] = df_keywords['embedding'].apply(lambda x: cosine_similarity(np.array(x).reshape(1,-1), np.array(embedded_value).reshape(1, -1)))
    most_similar = df_keywords.sort_values('similarity', ascending=False).iloc[0]
    return most_similar

def replace_keyword(keyword, threshold = 0.6):
    most_similar = compare_keyword(keyword)
    if most_similar['similarity'] > threshold:
        print(f"Replacing '{keyword}' with existing keyword: '{most_similar['keyword']}'")
        return most_similar['keyword']
    return keyword
# Example keywords to compare to our list of existing keywords
example_keywords = ['bed frame', 'wooden', 'vintage', 'old school', 'desk', 'table', 'old', 'metal', 'metallic', 'woody']
final_keywords = []

for k in example_keywords:
    final_keywords.append(replace_keyword(k))
    
final_keywords = set(final_keywords)
print(f"Final keywords: {final_keywords}")
Replacing 'bed frame' with existing keyword: 'bed'
Replacing 'wooden' with existing keyword: 'wood'
Replacing 'vintage' with existing keyword: 'vintage'
Replacing 'metal' with existing keyword: 'metal'
Replacing 'metallic' with existing keyword: 'metal'
Replacing 'woody' with existing keyword: 'wood'
Final keywords: {'table', 'desk', 'bed', 'old', 'vintage', 'metal', 'wood', 'old school'}

Generate captions

In this section, we'll use GPT-4V to generate an image description and then use a few-shot examples approach with GPT-4-turbo to generate captions from the images.

If few-shot examples are not enough for your use case, consider fine-tuning a model to get the generated captions to match the style & tone you are targeting.

# Cleaning up dataset columns
selected_columns = ['title', 'primary_image', 'style', 'material', 'color', 'url']
df = df[selected_columns].copy()
df.head()
title primary_image style material color url
0 GOYMFK 1pc Free Standing Shoe Rack, Multi-laye... https://m.media-amazon.com/images/I/416WaLx10j... Modern Metal White https://www.amazon.com/dp/B0CJHKVG6P
1 subrtex Leather ding Room, Dining Chairs Set o... https://m.media-amazon.com/images/I/31SejUEWY7... Black Rubber Wood Sponge Black https://www.amazon.com/dp/B0B66QHB23
2 Plant Repotting Mat MUYETOL Waterproof Transpl... https://m.media-amazon.com/images/I/41RgefVq70... Modern Polyethylene Green https://www.amazon.com/dp/B0BXRTWLYK
3 Pickleball Doormat, Welcome Doormat Absorbent ... https://m.media-amazon.com/images/I/61vz1Igler... Modern Rubber A5589 https://www.amazon.com/dp/B0C1MRB2M8
4 JOIN IRON Foldable TV Trays for Eating Set of ... https://m.media-amazon.com/images/I/41p4d4VJnN... X Classic Style Iron Grey Set of 4 https://www.amazon.com/dp/B0CG1N9QRC
describe_system_prompt = '''
    You are a system generating descriptions for furniture items, decorative items, or furnishings on an e-commerce website.
    Provided with an image and a title, you will describe the main item that you see in the image, giving details but staying concise.
    You can describe unambiguously what the item is and its material, color, and style if clearly identifiable.
    If there are multiple items depicted, refer to the title to understand which item you should describe.
    '''

def describe_image(img_url, title):
    response = client.chat.completions.create(
    model="gpt-4-vision-preview",
    temperature=0.2,
    messages=[
        {
            "role": "system",
            "content": describe_system_prompt
        },
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": img_url,
                },
            ],
        },
        {
            "role": "user",
            "content": title
        }
    ],
    max_tokens=300,
    )

    return response.choices[0].message.content
for index, row in examples.iterrows():
    print(f"{row['title'][:50]}{'...' if len(row['title']) > 50 else ''} - {row['url']} :\n")
    img_description = describe_image(row['primary_image'], row['title'])
    print(f"{img_description}\n--------------------------\n")
GOYMFK 1pc Free Standing Shoe Rack, Multi-layer Me... - https://www.amazon.com/dp/B0CJHKVG6P :

This is a free-standing shoe rack featuring a multi-layer design, constructed from metal for durability. The rack is finished in a clean white color, which gives it a modern and versatile look, suitable for various home decor styles. It includes several horizontal shelves dedicated to organizing shoes, providing ample space for multiple pairs.

Additionally, the rack is equipped with 8 double hooks, which are integrated into the frame above the shoe shelves. These hooks offer extra functionality, allowing for the hanging of accessories such as hats, scarves, or bags. The design is space-efficient and ideal for placement in living rooms, bathrooms, hallways, or entryways, where it can serve as a practical storage solution while contributing to the tidiness and aesthetic of the space.
--------------------------

subrtex Leather ding Room, Dining Chairs Set of 2,... - https://www.amazon.com/dp/B0B66QHB23 :

This image showcases a set of two dining chairs. The chairs are upholstered in black leather, featuring a sleek and modern design. They have a high back with subtle stitching details that create vertical lines, adding an element of elegance to the overall appearance. The legs of the chairs are also black, maintaining a consistent color scheme and enhancing the sophisticated look. These chairs would make a stylish addition to any contemporary dining room setting.
--------------------------

Plant Repotting Mat MUYETOL Waterproof Transplanti... - https://www.amazon.com/dp/B0BXRTWLYK :

This is a square plant repotting mat designed for indoor gardening activities such as transplanting or changing soil for plants. The mat measures 26.8 inches by 26.8 inches, providing ample space for gardening tasks. It is made from a waterproof material, which is likely a durable, easy-to-clean fabric, in a vibrant green color. The edges of the mat are raised with corner fastenings to keep soil and water contained, making the workspace tidy and preventing messes. The mat is also foldable, which allows for convenient storage when not in use. This mat is suitable for a variety of gardening tasks, including working with succulents and other small plants, and it can be a practical gift for garden enthusiasts.
--------------------------

Pickleball Doormat, Welcome Doormat Absorbent Non-... - https://www.amazon.com/dp/B0C1MRB2M8 :

This is a rectangular doormat featuring a playful design that caters to pickleball enthusiasts. The mat's background is a natural coir color, which is a fibrous material made from coconut husks, known for its durability and excellent scraping properties. Emblazoned across the mat in bold, black letters is the phrase "it's a good day to play PICKLEBALL," with the word "PICKLEBALL" being prominently displayed in larger font size. Below the text, there are two crossed pickleball paddles in black, symbolizing the sport.

The doormat measures approximately 16x24 inches, making it a suitable size for a variety of entryways. Its design suggests that it has an absorbent quality, which would be useful for wiping shoes and preventing dirt from entering the home. Additionally, the description implies that the doormat has a non-slip feature, which is likely due to a backing material that helps keep the mat in place on various floor surfaces. This mat would be a welcoming addition to the home of any pickleball player or sports enthusiast, offering both functionality and a touch of personal interest.
--------------------------

JOIN IRON Foldable TV Trays for Eating Set of 4 wi... - https://www.amazon.com/dp/B0CG1N9QRC :

This image features a set of four foldable TV trays with a stand, designed for eating or as snack tables. The tables are presented in a sleek grey finish, which gives them a modern and versatile look, suitable for a variety of home decor styles. Each tray table has a rectangular top with a wood grain texture, supported by a sturdy black iron frame that folds easily for compact storage. The accompanying stand allows for neat organization and easy access when the tables are not in use. These tables are ideal for small spaces where multifunctional furniture is essential, offering a convenient surface for meals, work, or leisure activities.
--------------------------

caption_system_prompt = '''
Your goal is to generate short, descriptive captions for images of furniture items, decorative items, or furnishings based on an image description.
You will be provided with a description of an item image and you will output a caption that captures the most important information about the item.
Your generated caption should be short (1 sentence), and include the most relevant information about the item.
The most important information could be: the type of the item, the style (if mentioned), the material if especially relevant and any distinctive features.
'''

few_shot_examples = [
    {
        "description": "This is a multi-layer metal shoe rack featuring a free-standing design. It has a clean, white finish that gives it a modern and versatile look, suitable for various home decors. The rack includes several horizontal shelves dedicated to organizing shoes, providing ample space for multiple pairs. Above the shoe storage area, there are 8 double hooks arranged in two rows, offering additional functionality for hanging items such as hats, scarves, or bags. The overall structure is sleek and space-saving, making it an ideal choice for placement in living rooms, bathrooms, hallways, or entryways where efficient use of space is essential.",
        "caption": "White metal free-standing shoe rack"
    },
    {
        "description": "The image shows a set of two dining chairs in black. These chairs are upholstered in a leather-like material, giving them a sleek and sophisticated appearance. The design features straight lines with a slight curve at the top of the high backrest, which adds a touch of elegance. The chairs have a simple, vertical stitching detail on the backrest, providing a subtle decorative element. The legs are also black, creating a uniform look that would complement a contemporary dining room setting. The chairs appear to be designed for comfort and style, suitable for both casual and formal dining environments.",
        "caption": "Set of 2 modern black leather dining chairs"
    },
    {
        "description": "This is a square plant repotting mat designed for indoor gardening tasks such as transplanting and changing soil for plants. It measures 26.8 inches by 26.8 inches and is made from a waterproof material, which appears to be a durable, easy-to-clean fabric in a vibrant green color. The edges of the mat are raised with integrated corner loops, likely to keep soil and water contained during gardening activities. The mat is foldable, enhancing its portability, and can be used as a protective surface for various gardening projects, including working with succulents. It's a practical accessory for garden enthusiasts and makes for a thoughtful gift for those who enjoy indoor plant care.",
        "caption": "Waterproof square plant repotting mat"
    }
]

formatted_examples = [[{
    "role": "user",
    "content": ex['description']
},
{
    "role": "assistant", 
    "content": ex['caption']
}]
    for ex in few_shot_examples
]

formatted_examples = [i for ex in formatted_examples for i in ex]
def caption_image(description, model="gpt-4-turbo-preview"):
    messages = formatted_examples
    messages.insert(0, 
        {
            "role": "system",
            "content": caption_system_prompt
        })
    messages.append(
        {
            "role": "user",
            "content": description
        })
    response = client.chat.completions.create(
    model=model,
    temperature=0.2,
    messages=messages
    )

    return response.choices[0].message.content
examples = df.iloc[5:8]
for index, row in examples.iterrows():
    print(f"{row['title'][:50]}{'...' if len(row['title']) > 50 else ''} - {row['url']} :\n")
    img_description = describe_image(row['primary_image'], row['title'])
    print(f"{img_description}\n--------------------------\n")
    img_caption = caption_image(img_description)
    print(f"{img_caption}\n--------------------------\n")
LOVMOR 30'' Bathroom Vanity Sink Base Cabine, Stor... - https://www.amazon.com/dp/B0C9WYYFLB :

This is a LOVMOR 30'' Bathroom Vanity Sink Base Cabinet featuring a classic design with a rich brown finish. The cabinet is designed to provide ample storage with three drawers on the left side, offering organized space for bathroom essentials. The drawers are likely to have smooth glides for easy operation. Below the drawers, there is a large cabinet door that opens to reveal additional storage space, suitable for larger items. The paneling on the drawers and door features a raised, framed design, adding a touch of elegance to the overall appearance. This vanity base is versatile and can be used not only in bathrooms but also in kitchens, laundry rooms, and other areas where extra storage is needed. The construction material is not specified, but it appears to be made of wood or a wood-like composite. Please note that the countertop and sink are not included and would need to be purchased separately.
--------------------------

LOVMOR 30'' classic brown bathroom vanity base cabinet with three drawers and additional storage space.
--------------------------

Folews Bathroom Organizer Over The Toilet Storage,... - https://www.amazon.com/dp/B09NZY3R1T :

This is a 4-tier bathroom organizer designed to fit over a standard toilet, providing a space-saving storage solution. The unit is constructed with a sturdy metal frame in a black finish, which offers both durability and a sleek, modern look. The design includes four shelves that offer ample space for bathroom essentials, towels, and decorative items. Two of the shelves are designed with a metal grid pattern, while the other two feature a solid metal surface for stable storage.

Additionally, the organizer includes adjustable baskets, which can be positioned according to your storage needs, allowing for customization and flexibility. The freestanding structure is engineered to maximize the unused vertical space above the toilet, making it an ideal choice for small bathrooms or for those looking to declutter countertops and cabinets.

The overall design is minimalist and functional, with clean lines that can complement a variety of bathroom decors. The open shelving concept ensures that items are easily accessible and visible. Installation is typically straightforward, with no need for wall mounting, making it a convenient option for renters or those who prefer not to drill into walls.
--------------------------

Modern 4-tier black metal bathroom organizer with adjustable shelves and baskets, designed to fit over a standard toilet for space-saving storage.
--------------------------

GOYMFK 1pc Free Standing Shoe Rack, Multi-layer Me... - https://www.amazon.com/dp/B0CJHKVG6P :

This is a multi-functional free-standing shoe rack featuring a sturdy metal construction with a white finish. It is designed with multiple layers, providing ample space to organize and store shoes. The rack includes four tiers dedicated to shoe storage, each tier capable of holding several pairs of shoes.

Above the shoe storage area, there is an additional shelf that can be used for placing bags, small decorative items, or other accessories. At the top, the rack is equipped with 8 double hooks, which are ideal for hanging hats, scarves, coats, or umbrellas, making it a versatile piece for an entryway, living room, bathroom, or hallway.

The overall design is sleek and modern, with clean lines that would complement a variety of home decor styles. The vertical structure of the rack makes it a space-saving solution for keeping footwear and accessories organized in areas with limited floor space.
--------------------------

Multi-layer white metal shoe rack with additional shelf and 8 double hooks for versatile storage in entryways or hallways.
--------------------------

In this section, we will use generated keywords and captions to search items that match a given input, either text or image.

We will leverage our embeddings model to generate embeddings for the keywords and captions and compare them to either input text or the generated caption from an input image.

# Df we'll use to compare keywords
df_keywords = pd.DataFrame(columns=['keyword', 'embedding'])
df['keywords'] = ''
df['img_description'] = ''
df['caption'] = ''
# Function to replace a keyword with an existing keyword if it's too similar
def get_keyword(keyword, df_keywords, threshold = 0.6):
    embedded_value = get_embedding(keyword)
    df_keywords['similarity'] = df_keywords['embedding'].apply(lambda x: cosine_similarity(np.array(x).reshape(1,-1), np.array(embedded_value).reshape(1, -1)))
    sorted_keywords = df_keywords.copy().sort_values('similarity', ascending=False)
    if len(sorted_keywords) > 0 :
        most_similar = sorted_keywords.iloc[0]
        if most_similar['similarity'] > threshold:
            print(f"Replacing '{keyword}' with existing keyword: '{most_similar['keyword']}'")
            return most_similar['keyword']
    new_keyword = {
        'keyword': keyword,
        'embedding': embedded_value
    }
    df_keywords = pd.concat([df_keywords, pd.DataFrame([new_keyword])], ignore_index=True)
    return keyword
import ast

def tag_and_caption(row):
    keywords = analyze_image(row['primary_image'], row['title'])
    try:
        keywords = ast.literal_eval(keywords)
        mapped_keywords = [get_keyword(k, df_keywords) for k in keywords]
    except Exception as e:
        print(f"Error parsing keywords: {keywords}")
        mapped_keywords = []
    img_description = describe_image(row['primary_image'], row['title'])
    caption = caption_image(img_description)
    return {
        'keywords': mapped_keywords,
        'img_description': img_description,
        'caption': caption
    }
df.shape
(312, 9)

Processing all 312 lines of the dataset will take a while. To test out the idea, we will only run it on the first 50 lines: this takes ~20 mins. Feel free to skip this step and load the already processed dataset (see below).

# Running on first 50 lines
for index, row in df[:50].iterrows():
    print(f"{index} - {row['title'][:50]}{'...' if len(row['title']) > 50 else ''}")
    updates = tag_and_caption(row)
    df.loc[index, updates.keys()] = updates.values()
df.head()
title primary_image style material color url keywords img_description caption
0 GOYMFK 1pc Free Standing Shoe Rack, Multi-laye... https://m.media-amazon.com/images/I/416WaLx10j... Modern Metal White https://www.amazon.com/dp/B0CJHKVG6P [shoe rack, free standing, multi-layer, metal,... This is a free-standing shoe rack featuring a ... White metal free-standing shoe rack with multi...
1 subrtex Leather ding Room, Dining Chairs Set o... https://m.media-amazon.com/images/I/31SejUEWY7... Black Rubber Wood Sponge Black https://www.amazon.com/dp/B0B66QHB23 [dining chairs, set of 2, leather, black] This image features a set of two black dining ... Set of 2 sleek black faux leather dining chair...
2 Plant Repotting Mat MUYETOL Waterproof Transpl... https://m.media-amazon.com/images/I/41RgefVq70... Modern Polyethylene Green https://www.amazon.com/dp/B0BXRTWLYK [plant repotting mat, waterproof, portable, fo... This is a square plant repotting mat designed ... Waterproof green square plant repotting mat
3 Pickleball Doormat, Welcome Doormat Absorbent ... https://m.media-amazon.com/images/I/61vz1Igler... Modern Rubber A5589 https://www.amazon.com/dp/B0C1MRB2M8 [doormat, absorbent, non-slip, brown] This is a rectangular doormat featuring a play... Pickleball-themed coir doormat with playful de...
4 JOIN IRON Foldable TV Trays for Eating Set of ... https://m.media-amazon.com/images/I/41p4d4VJnN... X Classic Style Iron Grey Set of 4 https://www.amazon.com/dp/B0CG1N9QRC [tv tray table set, foldable, iron, grey] This image showcases a set of two foldable TV ... Set of two foldable TV trays with grey wood gr...
# Saving locally for later
data_path = "data/items_tagged_and_captioned.csv"
df.to_csv(data_path, index=False)
# Optional: load data from saved file
df = pd.read_csv(data_path)

Embedding captions and keywords

We can now use the generated captions and keywords to match relevant content to an input text query or caption. To do this, we will embed a combination of keywords + captions. Note: creating the embeddings will take ~3 mins to run. Feel free to load the pre-processed dataset (see below).

df_search = df.copy()
def embed_tags_caption(x):
    if x['caption'] != '':
        keywords_string = ",".join(k for k in x['keywords']) + '\n'
        content = keywords_string + x['caption']
        embedding = get_embedding(content)
        return embedding
df_search['embedding'] = df_search.apply(lambda x: embed_tags_caption(x), axis=1)
df_search.head()
title primary_image style material color url keywords img_description caption embedding
0 GOYMFK 1pc Free Standing Shoe Rack, Multi-laye... https://m.media-amazon.com/images/I/416WaLx10j... Modern Metal White https://www.amazon.com/dp/B0CJHKVG6P ['shoe rack', 'free standing', 'multi-layer', ... This is a free-standing shoe rack featuring a ... White metal free-standing shoe rack with multi... [-0.06596625, -0.026769113, -0.013789515, -0.0...
1 subrtex Leather ding Room, Dining Chairs Set o... https://m.media-amazon.com/images/I/31SejUEWY7... Black Rubber Wood Sponge Black https://www.amazon.com/dp/B0B66QHB23 ['dining chairs', 'set of 2', 'leather', 'black'] This image features a set of two black dining ... Set of 2 sleek black faux leather dining chair... [-0.0077859573, -0.010376813, -0.01928079, -0....
2 Plant Repotting Mat MUYETOL Waterproof Transpl... https://m.media-amazon.com/images/I/41RgefVq70... Modern Polyethylene Green https://www.amazon.com/dp/B0BXRTWLYK ['plant repotting mat', 'waterproof', 'portabl... This is a square plant repotting mat designed ... Waterproof green square plant repotting mat [-0.023248248, 0.005370147, -0.0048999498, -0....
3 Pickleball Doormat, Welcome Doormat Absorbent ... https://m.media-amazon.com/images/I/61vz1Igler... Modern Rubber A5589 https://www.amazon.com/dp/B0C1MRB2M8 ['doormat', 'absorbent', 'non-slip', 'brown'] This is a rectangular doormat featuring a play... Pickleball-themed coir doormat with playful de... [-0.028953036, -0.026369056, -0.011363288, 0.0...
4 JOIN IRON Foldable TV Trays for Eating Set of ... https://m.media-amazon.com/images/I/41p4d4VJnN... X Classic Style Iron Grey Set of 4 https://www.amazon.com/dp/B0CG1N9QRC ['tv tray table set', 'foldable', 'iron', 'grey'] This image showcases a set of two foldable TV ... Set of two foldable TV trays with grey wood gr... [-0.030723095, -0.0051356032, -0.027088132, 0....
# Keep only the lines where we have embeddings
df_search = df_search.dropna(subset=['embedding'])
print(df_search.shape)
(49, 10)
# Saving locally for later
data_embeddings_path = "data/items_tagged_and_captioned_embeddings.csv"
df_search.to_csv(data_embeddings_path, index=False)
# Optional: load data from saved file
from ast import literal_eval
df_search = pd.read_csv(data_embeddings_path)
df_search["embedding"] = df_search.embedding.apply(literal_eval).apply(np.array)
# Searching for N most similar results
def search_from_input_text(query, n = 2):
    embedded_value = get_embedding(query)
    df_search['similarity'] = df_search['embedding'].apply(lambda x: cosine_similarity(np.array(x).reshape(1,-1), np.array(embedded_value).reshape(1, -1)))
    most_similar = df_search.sort_values('similarity', ascending=False).iloc[:n]
    return most_similar
user_inputs = ['shoe storage', 'black metal side table', 'doormat', 'step bookshelf', 'ottoman']
for i in user_inputs:
    print(f"Input: {i}\n")
    res = search_from_input_text(i)
    for index, row in res.iterrows():
        similarity_score = row['similarity']
        if isinstance(similarity_score, np.ndarray):
            similarity_score = similarity_score[0][0]
        print(f"{row['title'][:50]}{'...' if len(row['title']) > 50 else ''} ({row['url']}) - Similarity: {similarity_score:.2f}")
        img = Image(url=row['primary_image'])
        display(img)
        print("\n\n")
Input: shoe storage

GOYMFK 1pc Free Standing Shoe Rack, Multi-layer Me... (https://www.amazon.com/dp/B0CJHKVG6P) - Similarity: 0.62


GOYMFK 1pc Free Standing Shoe Rack, Multi-layer Me... (https://www.amazon.com/dp/B0CJHKVG6P) - Similarity: 0.57


Input: black metal side table

FLYJOE Narrow Side Table with PU Leather Magazine ... (https://www.amazon.com/dp/B0CHYDTQKN) - Similarity: 0.59


HomePop Metal Accent Table Triangle Base Round Mir... (https://www.amazon.com/dp/B08N5H868H) - Similarity: 0.57


Input: doormat

Pickleball Doormat, Welcome Doormat Absorbent Non-... (https://www.amazon.com/dp/B0C1MRB2M8) - Similarity: 0.59


Caroline's Treasures PPD3013JMAT Enchanted Garden ... (https://www.amazon.com/dp/B08Q5KDSQK) - Similarity: 0.57


Input: step bookshelf

Leick Home 70007-WTGD Mixed Metal and Wood Stepped... (https://www.amazon.com/dp/B098KNRNLQ) - Similarity: 0.61


Wildkin Kids Canvas Sling Bookshelf with Storage f... (https://www.amazon.com/dp/B07GBVFZ1Y) - Similarity: 0.47


Input: ottoman

HomePop Home Decor | K2380-YDQY-2 | Luxury Large F... (https://www.amazon.com/dp/B0B94T1TZ1) - Similarity: 0.53


Moroccan Leather Pouf Ottoman for Living Room - Ro... (https://www.amazon.com/dp/B0CP45784G) - Similarity: 0.51


Search from image

If the input is an image, we can find similar images by first turning images into captions, and embedding those captions to compare them to the already created embeddings.

# We'll take a mix of images: some we haven't seen and some that are already in the dataset
example_images = df.iloc[306:]['primary_image'].to_list() + df.iloc[5:10]['primary_image'].to_list()
for i in example_images:
    img_description = describe_image(i, '')
    caption = caption_image(img_description)
    img = Image(url=i)
    print('Input: \n')
    display(img)
    res = search_from_input_text(caption, 1).iloc[0]
    similarity_score = res['similarity']
    if isinstance(similarity_score, np.ndarray):
        similarity_score = similarity_score[0][0]
    print(f"{res['title'][:50]}{'...' if len(res['title']) > 50 else ''} ({res['url']}) - Similarity: {similarity_score:.2f}")
    img_res = Image(url=res['primary_image'])
    display(img_res)
    print("\n\n")
    
Input: 

Mimoglad Office Chair, High Back Ergonomic Desk Ch... (https://www.amazon.com/dp/B0C2YQZS69) - Similarity: 0.63


Input: 

CangLong Mid Century Modern Side Chair with Wood L... (https://www.amazon.com/dp/B08RTLBD1T) - Similarity: 0.51


Input: 

MAEPA RV Shoe Storage for Bedside - 8 Extra Large ... (https://www.amazon.com/dp/B0C4PL1R3F) - Similarity: 0.61


Input: 

Chief Mfg.Swing-Arm Wall Mount Hardware Mount Blac... (https://www.amazon.com/dp/B007E40Z5K) - Similarity: 0.63


Input: 

HomePop Home Decor | K2380-YDQY-2 | Luxury Large F... (https://www.amazon.com/dp/B0B94T1TZ1) - Similarity: 0.63


Input: 

CangLong Mid Century Modern Side Chair with Wood L... (https://www.amazon.com/dp/B08RTLBD1T) - Similarity: 0.58


Input: 

LOVMOR 30'' Bathroom Vanity Sink Base Cabine, Stor... (https://www.amazon.com/dp/B0C9WYYFLB) - Similarity: 0.69


Input: 

Folews Bathroom Organizer Over The Toilet Storage,... (https://www.amazon.com/dp/B09NZY3R1T) - Similarity: 0.82


Input: 

GOYMFK 1pc Free Standing Shoe Rack, Multi-layer Me... (https://www.amazon.com/dp/B0CJHKVG6P) - Similarity: 0.69


Input: 

subrtex Leather ding Room, Dining Chairs Set of 2,... (https://www.amazon.com/dp/B0B66QHB23) - Similarity: 0.87


Input: 

Plant Repotting Mat MUYETOL Waterproof Transplanti... (https://www.amazon.com/dp/B0BXRTWLYK) - Similarity: 0.69


Wrapping up

In this notebook, we explored how to leverage the multimodal capabilities of GPT-4V to tag and caption images. By providing images along with contextual information to the model, we were able to generate tags and descriptions that can be further refined using a language model like GPT-4-turbo to create captions. This process has practical applications in various scenarios, particularly in enhancing search functionalities.

The search use case illustrated can be directly applied to applications such as recommendation systems, but the techniques covered in this notebook can be extended beyond items search and used in multiple use cases, for example RAG applications leveraging unstructured image data.

As a next step, you could explore using a combination of rule-based filtering with keywords and embeddings search with captions to retrieve more relevant results.