Introduction to GPT-4o and GPT-4o mini

Jul 18, 2024
Open in Github

GPT-4o ("o" for "omni") and GPT-4o mini are natively multimodal models designed to handle a combination of text, audio, and video inputs, and can generate outputs in text, audio, and image formats. GPT-4o mini is the lightweight version of GPT-4o.

Background

Before GPT-4o, users could interact with ChatGPT using Voice Mode, which operated with three separate models. GPT-4o integrates these capabilities into a single model that's trained across text, vision, and audio. This unified approach ensures that all inputs — whether text, visual, or auditory — are processed cohesively by the same neural network.

GPT-4o mini is the next iteration of this omni model family, available in a smaller and cheaper version. This model offers higher accuracy than GPT-3.5 Turbo while being just as fast and supporting multimodal inputs and outputs.

Current API Capabilities

Currently, the API supports {text, image} inputs only, with {text} outputs, the same modalities as gpt-4-turbo.

Additional modalities, including audio, will be introduced soon. This guide will help you get started with using GPT-4o mini for text, image, and video understanding.

%pip install --upgrade openai

Configure the OpenAI client and submit a test request

To setup the client for our use, we need to create an API key to use with our request. Skip these steps if you already have an API key for usage.

You can get an API key by following these steps:

  1. Create a new project
  2. Generate an API key in your project
  3. (RECOMMENDED, BUT NOT REQUIRED) Setup your API key for all projects as an env var

Once we have this setup, let's start with a simple {text} input to the model for our first request. We'll use both system and user messages for our first request, and we'll receive a response from the assistant role.

from openai import OpenAI 
import os

## Set the API key and model name
MODEL="gpt-4o-mini"
client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY", "<your OpenAI API key if not set as an env var>"))
completion = client.chat.completions.create(
  model=MODEL,
  messages=[
    {"role": "system", "content": "You are a helpful assistant. Help me with my math homework!"}, # <-- This is the system message that provides context to the model
    {"role": "user", "content": "Hello! Could you solve 2+2?"}  # <-- This is the user message for which the model will generate a response
  ]
)

print("Assistant: " + completion.choices[0].message.content)
Assistant: Of course! \( 2 + 2 = 4 \).

Image Processing

GPT-4o mini can directly process images and take intelligent actions based on the image. We can provide images in two formats:

  1. Base64 Encoded
  2. URL

Let's first view the image we'll use, then try sending this image as both Base64 and as a URL link to the API

from IPython.display import Image, display, Audio, Markdown
import base64

IMAGE_PATH = "data/triangle.png"

# Preview image for context
display(Image(IMAGE_PATH))
image generated by notebook
# Open the image file and encode it as a base64 string
def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode("utf-8")

base64_image = encode_image(IMAGE_PATH)

response = client.chat.completions.create(
    model=MODEL,
    messages=[
        {"role": "system", "content": "You are a helpful assistant that responds in Markdown. Help me with my math homework!"},
        {"role": "user", "content": [
            {"type": "text", "text": "What's the area of the triangle?"},
            {"type": "image_url", "image_url": {
                "url": f"data:image/png;base64,{base64_image}"}
            }
        ]}
    ],
    temperature=0.0,
)

print(response.choices[0].message.content)
To find the area of the triangle, you can use the formula:

\[
\text{Area} = \frac{1}{2} \times \text{base} \times \text{height}
\]

In the triangle you provided:

- The base is \(9\) (the length at the bottom).
- The height is \(5\) (the vertical line from the top vertex to the base).

Now, plug in the values:

\[
\text{Area} = \frac{1}{2} \times 9 \times 5
\]

Calculating this:

\[
\text{Area} = \frac{1}{2} \times 45 = 22.5
\]

Thus, the area of the triangle is **22.5 square units**.
response = client.chat.completions.create(
    model=MODEL,
    messages=[
        {"role": "system", "content": "You are a helpful assistant that responds in Markdown. Help me with my math homework!"},
        {"role": "user", "content": [
            {"type": "text", "text": "What's the area of the triangle?"},
            {"type": "image_url", "image_url": {
                "url": "https://upload.wikimedia.org/wikipedia/commons/e/e2/The_Algebra_of_Mohammed_Ben_Musa_-_page_82b.png"}
            }
        ]}
    ],
    temperature=0.0,
)

print(response.choices[0].message.content)
To find the area of the triangle, you can use the formula:

\[
\text{Area} = \frac{1}{2} \times \text{base} \times \text{height}
\]

In the triangle you provided:

- The base is \(9\) (the length at the bottom).
- The height is \(5\) (the vertical line from the top vertex to the base).

Now, plug in the values:

\[
\text{Area} = \frac{1}{2} \times 9 \times 5
\]

Calculating this gives:

\[
\text{Area} = \frac{1}{2} \times 45 = 22.5
\]

Thus, the area of the triangle is **22.5 square units**.

Video Processing

While it's not possible to directly send a video to the API, GPT-4o can understand videos if you sample frames and then provide them as images.

Since GPT-4o mini in the API does not yet support audio-in (as of July 2024), we'll use a combination of GPT-4o mini and Whisper to process both the audio and visual for a provided video, and showcase two usecases:

  1. Summarization
  2. Question and Answering

Setup for Video Processing

We'll use two python packages for video processing - opencv-python and moviepy.

These require ffmpeg, so make sure to install this beforehand. Depending on your OS, you may need to run brew install ffmpeg or sudo apt install ffmpeg

%pip install opencv-python
%pip install moviepy
import cv2
from moviepy.editor import VideoFileClip
import time
import base64

# We'll be using the OpenAI DevDay Keynote Recap video. You can review the video here: https://www.youtube.com/watch?v=h02ti0Bl6zk
VIDEO_PATH = "data/keynote_recap.mp4"
def process_video(video_path, seconds_per_frame=2):
    base64Frames = []
    base_video_path, _ = os.path.splitext(video_path)

    video = cv2.VideoCapture(video_path)
    total_frames = int(video.get(cv2.CAP_PROP_FRAME_COUNT))
    fps = video.get(cv2.CAP_PROP_FPS)
    frames_to_skip = int(fps * seconds_per_frame)
    curr_frame=0

    # Loop through the video and extract frames at specified sampling rate
    while curr_frame < total_frames - 1:
        video.set(cv2.CAP_PROP_POS_FRAMES, curr_frame)
        success, frame = video.read()
        if not success:
            break
        _, buffer = cv2.imencode(".jpg", frame)
        base64Frames.append(base64.b64encode(buffer).decode("utf-8"))
        curr_frame += frames_to_skip
    video.release()

    # Extract audio from video
    audio_path = f"{base_video_path}.mp3"
    clip = VideoFileClip(video_path)
    clip.audio.write_audiofile(audio_path, bitrate="32k")
    clip.audio.close()
    clip.close()

    print(f"Extracted {len(base64Frames)} frames")
    print(f"Extracted audio to {audio_path}")
    return base64Frames, audio_path

# Extract 1 frame per second. You can adjust the `seconds_per_frame` parameter to change the sampling rate
base64Frames, audio_path = process_video(VIDEO_PATH, seconds_per_frame=1)
MoviePy - Writing audio in data/keynote_recap.mp3
                                                                      
MoviePy - Done.
Extracted 218 frames
Extracted audio to data/keynote_recap.mp3
## Display the frames and audio for context
display_handle = display(None, display_id=True)
for img in base64Frames:
    display_handle.update(Image(data=base64.b64decode(img.encode("utf-8")), width=600))
    time.sleep(0.025)

Audio(audio_path)
image generated by notebook