Welcome! This notebook demonstrates how to build a Retrieval-Augmented Generation (RAG) system using OpenAI’s Vision and Responses APIs. It focuses on multimodal data, combining image and text inputs to analyze customer experiences. The system leverages GPT-4.1 and integrates image understanding with file search to provide context-aware responses.
Multimodal datasets are increasingly common, particularly in domains like healthcare, where records often contain both visual data (e.g. radiology scans) and accompanying text (e.g. clinical notes). Real-world datasets also tend to be noisy, with incomplete or missing information, making it critical to analyze multiple modalities in tandem.
This guide focuses on a customer service use case: evaluating customer feedback that may include photos, and written reviews. You’ll learn how to synthetically generate both image and text inputs, use file search for context retrieval, and apply the Evals API to assess how incorporating image understanding impacts overall performance.
Overview
Table of Contents
- Setup & Dependencies
- Example Generations
- Data Processing
- Load synthetic datasets
- Merge data
- Populating Vector Store
- Upload data for file search
- Set up attribute filters
- Retrieval and Filtering
- Test retrieval performance
- Apply attribute-based filters
- Evaluation and Analysis
- Compare predictions to ground truth
- Analyze performance metrics