This cookbook demonstrates how to use OpenAI's Evals framework for image-based tasks. Leveraging the Evals API, we will grade model-generated responses to an image and prompt by using sampling to generate model responses and model grading (LLM as a Judge) to score the model responses against the image, prompt, and reference answer.
In this example, we will evaluate how well our model can:
- Generate appropriate responses to user prompts about images
- Align with reference answers that represent high-quality responses