Evals are task-oriented and iterative, they're the best way to check how your LLM integration is doing and improve it.
In the following eval, we are going to focus on the task of detecting our prompt changes for regressions.
Our use-case is:
- We have been logging chat completion requests by setting
store=True
in our production chat completions requests. Note that you can also enable "on by default" logging in your admin panel (https://platform.openai.com/settings/organization/data-controls/data-retention). - We want to see whether our prompt changes have introduced regressions.
Evals structure
Evals have two parts, the "Eval" and the "Run". An "Eval" holds the configuration for your testing criteria and the structure of the data for your "Runs". An Eval can have many Runs, which are each evaluated using your testing criteria.