Evals are task oriented and iterative, they're the best way to check how your LLM integration is doing and improve it.
In the following eval, we are going to focus on the task of detecting if my prompt change is a regression.
Our use-case is:
- I have an llm integration that takes a list of push notifications and summarizes them into a single condensed statement.
- I want to detect if a prompt change regresses the behavior
Evals structure
Evals have two parts, the "Eval" and the "Run". An "Eval" holds the configuration for your testing criteria and the structure of the data for your "Runs". An Eval can have many runs that are evaluated by your testing criteria.