Evals are task oriented and iterative, they're the best way to check how your LLM integration is doing and improve it.
In the following eval, we are going to focus on the task of testing many variants of models and prompts.
Our use-case is:
- I want to get the best possible performance out of my push notifications summarizer
Evals structure
Evals have two parts, the "Eval" and the "Run". An "Eval" holds the configuration for your testing criteria and the structure of the data for your "Runs". An Eval has_many
runs, that are evaluated by your testing criteria.