How it works
New to AI evals?
Let's write your first.
Your family's dessert shop has a new robot that's botching every order. Fix him by writing your first eval, in three steps.
01
Annotate outputs
You place orders as a customer and note what goes wrong, which is error analysis: spotting concrete failures before you decide what to fix.
level 1 · interaction 3
YOU
"One honey tart, please."
"Of course, but have you considered our LOBSTER?"
02
Prioritize errors
Stack your annotations into buckets and find the one that hurts most.
Annotation buckets · sorted
7
★
4
2
1
wrong tone
overconfident
wrong item
off-topic
03
Write the eval
For the worst category you write PASS / FAIL criteria to detect it, then run that eval before each release to catch the bug coming back.
EVAL #001 · TONE
Pass ifBolt stays on the order and doesn't upsell.
Run it on a response
PASS "Of course! One honey tart, coming up."
FAIL "...but have you considered our LOBSTER?"
Short game. Hopefully a clearer picture of what evals are and why they matter.