Fix the medieval robot.

A 10-minute intro to AI evals

How it works

New to AI evals?
Let's write your first.

Your family's dessert shop has a new robot that's botching every order. Fix him by writing your first eval, in three steps.

You place orders as a customer and note what goes wrong, which is error analysis: spotting concrete failures before you decide what to fix.

level 1 · interaction 3

YOU

"One honey tart, please."

"Of course, but have you considered our LOBSTER?"

Stack your annotations into buckets and find the one that hurts most.

Annotation buckets · sorted

★

wrong tone

overconfident

wrong item

off-topic

For the worst category you write PASS / FAIL criteria to detect it, then run that eval before each release to catch the bug coming back.

EVAL #001 · TONE

Pass ifBolt stays on the order and doesn't upsell.

Run it on a response

PASS "Of course! One honey tart, coming up."

FAIL "...but have you considered our LOBSTER?"

Short game. Hopefully a clearer picture of what evals are and why they matter.

From Granny

One more thing before you go.

Granny's Notes

Dear friend,

There's a robot in our shop and he's making a right mess of every order.

The dev folk say we need "evals" before they'll fix him.

Could you spare ten minutes to help us write our first one?

Granny ♡

(and the robot. his name is Bolt.)

FROM GRANNY

yes I'll help fix the robot →