BLOG · AI & AGENTS

Evals before vibes: measuring agent quality

You cannot improve what you do not measure, and "it feels smarter" is not a metric. How we build eval suites that catch regressions before users do.

Analytics dashboard with charts on a screen

Teams iterate on prompts the way gamblers iterate on lucky socks: change something, squint at three outputs, declare victory. Then a model update lands and last month's "fix" quietly breaks.

If you cannot replay yesterday's failures, you are not testing — you are gambling.

A minimal eval suite

Freeze a set of real (anonymised) inputs — fifty is enough to start.
Write expected behaviours as assertions, not vibes: "must cite the source", "must refuse out-of-scope requests".
Score every prompt or model change against the set, and track the curve over time.

The point is not academic rigour. The point is that regressions become visible the day they happen, while the change that caused them is still on the table.

Start smaller than feels respectable

Fifty cases and a pass/fail column in a spreadsheet beat a sophisticated benchmark you never run.

Where LLM-as-judge fits

Judge models are useful for fuzzy criteria — tone, completeness — but anchor them with exact-match and rule-based checks. A judge with no ground truth drifts together with the system it judges.