GrepfoxGrepfox
BLOG · AI & AGENTS

Evals before vibes: measuring agent quality

You cannot improve what you do not measure, and "it feels smarter" is not a metric. How we build eval suites that catch regressions before users do.

Analytics dashboard with charts on a screen

Teams iterate on prompts the way gamblers iterate on lucky socks: change something, squint at three outputs, declare victory. Then a model update lands and last month's "fix" quietly breaks.

If you cannot replay yesterday's failures, you are not testing — you are gambling.

A minimal eval suite

  1. Freeze a set of real (anonymised) inputs — fifty is enough to start.
  2. Write expected behaviours as assertions, not vibes: "must cite the source", "must refuse out-of-scope requests".
  3. Score every prompt or model change against the set, and track the curve over time.

The point is not academic rigour. The point is that regressions become visible the day they happen, while the change that caused them is still on the table.

Start smaller than feels respectable

Fifty cases and a pass/fail column in a spreadsheet beat a sophisticated benchmark you never run.

Where LLM-as-judge fits

Judge models are useful for fuzzy criteria — tone, completeness — but anchor them with exact-match and rule-based checks. A judge with no ground truth drifts together with the system it judges.

RELATED READING
ALL POSTS →
Dark code editor with dense source listing
AI & AGENTS

Shipping AI agents to production: what actually breaks

Demos are easy; production is where agents meet ambiguity, rate limits and angry edge cases. A field checklist from the last year of deployments.

JUN 2, 20261 MIN READ
Macro shot of a printed circuit board
AI & AGENTS

Choosing an LLM stack: OpenAI, Claude or self-host

Provider choice is an engineering decision, not a brand preference. The trade-offs that actually matter when picking a model stack.

MAR 24, 20261 MIN READ
Tidy developer workspace with monitor and notes
AUTOMATION

The boring automation that pays for itself

The highest-ROI automation we ship is rarely glamorous: report generation, data syncs, handoffs between tools. Boring is a feature.

MAY 5, 20261 MIN READ