Teams iterate on prompts the way gamblers iterate on lucky socks: change something, squint at three outputs, declare victory. Then a model update lands and last month's "fix" quietly breaks.
If you cannot replay yesterday's failures, you are not testing — you are gambling.
A minimal eval suite
- Freeze a set of real (anonymised) inputs — fifty is enough to start.
- Write expected behaviours as assertions, not vibes: "must cite the source", "must refuse out-of-scope requests".
- Score every prompt or model change against the set, and track the curve over time.
The point is not academic rigour. The point is that regressions become visible the day they happen, while the change that caused them is still on the table.
Start smaller than feels respectable
Fifty cases and a pass/fail column in a spreadsheet beat a sophisticated benchmark you never run.
Where LLM-as-judge fits
Judge models are useful for fuzzy criteria — tone, completeness — but anchor them with exact-match and rule-based checks. A judge with no ground truth drifts together with the system it judges.



