Every agent demo looks the same: a clean prompt, a happy path, applause. Production looks different. The same agent meets ambiguous requests, malformed data, third-party outages and users who type in three languages at once.
Production is the only benchmark that matters.
The three things that break first
- Tool calls against real APIs — timeouts and partial failures the demo never saw.
- Context discipline — conversations that outgrow the window and silently lose the one fact that mattered.
- Confidence calibration — agents that act when they should escalate to a human.
None of these are model problems. They are systems problems, and they have systems answers: retries with budgets, explicit context contracts, and escalation paths designed before launch — not after the first incident.
What we do differently now
We ship a pilot in 2–4 weeks, but the pilot includes the boring parts: monitoring, evals and guardrails from day one. An agent without observability is a liability with a chat interface.
The pre-launch checklist
- Instrument every tool call before adding new capabilities.
- Replay the worst week of traffic against every prompt change.
- Define the escalation path to a human before the first user sees the agent.
Start with the failure modes. The happy path will take care of itself.



