Shipping AI agents to production: what actually breaks

Every agent demo looks the same: a clean prompt, a happy path, applause. Production looks different. The same agent meets ambiguous requests, malformed data, third-party outages and users who type in three languages at once.

Production is the only benchmark that matters.

The three things that break first

Tool calls against real APIs — timeouts and partial failures the demo never saw.
Context discipline — conversations that outgrow the window and silently lose the one fact that mattered.
Confidence calibration — agents that act when they should escalate to a human.

None of these are model problems. They are systems problems, and they have systems answers: retries with budgets, explicit context contracts, and escalation paths designed before launch — not after the first incident.

What we do differently now

We ship a pilot in 2–4 weeks, but the pilot includes the boring parts: monitoring, evals and guardrails from day one. An agent without observability is a liability with a chat interface.

The pre-launch checklist

Instrument every tool call before adding new capabilities.
Replay the worst week of traffic against every prompt change.
Define the escalation path to a human before the first user sees the agent.

Start with the failure modes. The happy path will take care of itself.

Shipping AI agents to production: what actually breaks

The three things that break first

What we do differently now

The pre-launch checklist

Evals before vibes: measuring agent quality

Choosing an LLM stack: OpenAI, Claude or self-host

The boring automation that pays for itself