AI Agents in Production: Orchestration and Failure Patterns

The problem is not building agents. It is keeping them alive.

Building an AI agent that works in a controlled environment takes an afternoon. Building one that survives in production takes months. The difference is not the model or the prompt. It is everything that surrounds the agent when things go wrong.

And they go wrong. At 3 AM on a Friday when a vendor changes their API format without notice. When a user sends a scanned PDF at 72 DPI with diagonal text. When the model decides the best way to classify an invoice is to re-read it fourteen times because it cannot reach a satisfactory confidence score.

This whitepaper documents three orchestration patterns we have validated in production with real clients: supervisor, hierarchical, and consensus. It also documents the failure modes we discovered the hard way, and the strategies that mitigate them. Every pattern includes the conditions under which it works, the conditions under which it fails, and the numbers we use to choose between them.

Three orchestration patterns

The supervisor pattern

The supervisor pattern is the starting point for most deployments, and for good reason. A central component (the supervisor) receives requests, decides which sub-agents need to act, orchestrates execution, and consolidates results.

Our concrete implementation has three layers:

Router. Classifies the incoming request and determines the flow. Can be a small model (Haiku, GPT-4o-mini) or rule-based logic. Cost per classification sits in the $0.0003-$0.001 range.
Specialized sub-agents. Each scoped to a specific capability: document extraction, API queries, structured output generation. Each uses the model appropriate to its complexity.
Consolidator. Receives sub-agent outputs, validates against JSON schemas, resolves conflicts, and produces the final result.

The primary advantage of the supervisor pattern is fault containment. If the document extraction sub-agent fails, the supervisor can retry, use a fallback, or escalate to human review without losing the state of the other sub-agents that already completed their work.

Production numbers: in an order processing flow with four sub-agents, the supervisor pattern handles 96.3% of requests without human intervention, at a mean cost of $0.08 per request and a p95 latency of 12 seconds.

The limitation is centralization. The supervisor is a single point of failure. If it goes down, everything goes down. The mitigation is straightforward (redundancy, health checks, circuit breakers), but adds operational complexity.

The hierarchical pattern

The hierarchical pattern extends the supervisor with multiple levels of delegation. A top-level supervisor delegates to intermediate supervisors, which in turn orchestrate their own sub-agents. Think of it as a software org chart.

We use this pattern when the workflow is complex enough that a single supervisor becomes a cognitive bottleneck. A concrete example: insurance claims processing. The top-level supervisor determines claim type. Intermediate supervisors manage specific flows (property damage, liability, health). Each flow has its own specialized sub-agents.

The hierarchy introduces two benefits the flat pattern does not offer:

Context isolation. Each intermediate supervisor only knows its domain. The property damage supervisor does not need to know anything about health claims. This reduces prompt sizes, improves accuracy, and lowers token costs.

Organizational scalability. Different teams can own different branches of the hierarchy. The property damage team evolves its agents without affecting the health team. Interfaces between levels are defined by contracts (JSON schemas) that act as API contracts.

The cost is latency. Each level of the hierarchy adds a communication hop. In our measurements, each level adds 2 to 4 seconds. With three levels, total latency can exceed 20 seconds. For flows where the user is waiting for an interactive response, this is too much. For asynchronous processing, it is acceptable.

The practical rule we follow: if the flow has fewer than 6 sub-agents, the flat supervisor pattern is sufficient. Above 6, hierarchy starts making sense. Above 15, it is almost mandatory.

The consensus pattern

The consensus pattern executes multiple agents in parallel on the same task and combines their results. It is the equivalent of asking for a second (and third) opinion.

We use it exclusively for tasks where accuracy is critical and the cost of error is high. Legal document classification. Financial data validation. Invoice amount extraction when OCR is ambiguous.

The typical implementation runs three instances of the same agent (sometimes with different prompts or different models) and applies an aggregation function:

Majority vote for classification tasks. If two of three agents say the document is an invoice, it is an invoice.
Median for numerical extraction tasks. If the three agents extract amounts of 1,250, 1,250, and 12,500 euros, the median discards the outlier.
Union with verification for multi-field extraction tasks. All extracted fields are merged and discrepancies flagged.

The cost scales linearly with the number of instances. Three agents cost three times as much. But in contexts where a classification error has a business cost of hundreds or thousands of euros, the 3x multiplier on inference cost (cents) is irrelevant.

The numbers: in invoice extraction, moving from one agent to three with majority vote reduced the error rate from 4.2% to 0.7%. Cost went from $0.03 to $0.09 per invoice. For a client processing 8,000 invoices per month, reducing 280 monthly errors justified the $480 increase in inference costs.

Failure modes and how to survive them

The infinite loop

The most dangerous failure mode is the agent that enters a loop. It re-reads the same document, re-invokes the same tool, tries the same strategy again and again. Each iteration consumes tokens. Without an interrupt mechanism, a loop can burn hundreds of dollars in minutes.

The root cause is usually one of two things: the agent cannot resolve ambiguity with available information, or the tool it is invoking keeps returning the same unsatisfactory result.

Three protection mechanisms are mandatory:

Iteration limit. Every agent has a configurable maximum iteration count. For most of our agents it is 5. If an agent has not completed its task in 5 iterations, it stops and escalates.
Token budget. Every invocation has a total token limit. We set it at 3x the historical average consumption. If the current invocation costs more than $0.25 when the average is $0.08, something is wrong.
Repetition detection. We compare tool call arguments against previous calls. If an agent invokes the same tool with the same arguments twice consecutively, it is interrupted. This catches 80% of loops before they consume significant resources.

Cascade failure

In a multi-agent architecture, one component’s failure can propagate. Sub-agent A fails, the supervisor retries, retries saturate the model API rate limit, sub-agents B and C start failing too because they share the same connection pool.

The mitigation is isolating circuit breakers per sub-agent. Each sub-agent has its own circuit breaker with its own threshold. If the extraction agent fails three consecutive times, its circuit breaker opens and the supervisor stops sending it requests during a cooldown period. Other sub-agents continue operating normally.

We also separate rate limiting pools. Each sub-agent has its own tokens-per-minute quota. A sub-agent that enters a loop cannot consume the quota of others.

Silent degradation

This failure mode causes the most damage because it is invisible. The agent keeps running. It produces results. The results look reasonable. But quality has declined.

We have seen this when a model provider changes model behavior without changing the API version (this happens more than the industry admits). We have seen it when input data shifts gradually: invoices that used to have a consistent format start arriving with variations because the supplier changed their system.

The only defense is continuous quality monitoring. We sample a percentage of agent outputs (between 5% and 10%) and evaluate them against ground truth. For document extraction, we compare extracted fields with human-verified values. For classification, we measure precision and recall on the sample.

We have alerts configured at three thresholds:

Warning: accuracy drops below 95%. Investigate during business hours.
Error: accuracy drops below 90%. Investigate immediately.
Critical: accuracy drops below 85%. Disable the agent and process manually.

Thresholds vary by use case. For an email triage agent, 90% may be acceptable. For an agent classifying tax documents, we need 98%+.

Operational hallucination

Language models hallucinate. This is well known. What is less known is how hallucinations manifest in agents with tool access.

The most problematic case we have seen: an agent that needs to query an API to retrieve a data point, but the model “decides” it already knows the answer and generates a fabricated value instead of invoking the tool. It does not fail. It does not throw an error. It simply invents an order number, a date, an amount.

The defense is twofold. First, strict schema validation on all outputs. If the agent says the order number is X, we verify that X exists in the database. If it does not, the output is rejected. Second, trace analysis. If a trace shows the agent produced a result that includes a field that should have come from a tool call, but there is no tool call in the trace, we have an operational hallucination.

Human-in-the-loop: where to draw the line

The “autonomous vs. supervised” debate is a false dichotomy. The real question is: for each type of decision, what is the right level of human oversight?

We use a four-level framework:

Level 0: Fully autonomous. The agent decides and executes without intervention. Reserved for low-impact, high-confidence decisions. Examples: classifying an email as spam, updating shipment status, sending a confirmation notification.

Level 1: Autonomous with audit. The agent decides and executes, but all decisions are logged for later review. A human reviews a sample periodically. Examples: classifying invoices by accounting category, routing support tickets to queues.

Level 2: Proposal with approval. The agent proposes an action but does not execute until a human approves. Examples: drafting a client response, proposing an inventory adjustment, generating a report draft.

Level 3: Assistance only. The agent provides information and analysis, but decision and execution are entirely human. Examples: legal risk analysis, investment recommendations, medical diagnostics.

Level assignment is not static. An agent can start at Level 2 and, as it accumulates a track record of high accuracy, migrate to Level 1 and eventually Level 0 for certain decision types. The reverse path also happens: if accuracy drops, the agent moves up a level (more supervision).

The practical implementation requires a review queue. Decisions needing human approval go to a queue prioritized by urgency and amount. Human reviewers see the agent’s proposal, the evidence it used, and the confidence level. They can approve, reject, or modify. Reviewer decisions feed back into the evaluation pipeline and improve confidence thresholds over time.

Cost control: beyond the token budget

Cost control for AI agents has four dimensions that teams typically discover sequentially (and painfully).

Inference cost

The most obvious. Every LLM call has a per-token price. The common mistake is using the most powerful model for every task. Our rule: 70% of tasks in a typical flow can be handled by fast, cheap models (Haiku, GPT-4o-mini, Gemini Flash). 25% need mid-tier models (Sonnet, GPT-4o). Only 5% justify premium models (Opus, o1).

That model distribution reduces the average cost per request by 60% to 75% compared to using the premium model for everything. For a client processing 50,000 monthly requests, the difference is $3,200 per month.

Retry cost

An agent that retries three times before escalating costs four times as much as one that gets it right the first time. Retries are necessary, but they must be measured and optimized. If a sub-agent has a retry rate above 15%, it does not need more retries — it needs a better prompt, better input data, or a redesign.

We monitor retry rate per sub-agent as a primary metric. It is the most reliable indicator that something is silently degrading.

Human-in-the-loop cost

Every escalation to a human has a cost: the reviewer’s salary, the process wait time, the potential degradation of user experience. For one logistics client, we calculated each escalation costs EUR 4.70 in personnel time. At 200 escalations per day, that is EUR 940 daily. Reducing escalations from 200 to 80 (by improving agent accuracy from 88% to 95%) saved EUR 564 daily, or EUR 12,400 per month.

Undetected error cost

The hardest to quantify and the most expensive. A tax classification error can result in a penalty. An invoice amount error can result in an incorrect payment. A ticket assignment error can result in a breached SLA. These costs do not appear on the LLM provider’s bill, but they are real and sometimes orders of magnitude larger than inference cost.

The complete accounting of an agent’s cost includes all four dimensions. Teams that only optimize inference cost are ignoring 80% of the equation.

Observability specific to orchestration

Microservices observability is a prerequisite, but agent systems need additional layers.

Reasoning traces

Every agent invocation generates a trace that includes: the prompt sent to the model, tool calls with their arguments and responses, supervisor routing decisions, validation results, and the final output. We store these traces in a structured format that supports queries like “show me all invocations where the agent used more than 3 retries in the last hour” or “show me invocations where confidence was low but the result was correct.”

We use OpenTelemetry as the foundation, with custom spans for each agent component. One span for the supervisor, one span per sub-agent, one span per tool call. Attributes include tokens consumed, model used, confidence level, and validation result.

Operational dashboards

We maintain two primary dashboards. The health dashboard shows real-time metrics: requests per minute, latency p50/p95/p99, success rate, escalation rate, cumulative cost. The quality dashboard shows accuracy metrics computed from sampling: accuracy by task type, confidence drift, and the correlation between reported confidence and actual accuracy.

The second dashboard is critical because it detects the calibration problem: an agent that reports high confidence but produces incorrect results. If the correlation between confidence and accuracy diverges, human-in-the-loop thresholds need recalibration.

Agent-specific alerts

Beyond standard infrastructure alerts, we configure agent-specific ones:

Detected loop rate > 2% of invocations.
Mean cost per invocation > 2x historical average.
Human escalation rate > configured threshold per use case.
Mean time in human review queue > defined SLA.
Accuracy drift > 3 percentage points in a 24-hour window.

Choosing the right pattern

No pattern is universally better. The decision depends on four factors.

Flow complexity. Simple flows (fewer than 5 steps, no significant branching) work well with the supervisor pattern. Complex flows (multiple branches, multiple domains) need hierarchy.

Accuracy requirements. If an error has high cost, the consensus pattern on the critical tasks within the flow reduces error rates significantly. You do not need consensus across the entire flow, only at the high-impact decision points.

Budget and latency. The consensus pattern multiplies costs and introduces latency from parallel executions. Hierarchy adds latency from level hops. If budget is tight or latency is critical, the flat supervisor with good fallbacks is the most pragmatic option.

Team maturity. Hierarchy and consensus require more observability infrastructure and more operational discipline. If the team is starting with agents, the supervisor pattern provides enough complexity for the first iteration. Other patterns should be introduced when data shows they are needed, not before.

Combined patterns: what we run in production

In practice, our deployments combine patterns. A real example: import document processing.

The top-level supervisor classifies the document (invoice, packing list, bill of lading, customs declaration). Each document type has its own sub-supervisor with specialized extraction agents. Amount extraction uses the consensus pattern with three instances. Text field extraction (supplier name, goods description) uses a single agent.

The complete flow takes 8 to 15 seconds. It processes documents at 97.2% accuracy without human intervention. The remaining 2.8% goes to a review queue. Average cost is $0.14 per document. For a client processing 3,000 documents monthly, the total cost is $420 versus an estimated 2.5 FTEs ($7,500/month) for manual processing.

Those numbers do not include the speed advantage (15 seconds vs. 8-12 minutes per document) or 24/7 availability. But the direct cost numbers are what convince CFOs.

A second production example: customer service ticket processing. The supervisor receives an incoming ticket (email, form submission, or chat message) and classifies it by urgency and topic. Urgent tickets (service outage, billing error) go directly to a human agent with a pre-drafted response from the AI. Non-urgent tickets are handled by specialized sub-agents: one for order status queries (API lookup, no LLM needed), one for return requests (structured workflow with policy validation), and one for general questions (RAG-based response generation with Sonnet). The consensus pattern is not used here because the cost of a slightly imprecise response to a general question does not justify tripling the inference cost. Instead, all generated responses go through a tone and accuracy validator before sending.

This flow handles 73% of tickets end-to-end without human intervention. Average response time dropped from 4.2 hours (human-only) to 8 minutes (agent-first). Monthly cost: $380 in inference versus $4,200 in estimated human time for the same volume. The remaining 27% of tickets that get escalated arrive with full context (previous interactions, extracted order data, attempted resolution steps), which reduces average human handling time by 40%.

The state of the art and where it is heading

The agent orchestration ecosystem is maturing rapidly. Frameworks like LangGraph, CrewAI, and AutoGen provide useful abstractions but still require significant customization for production. The emerging interoperability standard is Anthropic’s Model Context Protocol (MCP), which standardizes how agents consume tools and data.

What is missing is mature tooling for testing and evaluation. Testing a deterministic agent is trivial. Testing an agent that makes probabilistic decisions on unstructured data requires evaluation infrastructure that most frameworks do not provide. Teams that invest in evaluation before features are the ones that keep their agents in production. Those that do not end up with a demo that never scales.

Orchestration is not the hardest problem in production AI agents. But it is the one that determines whether your system degrades gracefully or collapses spectacularly when something fails. And something always fails. For deeper dives into architecture patterns for autonomous agents or the real costs of operating LLMs in production, we cover both topics in detail.