Multi-Agent AI in Production: Architecture Patterns for Enterprise Scale

The gap between an AI demo and a production AI system is enormous. A demo calls an LLM, gets a response, and shows it on screen. A production system handles failures, tracks state across long workflows, controls costs, maintains audit logs, and keeps working when a model endpoint returns a 429 or hallucinates a tool call.

Multi-agent systems add another layer. Now you have multiple LLM calls happening in sequence or in parallel, each with its own failure surface, each consuming tokens, each potentially making decisions that affect downstream agents. The operational complexity compounds quickly.

This post covers what actually matters when you move multi-agent AI from a prototype into a production backend: the architecture patterns, the failure modes you will hit, and the engineering disciplines that keep these systems reliable and cost-predictable.

What an Agent Actually Is

An agent is an LLM that has been given tools it can call and a loop that continues until a termination condition is met. The simplest possible agent looks like this:

python

import openai
import json

client = openai.OpenAI()

tools = [
    {
        "type": "function",
        "function": {
            "name": "search_database",
            "description": "Search the product database for items matching a query",
            "parameters": {
                "type": "object",
                "properties": {
                    "query": {"type": "string"},
                    "limit": {"type": "integer", "default": 10}
                },
                "required": ["query"]
            }
        }
    }
]

def run_agent(user_message: str, max_iterations: int = 10) -> str:
    messages = [{"role": "user", "content": user_message}]

    for _ in range(max_iterations):
        response = client.chat.completions.create(
            model="gpt-4o",
            messages=messages,
            tools=tools,
            tool_choice="auto"
        )

        choice = response.choices[0]
        messages.append(choice.message)

        # No tool calls means the agent is done
        if not choice.message.tool_calls:
            return choice.message.content

        # Execute each tool call
        for tool_call in choice.message.tool_calls:
            result = execute_tool(tool_call.function.name, tool_call.function.arguments)
            messages.append({
                "role": "tool",
                "tool_call_id": tool_call.id,
                "content": json.dumps(result)
            })

    return "Max iterations reached"

The loop is the critical part. The agent decides whether to call a tool or return a final answer. This works fine for simple tasks. It breaks down under three conditions: the task takes many steps, multiple agents need to collaborate, or the workflow has to survive process restarts.

When You Need Multiple Agents

A single agent with many tools is easier to operate than multiple agents. Start with one. Move to multiple only when you hit a genuine limitation:

Context window overflow: a single task requires more context than fits in one LLM call
Specialization: different subtasks benefit from different system prompts, different models, or different tool sets
Parallelism: independent subtasks can run concurrently and you need the throughput
Isolation: one agent's hallucination should not corrupt another's state

If your task does not hit any of these limits, one agent is the right answer. Multi-agent systems are harder to debug, harder to trace, and harder to cost-control.

Architecture Pattern 1: Orchestrator-Worker

The orchestrator receives a high-level goal, breaks it into subtasks, dispatches those subtasks to worker agents, and assembles the results. Workers are stateless and focused. The orchestrator owns the overall workflow state.

python

from dataclasses import dataclass, field
from typing import Any
import asyncio

@dataclass
class Task:
    id: str
    type: str
    input: dict
    result: Any = None
    status: str = "pending"

@dataclass
class WorkflowState:
    goal: str
    tasks: list[Task] = field(default_factory=list)
    completed: bool = False

async def orchestrator(goal: str) -> dict:
    state = WorkflowState(goal=goal)

    # Phase 1: planning (orchestrator LLM call)
    plan = await plan_tasks(goal)
    state.tasks = [Task(id=t["id"], type=t["type"], input=t["input"]) for t in plan]

    # Phase 2: dispatch workers in parallel where possible
    independent_tasks = [t for t in state.tasks if not has_dependencies(t, state.tasks)]
    results = await asyncio.gather(*[dispatch_worker(t) for t in independent_tasks])

    for task, result in zip(independent_tasks, results):
        task.result = result
        task.status = "done"

    # Phase 3: dependent tasks run after their dependencies complete
    for task in state.tasks:
        if task.status == "pending":
            task.result = await dispatch_worker(task, context=get_context(task, state))
            task.status = "done"

    # Phase 4: synthesis (orchestrator LLM call)
    return await synthesize_results(state)

async def dispatch_worker(task: Task, context: dict = None) -> Any:
    # Each worker is a focused agent with its own system prompt and tools
    worker = get_worker_for_type(task.type)
    return await worker.run(task.input, context=context)

Architecture Pattern 2: Supervisor with Handoffs

In the supervisor pattern, a routing agent decides which specialist agent should handle the current state of the conversation. Control passes entirely to the specialist, which can pass it back or route to another specialist. This works well for customer-facing systems where the same conversation might involve multiple domains.

python

from enum import Enum

class AgentType(Enum):
    SUPERVISOR = "supervisor"
    BILLING = "billing"
    TECHNICAL = "technical"
    ESCALATION = "escalation"

ROUTING_TOOLS = [
    {
        "type": "function",
        "function": {
            "name": "transfer_to_agent",
            "description": "Transfer control to a specialized agent",
            "parameters": {
                "type": "object",
                "properties": {
                    "agent": {
                        "type": "string",
                        "enum": ["billing", "technical", "escalation"]
                    },
                    "reason": {"type": "string"},
                    "context_summary": {"type": "string"}
                },
                "required": ["agent", "reason"]
            }
        }
    }
]

async def run_supervisor_loop(conversation: list[dict]) -> str:
    current_agent = AgentType.SUPERVISOR
    max_handoffs = 5

    for _ in range(max_handoffs):
        agent_config = get_agent_config(current_agent)
        response = await call_llm(
            system_prompt=agent_config.system_prompt,
            messages=conversation,
            tools=agent_config.tools + ROUTING_TOOLS if current_agent == AgentType.SUPERVISOR else agent_config.tools
        )

        if response.tool_calls:
            for call in response.tool_calls:
                if call.name == "transfer_to_agent":
                    next_agent = AgentType(call.arguments["agent"])
                    conversation.append({"role": "system", "content": f"Context: {call.arguments.get('context_summary', '')}"})
                    current_agent = next_agent
                    break
        else:
            return response.content

    return "Escalation required"

State Management: The Most Common Production Failure

In-memory state is the #1 cause of agent system failures in production. When a process restarts, a pod gets evicted, or a timeout occurs mid-workflow, all state is lost. The agent has no idea what it already completed.

The fix is to treat agent workflows like distributed transactions: persist state at every meaningful checkpoint and design for idempotent resumption.

python

import uuid
from datetime import datetime
from sqlalchemy import create_engine, Column, String, JSON, DateTime
from sqlalchemy.orm import DeclarativeBase, Session

class Base(DeclarativeBase):
    pass

class AgentWorkflow(Base):
    __tablename__ = "agent_workflows"

    id = Column(String, primary_key=True, default=lambda: str(uuid.uuid4()))
    goal = Column(String, nullable=False)
    status = Column(String, default="running")  # running, paused, completed, failed
    state = Column(JSON, default=dict)           # full serialized workflow state
    created_at = Column(DateTime, default=datetime.utcnow)
    updated_at = Column(DateTime, default=datetime.utcnow, onupdate=datetime.utcnow)

engine = create_engine("postgresql://user:password@localhost/agentdb")
Base.metadata.create_all(engine)

def checkpoint_state(workflow_id: str, state: dict) -> None:
    with Session(engine) as session:
        workflow = session.get(AgentWorkflow, workflow_id)
        if workflow:
            workflow.state = state
            workflow.updated_at = datetime.utcnow()
            session.commit()

def resume_workflow(workflow_id: str) -> dict:
    with Session(engine) as session:
        workflow = session.get(AgentWorkflow, workflow_id)
        if not workflow or workflow.status == "completed":
            return {}
        return workflow.state

# Usage in orchestrator
async def orchestrator_with_persistence(goal: str, workflow_id: str = None) -> dict:
    if workflow_id:
        # Resume from last checkpoint
        state_data = resume_workflow(workflow_id)
        state = WorkflowState(**state_data) if state_data else WorkflowState(goal=goal)
    else:
        workflow_id = str(uuid.uuid4())
        state = WorkflowState(goal=goal)

    for task in state.tasks:
        if task.status == "done":
            continue  # Skip already completed tasks

        task.result = await dispatch_worker(task)
        task.status = "done"

        # Persist after every task completion
        checkpoint_state(workflow_id, state.__dict__)

    return await synthesize_results(state)

Reliability: Retries, Fallbacks, and Guardrails

LLM APIs are not as reliable as internal microservices. Rate limits, timeouts, and model errors happen regularly. Every agent call needs retry logic with backoff, and every tool call needs a timeout.

python

import asyncio
import openai
from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type

@retry(
    stop=stop_after_attempt(3),
    wait=wait_exponential(multiplier=1, min=2, max=30),
    retry=retry_if_exception_type((openai.RateLimitError, openai.APITimeoutError))
)
async def call_llm_with_retry(messages: list, model: str = "gpt-4o", **kwargs):
    return await asyncio.wait_for(
        client.chat.completions.create(
            model=model,
            messages=messages,
            **kwargs
        ),
        timeout=30.0  # Hard timeout per call
    )

# Model fallback: use cheaper model when primary is unavailable
async def call_llm_with_fallback(messages: list) -> str:
    models = ["gpt-4o", "gpt-4o-mini", "gpt-3.5-turbo"]

    for model in models:
        try:
            response = await call_llm_with_retry(messages, model=model)
            return response.choices[0].message.content
        except (openai.APIError, asyncio.TimeoutError) as e:
            if model == models[-1]:
                raise RuntimeError(f"All models failed: {e}") from e
            continue

# Output validation: reject hallucinated tool calls
def validate_tool_call(tool_name: str, arguments: dict, allowed_tools: set) -> bool:
    if tool_name not in allowed_tools:
        return False  # Hallucinated tool name

    schema = get_tool_schema(tool_name)
    required = schema.get("required", [])
    return all(k in arguments for k in required)

Observability: Tracing Agent Calls

Standard APM tools were not built for LLM workflows. You need to track: which agent called which model, with what prompt, at what token cost, and what it returned. Without this, debugging production failures is nearly impossible.

python

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
import time

provider = TracerProvider()
provider.add_span_processor(BatchSpanProcessor(OTLPSpanExporter()))
trace.set_tracer_provider(provider)
tracer = trace.get_tracer("agent-system")

async def traced_llm_call(agent_name: str, messages: list, model: str = "gpt-4o"):
    with tracer.start_as_current_span(f"llm.call.{agent_name}") as span:
        span.set_attribute("agent.name", agent_name)
        span.set_attribute("llm.model", model)
        span.set_attribute("llm.messages.count", len(messages))

        start = time.time()
        response = await call_llm_with_retry(messages, model=model)
        duration_ms = (time.time() - start) * 1000

        # Track token usage for cost attribution
        usage = response.usage
        span.set_attribute("llm.tokens.prompt", usage.prompt_tokens)
        span.set_attribute("llm.tokens.completion", usage.completion_tokens)
        span.set_attribute("llm.tokens.total", usage.total_tokens)
        span.set_attribute("llm.latency_ms", round(duration_ms, 2))

        # Estimated cost (GPT-4o pricing as of 2026)
        cost_usd = (usage.prompt_tokens / 1_000_000 * 2.50) + (usage.completion_tokens / 1_000_000 * 10.00)
        span.set_attribute("llm.cost_usd", round(cost_usd, 6))

        return response

Cost Control in Production Agent Systems

A multi-agent system that runs without token budgets will surprise you on the cloud bill. The correct approach is to set hard limits per workflow and route simpler tasks to cheaper models.

Model routing: use GPT-4o-mini or Claude Haiku for classification, routing, and simple extraction; use GPT-4o or Claude Sonnet for reasoning and synthesis
Token budgets: track cumulative token usage per workflow and abort when the budget is exceeded
Prompt caching: OpenAI and Anthropic both support prompt caching for long system prompts, which cuts repeated call costs by 50-90%
Output constraints: instruct agents to respond in JSON with a defined schema; loose instructions produce verbose responses that cost more
Parallel calls: running independent agents in parallel reduces wall-clock time and sometimes reduces cost vs sequential calls with growing context

python

class TokenBudget:
    def __init__(self, max_tokens: int):
        self.max_tokens = max_tokens
        self.used_tokens = 0

    def consume(self, tokens: int) -> None:
        self.used_tokens += tokens
        if self.used_tokens > self.max_tokens:
            raise RuntimeError(
                f"Token budget exceeded: {self.used_tokens} / {self.max_tokens}"
            )

    @property
    def remaining(self) -> int:
        return self.max_tokens - self.used_tokens

def select_model(task_type: str, budget: TokenBudget) -> str:
    # Route to cheaper model when budget is running low
    if budget.remaining < 5000:
        return "gpt-4o-mini"

    routing_map = {
        "classify": "gpt-4o-mini",
        "extract": "gpt-4o-mini",
        "summarize": "gpt-4o-mini",
        "reason": "gpt-4o",
        "synthesize": "gpt-4o",
        "plan": "gpt-4o",
    }
    return routing_map.get(task_type, "gpt-4o-mini")

A Real Production Pattern: Document Processing Pipeline

A practical multi-agent pipeline for processing incoming contracts: one agent classifies the document type, one extracts key terms, one validates the extracted terms against business rules, and a final agent generates a structured summary for the legal team. Each agent is focused, has a small context window, and uses the cheapest model that can do its job reliably.

Classifier agent (gpt-4o-mini): reads first 500 tokens, outputs document type and jurisdiction
Extractor agent (gpt-4o-mini): reads full document in 2000-token chunks, outputs structured JSON with key clauses
Validator agent (gpt-4o): cross-checks extracted data against a rule set, flags anomalies
Synthesizer agent (gpt-4o): generates a business-readable summary with risk flags
Workflow state: persisted in PostgreSQL, resumable on failure, full trace in Grafana Tempo

This pipeline processes a 50-page contract in under 90 seconds for roughly $0.08 per document. The key design decisions that made it cost-effective: chunked extraction instead of full-document context per agent, model routing by task complexity, and prompt caching for the system prompts that do not change between calls.

What Still Goes Wrong

Even with all of the above in place, production agent systems fail in predictable ways. The most common issues after six months in production:

Prompt drift: system prompts that worked in January start failing in March after a model update; pin model versions and run regression tests on prompts
Tool call schema mismatches: the LLM generates a tool call with extra or missing fields; validate every tool call before executing it
Infinite loops: add a maximum iteration count and make it configurable per workflow type
Context contamination: long conversation histories cause the model to fixate on early context; trim or summarize old messages before each call
Cascading failures: one slow agent blocks the entire orchestrator; set individual agent timeouts and design the workflow to handle partial results

Summary

Multi-agent AI is a real engineering discipline, not a prompt engineering trick. The patterns that make it work in production are the same ones that make distributed systems work: durable state, idempotent operations, retry with backoff, hard timeouts, structured observability, and cost budgets enforced at the code level.

Start with a single agent. Add the second one only when you have a concrete reason. When you do go multi-agent, pick your orchestration pattern based on whether your tasks are sequential, parallel, or conversational, and build the persistence layer before you build anything else.