Production AI Agent Architecture: Patterns That Actually Ship

April 10, 2026 · 18 min read · ai-agents, architecture, production, llm-engineering, claude
Production AI Agent Architecture: Patterns That Actually Ship

Most agent tutorials end at “the model calls tools in a loop, done.” That works for a demo. It falls apart the first time a tool 500s, a user asks something off-script, or the token bill crosses $20 on a single task. Production AI agent architecture is the set of patterns that keep that loop alive when reality hits.

I run 10 agents in production right now. Bash scripts calling claude -p, scheduled via systemd timers, reporting outcomes to Telegram. Not fancy. They ship work every day because the architecture around the loop is boring and deliberate. This guide is that playbook: the patterns, the must-haves, the anti-patterns, and the opinionated verdicts on what to use when.

If you want the deeper build-vs-buy framing, start with the agents build vs buy guide. This page is the hub for everything after you decide to build.

What “production” actually means for an agent

A production agent is not a prompt that works on your laptop. It is a system with four properties:

  1. Bounded cost per run. You can answer “what is the max I spend if this runs 10,000 times today?”
  2. Bounded time per run. There is a hard cap. If the agent is still going at minute 20, something kills it.
  3. Reproducible decisions. If a user complains about output, you can replay what the agent saw and why it acted.
  4. Graceful failure. When a tool breaks or a model refuses, the agent either recovers or escalates cleanly. It never silently loops.

Everything below feeds one of those four. If an architectural choice does not, cut it.

A “demo agent” has none of these. It works because you, the author, are the error handler, budget cap, and replay mechanism, sitting in front of the terminal. Production means the agent runs without you.

The Router-Planner-Executor split

The single most useful pattern I use. Split the agent into three roles, each with a different model.

  • Router: a fast, cheap model (Haiku) decides which capability or agent to dispatch. Input is the user request and a short capability catalog. Output is a single capability name.
  • Planner: a reasoning model (Sonnet or Opus with extended thinking) plans the approach for the chosen capability. Input is the user request and the available tools. Output is a structured plan.
  • Executor: the tool-use loop. Dispatches calls, handles errors, passes results back. Usually Sonnet. Can be the same model as the planner or a cheaper one.

Why three:

  • Separation of concerns. The router does not reason about how to do the work. The planner does not run tools. The executor does not decide strategy.
  • Cost control. Most requests never need a reasoning model. The router filters those out for Haiku pennies.
  • Quality control. The planner gets a clean, focused prompt without the noise of tool definitions and results. The executor gets a clear plan without the ambiguity of the original user input.

A sketch in TypeScript, stripped to the bones:

import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic();

// 1. ROUTER - Haiku picks the capability
async function route(userMessage: string): Promise<string> {
  const res = await client.messages.create({
    model: "claude-haiku-4-5-20251001",
    max_tokens: 50,
    system: "Return one of: search, write, schedule, escalate. Nothing else.",
    messages: [{ role: "user", content: userMessage }],
  });
  return (res.content[0] as { text: string }).text.trim();
}

// 2. PLANNER - Sonnet with thinking plans the steps
async function plan(capability: string, userMessage: string) {
  const res = await client.messages.create({
    model: "claude-sonnet-4-6",
    max_tokens: 2000,
    thinking: { type: "enabled", budget_tokens: 4000 },
    system: `You plan steps for the "${capability}" capability. Return a JSON array of steps.`,
    messages: [{ role: "user", content: userMessage }],
  });
  return res.content;
}

// 3. EXECUTOR - tool-use loop with bounded iterations
async function execute(plan: unknown, tools: Anthropic.Tool[], maxSteps = 10) {
  const messages: Anthropic.MessageParam[] = [
    { role: "user", content: JSON.stringify(plan) },
  ];
  for (let i = 0; i < maxSteps; i++) {
    const res = await client.messages.create({
      model: "claude-sonnet-4-6",
      max_tokens: 4096,
      tools,
      messages,
    });
    if (res.stop_reason === "end_turn") return res;
    // dispatch tool calls, append tool_result, continue
  }
  throw new Error("max_steps exceeded");
}

The Claude API tool use guide goes deep on the executor loop. This page stays at the architecture level.

ReAct, Plan-then-Execute, and Reflexion

Three named patterns you should know. Pick one per capability. Do not mix them in the same loop.

ReAct (Reason + Act). The model alternates “thought” and “action” in the same context. It narrates why it is about to call a tool, calls it, observes the result, narrates again. This is what you get by default from any modern tool-use loop. Good for short tasks with unpredictable shape: “find the last invoice for this customer and summarize what changed.” Biggest weakness: context grows fast, cost grows with it. Every turn carries the prior tool results, so the 15th turn costs 10x what the 1st turn cost. Budget accordingly.

Plan-then-Execute. The planner produces a full plan upfront. The executor walks the plan step by step. If a step fails, you re-plan from the failure point rather than from scratch. Better for multi-step tasks with predictable shape (data pipelines, checklist-style tasks). Cheaper because the planner runs once per task, not per step. Easier to audit because the plan is a single artifact you can log and replay. The trade-off: if the task genuinely needs adaptive behavior, a rigid plan breaks on the first surprise.

Reflexion loop. The executor produces a result. A separate evaluator call (same or different model) grades it against criteria. If the grade is low, the executor retries with the evaluator’s feedback appended. Best for tasks where verifying is easier than doing: code generation (does it compile?), summarization against a rubric, structured extraction (does every required field have a non-null value?). Budget the loop: usually one retry is worth it, three is not. After two failed attempts, escalate.

My verdict: default to Plan-then-Execute for scheduled work and ReAct for interactive work. Add Reflexion only on the specific steps where you can auto-grade cheaply. Do not wrap the whole agent in a Reflexion loop; wrap the risky step. The quality bump is concentrated there and the cost stays bounded.

Production must-haves

Every agent I run has all of these. None is optional.

  • Budget cap per run. The Claude Code SDK has --max-budget-usd. Roll your own if you are using raw SDK. Count input + output tokens, multiply by price, halt when the cap is hit. See the Claude Code SDK agents post for the pattern.
  • Iteration limit. A hard max on tool calls per task. I use 10 for most agents, 30 for research. When hit, the loop exits with a specific error, never silently.
  • Timeouts. Per-tool-call timeout (5-30s depending on tool) and per-run timeout (5-15 min). Tool timeout returns a tool_result with is_error: true. Run timeout kills the process.
  • Structured inputs and outputs. Tool use everywhere. No “please return JSON in your response.” Use the extended thinking post as a reference for how to keep the reasoning separate from the structured output.
  • Observability. Every model call, tool call, token count, and outcome goes to a log with a trace ID. Not “printed to stdout.” A file you can grep six months later.
  • Replay mechanism. Given a trace ID, you can reconstruct: the exact prompt, the exact tool responses, the exact decision. If you cannot do this, you cannot debug. If you cannot debug, you cannot improve.

The replay requirement is the one teams skip and regret. Build it on day one. A JSONL file per run is enough. One line per event: timestamp, trace ID, event type (model_call, tool_call, tool_result, error), payload. Gzip the old ones. Store for at least 30 days. The first time a user says “your agent did something weird on Tuesday,” you will be glad.

A practical layout that works: one directory per day, one file per run, named <trace_id>.jsonl. Agent writes as it goes, not just at the end, so a crash does not lose the partial trace. Add a tiny CLI that reads a trace ID and prints a human-readable summary; future you will thank present you.

Error handling patterns that don’t lose data

The default failure mode for agents is “keep going and make it worse.” Four patterns stop that.

Graceful tool failure. When a tool throws, return a tool_result with is_error: true and a short message (“rate limited, retry in 30s”). The model reads it, adapts, often tries a different parameter set. Do not raise through the loop. The loop decides whether the agent can recover.

Bounded retry. Per tool call, three tries max with exponential backoff. After that, the tool_result is a hard error and the model either picks a different tool or halts. No infinite retry. No “just one more try.”

Human escalation. Some tasks cannot be completed autonomously: ambiguous input, missing credentials, policy-sensitive decisions. Give the agent an escalate_to_human(reason, context) tool. For my setup that tool posts to a Telegram thread, attaches the run trace, and waits. The agent is done for now. See the Telegram bot post for the bot side.

Cost kill-switch. If accumulated cost for a run exceeds the cap, the loop halts, logs the overrun, and notifies. It does not “try once more to finish.” This is the most important line of code in the whole system because without it, a runaway agent can burn hundreds of dollars before anyone notices. I set two thresholds: a soft warn at 70% of budget (log it, keep going) and a hard halt at 100% (stop immediately, escalate). The soft warn lets you tune caps based on real data rather than guessing.

Idempotent run IDs. Assign every run a UUID. Tools that perform side effects accept the run ID as a correlation token. If the agent crashes and retries, the tools can detect a repeat and skip the side effect. Without this, every retry risks duplicate emails, duplicate invoices, duplicate charges. Cheap to build, impossible to retrofit cleanly.

State management without blowing prompts

The wrong answer is “stuff everything in the conversation history.” The right answer depends on the lifetime of the state.

  • Stateless per request. Default. No memory between invocations. Each run starts fresh. This is what 80% of agents should be. Easier to test, easier to reason about, cheaper.
  • Conversation memory within a session. Append messages across turns. Use prompt caching on the stable prefix (system prompt, tools, early turns) so the cache hit rate stays above 80%. If it does not, something is invalidating the prefix and you are paying full price every turn.
  • External state. Persistent state lives in a database, never in prompt history. The agent queries it with tools: get_user_preferences, get_last_invoice_id. Prompts stay small and bounded. Scale is linear in task complexity, not in user lifetime.
  • Scratchpad vs final output. Keep the model’s internal reasoning, tool results, and half-formed drafts in a scratchpad that never leaves the agent. The user-facing output is a separate, validated payload. This matters for logs too: do not ship scratchpad to end users or to monitoring dashboards. It leaks prompts and confuses readers.

A fifth pattern that gets overlooked: compaction. When conversation history grows past a threshold, summarize the oldest turns into a single compact message, drop the originals, and keep going. Done right, you reset the token count without losing the thread. Done wrong, you lose the one detail the next turn needed. The rule I follow: compact only when a turn has been fully resolved (tool called, result used, decision made). Never compact mid-reasoning.

Tool design principles

The model cannot read your source code. It reads the tool description and the JSON schema. That is the entire contract.

  • Descriptive names and docs. search_invoices(customer_id, date_range) beats query. The description tells the model when to use it and what to expect back.
  • One concept per tool. Do not overload parameters. A do_thing(action, target, extras) tool fails more often than create_thing, update_thing, delete_thing as three separate tools. The model routes to the right one on its own.
  • Narrow schemas. Required fields required, optionals clearly marked, enums where possible. Every free-form string is a failure waiting.
  • Idempotent where possible. Retries should not create duplicate side effects. Use client-generated IDs or upserts. When a tool cannot be idempotent (sending email), say so in the description so the model retries carefully.
  • Rate-limit aware. Tools handle backoff internally. The agent should never see “429 too many requests”; it should see “rate limited, retry in 5s” and decide whether to wait or try something else.

Return errors as data. When a tool fails, the tool_result with is_error: true should contain a structured payload the model can reason about: {"error": "not_found", "suggestion": "try search_invoices with a wider date range"}. Free-form stack traces waste tokens and teach the model nothing. Treat errors like any other output: typed, intentional, useful for the next turn.

Version tools, not prompts. When you change a tool’s schema, bump its version and keep the old definition callable for a deprecation window. Prompt changes are easy; tool contract changes break agents silently. Treat tool schemas as a public API, because for your agent, they are.

More patterns and working code in the tool use guide. The build an MCP server post shows how to package tools as a reusable server once you have more than a handful.

RAG inside agents

Retrieval shows up two ways. Pick deliberately.

Retrieval as a tool. search_docs(query) returns the top N chunks. The model decides when to call it. Good when the agent handles varied queries and you do not know in advance which docs are relevant. Costs: extra tool call round trip per retrieval, unpredictable chunk count.

Retrieval as context injection. You run retrieval before the model call, prepend the results to the system prompt, and cache the result. Good when the relevant context is predictable (onboarding agent, product support with a small FAQ). Costs: larger prompts, lower cache hit rate if retrieval results change often.

My default: tool-based for open-ended agents, inject for narrow agents with a known corpus. You can mix them. Full pipeline details in the RAG pipeline tutorial and the vector DB comparison.

Do not mistake RAG for memory. RAG retrieves from a corpus the agent was designed to search. Memory is “what has this agent learned about this user across sessions.” Different problem, different pattern.

Memory patterns that are actually useful

Most “AI agent memory” products are a vector database plus retrieval, dressed up. Decide whether you actually need memory before adding it. If you can solve the problem with external state + tools, do that instead.

Three useful memory types:

  • Episodic. What happened in this session. Lives in conversation history, bounded by a token budget.
  • Semantic. Facts the agent learned about the user or the world, persisted across sessions. Lives in a database. The agent reads it at session start via a tool, writes updates via a tool. Prompts do not grow unbounded.
  • Procedural. How-to patterns the agent uses. Almost always scripted, not learned. Live as prompt fragments or examples in the codebase, version controlled.

Default to no memory. Add episodic when the agent needs to reference earlier turns. Add semantic only when there is a concrete user-facing reason (“agent should remember my timezone across sessions”). Procedural is code, not memory. Treat it like code.

Single-agent vs multi-agent

Most problems are single-agent with specialized tools. Multi-agent is overused because it is more interesting to build. Resist.

Multi-agent is worth it when one of these is true:

  • Different models suit different subtasks. Haiku routes, Sonnet plans, Opus handles the reasoning-heavy core. This is really the Router-Planner-Executor split, which most people already do inside one logical agent.
  • Parallelism is worth the complexity. Three independent research queries in parallel, then a synthesis step. The coordination cost is paid for by the wall clock savings.
  • Clear handoff contracts. Agent A produces output X in a schema, Agent B consumes X. If the contract is fuzzy, you do not have two agents; you have one agent with a communication bug.

My take: default single-agent until you have a concrete reason. When I say I run “10 agents,” I mean 10 separate bash scripts, each doing one narrow job, each single-agent inside. That is not a multi-agent system. That is 10 small systems. The difference matters.

Testing agents you can actually ship

If your only test is “I ran it once and it worked,” you do not have tests.

  • Golden tasks. A curated set of 10-50 realistic inputs with expected behavior. Run on every commit that touches prompt or tool code. These are not unit tests of the model; they are unit tests of your agent’s decisions.
  • Snapshot assertions on structured output. Compare JSON, not prose. Normalize whitespace. Allow known-variable fields (timestamps, IDs) to vary. Assert the shape and the decision-carrying fields.
  • Cost budget assertions. Each golden task has a max cost. If a prompt change pushes costs over, the test fails. You notice before the bill does.
  • Offline eval with fixtures. Mock tool responses. Assert the agent picked the right tools in the right order. This catches “agent stopped calling search_docs after the refactor” bugs that eyeball testing misses.
  • Production shadow traffic. Real traffic, real outputs, labeled over time. A sample of production runs gets human-reviewed weekly. Trends in outcome quality inform prompt and tool changes.
  • Regression suite on model upgrades. When a new model version ships, run the full golden task set on both old and new. Compare outputs, costs, and latency side by side. Upgrade when the new model wins on the metrics you care about, not because it is newer. Sometimes the old model is better for your specific task.

Cost assertions are the test type most teams skip, and they pay for it. One bad prompt change can triple your bill in a week and look fine in output quality. The test fails; you fix it; you move on. No all-hands postmortem needed.

Deployment patterns

How the agent actually runs in production. Four common shapes.

  • Claude Code CLI headless. claude -p "task" in a bash script, scheduled via cron or systemd timer. Minimal infra, full MCP tool access. This is what I use for 10 production agents. Details in the Claude Code SDK agents post.
  • SDK embedded in an app. Node or Python service imports the SDK, exposes an API. Good when the agent is part of a user-facing product. You own the server, the auth, the rate limits.
  • Worker queue. Tasks hit a queue (Redis, SQS, RabbitMQ). Workers pick up, run the agent, write results back. Good when agent runs are long and you need to scale horizontally.
  • Webhook-triggered. An API endpoint receives an event, fires the agent, returns async. Good for integrations (GitHub events, Telegram messages, form submissions).

Pick the smallest shape that fits. A cron script is a deployment. You do not need Kubernetes for an agent that runs four times a day.

The boring architecture that ships

What I actually run, as a reference point, not a recommendation to copy exactly.

  • 10 agents. Each is a bash script in /home/user/ calling claude -p --model opus with a specific task prompt.
  • Each script is scheduled by a systemd timer with OnCalendar= in Madrid time. Timers handle DST correctly; cron does not.
  • Each script logs to /var/log/<agent>.log. On success it appends a marker. On failure it exits non-zero.
  • Each script ends with a telegram-notify.sh "message" call, so I see outcomes in Telegram, not by reading logs.
  • All of it runs on one Debian VPS. No Kubernetes, no Docker, no queue. Just systemd, bash, and the Claude API.

That is the architecture. It processes my morning briefing, weekly planning, job screens, content follow-ups, and more. Nothing here would impress a platform engineer. Everything here ships reliably, costs under $50 a month, and I can debug any failure from a terminal in two minutes.

The lesson is not “copy this setup.” The lesson is: every component above is there because an earlier, fancier version failed and got replaced with something boring. Your architecture should evolve the same way.

Anti-patterns I see most

  • Building a framework first. One agent, then a second, then a third. After three, extract the shared pieces. Not before. Frameworks written before agents exist solve problems that do not.
  • Over-engineering the tool catalog. Most production agents need 5 to 10 tools. I see 40-tool catalogs where 30 tools are unused. The model gets worse at picking from a long list. Cut.
  • Depending on a specific model version forever. Pin the model in code, but plan for upgrades. New versions get cheaper, smarter, or both, every few months. Budget a quarter for migration work. Do not let a three-year-old pin be your architecture.
  • Ignoring the non-happy path. 80% of production agent work is edge cases, retries, escalations, and cost containment. If your design doc is all happy path, you have not designed the agent. You have described the demo.
  • Testing in production only. Build offline evals early. It is far cheaper to find a regression in a golden task than in a user complaint.

Where to go next

This is the hub. Follow the threads that apply.

Production agents are not hard because the AI is hard. They are hard because the surrounding system has to be boring, bounded, observable, and recoverable. Build that system first. The model is the easy part.

Download the AI Automation Checklist (PDF)

Checkliste herunterladen Download the checklist

Kostenloses 2-seitiges PDF. Kein Spam. Free 2-page PDF. No spam.

Kein Newsletter. Keine Weitergabe. Nur die Checkliste. No newsletter. No sharing. Just the checklist.

Ihre Checkliste ist bereit Your checklist is ready

Klicken Sie unten zum Herunterladen. Click below to download.

PDF herunterladen Download PDF Ergebnisse gemeinsam durchgehen? → Walk through your results together? →