how to choose llm for production

The article outlines a seven-criteria framework in priority order: data residency and compliance, workload type, latency SLA, context length, cost, tool use reliability, and vendor risk. Work through these filters in order to narrow a shortlist to two or three candidates, then evaluate them against your specific requirements rather than leaderboards.

what latency targets should i set for llm

Latency SLA depends on your use case. For user-facing sync calls, target p95 latency under 2 seconds, while background agent tasks can tolerate p95 of 20 seconds or higher. Always measure p50, p95, and p99 latency under realistic concurrent load from your production region with your actual prompt sizes.

which llm is best for agent loops

Claude Sonnet 4.6 or Opus 4.7 with extended thinking is recommended for agentic workloads requiring reliable tool calling and multi-step planning. Tool use reliability differs significantly across providers and tiers, so test tool selection accuracy and schema adherence explicitly before production if your pipeline involves function calling.

do i need multiple llm providers in production

Single-provider production should be a deliberate decision, not a default. For business-critical workloads, build with a failover provider from day one behind a common interface layer, so switching is a config change rather than a rewrite.

How to Choose an LLM for Production: 7 Criteria That Matter

April 17, 2026 · 13 min read · llm, production, decision-guide, ai-architecture

Most teams pick an LLM for production the wrong way. They read a leaderboard, pick the top model, and wire it into an endpoint. Six weeks later they hit a rate limit during a traffic spike, or a compliance reviewer asks where EU data is processed, or the p99 latency kills a user-facing flow. Then the real selection work starts, under pressure, in production.

This guide is how to choose an LLM for production the right way, before any of that happens. I run AI agents and LLM-backed automations for DACH clients, and every production deployment I’ve shipped went through the same seven-criteria filter. The order matters. Skip one and you will find out later, usually on a weekend.

The short version: production LLM requirements are not about benchmarks. They are about compliance, latency, context, cost, reliability, and vendor risk, evaluated against your specific workload. Below is the framework, a decision tree, an evaluation process, and a checklist you can run before you commit.

The 7 criteria that matter for production

These are the LLM selection criteria I use, in priority order. Priority matters because some filters are absolute. If data residency rules out a provider, no amount of benchmark score saves it.

Data residency and compliance
Workload type (chat, extraction, reasoning, agent loop, multimodal)
Latency SLA (p50, p95, p99)
Context length required (realistic, not theoretical)
Cost ceiling at projected volume
Tool use and function calling reliability
Vendor risk and multi-provider strategy

Work through them in order. Each criterion either eliminates providers or narrows the shortlist. By the end you should have two or three real candidates to test, not twelve.

Criterion 1: Data residency and compliance

This is the first filter because it is the cheapest to get wrong and the most expensive to fix.

What to measure. Where is the inference endpoint physically hosted? Where are logs stored? What is the default training-data opt-in status? Is there a signed Data Processing Agreement available? Is there a HIPAA BAA if you touch health data? Does the provider have a current SOC 2 Type II report? For EU workloads, is there an EU-resident region with no transatlantic processing?

Red flags. “We can sign a DPA” without a ready template. Logs stored in a different region than inference. Training on customer data by default with opt-out buried in account settings. No published sub-processor list. No region selection at all (your request can land anywhere).

Real tradeoffs. If you need EU data residency, your shortlist is roughly: Anthropic on AWS Bedrock in EU regions, Azure OpenAI in EU regions, Mistral hosted in France, or self-hosted open models on EU infrastructure like Hetzner or Scaleway. OpenAI’s direct API and Google Gemini’s direct API have improved but verify the current region options against your compliance team’s actual requirements, not marketing pages. For best LLM for enterprise use cases with strict DACH regulatory exposure, this filter alone collapses most of the market.

Criterion 2: Workload fit

Not every model is good at every task. Leaderboards average across tasks and hide this.

What to measure. Classify your workload honestly. Is it a chat interface where the model answers freeform questions? A structured extraction pipeline pulling fields from documents? A reasoning task like code generation or multi-step planning? An agent loop where the model calls tools and decides what to do next? A multimodal task involving images, audio, or document layout?

Red flags. Picking a general chat model for a pure extraction workload and paying for capabilities you do not use. Picking a cheap model for agent loops and watching it fail at tool selection. Assuming multimodal performance from a text benchmark.

Real tradeoffs. For structured extraction, smaller and cheaper models with good tool use (Haiku 4.5, GPT-4o-mini, Gemini Flash) are usually the right choice. For agent loops that need reliable tool calling and planning, Claude Sonnet 4.6 or Opus 4.7 with extended thinking is what I reach for. For high-volume chat, Sonnet-tier or Gemini Pro tier balances quality and cost. For audio output or image generation inside one call, OpenAI currently leads. Pick the model that matches the workload, not the one that wins on MMLU.

I have shipped extraction pipelines on Haiku that beat Opus on the same task once the prompt was tuned. I have also seen teams default to the flagship tier for a classification job and pay 20x for no accuracy improvement. Workload fit is almost always the right first optimization before you look at cost.

Criterion 3: Latency SLA

Latency is the criterion teams most often measure wrong. They check p50 on a quiet afternoon and ship.

What to measure. p50, p95, and p99 latency under realistic concurrent load, measured from your application’s network location, with your actual prompt sizes and output lengths. Time to first token (TTFT) matters for streaming UIs. Total time matters for batch jobs.

Red flags. Measuring only p50. Measuring from your laptop instead of your production region. Testing with a 100-token prompt when production sends 4,000 tokens. Ignoring the tail where the 99th percentile can be four to ten times the median.

Real tradeoffs. Smaller models are faster. Streaming hides latency for chat. For a user-facing sync call, I target p95 under 2 seconds, and that usually forces me to Haiku-tier or Flash-tier with careful prompt sizing. For a background agent, p95 of 20 seconds is fine. Write the SLO down before you pick. Extended thinking adds seconds but can replace multiple round trips, which on net is often faster.

Criterion 4: Context length

Everyone quotes the advertised context window. Almost no one measures what the model actually does at that length.

What to measure. The realistic context length your workload needs, padded by 20 percent for growth. Then, separately, how well each candidate performs at that length. Needle-in-a-haystack retrieval is a floor, not a ceiling. Test on your real prompts with your real instructions placed at the top, middle, and end.

Red flags. Relying on a 1M or 2M context window claim without testing. Stuffing full conversation histories into every call because the window allows it. Ignoring the cost multiplier on long inputs at high volume.

Real tradeoffs. If you genuinely need 1M+ tokens at scale and quality, Gemini’s long context is the default answer today. For 200K with strong recall, Claude Sonnet and Opus are reliable. For most real workloads, you do not need the full window. You need good retrieval, prompt caching on the stable prefix, and discipline about what you send. Prompt caching turns long system prompts from a cost problem into a warm-path speedup. I cache aggressively and keep working context short.

Criterion 5: Cost at scale

Cost is criterion five, not criterion one, because premature cost optimization picks the wrong model for the task and then you pay for a second migration.

What to measure. Cost per request at your realistic input and output token counts, not per million tokens in the abstract. Multiply by projected monthly volume. Add the hidden costs: cache writes, tool-use round trips, retries on failure, fallback calls to a second provider. Compare against a ceiling you set before you start shopping.

Red flags. Comparing only per-million-token input prices. Forgetting that output tokens cost 3 to 5 times input. Not factoring retries, which at 5 percent error rate and two-attempt policy means 1.05x cost minimum. Ignoring that committed-use discounts change the math at scale.

Real tradeoffs. For high-volume, cost-sensitive workloads where a small model can meet the accuracy bar, Haiku 4.5, GPT-4o-mini, and Gemini Flash sit in the same value zone. Use prompt caching on any stable system prompt over roughly 1,000 tokens. For a quantitative walkthrough of current token pricing across providers, see my LLM API cost comparison. For the structural differences in how each provider bills (cache, tools, batch), see LLM API comparison.

Criterion 6: Tool use reliability

If your workload involves function calling, this criterion moves up the list. Tool use is where models silently differ most.

What to measure. Success rate at selecting the correct tool. Success rate at filling required arguments. Behavior when arguments are ambiguous. Recovery after a tool error. Parallel tool call support when you need it. Schema adherence, meaning the model returns valid JSON that parses on the first attempt, not after a retry loop.

Red flags. Hallucinated tool names. Invented arguments. Silent schema drift where the model returns user_id as a string when the schema says integer. Over-calling tools in a loop. Under-calling, where the model answers from parametric memory instead of using the tool you gave it.

Real tradeoffs. Claude’s tool use is the most reliable I have shipped, especially combined with extended thinking on harder agent decisions. GPT-4 class models are solid. Gemini has improved. Smaller models across providers are weaker at tool use, so if your pipeline is agentic, do not default to the cheapest tier without testing. For production patterns on Claude specifically, the migrate OpenAI to Claude guide covers the tool-use translation layer.

Criterion 7: Vendor risk

Single-provider production is a decision, not a default. Make it deliberately.

What to measure. Provider uptime history and incident transparency. Rate limit headroom at your peak. Pricing stability (how often and by how much the provider has changed prices in the last 24 months). Geographic concentration. Your switching cost if they double the price or deprecate the model.

Red flags. No status page or a status page that stays green during known outages. Rate limits that cannot be raised above your peak. A single model family with no equivalent elsewhere. Lock-in through proprietary features you cannot replicate.

Real tradeoffs. For anything business-critical I build with a failover provider from day one, even if 99 percent of traffic goes to the primary. The two candidates share a common interface layer in my code so switching is a config change, not a rewrite. For workloads where the model output itself is the product, I consider a self-hosted LLM as a strategic fallback, even if day-to-day I use an API.

The vendor risk question is not “will this provider go down.” They all do, sometimes for hours. The question is “what does my system do when they do.” If the answer is “returns errors to users,” vendor risk is your bottleneck, not your model quality. Two providers behind an interface layer, one prompt format, one monitoring surface. That is the pattern I run in every production deployment I ship.

A decision tree you can follow today

This is the short path from workload to provider shortlist. Run the top filter first, then the rest in order.

EU data residency required? > Anthropic via Bedrock EU, Azure OpenAI EU region, Mistral EU, or self-hosted open model
Multimodal output (audio generation, image generation in-call)? > OpenAI
1M+ tokens of context at scale with quality? > Gemini
Agentic reasoning critical, many tool calls, multi-step plans? > Claude Sonnet 4.6 or Opus 4.7 with extended thinking
Cost-sensitive high-volume extraction or classification? > Haiku 4.5, GPT-4o-mini, or Gemini Flash
Experimental, research, or strict IP control? > Mistral, DeepSeek, Llama, or another open model self-hosted

This is a starting shortlist, not a verdict. The next step is the evaluation process.

The evaluation process

How to actually pick, once the shortlist exists. This is the llm evaluation checklist I run on every serious production choice.

Step 1: Define the task and the “good enough” bar. Write down what success means in one sentence. Example: “Extract the six fields from an invoice with 97 percent field-level accuracy, p95 latency under 3 seconds, at 50,000 invoices per day.” If you cannot write this sentence, you are not ready to pick a model.

Step 2: Build a test set. 100 to 500 realistic examples, labeled by a human. Cover the common case, the edge cases, and the adversarial ones (corrupted input, wrong language, missing fields). This is the most undervalued step. Teams that skip it end up arguing about vibes in production.

Step 3: Run each candidate. For each shortlisted model, run the full test set. Track three numbers per request: accuracy against your label, latency, and cost. Automate it so you can rerun when a new model drops.

Step 4: Shortlist to two or three. Anyone below the “good enough” accuracy bar is out. Among the remainder, pick the two or three that balance accuracy, latency, and cost against your priorities.

Step 5: Production pilot. Ship the top candidate to 1 to 5 percent of real traffic for two to four weeks. Log every request. Watch for distribution shift, where the test set was cleaner than production reality.

Step 6: Monitor the right metrics. Accuracy drift (sample production outputs, label them, compare to the test-set baseline). p95 latency by region. Cost per inference end to end. Failure modes (timeout, rate limit, schema error, refusal).

Step 7: Commit, or add a failover. If the pilot holds, commit the primary. If it is close between two providers, ship both behind a feature flag and split traffic. Either way, the failover path is already plumbed by now.

Common mistakes I see

Choosing based on leaderboards instead of your workload. A model can win MMLU and lose your extraction task. Only your test set tells the truth.
Not measuring p95 and p99. p50 hides the tail. The tail is what wakes your on-call engineer.
Ignoring rate limits. They bite at scale, often during a launch. Size for peak plus 2x buffer and verify the provider can actually grant it.
Not building a failover from day one. Failover added post-incident is twice the work. An interface layer and a second provider from the start cost you a day.
Picking the cheapest model and hoping. Always size against accuracy first, then optimize cost against your “good enough” bar. The reverse order wastes a migration.
Stuffing context because the window allows it. Long prompts are slow and expensive. Retrieve what matters, cache the prefix, keep working context lean.
Treating tool use as a commodity. Reliability differs across providers and tiers. Test it explicitly if your workload is agentic. See production AI agent architecture for the full pattern.

Production requirements checklist

Run this before you declare a workload production-ready. These are the production llm requirements I verify on every deployment.

Data Processing Agreement signed and filed
Region selected and verified in the request response headers, not just the console
Training-data opt-out confirmed for all API keys in use
Rate limits sized for peak traffic plus 2x buffer, confirmed in writing with the provider
Failover provider configured with an interface layer, tested end to end at least once
Prompt versioning in place (every prompt hashed and logged with the response)
Cost monitoring per endpoint and per team, with budget alerts
Latency SLO defined (p50, p95, p99 targets) with alerting on breach
Prompt injection defenses at the boundary (input filtering, output validation, tool-call sandboxing)
PII redaction before the request leaves your perimeter, where applicable
Audit logs of every call, retained per your compliance requirements
On-call runbook for model outages, rate-limit hits, and quality regressions

If any of these is blank, that is your next piece of work.

When to revisit

The right LLM choice is not permanent. Revisit when any of these fires:

Every 6 months by default. Models improve fast. A re-run of your test set against the current top three takes a day and occasionally uncovers a 2x cost or quality shift.
When pricing changes significantly. A 30 percent price cut on a competitor, or a new cached-input tier, can change the answer. So can a sudden price increase on your primary.
When you cross a volume threshold where committed-use or enterprise agreements apply. The math at 1 million calls per month is different from the math at 10 million. Negotiate.
When a new model family opens a capability you could not use before. Longer context that actually works, native audio output, reliable tool use at a cheaper tier. Re-scope what is possible, not just what is cheaper.
When a compliance requirement shifts. New jurisdiction, new customer segment, new regulation. Data residency pulls the decision tree back to criterion one.

The process above, run once, gives you the framework. Run it every six months and you keep the production choice honest instead of inherited.

Fixed price and milestones — or a clear no with reasons.

From pilot to production

Running an AI pilot that is not production-ready yet? That is exactly what I do: audit, fixed-price scope, delivery in 2–6 weeks.

Make your AI pilot production-ready → Production audit ($1,900 fixed)

How to Choose an LLM for Production: 7 Criteria That Matter

The 7 criteria that matter for production

Criterion 1: Data residency and compliance

Criterion 2: Workload fit

Criterion 3: Latency SLA

Criterion 4: Context length

Criterion 5: Cost at scale

Criterion 6: Tool use reliability

Criterion 7: Vendor risk

A decision tree you can follow today

The evaluation process

Common mistakes I see

Production requirements checklist

When to revisit

Before you go —

Almost there

How to Choose an LLM for Production: 7 Criteria That Matter

The 7 criteria that matter for production

Criterion 1: Data residency and compliance

Criterion 2: Workload fit

Criterion 3: Latency SLA

Criterion 4: Context length

Criterion 5: Cost at scale

Criterion 6: Tool use reliability

Criterion 7: Vendor risk

A decision tree you can follow today

The evaluation process

Common mistakes I see

Production requirements checklist

When to revisit

Scope my automation in 24h

Request received