LLM API Cost Comparison 2026: Framework, Not a Stale Table
Every llm api cost comparison I see online has the same problem: it goes stale in two weeks. Providers drop a new tier, another one halves their output price, a reasoning model ships at triple the cost. By the time the post ranks on Google, the numbers are wrong and the rankings are meaningless.
So this piece is not a table you check once. It is the framework I use to model llm api pricing for my own production workloads, plus a snapshot of list prices as of April 2026, plus four realistic scenarios run through that framework. The scenarios are the point. Plug your own traffic into them, change the model, get a defensible monthly cost number.
A quick verdict before we dig in: for most DACH and EU teams shipping production AI work in 2026, the cost-effective default stack is Haiku 4.5 for classification and extraction, Sonnet 4.6 for agentic workloads with tool use, and Opus 4.7 selectively when reasoning quality justifies the premium. GPT-4o-mini, Gemini Flash, and DeepSeek V3 compete hard on the low end. o1/o3 and Claude extended thinking are worth the money only when a wrong intermediate step cascades into real damage.
Current list prices (2026-04)
Snapshot only. Verify on the provider’s pricing page before you commit anything to a budget. These shift every quarter.
| Model | Input ($/1M) | Output ($/1M) | Notes |
|---|---|---|---|
| Claude Opus 4.7 | $15.00 | $75.00 | Frontier reasoning, 5x output premium |
| Claude Sonnet 4.6 | $3.00 | $15.00 | Workhorse, tool use default |
| Claude Haiku 4.5 | $0.80 | $4.00 | Fast, cheap, strong for extraction |
| OpenAI GPT-4o | $2.50 | $10.00 | Similar tier to Sonnet |
| OpenAI GPT-4o-mini | $0.15 | $0.60 | Cheapest quality tier from OpenAI |
| OpenAI o1 / o3 | ~$15.00 | ~$60.00 | Reasoning, thinking tokens billed as output |
| Gemini 1.5 Pro | ~$1.25 | ~$5.00 | Under 128k context, doubles over it |
| Gemini 1.5 Flash | ~$0.075 | ~$0.30 | Aggressively priced, weaker on nuance |
| Mistral Large | ~$2.00 | ~$6.00 | EU-hosted option |
| Mistral Small | ~$0.20 | ~$0.60 | Good self-host candidate |
| DeepSeek V3 | ~$0.27 | ~$1.10 | Strong for the price, hosting tradeoffs |
Two things to notice before you move on. First, output is almost always 4x to 5x input on the same model. That ratio is the single biggest lever on your monthly bill. Second, the spread between the cheapest and most expensive tier is about 200x. That means model selection dominates every other optimization you can do.
For a deeper tradeoff breakdown on the two main providers I use day to day, see Claude API vs OpenAI for business automation. If you are already on OpenAI and sizing a move, migrate OpenAI to Claude covers the practical steps.
The cost you don’t see in the table
The list price is the headline. The bill is shaped by everything around it.
Batch API discounts. Most providers offer 50% off if you can tolerate a turnaround of up to 24 hours. Document processing, embeddings, nightly summarization, evaluation runs: all candidates. If half your volume is async, you just halved half your bill.
Enterprise commitments. At 5M+ tokens per day you can negotiate 10% to 30% off in exchange for a committed spend. It locks you to a provider, and provider lock-in is real. I would not recommend it before your workload has stabilized for at least a quarter.
Rate limits. Frontier models come with tight rate limits on lower tiers. If you need higher throughput, you pay for a higher tier, which can mean a monthly minimum that shows up nowhere on the llm token cost page.
Error retries. Failed calls still cost. Timeouts, malformed JSON that your parser rejects, tool calls that loop, 5xxs on the provider side. I multiply every estimate by 1.1x as a safety factor and I still miss sometimes.
Self-hosted. GPU instance cost plus staff time plus observability plus the occasional 3 AM on-call. Worth it at very high volume or when data residency forbids sending tokens to a US provider. Not worth it for a weekend project. I break this down in Hetzner vs AWS for AI workloads.
Tokenizer quirks. Different tokenizers count the same text differently. GPT-4o’s tokenizer is roughly 15% more efficient than Claude’s for English prose. For German, code, and long structured data the gap narrows and sometimes reverses. A naive “we switched to provider X and saved 20%” number is often 5% actual savings plus 15% tokenizer difference.
Prompt caching changes the math
This is where the cheapest llm api 2026 headlines lose the plot. Caching restructures the bill.
Anthropic. Cache writes cost 1.25x the normal input rate. Cache reads cost 0.1x. The default TTL is 5 minutes, with a longer option available. For any workload with a stable system prompt, a fixed tool schema, or a repeated document prefix, caching drops input cost by roughly 90% on the cached portion after the first write. I wrote up the production pattern in Claude API prompt caching.
OpenAI. Automatic 50% discount on cached input, no code change required. Less control, less saving, but zero configuration overhead.
Gemini. Context caching exists with its own pricing structure. Discount is smaller than Anthropic’s and the setup is more involved.
If your workload has a 20k token prefix that repeats across every call (a long system prompt, a tool catalog, a reference document), and you make more than a handful of calls per minute, Anthropic caching can be the deciding factor. I have seen agentic workloads where caching takes Sonnet from 20% more expensive than GPT-4o to 40% cheaper.
Output tokens cost more than you think
The dirty secret of llm api pricing is that output is priced 4x to 5x higher than input on frontier tiers. The reason is simple: output is where the compute happens. Each output token is a full forward pass. Each input token, once caching kicks in, is close to free.
The implication is direct. If your prompt encourages verbose responses, you are paying a premium on every token of filler. I have cut production bills by 30% just by tightening the instructions: “Respond in at most 3 bullet points. No preamble. No summary.” Output tokens are the first thing I look at when a cost regression hits.
When you design a schema for structured output, the same rule applies. A JSON response with 20 keys when you need 5 is not just sloppy, it is expensive.
Extended thinking and reasoning: paying for quality
Reasoning models (o1, o3, Claude with extended thinking enabled) generate intermediate thinking tokens that are billed at the output rate. A single call can consume 10k to 50k thinking tokens before it emits the visible answer. At $60 to $75 per million output tokens, that is real money per call.
The math shifts the break-even. A standard Sonnet 4.6 call at 5k input and 1k visible output costs about $0.03. The same task routed through extended thinking with a 20k thinking budget costs about $0.35, more than 10x. Multiply by 1000 daily runs and the choice becomes a budget line item, not a technical detail.
When is reasoning mode worth the premium? Multi-step decisions where one wrong intermediate step cascades. Contract review, medical triage summaries, tax logic, code diagnosis that feeds into an automated action. If an error at step 3 destroys the value of steps 4 through 10, pay for thinking. If each step is independently verifiable and cheap to retry, don’t. The Claude extended thinking post walks through the budget decision in production.
Four realistic cost scenarios
This is the part of the llm cost calculator that usually gets skipped: actual workloads with actual numbers. All costs are list price, no caching, no batch discount, no retries factored in. Round to the nearest dollar, multiply by 30 for monthly.
Scenario 1: Customer support bot, 10,000 chats per day
Assumptions: 2,000 input tokens per turn, 500 output per turn, 5 turns per chat.
Daily: 100M input tokens, 25M output tokens.
| Model | Daily input | Daily output | Daily total | Monthly |
|---|---|---|---|---|
| Claude Haiku 4.5 | $80 | $100 | $180 | $5,400 |
| GPT-4o-mini | $15 | $15 | $30 | $900 |
| Gemini 1.5 Flash | $7.50 | $7.50 | $15 | $450 |
At this tier the llm token cost gap is real. Gemini Flash and GPT-4o-mini undercut Haiku by 4x to 12x on raw list price. The counterweight is quality on edge cases, tool use reliability, and the cost of a failed escalation. If one bad answer triggers a human support ticket worth $8, you absorb a lot of price difference before Haiku stops paying back.
Scenario 2: Agentic workflow, 1,000 runs per day
Assumptions: 15k input, 3k output per turn, 5 turns per run with tool use.
Daily: 75M input, 15M output.
| Model | Daily input | Daily output | Daily total | Monthly |
|---|---|---|---|---|
| Claude Sonnet 4.6 | $225 | $225 | $450 | $13,500 |
| GPT-4o | $188 | $150 | $338 | $10,140 |
| Gemini 1.5 Pro | $94 | $75 | $169 | $5,070 |
GPT-4o wins on raw price against Sonnet 4.6. Gemini Pro wins against both. The picture flips the moment you enable prompt caching on a stable tool schema: Sonnet with caching on 80% of the input drops to roughly $8,000 monthly, and the comparison gets interesting again.
Scenario 3: Document summarization, 5,000 docs per day
Assumptions: 20k input, 1k output per doc. Heavy prefix reuse if you use a consistent system prompt and a shared instruction template.
Daily: 100M input, 5M output.
| Model | No caching, monthly | With caching (Anthropic), monthly |
|---|---|---|
| Claude Sonnet 4.6 | $11,250 | ~$4,500 |
| GPT-4o | $8,100 | ~$6,500 (50% auto cache) |
| Gemini 1.5 Pro | $4,500 | lower, harder to estimate |
This is where Anthropic caching does real work. The document is the variable part, the system prompt plus instructions plus format guide is the constant. Cache the constant, pay full price for the document, and Sonnet beats GPT-4o by a comfortable margin. Batch API would halve either number again if you can wait.
Scenario 4: Heavy reasoning, 100 runs per day
Assumptions: 5k input, 2k visible output, 20k reasoning tokens per run.
Daily: 0.5M input, 2.2M effective output (including thinking).
| Model | Daily cost | Monthly |
|---|---|---|
| Claude Opus 4.7 with extended thinking | $173 | $5,190 |
| OpenAI o1 / o3 | $140 | $4,200 |
| Claude Sonnet 4.6 (no thinking) | $34 | $1,020 |
The premium for reasoning mode is roughly 5x over a plain Sonnet call. If you do not have evidence that the chain of thought meaningfully improves your accuracy on this specific task, you are spending $4k per month on a benchmark headline.
Batch API, commits, and self-hosting
Three levers that sit outside the headline table.
Batch. Flip anything tolerant of a 24-hour turnaround to batch and halve it. Overnight reports, backfill jobs, evaluation runs, content generation pipelines that feed next-day publishing. There is no reason to pay real-time prices for work that runs at 3 AM.
Commits. If your spend has been stable for two quarters and you have no plans to switch providers, a commit discount of 15% to 25% is reasonable to ask for. If your spend is still shifting, skip it. The lock-in costs more than the saving.
Self-hosted. Llama 3.x, Qwen 2.5, Mistral, and DeepSeek V3 in self-hosted mode can undercut API pricing by 3x to 10x per token at sustained volume. The real cost is on-call burden, GPU availability, and the engineer-weeks to get the stack stable. I only recommend self-hosted when at least one of these applies: (a) data cannot leave your infrastructure, (b) sustained spend is above $10k monthly on a non-frontier tier, or (c) you already have the GPU and ops expertise in-house.
Tokenizer quirks across providers
A common mistake: comparing 1M tokens at provider A to 1M tokens at provider B without noticing that the same text tokenizes differently.
- GPT-4o’s
o200k_basetokenizer is about 15% more efficient on English prose than Claude’s tokenizer. - For German, the gap narrows significantly. Long compound nouns tokenize worse on both, and Claude’s tokenizer handles some of the common endings slightly better.
- For code, tokenizer efficiency depends heavily on the language and indentation style. Python with 4-space indent is close to a wash. Heavily nested JSON can diverge by 10% or more.
When you are benchmarking openai vs anthropic pricing, run your actual production prompts through both tokenizers and compare the bill, not the list price. I have seen “20% cheaper” turn into “5% cheaper” after correcting for tokenization.
Which LLM for which cost regime?
Budget tiers map cleanly to decision rules.
Under $50 per month. Pick on feature fit, not price. Every tier is cheap enough that model quality and SDK ergonomics matter more than a few dollars.
$50 to $500 per month. Haiku 4.5, GPT-4o-mini, and Gemini Flash dominate. Spend your time tightening prompts, shortening outputs, and running evals. At this tier, a better prompt beats a cheaper provider.
$500 to $5,000 per month. Sonnet 4.6, GPT-4o, and Gemini Pro tradeoffs dominate. Prompt caching is mandatory, not optional. Batch API for non-real-time work. Output length discipline pays for itself every week.
$5,000 per month and up. Negotiate commits. Audit your cache hit rates. Evaluate self-hosted non-frontier models for the tasks where quality tolerance allows it. At this scale, a 20% saving is an engineering salary.
Cost monitoring that catches problems early
A cost-effective setup that never gets audited will drift. A few patterns that have saved me from bad surprises.
Log usage per call. Every response from the Anthropic and OpenAI SDKs includes a usage object. Persist it. Sqlite is enough. Columns: timestamp, model, input tokens, output tokens, cached tokens, workload tag.
Weekly rollup. A cron job that sums the week and compares to the previous one. If spend is up more than 20% and traffic is not, something regressed: a longer prompt, a worse cache hit rate, a silent retry loop, a new model version that is chattier.
Daily anomaly alerts. If today’s cost is more than 2x yesterday’s, ping yourself. I run this through a Telegram notification. Most of the time it is a traffic spike. Occasionally it is a runaway agentic loop that would have cost four digits by morning.
Per-feature attribution. Tag every call with a workload label. When the bill grows, you want to know whether the growth came from the support bot, the summarization pipeline, or the batch eval runs. Without attribution, every cost regression turns into a two-hour debugging session.
Which should you choose?
If I had to reduce this whole post to five rules:
- Use the llm cost calculator approach from Scenarios 1 through 4. Plug in your traffic. Don’t guess.
- Enable prompt caching anywhere your prefix is stable. It is the single biggest lever after model choice.
- Shorten outputs. Every output token is 4x to 5x more expensive than an input token.
- Move async workloads to batch API. Free 50% off.
- Review weekly. Spend drifts. Your 2026-04 benchmark will be stale by 2026-07.
For most production workloads I build, the default stack is Haiku 4.5 for extraction and classification, Sonnet 4.6 with caching for agentic tool use, and Opus 4.7 reserved for the narrow set of decisions where reasoning quality changes the downstream outcome. GPT-4o-mini is a strong alternative on the low end. Gemini Flash is worth a look when raw cost dominates. o1, o3, and extended thinking stay on the shelf until a specific workload proves they earn their price.