when to self-host llm instead of api

Self-hosting wins for very high volume (50M+ tokens/day), strict data residency requirements, fine-tuned custom models, cost-sensitive bulk with predictable load, and models not offered by API providers. For most teams building agents and tools, the Claude API remains cheaper when engineering time and idle GPU costs are included.

self-hosted llm vs api break-even point

Break-even for self-hosting against Claude Sonnet 4.6 sits around 2-5M tokens/day if an open 70B model meets your quality bar. Against Haiku 4.5, break-even is closer to 15-25M tokens/day. Below these volumes, the API wins even when accounting for engineering time and GPU costs.

what are the hidden costs of self-hosting llm

Beyond hardware, self-hosting requires 2-4 weeks of initial engineering time, ongoing ops overhead (10-20% of one senior engineer), GPU idle time costs, observability and monitoring infrastructure, security hardening, and regular model updates and benchmarking. Many teams underestimate these costs and find self-hosting more expensive than anticipated.

best open source models self-hosting 2026

Llama 3.3 70B and Qwen 2.5 72B are solid general-purpose options, while DeepSeek V3 offers near-Sonnet performance on reasoning tasks. For simpler workloads, smaller 7B-13B models on a single 20GB GPU are cost-effective, and Phi-4 excels at classification and structured output tasks.

Self-Hosted LLM vs API Cost: Break-Even Analysis (2026)

April 16, 2026 · 15 min read · llm, self-hosted, infrastructure, cost-optimization

Every few months a client asks me the same question. “We’re burning $8k/mo on Claude. Should we self-host Llama?” The answer is almost always no, and the reason has nothing to do with whether the model is good enough. It has to do with what a GPU costs when it’s idle, and how much engineering time it takes to keep a serving stack healthy at 3am.

This guide breaks down self-hosted LLM vs API cost with real numbers. Hetzner GPU pricing, RunPod and Lambda hourly rates, Claude Sonnet 4.6 and Haiku 4.5 token pricing, and the break-even points that actually matter. The goal is to give you a decision framework, not a marketing pitch for either side.

If you want the raw per-token pricing for hosted models, I wrote that up in the LLM API cost comparison. This guide is the other half: what changes when you put the model on your own GPU.

Verdict up front

APIs win for 95% of production workloads in 2026. That number is not hedging. For most teams building agents, chat features, extraction pipelines, and internal tools, the sticker price of Claude Sonnet 4.6 at $3/$15 per million tokens looks expensive on a spreadsheet and turns out to be the cheapest path by a wide margin once you account for engineering time, idle GPU cost, and the ops load of running your own inference.

Self-hosting wins in five cases, and only these five:

Very high volume. Past roughly 50M tokens/day of steady load, the math starts tilting toward self-hosted, and past 100M tokens/day it’s clearly ahead if an open model meets your quality bar.
Strict data residency that no API provider can meet. Claude, OpenAI, and the major hyperscalers all have EU and US options, BAA, zero-retention endpoints, and region pinning. “Data has to never leave our network” is a narrower claim than it was two years ago.
Fine-tuned custom models. If you need weights trained on your data that you actually own and can redeploy, you’re self-hosting by definition.
Cost-sensitive bulk with predictable load. High-volume classification, extraction, embeddings, and summarization where you can keep a GPU at 70%+ utilization around the clock.
Models the API providers don’t offer. Small specialized models, older model versions frozen for a legacy integration, domain-tuned open weights from Hugging Face that have no hosted equivalent.

If your workload isn’t one of those five, stop reading and pick the right hosted LLM for production. Everything below is for the cases where self-hosting is genuinely on the table.

The cost model for APIs

The cost model for the Claude API is three numbers: input price, output price, and zero infrastructure.

For 2026:

Model	Input ($/1M tok)	Output ($/1M tok)
Claude Opus 4.7	$15	$75
Claude Sonnet 4.6	$3	$15
Claude Haiku 4.5	$0.80	$4

That’s the whole pricing sheet. No idle cost, no reserved capacity, no GPU you pay for when nobody is using it. If you send zero requests this month you pay zero dollars. Scale is elastic in both directions.

On top of that you get prompt caching (90% discount on cached input), batch processing (50% discount for non-realtime jobs), and a global SDK that handles retries, rate limits, and streaming without any work on your end. The effective per-token rate for a well-tuned workload with prompt caching enabled is often 30-50% below the sticker price.

The failure modes are: per-token cost adds up at scale, your latency depends on the provider’s queue depth, and you’re subject to their rate limits (which you can raise on request but not eliminate).

The cost model for self-hosting

Self-hosting has four cost buckets, and most people only see the first one.

Hardware or rental. You either buy a GPU and rack it, or you rent one. On Hetzner, a GEX44 with an RTX 4000 Ada (20GB VRAM) runs €184/mo. That’s enough to serve quantized Llama 3.3 70B, or to run smaller models at full precision. A dedicated H100 on RunPod is about $2.50/hr on-demand, Lambda Labs around $2.80/hr. Running an H100 24/7 for a month is $1,800-2,000.

For comparison on the cloud GPU side, I have a separate writeup on GPU cloud comparison for AI inference that covers spot pricing and reserved instances. And Hetzner vs AWS for AI workloads covers the bare-metal vs hyperscaler tradeoff for anything steady-state.

Idle time. A GPU at rest costs the same as a GPU at 100%. If your load is bursty (peak at 9am, nothing at 2am), you’re paying for capacity you’re not using. The Claude API charges you zero when you’re idle. This is the single biggest cost difference at low-to-medium scale.

Engineering time. Getting a production-grade inference setup running is 2-4 weeks of focused work for someone who’s done it before, longer if it’s their first time. That’s vLLM deployment, model download and validation, load testing, autoscaling, observability, and a deployment pipeline. Then it’s ongoing: model updates, serving stack upgrades, incident response. Budget 10-20% of one engineer’s time permanently if you want it to stay healthy.

The invisible stack. API queueing, retries, rate limit backoff, streaming, multi-region failover, cost monitoring, request logging, latency SLOs, security hardening of the inference endpoint. The Anthropic SDK does all of this. You get to rebuild it.

Break-even math at realistic volumes

Run the numbers on your own workload: Self-Hosted LLM vs API Break-Even Calculator — enter monthly tokens, batch share, prompt-cache hit rate, and GPU utilization to see the crossover point in seconds.

Here’s where the rubber meets the road. Four scenarios, all assuming Claude Sonnet 4.6 as the API baseline and a reasonable open model (Llama 3.3 70B or Qwen 2.5 72B) as the self-hosted alternative. Assume a 50/50 input/output split for simplicity.

Scenario: 1M tokens/day

Claude Sonnet 4.6: 500k input @ $3/M + 500k output @ $15/M = $1.50 + $7.50 = $9/day, or about $270/month.

Hetzner GEX44 at €184/month can serve this easily, but the GPU is idle 95% of the time. Even at €184/mo the per-token cost is fine, but you’ve spent 3-4 weeks of engineering time to save $86/month. That’s a 3-5 year payback on the setup cost alone, before counting ongoing ops.

Verdict: API wins, not close.

Scenario: 10M tokens/day

Claude Sonnet 4.6: 5M input + 5M output = $15 + $75 = $90/day, roughly $2,700/month.

On a Hetzner GEX44 serving a quantized 70B open model with vLLM, 10M tokens/day is well within capacity. Throughput on an A4000-class GPU with vLLM is 200-500 tokens/sec for a 70B quantized model, so 10M tokens/day (about 115 tokens/sec average) leaves significant headroom. Hardware cost: €184/mo. Engineering time: still 2-4 weeks upfront plus ongoing.

Break-even against Sonnet: yes, if quality is acceptable. Break-even against Haiku 4.5 ($0.80/$4, so $24/day, $720/mo) is harder. Haiku handles many of the workloads people reach for self-hosting to cover.

Verdict: depends on which API model you’re actually comparing against. Against Sonnet, self-hosting is cheaper at this volume. Against Haiku, it’s a wash.

Scenario: 100M tokens/day

Claude Sonnet 4.6: $900/day, $27k/month. Haiku 4.5: $240/day, $7.2k/month.

At 100M tokens/day (1,150 tokens/sec average), a single GEX44 is tapped out. You’re looking at 2-4 GPUs, likely H100-class if you want headroom, so $4-8k/month on cloud GPU. Self-hosted wins decisively against Sonnet, and is competitive against Haiku when you factor in throughput advantages on bulk workloads.

This is where fine-tuning also enters the picture. If you can fine-tune a 7B-13B model on your specific task and serve it on a single A4000, unit economics get even better.

Verdict: self-hosted wins if an open model clears your quality bar.

Scenario: 1B tokens/day

At this scale, you’re running a small cluster. 4-8 H100s, dedicated ops, custom routing. Claude Sonnet would be $9k/day ($270k/month). Self-hosted with fine-tuned models is in the $30-80k/month range all-in including engineering overhead. The break-even is obvious, but it’s no longer a side project. You need a team.

Rough rule of thumb: break-even for self-hosting against Claude Sonnet 4.6 sits around 2-5M tokens/day if an open 70B model meets your quality bar. Against Haiku 4.5, it’s closer to 15-25M tokens/day. Against Opus 4.7, break-even starts around 500k tokens/day, but you’re also accepting a quality drop that’s rarely worth it for the kind of workload that justifies Opus in the first place.

What you give up self-hosting

This is the list that kills most self-hosting plans when teams actually confront it.

Model quality at the top. Claude Opus 4.7 is meaningfully ahead of the best open weights on reasoning, code, and agentic tasks. The gap on everyday tasks has narrowed a lot, but the gap on hard tasks (multi-step reasoning, complex tool use, long-form generation with strong coherence) is still real.

Tool use reliability. Claude’s tool_use is production-grade in a way open models are still catching up to. If your agent calls five tools in sequence and needs the JSON schema to be right every time, Claude’s first-party tool use does that. Open models with function calling retrofitted via prompt templates fail more often, and the failures are harder to debug. I wrote more about this in the Claude API tool use guide.

Long-context recall. Open models with 128k context windows often fall off a cliff past 32k, especially for needle-in-haystack recall. Claude holds coherence across 200k+ tokens in production in ways the open ecosystem hasn’t matched.

Vision and multimodal. Claude accepts images natively, handles PDFs, and generates structured output about visual content in the same API call. Self-hosted vision models exist but require a separate pipeline.

Prompt caching. Anthropic’s prompt caching gives you a 90% discount on cached input tokens and is a one-line code change. Self-hosted requires you to build your own caching layer, which is doable but non-trivial.

Ecosystem. SDK in every language, docs, community examples, debugging tools, third-party integrations. The hosted API ecosystem is years ahead of the open self-hosted tooling.

What you gain self-hosting

Data never leaves your network. This matters for regulated industries, classified workloads, and anything where your legal team has decided no external API is acceptable.

No rate limits. You’re capped by your own GPU, not by a provider’s rate limit policy. For workloads that need sustained high throughput, this is a real advantage.

Weights you own. Fine-tuned checkpoints you can redeploy anywhere. No vendor lock-in on the model itself.

Predictable flat-rate cost. Once the GPU is paid for, marginal cost per token is effectively zero. This makes budgeting trivial and removes the “we burned $30k in API calls this month” surprise.

Latency predictability. No API queue variance. Your p99 is whatever your GPU serves at, full stop.

Models nobody else hosts. Smaller specialized models, older versions frozen for a legacy pipeline, domain-specific tunes from Hugging Face with no hosted equivalent.

Which open models are realistic for self-hosting

Not every open model is production-ready. Here’s what I’d actually put into a serving stack in 2026.

Llama 3.3 70B. Strong general-purpose, wide tooling support, vLLM serves it well. At fp16 it needs around 140GB VRAM, which means 2x A100 80GB or an H100 plus headroom. Quantized to 4-bit (AWQ or GPTQ) it fits in 40GB, so a single A100 40GB or RTX 4090 48GB setup works.

Qwen 2.5 72B. Competitive with Llama 3.3 on most benchmarks, often better on multilingual. Same VRAM profile. Strong option if you want a non-Meta alternative.

DeepSeek V3. Mixture-of-experts architecture, strong reasoning, serves with vLLM. Interesting for teams that want near-Sonnet-level performance on reasoning tasks in an open model. Bigger total parameter count but active parameters per token are lower.

Mistral. Older open-weight versions (Mixtral, Mistral Large before it went closed) are still solid for many workloads.

Phi-4. Microsoft’s smaller model, surprisingly strong on narrow tasks, runs on consumer GPUs (24GB VRAM is plenty). Good fit for classification, extraction, structured output where you don’t need frontier reasoning.

7B-13B models. For embedding, classification, simple extraction, and routing, a 7-13B model on a single 20GB GPU is often all you need. This is where Hetzner GEX44 economics start looking excellent.

If you’re deploying any of these on Kubernetes, I have a longer writeup on self-hosted LLM on Kubernetes covering vLLM deployment, autoscaling, and observability.

The serving stack options

The inference server matters almost as much as the model.

vLLM. My default. Fastest, production-ready, PagedAttention makes memory usage efficient for batched inference. Strong OpenAI-compatible API. Active development. If you’re serving at scale, this is the starting point.

Text Generation Inference (TGI). Hugging Face’s inference server. Solid, widely deployed, good integration with the HF model hub. Slightly behind vLLM on raw throughput in most benchmarks I’ve run but perfectly production-viable.

Ollama. Excellent for development and local experimentation. Not production-grade for multi-tenant serving. I use it on my dev box, I would not put it behind a production endpoint.

llama.cpp. CPU-friendly, runs on edge devices, great for Mac deployments. Slower than GPU-based stacks but useful when GPU isn’t an option.

SGLang. Newer, competitive with vLLM on throughput, particularly strong for structured output and constrained decoding. Worth watching.

For production, vLLM or TGI. Everything else is dev tooling.

The real cost categories people forget

The spreadsheet usually looks like “GPU cost + electricity = total”. It’s not.

GPU capex or ongoing rent. The obvious one.

Idle time. GPU at rest costs the same as GPU at 100%. If you can’t keep utilization above 60%, per-token economics look worse than the brochure suggests. The API provider amortizes idle across thousands of customers; you amortize it across your own traffic alone.

Engineering time, initial. 2-4 weeks of focused senior engineering to get a production serving stack up: deployment, autoscaling, monitoring, load testing, security.

Engineering time, ongoing. 10-20% of one senior engineer permanently: model updates, serving stack upgrades, incident response, cost optimization, quality regression tracking when you swap models.

Observability. Cost per request, latency percentiles, quality drift, token accounting. The API gives you most of this out of the box. You build it.

Model updates. When Anthropic ships Sonnet 4.7 or Haiku 5.0, you get it with a model string change. When a new open model drops, you download, benchmark, re-validate on your eval suite, test serving performance, and redeploy. Budget a week per major model swap.

Security surface. Your inference endpoint is now an attack surface you own. Authentication, authorization, rate limiting, DDoS protection, input validation to prevent prompt injection from leaking to the model layer.

Redundancy. The API has multi-region failover built in. You rebuild it. A single-GPU deployment has a single failure domain, which is fine for batch but not for real-time.

If you add these honestly, self-hosting economics shift meaningfully. A setup that looks $2k/month cheaper on hardware often costs $4-6k/month more once you include engineering amortization.

The hybrid pattern

Here’s what most production teams I work with actually end up doing. Not pure self-hosted, not pure API. Both.

Self-host for:

Bulk classification (millions of records, simple schema)
Extraction from structured or semi-structured data
Embeddings (a small model on a cheap GPU runs these for pennies)
Simple summarization where quality bar is “good enough”
Any workload where p95 volume is predictable and sustained

Claude API for:

Reasoning, multi-step agents, complex tool use
Multilingual content generation
Long-context tasks (analyzing full contracts, long codebases)
Vision and multimodal
Anything customer-facing where quality variance matters
Burst traffic that would push your self-hosted GPU past capacity

A simple routing layer decides per request. Classification input? Self-host. Customer support agent? Claude. Embedding a document batch? Self-host. Analyzing a 50-page contract? Claude.

This pattern captures the cost savings of self-hosting on the workloads that justify it, while keeping the quality and ops simplicity of API for the workloads that need it. Almost nobody ends up at 100% self-host. The teams that claim they did usually have a second hosted pipeline they forgot to mention.

Six questions before self-hosting

Walk through these honestly. Skip the ones you want to rationalize away at your own risk.

What’s my p95 token volume in the next 12 months? If it’s under 5M tokens/day against Sonnet, under 20M tokens/day against Haiku, the math almost certainly doesn’t work. Pay for the API.
Is my workload simple enough that an open 70B or smaller model performs well? Run an eval. Actually run it. “I think Llama will be fine” is not evidence. Take your top 200 prompts, run them through Sonnet and Llama, score the outputs. If Llama is within 10% on your metric, good. If it’s 30% worse, no self-hosting will fix that.
Do I need fine-tuning? If yes, self-host is the path. If no, the argument for self-host gets weaker.
Do I have GPU ops capability, or budget to hire it? If your team has never run a GPU in production, this is a hidden 3-6 month cost you’re not counting. If you can hire it, budget €90-130k/year for someone who’s actually done it.
Does my data HAVE to never leave my infra? Not “we’d prefer” but actually, legally, contractually required. Check first whether zero-retention endpoints from API providers satisfy your legal team, because increasingly they do.
Am I willing to re-benchmark every 6 months as open models improve and frontier models improve? Self-hosting is not a one-time decision. The gap between open and closed moves. Your decision has to move with it.

If you answered yes to 4 or more, self-hosting may be worth evaluating seriously. If you answered yes to fewer than 4, default to the API. That’s not a cop-out, it’s the right call for the volume and workload pattern most teams actually have.

Which should you choose?

For most teams in 2026, the answer is API. Claude Sonnet 4.6 for quality-sensitive work, Haiku 4.5 for cost-sensitive bulk, both with prompt caching to squeeze the effective rate down another 30-50%. The engineering time you save pays for a lot of tokens.

Self-host when volume is genuinely in the 10M+/day range on a workload where an open model meets your quality bar, or when you have hard data residency requirements no API can satisfy, or when you need fine-tuned weights you own. In those cases, start with vLLM on a Hetzner GEX44 or a single H100 rental, prove the unit economics with 30 days of real traffic, then scale.

The hybrid pattern is where most serious production setups land. Self-host the predictable bulk, use Claude for the high-value and bursty work. One router between them. That’s the architecture that actually ships.

Fixed price and milestones — or a clear no with reasons.

From pilot to production

Running an AI pilot that is not production-ready yet? That is exactly what I do: audit, fixed-price scope, delivery in 2–6 weeks.

Make your AI pilot production-ready → Production audit ($1,900 fixed)

Self-Hosted LLM vs API Cost: Break-Even Analysis (2026)

Verdict up front

The cost model for APIs

The cost model for self-hosting

Break-even math at realistic volumes

What you give up self-hosting

What you gain self-hosting

Which open models are realistic for self-hosting

The serving stack options

The real cost categories people forget

The hybrid pattern

Six questions before self-hosting

Which should you choose?

Before you go —

Almost there

Self-Hosted LLM vs API Cost: Break-Even Analysis (2026)

Verdict up front

The cost model for APIs

The cost model for self-hosting

Break-even math at realistic volumes

What you give up self-hosting

What you gain self-hosting

Which open models are realistic for self-hosting

The serving stack options

The real cost categories people forget

The hybrid pattern

Six questions before self-hosting

Which should you choose?

Scope my automation in 24h

Request received