Question 1

When does self-hosting an LLM become cheaper than calling an API?

Accepted Answer

The crossover sits between 50 and 300 million output tokens per month for a 70B-class model, depending on GPU rental price, utilization, and whether you can use batch and prompt caching on the API side. Below that, the API almost always wins because you pay only for what you use. Above that, a well-utilized GPU on spot pricing undercuts even discounted API calls.

Question 2

What is the real cost of self-hosting beyond the GPU?

Accepted Answer

Sticker GPU rent (around 620 euros per month for an L40S, 1,800 euros for an H100 on spot) is roughly half the real cost. Add engineering time for vLLM tuning, autoscaling, monitoring, and incident response. Budget 4 to 8 hours per month at your loaded engineering rate. With no Linux ops capacity, the API wins even at scale.

Question 3

Does the Anthropic Batch API change the break-even point?

Accepted Answer

Yes, by roughly 2x. Batch API gives a 50 percent discount on both input and output tokens for any workload tolerant of a 24-hour turnaround. Overnight reports, evaluation runs, backfills, and async classification all qualify. Combined with prompt caching at 90 percent off cached input, an API workload can shed two thirds of its bill before self-hosting is worth considering.

Question 4

What is the hybrid pattern and when does it win?

Accepted Answer

Hybrid means routing latency-sensitive traffic (chat, real-time agents) to the API and shifting batch or bulk traffic (extraction, classification, evaluation) to a self-hosted model. It wins when one workload has tight p95 requirements you cannot meet on a small GPU, and another workload runs millions of tokens per night that would burn API credit. Most teams underestimate how much traffic is actually batch-tolerant.

Question 5

Why does utilization matter so much for self-hosted cost?

Accepted Answer

A GPU you pay for 24 hours a day only earns its keep when it is actually serving tokens. At 30 percent utilization, your effective cost per token is roughly 3x the headline GPU rate. The break-even calculation assumes 60 to 80 percent utilization, which requires either steady traffic or aggressive batching. If your traffic is bursty and small, self-hosting will look cheap on the spreadsheet and expensive in reality.

Self-Hosted LLM vs. API Break-Even-Rechner Self-Hosted LLM vs API Break-Even Calculator

Ihr WorkloadYour workload

VolumenVolume

ModellklasseModel class

API-HebelAPI levers

Self-HostedSelf-hosted

Monatliche KostenMonthly cost

LLM-Infra-Plan in 24 StundenLLM infra plan in 24 hours

Wie der Rechner zähltHow the calculator counts