Self-Hosted LLM vs. API Break-Even-Rechner Self-Hosted LLM vs API Break-Even Calculator

Monatliche Tokens eingeben. Sofort sehen, wann Claude/OpenAI-API, self-hosted vLLM oder ein Hybrid-Muster tatsächlich am günstigsten ist — mit Batch-Rabatt, Prompt-Cache und realer GPU-Auslastung. Enter your monthly tokens. See instantly when Claude/OpenAI API, self-hosted vLLM, or a hybrid pattern is actually cheapest — with batch discount, prompt cache, and real GPU utilization factored in.

Methodik und vollständiger Kostenaufriss: Self-Hosted LLM vs. API: Break-Even-Analyse Methodology and full cost teardown: Self-Hosted LLM vs API Cost: Break-Even Analysis

API pricing is straightforward: you pay for tokens. Self-hosting is not straightforward: you pay for a GPU, for utilization, for engineering time, and for the operational tail. This calculator models all four costs against your real volume so the crossover point shows up where it actually sits, not where a vendor blog claims.

Methodology and the full cost teardown: Self-Hosted LLM vs API Cost: Break-Even Analysis. Related: LLM API Cost Comparison, Self-Hosted LLM on Kubernetes, How to Choose an LLM for Production.

Ihr WorkloadYour workload

VolumenVolume

50M
Log-Skala: 100K bis 10B Tokens pro Monat.Log scale: 100K to 10B tokens per month.
10M
Output dominiert die Kosten bei den meisten APIs (5x Input).Output dominates cost on most APIs (5x input).

ModellklasseModel class

Wählt API-Pricing und passendes OSS-Modell + GPU.Picks API pricing and matched OSS model + GPU.

API-HebelAPI levers

30%
Workload, der 24h warten kann (Reports, Evals, Backfills). 50% Rabatt.Workload that can wait 24h (overnight reports, evals, backfills). 50% off.
40%
Anteil Input-Tokens aus Cache (90% billiger).Share of input tokens served from cache (90% cheaper).

Self-HostedSelf-hosted

60%
Realistisch: 40-70%. Unter 30% lohnt Self-Hosting selten.Realistic: 40-70%. Below 30% self-hosting rarely makes sense.
6h
vLLM-Tuning, Monitoring, Incidents. Bei 130 EUR/h geladen.vLLM tuning, monitoring, incidents. At 130 EUR/h loaded.

Monatliche KostenMonthly cost

Hinweis: Self-hosted unterstellt vLLM auf Spot-GPU-Preisen mit Continuous Batching. Kosten enthalten GPU-Miete + Ops-Zeit bei 130 EUR/h geladen. Throughput skaliert mit Auslastung — eine zu 30% ausgelastete H100 ist pro Token ~2x teurer als der Headline-Preis suggeriert. Hybrid routet Batch-toleranten Traffic auf self-hosted, interaktiven auf die API. Note: Self-hosted assumes vLLM on spot GPU pricing with continuous batching. The cost includes GPU rent + ops time at 130 EUR/h loaded. Output throughput scales with utilization, so a 30%-utilized H100 is ~2x more expensive per token than the headline rate suggests. Hybrid routes batch-tolerant traffic to self-hosted, interactive traffic to API.

LLM-Infra-Plan in 24 StundenLLM infra plan in 24 hours

Sie sehen den Schnittpunkt. Wenn Sie einen konkreten Deployment-Plan wollen — vLLM-Config, Autoscaling, Fallback zur API, Monitoring — ich liefere ihn schriftlich innerhalb von 24 Stunden. You see the crossover point. If you want a concrete deployment plan — vLLM config, autoscaling, fallback to API, monitoring — I deliver it in writing within 24 hours.

Mein Konzept anfragen Request my scope

Wie der Rechner zähltHow the calculator counts

Vier Kostenpfade, gegen Ihr monatliches Token-Volumen modelliert. Four cost paths, modeled against your monthly token volume.

USD/EUR: 0,92. GPU-Spot-Raten: L40S $0,86/h, H100 $2,50/h. Throughput-Annahmen (Output-Tokens/Sek. bei 60% Auslastung): 8B-Klasse ~1.500, 70B-Klasse ~400, 405B-Klasse ~120. Eigene Raten vor Commit prüfen — Spot-Preise ändern sich wöchentlich. USD/EUR: 0.92. GPU spot rates: L40S $0.86/h, H100 $2.50/h. Throughput assumptions (output tokens/sec at 60% utilization): 8B class ~1,500, 70B class ~400, 405B class ~120. Verify your own rates before committing — spot pricing moves weekly.