best GPU cloud for always on inference costs

Hetzner dominates for always-on L4 and L40S-class inference with fixed monthly pricing starting at EUR 184-349 per month. At full 24/7 utilization, Hetzner costs about $0.51 per hour equivalent, which is 2-4 times cheaper than hourly options like RunPod Secure at $1.20/hr or Lambda at $1.40/hr.

GPU cloud for bursty inference workloads

RunPod Serverless is the most cost-efficient for bursty traffic, scaling from zero to N based on incoming requests and charging per second of GPU time used. For peak burst scenarios, Lambda on-demand and AWS spot G6 instances are viable alternatives billed per second or per request.

which GPU cloud has cheapest data egress

Hetzner offers the best egress pricing with 20 TB included per server, then EUR 1 per TB after that. RunPod and Paperspace both charge approximately $0.05 per GB to the internet, while AWS charges around $90 per TB after 100 GB free.

HIPAA BAA compliant GPU cloud providers

AWS, GCP, Azure, and CoreWeave all offer HIPAA BAA (Business Associate Agreement) for regulated healthcare workloads. CoreWeave requires a contract, while AWS and GCP have standard HIPAA compliance available in certain regions, making them the most accessible options.

GPU Cloud Comparison for AI Inference: 2026 Reality Check

April 4, 2026 · 13 min read · gpu, cloud, ai-inference, comparison, infrastructure

You want to run LLM inference in 2026 and the GPU cloud market has fragmented into roughly three camps: developer-first hourly clouds (Lambda, RunPod, Vast.ai), enterprise Kubernetes clouds (CoreWeave, AWS, GCP, Azure), and fixed-price European hosts (Hetzner, Nebius). The right pick depends less on the raw dollar-per-hour number and more on your utilization pattern, your compliance story, and your network egress shape.

This is a gpu cloud comparison ai inference engineers actually use when planning production workloads. I will not pretend there is one winner. The honest answer is that Hetzner dominates for always-on L40S-class inference in the EU, RunPod Secure is the sweet spot for spiky workloads, CoreWeave and the hyperscalers are the only real answer for compliance-heavy H100 SXM, and Vast.ai only earns a spot in the experimentation phase.

Prices in this guide are snapshots from April 2026 and vary by region and inventory. Verify current rates before you sign anything longer than a month.

The providers worth considering in 2026

A short-list of who actually matters for production inference right now, and what they are known for.

Lambda Labs: on-demand and reserved, strong H100 and A100 inventory, clean CLI, no managed k8s. US-heavy.
RunPod: Community Cloud (consumer boards, price-aggressive) plus Secure Cloud (data-center grade). Serverless GPU product for pay-per-request inference. The developer-favorite in 2026.
Vast.ai: marketplace of third-party hosts. Variable quality. Good for experimentation, poor for production SLAs.
CoreWeave: enterprise k8s-native, H100 heavy, strong networking (NVLink, InfiniBand), SOC 2 Type II. Limited developer-friendliness.
Paperspace (part of DigitalOcean): competitive on L4 and A100, developer UX, global footprint inherited from DO.
Hetzner: fixed monthly billing, Falkenstein and Helsinki data centers, no H100 but RTX 4000 Ada, L4, L40S, RTX 6000 Ada. Best-in-class egress allowance.
AWS EC2 (G5, G6, P4d, P5): full control, most expensive, every compliance box you could want.
GCP (A3, A3 Mega with H100 SXM): strong if you already live on GCP, tight integration with Vertex AI and GKE.
Azure (ND H100 v5): enterprise sales cycle, procurement-friendly for regulated firms.
Fluidstack: newer aggressive-pricing entrant, reserved capacity focus.
Nebius: European GPU cloud, growing presence, H100 and L40S inventory in Finland.

If you are choosing between Hetzner and hyperscaler GPU for predictable inference, I cover that decision in depth in Hetzner vs AWS for AI workloads.

Price snapshot per GPU type

Hourly and fixed-monthly pricing per GPU class, April 2026. This is what you pay per GPU, not per instance, and excludes egress and storage.

GPU	Hetzner	RunPod Community	RunPod Secure	Lambda	CoreWeave	AWS on-demand	GCP on-demand
RTX 4000 Ada (20GB)	EUR 184/mo fixed	$0.29/hr	n/a	n/a	n/a	n/a	n/a
L4 (24GB)	n/a	$0.39/hr	$0.44/hr	n/a	n/a	$0.80/hr (g6)	$0.70/hr
L40S (48GB)	EUR 349/mo fixed	$0.79/hr	$1.20/hr	$1.40/hr	$1.50/hr	n/a	n/a
RTX 6000 Ada (48GB)	EUR 439/mo fixed	$0.77/hr	$1.19/hr	n/a	n/a	n/a	n/a
A100 40GB	n/a	$1.19/hr	$1.50/hr	$1.29/hr	$1.65/hr	$3.06/hr (p4d)	$2.93/hr
A100 80GB	n/a	$1.49/hr	$1.89/hr	$1.79/hr	$1.85/hr	$4.10/hr	$3.67/hr
H100 PCIe 80GB	n/a	$1.99/hr	$2.50/hr	$2.49/hr	$2.23/hr	n/a	n/a
H100 SXM 80GB	n/a	n/a	$3.39/hr	$2.99/hr	$3.10/hr	$4.50/hr (p5)	$3.92/hr
B200 (limited)	n/a	waitlist	waitlist	waitlist	available	waitlist	waitlist

The numbers are moving every quarter. The shape is stable: Community-tier GPUs are 30 to 50 percent below Secure-tier, Secure-tier is 30 to 50 percent below hyperscalers, and Hetzner fixed-monthly beats everyone if utilization is north of ~60 percent.

Workload fit: which provider for which job

There is no universal answer, but there are clear archetypes.

Development and experimentation: Vast.ai spot nodes, RunPod Community, Paperspace hourly notebooks. Pre-emption is fine because you are iterating.
Staging and pre-production: RunPod Secure or Lambda on-demand. You want a real SLA but not yet a commit.
Production inference, always-on: Hetzner GPU monthly for L4, L40S, RTX 6000 Ada class. Or reserved Lambda/CoreWeave for A100/H100 class. Fixed cost beats hourly once utilization crosses about 60 percent.
Enterprise compliance: CoreWeave for k8s-native SOC 2. AWS, GCP, Azure for HIPAA, FedRAMP, GDPR-certified regions.
Peak burst / traffic spikes: RunPod Serverless, Lambda on-demand, or AWS spot G6 instances. Billed per second or per request.
Training runs requiring NVLink: CoreWeave H100 SXM clusters, AWS P5, GCP A3 Mega. The fabric matters as much as the GPU.

If you are weighing the build-it-yourself path against a managed inference API for some of these tiers, self-hosted LLM vs API walks through the decision matrix.

Feature comparison matrix

GPU hourly price is one data point. The operational features often matter more for production.

Feature	Lambda	RunPod	Vast.ai	CoreWeave	Hetzner	AWS	GCP	Azure
Hourly billing	yes	yes	yes	yes	no	yes	yes	yes
Per-second billing	no	yes (Serverless)	no	yes	no	yes	yes	yes
Fixed monthly	reserved only	no	no	commit	yes	reserved	commit	commit
Spot / pre-emptible	no	yes (Community)	yes	no	no	yes	yes	yes
Managed Kubernetes	no	no	no	yes (native)	no	EKS	GKE	AKS
BYO container	yes	yes	yes	yes	yes	yes	yes	yes
Pre-built CUDA/PyTorch images	yes	yes	yes	yes	no	yes	yes	yes
Reserved discounts	yes (30-40%)	Savings Plans (2026)	n/a	custom	n/a (low list)	yes (~30%)	yes	yes
Terraform / Pulumi provider	yes (community)	yes	no	yes	yes (hcloud)	yes	yes	yes
US regions	yes	yes	yes	yes	no	yes	yes	yes
EU regions	limited	yes	yes	yes	yes (DE, FI)	yes	yes	yes
Asia regions	no	yes	yes	no	no	yes	yes	yes
GDPR / EU data residency	partial	yes	varies by host	yes	yes (native)	yes	yes	yes
SOC 2	yes	Secure only	no	yes	no (ISO 27001)	yes	yes	yes
HIPAA BAA	no	no	no	yes	no	yes	yes	yes
Typical H100 queue time	minutes	minutes	hours to days	minutes	n/a	minutes to hours	minutes	hours to days
Egress included	none	internal free	varies	contract	20 TB	none	none	none

Provider deep dive

Lambda Labs

Lambda is the straightforward pick. The inventory for H100 PCIe and A100 is consistently available in the US, the CLI (lambda-cloud) is pleasant, and their 1-Click Clusters give you multi-node NVLink setups without a procurement call. Reserved pricing kicks in around 30 to 40 percent off list at one-year terms.

The gaps: no managed Kubernetes, no EU region coverage worth committing to (a small Netherlands presence but inventory is thin), no HIPAA. If you need compliance beyond SOC 2, Lambda is not the answer.

I use Lambda for training-adjacent workloads where I want hourly billing on a trusted platform and do not need k8s. For production always-on inference, I tend to move off Lambda onto either Hetzner (if L40S-class is enough) or CoreWeave (if I need H100 SXM at scale).

RunPod

RunPod is the developer-favorite in 2026. Two tiers matter. Community Cloud runs on third-party hosts with consumer and prosumer GPUs at aggressive prices, and it is fine for development and low-criticality inference. Secure Cloud runs in Tier III data centers with SOC 2 on the Secure tier.

The killer feature is RunPod Serverless GPU. You deploy a container, RunPod scales it from zero to N based on incoming requests, and you pay per second of GPU time used. For bursty inference traffic this is the most cost-efficient model I have found, though cold starts on large models (30B+) are still measured in tens of seconds.

Caveats. Community Cloud nodes can and will be pulled by the underlying host with short notice, so do not put production traffic on them. The API is solid but the web console has quirks. Egress is free between RunPod regions but charged to the open internet.

Vast.ai

Vast.ai is a marketplace, not a cloud. You bid on GPU hours from hosts who have listed capacity. The cheapest A100 and H100 hours on the internet live here, but reliability ranges from enterprise-grade to hobbyist-running-a-box-in-a-garage. The filtering UI lets you require data center hosts, verified scores, and uptime history, which narrows the pool to something usable.

My rule: Vast.ai for experimentation, never for production inference that faces paying customers. The money saved is not worth the 2 AM pages when a host reboots.

CoreWeave

CoreWeave is the enterprise k8s-native option. H100-heavy inventory, strong NVLink and InfiniBand fabrics, managed Kubernetes that actually understands GPU workloads (node labels, device plugins, topology-aware scheduling). They sell via contracts with committed capacity, not pure hourly consumption, though hourly is available.

If your team is already operating on Kubernetes and you need H100 SXM at scale with a SOC 2 story, CoreWeave is the default choice. If you are a two-person team running one inference pod, it is overkill and the onboarding will feel like enterprise SaaS procurement.

Hetzner

Hetzner’s GPU servers are fixed monthly pricing on dedicated hardware. Current lineup includes RTX 4000 Ada at EUR 184/mo, L4 at EUR 199/mo, L40S at EUR 349/mo, and RTX 6000 Ada at EUR 439/mo. No A100 or H100. Data centers in Germany and Finland.

For always-on inference that fits in 48GB of VRAM (quantized 70B models, most 30B models, embedding servers, Whisper, Stable Diffusion), Hetzner is unbeatable on effective cost per hour once you hit reasonable utilization. Egress is 20 TB included per server at 1 Gbps, then EUR 1 per TB. No other provider gets close on that dimension.

Trade-offs: setup is provisioning a bare Linux box, installing NVIDIA drivers and CUDA yourself, no Kubernetes, no autoscaling, no managed inference layer. If you want hands-off, this is not it. If you want predictable cost and you are comfortable running systemd services, this is the best value in the market.

AWS, GCP, Azure

The hyperscalers charge roughly 1.5 to 2x the Lambda or CoreWeave rate for equivalent GPUs, and the justification is everything around the GPU: VPC networking, IAM, managed services, certifications, and the ability for your procurement team to issue one PO that covers the entire stack.

Pick AWS when you need P5 (H100 SXM) in an eu-west or us-east region with EKS, or when HIPAA BAA and FedRAMP are hard requirements. Pick GCP when you already run on GCP and A3 instances can share a VPC with your existing services. Pick Azure when the company is Microsoft-first and procurement speed matters more than unit economics.

Network egress is often the real cost

GPU-hour price gets all the attention. Egress pricing decides what your invoice looks like when you run a chatty inference API with users streaming responses.

Provider	Egress pricing	Typical “gotcha”
AWS	~$90/TB after 100 GB free	Cross-AZ and cross-region charges stack on top
GCP	$80 to $120/TB depending on destination	Egress to internet vs peered networks differs
Azure	~$87/TB after 100 GB free	Zone-to-zone charges in some regions
Lambda	~$0.10/GB ($100/TB)	Flat, predictable
RunPod	~$0.05/GB to internet	Free intra-region, some regional variance
CoreWeave	negotiated in contract	Typically below hyperscaler list
Hetzner	20 TB included, EUR 1/TB after	Effectively free for most inference workloads
Vast.ai	depends on host	Check per-listing, some hosts cap bandwidth
Paperspace	~$0.05/GB	Inherits DO network pricing

For a streaming chat API serving a million messages per day at a few KB each, egress is a rounding error. For a RAG system returning large context chunks or a vision model streaming images, egress can exceed GPU spend. I have seen AWS bills where the P4d instance cost EUR 2,200 for the month and egress added EUR 1,800.

Reserved instances and commits

Hourly list price is the worst price on most of these platforms. If you have six months of certainty about your workload, reserved pricing is where the real economics live.

Lambda reserved: 1-year terms at roughly 30 to 40 percent off on-demand, 3-year terms deeper. Minimum term is the commitment; you pay whether you use it or not.
AWS Savings Plans and Reserved Instances: 1-year at ~30 percent off, 3-year at ~50 percent off. Convertible versus standard affects flexibility.
GCP Committed Use Discounts: similar curve to AWS, with a flexible CUD product that covers instance families.
Azure Reserved VM Instances: similar, three-year terms get to ~55 percent off.
CoreWeave: custom contracts, typically starting at 6-month commits. Pricing is negotiated per customer.
RunPod: Savings Plans rolled out in early 2026, 20 to 30 percent off on-demand for six and twelve month terms.
Hetzner: no reserved tier. The list price is already low enough that the discount math does not matter. Cancel monthly if the workload goes away.

The mental model: if you expect to run the GPU more than 40 hours per week for six months, reserved beats on-demand. If utilization is below that, stay hourly and let the platform eat the idle capacity.

The real cost at realistic utilization

Raw hourly price is a trap. The number that matters is your effective monthly cost at your actual utilization. Here is a realistic scenario: one inference pod running 24 hours per day, 7 days per week, for a month (~730 hours).

Setup	Hourly equivalent	Monthly cost
Hetzner L40S (fixed monthly)	~$0.51/hr	EUR 349 (~$380)
RunPod Secure L40S	$1.20/hr	$876
Lambda L40S on-demand	$1.40/hr	$1,022
AWS G6e (L40S equiv)	~$2.15/hr	$1,570
CoreWeave L40S	$1.50/hr	$1,095

Hetzner wins always-on by a factor of 2 to 4 against hourly clouds at 100 percent utilization. At 30 percent utilization (pod running 8 hours per day), the calculus flips: Hetzner is still EUR 349, but RunPod Secure drops to ~$288. Below roughly 35 percent utilization, hourly billing wins. Above that, fixed-monthly wins.

This is exactly the kind of math I run through in LLM API cost comparison when clients try to decide whether self-hosting is even worth it versus a per-token API.

Inference serving stacks

Every major GPU cloud supports vLLM, TGI (text-generation-inference), and Ollama as container images. The actual difference in serving stack is small once you have the GPU. vLLM is the default for high-throughput batched serving, TGI is close behind, Ollama is the go-to for dev convenience and quantized models.

For production k8s deployments of these stacks, the companion piece self-hosted LLM on Kubernetes covers the GPU operator setup, autoscaling, and node topology patterns I use.

If you skip the ops layer entirely and go with managed inference (Fireworks, Together, Replicate, Anyscale), you trade GPU control for per-token pricing in the $0.20 to $2.00 per million tokens range depending on model size. That is often the right answer for teams running fewer than one GPU’s worth of traffic.

Security and compliance

Certification by provider, as of April 2026:

Framework	Lambda	RunPod	CoreWeave	Hetzner	AWS	GCP	Azure
SOC 2 Type II	yes	Secure tier	yes	no	yes	yes	yes
ISO 27001	partial	no	yes	yes	yes	yes	yes
HIPAA BAA	no	no	yes (contract)	no	yes	yes	yes
GDPR / EU residency	partial	EU regions	yes	native	eu-* regions	europe-*	EU regions
FedRAMP	no	no	High (In Process)	no	GovCloud	Gov regions	Gov cloud
Air-gapped / private	no	no	yes (contract)	no	GovCloud	Gov regions	Gov cloud

For regulated workloads the list thins out fast. HIPAA narrows to AWS, GCP, Azure, and CoreWeave. FedRAMP narrows to AWS GovCloud and Azure Government. GDPR with EU data residency widens it again: Hetzner, Nebius, RunPod EU, and the hyperscalers’ European regions all qualify when configured correctly.

A playbook for choosing

A decision flow that maps common situations to a provider.

Do you need EU data residency and nothing else compliance-wise? > Hetzner for predictable workloads, Nebius if you need H100.
Do you need H100 SXM with NVLink? > CoreWeave first, Lambda second, AWS P5 if you need hyperscaler compliance.
Are you ops-heavy and running on Kubernetes already? > CoreWeave native, or EKS/GKE/AKS with GPU node pools.
Are you running always-on inference on L4 or L40S class? > Hetzner monthly. The math wins.
Are you serving spiky traffic with quiet hours? > RunPod Serverless or AWS G6 spot.
Is procurement long and compliance-heavy? > AWS, GCP, or Azure, whichever your company already has a contract with.
Are you still in experimentation mode? > Vast.ai or RunPod Community. Spot-priced, pre-emptible, fine for notebooks.
Do you need HIPAA BAA? > AWS, GCP, Azure, or CoreWeave with contract.

One warning on the framing of “lowest price wins”. Reliability matters. A Community Cloud node at half the Secure Cloud price looks great until it gets reclaimed during an inference burst and your customers see 503s. Match the reliability tier to the workload criticality, not the other way around.

GPU Cloud Comparison for AI Inference: 2026 Reality Check

The providers worth considering in 2026

Price snapshot per GPU type

Workload fit: which provider for which job

Feature comparison matrix

Provider deep dive

Lambda Labs

RunPod

Vast.ai

CoreWeave

Hetzner

AWS, GCP, Azure

Network egress is often the real cost

Reserved instances and commits

The real cost at realistic utilization

Inference serving stacks

Security and compliance

A playbook for choosing

Before you go —

Almost there

GPU Cloud Comparison for AI Inference: 2026 Reality Check

The providers worth considering in 2026

Price snapshot per GPU type

Workload fit: which provider for which job

Feature comparison matrix

Provider deep dive

Lambda Labs

RunPod

Vast.ai

CoreWeave

Hetzner

AWS, GCP, Azure

Network egress is often the real cost

Reserved instances and commits

The real cost at realistic utilization

Inference serving stacks

Security and compliance

A playbook for choosing

Written quote in 24h

Request received