when should I use Kubernetes for LLM serving

Kubernetes is worth using when you already run it for other workloads, need horizontal scaling across multiple GPU pods, require pod-level isolation for multi-tenancy, or want controlled rollouts on GPU nodes. Single-model, single-team deployments are better served by vLLM on a single GPU VM with systemd or Docker Compose.

how do I handle model weights in Kubernetes deployments

The three strategies are: init container downloads from Hugging Face (simple but slow cold starts), PersistentVolume shared read-only across pods (fast startup but requires ReadWriteMany storage support), or baking weights into the container image (fastest but creates large images and registry overhead). Production deployments should use a PVC populated once by a one-shot Job.

how much does self hosted LLM cost compared to API

Two L40S nodes running Llama 3.3 70B cost roughly 1500 to 1900 dollars per month and can handle around 100 million tokens per day. The same throughput via API would cost 3000 to 6000 dollars monthly, making self-hosted cost-effective above 30 percent GPU utilization.

what are common problems when deploying vLLM on Kubernetes

Common issues include slow first pod startup (10-30 minutes), out-of-memory errors with unquantized 70B models, vLLM breaking changes between versions, lack of ReadWriteMany storage support, LoadBalancer connection timeouts on long generations, Hugging Face rate limits during scale events, and GPU autoscaler confusion with resource counting.

Self-Hosted LLM on Kubernetes: A Production vLLM Deployment

April 5, 2026 · 16 min read · kubernetes, llm, self-hosted, vllm, gpu

Most teams asking about self-hosted LLM Kubernetes deployments should not be running Kubernetes for this at all. The honest answer is that vLLM on a single GPU box, wrapped in systemd or Docker Compose, covers more use cases than anyone wants to admit. Kubernetes earns its keep only when you already run it, or when you need horizontal scaling, multi-tenant isolation, or proper rolling deploys across a GPU node pool.

This guide assumes you have decided Kubernetes is the right answer. I will walk through the reference architecture I use when I deploy LLMs to k8s, full manifests for vLLM serving Llama 3.3 70B quantized, and the operational gotchas that bite everyone the first time. No platform-specific magic, no abstracting behind a managed vendor. Just the YAML you need and the reasoning behind each field.

If you are still on the fence about whether to self-host at all, the self-hosted LLM vs API comparison handles that decision — or run your numbers through the break-even calculator directly. This post starts at the point where you have already chosen self-hosted and are picking the platform.

See it running: the result of this architecture in production: a self-hosted voice AI stack (local STT, LLM, TTS on GPUs) taking a live call in an EU VPC. 69-second console demo. Watch the case study →

When Kubernetes is the right answer

Kubernetes makes sense for LLM serving in four situations, and they are narrower than the Kubernetes community likes to admit.

You already run Kubernetes for the rest of your workloads. Adding an LLM Deployment to an existing cluster is cheaper than standing up a separate stack. You reuse the same RBAC, the same monitoring, the same CI/CD. The marginal cost of one more namespace is small.

You need horizontal scaling of inference pods. One GPU replica is not enough. You see sustained traffic that benefits from three, five, or ten pods running behind a load balancer, and the request rate varies enough that autoscaling matters. Below that threshold, one fat node is simpler.

You need pod-level isolation for multi-tenancy. Different tenants hit different models, or the same model with different rate limits and quotas. Namespaces and NetworkPolicies give you blast-radius isolation that a single server cannot.

You want controlled rollouts on GPU-backed nodes. Rolling updates, canary deploys, blue-green for a new vLLM version, all without dropping requests. This is the killer feature when you actually need it.

When it is overkill

Single-model, single-team deployment. One product, one model, one team maintaining it. Kubernetes buys you nothing here. Put vLLM on a GPU VM with systemd, point a reverse proxy at it, done. The Linux VPS AI development setup shows the pattern.

No existing Kubernetes experience. The learning curve for k8s plus GPU plus vLLM is steep. If your team has never run a production cluster, the operational risk dominates any theoretical benefit. Docker Compose covers dev and staging trivially, and the Docker Compose AI development stack is where I start almost every project.

Dev and staging only. There is no scaling requirement. There are no SLOs. Nobody is getting paged. Compose is fine. Kubernetes here is resume-driven architecture.

Latency-sensitive edge deployment. If your users expect sub-100ms TTFT and you are running on general-purpose cloud k8s, the control plane overhead and network hops hurt. Bare metal wins.

The reference architecture

Here is the shape of what we are building. Everything lives in one namespace for simplicity, though in practice you split by environment or team.

llm-serving/
  Namespace          llm-serving
  ConfigMap          vllm-config         (serving args, chat templates)
  Secret             hf-secret           (Hugging Face token)
  PersistentVolume   model-cache         (50 to 200 GB, read-many)
  Deployment         vllm-llama3         (2+ replicas, GPU nodeSelector)
  Service            vllm-llama3-svc     (ClusterIP)
  Ingress            vllm-llama3-ing     (NGINX + cert-manager + auth)
  HPA                vllm-llama3-hpa     (scales on queue depth)
  ServiceMonitor     vllm-llama3-metrics (Prometheus scrape)

The boring shape matters. You want each resource to do one thing. The Deployment runs vLLM. The Service gives it a stable in-cluster DNS name. The Ingress handles TLS and auth. The HPA scales. The PVC holds weights so you do not redownload 140 GB on every pod restart. Keep these concerns separated and you can swap any piece without touching the others.

GPU node setup

Before any of the manifests work, the cluster needs to know about GPUs.

Install the NVIDIA Device Plugin as a DaemonSet across your GPU node pool. On most managed Kubernetes offerings this is a one-line Helm install, on bare metal it needs the NVIDIA driver baked into the node image first. After install, kubectl describe node <gpu-node> shows nvidia.com/gpu: 1 (or more) in the capacity block. If it does not, nothing else on this page will work.

Run a dedicated GPU node pool. Mix and match hurts. Cluster autoscalers get confused when GPU pods land on non-GPU nodes, and pod bin-packing across heterogeneous hardware is a losing battle. Pick a single GPU class per pool: T4 and L4 for small models and batch work, L40S or A100 for 70B class with quantization, H100 for full precision serving or training. I run L40S for most 70B quantized deployments. Price-performance is hard to beat for inference.

Taint GPU nodes to keep regular workloads off them. kubectl taint nodes <node> nvidia.com/gpu=present:NoSchedule, then add a matching toleration to your vLLM Deployment spec. GPU nodes cost eight to twenty times what CPU nodes cost. You do not want a sidecar logger squatting on them.

The Hetzner vs AWS for AI workloads comparison covers the provider-level tradeoffs. On AWS EKS, the NVIDIA operator handles most of this. On Hetzner bare metal, you will install the driver and device plugin yourself, and it is worth it for the price delta.

Finally, the cluster autoscaler needs to understand GPU requests. Recent versions do, but double-check. A pod pending on nvidia.com/gpu: 1 should trigger a new GPU node to spin up within a few minutes. If it does not, your autoscaler is not counting GPU resources correctly.

Handling model weights

Llama 3.3 70B is 140 GB in full precision, roughly 40 GB quantized to AWQ. You do not want to download that on every pod start. There are three strategies, each with real tradeoffs.

Init container that downloads from Hugging Face. Simple, no storage infrastructure, works day one. Cold start is slow (5 to 30 minutes depending on model size and link speed), and if HF rate-limits you during a scale-up event, half your pods fail readiness. Fine for experiments and early staging, painful in production.

PersistentVolume shared read-only across pods. Weights live on a PVC, pods mount it, startup drops to the time it takes vLLM to load weights into GPU memory (30 to 120 seconds). The catch is storage class support. ReadWriteMany or ReadOnlyMany access modes are required, and not every storage backend supports them. On AWS, EFS works but adds latency on first read. On Hetzner or bare metal, NFS or Ceph handle it cleanly. A PVC populated once by a one-shot Job is the canonical pattern.

Baked-in container image. Model weights in the image itself. Cold start is as fast as it gets. The image is 40 to 150 GB. Every pull hits the registry hard, every base image bump forces a full rebuild, and most registries charge per-GB egress. Valid for locked-down air-gapped deployments, painful everywhere else.

My recommendation: PersistentVolume for production, init container for experiments, baked image only if you have a compliance reason. Here is the Job that populates the PVC once, before the Deployment ever runs.

apiVersion: batch/v1
kind: Job
metadata:
  name: model-cache-populate
  namespace: llm-serving
spec:
  template:
    spec:
      restartPolicy: OnFailure
      containers:
        - name: downloader
          image: python:3.12-slim
          command:
            - /bin/sh
            - -c
            - |
              pip install huggingface_hub
              huggingface-cli download meta-llama/Meta-Llama-3.3-70B-Instruct \
                --local-dir /cache/Meta-Llama-3.3-70B-Instruct
          env:
            - name: HF_TOKEN
              valueFrom:
                secretKeyRef:
                  name: hf-secret
                  key: token
          volumeMounts:
            - name: model-cache
              mountPath: /cache
      volumes:
        - name: model-cache
          persistentVolumeClaim:
            claimName: model-cache-pvc

Run this once, verify the PVC has the weights, then let the Deployment pods mount it read-only.

The full Deployment manifest

This is the core artifact. vLLM serving Llama 3.3 70B quantized, two replicas, L40S nodes, PVC-backed weight cache, Prometheus-ready, with a readiness probe that accounts for vLLM’s slow first load.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-llama3
  namespace: llm-serving
  labels:
    app: vllm-llama3
spec:
  replicas: 2
  selector:
    matchLabels:
      app: vllm-llama3
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
  template:
    metadata:
      labels:
        app: vllm-llama3
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "8000"
        prometheus.io/path: "/metrics"
    spec:
      nodeSelector:
        nvidia.com/gpu.product: "NVIDIA-L40S"
      tolerations:
        - key: nvidia.com/gpu
          operator: Equal
          value: present
          effect: NoSchedule
      containers:
        - name: vllm
          image: vllm/vllm-openai:v0.7.3
          imagePullPolicy: IfNotPresent
          args:
            - --model
            - meta-llama/Meta-Llama-3.3-70B-Instruct
            - --quantization
            - awq_marlin
            - --gpu-memory-utilization
            - "0.9"
            - --max-model-len
            - "32768"
            - --max-num-seqs
            - "64"
            - --disable-log-requests
          ports:
            - name: http
              containerPort: 8000
          resources:
            limits:
              nvidia.com/gpu: 1
              memory: 64Gi
              cpu: "8"
            requests:
              nvidia.com/gpu: 1
              memory: 48Gi
              cpu: "4"
          volumeMounts:
            - name: model-cache
              mountPath: /root/.cache/huggingface
              readOnly: true
            - name: shm
              mountPath: /dev/shm
          env:
            - name: HF_TOKEN
              valueFrom:
                secretKeyRef:
                  name: hf-secret
                  key: token
            - name: HF_HUB_OFFLINE
              value: "1"
            - name: VLLM_WORKER_MULTIPROC_METHOD
              value: spawn
          readinessProbe:
            httpGet:
              path: /health
              port: 8000
            initialDelaySeconds: 180
            periodSeconds: 10
            failureThreshold: 30
          livenessProbe:
            httpGet:
              path: /health
              port: 8000
            initialDelaySeconds: 600
            periodSeconds: 30
            failureThreshold: 5
      volumes:
        - name: model-cache
          persistentVolumeClaim:
            claimName: model-cache-pvc
        - name: shm
          emptyDir:
            medium: Memory
            sizeLimit: 8Gi
      terminationGracePeriodSeconds: 120

A few fields are load-bearing and worth calling out.

nodeSelector: nvidia.com/gpu.product: "NVIDIA-L40S" pins to a specific GPU model. Without it you might land on a T4 and get OOM before weights finish loading. The NVIDIA GPU feature discovery DaemonSet populates this label automatically.

--quantization awq_marlin is how 70B fits on a single 48 GB L40S. Without it, you need two GPUs and tensor parallelism, which doubles cost. AWQ with the Marlin kernel is the current sweet spot for quality versus throughput on L40S.

--gpu-memory-utilization 0.9 reserves 90% of GPU memory for vLLM’s KV cache. Leave some headroom for CUDA kernels. Pushing to 0.95 works sometimes, then fails under load.

--max-model-len 32768 caps context at 32k tokens. Llama 3.3 supports 128k, but larger context eats KV cache and cuts throughput. Match this to your actual longest prompt.

HF_HUB_OFFLINE: "1" forces vLLM to use the PVC cache only, no network calls to Hugging Face at runtime. Prevents the nightmare scenario where HF is down and your pods cannot start.

emptyDir: medium: Memory for /dev/shm gives vLLM’s multiproc workers enough shared memory. Default shm is 64 MB. vLLM will segfault on anything nontrivial without this.

readinessProbe.initialDelaySeconds: 180 plus failureThreshold: 30 gives the pod 180 seconds before probing starts, then up to 5 minutes of probe failures before giving up. Weight loading is slow even from a PVC.

terminationGracePeriodSeconds: 120 lets in-flight requests drain on rolling updates. Default 30 seconds will cut requests short.

The Deployment pairs with a matching PVC:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: model-cache-pvc
  namespace: llm-serving
spec:
  accessModes:
    - ReadOnlyMany
  resources:
    requests:
      storage: 200Gi
  storageClassName: nfs-client

Storage class matters. Use a backend that supports ReadOnlyMany or ReadWriteMany. EFS, NFS, Ceph, or CSI drivers that advertise multi-attach. Default AWS gp3 will not work here, it is RWO only.

Service, Ingress, and auth

Internal traffic hits a ClusterIP. External traffic goes through Ingress with TLS and authentication. Never expose vLLM’s OpenAI-compatible API raw to the internet.

apiVersion: v1
kind: Service
metadata:
  name: vllm-llama3-svc
  namespace: llm-serving
spec:
  type: ClusterIP
  selector:
    app: vllm-llama3
  ports:
    - name: http
      port: 80
      targetPort: 8000
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: vllm-llama3-ing
  namespace: llm-serving
  annotations:
    cert-manager.io/cluster-issuer: letsencrypt-prod
    nginx.ingress.kubernetes.io/proxy-read-timeout: "300"
    nginx.ingress.kubernetes.io/proxy-send-timeout: "300"
    nginx.ingress.kubernetes.io/auth-url: "http://auth-proxy.llm-serving.svc.cluster.local/verify"
    nginx.ingress.kubernetes.io/auth-response-headers: "X-Tenant-Id"
spec:
  ingressClassName: nginx
  tls:
    - hosts:
        - llm.example.com
      secretName: vllm-tls
  rules:
    - host: llm.example.com
      http:
        paths:
          - path: /v1
            pathType: Prefix
            backend:
              service:
                name: vllm-llama3-svc
                port:
                  number: 80

Two small things that trip people up. The proxy timeouts need to be long enough for streaming responses, otherwise NGINX kills the connection mid-generation. 300 seconds is usually enough. And the auth-url annotation wires in a separate auth service that validates API tokens or OIDC and returns 200 or 401. Rolling your own is fine for internal use. For anything customer-facing, an OIDC proxy like oauth2-proxy handles the edge cases.

If your cluster sits behind a cloud LoadBalancer, watch out for idle connection timeouts. AWS ALB defaults to 60 seconds, which kills long generations. Bump it to 300 or more. GCP is similar. On Hetzner Load Balancer, the default is fine.

Horizontal pod autoscaling

The naive approach is autoscaling on CPU. It sort of works because vLLM burns host CPU proportional to request load, but it is a lagging indicator and you end up scaling too late. The better signal is queue depth.

vLLM exposes vllm:num_requests_waiting at /metrics. Prometheus scrapes it. A Prometheus Adapter exposes it as a custom metric to the HPA.

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: vllm-llama3-hpa
  namespace: llm-serving
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: vllm-llama3
  minReplicas: 2
  maxReplicas: 8
  metrics:
    - type: Pods
      pods:
        metric:
          name: vllm_num_requests_waiting
        target:
          type: AverageValue
          averageValue: "5"
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
        - type: Pods
          value: 2
          periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 600
      policies:
        - type: Pods
          value: 1
          periodSeconds: 300

The scale-up is aggressive because new GPU pods take minutes to come online, so you want to react fast. The scale-down is slow because you do not want to churn expensive GPU nodes. Ten minutes of sustained low queue depth before shedding a replica is a sensible starting point.

If you cannot set up a custom metrics adapter, CPU-based HPA works as a rough proxy at a pinch. Target 70% CPU, accept that you will scale slightly late, move on.

Monitoring vLLM

vLLM exposes Prometheus metrics at /metrics out of the box. The ones I watch:

vllm:e2e_request_latency_seconds histogram. p95 and p99 are the SLO metrics.
vllm:time_to_first_token_seconds. Streaming UX depends on this.
vllm:generation_tokens_total rate. Tokens per second across all pods.
vllm:num_requests_running and vllm:num_requests_waiting. Queue health.
vllm:gpu_cache_usage_perc. KV cache pressure. Approaches 100% under load, stays there if you are overcommitted.
DCGM_FI_DEV_GPU_UTIL from the NVIDIA DCGM exporter. Actual GPU utilization.

Grafana’s vLLM dashboard (ID on grafana.com) drops in as a template. Start there, then tune panels for your SLOs. The systemd services for AI servers post has the metrics playbook for the non-Kubernetes variant, and most of it ports over.

Alerts I run:

p95 latency above SLO for 5 minutes.
num_requests_waiting above 20 for 3 minutes (HPA is too slow).
GPU utilization above 95% sustained (running hot, need to scale).
Pod restart rate above 1 per hour (something is crashing).
PVC nearing capacity (model cache growing unexpectedly).

Multi-model strategies

Running more than one model complicates the picture.

One Deployment per model. Cleanest. Each model gets its own replicas, its own scaling curve, its own resource envelope. Wasteful if some models handle 95% of traffic and others are used twice a day. Great if usage is balanced.

Router layer with LiteLLM or similar. One external endpoint, LiteLLM routes to the right backend based on model name in the request. Adds a hop and a failure mode, but lets you present a clean OpenAI-compatible API that fronts any number of internal deployments. I use this pattern whenever there are more than two models.

Dynamic model loading. vLLM does not hot-swap models. SGLang and some newer serving stacks do, with real tradeoffs on first-request latency. Worth evaluating if you have a long tail of rarely-used models and cannot justify dedicated replicas.

In practice I mix the first two. Deployments for high-traffic models, LiteLLM router in front, fall back to managed inference (Fireworks, Together, Replicate) for the tail. The production AI agent architecture guide walks through the routing logic in more detail.

Cost modeling at k8s scale

Rough numbers for two L40S nodes running 24/7:

Component	Monthly cost
2x L40S instances (cloud)	$1400 to $1800
Cluster control plane	$75 (EKS, GKE) or $0 (bare metal)
Storage (PVC, 200 GB)	$20 to $40
LoadBalancer	$20 to $30
Egress	highly variable
Total baseline	$1500 to $1900

Compare that to API spend at equivalent throughput. A single L40S running Llama 3.3 70B AWQ hits roughly 800 to 1500 tokens per second sustained, depending on prompt shape. Two replicas at 60% utilization is maybe 100 million tokens per day. At $0.60 per million input plus $2 per million output blended for a mid-tier API, that is $60 to $200 per day, call it $3000 to $6000 per month.

The crossover is usually around 30% utilization. Below that, an API is cheaper. Above 60%, self-hosted wins comfortably, assuming you actually need 70B quality. At 100% utilization on cheaper hardware, you are looking at a 5x to 10x cost reduction versus API spend.

These numbers move constantly. The Hetzner vs AWS for AI workloads breakdown goes deeper on provider-level pricing, and the GPU cloud comparison handles dedicated GPU rentals where the math shifts again.

Managed inference (Fireworks, Together, Replicate) is the middle option. You pay per token like an API, but on open models. If your open model choice is locked in and you do not want the ops, this skips Kubernetes entirely. It is often the right answer for teams of one or two engineers.

Gotchas

First pod startup takes 10 to 30 minutes. Readiness probe initialDelaySeconds and failureThreshold must cover it. Watch the pod logs on first deploy. If vLLM logs show it is still loading weights when Kubernetes gives up, bump the failure threshold.

OOM when loading 70B without quantization. Single L40S is 48 GB. 70B fp16 is 140 GB. Use --quantization awq_marlin or --quantization gptq. If you insist on full precision, tensor-parallel across two GPUs with --tensor-parallel-size 2 and matching nvidia.com/gpu: 2 in resources.

vLLM ships breaking changes. Pin the image. vllm/vllm-openai:latest in production is asking for a 3 AM page when a new release lands. Pin to a tag like v0.7.3 and upgrade deliberately.

ReadWriteMany is not universal. Default cloud block storage is RWO. You need EFS, NFS, Ceph, or a CSI driver that advertises multi-attach. Check your storage class before designing around shared PVCs.

LoadBalancer long-connection timeouts. AWS ALB and GCP LB kill idle connections at 60 seconds by default. Streaming responses look idle from the LB’s perspective. Bump the idle timeout to 300+ or use NGINX Ingress with matching proxy-read-timeout.

HF rate limits during scale events. If you are using init-container downloads and scale from 2 to 8 pods, all 6 new pods hit HF simultaneously. They will 429. PVC-backed cache solves this. Keep the init-container pattern for dev only.

KV cache pressure under bursty load. When vllm:gpu_cache_usage_perc hits 100%, vLLM starts preempting requests. Latency spikes hard. If you see this regularly, either scale horizontally or drop --max-model-len.

Node autoscaler confusion on GPU requests. Older Cluster Autoscaler versions do not count nvidia.com/gpu correctly. Pods go pending forever. Check your autoscaler version and logs, and test scale-up from zero GPU nodes before you rely on it.

When to give up on self-hosted

There is a clear threshold, and it is worth naming.

Your cost analysis shows an API is cheaper at your actual volume. If you are below 30% GPU utilization with no growth in sight, stop. Move to an API or managed inference. The engineering time you spend tuning k8s is worth more than the savings.

Open model quality no longer meets your bar. If Claude Sonnet 4.6 or GPT-5 solves your task and Llama 3.3 70B does not, the quality gap wins. Pay for the API, focus on prompts and evals, revisit self-hosting when open models close the gap again.

Ops cost exceeds savings. One full-time engineer spending 20% of their time on vLLM operations is 5000 to 8000 per month in loaded cost. If self-hosting saves less than that per month, do not self-host.

Regulatory requirements change. Sometimes you start self-hosted for data sovereignty, then your compliance team approves a vendor with the right certs and region. Switch. There is no trophy for running your own GPU cluster.

The honest pattern: start with an API, move to managed inference when you need open models, move to self-hosted Kubernetes only when scale and economics both justify it. Most teams never need the third step, and the ones that do usually know long before they write the first manifest.

From pilot to production

Running an AI pilot that is not production-ready yet? That is exactly what I do: audit, fixed-price scope, delivery in 2–6 weeks.

Make your AI pilot production-ready → Production audit ($1,900 fixed)

Self-Hosted LLM on Kubernetes: A Production vLLM Deployment

When Kubernetes is the right answer

When it is overkill

The reference architecture

GPU node setup

Handling model weights

The full Deployment manifest

Service, Ingress, and auth

Horizontal pod autoscaling

Monitoring vLLM

Multi-model strategies

Cost modeling at k8s scale

Gotchas

When to give up on self-hosted

Before you go —

Almost there

Self-Hosted LLM on Kubernetes: A Production vLLM Deployment

When Kubernetes is the right answer

When it is overkill

The reference architecture

GPU node setup

Handling model weights

The full Deployment manifest

Service, Ingress, and auth

Horizontal pod autoscaling

Monitoring vLLM

Multi-model strategies

Cost modeling at k8s scale

Gotchas

When to give up on self-hosted

Scope my automation in 24h

Request received