Docker Compose AI ML Development Stack: Local LLM, Vector DB, Full YAML

March 20, 2026 · 11 min read · docker, ai-development, ollama, qdrant, devops
Docker Compose AI ML Development Stack: Local LLM, Vector DB, Full YAML

Every AI project I start now begins the same way: docker compose up -d and I have Ollama, Qdrant, Postgres, Redis, and a LiteLLM proxy running in under two minutes. No pyenv conflicts, no homebrew drift, no “works on my machine”. One YAML file, one command, identical stack across my laptop and my dev VPS.

This is a tutorial for a full docker compose AI ML development stack. Copy the YAML, run it, pull a model, and start building. I use this exact layout for prototyping RAG pipelines, testing MCP servers, and running my cron-driven Claude agents before they ship to production.

The stack fits on a 16 GB laptop. It also scales to a Hetzner CCX43 for a shared team dev environment. What it is not: a production setup. For self-hosted LLM inference at scale, jump to the Kubernetes guide. This one is for the loop you actually iterate in.

Why Docker Compose for AI dev

I tried three paths before settling on Compose for a local AI dev environment.

Bare-metal install worked until I had three Python versions fighting over torch and an Ollama binary that had silently moved its model cache. Every new project reset my dev box. CUDA versions drifted. Pip installs cross-contaminated. When I wiped the machine to start over, I lost two days rebuilding my environment.

Kubernetes was the opposite problem. Minikube and k3d work, but the overhead of writing manifests, managing namespaces, ingresses, service accounts, and port-forwarding for a project I might delete in a week is not worth it. Kubernetes belongs in production, where operational cost pays for itself in reliability and autoscaling. It does not belong in the tight inner loop of trying a new model or swapping a vector DB.

Docker Compose sits in the middle. One file describes the stack. docker compose up starts it. docker compose down stops it. docker compose down -v nukes every volume and lets me start clean. The docker ai stack pattern is the fastest feedback loop I have found, and the same YAML moves to a staging VPS when I need to share it with someone. Compose is the shortest path between “idea” and “running code that touches a real LLM, a real vector DB, and a real Postgres”.

Three concrete wins:

  • Repeatable. A new engineer clones the repo, runs one command, and the stack is identical to mine. Onboarding goes from half a day of install-README-debugging to ten minutes waiting for images to pull.
  • Isolated. Nothing pollutes the host. System Python stays clean. Node versions do not collide. If I trash a database in dev, I trash a Docker volume, not my workstation.
  • Disposable. Teardown is instant. I can blow the whole thing away and rebuild in thirty seconds. This matters more than it sounds. The cost of experimenting drops to near zero, so I experiment more.

What we are building

The target is a local ai dev environment with seven services. Five are core. Two are optional but I almost always include them.

Core services:

  • Ollama. Local LLM runner. Pulls GGUF-quantized models and exposes an OpenAI-compatible API on port 11434. Handles Llama 3.3, Qwen 2.5, Mistral, Phi, and anything else on the Ollama registry.
  • Qdrant. Vector DB. I use it for RAG prototyping, embedding search, and semantic task lookup. Web dashboard on port 6333.
  • Postgres 16. General relational store. Every AI app eventually needs one for tasks, users, job queues, audit logs.
  • Redis 7. Cache layer and background queue. Pairs with BullMQ, Celery, or Sidekiq.

Optional but recommended:

  • LiteLLM proxy. Unified OpenAI-compatible API in front of Ollama, Claude, and OpenAI. Swap providers with a config change, not a code change.
  • Grafana + Prometheus. Observability. Prometheus scrapes metrics from LiteLLM and Postgres. Grafana dashboards show tokens per second, cache hit rate, queue depth.
  • n8n. Visual automation for testing agent workflows before I rewrite them as bash or Go.

That covers ollama docker compose, local llm docker, vector search, data persistence, and metrics. Everything a prototype needs.

The full docker-compose.yml

Save this as docker-compose.yml at the root of your project. It is the complete stack.

services:
  ollama:
    image: ollama/ollama:latest
    container_name: ai-ollama
    ports:
      - "11434:11434"
    volumes:
      - ollama:/root/.ollama
    restart: unless-stopped
    # Remove the deploy block if you have no NVIDIA GPU.
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

  qdrant:
    image: qdrant/qdrant:latest
    container_name: ai-qdrant
    ports:
      - "6333:6333"
      - "6334:6334"
    volumes:
      - qdrant:/qdrant/storage
    restart: unless-stopped

  postgres:
    image: postgres:16-alpine
    container_name: ai-postgres
    environment:
      POSTGRES_PASSWORD: ${POSTGRES_PASSWORD}
      POSTGRES_USER: ${POSTGRES_USER:-aidev}
      POSTGRES_DB: ${POSTGRES_DB:-ai_dev}
    ports:
      - "5432:5432"
    volumes:
      - pg:/var/lib/postgresql/data
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U ${POSTGRES_USER:-aidev}"]
      interval: 10s
      timeout: 5s
      retries: 5
    restart: unless-stopped

  redis:
    image: redis:7-alpine
    container_name: ai-redis
    ports:
      - "6379:6379"
    volumes:
      - redis:/data
    restart: unless-stopped

  litellm:
    image: ghcr.io/berriai/litellm:main-latest
    container_name: ai-litellm
    ports:
      - "4000:4000"
    environment:
      ANTHROPIC_API_KEY: ${ANTHROPIC_API_KEY}
      OPENAI_API_KEY: ${OPENAI_API_KEY}
      LITELLM_MASTER_KEY: ${LITELLM_MASTER_KEY}
    volumes:
      - ./litellm.config.yaml:/app/config.yaml:ro
    command: ["--config", "/app/config.yaml", "--port", "4000"]
    depends_on:
      - ollama
    restart: unless-stopped

  prometheus:
    image: prom/prometheus:latest
    container_name: ai-prometheus
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml:ro
      - prometheus:/prometheus
    restart: unless-stopped

  grafana:
    image: grafana/grafana:latest
    container_name: ai-grafana
    ports:
      - "3000:3000"
    environment:
      GF_SECURITY_ADMIN_PASSWORD: ${GRAFANA_PASSWORD:-admin}
    volumes:
      - grafana:/var/lib/grafana
    depends_on:
      - prometheus
    restart: unless-stopped

  n8n:
    image: n8nio/n8n:latest
    container_name: ai-n8n
    ports:
      - "5678:5678"
    environment:
      N8N_BASIC_AUTH_ACTIVE: "true"
      N8N_BASIC_AUTH_USER: ${N8N_USER:-admin}
      N8N_BASIC_AUTH_PASSWORD: ${N8N_PASSWORD}
      DB_TYPE: postgresdb
      DB_POSTGRESDB_HOST: postgres
      DB_POSTGRESDB_DATABASE: ${POSTGRES_DB:-ai_dev}
      DB_POSTGRESDB_USER: ${POSTGRES_USER:-aidev}
      DB_POSTGRESDB_PASSWORD: ${POSTGRES_PASSWORD}
    volumes:
      - n8n:/home/node/.n8n
    depends_on:
      postgres:
        condition: service_healthy
    restart: unless-stopped

volumes:
  ollama:
  qdrant:
  pg:
  redis:
  prometheus:
  grafana:
  n8n:

And the .env file next to it:

# .env
POSTGRES_USER=aidev
POSTGRES_PASSWORD=changeme_local_only
POSTGRES_DB=ai_dev

LITELLM_MASTER_KEY=sk-local-dev-key
ANTHROPIC_API_KEY=sk-ant-...
OPENAI_API_KEY=sk-...

GRAFANA_PASSWORD=admin
N8N_USER=admin
N8N_PASSWORD=changeme_local_only

Minimal litellm.config.yaml:

model_list:
  - model_name: local-llama
    litellm_params:
      model: ollama/llama3.2:3b
      api_base: http://ollama:11434
  - model_name: claude-sonnet
    litellm_params:
      model: anthropic/claude-sonnet-4-6
      api_key: os.environ/ANTHROPIC_API_KEY
  - model_name: gpt-4
    litellm_params:
      model: openai/gpt-4o
      api_key: os.environ/OPENAI_API_KEY

general_settings:
  master_key: os.environ/LITELLM_MASTER_KEY

Minimal prometheus.yml:

global:
  scrape_interval: 15s
scrape_configs:
  - job_name: litellm
    static_configs:
      - targets: ["litellm:4000"]

First-run commands

From the project directory:

docker compose up -d
docker compose ps

You should see eight containers running. Next, pull a model into Ollama. Start small. A 3B model fits on any modern laptop and is fast enough for iteration:

docker exec -it ai-ollama ollama pull llama3.2:3b

If you have a GPU with 24 GB or more of VRAM, pull a bigger one:

docker exec -it ai-ollama ollama pull llama3.3:70b-instruct-q4_K_M

Verify Ollama responds:

curl http://localhost:11434/api/generate \
  -d '{"model":"llama3.2:3b","prompt":"What is Docker Compose in one sentence?","stream":false}'

Check Qdrant:

curl http://localhost:6333/collections

Confirm LiteLLM is proxying correctly:

curl http://localhost:4000/v1/chat/completions \
  -H "Authorization: Bearer sk-local-dev-key" \
  -H "Content-Type: application/json" \
  -d '{"model":"local-llama","messages":[{"role":"user","content":"ping"}]}'

Grafana is at http://localhost:3000, Qdrant dashboard at http://localhost:6333/dashboard, n8n at http://localhost:5678.

GPU support notes

Linux with NVIDIA is the happy path. Install the NVIDIA Container Toolkit first:

distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/libnvidia-container/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list \
  | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker

Verify GPU visibility from inside the container:

docker exec -it ai-ollama nvidia-smi

On macOS, Docker Desktop has no GPU passthrough. If you are on an M-series Mac and care about inference speed, run Ollama natively on the host (brew install ollama), point the rest of the Compose stack at http://host.docker.internal:11434, and delete the ollama service from the YAML. You keep the containerized vector DB and data services without fighting Metal.

CPU-only is viable for 3B to 7B models. Expect 5 to 20 tokens per second on a modern Ryzen or M-series. Fine for dev, painful for anything interactive.

Adding dev services for your app code

Compose is not just for infra. Your own app runs here too. Add a service that mounts your source and hot-reloads.

TypeScript/Node example:

  app:
    image: node:22-alpine
    working_dir: /app
    volumes:
      - ./src:/app/src
      - ./package.json:/app/package.json
      - ./tsconfig.json:/app/tsconfig.json
      - node_modules:/app/node_modules
    command: sh -c "npm install && npm run dev"
    ports:
      - "8080:8080"
    environment:
      DATABASE_URL: postgres://aidev:${POSTGRES_PASSWORD}@postgres:5432/ai_dev
      REDIS_URL: redis://redis:6379
      OLLAMA_URL: http://ollama:11434
      LITELLM_URL: http://litellm:4000
    depends_on:
      - postgres
      - redis
      - ollama

Python/FastAPI version:

  app:
    image: python:3.12-slim
    working_dir: /app
    volumes:
      - ./:/app
    command: sh -c "pip install -r requirements.txt && uvicorn main:app --host 0.0.0.0 --reload"
    ports:
      - "8000:8000"
    depends_on:
      - postgres
      - qdrant
      - ollama

Inside your app, point the Ollama client at http://ollama:11434, not localhost. Docker Compose sets up a network where service names resolve as hostnames.

Common dev patterns

Four patterns I use on nearly every project.

RAG prototyping with Qdrant. Ingest local docs, embed with a small sentence-transformers model (or Ollama’s nomic-embed-text), write vectors to Qdrant. Query with a filter, inspect results in the Qdrant dashboard at http://localhost:6333/dashboard, then wire the retrieved chunks into a prompt through LiteLLM. The full walkthrough is in the RAG pipeline tutorial. For picking the right vector DB beyond dev, see Qdrant vs Pinecone vs Weaviate.

LiteLLM swap between local and cloud. In dev I hit local-llama through LiteLLM. In staging I flip the model name to claude-sonnet. My app code does not change. One env var, different backend. This is the single most useful piece of the whole stack. It means the same codebase can run offline on a plane with a 3B model, or in a paid pilot on Claude Sonnet, with zero code churn.

Compose as the MCP host. I run Claude Code on the host and point it at the containerized Postgres via the postgres MCP server. Combined with the filesystem MCP and a sqlite MCP, Claude Code can query real data during development without any cloud round trip. Schema introspection, row counts, sample queries, all driven from within my editor. Pair this with the Ollama container and Claude Code can even fall back to a local model when I am rate limited.

Host your Linux VPS the same way. I reuse this Compose file on the VPS for shared team dev. Same services, same ports, different .env. The Linux VPS AI development setup guide covers firewall, reverse proxy, and TLS on top of this stack. The key insight: dev and staging share infrastructure code, so nothing surprises me when I push the same Compose to a bigger box.

Extending the stack

Things I add per project, not always:

  • JupyterLab for notebook exploration: jupyter/scipy-notebook, mount ./notebooks, port 8888.
  • MinIO for S3-compatible object storage when I am testing file pipelines offline: minio/minio, ports 9000 and 9001.
  • pgvector extension in Postgres when I want embeddings colocated with relational data. Swap the image to pgvector/pgvector:pg16.
  • Traefik as a reverse proxy once the stack has more than three HTTP services. Gives every service a nice *.localhost hostname.
  • n8n for agent workflow prototyping. I use n8n to sketch agent flows before rewriting them as code. If you want to self-host n8n for real, the n8n self-hosting guide covers the production path.

Cleanup

Stop everything, keep data:

docker compose down

Stop everything and delete volumes:

docker compose down -v

Reclaim disk space from unused images and build cache:

docker system prune -a

Ollama model blobs are the biggest offenders on disk. Check usage:

docker system df -v | grep ollama

Moving to production

This stack is a dev stack. Do not deploy it as-is.

For production self-hosted LLM inference, read the self-hosted LLM on Kubernetes guide. It covers GPU scheduling, model sharding, and proper autoscaling for inference pods.

For a middle ground, the same Compose file runs on a VPS for staging. Use systemd to manage docker compose up as a service, put Traefik in front for TLS, and keep Postgres backups on a schedule. Good enough for internal tools and paid pilots. Not good enough for user-facing workloads.

Troubleshooting

“Ollama: GPU not detected”. Run docker exec -it ai-ollama nvidia-smi. If it fails, the NVIDIA Container Toolkit is not installed or the Docker daemon was not restarted after install. On WSL2, check that the host driver is the WSL-compatible build.

“Out of memory on Ollama”. Switch to a quantized model. llama3.3:70b-instruct-q4_K_M uses roughly 40 GB VRAM. For 8 to 16 GB cards, use llama3.2:8b-instruct-q4_K_M or the 3B variant. Ollama also splits between VRAM and CPU RAM automatically, so a partial offload is often enough.

“Qdrant will not start”. Port 6333 conflict. Another process on the host is using it. sudo lsof -i :6333 to find it. Or change the Compose port mapping to 16333:6333.

“Postgres data gone after restart”. The volumes: block at the bottom of the file is missing or the service is not referencing it. Without named volumes, Docker uses anonymous volumes that get garbage-collected.

“Slow inference on CPU”. You are running a model too big for your hardware. Drop to a 3B or use an API via LiteLLM for the slow path. Keep Ollama for offline iteration and hit Claude Sonnet for anything interactive.

“n8n cannot connect to Postgres”. The depends_on with condition: service_healthy requires the healthcheck on Postgres. If you deleted it, n8n races the DB on startup and crashes. Put it back or use a retry loop in your n8n config.

“LiteLLM returns 401”. The LITELLM_MASTER_KEY env var must be passed in the Authorization: Bearer header on every request. Check the .env is loaded (docker compose config prints the resolved values).

“Ollama pull hangs on a large model”. Ollama downloads in chunks over HTTPS. Corporate or hotel networks sometimes reset long-lived connections. Run the pull on a stable connection, or pull outside the container with a curl directly to the Ollama registry and mount the blobs into the volume.

“LiteLLM cannot reach Ollama”. Inside the Compose network, use http://ollama:11434, not localhost. The localhost of the LiteLLM container is the container itself, not the Ollama service. The service name is the DNS name.

“Docker compose up is slow on first run”. First run pulls seven or eight images. On a slow connection that is 3 to 5 GB of layers. Subsequent up calls are near-instant. Keep an eye on docker compose pull in a warm-up step if you cycle the stack often.

“Qdrant dashboard shows no collections”. You have not created any yet. Qdrant starts empty. Create one with a curl -X PUT http://localhost:6333/collections/my_docs -H 'Content-Type: application/json' -d '{"vectors":{"size":768,"distance":"Cosine"}}'.

Download the AI Automation Checklist (PDF)