Claude Code with Local LLMs and ANTHROPIC_BASE_URL: Ollama, LM Studio, llama.cpp, vLLM

Native Anthropic endpoints, tool-call compatibility, and context-window sizing for local Claude Code.

Last tested: April 2026. See Changelog at the bottom.

TL;DR cheat sheet

GoalUse
MacBook AirGemma 4 26B-A4B Q4, 32K context, LM Studio or Ollama
MacBook ProGemma 4 26B-A4B Q4 / UD-Q4, 64K context, llama.cpp or LM Studio
Claude Code minimum32K context (anything below is a chat demo)
Best local backendLM Studio or Ollama first; llama.cpp for advanced; vLLM for servers
Avoid8K / 16K context, dense 31B Gemma 4 on 32 GB machines, old llama.cpp builds

The local-Claude-Code rule of thumb

Three things decide whether a local Claude Code session works:

  • Model quality decides whether the answer is smart.
  • Tool-call formatting decides whether Claude Code can act on the answer.
  • Context length decides whether the session survives past the first few edits.

For local coding agents: 32K is the floor. 64K is the sweet spot. Anything below 32K is a chat demo, not Claude Code.

Use this first. Don’t shop the buffet of alternatives until you’ve tried this one.

  • Backend: LM Studio (≥ 0.4.1) or Ollama (≥ v0.14.0) — both expose a native Anthropic compatible local endpoint, no proxy needed.
  • Model: gemma4:26b-a4b (Gemma 4 26B-A4B-it, Q4 quant). MoE active-param ≈ 3.88 B → laptop-friendly latency, tool-use trained directly into the model.
  • Context: 32K context on a MacBook Air, 64K context on a MacBook Pro M5 Pro/Max with 48 GB+ RAM.
  • Machine: 32 GB+ RAM strongly preferred. 24 GB works at 24K–32K with care.

If you don’t have Anthropic-compatible mode and only have an OpenAI compatible local endpoint running, run LiteLLM in front (see section on LiteLLM).


1. Environment variables Claude Code reads

# Where Claude Code POSTs requests. Default: https://api.anthropic.com
ANTHROPIC_BASE_URL=http://localhost:11434

# Sent as auth. Local servers usually accept any non-empty value.
ANTHROPIC_AUTH_TOKEN=ollama

# Map Claude Code's "claude-opus-X-Y" / "claude-sonnet-X-Y" / "claude-haiku-X-Y"
# to model names your local backend serves.
ANTHROPIC_DEFAULT_OPUS_MODEL=gemma4:26b-a4b
ANTHROPIC_DEFAULT_SONNET_MODEL=gemma4:26b-a4b
ANTHROPIC_DEFAULT_HAIKU_MODEL=gpt-oss:20b

claude

Or override per-invocation:

claude --model gemma4:26b-a4b

If ANTHROPIC_BASE_URL is set but the URL doesn’t respond with the right shape, Claude Code does not fall back to the cloud. It errors out.


2. Context length: the hidden failure mode

Claude Code is not a chat prompt. Before your actual request, the backend sees:

  • Claude Code’s system prompt (~6–10K tokens by itself)
  • tool definitions for Read / Edit / Bash / Grep / Glob / TodoWrite
  • conversation history
  • file excerpts and full reads
  • diffs
  • command output
  • retry/error messages from failed tool calls

That means 8K and 16K contexts are misleading tests. They may answer a chat question, but they are not enough for reliable agentic coding. The session survives a handful of turns, then silently degrades — file edits truncate, tool calls drop arguments, the loop gets confused.

Practical context tiers

ContextVerdictWhat happens
8KBroken for Claude CodeSystem prompt + tools eat the window before your code arrives. Chat-only.
16KDemo onlyTiny edits, short sessions. Not a real test of any model.
25KLM Studio’s stated minimumGood enough for small tasks if tool calls are reliable.
32KReal minimum (32K context).Ollama recommends this floor. Use as your default.
64KSweet spot (64K context).Best balance on 32GB+ machines. Handles medium repos and multi-file edits.
128K+Diminishing returnsPrefill latency and KV-cache memory rise hard. Worth it only on high-memory servers, and only for repo-wide reads.

Apple Silicon context presets

MachineRecommended contextNotes
MacBook Air M5, 16 GB16K–24KUse smaller models (≤8B). 26B-A4B is tight.
MacBook Air M5, 24 GB24K–32K32K is the target; keep other apps light.
MacBook Air M5, 32 GB32KBest Air setup. Higher rarely beats thermal throttling.
MacBook Pro M5 Pro, 24 GB32KBetter sustained perf than Air at the same context.
MacBook Pro M5 Pro, 48/64 GB64KSweet spot for serious local coding.
MacBook Pro M5 Max, 64/128 GB64K default, 128K experimentalUse 128K for repo-wide analysis, not every edit loop.

Note: backend docs differ — LM Studio says “start at 25K, increase for better results,” Ollama recommends 32K. Use 32K as the cross-backend baseline. Reading “25K” as “25K is enough” is the most common mistake.


3. Claude Code Ollama setup (native, v0.14.0+)

Ollama announced Anthropic Messages API compatibility on 2026-01-16. No proxy, no LiteLLM, no nothing.

# Set context length first — this is the most important knob
export OLLAMA_CONTEXT_LENGTH=32768   # 65536 on a Pro

export ANTHROPIC_AUTH_TOKEN=ollama
export ANTHROPIC_BASE_URL=http://localhost:11434

claude --model gemma4:26b-a4b

Cloud-hosted Ollama models work too:

claude --model glm-4.7:cloud
claude --model minimax-m2.1:cloud

Two known limits of Ollama’s Anthropic-compat layer (April 2026):

  • No prompt caching. Anthropic’s cache_control doesn’t apply — every Claude Code request re-processes the system prompt and conversation history from scratch.
  • No tool_choice. Claude Code occasionally uses tool_choice to force a specific tool call. Ollama’s compat layer ignores it. When it matters, Claude Code may pick the wrong tool and get stuck in a loop.

4. Claude Code LM Studio setup (native, 0.4.1+)

LM Studio added the Anthropic-compatible /v1/messages endpoint on 2026-01-30. Streaming, tool calls, and message-shape are all supported natively.

# Set context to at least 32K in the LM Studio UI (or higher; see section 2)
lms server start --port 1234

export ANTHROPIC_BASE_URL=http://localhost:1234
export ANTHROPIC_AUTH_TOKEN=lmstudio

claude --model openai/gpt-oss-20b

For VS Code with the Claude Code extension (env vars from your shell are NOT inherited by VS Code):

// .vscode/settings.json
"claudeCode.environmentVariables": [
  { "name": "ANTHROPIC_BASE_URL", "value": "http://localhost:1234" },
  { "name": "ANTHROPIC_AUTH_TOKEN", "value": "lmstudio" }
]

LM Studio’s docs say “at least 25K.” Set 32K. See section 2.


5. Claude Code llama.cpp setup (Apple Silicon fast path for Gemma 4 26B-A4B)

If you’re on Apple Silicon and want the absolute lowest overhead with Gemma 4 26B-A4B, llama.cpp’s server is faster per-token than Ollama or LM Studio. You need a recent build (one that supports -hf for HuggingFace pulls and --jinja for chat templates).

./build/bin/llama-server \
  -hf ggml-org/gemma-4-26B-A4B-it-GGUF:Q4_K_M \
  --host 127.0.0.1 \
  --port 8080 \
  -ngl 99 \
  -c 65536 \
  --jinja
export ANTHROPIC_BASE_URL=http://localhost:8080
export ANTHROPIC_AUTH_TOKEN=llama-cpp
claude --model gemma-4-26B-A4B

Flags that matter:

  • -c 65536 sets 64K context (drop to -c 32768 on tighter machines).
  • -ngl 99 offloads all layers to Metal/GPU.
  • --jinja is required for Gemma 4’s chat template to render correctly. Without it, tool calls won’t format and you’ll see <unused24> / <unused49> tokens leaking into output.
  • -hf ggml-org/gemma-4-26B-A4B-it-GGUF:Q4_K_M pulls the GGUF straight from HuggingFace.

Caveat: llama.cpp’s Anthropic-compat is partial. Works for chat and basic tool calling. Streaming-shape and some Anthropic-specific request fields are rougher than Ollama or LM Studio. If something breaks weirdly, fall back to Ollama. llama.cpp is the speed play, not the compatibility play.


6. Claude Code vLLM setup (native + tool parser)

vLLM ships an official Claude Code integration. Three things at server start: a tool-calling-capable model, --enable-auto-tool-choice, and the right --tool-call-parser.

vllm serve openai/gpt-oss-120b \
  --served-model-name my-model \
  --enable-auto-tool-choice \
  --tool-call-parser openai \
  --port 8000
export ANTHROPIC_BASE_URL=http://localhost:8000
export ANTHROPIC_API_KEY=dummy
export ANTHROPIC_AUTH_TOKEN=dummy
export ANTHROPIC_DEFAULT_OPUS_MODEL=my-model
export ANTHROPIC_DEFAULT_SONNET_MODEL=my-model
export ANTHROPIC_DEFAULT_HAIKU_MODEL=my-model

claude

The --tool-call-parser value depends on the model family — openai for the gpt-oss family, llama3_json for Llama 3.x, hermes for Hermes. Wrong parser → tool calls return as plain text and Claude Code’s edit/grep/bash tools silently no-op.


7. LiteLLM — for fallbacks, not for translation

With Ollama, LM Studio, llama.cpp, and vLLM all speaking native Anthropic now, LiteLLM’s role changes. It’s no longer “the translator” — it’s the router for fallbacks, request logging, per-tenant keys, and rate limits. Also the right answer if your only local option is an OpenAI compatible local endpoint.

# litellm-config.yaml
model_list:
  - model_name: claude-opus-4-7
    litellm_params:
      model: openai/my-vllm-model
      api_base: http://vllm:8000/v1

  - model_name: claude-sonnet-4-6
    litellm_params:
      model: ollama/gemma4:26b-a4b
      api_base: http://ollama:11434

  - model_name: claude-haiku-4-5
    litellm_params:
      model: anthropic/claude-haiku-4-5
      api_key: os.environ/ANTHROPIC_API_KEY

router_settings:
  fallbacks:
    - claude-opus-4-7: ["claude-haiku-4-5"]   # local fail → cloud Haiku

The single biggest win: when a local tool call silently fails, LiteLLM falls back to cloud Haiku transparently. Claude Code keeps working.


8. Common failures (the error strings developers google)

tool_use parse error / invalid tool call / tool_use is not supported

Three different symptoms, one root cause: the model is not emitting Anthropic-format tool_use content blocks.

The most deceptive symptom is the silent one — Claude Code starts, prints the model’s plain-prose answer (“I would change the file like this…”), and nothing happens. No file edit, no error.

Common causes (April 2026):

  • vLLM: missing --enable-auto-tool-choice or wrong --tool-call-parser.
  • Ollama: model that wasn’t trained for tool calling (avoid stock llama3.x instruct).
  • llama.cpp: missing --jinja. The chat template renders incorrectly and you see literal <unused24> / <unused49> tokens.
  • LM Studio: model file is fine but the loaded preset uses the wrong template.

context length exceeded / model stopped mid-edit

Claude Code’s prompts overflow the configured window. The session may finish a single turn, then truncate the next file edit silently. Fix: raise context to at least 32K. If you’re already at 32K and still hitting this, the model is reading too aggressively — drop to fewer tools or shorter file reads.

empty assistant response

Backend returned 200 OK with an empty content array. Causes:

  • Streaming SSE format mismatch (mostly llama.cpp).
  • Tool-call parser swallowed the message because it couldn’t parse it.
  • Model emitted only a <unused24> / <unused49> token and the parser dropped the rest.

Fix: switch backend (Ollama or LM Studio if you were on llama.cpp), or upgrade llama.cpp to a build with the patched Gemma 4 chat template.

model not found / 404 the model X does not exist

Claude Code asked for claude-opus-4-7 but the backend serves gpt-oss:20b or gemma4:26b-a4b. Fixes:

  • Set ANTHROPIC_DEFAULT_OPUS_MODEL (plus _SONNET_ and _HAIKU_) to the backend’s actual model name.
  • Use claude --model <backend-name> per call.
  • Map the names in LiteLLM (the model_name: field is what Claude Code asks for; model: is what gets served).

messages: Extra inputs are not permitted (HTTP 422)

Some backends are stricter than Anthropic’s own. They reject Anthropic-specific fields (cache_control, thinking, tools[].input_schema, metadata.user_id). Fix: upgrade the backend, or run a small middleware proxy that strips the unsupported fields before forwarding.

ANTHROPIC_BASE_URL ignored / Claude Code still calls the real API

  • Env var was set in .zshrc after the shell session started — restart the terminal.
  • ~/.config/claude/config.json or a --api-key flag is overriding the env var.
  • VS Code: env vars from your shell are NOT inherited. Use claudeCode.environmentVariables in workspace settings (section 4).

echo $ANTHROPIC_BASE_URL inside the same shell that runs claude. If empty, you have a sourcing problem.


9. Debug flow

When something breaks, walk this tree before swapping backends:

  1. Did the model load?
    • No → check quant size vs RAM. 26B-A4B Q4 needs ~16 GB free; bigger quants need more.
  2. Is the context at least 32K?
    • No → raise to 32K (Air) or 64K (Pro). See section 2.
  3. Are tool calls malformed? (Look for <unused24>, <unused49>, plain prose where you expected an edit.)
    • Yes → switch to native Anthropic mode (Ollama/LM Studio), or for vLLM verify --tool-call-parser, or for llama.cpp add --jinja.
  4. Does Claude Code stop mid-edit?
    • Yes → context exhaustion. Lower context targets in your tools, or use a faster quant so the model finishes turns before the window reuse cycle.
  5. Is the model hallucinating files that don’t exist?
    • Yes → the model isn’t calling Read before Edit. Add a CLAUDE.md rule that requires reading before editing, or use a tool-finer model (Gemma 4 26B-A4B is solid here).

10. Smoke test

Verify your setup with one prompt. Ask Claude Code:

Create a small FastAPI app with one /health endpoint, add a pytest test for it, run pytest, and fix any failures.

Passes if:

  • It reads/writes files correctly (no hallucinated paths).
  • It runs the test command (you see real pytest output).
  • It patches a failure (e.g. missing dependency) without losing context.
  • It does not lose tool-call format (no <unused24> / <unused49> leakage).
  • It does not truncate after the first edit.

Expected terminal feel:

✓ model loaded     (gemma4:26b-a4b, Q4_K_M)
✓ context: 32768
✓ tool call parsed (Edit)
✓ edited file      (app.py)
✓ tool call parsed (Bash)
✓ tests passed

If you don’t see all five, walk the debug flow above.


11. Compatibility matrix (April 2026)

BackendNative Anthropic APITool callsContext floorNotes
Ollama (≥ v0.14.0)YesDepends on model32K context (cross-backend baseline)Easiest setup. No prompt caching, no tool_choice (see section 3).
LM Studio (≥ 0.4.1)YesYes (out of the box)Stated 25K, use 32KStreaming + tool_use blocks supported natively. VS Code extension takes workspace env vars.
llama.cpp serverPartialYes with --jinja32K, 64K context on ProLowest overhead on Apple Silicon. Rougher Anthropic-compat. Best path for Gemma 4 26B-A4B.
vLLMYesYes with --enable-auto-tool-choice + correct parserModel-dependentBest throughput. Requires correct parser per model family.
LiteLLMRoutes to any backendWhatever the backend supportsn/aUse for fallbacks and logging, or to wrap an OpenAI compatible local endpoint as Anthropic.
Direct Ollama < v0.14.0NoNon/aUpgrade.

12. Hardware × model × context × backend (the cheat-sheet table)

A developer should not have to infer what to use:

MachineModelContextBackendVerdict
MacBook Air M5, 16 GBGemma 4 E4B16K–24KLM Studiousable for small tasks
MacBook Air M5, 24 GBGemma 4 26B-A4B Q424K–32KOllama / LM Studiogood
MacBook Air M5, 32 GBGemma 4 26B-A4B Q432KOllama / LM Studiobest Air setup
MacBook Pro M5 Pro, 48 GBGemma 4 26B-A4B Q4/UD-Q464Kllama.cpp / LM Studiosweet spot
MacBook Pro M5 Max, 64 GB+Gemma 4 26B-A4B or 31B64K–128Kllama.cpp / vLLMbest local

This is the single most copied table in this gist. Bookmark it.


13. Gemma 4 26B-A4B: the Apple Silicon sweet spot

For Mac local Claude Code, the standout Gemma 4 variant is 26B-A4B-it, not the dense 31B. Reasons:

  • Google trained tool-use directly into Gemma 4 (not bolted on as a fine-tune). It works on the first try, not after three retries.
  • The 26B MoE activates only ~3.88 B params per inference, so latency is in the 4 B-model range — around 300 tok/sec on M2 Ultra.
  • Strong tool-use behavior, good enough coding quality for private/local workflows.
  • Fits at useful context sizes on high-memory MacBooks.

Why 26B-A4B instead of 31B?

  • Faster tool calls — every Claude Code turn is bottlenecked by tool-call latency, not single-shot quality.
  • Lower active-parameter count keeps prefill cheap.
  • Better fit for laptops — 31B dense needs more RAM and more thermal headroom.
  • Enough quality for iterative coding; the agent loop matters more than peak IQ.
  • 31B may be better for single-shot answers — but Claude Code is many small turns, not one big answer.

For Gemma 4 local coding specifically: pick 26B-A4B unless you’re on a 64 GB+ Pro and you’ve measured that 31B Q4 actually finishes turns faster on your hardware.


14. Other model picks for Claude Code (April 2026)

If Gemma 4 isn’t available or you want to compare:

  • gpt-oss:20b — easy starting point. Tool calling reliable, runs on a single decent GPU. Recommended in Ollama’s and LM Studio’s official Claude Code blog posts.
  • gpt-oss:120b — much smarter on real codebases. The vLLM Claude Code integration page uses this as the example. Needs serious VRAM.
  • qwen3-coder — purpose-built for coding. Strong tool-call performance on Ollama. Frequently called the strongest local pick for Claude Code in March/April 2026 community threads.
  • qwen3.5 family — the 35B MoE variants are reported as the strongest agentic-coding open models in this size class. Verify tool-call support per quant.
  • glm-4.7-flash / glm-4.7:cloud — strong agentic coder. Available as an Ollama cloud model (no local GPU needed).
  • minimax-m2.1:cloud — newer Ollama cloud option, agentic-tuned.

What to avoid: stock llama3.x instruct models without tool fine-tuning. They will look like they work, then silently fail on file edits.


15. Setups I would avoid

  • 8K context. Too small for Claude Code. The system prompt eats it before your code arrives.
  • 16K context. Demos only. Don’t judge a model by 16K behavior.
  • Old llama.cpp builds with Gemma 4. No --jinja or no patched chat template → <unused24> / <unused49> token leakage.
  • 128K context on a 32 GB laptop. KV cache + prefill latency tax > the benefit.
  • Judging model quality before tool calls are stable. Fix the parser/template first, then evaluate the model.
  • Routing through LiteLLM when the backend is already native Anthropic. Adds a hop for nothing — only use LiteLLM for fallbacks or when wrapping an OpenAI compatible local endpoint.

16. Reusable startup script

Drop this in start-claude-code-local.sh and chmod +x. Default 32K context, override via env.

#!/usr/bin/env bash
set -euo pipefail

export OLLAMA_CONTEXT_LENGTH="${OLLAMA_CONTEXT_LENGTH:-32768}"
export ANTHROPIC_BASE_URL="${ANTHROPIC_BASE_URL:-http://localhost:11434}"
export ANTHROPIC_AUTH_TOKEN="${ANTHROPIC_AUTH_TOKEN:-ollama}"
export ANTHROPIC_DEFAULT_OPUS_MODEL="${ANTHROPIC_DEFAULT_OPUS_MODEL:-gemma4:26b-a4b}"
export ANTHROPIC_DEFAULT_SONNET_MODEL="${ANTHROPIC_DEFAULT_SONNET_MODEL:-gemma4:26b-a4b}"
export ANTHROPIC_DEFAULT_HAIKU_MODEL="${ANTHROPIC_DEFAULT_HAIKU_MODEL:-gpt-oss:20b}"

echo "Starting Ollama with context=$OLLAMA_CONTEXT_LENGTH"
ollama serve &
OLLAMA_PID=$!

# Wait for Ollama to be ready
until curl -sf "$ANTHROPIC_BASE_URL/api/version" > /dev/null; do
  sleep 0.5
done

echo "Launching Claude Code → $ANTHROPIC_BASE_URL"
echo "Model: $ANTHROPIC_DEFAULT_OPUS_MODEL"

claude

kill $OLLAMA_PID 2>/dev/null || true

For LM Studio, swap ollama serve for lms server start --port 1234 and update the env vars accordingly.

This script (and additions for other backends as they ship) lives in the companion repo:

github.com/renezander030/local-ai-coding-stackgit clone, chmod +x scripts/start-claude-code-local.sh, run.


17. Production recommendation

For real work, do not let Claude Code talk directly to a single local endpoint without a fallback path:

Claude Code
   │  ANTHROPIC_BASE_URL
   ▼
LiteLLM (router + logger)
   │  primary
   ▼
Ollama / LM Studio / llama.cpp / vLLM (local)
   │  on tool-call failure or 5xx
   ▼
Cloud Claude Haiku (fallback)
   │
   ▼
Audit log

Model swaps without restarting Claude Code; transparent fallback when local tool calling silently fails; request logs you can grep when something goes wrong. Same five-contract pattern from agent-approval-gate.


18. When local models are the wrong choice

  • Repo-wide refactors. Multi-step tool flows compound silent tool-call failures. Local fine-tunes drop accuracy fast.
  • Security-sensitive edits without an approval gate. Use agent-approval-gate and the local-vs-cloud question becomes secondary.
  • Tool-heavy sessions (50+ tool calls). Every silent failure compounds.
  • Anything billed by your time. A failed local tool call costs your time; a successful Haiku call is roughly $0.001.

Local Claude Code is a fit for: chat-only assist on private code, classification/summarization sub-steps, air-gapped environments.


Series

This gist is part of Production AI Automation Notes — a running set of repos and gists on shipping AI agents outside demos. Other entries:


Sources

Reader contributions

If you get this working on a different Mac/RAM/model combo, comment with:

  • machine
  • RAM
  • backend
  • model + quant
  • context length
  • what worked / what failed

The compatibility matrix and hardware table are updated weekly from these reports.

Changelog

2026-04-28

  • Added TL;DR cheat sheet, Recommended setup section, smoke test, debug flow, reusable startup script, hardware × model × context × backend table.
  • Expanded error-string section to include <unused24> / <unused49> template-leak symptoms.
  • Added 26B-A4B vs 31B comparison bullets.
  • Added “Setups I would avoid.”
  • Renamed Update log → Changelog.
  • Added Gemma 4 26B-A4B context recommendations.
  • Added MacBook Air vs Pro presets.
  • Added 32K / 64K Claude Code guidance.
  • Backend coverage rewritten: Ollama, LM Studio, vLLM all native Anthropic; llama.cpp added as Apple Silicon fast path.
  • LiteLLM repositioned as fallback router (and OpenAI-compat wrapper), not translator.

2026-04-22

  • Initial publish.