<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Vllm on René Zander | AI Automation Consultant</title><link>https://renezander.com/tags/vllm/</link><description>Recent content in Vllm on René Zander | AI Automation Consultant</description><generator>Hugo</generator><language>en</language><lastBuildDate>Wed, 29 Apr 2026 07:00:00 +0200</lastBuildDate><atom:link href="https://renezander.com/tags/vllm/index.xml" rel="self" type="application/rss+xml"/><item><title>Claude Code with Local LLMs and ANTHROPIC_BASE_URL: Ollama, LM Studio, llama.cpp, vLLM</title><link>https://renezander.com/guides/claude-code-local-llm-anthropic-base-url/</link><pubDate>Wed, 29 Apr 2026 07:00:00 +0200</pubDate><guid>https://renezander.com/guides/claude-code-local-llm-anthropic-base-url/</guid><description>&lt;p>&lt;em>Native Anthropic endpoints, tool-call compatibility, and context-window sizing for local Claude Code.&lt;/em>&lt;/p>
&lt;p>&lt;em>Last tested: April 2026. See Changelog at the bottom.&lt;/em>&lt;/p>
&lt;h2 id="tldr-cheat-sheet">TL;DR cheat sheet&lt;/h2>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>Goal&lt;/th>
 &lt;th>Use&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>MacBook Air&lt;/td>
 &lt;td>Gemma 4 26B-A4B Q4, &lt;strong>32K context&lt;/strong>, LM Studio or Ollama&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>MacBook Pro&lt;/td>
 &lt;td>Gemma 4 26B-A4B Q4 / UD-Q4, &lt;strong>64K context&lt;/strong>, llama.cpp or LM Studio&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Claude Code minimum&lt;/td>
 &lt;td>&lt;strong>32K context&lt;/strong> (anything below is a chat demo)&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Best local backend&lt;/td>
 &lt;td>LM Studio or Ollama first; llama.cpp for advanced; vLLM for servers&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Avoid&lt;/td>
 &lt;td>8K / 16K context, dense 31B Gemma 4 on 32 GB machines, old llama.cpp builds&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;h2 id="the-local-claude-code-rule-of-thumb">The local-Claude-Code rule of thumb&lt;/h2>
&lt;p>Three things decide whether a local Claude Code session works:&lt;/p></description></item><item><title>Self-Hosted LLM on Kubernetes: A Production vLLM Deployment</title><link>https://renezander.com/blog/self-hosted-llm-kubernetes/</link><pubDate>Sun, 05 Apr 2026 07:00:00 +0200</pubDate><guid>https://renezander.com/blog/self-hosted-llm-kubernetes/</guid><description>&lt;p>Most teams asking about self-hosted LLM Kubernetes deployments should not be running Kubernetes for this at all. The honest answer is that vLLM on a single GPU box, wrapped in systemd or Docker Compose, covers more use cases than anyone wants to admit. Kubernetes earns its keep only when you already run it, or when you need horizontal scaling, multi-tenant isolation, or proper rolling deploys across a GPU node pool.&lt;/p></description></item></channel></rss>