should I build a RAG system or use Claude's context window

Skip RAG if your corpus is under 200k tokens and rarely changes - put everything in Claude's context window directly. Use RAG with Qdrant and local embeddings if your corpus is larger, changes often, or you need hard citations. This combination provides the best cost and quality balance in 2026.

what is the right chunk size for RAG documents

Use token-based chunks of 500-1000 tokens with 100-token overlap between adjacent chunks. Respect document structure by splitting on headings first, then paragraphs and sentences, avoiding splits mid-sentence or mid-code block. Always count tokens instead of characters because CJK text and code have different character-to-token ratios.

which embedding model is best for RAG

For local models, use bge-large-en-v1.5 with 1024 dimensions or nomic-embed-text via Ollama with 768 dimensions. API embeddings from OpenAI or Voyage are marginally better but cost real money. Local models provide excellent quality at scale while minimizing costs.

how can I reduce RAG pipeline costs

Prompt cache the retrieved documents using cache_control ephemeral type, which makes cache reads 90 percent cheaper than fresh reads. Use query rewriting for better retrieval precision, batch non-interactive workloads through the Batch API for 50 percent discount, and implement parent-child chunking for improved precision.

RAG Pipeline Tutorial: Build a Production Document Q&A System with Qdrant and Claude

April 1, 2026 · 16 min read · rag, claude, qdrant, vector-database, llm

Most RAG tutorials ship a toy. You paste a PDF, it answers one question, and the moment you point it at 500 documents the retrieval goes sideways and Claude hallucinates half the citations. This one is the opposite. I am going to walk through the pipeline I actually run in production, line by line, with the tradeoffs called out where they bit me.

The verdict first. If your corpus is under 200k tokens and rarely changes, skip RAG and stuff it all into Claude’s context window. If your corpus is larger, changes often, or you need hard citations, build this RAG pipeline tutorial end to end with Qdrant, a local embedding model, and Claude Sonnet 4.6. That is the sweet spot for cost and quality in 2026.

This is a code-heavy retrieval augmented generation tutorial. You will leave with a working document qa ai system, an evaluation harness, and a production checklist. Everything below is TypeScript, because that is what I ship, but the Python equivalent is a one-to-one translation.

Before you write code: the RAG System Requirements Template walks through the seven decisions (sources, ACL, chunking, retrieval, citations, fallback, evals) and exports a markdown spec. Define before you build.

What you’ll build

A document Q&A service. You feed it PDFs, markdown files, and HTML pages. Users ask questions in natural language. The system returns answers grounded in the source documents, with citations pointing back to the exact chunk and page that supported each claim.

Concretely, two CLI scripts:

ingest.ts loads files, chunks them, embeds the chunks, and upserts everything into Qdrant with metadata.
query.ts embeds the user question, searches Qdrant for the top matching chunks, stuffs them into a Claude prompt, and returns a structured answer with citations.

That is it. No orchestration framework, no LangChain, no managed service. Roughly 200 lines of code across the two files. You can build rag system workflows on top of this core and it will scale to a few million chunks before Qdrant starts to sweat.

The stack

Same components as my Docker Compose AI development stack, so if you already run that you have most of this live.

Vector storage: Qdrant. Open source, Rust, fast, runs in one container, supports hybrid search natively. I compare it to the alternatives in Qdrant vs Pinecone vs Weaviate. For anything self-hosted Qdrant is the default pick.
Embeddings: bge-large-en-v1.5 via sentence-transformers, or nomic-embed-text via Ollama. Both run locally, both are free, both are good enough for 95 percent of use cases. API embeddings (OpenAI, Cohere, Voyage) are marginally better and cost real money. I default to local.
Generation: Claude Sonnet 4.6 (claude-sonnet-4-6). Great instruction following, 200k context, tool use for structured output, prompt caching for the retrieved chunks.
Glue: Node 20, TypeScript, @anthropic-ai/sdk, @qdrant/js-client-rest, pdf-parse, cheerio.

That is the whole stack. No vector DB managed service fee, no embedding API bill, and Claude usage stays low because the retrieved context is small and cacheable. This is the architecture I recommend in the production AI agent architecture pillar as the default RAG component.

The four stages of RAG

Every RAG system is four stages. If you understand these in order, the failure modes become obvious and you stop debugging symptoms.

Ingestion: load documents, chunk them, embed each chunk, upsert to the vector DB with metadata.
Retrieval: embed the user query with the same model, search the vector DB, get the top K chunks.
Augmentation: format the retrieved chunks into a structured prompt block.
Generation: call Claude with the augmented prompt, get a grounded answer with citations.

The first two are a pure search problem. If retrieval is bad, no amount of prompt engineering will save you. The second two are a generation problem. If retrieval is good and generation still fails, the prompt is wrong or the model is hallucinating.

Stages run independently. Ingestion happens once per document (and re-runs on updates). Retrieval and generation happen on every user query. Separate them in your codebase from day one.

Stage 1: Ingestion

// ingest.ts
import { QdrantClient } from "@qdrant/js-client-rest";
import { pipeline } from "@xenova/transformers";
import pdf from "pdf-parse";
import * as cheerio from "cheerio";
import { readFile, readdir } from "fs/promises";
import { randomUUID } from "crypto";

const qdrant = new QdrantClient({ url: "http://localhost:6333" });
const COLLECTION = "docs";
const EMBED_MODEL = "Xenova/bge-large-en-v1.5";
const DIM = 1024;

async function ensureCollection() {
  const exists = await qdrant.collectionExists(COLLECTION);
  if (exists.exists) return;
  await qdrant.createCollection(COLLECTION, {
    vectors: { size: DIM, distance: "Cosine" },
  });
}

async function loadDoc(path: string): Promise<{ text: string; type: string }> {
  const buf = await readFile(path);
  if (path.endsWith(".pdf")) {
    const parsed = await pdf(buf);
    return { text: parsed.text, type: "pdf" };
  }
  if (path.endsWith(".html")) {
    const $ = cheerio.load(buf.toString());
    $("script, style, nav, footer").remove();
    return { text: $("body").text(), type: "html" };
  }
  return { text: buf.toString(), type: "markdown" };
}

That is the boring part. Now the hard part.

Chunking strategy

Chunking is where most RAG systems die. Too small and you lose context. Too big and retrieval gets noisy. The defaults most tutorials ship (500 characters, no overlap) are wrong for almost every real corpus.

My production defaults:

Token-based window: 500 to 1000 tokens per chunk. Use the gpt-tokenizer package or tiktoken for counting. Character counts lie because CJK text and code have very different character-to-token ratios.
Overlap: 100 tokens between adjacent chunks. This keeps sentences and ideas from getting sliced at chunk boundaries.
Respect structure: split on H2 and H3 boundaries first, then paragraphs, then sentences. Never split mid-sentence if you can help it. Never split mid code block, ever.
Recursive splitter: the LangChain-style recursive character splitter is a good baseline. Try the largest separator first (\n## ), fall back to the next (\n\n), then sentences, then characters. Keep splitting until chunks are under the token budget.

import { encode } from "gpt-tokenizer";

function chunkText(text: string, maxTokens = 800, overlap = 100): string[] {
  // Split on markdown headings first
  const sections = text.split(/\n(?=#{1,3} )/);
  const chunks: string[] = [];
  for (const section of sections) {
    const tokens = encode(section);
    if (tokens.length <= maxTokens) {
      chunks.push(section.trim());
      continue;
    }
    // Fall back to paragraphs
    const paragraphs = section.split(/\n\n+/);
    let buf: string[] = [];
    let bufTokens = 0;
    for (const para of paragraphs) {
      const paraTokens = encode(para).length;
      if (bufTokens + paraTokens > maxTokens && buf.length) {
        chunks.push(buf.join("\n\n"));
        // Start next chunk with overlap
        const tail = buf.slice(-1).join("\n\n");
        buf = tail ? [tail] : [];
        bufTokens = buf.length ? encode(tail).length : 0;
      }
      buf.push(para);
      bufTokens += paraTokens;
    }
    if (buf.length) chunks.push(buf.join("\n\n"));
  }
  return chunks.filter((c) => c.length > 50);
}

For technical documentation with lots of code, guard code fences explicitly. I track fence state in a flag and refuse to split while inside a fence. A fifty-line code example stays intact, even if it exceeds the token budget.

Embedding choice

The model you pick here determines the ceiling of your retrieval quality. Swap it later and you have to re-embed everything.

Local, fast, small: all-MiniLM-L6-v2. 384 dimensions, runs on CPU, fine for small corpora.
Local, slow, strong: bge-large-en-v1.5. 1024 dimensions, great accuracy, needs a decent CPU or GPU.
Ollama-hosted: nomic-embed-text. 768 dimensions, free, very solid. My default when Ollama is already in the stack.
API, best quality: OpenAI text-embedding-3-large, Cohere embed-english-v3, or Voyage voyage-3. Pay per token, beat the local models by a small margin on hard retrieval tasks.

I ship bge-large-en-v1.5 for English corpora and nomic-embed-text multilingual when I need German. The quality gap to API models is small. The cost gap is large when you ingest a million chunks.

Metadata matters

Every chunk goes into Qdrant with a payload. Everything you will ever want to filter on goes in the payload now, not later.

type ChunkPayload = {
  doc_id: string;
  chunk_id: string;
  source: string;      // file path or URL
  page?: number;       // for PDFs
  doc_type: string;    // pdf | markdown | html
  created_at: string;
  acl?: string[];      // who is allowed to see this
  text: string;        // the chunk itself, for reading back
};

Skipping ACLs at ingest is the single most common production bug I see. Retrofitting access control after launch is miserable. Put the user groups or tenant IDs in the payload now.

Upsert loop

async function embed(texts: string[]): Promise<number[][]> {
  const extractor = await pipeline("feature-extraction", EMBED_MODEL);
  const out = await extractor(texts, { pooling: "mean", normalize: true });
  return out.tolist();
}

async function ingestFile(path: string) {
  const { text, type } = await loadDoc(path);
  const chunks = chunkText(text);
  const doc_id = randomUUID();
  const BATCH = 100;

  for (let i = 0; i < chunks.length; i += BATCH) {
    const batch = chunks.slice(i, i + BATCH);
    const vectors = await embed(batch);
    const points = batch.map((text, j) => ({
      id: randomUUID(),
      vector: vectors[j],
      payload: {
        doc_id,
        chunk_id: `${doc_id}:${i + j}`,
        source: path,
        doc_type: type,
        created_at: new Date().toISOString(),
        text,
      },
    }));
    await qdrant.upsert(COLLECTION, { wait: true, points });
  }
}

async function main() {
  await ensureCollection();
  const dir = process.argv[2];
  for (const file of await readdir(dir)) {
    await ingestFile(`${dir}/${file}`);
    console.log(`ingested ${file}`);
  }
}
main();

Batch size 100 is the throughput sweet spot for Qdrant on a single node. Smaller batches waste round trips. Larger batches blow memory and can hit request size limits.

Stage 2: Retrieval

Query embedding must use the exact same model as ingestion. Mixing embedding models is the second most common production bug. If you have to swap embedding models, re-embed the whole corpus. There is no shortcut.

// query.ts (retrieval portion)
async function retrieve(question: string, k = 8) {
  const [qvec] = await embed([question]);
  const results = await qdrant.search(COLLECTION, {
    vector: qvec,
    limit: k,
    with_payload: true,
  });
  return results;
}

Top K is typically 5 to 10. Start at 8. Below 5 you miss relevant context. Above 15 the noise drowns the signal and Claude gets confused.

Hybrid search

Pure vector search misses exact matches. If the user asks about error code E_TIMEOUT_503 and no chunk is semantically close, you get nothing. Hybrid search combines vector similarity with BM25 keyword matching. Qdrant supports this natively through sparse vectors.

const results = await qdrant.query(COLLECTION, {
  prefetch: [
    { query: qvec, using: "dense", limit: 20 },
    { query: { text: question }, using: "sparse", limit: 20 },
  ],
  query: { fusion: "rrf" },
  limit: 8,
});

RRF (reciprocal rank fusion) merges the two result sets. The sparse path catches exact keyword hits. The dense path catches paraphrases. I turn hybrid on for any corpus with technical jargon, code, product names, or error codes.

Reranking

Reranking takes the top 20 to 50 candidates and scores them again with a cross-encoder. Cross-encoders read the query and chunk together, so they score relevance more accurately than vector similarity. The cost is latency, 100 to 300 ms per batch.

Use reranking when precision matters more than latency. Legal research, compliance Q&A, medical information. Skip it for interactive chat where every second counts.

Good rerankers in 2026: bge-reranker-v2-m3 (open weights, multilingual) and Cohere Rerank 3.5 (API, fast, excellent quality).

Stage 3: Augmentation

Take the retrieved chunks and format them into a prompt block Claude can parse cleanly. Use XML tags. Anthropic has explicitly trained Claude to respect XML structure, and it makes citations trivial.

function buildContext(hits: typeof results): string {
  return hits
    .map(
      (hit, i) => `<document id="${i + 1}" source="${hit.payload.source}">
${hit.payload.text}
</document>`
    )
    .join("\n");
}

const SYSTEM = `You are a document Q&A assistant.

Answer using ONLY information from <documents>. Cite sources by document id.
If the answer is not in the documents, respond: "I don't have that information in the provided documents."
Never invent facts, page numbers, or sources.`;

The system prompt is load-bearing. “Use only the documents” plus “say I don’t know” catches most hallucinations. The citation instruction makes the output auditable.

For structured citations, use tool use. This gives you typed JSON out of the box, which I cover in detail in Claude API structured output and the general Claude API tool use guide.

Stage 4: Generation

import Anthropic from "@anthropic-ai/sdk";
const claude = new Anthropic();

const ANSWER_TOOL = {
  name: "answer_with_citations",
  description: "Return a grounded answer with source citations.",
  input_schema: {
    type: "object" as const,
    properties: {
      answer: { type: "string" },
      citations: {
        type: "array",
        items: {
          type: "object",
          properties: {
            document_id: { type: "integer" },
            quote: { type: "string" },
          },
          required: ["document_id", "quote"],
        },
      },
    },
    required: ["answer", "citations"],
  },
};

async function answer(question: string) {
  const hits = await retrieve(question);
  const context = buildContext(hits);

  const response = await claude.messages.create({
    model: "claude-sonnet-4-6",
    max_tokens: 1024,
    system: [
      {
        type: "text",
        text: SYSTEM,
      },
      {
        type: "text",
        text: `<documents>\n${context}\n</documents>`,
        cache_control: { type: "ephemeral" },
      },
    ],
    tools: [ANSWER_TOOL],
    tool_choice: { type: "tool", name: "answer_with_citations" },
    messages: [{ role: "user", content: question }],
  });

  const toolUse = response.content.find((b) => b.type === "tool_use");
  return { result: toolUse?.input, usage: response.usage, hits };
}

The tool_choice: { type: "tool", ... } forces Claude to emit the structured answer every time. No prefill required, no fragile JSON parsing. The cache_control on the documents block is the magic optimization I cover next.

Full code walkthrough

Two files. One CLI wrapper.

# install
npm i @anthropic-ai/sdk @qdrant/js-client-rest @xenova/transformers \
      pdf-parse cheerio gpt-tokenizer

# run Qdrant
docker run -p 6333:6333 -v ./qdrant:/qdrant/storage qdrant/qdrant

# ingest
npx tsx ingest.ts ./corpus/

# ask
npx tsx query.ts "What is our refund policy for annual subscriptions?"

The query.ts wrapper:

async function main() {
  const question = process.argv.slice(2).join(" ");
  const { result, usage, hits } = await answer(question);
  console.log(result.answer);
  console.log("\nSources:");
  for (const c of result.citations) {
    const hit = hits[c.document_id - 1];
    console.log(`  [${c.document_id}] ${hit.payload.source}`);
    console.log(`      "${c.quote.slice(0, 100)}..."`);
  }
  console.log(`\nTokens: ${usage.input_tokens} in, ${usage.output_tokens} out`);
}
main();

That is the whole document Q&A system. Around 200 lines. Production ready modulo the hardening section below.

Optimization layer

This is where a working prototype becomes something you can bill for.

Prompt cache the retrieved context. The cache_control: { type: "ephemeral" } on the documents block means repeated queries that surface the same chunks hit the cache. Cache reads are 90 percent cheaper than fresh reads. Full details in Claude API prompt caching. On a FAQ workload with a warm cache my input token bill dropped by an order of magnitude.

Query rewriting. Send the user question to Claude Haiku with a prompt like “rewrite this question into three retrieval queries that cover different angles”. Retrieve against each, union the top K, deduplicate. Catches cases where the user’s phrasing does not match the corpus vocabulary.

Parent-child chunking. Embed and retrieve small chunks (300 tokens) for precision. When you pass chunks to Claude, expand each hit to include the surrounding paragraph or section. You get the precision of small chunks and the context of large ones.

Re-retrieval on “I don’t know”. Parse Claude’s answer. If it says it does not have the information, widen the retrieval (larger K, or query rewrite variants) and try once more before returning the empty answer to the user. Cheap win on edge cases.

Async for non-interactive. Batch jobs (nightly re-ingestion, offline QA generation, eval runs) should go through the Claude Batch API. 50 percent discount, 24 hour SLA, perfect for eval harnesses.

Evaluation that catches regressions

You cannot improve what you cannot measure. The minute you ship RAG without an eval harness, someone is going to change a chunking parameter and silently break retrieval quality.

Build a test set of 50 to 100 question-answer pairs drawn from your actual docs. Three metrics:

Retrieval precision at K: of the top K chunks, how many are actually relevant? Label manually the first time, then reuse the labels.
Answer accuracy: does the answer match the expected answer? Grade with a stronger model (Claude Opus 4.7) acting as judge.
Faithfulness: is every claim in the answer actually supported by a retrieved chunk? RAGAS automates this. Or use Claude as judge with a prompt asking it to flag unsupported claims.

Run the eval on every PR that touches the pipeline. A regression in retrieval precision is a one-line commit message away. You want CI to catch it.

Production hardening

The list that separates a demo from something on-call-worthy.

Monitoring: log retrieval latency, generation latency, prompt cache hit rate, input and output tokens per query. Graph them. Alert on p95 regressions.
Cost tracking: embedding cost per document ingested, Claude cost per query. Tag by tenant or user so you can attribute spend.
Rate limits: retrieval and generation have different ceilings. Qdrant handles thousands of QPS. Claude has per-minute token limits. Rate limit each independently so a burst of queries does not DoS your Anthropic quota.
Fallback paths: if Qdrant times out, fall back to “no context, let Claude answer from general knowledge” with a banner telling the user. Or refuse politely. Never silently return a hallucinated answer.
ACL enforcement: filter Qdrant search by the user’s allowed ACL groups in the payload. Never return a chunk the user is not entitled to see, even if it ranked in the top K.
Source freshness: track created_at and updated_at per document. Re-ingest changed files. Delete chunks for deleted documents. Stale retrieval is worse than no retrieval.
Replay harness: store (query, retrieved IDs, answer) tuples. When a user reports a bad answer, replay it against the latest pipeline and debug in one step.

Scaling knobs

When you outgrow the single-node default.

Qdrant: enable sharding for corpora over 10 million chunks. Add replicas for HA. Turn on scalar quantization (int8) to cut memory use 4x with marginal recall loss. Product quantization cuts it 16x if you can tolerate 2 to 3 percent recall drop.
Embeddings: batch inference on a GPU once you cross 100k new chunks a day. A single RTX 4090 handles most mid-size corpora. At scale, use the NVIDIA Triton inference server or a managed embedding API.
Claude: async queueing via the Batch API for anything non-interactive. For interactive workloads, spread requests across multiple API keys or regions if you hit rate limits.
Cold cache warmup: on deployment, pre-run your top 100 FAQ queries to warm the prompt cache. First-time users hit the fast path.

Common mistakes

I have made every one of these.

Over-chunking: 200-token chunks look neat but lose context. Claude gets the right chunk and still cannot answer because the supporting sentence got cut off.
Under-chunking: 4000-token chunks retrieve well but bury the answer in noise. Retrieval precision drops.
Different embedding models for ingest and query. Silent killer. Results look vaguely relevant but not quite. Always log and assert the model name on both sides.
No metadata filters. Filter on language, document type, tenant, recency. Users ask about recent docs and you return three-year-old ones because you did not filter on date.
Skipping reranking in precision-sensitive workloads. You will blame the LLM for bad answers when retrieval was the problem.
No evaluation. You will tune knobs by vibes, and every “improvement” will regress something else. Write the eval harness in week one.
Ignoring prompt caching. You will burn money on repeated context tokens. Cache the documents block.

Alternative architectures

RAG is not always the answer.

Long-context-only: if your entire corpus is under 200k tokens and changes rarely, put the whole thing in Claude’s context with prompt caching. No vector DB, no chunking, no retrieval step. A single legal contract, a product manual, one codebase. This is the lowest-effort, highest-quality option when it fits.
Fine-tuning: training a model on your corpus. Rarely worth it vs RAG. Fine-tuning bakes in knowledge but cannot be updated without retraining, and Claude is not fine-tunable via API. Leave this for edge cases with stable domains and latency-critical deployment.
Hybrid: keep your 10 most-used docs in Claude’s cached context (always available), and fall back to RAG for the long tail. Best of both worlds if you have a clear usage Pareto.

From pilot to production

Running an AI pilot that is not production-ready yet? That is exactly what I do: audit, fixed-price scope, delivery in 2–6 weeks.

Make your AI pilot production-ready → Production audit ($1,900 fixed)

RAG Pipeline Tutorial: Build a Production Document Q&A System with Qdrant and Claude

What you’ll build

The stack

The four stages of RAG

Stage 1: Ingestion

Chunking strategy

Embedding choice

Metadata matters

Upsert loop

Stage 2: Retrieval

Hybrid search

Reranking

Stage 3: Augmentation

Stage 4: Generation

Full code walkthrough

Optimization layer

Evaluation that catches regressions

Production hardening

Scaling knobs

Common mistakes

Alternative architectures

Before you go —

Almost there

RAG Pipeline Tutorial: Build a Production Document Q&A System with Qdrant and Claude

What you’ll build

The stack

The four stages of RAG

Stage 1: Ingestion

Chunking strategy

Embedding choice

Metadata matters

Upsert loop

Stage 2: Retrieval

Hybrid search

Reranking

Stage 3: Augmentation

Stage 4: Generation

Full code walkthrough

Optimization layer

Evaluation that catches regressions

Production hardening

Scaling knobs

Common mistakes

Alternative architectures

Scope my automation in 24h

Request received