Claude Extended Thinking: When the Budget Pays Off

March 27, 2026 · 10 min read · claude-api, reasoning, agents, llm-infrastructure
Claude Extended Thinking: When the Budget Pays Off

The first time I turned on Claude extended thinking for a real agent, the run went from 4 seconds to 47. The output was better. The bill was worse. That tradeoff is the whole story.

Claude extended thinking lets Opus or Sonnet produce a block of visible reasoning tokens before the final answer. You give it a budget, it spends that budget thinking, and you pay for every thinking token at the output rate. The upside is measurable quality gains on multi-step problems. The downside is latency and cost that scale with the budget you set.

My verdict after shipping this on agent loops, code generation, and planning tasks: default off, enable selectively. Extended thinking is a power tool, not a universal upgrade. This post walks through what it does, what it costs, and the exact task shapes where the budget pays for itself.

What extended thinking actually does

When you call the Claude API with a thinking parameter, the model generates a thinking content block before the normal text block. You see the reasoning. So does the model, on the next turn if you keep it in the history.

The API shape is minimal:

import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic();

const response = await client.messages.create({
  model: "claude-opus-4-7",
  max_tokens: 16000,
  thinking: { type: "enabled", budget_tokens: 10000 },
  messages: [
    { role: "user", content: "Refactor this function to be pure without changing its signature: ..." }
  ]
});

for (const block of response.content) {
  if (block.type === "thinking") {
    console.log("reasoning:", block.thinking);
  } else if (block.type === "text") {
    console.log("answer:", block.text);
  }
}

Three constraints to know up front:

  1. Opus and Sonnet only. Haiku does not support thinking. If you need reasoning at Haiku prices, you are out of luck.
  2. Thinking tokens are billed as output tokens. A 10,000-token budget on Opus is a real line item.
  3. budget_tokens must be less than max_tokens. The thinking budget is carved out of your output allocation.

The model can stop early. If it finishes reasoning in 2,400 tokens, you pay for 2,400, not 10,000. The budget is a cap, not a target. In practice Opus uses between 40 and 90 percent of the budget on tasks that actually need it, and almost none on tasks that do not.

The cost of thinking

Output tokens on Opus are roughly 5x the input price. That ratio is why thinking budgets matter.

Here is what a single request looks like at each common tier, assuming Opus 4.7 at current pricing (output $75 per million tokens):

Budget tierThinking tokensCost per requestTypical use
Light1,024~$0.08Quick disambiguation, small plans
Medium5,000~$0.38Single-hop reasoning, short code gen
Heavy16,000~$1.20Multi-step planning, complex refactors
Max64,000~$4.80Research-grade analysis, architectural decisions

That is per request, before the final answer’s output tokens. On Sonnet 4.6 the numbers are about one-fifth of Opus, which is why a lot of production thinking setups run Sonnet even when the team defaults to Opus for non-thinking work.

If you are doing 10,000 requests a day with a 10k budget on Opus, you are spending $7,500 a day just on thinking. For a customer-facing feature, that math does not work. For a once-a-day architectural planning agent, it is trivial.

The other hidden cost is latency. A 10k-token think takes roughly 15 to 30 seconds on Opus, depending on load. A 64k think can run over a minute. Your p99 is no longer measured in seconds.

Tasks where thinking earns its keep

I have run thinking on and off across several production systems. The pattern is clear: it helps where one wrong step cascades, and it wastes money everywhere else.

Agentic loops with tool use. When an agent has to pick between five tools, each with different parameter shapes, a thinking block before the tool call reduces wrong-tool picks and parameter hallucinations. My Graffiti profiling pipeline calls three sequential tools per customer, and enabling a 4k thinking budget on the planning step cut retry rate by a visible margin. The thinking cost is small because it only runs once per agent session, not per tool call.

Code generation with constraints. “Refactor this function to be pure without changing the type signature” is exactly the kind of problem where Opus without thinking will sometimes rewrite the signature anyway. With thinking enabled it notices the constraint, reasons about which lines violate it, and produces output that passes the original tests. This is my strongest use case for thinking.

Multi-hop reasoning over structured data. If the question is “given this JSON of 40 customer events, which user is most likely to churn and why”, thinking helps. The model walks through the events, forms hypotheses, rejects some, and commits. Without thinking, it tends to latch onto the first signal.

Complex planning. Building an agent plan, writing a migration strategy, or designing an API contract. Anywhere the output needs internal consistency across 10+ decisions. I run my weekly planning cron with a 16k thinking budget on Opus, once a week. Cost is negligible at that cadence.

Tasks where it’s pure overhead

These are the ones where I have turned thinking back off after measuring:

Customer-facing chat. The latency kills the feel. A 15-second wait with no streaming makes users think the service is down. Even with streaming (more on that below), the time-to-first-visible-text is too long for any interactive UX.

Summarization and extraction. “Summarize this email in three bullets” does not need reasoning. The model already knows how to summarize. You are paying extra tokens to watch it think about a task it would get right on the first try.

High-volume classification. If you are labeling 100k support tickets a day, the cost per item matters more than the quality bump from thinking. Run Haiku without thinking, or Sonnet without thinking, and accept the small accuracy hit.

Simple retrieval and formatting. “Pull the total from this invoice” or “convert this markdown to HTML” has no reasoning surface area. Thinking adds cost and zero quality.

The pattern: if the task has one obvious path, thinking is waste. If the task has decision points where picking wrong costs real money downstream, the budget pays off.

Streaming thinking to users

When you stream a response with thinking enabled, the thinking block comes first, token by token, then the final text block. You have three choices:

  1. Hide the thinking entirely. Show a “thinking…” spinner. User waits. Works for background jobs, not interactive UIs.
  2. Show the thinking live. Render the reasoning as it streams. This is what Claude.ai does in its UI. Feels transparent and sometimes educational, but most users do not want to read 10,000 tokens of reasoning.
  3. Summarize and show a progress pulse. Stream the thinking into a collapsed panel, show a one-line “analyzing inputs… considering tradeoffs…” summary. This is the best UX I have found for production apps.

The SDK gives you thinking_delta events in the stream, separate from text_delta. Route them to different UI surfaces:

const stream = await client.messages.stream({
  model: "claude-opus-4-7",
  max_tokens: 16000,
  thinking: { type: "enabled", budget_tokens: 8000 },
  messages: [{ role: "user", content: prompt }]
});

for await (const event of stream) {
  if (event.type === "content_block_delta") {
    if (event.delta.type === "thinking_delta") {
      renderReasoningPanel(event.delta.thinking);
    } else if (event.delta.type === "text_delta") {
      renderAnswer(event.delta.text);
    }
  }
}

One rule I follow: never show thinking output verbatim to end users in a professional context. It is raw, sometimes rambles, and can reveal system prompt details. Summarize or hide it.

How thinking interacts with other features

Prompt caching. Thinking output is not cacheable (it is generated fresh each turn), but once a thinking block is in the message history, it counts as input for the next turn and can be cached like any other input. If you are keeping a conversation going, the previous turn’s thinking becomes cached context. See Claude API prompt caching for the full caching model.

Tool use. Thinking happens before tool calls. The model reasons, then decides which tool to invoke. This is where thinking shines for agents, because the reasoning influences tool selection. The Claude Code SDK agents pattern uses this exact combination: thinking for planning, tool calls for execution.

Structured output. Do not combine thinking with prefill-based JSON extraction. The thinking block will break your prefill expectations. Use the tool-use pattern instead: define a tool with your JSON schema, let the model think, then have it call the tool. See Claude API structured output for why tool use beats prefill when thinking is in play.

An agent loop that uses thinking selectively

Here is the pattern I run in production. Thinking is enabled only for the planning step, not for each tool execution. That keeps cost bounded and latency acceptable.

import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic();

const tools = [
  {
    name: "fetch_customer_events",
    description: "Fetch recent events for a customer by ID",
    input_schema: {
      type: "object",
      properties: { customer_id: { type: "string" } },
      required: ["customer_id"]
    }
  },
  {
    name: "score_churn_risk",
    description: "Score a customer's churn risk given their event history",
    input_schema: {
      type: "object",
      properties: { events: { type: "array", items: { type: "object" } } },
      required: ["events"]
    }
  }
];

async function runAgent(userPrompt: string) {
  const messages: Anthropic.MessageParam[] = [
    { role: "user", content: userPrompt }
  ];

  // First turn: thinking enabled for planning
  let response = await client.messages.create({
    model: "claude-opus-4-7",
    max_tokens: 16000,
    thinking: { type: "enabled", budget_tokens: 4000 },
    tools,
    messages
  });

  // Subsequent turns: no thinking, just tool execution
  while (response.stop_reason === "tool_use") {
    const toolUse = response.content.find(b => b.type === "tool_use");
    if (!toolUse || toolUse.type !== "tool_use") break;

    const toolResult = await executeTool(toolUse.name, toolUse.input);

    messages.push({ role: "assistant", content: response.content });
    messages.push({
      role: "user",
      content: [{ type: "tool_result", tool_use_id: toolUse.id, content: toolResult }]
    });

    response = await client.messages.create({
      model: "claude-opus-4-7",
      max_tokens: 4000,
      tools,
      messages
    });
  }

  return response.content.find(b => b.type === "text");
}

async function executeTool(name: string, input: unknown): Promise<string> {
  // your real tool dispatch
  return JSON.stringify({ ok: true });
}

The first call costs more (thinking budget plus planning output). Every follow-up is cheap and fast because thinking is off for tool dispatch. I have seen this pattern cut total session cost by 60 percent versus naive “thinking on every turn” setups, while keeping the quality benefit where it matters.

When to enable it

Here is my decision flow:

Is the task interactive (user is waiting)?

Yes > thinking off. No > continue.

Does a wrong step cost real money downstream?

Yes > enable thinking with a small budget (2k to 8k) and measure. No > thinking off.

Are you doing more than 1,000 of these a day?

Yes > measure the cost delta carefully before shipping. No > budget freely, it does not matter at low volume.

Start at 4,000 tokens. Go up only if you see the model hitting the budget ceiling (check stop_reason: "max_tokens" on thinking blocks). Go down if it consistently uses less than half.

Do not default-enable thinking across your whole application. The bill will surprise you, the p99 latency will degrade, and for most tasks the quality gain is not there. Pick the two or three steps in your system where a cascading wrong choice is expensive, turn it on there, and leave it off everywhere else.

Download the AI Automation Checklist (PDF)