How do I cut vision token costs in a browser agent?

Move click grounding to a local 2B specialist model. Frontier vision models handle reasoning (next action); the local model emits JSON bboxes for clickable elements. Vision token spend on the grounding step drops to zero.

Why does browser-use spend so many tokens on screenshots?

Every step bundles two jobs into one frontier-VLM call: reasoning over the page state and grounding the click coordinates. Both are billed at frontier rates per screenshot, 20 to 50 times per agent run.

What is the ScreenSpot-v2 benchmark?

ScreenSpot-v2 measures whether a vision model can locate a UI element from a natural-language target description. GPT-4o scores 18.3% on it. A 2B specialist trained on UI screenshots can reach 45 to 90% on the same benchmark.

Can a 2B model beat GPT-4o at vision tasks?

Not in general. On the narrow click-grounding task, yes. browserground (2B parameters, Qwen3-VL base) scores 45.3% on ScreenSpot-v2 versus GPT-4o's 18.3%. Specialists beat generalists at narrow jobs.

How does a local grounding model integrate with browser-use or Skyvern?

Via a thin adapter. The local model parses the screenshot into a JSON list of clickable elements with bboxes. You pass that JSON, not the raw image, to the frontier reasoning model. Adapters exist for browser-use, Skyvern, Claude Computer Use, and Operator-style patterns.

What does it cost to fine-tune a 2B vision model for UI grounding?

About $4 of cloud GPU on a single L40S over three hours. browserground used roughly 13k labeled UI screenshots across macOS, Android, and web data with LoRA rank 16, one epoch.

Splitting Grounding from Reasoning in Browser-Agent Stacks

May 19, 2026 · 4 min read · ai, opensource, productivity

Everyone’s posting browser-agent demos this week. Click here, scroll there, fill that form. Most break by click seven.

Mine broke too. The submit button on a checkout form that the frontier vision model literally couldn’t see. Billed at $0.01-0.05 per call, called 20-50 times per agent run, the model was burning reasoning capacity on parsing pixel coordinates. A 2B specialist I trained for $4 hits that same button 2.5x more reliably on ScreenSpot-v2.

The architecture is the bug, not the model.

Two Jobs, One Forward Pass

Browser-agent stacks send a screenshot to a frontier vision model and ask it for both the next decision and the click coordinates in one call. Splitting that into two calls, a local 2B grounding model that emits JSON followed by a frontier model that reasons over the JSON, drops vision token spend and raises click accuracy.

browser-use (94k stars), Skyvern (22k stars), Claude Computer Use, OpenAI Operator. Same pattern. Same compound question every step:

Given this page, what should the agent do next, and where exactly does it click?

Two jobs welded together. Reasoning (“what next”) is a probabilistic problem worth a frontier model. Grounding (“where exactly”) is a structured-output problem with a tight schema: clickable elements, bounding boxes, accessible labels.

You’re paying frontier-tier rates for the second job. Per screenshot. Every step of the loop.

Grounding Is a Parser Problem

Once you name it as a parser problem, the right tool changes. You don’t need 200 billion parameters to emit a JSON list of clickable elements. You need a model that:

Has seen enough UI screenshots to recognize buttons, inputs, links with sub-50-pixel precision
Outputs strict JSON without hallucinating bounding boxes
Runs locally so the per-step cost is zero

A 2B specialist trained on screen-parsing data. Not a frontier model.

I trained one. Total cost: $4 of RunPod compute on a single L40S GPU over three hours. The result, browserground, hits 45.3% on ScreenSpot-v2 against GPT-4o’s 18.3%. A 2.5x beat at the click-grounding job, by a model roughly 100x smaller. It runs in 2 GB of unified memory on a MacBook Air at 1.8 seconds per inference. A drop-in for any agent loop currently handing screenshots to a frontier API.

The Reasoning Model Gets Its Reasoning Capacity Back

When you split the call, the frontier model stops seeing pixels. It sees:

{
  "elements": [
    {"id": "e7", "label": "Submit order", "type": "button", "bbox": [344, 612, 478, 658]},
    {"id": "e8", "label": "Edit cart",    "type": "link",   "bbox": [...]}
  ]
}

Now the frontier model does the job it’s good at: deciding e7 vs e8 given the agent’s goal. A reasoning question over structured input. Cheap. Reliable. Auditable.

Three things change at once. Per-step token spend on vision collapses, because the grounding step runs locally. JSON validity hits 100% (the specialist learned the output convention with 17M LoRA parameters). Agent traces become debuggable. You read the structured grounding output before the reasoning step ever runs.

What Anthropic and OpenAI Ship Next

The frontier providers will absorb grounding into their own small models. Within twelve months, “fast vision” or “tool vision” tiers will appear in both Anthropic and OpenAI billing at a fraction of frontier rates. The economics demand it. Nobody can justify charging GPT-4 prices for a parser, and Hugging Face downloads already prove the demand: SeeClick, UI-TARS, and ShowUI pull ~300k category downloads a month between them.

When that ships, stack owners who already split grounding from reasoning have three things the wait-and-see crowd doesn’t. A local fallback if the provider has an outage. An auditable structured-grounding trace in every log line. An exit option to a different reasoning provider without re-validating click behavior, because the grounding step belongs to them.

Stack owners who didn’t split will find their grounding step has quietly become someone else’s API. Same vendor billing the reasoning calls. Same vendor setting the price. Same vendor’s deprecation calendar.

The Diagnostic

Pull up your last failed agent trace. Three numbers:

Total tokens spent on vision calls per agent step.
Fraction of those tokens spent on grounding (parsing pixel coordinates) vs reasoning (deciding actions).
Per-run vision cost at your current API rates.

If grounding dominates the first two numbers, and in most stacks it does, your stack has the split wrong.

Grounding is plumbing. Reasoning is cognition. Stop paying cognition rates for plumbing.

I build the split layer. browserground is the open-source reference. Model: huggingface.co/renezander030/browserground. CLI: github.com/renezander030/browserground. npm: browserground (npm install -g browserground, then browserground parse <img> --target "..."). Apache-2.0. Plugin entry points for Claude Code and Codex CLI. PRs welcome, especially eval cases where it fails.

Splitting Grounding from Reasoning in Browser-Agent Stacks

Two Jobs, One Forward Pass

Grounding Is a Parser Problem

The Reasoning Model Gets Its Reasoning Capacity Back

What Anthropic and OpenAI Ship Next

The Diagnostic

Before you go —

Almost there

Splitting Grounding from Reasoning in Browser-Agent Stacks

Two Jobs, One Forward Pass

Grounding Is a Parser Problem

The Reasoning Model Gets Its Reasoning Capacity Back

What Anthropic and OpenAI Ship Next

The Diagnostic

Scope my automation in 24h

Request received