Splitting Grounding from Reasoning in Browser-Agent Stacks

Everyone’s posting browser-agent demos this week. Click here, scroll there, fill that form. Most break by click seven.
Mine broke too. The submit button on a checkout form that the frontier vision model literally couldn’t see. Billed at $0.01-0.05 per call, called 20-50 times per agent run, the model was burning reasoning capacity on parsing pixel coordinates. A 2B specialist I trained for $4 hits that same button 2.5x more reliably on ScreenSpot-v2.
The architecture is the bug, not the model.
Two Jobs, One Forward Pass
Browser-agent stacks send a screenshot to a frontier vision model and ask it for both the next decision and the click coordinates in one call. Splitting that into two calls, a local 2B grounding model that emits JSON followed by a frontier model that reasons over the JSON, drops vision token spend and raises click accuracy.
browser-use (94k stars), Skyvern (22k stars), Claude Computer Use, OpenAI Operator. Same pattern. Same compound question every step:
Given this page, what should the agent do next, and where exactly does it click?
Two jobs welded together. Reasoning (“what next”) is a probabilistic problem worth a frontier model. Grounding (“where exactly”) is a structured-output problem with a tight schema: clickable elements, bounding boxes, accessible labels.
You’re paying frontier-tier rates for the second job. Per screenshot. Every step of the loop.
Grounding Is a Parser Problem
Once you name it as a parser problem, the right tool changes. You don’t need 200 billion parameters to emit a JSON list of clickable elements. You need a model that:
- Has seen enough UI screenshots to recognize buttons, inputs, links with sub-50-pixel precision
- Outputs strict JSON without hallucinating bounding boxes
- Runs locally so the per-step cost is zero
A 2B specialist trained on screen-parsing data. Not a frontier model.
I trained one. Total cost: $4 of RunPod compute on a single L40S GPU over three hours. The result, browserground, hits 45.3% on ScreenSpot-v2 against GPT-4o’s 18.3%. A 2.5x beat at the click-grounding job, by a model roughly 100x smaller. It runs in 2 GB of unified memory on a MacBook Air at 1.8 seconds per inference. A drop-in for any agent loop currently handing screenshots to a frontier API.
The Reasoning Model Gets Its Reasoning Capacity Back
When you split the call, the frontier model stops seeing pixels. It sees:
{
"elements": [
{"id": "e7", "label": "Submit order", "type": "button", "bbox": [344, 612, 478, 658]},
{"id": "e8", "label": "Edit cart", "type": "link", "bbox": [...]}
]
}
Now the frontier model does the job it’s good at: deciding e7 vs e8 given the agent’s goal. A reasoning question over structured input. Cheap. Reliable. Auditable.
Three things change at once. Per-step token spend on vision collapses, because the grounding step runs locally. JSON validity hits 100% (the specialist learned the output convention with 17M LoRA parameters). Agent traces become debuggable. You read the structured grounding output before the reasoning step ever runs.
What Anthropic and OpenAI Ship Next
The frontier providers will absorb grounding into their own small models. Within twelve months, “fast vision” or “tool vision” tiers will appear in both Anthropic and OpenAI billing at a fraction of frontier rates. The economics demand it. Nobody can justify charging GPT-4 prices for a parser, and Hugging Face downloads already prove the demand: SeeClick, UI-TARS, and ShowUI pull ~300k category downloads a month between them.
When that ships, stack owners who already split grounding from reasoning have three things the wait-and-see crowd doesn’t. A local fallback if the provider has an outage. An auditable structured-grounding trace in every log line. An exit option to a different reasoning provider without re-validating click behavior, because the grounding step belongs to them.
Stack owners who didn’t split will find their grounding step has quietly become someone else’s API. Same vendor billing the reasoning calls. Same vendor setting the price. Same vendor’s deprecation calendar.
The Diagnostic
Pull up your last failed agent trace. Three numbers:
- Total tokens spent on vision calls per agent step.
- Fraction of those tokens spent on grounding (parsing pixel coordinates) vs reasoning (deciding actions).
- Per-run vision cost at your current API rates.
If grounding dominates the first two numbers, and in most stacks it does, your stack has the split wrong.
Grounding is plumbing. Reasoning is cognition. Stop paying cognition rates for plumbing.
I build the split layer. browserground is the open-source reference. Model: huggingface.co/renezander030/browserground. CLI: github.com/renezander030/browserground. npm: browserground (npm install -g browserground, then browserground parse <img> --target "..."). Apache-2.0. Plugin entry points for Claude Code and Codex CLI. PRs welcome, especially eval cases where it fails.