Build the Harness Once With Your Best Model. Run It on a Cheap One.

June 3, 2026 · 4 min read · ai, agents, llm
Build the Harness Once With Your Best Model. Run It on a Cheap One.

Build your AI skill once with your best model. Then run it on a model that costs a tenth as much until the next flagship ships. The output will not drop.

That sounds like a downgrade. It is not. It fixes the two things that make AI agents painful right now: they forget, and the good ones cost. Both get fixed in the same place, and it is not the model you pick.

You explain the goal, the agent nails it twice, then on the third run it quietly drops the one constraint that mattered. Upgrading the model does not fix that. It only makes the dropped constraint cost more. You are paying frontier prices to be forgotten more politely.

The goal is living in two places that cannot hold it. In the conversation, where it rots the moment the context gets long. And inside the model, where keeping it sharp burns money on every run.

The constraint it dropped on Tuesday belongs in a script

Quality comes from whatever checks the work. The model that produced it is almost incidental. So decide the exact exit criteria for each step of your skill, then write a deterministic script that enforces them. The folders exist. The file parses. The test passes. The lint is clean. The agent reads the script’s verdict instead of grading its own output.

A script cannot forget the goal. That is the whole point. Your agent drops constraints because you trusted a probabilistic system to hold a hard requirement in its head. Move the requirement into code that fails the run when it is broken, and forgetting stops being possible. You are not repeating yourself anymore, because the harness repeats it for you, every run, exactly.

Where the expensive model actually earns its price

This is the one place a frontier model earns its price. Use the best model you have to build the skill as exactly as you can today. Name the phases. State the precise goal of each. Get the exit scripts right. That is hard, judgment-heavy work, and you do it once.

Then swap the model out and run the skill on something cheap. Gemini 2.5 Flash through OpenRouter, driven from the opencode desktop app if you want a UI instead of a terminal. The cheap model generates. The scripts gate. You review the scripts’ output, not the model’s opinion of its own work.

The cheap model clears the same bar, because the bar is enforced outside it. A model that costs a fraction as much produces work you can trust. It did not get smarter overnight. It is no longer the thing deciding whether the work is good enough to ship.

The frontier model is a contractor you re-hire on release day

Here is the cadence. A new flagship model ships. You bring it in for one job: build any new skills, and re-validate every harness you already run against the new ceiling. Then you let it go. Until the next flagship drops, you run everything exclusively on cheap and local models, small language models included, wherever they win on the bottom line.

That inverts the dependency everyone assumes they are stuck with. You are not renting frontier intelligence for as long as the product lives. You pay top rate for a few build days a release cycle, and the thing that runs ten thousand times a month is a small model that costs almost nothing. The forgetting is gone, because a script holds the goal. The bill no longer scales with quality, because a cheap model clears the scripts.

I build the harness, not a standing dependency on whoever ships the smartest model this quarter.

Open the last agent you argued with. How much of that conversation was you re-explaining a goal a script could have held? And which model were you paying to forget it?

Written quote in 24h

5 fields. I reply within 24h with either “yes, fixed price X, duration Y” or “no, here’s why not”.

Request received

You’ll hear from me within 24h with an honest assessment.

Prefer to talk? 30-min roadmap call →