Can a cheaper LLM match a flagship model's output quality?

On its own, often not. Behind a deterministic harness, yes. When exit criteria are enforced by scripts the model cannot pass without meeting, a cheaper model produces work you can trust, because the quality bar lives outside the model instead of inside it.

What is a deterministic harness for an AI agent?

A set of local scripts that check the agent's output against exact exit criteria: folders exist, files parse, tests pass, lint is clean. The agent reads the script's verdict instead of grading its own work, so the standard is enforced the same way every run.

Why do AI agents forget instructions, and how do I stop repeating myself?

A model holds requirements probabilistically, so a long conversation eventually drops one. Move the requirement into code that fails the run when it is broken. A script cannot forget, so you stop re-explaining the same constraint.

How do I run an AI skill on a cheap model like Gemini 2.5 Flash?

Build the skill with your best model first, encode each phase's exit criteria as scripts, then point the runner at Gemini 2.5 Flash through OpenRouter. The opencode desktop app gives you a UI if you do not want a terminal. The model generates, the scripts gate.

Should an AI model evaluate its own output?

No. Self-evaluation lets the model forget or rationalize the goal. A deterministic script evaluates against fixed criteria, and the agent reads that verdict. Evaluation belongs outside the thing being evaluated.

Build the Harness Once With Your Best Model. Run It on a Cheap One.

June 3, 2026 · 4 min read · ai, agents, llm

Build your AI skill once with your best model. Then run it on a model that costs a tenth as much until the next flagship ships. The output will not drop.

That sounds like a downgrade. It is not. It fixes the two things that make AI agents painful right now: they forget, and the good ones cost. Both get fixed in the same place, and it is not the model you pick.

You explain the goal, the agent nails it twice, then on the third run it quietly drops the one constraint that mattered. Upgrading the model does not fix that. It only makes the dropped constraint cost more. You are paying frontier prices to be forgotten more politely.

The goal is living in two places that cannot hold it. In the conversation, where it rots the moment the context gets long. And inside the model, where keeping it sharp burns money on every run.

The constraint it dropped on Tuesday belongs in a script

Quality comes from whatever checks the work. The model that produced it is almost incidental. So decide the exact exit criteria for each step of your skill, then write a deterministic script that enforces them. The folders exist. The file parses. The test passes. The lint is clean. The agent reads the script’s verdict instead of grading its own output.

A script cannot forget the goal. That is the whole point. Your agent drops constraints because you trusted a probabilistic system to hold a hard requirement in its head. Move the requirement into code that fails the run when it is broken, and forgetting stops being possible. You are not repeating yourself anymore, because the harness repeats it for you, every run, exactly.

Where the expensive model actually earns its price

This is the one place a frontier model earns its price. Use the best model you have to build the skill as exactly as you can today. Name the phases. State the precise goal of each. Get the exit scripts right. That is hard, judgment-heavy work, and you do it once.

Then swap the model out and run the skill on something cheap. Gemini 2.5 Flash through OpenRouter, driven from the opencode desktop app if you want a UI instead of a terminal. The cheap model generates. The scripts gate. You review the scripts’ output, not the model’s opinion of its own work.

The cheap model clears the same bar, because the bar is enforced outside it. A model that costs a fraction as much produces work you can trust. It did not get smarter overnight. It is no longer the thing deciding whether the work is good enough to ship.

The frontier model is a contractor you re-hire on release day

Here is the cadence. A new flagship model ships. You bring it in for one job: build any new skills, and re-validate every harness you already run against the new ceiling. Then you let it go. Until the next flagship drops, you run everything exclusively on cheap and local models, small language models included, wherever they win on the bottom line.

That inverts the dependency everyone assumes they are stuck with. You are not renting frontier intelligence for as long as the product lives. You pay top rate for a few build days a release cycle, and the thing that runs ten thousand times a month is a small model that costs almost nothing. The forgetting is gone, because a script holds the goal. The bill no longer scales with quality, because a cheap model clears the scripts.

I build the harness, not a standing dependency on whoever ships the smartest model this quarter.

Open the last agent you argued with. How much of that conversation was you re-explaining a goal a script could have held? And which model were you paying to forget it?

Fixed price and milestones — or a clear no with reasons.

From pilot to production

Running an AI pilot that is not production-ready yet? That is exactly what I do: audit, fixed-price scope, delivery in 2–6 weeks.

Make your AI pilot production-ready → Production audit ($1,900 fixed)

Build the Harness Once With Your Best Model. Run It on a Cheap One.

The constraint it dropped on Tuesday belongs in a script

Where the expensive model actually earns its price

The frontier model is a contractor you re-hire on release day

Before you go —

Almost there

Build the Harness Once With Your Best Model. Run It on a Cheap One.

The constraint it dropped on Tuesday belongs in a script

Where the expensive model actually earns its price

The frontier model is a contractor you re-hire on release day

Scope my automation in 24h

Request received