What We Learned After $45 and 52 AI Coding Benchmarks (The Part That Matters to You)

The biggest factor in AI development cost isn't the model — it's the brief. 52 benchmarks, -85% total cost reduction, explained for decision-makers.

UpGPT Team

UpGPT·April 20, 2026·9 min read

Before the Numbers

We spent three weeks running controlled experiments on AI-assisted software development. We tested every major approach. We let an adversarial reviewer tear our methodology apart. We re-ran things that gave us answers we liked to make sure we weren't fooling ourselves.

Here's what we found, and what it means if you're using AI to build software — or paying someone to build it for you.

Our results come from a specific setup — a production Next.js/TypeScript/Supabase codebase, claude-sonnet-4-6 as the worker model, and a class of greenfield TypeScript tasks. Your numbers will differ. The directions (contracts help, retries hurt, compression saves tokens) appear structurally sound based on the research literature — but magnitudes are setup-specific. The full technical detail, every table, and every run is in the detailed technical version.

The One Finding That Changes Everything

The biggest variable in AI-assisted development is not the tool, the model, or the parallelism. It's what you tell the AI before it starts.

We call this a CONTRACT.md — a structured brief written before any worker touches code. It contains exact TypeScript interfaces, exact database column names, exact import paths, exact SQL conventions, explicit non-goals. The AI stops exploring and starts executing.

The numbers: -54% cost. -60% time. Quality score goes from 5/10 to 9/10.

Same model. Same architecture. Just a better document.

That's not a marginal improvement. That's the difference between an AI session that costs $5.45 and one that costs $2.51 — before any other optimization. Stack in the compression and orchestration improvements below, and the same session reaches $0.83. An 85% total cost reduction, compared to running these sessions naively.

We proved this in a proper 2×2 factorial experiment — 20 controlled runs, four conditions, with the CONTRACT.md as the only variable. It held up.

What Doesn't Work (That Sounds Like It Should)

Parallel agent teams — more expensive, same quality

Anthropic markets "Agent Teams" as a way to parallelize work and finish faster. Technically true. But the cost is brutal:

For a medium task: 124% more expensive than sequential at equivalent quality
For a large task: 73% more expensive
Quality: no better, sometimes worse

Parallelism has a tax — every agent loads the codebase context independently. Three agents means three copies of the same 80,000-token context. The cache burn dominates. Agent Teams is a cost trap. Avoid it.

Retry loops — they make the output worse

The intuition seems sound: if the AI misses something, have it check its own work and try again. Several research papers recommend it. We built it, ran N=5 controlled experiments, and found the opposite.

Self-improvement retries: improved acceptance criteria by 1 item, degraded overall quality from 9/10 to 6/10, and cost 2.1× more.

The mechanism: when a model retries, it doesn't make surgical edits. It regenerates entire files. Fixing a broken import path means rewriting the whole route file — and losing all the CRUD endpoints and tests that were correct the first time. We saw this across 15 retry attempts across 3 runs. It never didn't happen.

"Just use a smarter model to review the output" — doesn't help when the brief is good

We tested having Opus read the output in full and make surgical edits (no retry loop — just a targeted one-shot patch). Clean N=5 retest with full file context:

Opus One-Shot Review — zero quality gain, +56% cost

Zero quality gain. +56% cost. When the CONTRACT.md is well-formed, Sonnet already reaches 9.8/10. There's nothing for Opus to fix.

The right move: write the contract right. Don't retry. Don't add a review pass. The brief is the quality lever — not the verification architecture.

What Does Work

In order of impact:

The CONTRACT.md brief (always). For any task touching 3+ files: write the contract first. Cost ~$0.15 to generate, saves 47–54% on the task itself. Paid back on the first run, every time.
File content compression (always). Instead of feeding entire source files into context, extract only exported symbols using AST parsing. We measured 85–91% token reduction across a real production codebase. Zero quality tradeoff. Set up once, free forever.
Cheap orchestrator for long sessions. Replacing the full orchestrator model with a lightweight model that reads/writes a compact state file cuts orchestration cost by 96%.
Cheap models for boilerplate workers — but only as implementers. Haiku matches Sonnet quality (9.0 vs 9.8/10) when implementing a Sonnet-written contract. When Haiku writes its own contract, quality collapses to 4.9/10. The right pattern: Sonnet writes the brief, Haiku executes. 64% lower cost, near-parity quality.
Sequential execution as the default. When speed isn't the primary constraint, a single sequential session with a good contract consistently beats parallel approaches on cost and matches them on quality. Sequential sessions also share their warm cache — the second worker pays ~2% of the first worker's input cost when the context is the same.
A tree-shaped index, not a flat codebase dump. Instead of feeding the AI your whole repo, we built a three-level drill-down: L0 (what modules exist, ~18KB), L1 (per-module exact signatures, 2–3KB each), L2 (raw source, only when behavior matters). The AI loads only what the task needs. A 600K-token codebase becomes ~30K tokens of targeted context — without losing any information the task actually needs.

The Stacked Numbers

Starting point: a large 8-agent coding session costs $5.45 when run naively (no compression, no contracts, full model for orchestration).

Stacked Savings — T6 Session

$5.45 → $0.83. -85%. These aren't projections. These are measured results on our production codebase. The full methodology is in the technical version.

What This Means for How You Build Software

This methodology — CONTRACT-first development, AST compression, lightweight orchestration, and model routing — is baked into every UpGPT engagement. When we build for your team, you start from the optimized state from day one, not the one teams eventually discover after weeks of costly iteration.

The open-source CLI and benchmark data are available at upgpt.ai/tools/upcommander for technical teams who want to understand the methodology.

If you're evaluating a development agency or AI solutions partner: Ask them what their AI session methodology looks like. If the answer involves "we just run it and see what happens" or "we use retry loops to catch quality issues," you're likely paying 2–3× more than necessary for the same output quality. The benchmark data is public — the numbers are reproducible.

If you're evaluating AI coding platforms: The overhead question is real. Tools that promise speed through parallelism often deliver it at cost multiples that don't appear in demos. Ask for cost-per-task data on tasks similar to yours, not just wall-clock time.

What This Means if You're Thinking About AI Solutions for Your Business

UpGPT designs, builds, and operates agentic AI solutions for businesses — architecture, implementation, deployment, and ongoing administration.

These benchmarks are why that matters.

The 85% cost reduction we achieve today required 52+ controlled experiments, an adversarial review, a formal 2×2 factorial design, and findings that overturned nearly every assumption we started with. That research is now baked into how we build — every solution gets CONTRACT generation, AST compression, and lightweight orchestration from day one. You start from the optimized state, not the one you'd eventually discover yourself after weeks of iteration.

And we keep benchmarking. Multi-model routing (Haiku for boilerplate, Sonnet for design) is confirmed: 64% cost savings with a 0.8pt quality tradeoff. Next: production-scale refactoring and multi-model routing in parallel worker sessions.

If you want to explore what an agentic solution looks like for your business, the conversation starts here.

The Short Version

Write the contract first. Always. For any task touching 3+ files. It's the single biggest lever in the entire stack.
Don't parallelize unless speed has direct business value. The cost premium is real and documented.
Don't use retry loops. They degrade quality. Don't add an Opus review pass either — when the contract is good, Sonnet already hits 9.8/10.
Compress your file context. 85–91% token reduction. Free.
The model matters less than the scaffolding — for implementation. Haiku implementing a Sonnet contract = 9.0/10 at 64% less cost. Haiku writing a contract = 4.9/10. Don't conflate the two.
Drill the index. L0 → L1 → L2. Load the summary, then the module signatures, then raw files only if needed. Never blind-search a codebase with an index.
Grading is cross-vendor reliable. Opus, GPT-4o, and Gemini agree within ±1 point. Gemini is strictest and catches issues the others miss.

Full methodology, raw data, and every uncomfortable number is in the detailed technical version.

Developers: The CLI is open source — full benchmark data, methodology, and source at upgpt.ai/tools/upcommander. A free VS Code extension is in development — star the repo to follow.

Frequently Asked Questions

What is the biggest factor in AI-assisted software development cost?

The structured brief you give the AI before it starts — not the model, tool, or parallelism. A CONTRACT.md containing exact interfaces, column names, import paths, and explicit non-goals reduces cost 54% and raises quality from 5/10 to 9/10 with the same model and same codebase.

How much can AI development costs be reduced?

By 85% compared to running sessions naively. The stacked improvements — CONTRACT.md brief (-54%), AST file compression (-85% tokens), and lightweight orchestration — bring a $5.45 session down to $0.83. These are measured results on a production codebase, not projections.

Are Claude Agent Teams worth the cost?

No — not for cost-sensitive work. Agent Teams cost 73–124% more than sequential execution with no quality improvement. Every agent loads the full codebase context independently, multiplying token costs. Sequential execution with a CONTRACT.md consistently wins on cost and matches quality.

Should you use retry loops to improve AI code quality?

No. Retry loops degrade quality from 9/10 to 6/10 and cost 2.1× more. When a model retries, it regenerates entire files instead of making surgical edits — destroying previously-correct sections. The correct approach is to write a better brief upfront, not to add verification passes.

How does UpGPT reduce AI development costs for clients?

UpGPT uses CONTRACT-first development, a three-level drill-down codebase index (L0 summary, L1 per-module signatures, L2 raw source), lightweight orchestration, and split-responsibility model routing (Sonnet writes contracts, Haiku implements them) on every engagement. Clients start from the optimized state — the 85% cost reduction achieved through 52+ controlled experiments — rather than discovering these techniques through costly iteration.

Start a conversation →

aibenchmarkscost-optimizationai-developmentcontract-driven-developmentbusiness

AI Agents

We Ran 52 AI Coding Benchmarks. Here's Every Uncomfortable Thing We Found.

CONTRACT.md cuts cost 54%, raises quality from 5/10 to 9/10. Agent Teams cost 124% more with no gain. Retry loops degrade output. 52 controlled runs, full data open-sourced.