Your AI Coding Limits Are a Planning Problem, Not a Model Problem

If you have ever stared at a Claude usage limit, switched to a local model, and watched it produce code that does not compile, you have probably blamed the model. The local model is too small. The cloud provider is too expensive. The tool is too limited.

That blame is almost always misplaced.

The failure is rarely the worker. It is the blueprint.

In this guide, we walk through a workflow that lets you punch far above the weight class of your local hardware, repeatedly ship working software, and avoid the predictable mid-task collapse that plagues almost every "run an agent and walk away" setup. We will use a top-down survival game built with a frontier architect and a local execution model as a running case study, but the pattern applies to any multi-file software project.

The central claim is simple: your AI coding limits are planning limits, not model limits. Most people skip the plan because it feels slower. It is not. It is the difference between building a house with an architect and building a house by showing up with a hammer and hope.

The Architect, Blueprint, and Contractor Model

Think of software construction the same way you think of physical construction.

You do not show up to a job site with a hammer and start swinging. You hire an architect. The architect sits with you, asks twenty-plus questions, and produces a set of drawings: where the load-bearing walls go, where the plumbing runs, what the house will look like in five years when the tree you planted today has grown.

Then you hire trades. Framers. Electricians. Plumbers. They do not redesign the house. They build what is already drawn. They are cheaper, they are faster, and they are only as good as the plan handed to them.

You are the general contractor on site. You are not swinging hammers. You are not drafting blueprints. You are reading the plan, handing the trades one task at a time, and verifying each one before moving on.

In the AI coding world:

The architect is the frontier model. Claude. Codex. GPT-4 class.
The trades are the execution model. Gemma. Llama. Qwen.
You are still the GC. The job site does not run itself.

This is not an analogy. It is the actual control flow.

graph TD
    A[User + Architect Model<br/>Claude / Codex] -->|Interview & Design| B[Blueprint<br/>Architecture, Build Plan, Tasks]
    B -->|Task 1| C[Contractor Model<br/>Gemma 4 31B]
    C -->|Implement & Stop| D{GC Verification}
    D -->|Pass| B
    D -->|Fail| E[Escalate to Architect]
    E -->|Revised Guidance| B
    D -->|Escalate Stuck Issue| E
    C -->|Task N| D

Figure 1: The four-stage architect-blueprint-contractor workflow.

The architect never swings the hammer. The contractor never redesigns the house. You never stop verifying. This loop is boring, tedious, and almost always produces working code.

Why Your Limits Are Planning Limits

When you hit a Claude limit, the immediate reaction is to "optimize" the model. Use a smaller model. Use a local model. Use a cheaper API. Use fewer tokens per message.

All of these are worker-side optimizations.

The actual bottleneck is that most developers hand the AI a sentence or two and expect a multi-file application. That is not a model limitation. That is a specification limitation. If you hand a human engineer a sticky note that says "build a clone of Vampire Survivors," they will not produce clean architecture either. They will produce garbage, ask for clarification, or quit.

Models are the same. A model is a worker. A worker with a perfect blueprint can do amazing work. A worker with a blueprint cut in half will produce garbage, regardless of how large the context window is.

This distinction matters because it changes your optimization target. Instead of chasing the newest, biggest model, you invest time in the interview with the architect. That investment pays back in every subsequent task.

Stage 1: The Interview

This is the stage everyone skips, and it is also the stage that saves the most tokens later.

You sit down with the architect. You do not tell it what to build. You let it ask you questions until both of you know exactly what you are building.

In practice, this looks like giving the architect one sentence and letting it interview you for twenty to twenty-five minutes.

One-sentence prompt:

"I want to build a Vampire Survivors-style top-down shooter in vanilla HTML and JavaScript."

Architect response: The architect comes back with twenty-plus questions:

What movement scheme? WASD? Arrow keys? Both?
How many enemy types?
How does difficulty scale over time?
Is there autofire?
What is the scoring model?
Does progress persist between sessions?
What state machine governs gameplay states?
What polish features matter most?

Each answer locks one piece of the build. By the end of the interview, you have a shared mental model with the architect. There are no surprises.

sequenceDiagram
    participant You
    participant Architect
    You->>Architect: One-sentence project idea
    loop Interview
        Architect->>You: Question about player movement
        You->>Architect: Answer
        Architect->>You: Question about enemy variety
        You->>Architect: Answer
    end
    Architect->>You: Seven markdown files: brief, stories, architecture, build plan, local/cloud split

Figure 2: The interview stage. The architect extracts requirements; you do not skip it.

If you skip this stage, you will save twenty minutes at the start and lose hours later. The architecture document, the build plan, and the task list are the outputs of the interview. Do not skip the architect meeting.

Stage 2: The Blueprint

The architect produces seven markdown files in the case study described:

Product Brief — what we are building and why.
User Stories — who uses what and how.
Architecture Plan — file structure, state machines, module boundaries.
Build Plan — a list of tasks, each small enough to finish in one turn.
Local versus Cloud Split — which model handles which layer.
Interview Notes — the Q&A from Stage 1.
Task List — fifteen concrete implementation tasks.

This is the blueprint. It is detailed. It is boring. It is correct.

The contractor does not get any of these files. The contractor gets the build plan only, divided into fifteen tasks. Each task is small enough that a model can finish it in one response.

Why? Because large models are generalists. They get distracted. If you hand them the full context and say "do everything," they will hallucinate file structures, invent enemy behaviors you never asked for, and quit halfway through task three.

If you hand them one task and say "stop when done," they implement that task, and they stop.

graph LR
    B[Blueprint Outputs] --> P[Product Brief]
    B --> U[User Stories]
    B --> A[Architecture Plan]
    B --> BP[Build Plan]
    B --> LC[Local vs Cloud Split]
    B --> T[Task List - 15 items]
    T --> T1[Task 1: Setup]
    T --> T2[Task 2: Player Movement]
    T --> T3[Task 3: Enemy Spawner]
    T --> TN[...]

Figure 3: The blueprint outputs. The contractor sees only the build plan with task boundaries.

Stage 3: Execution

You hand the contractor the build plan.

In the case study, the contractor was Gemma 4, 31 billion parameters, running on Ollama Cloud. The setup required two commands:

ollama signin
ollama launch claude --model gemma4:31b --cloud

From there, Claude Code talks to the contractor through the Ollama bridge. The workflow is:

Open the project in Claude Code with the contractor model active.
Drop the build plan markdown into the project.
Say: "Read the build plan. Implement task one only. Stop and wait."
The contractor writes the code. It stops.
You verify.
If it looks right, you say: "Proceed to task two."

This is not autonomous. This is not "let the agent run for three hours." This is task by task.

flowchart LR
    Start[Load Build Plan] --> T1[Task 1]
    T1 --> V1{Verify}
    V1 -->|Pass| T2[Task 2]
    V1 -->|Fail| E1[Escalate to Architect]
    E1 --> T2
    T2 --> V2{Verify}
    V2 -->|Pass| T3[Task 3]
    V2 -->|Fail| E2[Escalate]
    T3 --> Dots[...]
    Dots --> End[Build Complete]

Figure 4: The task-by-task loop. Human verification between every task.

Why the autonomous loop fails

You might be tempted to hand the build plan to the model and say: "Verify each task. Run all the tests. Come back when you are finished."

This does not work. In the case study, Gemma 4, 31 billion parameters, same machine, same context, same prompt, quit twice. Once after task three. Once after task five. There was no error. It just stopped.

The lesson is clear: there is no autonomous job size for local models yet. The contractor needs the GC there to hand it the next task and verify the last one.

This is not a failure of the model. It is a limitation of the control loop. Treat it like a real job site. The trades do not call the architects. The GC calls the trades.

Context Window Reality Check

One of the most misunderstood concepts in local model deployment is the relationship between context window size and model capability.

Claude Code does not just send your prompt to the model. It sends a system prompt containing definitions, instructions, agent setup, and the entire tool interface. That system prompt alone is fifty to sixty-five thousand tokens before your actual question ever enters the picture.

This means:

A model with a thirty-two thousand token context window cannot even see the full system prompt. The model is guessing. The cue "I need more information" is the model telling you it received a blueprint that was cut in half.
A model with a one-hundred-twenty-eight thousand token context window can see the plan. It can execute. But if it only has four billion parameters, it does not have the skill to finish the job. It will give generic responses, abandon tasks halfway, and follow the plan loosely.

graph TD
    subgraph Context Window Anatomy
        S[System Prompt<br/>50k-65k tokens] --> T[Tasks + Plan<br/>+ your prompt]
        T --> C[Total Context Needed]
    end
    32[32k Context] -->|Cutoff| Fail[Model sees half the blueprint]
    128[128k Context] -->|Enough| Model4[4B Model]
    128 -->|Enough| Model31[31B Model]
    Model4 -->|Fails| Generic[Generic responses, quits early]
    Model31 -->|Success| Capable[Capable execution]

Figure 5: Context window anatomy. A 32k context cuts off the system prompt. A 4B parameter model cannot execute even with full context.

The punch line: big context window does not equal capable model. You need both.

Hardware Selection: A Practical Guide

If you are on a sixteen-gigabyte MacBook, do not bother running the 4B model locally. It will see the plan and still fail. It does not have enough parameters to run a full coding workflow.

If you have thirty-two gigabytes of RAM, you can run the 26B Gemma 4 model locally for some workloads. The 31B cloud version is still the safer bet.

If you have sixty-four gigabytes or more, you can run the 31B locally and it will work, but the cloud free tier is still cheaper and easier.

The realistic path for most viewers is an open-weight model, no Anthropic API key, no five-thousand-dollar rig in the closet. Ollama Cloud covers the testing and light usage for free.

flowchart TD
    Start[What is your RAM?] -->|16 GB| Cloud31[Ollama Cloud<br/>Gemma 4 31B<br/>Free tier]
    Start -->|32 GB| Choice{Agent work?}
    Choice -->|Light| Local26[Local 26B]
    Choice -->|Heavy| Cloud31
    Start -->|64 GB| Local31[Local 31B<br/>or Cloud 31B]
    Start -->|128 GB| Aggressive[Run everything locally<br/>No cloud needed]

Figure 6: Hardware selection flowchart. Pick by your hardware, not by the hype.

Why Ollama Cloud matters

Ollama recently made their cloud tier free for low-use testing. That is not a gimmick. It is the actual answer for people without the right device.

# Sign in to Ollama Cloud
ollama signin

# Launch the 31B Gemma 4 model on cloud GPUs
ollama launch claude --model gemma4:31b --cloud

# From Claude Code, the contractor is now Gemma 4 31B
# running on Ollama's GPUs, not your laptop.

The workflow is identical to local. The latency is higher. The cost is zero for low usage. The capability is real.

The Four Stages in Detail

Let us expand the four stages with the exact artifacts and commands used in the case study.

Stage 1: Interview with the Architect

You open a new Claude Code session. You give the one-sentence prompt. You let Claude ask questions. You answer them. You do not rush this.

Time investment: twenty to twenty-five minutes.

Outputs used later: none yet. This is shared understanding.

Stage 2: Blueprint Generation

After the interview, Claude generates seven markdown files. You review them. You refine them. You make sure the state machine definitions are correct. You make sure the enemy spawn logic is described in detail.

Only when the blueprint is solid do you proceed.

Time investment: another fifteen to thirty minutes, depending on project size.

Outputs:

docs/product-brief.md
docs/user-stories.md
docs/architecture.md
docs/build-plan.md
docs/local-vs-cloud.md
docs/interview-notes.md
docs/tasks.md

Stage 3: Task Execution

You open Claude Code with the contractor model active. You open the project folder. You drop the docs folder in. You say:

Read docs/build-plan.md. Implement task one only. Stop when done.

Task one is small. Maybe it is: "Create index.html, style.css, and main.js. Set up the canvas. Render a black square. Autofire enabled."

The contractor writes three files. It stops.

You run the game. You check:

Does the canvas render?
Is the square autofiring?
Are there any console errors?

If yes, you say: "Proceed to task two."

If no, you copy the error. You switch back to the architect. You say:

"Task one produced this error. Here is what the build plan intended. Revise the task specification."

Then you paste the revised task back to the contractor.

This continues for fifteen tasks.

Time investment: depends on task size. The case study took several hours of actual work, but it was first-time implementation of a non-trivial game with no major refactors.

Outputs: working software, verified incrementally.

Stage 4: Escalation

When the contractor gets stuck, you escalate to the architect. You do not ask the contractor to debug itself. You ask the architect.

In the case study, escalation was rare because the blueprint was good. But it does happen. The architect should never be given the full project context in the escalating message; give it just the task, the intended outcome, and the observed failure. That keeps architect calls cheap.

graph TD
    subgraph Architect [Claude / Codex]
        I[Interview] --> BP[Blueprint]
    end
    subgraph Contractor [Gemma 4 31B]
        T1[Task] --> V{Verify}
    end
    BP --> T1
    V -->|Pass| T2[Next Task]
    V -->|Fail| E[Escalate]
    E -->|One task context| A[Architect Fix]
    A --> T2
    T2 --> V

Figure 7: The escalation path. The architect gets only the failing task and the expected outcome.

A Real Case Study: Vampire Survivors Clone

To make this concrete, here is exactly what happened in the video.

Hardware: M2 Pro, sixteen gigabytes RAM.

First attempt: Gemma 4, four billion parameters, thirty-two thousand token context.

Prompt: full Claude Code system prompt plus the build plan.

Result: "I need more information." Every time.

Diagnosis: The 32k context window cut off the system prompt. The contractor literally could not see the instructions.

Second attempt: Gemma 4, four billion parameters, one hundred twenty-eight thousand token context.

Result: The model could read the plan. It produced generic responses, abandoned tasks halfway, and refused to follow the state machine specification.

Diagnosis: The model is too small. It does not have enough parameters to run a multi-step coding workflow.

Third attempt: Gemma 4, thirty-one billion parameters, one hundred twenty-eight thousand token context, running on Ollama Cloud.

Result: Task-by-task execution worked. The game built incrementally. Autofire worked. Enemy spawning matched the spec. Scoring persisted. Game over screen rendered.

Parallel comparison: The same 31B model, same machine, different prompt.

Vague prompt: "Make me a Vampire Survivors clone in HTML and JavaScript with enemies and a player." Result: Hallucinated file structure. Missing autofire. Made-up enemy behaviors. Half features missing. Game barely runs. Thrown out.
Plan prompt: Full build plan, task-by-task. Result: Working game. All three enemy types spawning at correct intervals. Autofire working. Scoring persisting. Game over screen. Exactly like the specs.

Same model. Same hardware. Different plan. Different universe.

graph LR
    subgraph Same Hardware & Model
        V[Vague Prompt] --> G1[Garbage]
        P[Plan Prompt] --> G2[Working Game]
    end
    style V fill:#ffcccc
    style P fill:#ccffcc
    style G1 fill:#ffcccc
    style G2 fill:#ccffcc

Figure 8: The comparison. Same model, same machine, different prompt quality.

Code Example: The Honest Contract

The pattern is so reliable that you can script it. Here is a minimal bash loop for task-by-task execution.

#!/bin/bash
# honest-contract.sh - Task-by-task contractor dispatch

PROJECT_DIR="./vampire-survivors"
BUILD_PLAN="$PROJECT_DIR/docs/build-plan.md"
CONTRACTOR="gemma4:31b"
TOTAL_TASKS=15

cd "$PROJECT_DIR" || exit 1

for i in $(seq 1 "$TOTAL_TASKS"); do
    echo "===== Task $i ====="
    
    # 1. Hand the task
    claude code --model "$CONTRACTOR" \
        --message "Read $BUILD_PLAN. Implement task $i only. Stop when done."
    
    # 2. Wait for human verification
    echo "Task $i complete. Verify before continuing."
    echo "Press Enter to proceed to task $((i + 1))..."
    read -r
    
    # 3. Optional: git checkpoint after each verified task
    git add .
    git commit -m "Task $i: [automated via honest-contract]"
done

This script does one thing: it prevents the autonomous loop. The human must press Enter after each task. You cannot accidentally walk away.

The verification checklist

What do you verify between tasks?

The files created match the blueprint.
The code compiles and runs without errors.
The behavior matches the task specification exactly.
No extra features were invented.
No planned features were omitted.

If any of these fail, you do not proceed. You escalate to the architect.

Mermaid Diagrams in Practice

Because this blog post uses Mermaid syntax version 11.7.0+, here is a state machine diagram for the game itself, taken directly from the blueprint produced in the case study.

stateDiagram-v2
    [*] --> Menu
    Menu --> Playing: Start Game
    Playing --> Paused: Escape
    Paused --> Playing: Resume
    Playing --> GameOver: Player HP <= 0
    GameOver --> Menu: Return to Menu
    Playing --> Upgrade: Level Up
    Upgrade --> Playing: Select Upgrade

Figure 9: Gameplay state machine. Defined in architecture plan; implemented task by task.

The architect defines the states. The contractor implements the transitions. The GC verifies each one. If the architect skips this diagram, the contractor will invent states. You will end up with a pause menu that does not actually pause the game loop.

Why Most Autonomous Agent Setups Fail

If you watch the AI coding space, almost every "autonomous agent" setup follows the same pattern:

Give the agent a vague goal.
Let it run for a long time.
Hope it finishes.

This fails for two reasons:

First, the goal is vague. "Build a to-do app" is not a spec. It is a category. The agent will make architectural decisions that are convenient for the model, not optimal for the user.

Second, the run is too long. Even with a perfect plan, even with a 31B parameter model, autonomous loops collapse. The model loses track of where it is in the plan. It repeats work. It makes decisions that are locally optimal but globally wrong.

The architect-blueprint-contractor loop fixes both problems. The plan is specific. The contractor runs for one task, then stops. The GC keeps the global state.

graph TD
    Vague[Vague Goal] --> A1[Agent thinks for itself]
    A1 --> H1[Hallucinates architecture]
    H1 --> F1[Repeat work, drift, fail]
    Plan[Specific Plan] --> C1[Task 1]
    C1 --> V1[Verify]
    V1 --> C2[Task 2]
    C2 --> V2[Verify]
    V2 --> C3[...Continue]

Figure 10: Autonomous vague goal vs disciplined task loop.

The Token Math

A common objection is that the architect stage wastes tokens. You pay Claude for twenty-five minutes of interview and seven markdown files. Is that not expensive?

It depends on what you compare it to.

If you compare it to one vague prompt that produces garbage, yes, it looks expensive.

If you compare it to the cost of re-running the generation five times because the model misread your vague instructions, or re-architecting after task five because the file structure is wrong, the architect stage is cheaper.

The plan is the biggest token saver in the workflow. Every unanswered question in the interview becomes a correction later. Corrections cost more tokens than questions.

In the case study:

Architect stage: ~ten minutes of conversation, seven markdown files.
Contractor stage: fifteen tasks, task-by-task, with verification.
Total number of full re-architectures: zero.
Total number of major corrections: zero.

That is the point. The workflow does not eliminate work. It eliminates rework.

Checklist: Running the Workflow on Any Project

Here is a practical checklist you can follow for your next AI-assisted build.

Before you start

Install the contractor runtime. If using Ollama Cloud: ollama signin && ollama launch claude --model gemma4:31b --cloud.
Open a fresh Claude Code session for the architect.
Open a terminal or editor for the project workspace.

Stage 1: Interview

Write one sentence describing what you want.
Answer every question the architect asks.
Do not add features later. If you think of new features, add them to a "future" list, not the current build.
Review the seven markdown files before proceeding.

Stage 2: Blueprint acceptance

Confirm the state machine is defined.
Confirm enemy types, scoring, and persistence are specified.
Confirm the file structure matches your preferences.
Request fifteen or fewer tasks, each one small.

Stage 3: Task loop

Open Claude Code with the contractor model active.
Drop the build plan into the project.
Implement task one only. Stop and wait.
Run the application. Verify behavior.
If verification passes, proceed. If not, escalate to architect.
Repeat until all tasks are complete.

Stage 4: Polish

Review the finished project against the original brief.
Note missing features in the "future" list.
Commit the working build.
Share or ship.

This checklist is intentionally boring. That is the point. Boring workflows produce reliable software. Exciting workflows produce exciting bugs.

Applying the Workflow Beyond Games

The case study uses a top-down survival game. The pattern works for everything.

Web applications

For a React dashboard, the architect produces:

Data model and API contract.
Component tree and state management approach.
Authentication flow and route structure.
Build tasks, one per component or feature.

The contractor implements each component. You verify routing and state.

CLI tools

For a Rust CLI, the architect produces:

Argument parsing specification.
Subcommand hierarchy.
Error handling strategy.
Build tasks, one per subcommand or module.

The contractor implements each module. You verify the binary runs and the help text is correct.

Infrastructure as code

For a Terraform module, the architect produces:

Resource graph and dependency order.
Variable and output contract.
Environment separation strategy.
Build tasks, one per resource group.

The contractor applies each module. You verify terraform plan shows no surprises.

The loop does not change. The inputs change. The verification criteria change. The disciplines stay the same.

Case Study Deep Dive: The Exact Artifacts

To make this actionable, here are the exact artifacts produced in the interview stage of the case study, with minor edits for clarity.

Product Brief (excerpt)

Build a Vampire Survivors-style top-down shooter. Player moves with WASD. Autofire enabled by default. Enemies spawn from edges and move toward player. Three enemy types with increasing difficulty. Score persists in localStorage. Game over when player reaches zero health. Upgrade choices on level up. No external dependencies. Vanilla HTML, CSS, JavaScript only.

User Stories (excerpt)

As a player, I want to move my character with WASD keys so I can dodge enemies.
As a player, I want autofire so I can focus on movement.
As a player, I want score tracking so I can beat my own record.
As a player, I want upgrade choices so my build evolves over time.

Build Plan (excerpt)

## Task 1: Canvas Setup

- Create index.html, style.css, main.js
- Set canvas to 800x600, centered on page
- Render black background each frame
- Expected output: black screen, no errors

## Task 2: Player Movement

- Add player square at canvas center
- WASD movement at 200 pixels per second
- Player stays within canvas bounds
- Expected output: square moves with keys

## Task 3: Autofire

- Player fires projectile every 250ms
- Projectile moves straight right
- Remove projectile when off-screen
- Expected output: continuous stream of bullets

The pattern for each task is identical: one task, one expected outcome, one verification step.

Task List structure

Each task in the tasks.md file contains:

Task number and title.
Inputs: what files or context the contractor should read.
Implementation steps: numbered, concrete, and small.
Expected output: what a passing verification looks like.
Verification criteria: what you check when the contractor stops.

If the expected output is "a playable game with three enemy types," the task is too big. If the expected output is "a red square renders at coordinates 100,100," the task is right-sized.

The Honest Contract Pattern

The term "honest contract" describes the relationship between you, the architect, and the contractor. The contract has three clauses:

Clause 1: The architect does not lie.

If the architect does not know something, it says so. If the plan has a gap, it flags it. A good architect produces a blueprint with unknowns called out, not invented answers.

Clause 2: The contractor does nothing beyond the task.

The contractor reads the task. Implements it. Stops. It does not refactor tasks N through N+5 "while it is there." It does not add features it thinks you might want. It does not optimize code that has not been measured.

Clause 3: The GC verifies before proceeding.

No verification, no next task. If you skip verification to save time, you are not running this workflow. You are running the "hope it works" workflow, and that is the workflow thatproduces broken software.

graph LR
    subgraph Honest Contract [The Three Clauses]
        A[Architect<br/>No lies] --> B[Contractor<br/>No scope creep]
        B --> C[GC<br/>No skipping verification]
        C --> A
    end

Figure 11: The honest contract. Each role respects its boundary.

Failure Modes and How to Recover

Even with a good plan, things go wrong. Here are the most common failure modes and the recovery procedure.

The contractor quits mid-task

Symptoms: Model output ends abruptly. No error message. No status indicator. Task is incomplete.

Recovery:

Do not retry the same prompt.
Check if the task is too large. Split it in half.
If splitting does not help, escalate the subtask to the architect. Ask for a smaller subtask definition.
Resume with the smaller subtask.

The contractor produces working code with the wrong behavior

Symptoms: Code runs. Game launches. But the enemy spawns at the player's position instead of the edges.

Recovery:

Do not patch the code yourself yet.
Escalate to the architect with the observation.
Ask the architect to revise the task specification.
Hand the revised task to the contractor.

The architect produces an inconsistent blueprint

Symptoms: Task 3 references a module that Task 1 does not create. The state machine in the architecture plan conflicts with the build plan.

Recovery:

Pause the workflow.
Review the conflicting files.
Present the conflict to the architect.
Ask for a revised, consistent blueprint.
Do not proceed until the blueprint is internally consistent.

The blueprint is correct but the verification reveals a design flaw

Symptoms: Task runs perfectly. But when Task 7 builds on it, the integration breaks.

Recovery:

Accept the completed task.
Escalate the integration issue to the architect.
Let the architect revise the plan for Tasks 7 onward.
Continue from the revised task.

The key recovery principle is: fix the plan, then continue execution. Do not try to fix broken software without fixing the plan first. The broken software is a symptom of a broken plan or a broken verification step, not a model problem.

Measuring the Workflow

If you want data on whether this workflow actually saves tokens and time, instrument it.

Track:

Architect tokens spent: prompt + response in Stage 1 and Stage 2.
Contractor tokens spent: prompt + response per task in Stage 3.
Number of escalations.
Number of re-architectures.
Number of tasks completed without correction.
Total time from start to working software.

In the case study, the metrics were:

Architect tokens: ~ten minutes, seven files, no re-architectures.
Contractor tokens: fifteen tasks, fifteen successful implementations, zero escalations.
Correction rate: effectively zero after Task 1.
Total time: several hours for a non-trivial game, first-time implementation.

Compare that to the vague-prompt workflow, which produced garbage in twenty minutes and required a full restart.

The numbers do not lie. The plan is the biggest lever.

Advanced Topic: Long-Horizon Projects

For projects with more than twenty tasks, the workflow needs one addition: milestones.

A milestone is a checkpoint where you rebuild the shared mental model. At a milestone:

The architect reviews all completed tasks.
The architect updates the blueprint for any new requirements.
The contractor starts fresh on the next milestone.

This prevents the drift that occurs when a fifteen-task plan grows into a fifty-task plan without updated architecture. Without milestones, Task 30 might conflict with Task 12 because requirements changed and nobody updated the plan.

A practical rule: add a milestone every five to ten tasks. If the project is user-facing, add a milestone every feature that completes a user story.

graph TD
    subgraph Milestone 1 [Tasks 1-5]
        M1T1[Task 1] --> M1V1{Verify}
        M1V1 --> M1T2[Task 2]
        M1T2 --> M1V2{Verify}
        M1V2 --> M1T3[Task 3]
        M1T3 --> MS1[Milestone Review]
    end
    MS1 -->|Update Plan| subgraph Milestone 2 [Tasks 6-10]
        M2T1[Task 6] --> M2V1{Verify}
        M2V1 --> M2T2[Task 7]
        M2T2 --> M2V2{Verify}
        M2V2 --> M2T3[Task 8]
        M2T3 --> MS2[Milestone Review]
    end
    MS2 -->|Update Plan| subgraph Milestone 3 [Tasks 11-15]
        M3T1[Task 11] --> M3V1{Verify}
        M3V1 --> M3T2[Task 12]
        M3T2 --> M3V2{Verify}
        M3V2 --> M3T3[Task 13]
        M3T3 --> MS3[Milestone Review]
    end
    MS3 -->|Final Test| Complete[Build Complete]

Figure 12: Milestone structure for long-horizon projects. Rebuild shared mental model every five to ten tasks.

The Ralph Wiggum Loop and Iterative Planning

One advanced variation worth mentioning is the Ralph Wiggum Loop. In this pattern, after completing a milestone, you ask the architect:

"What did we learn in the last five tasks that changes the plan for the next five?"

This keeps the plan alive. It prevents the blueprint from becoming stale. It also surfaces architectural decisions you did not anticipate until you saw working code.

In practice, the Ralph Wiggum Loop adds five minutes per milestone and prevents the drift that ruins long-horizon projects. If you are building anything larger than a prototype, use it.

When to Escalate vs When to Fix It Yourself

Not every problem needs the architect. Here is a decision rule:

Escalate to the architect when:

The task behavior does not match the specification.
The contractor produced a file structure that is not in the blueprint.
The task required an architectural decision not covered in the plan.
The contractor quit without explanation.
You find yourself about to write a long corrective prompt that changes the architecture.

Fix it yourself when:

A typo or syntax error needs a one-line change.
A missing import or path adjustment.
A visual tweak that does not change the logic.
A console warning that does not affect behavior.

The rule is simple: if you are changing the spec, escalate. If you are correcting a typo, fix it.

If you are not sure which bucket you are in, escalate. The architect handles both types. The cost difference is small.

Conclusion

Your Claude limits are not a model problem. They are a planning problem.

The workflow is not new. It is just honest about how software gets built. You interview the architect. You get the blueprint. You hand the contractor one task. You verify. You repeat.

The model size matters. The context window matters. But neither matters as much as the plan, and neither matters at all if you are not in the driver's seat.

Build the plan. Run the loop. Verify every task.

The same model with a better plan will outperform a bigger model with a vague prompt every single time.

Your AI Coding Limits Are a Planning Problem, Not a Model Problem

That blame is almost always misplaced.

The failure is rarely the worker. It is the blueprint.

The Architect, Blueprint, and Contractor Model

Think of software construction the same way you think of physical construction.

In the AI coding world:

The architect is the frontier model. Claude. Codex. GPT-4 class.
The trades are the execution model. Gemma. Llama. Qwen.
You are still the GC. The job site does not run itself.

This is not an analogy. It is the actual control flow.

graph TD
    A[User + Architect Model<br/>Claude / Codex] -->|Interview & Design| B[Blueprint<br/>Architecture, Build Plan, Tasks]
    B -->|Task 1| C[Contractor Model<br/>Gemma 4 31B]
    C -->|Implement & Stop| D{GC Verification}
    D -->|Pass| B
    D -->|Fail| E[Escalate to Architect]
    E -->|Revised Guidance| B
    D -->|Escalate Stuck Issue| E
    C -->|Task N| D

Figure 1: The four-stage architect-blueprint-contractor workflow.

The architect never swings the hammer. The contractor never redesigns the house. You never stop verifying. This loop is boring, tedious, and almost always produces working code.

Why Your Limits Are Planning Limits

When you hit a Claude limit, the immediate reaction is to "optimize" the model. Use a smaller model. Use a local model. Use a cheaper API. Use fewer tokens per message.

All of these are worker-side optimizations.

Stage 1: The Interview

This is the stage everyone skips, and it is also the stage that saves the most tokens later.

You sit down with the architect. You do not tell it what to build. You let it ask you questions until both of you know exactly what you are building.

In practice, this looks like giving the architect one sentence and letting it interview you for twenty to twenty-five minutes.

One-sentence prompt:

"I want to build a Vampire Survivors-style top-down shooter in vanilla HTML and JavaScript."

Architect response: The architect comes back with twenty-plus questions:

What movement scheme? WASD? Arrow keys? Both?
How many enemy types?
How does difficulty scale over time?
Is there autofire?
What is the scoring model?
Does progress persist between sessions?
What state machine governs gameplay states?
What polish features matter most?

Each answer locks one piece of the build. By the end of the interview, you have a shared mental model with the architect. There are no surprises.

sequenceDiagram
    participant You
    participant Architect
    You->>Architect: One-sentence project idea
    loop Interview
        Architect->>You: Question about player movement
        You->>Architect: Answer
        Architect->>You: Question about enemy variety
        You->>Architect: Answer
    end
    Architect->>You: Seven markdown files: brief, stories, architecture, build plan, local/cloud split

Figure 2: The interview stage. The architect extracts requirements; you do not skip it.

Stage 2: The Blueprint

The architect produces seven markdown files in the case study described:

Product Brief — what we are building and why.
User Stories — who uses what and how.
Architecture Plan — file structure, state machines, module boundaries.
Build Plan — a list of tasks, each small enough to finish in one turn.
Local versus Cloud Split — which model handles which layer.
Interview Notes — the Q&A from Stage 1.
Task List — fifteen concrete implementation tasks.

This is the blueprint. It is detailed. It is boring. It is correct.

The contractor does not get any of these files. The contractor gets the build plan only, divided into fifteen tasks. Each task is small enough that a model can finish it in one response.

If you hand them one task and say "stop when done," they implement that task, and they stop.

graph LR
    B[Blueprint Outputs] --> P[Product Brief]
    B --> U[User Stories]
    B --> A[Architecture Plan]
    B --> BP[Build Plan]
    B --> LC[Local vs Cloud Split]
    B --> T[Task List - 15 items]
    T --> T1[Task 1: Setup]
    T --> T2[Task 2: Player Movement]
    T --> T3[Task 3: Enemy Spawner]
    T --> TN[...]

Figure 3: The blueprint outputs. The contractor sees only the build plan with task boundaries.

Stage 3: Execution

You hand the contractor the build plan.

In the case study, the contractor was Gemma 4, 31 billion parameters, running on Ollama Cloud. The setup required two commands:

ollama signin
ollama launch claude --model gemma4:31b --cloud

From there, Claude Code talks to the contractor through the Ollama bridge. The workflow is:

Open the project in Claude Code with the contractor model active.
Drop the build plan markdown into the project.
Say: "Read the build plan. Implement task one only. Stop and wait."
The contractor writes the code. It stops.
You verify.
If it looks right, you say: "Proceed to task two."

This is not autonomous. This is not "let the agent run for three hours." This is task by task.

flowchart LR
    Start[Load Build Plan] --> T1[Task 1]
    T1 --> V1{Verify}
    V1 -->|Pass| T2[Task 2]
    V1 -->|Fail| E1[Escalate to Architect]
    E1 --> T2
    T2 --> V2{Verify}
    V2 -->|Pass| T3[Task 3]
    V2 -->|Fail| E2[Escalate]
    T3 --> Dots[...]
    Dots --> End[Build Complete]

Figure 4: The task-by-task loop. Human verification between every task.

Why the autonomous loop fails

You might be tempted to hand the build plan to the model and say: "Verify each task. Run all the tests. Come back when you are finished."

The lesson is clear: there is no autonomous job size for local models yet. The contractor needs the GC there to hand it the next task and verify the last one.

This is not a failure of the model. It is a limitation of the control loop. Treat it like a real job site. The trades do not call the architects. The GC calls the trades.

Context Window Reality Check

One of the most misunderstood concepts in local model deployment is the relationship between context window size and model capability.

This means:

A model with a thirty-two thousand token context window cannot even see the full system prompt. The model is guessing. The cue "I need more information" is the model telling you it received a blueprint that was cut in half.
A model with a one-hundred-twenty-eight thousand token context window can see the plan. It can execute. But if it only has four billion parameters, it does not have the skill to finish the job. It will give generic responses, abandon tasks halfway, and follow the plan loosely.

graph TD
    subgraph Context Window Anatomy
        S[System Prompt<br/>50k-65k tokens] --> T[Tasks + Plan<br/>+ your prompt]
        T --> C[Total Context Needed]
    end
    32[32k Context] -->|Cutoff| Fail[Model sees half the blueprint]
    128[128k Context] -->|Enough| Model4[4B Model]
    128 -->|Enough| Model31[31B Model]
    Model4 -->|Fails| Generic[Generic responses, quits early]
    Model31 -->|Success| Capable[Capable execution]

Figure 5: Context window anatomy. A 32k context cuts off the system prompt. A 4B parameter model cannot execute even with full context.

The punch line: big context window does not equal capable model. You need both.

Hardware Selection: A Practical Guide

If you are on a sixteen-gigabyte MacBook, do not bother running the 4B model locally. It will see the plan and still fail. It does not have enough parameters to run a full coding workflow.

If you have thirty-two gigabytes of RAM, you can run the 26B Gemma 4 model locally for some workloads. The 31B cloud version is still the safer bet.

If you have sixty-four gigabytes or more, you can run the 31B locally and it will work, but the cloud free tier is still cheaper and easier.

The realistic path for most viewers is an open-weight model, no Anthropic API key, no five-thousand-dollar rig in the closet. Ollama Cloud covers the testing and light usage for free.

flowchart TD
    Start[What is your RAM?] -->|16 GB| Cloud31[Ollama Cloud<br/>Gemma 4 31B<br/>Free tier]
    Start -->|32 GB| Choice{Agent work?}
    Choice -->|Light| Local26[Local 26B]
    Choice -->|Heavy| Cloud31
    Start -->|64 GB| Local31[Local 31B<br/>or Cloud 31B]
    Start -->|128 GB| Aggressive[Run everything locally<br/>No cloud needed]

Figure 6: Hardware selection flowchart. Pick by your hardware, not by the hype.

Why Ollama Cloud matters

Ollama recently made their cloud tier free for low-use testing. That is not a gimmick. It is the actual answer for people without the right device.

# Sign in to Ollama Cloud
ollama signin

# Launch the 31B Gemma 4 model on cloud GPUs
ollama launch claude --model gemma4:31b --cloud

# From Claude Code, the contractor is now Gemma 4 31B
# running on Ollama's GPUs, not your laptop.

The workflow is identical to local. The latency is higher. The cost is zero for low usage. The capability is real.

The Four Stages in Detail

Let us expand the four stages with the exact artifacts and commands used in the case study.

Stage 1: Interview with the Architect

You open a new Claude Code session. You give the one-sentence prompt. You let Claude ask questions. You answer them. You do not rush this.

Time investment: twenty to twenty-five minutes.

Outputs used later: none yet. This is shared understanding.

Stage 2: Blueprint Generation

Only when the blueprint is solid do you proceed.

Time investment: another fifteen to thirty minutes, depending on project size.

Outputs:

docs/product-brief.md
docs/user-stories.md
docs/architecture.md
docs/build-plan.md
docs/local-vs-cloud.md
docs/interview-notes.md
docs/tasks.md

Stage 3: Task Execution

You open Claude Code with the contractor model active. You open the project folder. You drop the docs folder in. You say:

Read docs/build-plan.md. Implement task one only. Stop when done.

Task one is small. Maybe it is: "Create index.html, style.css, and main.js. Set up the canvas. Render a black square. Autofire enabled."

The contractor writes three files. It stops.

You run the game. You check:

Does the canvas render?
Is the square autofiring?
Are there any console errors?

If yes, you say: "Proceed to task two."

If no, you copy the error. You switch back to the architect. You say:

"Task one produced this error. Here is what the build plan intended. Revise the task specification."

Then you paste the revised task back to the contractor.

This continues for fifteen tasks.

Time investment: depends on task size. The case study took several hours of actual work, but it was first-time implementation of a non-trivial game with no major refactors.

Outputs: working software, verified incrementally.

Stage 4: Escalation

When the contractor gets stuck, you escalate to the architect. You do not ask the contractor to debug itself. You ask the architect.

graph TD
    subgraph Architect [Claude / Codex]
        I[Interview] --> BP[Blueprint]
    end
    subgraph Contractor [Gemma 4 31B]
        T1[Task] --> V{Verify}
    end
    BP --> T1
    V -->|Pass| T2[Next Task]
    V -->|Fail| E[Escalate]
    E -->|One task context| A[Architect Fix]
    A --> T2
    T2 --> V

Figure 7: The escalation path. The architect gets only the failing task and the expected outcome.

A Real Case Study: Vampire Survivors Clone

To make this concrete, here is exactly what happened in the video.

Hardware: M2 Pro, sixteen gigabytes RAM.

First attempt: Gemma 4, four billion parameters, thirty-two thousand token context.

Prompt: full Claude Code system prompt plus the build plan.

Result: "I need more information." Every time.

Diagnosis: The 32k context window cut off the system prompt. The contractor literally could not see the instructions.

Second attempt: Gemma 4, four billion parameters, one hundred twenty-eight thousand token context.

Result: The model could read the plan. It produced generic responses, abandoned tasks halfway, and refused to follow the state machine specification.

Diagnosis: The model is too small. It does not have enough parameters to run a multi-step coding workflow.

Third attempt: Gemma 4, thirty-one billion parameters, one hundred twenty-eight thousand token context, running on Ollama Cloud.

Result: Task-by-task execution worked. The game built incrementally. Autofire worked. Enemy spawning matched the spec. Scoring persisted. Game over screen rendered.

Parallel comparison: The same 31B model, same machine, different prompt.

Vague prompt: "Make me a Vampire Survivors clone in HTML and JavaScript with enemies and a player." Result: Hallucinated file structure. Missing autofire. Made-up enemy behaviors. Half features missing. Game barely runs. Thrown out.
Plan prompt: Full build plan, task-by-task. Result: Working game. All three enemy types spawning at correct intervals. Autofire working. Scoring persisting. Game over screen. Exactly like the specs.

Same model. Same hardware. Different plan. Different universe.

graph LR
    subgraph Same Hardware & Model
        V[Vague Prompt] --> G1[Garbage]
        P[Plan Prompt] --> G2[Working Game]
    end
    style V fill:#ffcccc
    style P fill:#ccffcc
    style G1 fill:#ffcccc
    style G2 fill:#ccffcc

Figure 8: The comparison. Same model, same machine, different prompt quality.

Code Example: The Honest Contract

The pattern is so reliable that you can script it. Here is a minimal bash loop for task-by-task execution.

#!/bin/bash
# honest-contract.sh - Task-by-task contractor dispatch

PROJECT_DIR="./vampire-survivors"
BUILD_PLAN="$PROJECT_DIR/docs/build-plan.md"
CONTRACTOR="gemma4:31b"
TOTAL_TASKS=15

cd "$PROJECT_DIR" || exit 1

for i in $(seq 1 "$TOTAL_TASKS"); do
    echo "===== Task $i ====="
    
    # 1. Hand the task
    claude code --model "$CONTRACTOR" \
        --message "Read $BUILD_PLAN. Implement task $i only. Stop when done."
    
    # 2. Wait for human verification
    echo "Task $i complete. Verify before continuing."
    echo "Press Enter to proceed to task $((i + 1))..."
    read -r
    
    # 3. Optional: git checkpoint after each verified task
    git add .
    git commit -m "Task $i: [automated via honest-contract]"
done

This script does one thing: it prevents the autonomous loop. The human must press Enter after each task. You cannot accidentally walk away.

The verification checklist

What do you verify between tasks?

The files created match the blueprint.
The code compiles and runs without errors.
The behavior matches the task specification exactly.
No extra features were invented.
No planned features were omitted.

If any of these fail, you do not proceed. You escalate to the architect.

Mermaid Diagrams in Practice

Because this blog post uses Mermaid syntax version 11.7.0+, here is a state machine diagram for the game itself, taken directly from the blueprint produced in the case study.

stateDiagram-v2
    [*] --> Menu
    Menu --> Playing: Start Game
    Playing --> Paused: Escape
    Paused --> Playing: Resume
    Playing --> GameOver: Player HP <= 0
    GameOver --> Menu: Return to Menu
    Playing --> Upgrade: Level Up
    Upgrade --> Playing: Select Upgrade

Figure 9: Gameplay state machine. Defined in architecture plan; implemented task by task.

Why Most Autonomous Agent Setups Fail

If you watch the AI coding space, almost every "autonomous agent" setup follows the same pattern:

Give the agent a vague goal.
Let it run for a long time.
Hope it finishes.

This fails for two reasons:

First, the goal is vague. "Build a to-do app" is not a spec. It is a category. The agent will make architectural decisions that are convenient for the model, not optimal for the user.

The architect-blueprint-contractor loop fixes both problems. The plan is specific. The contractor runs for one task, then stops. The GC keeps the global state.

graph TD
    Vague[Vague Goal] --> A1[Agent thinks for itself]
    A1 --> H1[Hallucinates architecture]
    H1 --> F1[Repeat work, drift, fail]
    Plan[Specific Plan] --> C1[Task 1]
    C1 --> V1[Verify]
    V1 --> C2[Task 2]
    C2 --> V2[Verify]
    V2 --> C3[...Continue]

Figure 10: Autonomous vague goal vs disciplined task loop.

The Token Math

A common objection is that the architect stage wastes tokens. You pay Claude for twenty-five minutes of interview and seven markdown files. Is that not expensive?

It depends on what you compare it to.

If you compare it to one vague prompt that produces garbage, yes, it looks expensive.

The plan is the biggest token saver in the workflow. Every unanswered question in the interview becomes a correction later. Corrections cost more tokens than questions.

In the case study:

Architect stage: ~ten minutes of conversation, seven markdown files.
Contractor stage: fifteen tasks, task-by-task, with verification.
Total number of full re-architectures: zero.
Total number of major corrections: zero.

That is the point. The workflow does not eliminate work. It eliminates rework.

Checklist: Running the Workflow on Any Project

Here is a practical checklist you can follow for your next AI-assisted build.

Before you start

Install the contractor runtime. If using Ollama Cloud: ollama signin && ollama launch claude --model gemma4:31b --cloud.
Open a fresh Claude Code session for the architect.
Open a terminal or editor for the project workspace.

Stage 1: Interview

Write one sentence describing what you want.
Answer every question the architect asks.
Do not add features later. If you think of new features, add them to a "future" list, not the current build.
Review the seven markdown files before proceeding.

Stage 2: Blueprint acceptance

Confirm the state machine is defined.
Confirm enemy types, scoring, and persistence are specified.
Confirm the file structure matches your preferences.
Request fifteen or fewer tasks, each one small.

Stage 3: Task loop

Open Claude Code with the contractor model active.
Drop the build plan into the project.
Implement task one only. Stop and wait.
Run the application. Verify behavior.
If verification passes, proceed. If not, escalate to architect.
Repeat until all tasks are complete.

Stage 4: Polish

Review the finished project against the original brief.
Note missing features in the "future" list.
Commit the working build.
Share or ship.

This checklist is intentionally boring. That is the point. Boring workflows produce reliable software. Exciting workflows produce exciting bugs.

Applying the Workflow Beyond Games

The case study uses a top-down survival game. The pattern works for everything.

Web applications

For a React dashboard, the architect produces:

Data model and API contract.
Component tree and state management approach.
Authentication flow and route structure.
Build tasks, one per component or feature.

The contractor implements each component. You verify routing and state.

CLI tools

For a Rust CLI, the architect produces:

Argument parsing specification.
Subcommand hierarchy.
Error handling strategy.
Build tasks, one per subcommand or module.

The contractor implements each module. You verify the binary runs and the help text is correct.

Infrastructure as code

For a Terraform module, the architect produces:

Resource graph and dependency order.
Variable and output contract.
Environment separation strategy.
Build tasks, one per resource group.

The contractor applies each module. You verify terraform plan shows no surprises.

The loop does not change. The inputs change. The verification criteria change. The disciplines stay the same.

Case Study Deep Dive: The Exact Artifacts

To make this actionable, here are the exact artifacts produced in the interview stage of the case study, with minor edits for clarity.

Product Brief (excerpt)

Build a Vampire Survivors-style top-down shooter. Player moves with WASD. Autofire enabled by default. Enemies spawn from edges and move toward player. Three enemy types with increasing difficulty. Score persists in localStorage. Game over when player reaches zero health. Upgrade choices on level up. No external dependencies. Vanilla HTML, CSS, JavaScript only.

User Stories (excerpt)

As a player, I want to move my character with WASD keys so I can dodge enemies.
As a player, I want autofire so I can focus on movement.
As a player, I want score tracking so I can beat my own record.
As a player, I want upgrade choices so my build evolves over time.

Build Plan (excerpt)

## Task 1: Canvas Setup

- Create index.html, style.css, main.js
- Set canvas to 800x600, centered on page
- Render black background each frame
- Expected output: black screen, no errors

## Task 2: Player Movement

- Add player square at canvas center
- WASD movement at 200 pixels per second
- Player stays within canvas bounds
- Expected output: square moves with keys

## Task 3: Autofire

- Player fires projectile every 250ms
- Projectile moves straight right
- Remove projectile when off-screen
- Expected output: continuous stream of bullets

The pattern for each task is identical: one task, one expected outcome, one verification step.

Task List structure

Each task in the tasks.md file contains:

Task number and title.
Inputs: what files or context the contractor should read.
Implementation steps: numbered, concrete, and small.
Expected output: what a passing verification looks like.
Verification criteria: what you check when the contractor stops.

If the expected output is "a playable game with three enemy types," the task is too big. If the expected output is "a red square renders at coordinates 100,100," the task is right-sized.

The Honest Contract Pattern

The term "honest contract" describes the relationship between you, the architect, and the contractor. The contract has three clauses:

Clause 1: The architect does not lie.

If the architect does not know something, it says so. If the plan has a gap, it flags it. A good architect produces a blueprint with unknowns called out, not invented answers.

Clause 2: The contractor does nothing beyond the task.

Clause 3: The GC verifies before proceeding.

graph LR
    subgraph Honest Contract [The Three Clauses]
        A[Architect<br/>No lies] --> B[Contractor<br/>No scope creep]
        B --> C[GC<br/>No skipping verification]
        C --> A
    end

Figure 11: The honest contract. Each role respects its boundary.

Failure Modes and How to Recover

Even with a good plan, things go wrong. Here are the most common failure modes and the recovery procedure.

The contractor quits mid-task

Symptoms: Model output ends abruptly. No error message. No status indicator. Task is incomplete.

Recovery:

Do not retry the same prompt.
Check if the task is too large. Split it in half.
If splitting does not help, escalate the subtask to the architect. Ask for a smaller subtask definition.
Resume with the smaller subtask.

The contractor produces working code with the wrong behavior

Symptoms: Code runs. Game launches. But the enemy spawns at the player's position instead of the edges.

Recovery:

Do not patch the code yourself yet.
Escalate to the architect with the observation.
Ask the architect to revise the task specification.
Hand the revised task to the contractor.

The architect produces an inconsistent blueprint

Symptoms: Task 3 references a module that Task 1 does not create. The state machine in the architecture plan conflicts with the build plan.

Recovery:

Pause the workflow.
Review the conflicting files.
Present the conflict to the architect.
Ask for a revised, consistent blueprint.
Do not proceed until the blueprint is internally consistent.

The blueprint is correct but the verification reveals a design flaw

Symptoms: Task runs perfectly. But when Task 7 builds on it, the integration breaks.

Recovery:

Accept the completed task.
Escalate the integration issue to the architect.
Let the architect revise the plan for Tasks 7 onward.
Continue from the revised task.

Measuring the Workflow

If you want data on whether this workflow actually saves tokens and time, instrument it.

Track:

Architect tokens spent: prompt + response in Stage 1 and Stage 2.
Contractor tokens spent: prompt + response per task in Stage 3.
Number of escalations.
Number of re-architectures.
Number of tasks completed without correction.
Total time from start to working software.

In the case study, the metrics were:

Architect tokens: ~ten minutes, seven files, no re-architectures.
Contractor tokens: fifteen tasks, fifteen successful implementations, zero escalations.
Correction rate: effectively zero after Task 1.
Total time: several hours for a non-trivial game, first-time implementation.

Compare that to the vague-prompt workflow, which produced garbage in twenty minutes and required a full restart.

The numbers do not lie. The plan is the biggest lever.

Advanced Topic: Long-Horizon Projects

For projects with more than twenty tasks, the workflow needs one addition: milestones.

A milestone is a checkpoint where you rebuild the shared mental model. At a milestone:

The architect reviews all completed tasks.
The architect updates the blueprint for any new requirements.
The contractor starts fresh on the next milestone.

A practical rule: add a milestone every five to ten tasks. If the project is user-facing, add a milestone every feature that completes a user story.

graph TD
    subgraph Milestone 1 [Tasks 1-5]
        M1T1[Task 1] --> M1V1{Verify}
        M1V1 --> M1T2[Task 2]
        M1T2 --> M1V2{Verify}
        M1V2 --> M1T3[Task 3]
        M1T3 --> MS1[Milestone Review]
    end
    MS1 -->|Update Plan| subgraph Milestone 2 [Tasks 6-10]
        M2T1[Task 6] --> M2V1{Verify}
        M2V1 --> M2T2[Task 7]
        M2T2 --> M2V2{Verify}
        M2V2 --> M2T3[Task 8]
        M2T3 --> MS2[Milestone Review]
    end
    MS2 -->|Update Plan| subgraph Milestone 3 [Tasks 11-15]
        M3T1[Task 11] --> M3V1{Verify}
        M3V1 --> M3T2[Task 12]
        M3T2 --> M3V2{Verify}
        M3V2 --> M3T3[Task 13]
        M3T3 --> MS3[Milestone Review]
    end
    MS3 -->|Final Test| Complete[Build Complete]

Figure 12: Milestone structure for long-horizon projects. Rebuild shared mental model every five to ten tasks.

The Ralph Wiggum Loop and Iterative Planning

One advanced variation worth mentioning is the Ralph Wiggum Loop. In this pattern, after completing a milestone, you ask the architect:

"What did we learn in the last five tasks that changes the plan for the next five?"

This keeps the plan alive. It prevents the blueprint from becoming stale. It also surfaces architectural decisions you did not anticipate until you saw working code.

In practice, the Ralph Wiggum Loop adds five minutes per milestone and prevents the drift that ruins long-horizon projects. If you are building anything larger than a prototype, use it.

When to Escalate vs When to Fix It Yourself

Not every problem needs the architect. Here is a decision rule:

Escalate to the architect when:

The task behavior does not match the specification.
The contractor produced a file structure that is not in the blueprint.
The task required an architectural decision not covered in the plan.
The contractor quit without explanation.
You find yourself about to write a long corrective prompt that changes the architecture.

Fix it yourself when:

A typo or syntax error needs a one-line change.
A missing import or path adjustment.
A visual tweak that does not change the logic.
A console warning that does not affect behavior.

The rule is simple: if you are changing the spec, escalate. If you are correcting a typo, fix it.

If you are not sure which bucket you are in, escalate. The architect handles both types. The cost difference is small.

Conclusion

Your Claude limits are not a model problem. They are a planning problem.

The workflow is not new. It is just honest about how software gets built. You interview the architect. You get the blueprint. You hand the contractor one task. You verify. You repeat.

The model size matters. The context window matters. But neither matters as much as the plan, and neither matters at all if you are not in the driver's seat.

Build the plan. Run the loop. Verify every task.

The same model with a better plan will outperform a bigger model with a vague prompt every single time.