12 de junio de 2026
If you have ever stared at a Claude usage limit, switched to a local model, and watched it produce code that does not compile, you have probably blamed the model. The local model is too small. The cloud provider is too expensive. The tool is too limited.
That blame is almost always misplaced.
The failure is rarely the worker. It is the blueprint.
In this guide, we walk through a workflow that lets you punch far above the weight class of your local hardware, repeatedly ship working software, and avoid the predictable mid-task collapse that plagues almost every "run an agent and walk away" setup. We will use a top-down survival game built with a frontier architect and a local execution model as a running case study, but the pattern applies to any multi-file software project.
The central claim is simple: your AI coding limits are planning limits, not model limits. Most people skip the plan because it feels slower. It is not. It is the difference between building a house with an architect and building a house by showing up with a hammer and hope.
Think of software construction the same way you think of physical construction.
You do not show up to a job site with a hammer and start swinging. You hire an architect. The architect sits with you, asks twenty-plus questions, and produces a set of drawings: where the load-bearing walls go, where the plumbing runs, what the house will look like in five years when the tree you planted today has grown.
Then you hire trades. Framers. Electricians. Plumbers. They do not redesign the house. They build what is already drawn. They are cheaper, they are faster, and they are only as good as the plan handed to them.
You are the general contractor on site. You are not swinging hammers. You are not drafting blueprints. You are reading the plan, handing the trades one task at a time, and verifying each one before moving on.
In the AI coding world:
This is not an analogy. It is the actual control flow.
graph TD
A[User + Architect Model<br/>Claude / Codex] -->|Interview & Design| B[Blueprint<br/>Architecture, Build Plan, Tasks]
B -->|Task 1| C[Contractor Model<br/>Gemma 4 31B]
C -->|Implement & Stop| D{GC Verification}
D -->|Pass| B
D -->|Fail| E[Escalate to Architect]
E -->|Revised Guidance| B
D -->|Escalate Stuck Issue| E
C -->|Task N| D
Figure 1: The four-stage architect-blueprint-contractor workflow.
The architect never swings the hammer. The contractor never redesigns the house. You never stop verifying. This loop is boring, tedious, and almost always produces working code.
When you hit a Claude limit, the immediate reaction is to "optimize" the model. Use a smaller model. Use a local model. Use a cheaper API. Use fewer tokens per message.
All of these are worker-side optimizations.
The actual bottleneck is that most developers hand the AI a sentence or two and expect a multi-file application. That is not a model limitation. That is a specification limitation. If you hand a human engineer a sticky note that says "build a clone of Vampire Survivors," they will not produce clean architecture either. They will produce garbage, ask for clarification, or quit.
Models are the same. A model is a worker. A worker with a perfect blueprint can do amazing work. A worker with a blueprint cut in half will produce garbage, regardless of how large the context window is.
This distinction matters because it changes your optimization target. Instead of chasing the newest, biggest model, you invest time in the interview with the architect. That investment pays back in every subsequent task.
This is the stage everyone skips, and it is also the stage that saves the most tokens later.
You sit down with the architect. You do not tell it what to build. You let it ask you questions until both of you know exactly what you are building.
In practice, this looks like giving the architect one sentence and letting it interview you for twenty to twenty-five minutes.
One-sentence prompt:
"I want to build a Vampire Survivors-style top-down shooter in vanilla HTML and JavaScript."
Architect response: The architect comes back with twenty-plus questions:
Each answer locks one piece of the build. By the end of the interview, you have a shared mental model with the architect. There are no surprises.
sequenceDiagram
participant You
participant Architect
You->>Architect: One-sentence project idea
loop Interview
Architect->>You: Question about player movement
You->>Architect: Answer
Architect->>You: Question about enemy variety
You->>Architect: Answer
end
Architect->>You: Seven markdown files: brief, stories, architecture, build plan, local/cloud split
Figure 2: The interview stage. The architect extracts requirements; you do not skip it.
If you skip this stage, you will save twenty minutes at the start and lose hours later. The architecture document, the build plan, and the task list are the outputs of the interview. Do not skip the architect meeting.
The architect produces seven markdown files in the case study described:
This is the blueprint. It is detailed. It is boring. It is correct.
The contractor does not get any of these files. The contractor gets the build plan only, divided into fifteen tasks. Each task is small enough that a model can finish it in one response.
Why? Because large models are generalists. They get distracted. If you hand them the full context and say "do everything," they will hallucinate file structures, invent enemy behaviors you never asked for, and quit halfway through task three.
If you hand them one task and say "stop when done," they implement that task, and they stop.
graph LR
B[Blueprint Outputs] --> P[Product Brief]
B --> U[User Stories]
B --> A[Architecture Plan]
B --> BP[Build Plan]
B --> LC[Local vs Cloud Split]
B --> T[Task List - 15 items]
T --> T1[Task 1: Setup]
T --> T2[Task 2: Player Movement]
T --> T3[Task 3: Enemy Spawner]
T --> TN[...]
Figure 3: The blueprint outputs. The contractor sees only the build plan with task boundaries.
You hand the contractor the build plan.
In the case study, the contractor was Gemma 4, 31 billion parameters, running on Ollama Cloud. The setup required two commands:
ollama signin
ollama launch claude --model gemma4:31b --cloud
From there, Claude Code talks to the contractor through the Ollama bridge. The workflow is:
This is not autonomous. This is not "let the agent run for three hours." This is task by task.
flowchart LR
Start[Load Build Plan] --> T1[Task 1]
T1 --> V1{Verify}
V1 -->|Pass| T2[Task 2]
V1 -->|Fail| E1[Escalate to Architect]
E1 --> T2
T2 --> V2{Verify}
V2 -->|Pass| T3[Task 3]
V2 -->|Fail| E2[Escalate]
T3 --> Dots[...]
Dots --> End[Build Complete]
Figure 4: The task-by-task loop. Human verification between every task.
You might be tempted to hand the build plan to the model and say: "Verify each task. Run all the tests. Come back when you are finished."
This does not work. In the case study, Gemma 4, 31 billion parameters, same machine, same context, same prompt, quit twice. Once after task three. Once after task five. There was no error. It just stopped.
The lesson is clear: there is no autonomous job size for local models yet. The contractor needs the GC there to hand it the next task and verify the last one.
This is not a failure of the model. It is a limitation of the control loop. Treat it like a real job site. The trades do not call the architects. The GC calls the trades.
One of the most misunderstood concepts in local model deployment is the relationship between context window size and model capability.
Claude Code does not just send your prompt to the model. It sends a system prompt containing definitions, instructions, agent setup, and the entire tool interface. That system prompt alone is fifty to sixty-five thousand tokens before your actual question ever enters the picture.
This means:
graph TD
subgraph Context Window Anatomy
S[System Prompt<br/>50k-65k tokens] --> T[Tasks + Plan<br/>+ your prompt]
T --> C[Total Context Needed]
end
32[32k Context] -->|Cutoff| Fail[Model sees half the blueprint]
128[128k Context] -->|Enough| Model4[4B Model]
128 -->|Enough| Model31[31B Model]
Model4 -->|Fails| Generic[Generic responses, quits early]
Model31 -->|Success| Capable[Capable execution]
Figure 5: Context window anatomy. A 32k context cuts off the system prompt. A 4B parameter model cannot execute even with full context.
The punch line: big context window does not equal capable model. You need both.
If you are on a sixteen-gigabyte MacBook, do not bother running the 4B model locally. It will see the plan and still fail. It does not have enough parameters to run a full coding workflow.
If you have thirty-two gigabytes of RAM, you can run the 26B Gemma 4 model locally for some workloads. The 31B cloud version is still the safer bet.
If you have sixty-four gigabytes or more, you can run the 31B locally and it will work, but the cloud free tier is still cheaper and easier.
The realistic path for most viewers is an open-weight model, no Anthropic API key, no five-thousand-dollar rig in the closet. Ollama Cloud covers the testing and light usage for free.
flowchart TD
Start[What is your RAM?] -->|16 GB| Cloud31[Ollama Cloud<br/>Gemma 4 31B<br/>Free tier]
Start -->|32 GB| Choice{Agent work?}
Choice -->|Light| Local26[Local 26B]
Choice -->|Heavy| Cloud31
Start -->|64 GB| Local31[Local 31B<br/>or Cloud 31B]
Start -->|128 GB| Aggressive[Run everything locally<br/>No cloud needed]
Figure 6: Hardware selection flowchart. Pick by your hardware, not by the hype.
Ollama recently made their cloud tier free for low-use testing. That is not a gimmick. It is the actual answer for people without the right device.
# Sign in to Ollama Cloud
ollama signin
# Launch the 31B Gemma 4 model on cloud GPUs
ollama launch claude --model gemma4:31b --cloud
# From Claude Code, the contractor is now Gemma 4 31B
# running on Ollama's GPUs, not your laptop.
The workflow is identical to local. The latency is higher. The cost is zero for low usage. The capability is real.
Let us expand the four stages with the exact artifacts and commands used in the case study.
You open a new Claude Code session. You give the one-sentence prompt. You let Claude ask questions. You answer them. You do not rush this.
Time investment: twenty to twenty-five minutes.
Outputs used later: none yet. This is shared understanding.
After the interview, Claude generates seven markdown files. You review them. You refine them. You make sure the state machine definitions are correct. You make sure the enemy spawn logic is described in detail.
Only when the blueprint is solid do you proceed.
Time investment: another fifteen to thirty minutes, depending on project size.
Outputs:
docs/product-brief.mddocs/user-stories.mddocs/architecture.mddocs/build-plan.mddocs/local-vs-cloud.mddocs/interview-notes.mddocs/tasks.mdYou open Claude Code with the contractor model active. You open the project folder. You drop the docs folder in. You say:
Read
docs/build-plan.md. Implement task one only. Stop when done.
Task one is small. Maybe it is: "Create index.html, style.css, and main.js. Set up the canvas. Render a black square. Autofire enabled."
The contractor writes three files. It stops.
You run the game. You check:
If yes, you say: "Proceed to task two."
If no, you copy the error. You switch back to the architect. You say:
"Task one produced this error. Here is what the build plan intended. Revise the task specification."
Then you paste the revised task back to the contractor.
This continues for fifteen tasks.
Time investment: depends on task size. The case study took several hours of actual work, but it was first-time implementation of a non-trivial game with no major refactors.
Outputs: working software, verified incrementally.
When the contractor gets stuck, you escalate to the architect. You do not ask the contractor to debug itself. You ask the architect.
In the case study, escalation was rare because the blueprint was good. But it does happen. The architect should never be given the full project context in the escalating message; give it just the task, the intended outcome, and the observed failure. That keeps architect calls cheap.
graph TD
subgraph Architect [Claude / Codex]
I[Interview] --> BP[Blueprint]
end
subgraph Contractor [Gemma 4 31B]
T1[Task] --> V{Verify}
end
BP --> T1
V -->|Pass| T2[Next Task]
V -->|Fail| E[Escalate]
E -->|One task context| A[Architect Fix]
A --> T2
T2 --> V
Figure 7: The escalation path. The architect gets only the failing task and the expected outcome.
To make this concrete, here is exactly what happened in the video.
Hardware: M2 Pro, sixteen gigabytes RAM.
First attempt: Gemma 4, four billion parameters, thirty-two thousand token context.
Prompt: full Claude Code system prompt plus the build plan.
Result: "I need more information." Every time.
Diagnosis: The 32k context window cut off the system prompt. The contractor literally could not see the instructions.
Second attempt: Gemma 4, four billion parameters, one hundred twenty-eight thousand token context.
Result: The model could read the plan. It produced generic responses, abandoned tasks halfway, and refused to follow the state machine specification.
Diagnosis: The model is too small. It does not have enough parameters to run a multi-step coding workflow.
Third attempt: Gemma 4, thirty-one billion parameters, one hundred twenty-eight thousand token context, running on Ollama Cloud.
Result: Task-by-task execution worked. The game built incrementally. Autofire worked. Enemy spawning matched the spec. Scoring persisted. Game over screen rendered.
Parallel comparison: The same 31B model, same machine, different prompt.
Vague prompt: "Make me a Vampire Survivors clone in HTML and JavaScript with enemies and a player." Result: Hallucinated file structure. Missing autofire. Made-up enemy behaviors. Half features missing. Game barely runs. Thrown out.
Plan prompt: Full build plan, task-by-task. Result: Working game. All three enemy types spawning at correct intervals. Autofire working. Scoring persisting. Game over screen. Exactly like the specs.
Same model. Same hardware. Different plan. Different universe.
graph LR
subgraph Same Hardware & Model
V[Vague Prompt] --> G1[Garbage]
P[Plan Prompt] --> G2[Working Game]
end
style V fill:#ffcccc
style P fill:#ccffcc
style G1 fill:#ffcccc
style G2 fill:#ccffcc
Figure 8: The comparison. Same model, same machine, different prompt quality.
The pattern is so reliable that you can script it. Here is a minimal bash loop for task-by-task execution.
#!/bin/bash
# honest-contract.sh - Task-by-task contractor dispatch
PROJECT_DIR="./vampire-survivors"
BUILD_PLAN="$PROJECT_DIR/docs/build-plan.md"
CONTRACTOR="gemma4:31b"
TOTAL_TASKS=15
cd "$PROJECT_DIR" || exit 1
for i in $(seq 1 "$TOTAL_TASKS"); do
echo "===== Task $i ====="
# 1. Hand the task
claude code --model "$CONTRACTOR" \
--message "Read $BUILD_PLAN. Implement task $i only. Stop when done."
# 2. Wait for human verification
echo "Task $i complete. Verify before continuing."
echo "Press Enter to proceed to task $((i + 1))..."
read -r
# 3. Optional: git checkpoint after each verified task
git add .
git commit -m "Task $i: [automated via honest-contract]"
done
This script does one thing: it prevents the autonomous loop. The human must press Enter after each task. You cannot accidentally walk away.
What do you verify between tasks?
If any of these fail, you do not proceed. You escalate to the architect.
Because this blog post uses Mermaid syntax version 11.7.0+, here is a state machine diagram for the game itself, taken directly from the blueprint produced in the case study.
stateDiagram-v2
[*] --> Menu
Menu --> Playing: Start Game
Playing --> Paused: Escape
Paused --> Playing: Resume
Playing --> GameOver: Player HP <= 0
GameOver --> Menu: Return to Menu
Playing --> Upgrade: Level Up
Upgrade --> Playing: Select Upgrade
Figure 9: Gameplay state machine. Defined in architecture plan; implemented task by task.
The architect defines the states. The contractor implements the transitions. The GC verifies each one. If the architect skips this diagram, the contractor will invent states. You will end up with a pause menu that does not actually pause the game loop.
If you watch the AI coding space, almost every "autonomous agent" setup follows the same pattern:
This fails for two reasons:
First, the goal is vague. "Build a to-do app" is not a spec. It is a category. The agent will make architectural decisions that are convenient for the model, not optimal for the user.
Second, the run is too long. Even with a perfect plan, even with a 31B parameter model, autonomous loops collapse. The model loses track of where it is in the plan. It repeats work. It makes decisions that are locally optimal but globally wrong.
The architect-blueprint-contractor loop fixes both problems. The plan is specific. The contractor runs for one task, then stops. The GC keeps the global state.
graph TD
Vague[Vague Goal] --> A1[Agent thinks for itself]
A1 --> H1[Hallucinates architecture]
H1 --> F1[Repeat work, drift, fail]
Plan[Specific Plan] --> C1[Task 1]
C1 --> V1[Verify]
V1 --> C2[Task 2]
C2 --> V2[Verify]
V2 --> C3[...Continue]
Figure 10: Autonomous vague goal vs disciplined task loop.
A common objection is that the architect stage wastes tokens. You pay Claude for twenty-five minutes of interview and seven markdown files. Is that not expensive?
It depends on what you compare it to.
If you compare it to one vague prompt that produces garbage, yes, it looks expensive.
If you compare it to the cost of re-running the generation five times because the model misread your vague instructions, or re-architecting after task five because the file structure is wrong, the architect stage is cheaper.
The plan is the biggest token saver in the workflow. Every unanswered question in the interview becomes a correction later. Corrections cost more tokens than questions.
In the case study:
That is the point. The workflow does not eliminate work. It eliminates rework.
Here is a practical checklist you can follow for your next AI-assisted build.
ollama signin && ollama launch claude --model gemma4:31b --cloud.This checklist is intentionally boring. That is the point. Boring workflows produce reliable software. Exciting workflows produce exciting bugs.
The case study uses a top-down survival game. The pattern works for everything.
For a React dashboard, the architect produces:
The contractor implements each component. You verify routing and state.
For a Rust CLI, the architect produces:
The contractor implements each module. You verify the binary runs and the help text is correct.
For a Terraform module, the architect produces:
The contractor applies each module. You verify terraform plan shows no surprises.
The loop does not change. The inputs change. The verification criteria change. The disciplines stay the same.
To make this actionable, here are the exact artifacts produced in the interview stage of the case study, with minor edits for clarity.
Build a Vampire Survivors-style top-down shooter. Player moves with WASD. Autofire enabled by default. Enemies spawn from edges and move toward player. Three enemy types with increasing difficulty. Score persists in localStorage. Game over when player reaches zero health. Upgrade choices on level up. No external dependencies. Vanilla HTML, CSS, JavaScript only.
## Task 1: Canvas Setup
- Create index.html, style.css, main.js
- Set canvas to 800x600, centered on page
- Render black background each frame
- Expected output: black screen, no errors
## Task 2: Player Movement
- Add player square at canvas center
- WASD movement at 200 pixels per second
- Player stays within canvas bounds
- Expected output: square moves with keys
## Task 3: Autofire
- Player fires projectile every 250ms
- Projectile moves straight right
- Remove projectile when off-screen
- Expected output: continuous stream of bullets
The pattern for each task is identical: one task, one expected outcome, one verification step.
Each task in the tasks.md file contains:
If the expected output is "a playable game with three enemy types," the task is too big. If the expected output is "a red square renders at coordinates 100,100," the task is right-sized.
The term "honest contract" describes the relationship between you, the architect, and the contractor. The contract has three clauses:
Clause 1: The architect does not lie.
If the architect does not know something, it says so. If the plan has a gap, it flags it. A good architect produces a blueprint with unknowns called out, not invented answers.
Clause 2: The contractor does nothing beyond the task.
The contractor reads the task. Implements it. Stops. It does not refactor tasks N through N+5 "while it is there." It does not add features it thinks you might want. It does not optimize code that has not been measured.
Clause 3: The GC verifies before proceeding.
No verification, no next task. If you skip verification to save time, you are not running this workflow. You are running the "hope it works" workflow, and that is the workflow thatproduces broken software.
graph LR
subgraph Honest Contract [The Three Clauses]
A[Architect<br/>No lies] --> B[Contractor<br/>No scope creep]
B --> C[GC<br/>No skipping verification]
C --> A
end
Figure 11: The honest contract. Each role respects its boundary.
Even with a good plan, things go wrong. Here are the most common failure modes and the recovery procedure.
Symptoms: Model output ends abruptly. No error message. No status indicator. Task is incomplete.
Recovery:
Symptoms: Code runs. Game launches. But the enemy spawns at the player's position instead of the edges.
Recovery:
Symptoms: Task 3 references a module that Task 1 does not create. The state machine in the architecture plan conflicts with the build plan.
Recovery:
Symptoms: Task runs perfectly. But when Task 7 builds on it, the integration breaks.
Recovery:
The key recovery principle is: fix the plan, then continue execution. Do not try to fix broken software without fixing the plan first. The broken software is a symptom of a broken plan or a broken verification step, not a model problem.
If you want data on whether this workflow actually saves tokens and time, instrument it.
Track:
In the case study, the metrics were:
Compare that to the vague-prompt workflow, which produced garbage in twenty minutes and required a full restart.
The numbers do not lie. The plan is the biggest lever.
For projects with more than twenty tasks, the workflow needs one addition: milestones.
A milestone is a checkpoint where you rebuild the shared mental model. At a milestone:
This prevents the drift that occurs when a fifteen-task plan grows into a fifty-task plan without updated architecture. Without milestones, Task 30 might conflict with Task 12 because requirements changed and nobody updated the plan.
A practical rule: add a milestone every five to ten tasks. If the project is user-facing, add a milestone every feature that completes a user story.
graph TD
subgraph Milestone 1 [Tasks 1-5]
M1T1[Task 1] --> M1V1{Verify}
M1V1 --> M1T2[Task 2]
M1T2 --> M1V2{Verify}
M1V2 --> M1T3[Task 3]
M1T3 --> MS1[Milestone Review]
end
MS1 -->|Update Plan| subgraph Milestone 2 [Tasks 6-10]
M2T1[Task 6] --> M2V1{Verify}
M2V1 --> M2T2[Task 7]
M2T2 --> M2V2{Verify}
M2V2 --> M2T3[Task 8]
M2T3 --> MS2[Milestone Review]
end
MS2 -->|Update Plan| subgraph Milestone 3 [Tasks 11-15]
M3T1[Task 11] --> M3V1{Verify}
M3V1 --> M3T2[Task 12]
M3T2 --> M3V2{Verify}
M3V2 --> M3T3[Task 13]
M3T3 --> MS3[Milestone Review]
end
MS3 -->|Final Test| Complete[Build Complete]
Figure 12: Milestone structure for long-horizon projects. Rebuild shared mental model every five to ten tasks.
One advanced variation worth mentioning is the Ralph Wiggum Loop. In this pattern, after completing a milestone, you ask the architect:
"What did we learn in the last five tasks that changes the plan for the next five?"
This keeps the plan alive. It prevents the blueprint from becoming stale. It also surfaces architectural decisions you did not anticipate until you saw working code.
In practice, the Ralph Wiggum Loop adds five minutes per milestone and prevents the drift that ruins long-horizon projects. If you are building anything larger than a prototype, use it.
Not every problem needs the architect. Here is a decision rule:
Escalate to the architect when:
Fix it yourself when:
The rule is simple: if you are changing the spec, escalate. If you are correcting a typo, fix it.
If you are not sure which bucket you are in, escalate. The architect handles both types. The cost difference is small.
Your Claude limits are not a model problem. They are a planning problem.
The workflow is not new. It is just honest about how software gets built. You interview the architect. You get the blueprint. You hand the contractor one task. You verify. You repeat.
The model size matters. The context window matters. But neither matters as much as the plan, and neither matters at all if you are not in the driver's seat.
Build the plan. Run the loop. Verify every task.
The same model with a better plan will outperform a bigger model with a vague prompt every single time.
This article is based on the workflow demonstrated by Daniel Zendel, using a Vampire Survivors-style game as a case study. The transcript is from the original YouTube video: https://youtu.be/gtLNQX_2NQc