Why pay for tokens when you can own the factory?
You're not asking "can a local model answer my questions?" You're asking:
can it handle the sub-tasks that Opus delegates when you say "do the analysis"?
Those sub-tasks are where 90% of your tokens actually go.
We classified 750 real prompts from 56 Claude Code projects, ran 63 tasks across 16 models
on three hardware tiers, and built an orchestration benchmark to find out.
We analyzed 750 real prompts from 56 Claude Code projects. Only 2.7% genuinely need frontier models.
750 prompts · 56 projects · classified by category, complexity, and minimum model tier.
User prompts are the tip of the iceberg. When you say "do the analysis", the orchestrator
(Claude Code / Hugin) spawns dozens of sub-tasks: write this script, read these files,
summarize that diff, run these tests.
Those sub-tasks are where ~90% of tokens are spent.
The real question is not "can local models answer my questions" but
"can local models handle what Opus delegates?"
750 prompts classified using Qwen2.5-7B-Instruct via OpenRouter. Schema: category, complexity, min_model_tier (edge/local-small/local-large/frontier). Source: actual Claude Code session transcripts.
We ran 63 tasks across 16+ models via OpenRouter. Two AI judges scored every answer independently.
63 tasks across 7 categories. Dual LLM judges (Claude Opus 4.5 + OpenAI o4-mini). Scale: 0 = fail, 1 = acceptable, 2 = good. Cloud scores are upper bounds — local inference may differ (see local benchmark notes below).
Note: GLM-4.7-Flash scores 1.27 cloud but ~0.81 locally with thinking disabled. Gemma4-26B scores 1.64 cloud, local benchmark in progress.
| # | Model | Score | Quality | 128 GB | 256 GB | Active Params | Avg Time |
|---|---|---|---|---|---|---|---|
| 1 | GPT-oss-120B | 1.89 | Yes* | Yes | 5.1B MoE | 102s | |
| 2 | Qwen3-235B-MoE | 1.68 | No | Yes | 22B MoE | 159s | |
| 3 | MiniMax-M2.5 | 1.68 | No | Yes | 15B MoE | 62s | |
| 4 | GLM-4.7-Flash † | 1.64 | Yes | Yes | ~3B MoE | 65s | |
| 4 | Gemma4-26B †† | 1.64 | Yes | Yes | 4B MoE | 55s | |
| 6 | Qwen3-Coder-Next | 1.62 | Yes | Yes | 3B MoE | 19s | |
| 6 | Qwen3-32B | 1.62 | Yes | Yes | 32B dense | 443s | |
| 6 | Devstral-2 | 1.62 | Yes* | Yes | 123B dense | 14s | |
| 9 | Qwen3.5-35B-A3B | 1.57 | Yes | Yes | 3B MoE | 28s | |
| 10 | Qwen3-14B | 1.51 | Yes | Yes | 14B dense | 84s | |
| 11 | Nemotron-3-Super | 1.51 | Yes* | Yes | 12B hybrid | 7.5s | |
| 12 | DS-R1-32B | 1.35 | Yes | Yes | 32B dense | 82s | |
| 13 | Llama 3.3-70B | 1.33 | No | Yes | 70B dense | 66s | |
| 14 | Qwen2.5-Coder-32B | 0.33 | Yes | Yes | 32B dense | 5.4s |
* Fits as sole model only. No room for a second model alongside it after macOS overhead (~20 GB).
† GLM-4.7-Flash cloud score is with thinking enabled (~2,886 tokens avg). Local score with thinking disabled (required to avoid 10-minute TTFT on M4 Air): ~0.81. Thinking-enabled locally on Mac Studio M5 would score ~1.22 est.
†† Gemma4-26B cloud (4B active MoE). Local benchmark in progress on M4 Air. Gemma4-E2B (2B) local benchmark complete.
[Task Definitions] ---> [Runner (OpenRouter / Ollama)] ---> [SQLite DB]
|
[Analysis Reports] <--- [Judge (Opus + o4-mini)] <-----------------+
Each phase is a separate CLI command. All state is persisted in SQLite (WAL mode). Runs are idempotent — failed or interrupted runs can be resumed with --resume without re-billing completed tasks.
Every model response is scored by two independent judges from different model families:
| Judge | Model | Why |
|---|---|---|
| Judge A | Claude Opus 4.5 (via OpenRouter) | Strong on idiomatic code quality |
| Judge B | OpenAI o4-mini (via OpenRouter) | Reasoning model, strong on correctness |
Scoring uses a 3-point scale across 3 dimensions:
| Score | Correctness | Completeness | Quality |
|---|---|---|---|
| fail (0) | Broken or wrong | Missing key parts | Unusable structure |
| acceptable (1) | Works, some gaps | Most covered | OK but room for improvement |
| good (2) | Correct, edge cases | All addressed | Professional quality |
Judge agreement on this evaluation: 100%. Disagreements >1.5 on any dimension trigger manual review (none were triggered).
30 tasks across 7 categories, sourced from two approaches:
20 generic tasks (designed to cover coding fundamentals): simple coding (4), refactoring (3), architecture (2), debugging (3), multi-file generation (2), reasoning (1), non-coding (5).
10 real-world tasks (mined from 6,567 actual Claude Code prompts): status reports, error debugging, commit messages, Express refactoring, Swedish email extraction, documentation, deployment debugging, slide generation, MCP tool implementation, fact-checking.
Real-world tasks were QA’d by Codex (GPT-5.4) in an adversarial review pass. 3 tasks were revised based on feedback (ambiguous dates, missing context, over-broad scope).
OpenRouter serves models at provider-chosen precision (typically FP16/BF16). Local inference uses Q4/Q8 quantization. Our three-tier test (Cloud vs Air vs Pi) showed Qwen3-14B quality holds within noise locally (1.48 local vs 1.55 cloud). However, highly sparse MoE models like GLM-4.7-Flash showed more degradation (1.07 local vs 1.70 cloud). All cloud scores should be treated as upper bounds.
Full source code, task definitions, and raw data on GitHub →
GPT-oss-120B is OpenAI’s open-weight MoE model (Apache 2.0). Key properties:
| Property | Value |
|---|---|
| Total parameters | 117B |
| Active parameters per token | 5.1B (MoE routing) |
| Native format | MXFP4 (~80 GB on disk) |
| Apple Silicon | Metal reference implementation from OpenAI |
| Quality score | 1.89 / 2.0 (near-perfect across all categories) |
The MoE architecture means it reads only ~10 GB of weights per token, despite being a 117B model. On a Mac Studio M4 Max (546 GB/s bandwidth), that translates to ~55 tok/s — fast enough for interactive use.
This model has not been validated running locally via Ollama/MLX on Apple Silicon. The MXFP4 format and Metal implementation exist but real-world performance needs Phase B testing on actual hardware. This is the single highest-risk item in our recommendation.
One task. One model. Cloud, laptop, and a Raspberry Pi. Same quality — wildly different speeds.
Task: "Summarize this project's status and prioritize next steps." Best available model per tier.
Same quality everywhere. 20x–50x slower locally — but free and private.
On a Mac Studio (546 GB/s), those 238s become ~60s.
| Platform | RAM | Bandwidth | Power | Noise | Price (SEK) |
|---|---|---|---|---|---|
| Mac Studio M4 Max 128GB | 128 GB unified | 546 GB/s | ~120W | Silent | ~40,000 |
| Mac Studio M3 Ultra 256GB | 256 GB unified | 819 GB/s | ~180W | Silent | ~95,000 |
| RTX 5090 PC | 32 GB VRAM | 1,792 GB/s | ~950W | Very loud | ~54,000 |
| RTX 4090 PC (used) | 24 GB VRAM | 1,008 GB/s | ~750W | Loud | ~36,500 |
| Dual RTX 5090 | 64 GB VRAM | 3,584 GB/s | ~1,500W | Extreme | ~106,000 |
| MacBook Air M4 32GB | 32 GB unified | 120 GB/s | ~15W | Silent | ~20,000 |
| Raspberry Pi 5 8GB | 8 GB | ~30 GB/s | ~10W | Silent | ~1,200 |
Unified memory is the differentiator. NVIDIA GPUs have higher bandwidth but are limited by VRAM — an RTX 5090 with 32 GB cannot run GPT-oss-120B (80 GB). You’d need dual GPUs with PCIe tensor parallelism, which adds complexity, noise, power draw, and costs more than the Mac.
The Mac Studio at 128 GB unified memory fits GPT-oss-120B + GLM-4.7-Flash simultaneously in a box that draws 120W and is silent. An equivalent NVIDIA setup draws 1,500W and sounds like a jet engine.
Mac Studio (8h/day): ~150 SEK/month. RTX 5090 PC (8h/day): ~450 SEK/month. ~10,800 SEK saved over 36 months — that covers 25% of the Mac’s purchase price.
Neither the RTX 4090 nor RTX 5090 support NVLink. Multi-GPU model parallelism uses PCIe bandwidth (~25 GB/s practical), which is 24x slower than NVLink. It works for inference but limits throughput and adds latency.
Hardware only makes sense at the right scale. Here's the honest math across three scenarios.
Mac Studio M5 Max* 128 GB — ~40,000 SEK est. — amortized over 36 months + electricity (100 SEK/mo).
⚠ Reality check: Most heavy users are on Claude MAX ($200/mo flat), not pay-per-token. Our billing audit showed $20/mo on OpenRouter — all benchmarking, zero production. Scenario B (subscription) is the relevant comparison for most users.
Subscriptions are flat-rate. Hardware enables a tier downgrade: swap MAX for Pro (~210 SEK/mo) and run the rest locally. Tiers: Pro $20 · MAX 5x $100 · MAX 20x $200 (per person/month)
Note: MAX is also a bet that subscription prices rise over time — hardware cost is fixed, MAX is not.
Usage split: 30% input tokens, 70% output tokens. Hardware depreciation: 36 months. Electricity: ~100 SEK/month. Subscription tiers: Pro $20/mo (~210 SEK), MAX 5x $100/mo (~1,050 SEK), MAX 20x $200/mo (~2,100 SEK).
Token generation is bounded by memory bandwidth. A 128 GB Mac Studio (546 GB/s) running one model 24/7 has a hard cap:
| Model | Est. tok/s | Max tokens/month (24/7) | Quality |
|---|---|---|---|
| Gemma4-E2B | 150 | ~389M | 1.30 / 2.0 |
| GLM-4.7-Flash | 120 | ~311M | 1.01 / 2.0 local |
| Qwen3-Coder-Next | 100 | ~259M | 1.55 / 2.0 |
| Gemma4-26B | 90 | ~233M | 1.64 / 2.0 |
| GPT-oss-120B | 60 | ~155M | 1.89 / 2.0 |
Total throughput is fixed by bandwidth — more concurrent users slice the same pie. The hardware cannot replace all API volume at high usage. It complements cloud by handling sub-tasks cheaply.
Billing reality: Real usage is 2.8B tokens/month via Claude MAX subscription, of which 95.4% are cache reads. Only ~13M output tokens/month are actually generated — local inference has no cache equivalent, so it addresses only the output slice.
For pay-per-token users, break-even depends on which API calls go local. The orchestration benchmark proved local Gemma4-26B replaces Sonnet-quality sub-tasks (1.98/2.0 hybrid vs 1.88/2.0 cloud). So local tokens displace Sonnet pricing ($3/$15 per M tokens = ~120 SEK/M tokens).
| Local tokens replace… | API price (SEK/M tok) | Tokens/month to break even | Generation time/day |
|---|---|---|---|
| Sonnet sub-tasks | ~120 SEK/M | ~10M | ~1 hour |
| GPT-oss-120B | ~10 SEK/M | ~124M | ~16 hours |
| Gemma4-26B cloud | ~3.4 SEK/M | ~361M | Exceeds 24/7 capacity |
Key insight: For subscription users (most heavy users), the API displacement table above is irrelevant. The real case is subscription displacement: hardware lets you downgrade from MAX ($200/mo) to Pro ($20/mo), saving $180/mo per user. See Scenario B above.
With hardware handling sub-tasks, users can downgrade from MAX to Pro. The saving is fixed — doesn’t depend on token volume, only on quality being good enough (it is: 1.83–1.98/2.0).
| Scenario | Sub saving/month | HW cost/month | Break-even | Saves over 3yr |
|---|---|---|---|---|
| 1 user MAX 5x → Pro | 840 SEK | 1,211 SEK | Never (HW costs more) | −13k SEK |
| 2 users MAX 5x → Pro | 1,680 SEK | 1,211 SEK | ~25 months | +17k SEK |
| 2 users MAX 20x → Pro | 3,780 SEK | 1,211 SEK | ~11 months | +93k SEK |
| 4 users MAX 5x → Pro | 3,360 SEK | 1,211 SEK | ~7 months | +173k SEK |
Not included: time spent on setup/maintenance, opportunity cost of waiting for tasks to complete locally (slower), the value of privacy (hard to quantify), resale value of the hardware after 3 years, and the asymmetric bet that subscription prices rise over time while hardware cost is fixed.
Two options depending on your priorities: maximum quality or maximum headroom.
Mac Studio M5 Max* · 128 GB Unified Memory · ~614 GB/s bandwidth · Silent · ~120W
Best quality, but you're all-in on one model. Swap between models by unloading/loading as needed (Ollama handles this automatically, but incurs a ~5-10s cold start).
Total: ~50 GB loaded · ~58 GB free after OS. Rock solid — no swap risk, handles multiple concurrent users, both models always hot. Slightly lower peak quality (1.62-1.64) but much more practical for daily use and the hackathon booth.
Or forget the trade-offs entirely. Load everything. Run the best model and three more alongside it.
Total: ~230 GB loaded on a Mac Studio M5 Ultra* 256 GB (~95,000 SEK). All four models hot simultaneously. No compromises, no swapping. The everything machine.
Co-op cost at 4 people: ~720 SEK/person/month — still cheaper than one Claude MAX subscription.
* M5 Max / M5 Ultra not yet announced. Specs estimated from Apple Silicon generational improvements. All benchmarks run on M4 equivalent (via OpenRouter). Plan: buy when M5 ships (summer 2026).
Nearly everything. 97.3% of real prompts can run on local models. Only architecture tasks skew frontier-heavy.
Measured from 750 real Claude Code session prompts across 56 projects. Not estimates — actual classification.
78% can run on tiny models (Gemma4-E2B, GLM-4.7-Flash) — not just "local", but fast local.
Architecture is the most frontier-heavy category at 18.4% needing Claude/GPT-4. Everything else is overwhelmingly local.
We built a compound-task benchmark to answer the real question: when Opus delegates sub-tasks, can local models replace Sonnet?
7 compound tasks × 3+ strategies. Each task runs 2–4 sub-tasks then synthesizes a final answer. Judged on the final output only.
Each task decomposes into 2–4 sub-tasks executed by the execution model (Sonnet or local), with outputs injected into the next prompt. A final synthesis call by the orchestrator (Opus or local) assembles the deliverable. Only the final output is judged — same pipeline, same rubric.
Only weak spot: ct-006 Code Port + Validation scored 1.33 fully-local vs 1.83 cloud-only. Equivalence checking is where the gap shows. All other 6 tasks: 1.83–2.00.
Source: src/orchestrator/
— strategies.ts, executor.ts, index.ts — runs via tsx src/orchestrator/index.ts --batch orch-v1
What's done, what's not, and what we're honest about not knowing yet.
ollama pull gpt-oss:120b), llama.cpp (native MXFP4), and MLX (via OpenHarmony-MLX). The model needs ~80 GB RAM, leaving ~28 GB free on a 128 GB Mac Studio — tight but functional as a solo model.
Don't trust our benchmarks? Good. Add yours and we'll run them.
The framework is open-source. Fork it, add a task definition, open a PR. We'll run it against all 12 models and publish the results here.
src/tasks/real-world.ts (or create a new category file)TaskDefinition interface:
{
id: 'community-001',
category: 'debugging', // or: simple-coding, refactoring, architecture,
// multi-file, reasoning, non-coding
title: 'Your Task Title',
difficulty: 3, // 1-5
maxTokens: 2000,
tags: ['your', 'tags'],
prompt: `Your full prompt here. Be specific.
Include any code snippets or context the model needs.`,
expectedCapabilities: [
'what a good answer should include',
'another expected capability',
],
}
npx vitest run tests/tasks.test.ts to validateFrom data to hardware. Three phases, six months, one machine.
Done 16 models × 63 tasks via OpenRouter. Local benchmarks on M4 Air (Gemma4-E2B, GLM-4.7-Flash, Gemma4-26B). 750 real prompts classified: 97.3% offloadable. Orchestration benchmark built and cloud baseline established: 1.88/2.0, $0.21/task.
In Progress Run hybrid orchestration strategies (Opus→local Gemma4-26B, Opus→Qwen3.5). If hybrid scores ≥1.58/2.0: hardware justified. Also: aider + Ollama agentic testing on Pi, quantization comparison (Q4 vs Q8), concurrent load tests. Finalize co-op group.
Planned Mac Studio M5 Max 128 GB drops (expected). Purchase as co-op. Deploy GPT-oss-120B + GLM-4.7-Flash. Set up Ollama as always-on inference server. Host first hackathon booth — solstolar, fika, and local AI in the garden.
300 SEK/month. Unlimited local inference. Full privacy. No rate limits. 5 concurrent users. Silent. Always on.
Get In Touch* Mac Studio M5 Max — expected release summer 2026. Benchmarks based on M4 Max specs; M5 likely ~12% faster.