LOCAL INFERENCE
MACHINE

Why pay for tokens when you can own the factory?

You're not asking "can a local model answer my questions?" You're asking: can it handle the sub-tasks that Opus delegates when you say "do the analysis"? Those sub-tasks are where 90% of your tokens actually go.

We classified 750 real prompts from 56 Claude Code projects, ran 63 tasks across 16 models on three hardware tiers, and built an orchestration benchmark to find out.

16
Models Tested
63
Tasks
97.3%
Prompts Offloadable
3
Hardware Tiers
Scroll to explore the data

97.3% Offloadable

We analyzed 750 real prompts from 56 Claude Code projects. Only 2.7% genuinely need frontier models.

750 prompts · 56 projects · classified by category, complexity, and minimum model tier.

97.3%
Runs on Local Models
≤ local-large tier (e.g. Gemma4-26B, GLM-4.7-Flash)
78.0%
Runs on Edge Tier
Gemma4-E2B or GLM-4.7-Flash quality suffices
2.7%
Needs Frontier
Claude / GPT-4 class. Mostly architecture tasks.

Top Prompt Categories

Coding30.8%
Questions / research23.3%
Communication / writing13.5%
Architecture← most frontier-heavy (18.4%)

The Hidden Multiplier

User prompts are the tip of the iceberg. When you say "do the analysis", the orchestrator (Claude Code / Hugin) spawns dozens of sub-tasks: write this script, read these files, summarize that diff, run these tests.

Those sub-tasks are where ~90% of tokens are spent. The real question is not "can local models answer my questions" but "can local models handle what Opus delegates?"

750 prompts classified using Qwen2.5-7B-Instruct via OpenRouter. Schema: category, complexity, min_model_tier (edge/local-small/local-large/frontier). Source: actual Claude Code session transcripts.

The Leaderboard

We ran 63 tasks across 16+ models via OpenRouter. Two AI judges scored every answer independently.

63 tasks across 7 categories. Dual LLM judges (Claude Opus 4.5 + OpenAI o4-mini). Scale: 0 = fail, 1 = acceptable, 2 = good. Cloud scores are upper bounds — local inference may differ (see local benchmark notes below).

Note: GLM-4.7-Flash scores 1.27 cloud but ~0.81 locally with thinking disabled. Gemma4-26B scores 1.64 cloud, local benchmark in progress.

# Model Score Quality 128 GB 256 GB Active Params Avg Time
1GPT-oss-120B1.89
Yes*Yes5.1B MoE102s
2Qwen3-235B-MoE1.68
NoYes22B MoE159s
3MiniMax-M2.51.68
NoYes15B MoE62s
4GLM-4.7-Flash †1.64
YesYes~3B MoE65s
4Gemma4-26B ††1.64
YesYes4B MoE55s
6Qwen3-Coder-Next1.62
YesYes3B MoE19s
6Qwen3-32B1.62
YesYes32B dense443s
6Devstral-21.62
Yes*Yes123B dense14s
9Qwen3.5-35B-A3B1.57
YesYes3B MoE28s
10Qwen3-14B1.51
YesYes14B dense84s
11Nemotron-3-Super1.51
Yes*Yes12B hybrid7.5s
12DS-R1-32B1.35
YesYes32B dense82s
13Llama 3.3-70B1.33
NoYes70B dense66s
14Qwen2.5-Coder-32B0.33
YesYes32B dense5.4s

* Fits as sole model only. No room for a second model alongside it after macOS overhead (~20 GB).
† GLM-4.7-Flash cloud score is with thinking enabled (~2,886 tokens avg). Local score with thinking disabled (required to avoid 10-minute TTFT on M4 Air): ~0.81. Thinking-enabled locally on Mac Studio M5 would score ~1.22 est.
†† Gemma4-26B cloud (4B active MoE). Local benchmark in progress on M4 Air. Gemma4-E2B (2B) local benchmark complete.

Dual LLM JudgeEvery response judged independently by Claude Opus 4.5 and OpenAI o4-mini. Two model families reduces systematic bias. Disagreements >1.5 trigger manual review. 63 Tasks7 categories: simple-coding, refactoring, architecture, debugging, multi-file, reasoning, non-coding. Includes Grimnir-specific tasks (librarian, delegated), and real-world tasks mined from actual Claude Code sessions. Compound tasks (ct-001 to ct-007) run via the orchestration benchmark separately. Coding + Non-CodingNot just HumanEval. Tasks include Swedish email extraction, MCP tool generation, Fortnox API refactoring, project status reports, and slide content generation. Real-World Prompts10 tasks derived from analyzing the user's actual prompt history: commit messages (8% of prompts), debugging (5%), documentation (8%), status reports (7%), Swedish tasks (5%). 100% Judge AgreementOpus and o4-mini agreed on the score for every single evaluation in this batch. This is unusually high — likely because the 3-point scale (fail/acceptable/good) is coarser than typical 1-5 scales.
Technical Deep Dive: Evaluation Methodology

Pipeline Architecture

[Task Definitions] ---> [Runner (OpenRouter / Ollama)] ---> [SQLite DB]
                                                                    |
[Analysis Reports] <--- [Judge (Opus + o4-mini)] <-----------------+

Each phase is a separate CLI command. All state is persisted in SQLite (WAL mode). Runs are idempotent — failed or interrupted runs can be resumed with --resume without re-billing completed tasks.

Judging Protocol

Every model response is scored by two independent judges from different model families:

JudgeModelWhy
Judge AClaude Opus 4.5 (via OpenRouter)Strong on idiomatic code quality
Judge BOpenAI o4-mini (via OpenRouter)Reasoning model, strong on correctness

Scoring uses a 3-point scale across 3 dimensions:

ScoreCorrectnessCompletenessQuality
fail (0)Broken or wrongMissing key partsUnusable structure
acceptable (1)Works, some gapsMost coveredOK but room for improvement
good (2)Correct, edge casesAll addressedProfessional quality

Judge agreement on this evaluation: 100%. Disagreements >1.5 on any dimension trigger manual review (none were triggered).

Task Design

30 tasks across 7 categories, sourced from two approaches:

20 generic tasks (designed to cover coding fundamentals): simple coding (4), refactoring (3), architecture (2), debugging (3), multi-file generation (2), reasoning (1), non-coding (5).

10 real-world tasks (mined from 6,567 actual Claude Code prompts): status reports, error debugging, commit messages, Express refactoring, Swedish email extraction, documentation, deployment debugging, slide generation, MCP tool implementation, fact-checking.

Real-world tasks were QA’d by Codex (GPT-5.4) in an adversarial review pass. 3 tasks were revised based on feedback (ambiguous dates, missing context, over-broad scope).

Proxy Validity

OpenRouter serves models at provider-chosen precision (typically FP16/BF16). Local inference uses Q4/Q8 quantization. Our three-tier test (Cloud vs Air vs Pi) showed Qwen3-14B quality holds within noise locally (1.48 local vs 1.55 cloud). However, highly sparse MoE models like GLM-4.7-Flash showed more degradation (1.07 local vs 1.70 cloud). All cloud scores should be treated as upper bounds.

Full source code, task definitions, and raw data on GitHub →

Technical Deep Dive: Why GPT-oss-120B Wins

GPT-oss-120B is OpenAI’s open-weight MoE model (Apache 2.0). Key properties:

PropertyValue
Total parameters117B
Active parameters per token5.1B (MoE routing)
Native formatMXFP4 (~80 GB on disk)
Apple SiliconMetal reference implementation from OpenAI
Quality score1.89 / 2.0 (near-perfect across all categories)

The MoE architecture means it reads only ~10 GB of weights per token, despite being a 117B model. On a Mac Studio M4 Max (546 GB/s bandwidth), that translates to ~55 tok/s — fast enough for interactive use.

Caveat

This model has not been validated running locally via Ollama/MLX on Apple Silicon. The MXFP4 format and Metal implementation exist but real-world performance needs Phase B testing on actual hardware. This is the single highest-risk item in our recommendation.

The Speed Test

One task. One model. Cloud, laptop, and a Raspberry Pi. Same quality — wildly different speeds.

Task: "Summarize this project's status and prioritize next steps." Best available model per tier.

OpenRouter
Cloud API · Qwen3-14B
12s
Score: 2.0 / 2.0
~$0.001/task
💻
MacBook Air M4
32 GB · 120 GB/s · Qwen3-14B
238s
Score: 2.0 / 2.0
6.5 tok/s · Free
🤖
Raspberry Pi 5
8 GB · CPU only · Qwen2.5-7B
559s
Score: 2.0 / 2.0
0.6 tok/s · Free

Same quality everywhere. 20x–50x slower locally — but free and private.
On a Mac Studio (546 GB/s), those 238s become ~60s.

Technical Deep Dive: Hardware Comparison Matrix

Platforms Evaluated

PlatformRAMBandwidthPowerNoisePrice (SEK)
Mac Studio M4 Max 128GB128 GB unified546 GB/s~120WSilent~40,000
Mac Studio M3 Ultra 256GB256 GB unified819 GB/s~180WSilent~95,000
RTX 5090 PC32 GB VRAM1,792 GB/s~950WVery loud~54,000
RTX 4090 PC (used)24 GB VRAM1,008 GB/s~750WLoud~36,500
Dual RTX 509064 GB VRAM3,584 GB/s~1,500WExtreme~106,000
MacBook Air M4 32GB32 GB unified120 GB/s~15WSilent~20,000
Raspberry Pi 5 8GB8 GB~30 GB/s~10WSilent~1,200

Why Mac Studio over NVIDIA?

Unified memory is the differentiator. NVIDIA GPUs have higher bandwidth but are limited by VRAM — an RTX 5090 with 32 GB cannot run GPT-oss-120B (80 GB). You’d need dual GPUs with PCIe tensor parallelism, which adds complexity, noise, power draw, and costs more than the Mac.

The Mac Studio at 128 GB unified memory fits GPT-oss-120B + GLM-4.7-Flash simultaneously in a box that draws 120W and is silent. An equivalent NVIDIA setup draws 1,500W and sounds like a jet engine.

Over 3 years: electricity savings

Mac Studio (8h/day): ~150 SEK/month. RTX 5090 PC (8h/day): ~450 SEK/month. ~10,800 SEK saved over 36 months — that covers 25% of the Mac’s purchase price.

RTX 4090/5090: No NVLink

Neither the RTX 4090 nor RTX 5090 support NVLink. Multi-GPU model parallelism uses PCIe bandwidth (~25 GB/s practical), which is 24x slower than NVLink. It works for inference but limits throughput and adds latency.

What It Costs

Hardware only makes sense at the right scale. Here's the honest math across three scenarios.

Mac Studio M5 Max* 128 GB — ~40,000 SEK est. — amortized over 36 months + electricity (100 SEK/mo).

Scenario A — vs API (pay-per-token)

Reality check: Most heavy users are on Claude MAX ($200/mo flat), not pay-per-token. Our billing audit showed $20/mo on OpenRouter — all benchmarking, zero production. Scenario B (subscription) is the relevant comparison for most users.

Light use (~10M tok/mo)
API cost
SEK 98/mo
Hardware: never breaks even
Moderate (~100M tok/mo)
API cost
SEK 977/mo
Hardware: never breaks even
Heavy (~1B tok/mo)
API cost
SEK 9,765/mo
Break-even: 5 months

Scenario B — vs Subscription (Claude MAX)

Subscriptions are flat-rate. Hardware enables a tier downgrade: swap MAX for Pro (~210 SEK/mo) and run the rest locally. Tiers: Pro $20 · MAX 5x $100 · MAX 20x $200 (per person/month)

1 user — MAX 5x → Pro
128 GB hardware
SEK 1,211/mo hw
Sub saving: 840 SEK/mo — never breaks even
2 users — MAX 5x → Pro each
128 GB hardware
SEK 606/person/mo
Break-even: ~25 months
2 users — MAX 20x → Pro each
128 GB hardware
SEK 606/person/mo
Break-even: ~11 months · saves 93k SEK/3yr

Note: MAX is also a bet that subscription prices rise over time — hardware cost is fixed, MAX is not.

Technical Deep Dive: Full Cost Model

Assumptions

Usage split: 30% input tokens, 70% output tokens. Hardware depreciation: 36 months. Electricity: ~100 SEK/month. Subscription tiers: Pro $20/mo (~210 SEK), MAX 5x $100/mo (~1,050 SEK), MAX 20x $200/mo (~2,100 SEK).

Throughput ceiling — what the hardware can actually produce

Token generation is bounded by memory bandwidth. A 128 GB Mac Studio (546 GB/s) running one model 24/7 has a hard cap:

ModelEst. tok/sMax tokens/month (24/7)Quality
Gemma4-E2B150~389M1.30 / 2.0
GLM-4.7-Flash120~311M1.01 / 2.0 local
Qwen3-Coder-Next100~259M1.55 / 2.0
Gemma4-26B90~233M1.64 / 2.0
GPT-oss-120B60~155M1.89 / 2.0

Total throughput is fixed by bandwidth — more concurrent users slice the same pie. The hardware cannot replace all API volume at high usage. It complements cloud by handling sub-tasks cheaply.

What the hardware actually replaces (the real break-even)

Billing reality: Real usage is 2.8B tokens/month via Claude MAX subscription, of which 95.4% are cache reads. Only ~13M output tokens/month are actually generated — local inference has no cache equivalent, so it addresses only the output slice.

For pay-per-token users, break-even depends on which API calls go local. The orchestration benchmark proved local Gemma4-26B replaces Sonnet-quality sub-tasks (1.98/2.0 hybrid vs 1.88/2.0 cloud). So local tokens displace Sonnet pricing ($3/$15 per M tokens = ~120 SEK/M tokens).

Local tokens replace…API price (SEK/M tok)Tokens/month to break evenGeneration time/day
Sonnet sub-tasks~120 SEK/M~10M~1 hour
GPT-oss-120B~10 SEK/M~124M~16 hours
Gemma4-26B cloud~3.4 SEK/M~361MExceeds 24/7 capacity

Key insight: For subscription users (most heavy users), the API displacement table above is irrelevant. The real case is subscription displacement: hardware lets you downgrade from MAX ($200/mo) to Pro ($20/mo), saving $180/mo per user. See Scenario B above.

Subscription comparison (MAX → Pro + local)

With hardware handling sub-tasks, users can downgrade from MAX to Pro. The saving is fixed — doesn’t depend on token volume, only on quality being good enough (it is: 1.83–1.98/2.0).

ScenarioSub saving/monthHW cost/monthBreak-evenSaves over 3yr
1 user MAX 5x → Pro840 SEK1,211 SEKNever (HW costs more)−13k SEK
2 users MAX 5x → Pro1,680 SEK1,211 SEK~25 months+17k SEK
2 users MAX 20x → Pro3,780 SEK1,211 SEK~11 months+93k SEK
4 users MAX 5x → Pro3,360 SEK1,211 SEK~7 months+173k SEK

What’s NOT in the cost model

Not included: time spent on setup/maintenance, opportunity cost of waiting for tasks to complete locally (slower), the value of privacy (hard to quantify), resale value of the hardware after 3 years, and the asymmetric bet that subscription prices rise over time while hardware cost is fixed.

The Setup

Two options depending on your priorities: maximum quality or maximum headroom.

Mac Studio M5 Max* · 128 GB Unified Memory · ~614 GB/s bandwidth · Silent · ~120W

Option A: Maximum Quality (solo model)

Single Model — Best Score in the Evaluation
GPT-oss-120B
RAM: ~80 GB (native MXFP4)
Active: 5.1B params (MoE)
Score: 1.89 / 2.0 — near-perfect
Est. speed: ~55 tok/s
Concurrent users: 5
Free after OS: ~28 GB — works but no room for a second model

Best quality, but you're all-in on one model. Swap between models by unloading/loading as needed (Ollama handles this automatically, but incurs a ~5-10s cold start).

Option B: Dual-Model Setup (recommended for co-op)

Quality — Harder Tasks
Qwen3-Coder-Next
RAM: ~46 GB
Active: 3B params (MoE)
Score: 1.62 / 2.0
Est. speed: ~90 tok/s
Concurrent users: 9
Utility — Fast Tasks
GLM-4.7-Flash
RAM: ~4 GB
Active: ~3B params (MoE)
Score: 1.64 / 2.0
Est. speed: ~90 tok/s
Concurrent users: 9

Total: ~50 GB loaded · ~58 GB free after OS. Rock solid — no swap risk, handles multiple concurrent users, both models always hot. Slightly lower peak quality (1.62-1.64) but much more practical for daily use and the hackathon booth.

Option C: The 256 GB Beast

Or forget the trade-offs entirely. Load everything. Run the best model and three more alongside it.

Flagship — Near-Perfect Quality
GPT-oss-120B
RAM: ~80 GB
Score: 1.89 / 2.0
Est. speed: ~70 tok/s (819 GB/s on Ultra)
Coding — Ultra-Fast MoE
Qwen3-Coder-Next
RAM: ~46 GB
Score: 1.62 / 2.0
Est. speed: ~130 tok/s
Utility — Quick Tasks
GLM-4.7-Flash
RAM: ~4 GB
Score: 1.64 / 2.0
Est. speed: ~130 tok/s
Reasoning — 256 GB Exclusive
MiniMax-M2.5
RAM: ~100 GB
Score: 1.68 / 2.0
Only runs on 256 GB

Total: ~230 GB loaded on a Mac Studio M5 Ultra* 256 GB (~95,000 SEK). All four models hot simultaneously. No compromises, no swapping. The everything machine.
Co-op cost at 4 people: ~720 SEK/person/month — still cheaper than one Claude MAX subscription.

* M5 Max / M5 Ultra not yet announced. Specs estimated from Apple Silicon generational improvements. All benchmarks run on M4 equivalent (via OpenRouter). Plan: buy when M5 ships (summer 2026).

What Moves Local?

Nearly everything. 97.3% of real prompts can run on local models. Only architecture tasks skew frontier-heavy.

Measured from 750 real Claude Code session prompts across 56 projects. Not estimates — actual classification.

LOCAL 97.3%
FRONTIER 2.7% ▶
EDGE / LOCAL-SMALL 78%

78% can run on tiny models (Gemma4-E2B, GLM-4.7-Flash) — not just "local", but fast local.

Runs Locally (97.3%)

  • Coding (30.8% of prompts)
  • Questions / research (23.3%)
  • Communication / writing (13.5%)
  • Refactoring, debugging, docs2.0
  • Commit messages, status reports2.0
  • Simple coding, Swedish text1.5–1.7

Stays Cloud (2.7%)

  • Complex architecture (18.4% frontier)↑cloud
  • Multi-step strategic planningvaries
  • Research requiring web searchN/A
  • Long context (>128K tokens)N/A

Architecture is the most frontier-heavy category at 18.4% needing Claude/GPT-4. Everything else is overwhelmingly local.

Orchestration Benchmark

We built a compound-task benchmark to answer the real question: when Opus delegates sub-tasks, can local models replace Sonnet?

7 compound tasks × 3+ strategies. Each task runs 2–4 sub-tasks then synthesizes a final answer. Judged on the final output only.

STRATEGY A — BASELINE
Cloud-Only
Opus orchestrates Sonnet
Score: 1.88 / 2.0
Cost: $0.21/task
Baseline complete
STRATEGY B — WINNER
Hybrid
Opus orchestrates local Gemma4-26B
Score: 1.98 / 2.0 ↑ beats cloud
Cost: $0.15/task (orchestrator only, 29% cheaper)
Complete — 7/7 tasks
STRATEGY C — ZERO COST
Fully Local
Gemma4-26B orchestrates & executes
Score: 1.83 / 2.0 −0.05 vs cloud
Cost: $0.00/task
Complete — 7/7 tasks

7 Compound Tasks

ct-001 — Feature + Tests + Review
ct-002 — Research & Recommend
ct-003 — Debug + Fix + Regression
ct-004 — PR Review + Changelog
ct-005 — Architecture Assessment
ct-006 — Code Port + Validation
ct-007 — Daily Briefing

Each task decomposes into 2–4 sub-tasks executed by the execution model (Sonnet or local), with outputs injected into the next prompt. A final synthesis call by the orchestrator (Opus or local) assembles the deliverable. Only the final output is judged — same pipeline, same rubric.

Verdict

✓ Hybrid — JUSTIFIED
Score delta: +0.10 vs cloud (threshold: −0.30)
Cost savings: 29% (threshold: >50%) — borderline
Coverage: 7/7 tasks
Best quality at lowest cloud spend
✓ Fully Local — JUSTIFIED
Score delta: −0.05 vs cloud (within threshold)
Cost savings: 100%
Coverage: 7/7 tasks
Near-cloud quality at zero marginal cost

Only weak spot: ct-006 Code Port + Validation scored 1.33 fully-local vs 1.83 cloud-only. Equivalence checking is where the gap shows. All other 6 tasks: 1.83–2.00.

Source: src/orchestrator/ — strategies.ts, executor.ts, index.ts — runs via tsx src/orchestrator/index.ts --batch orch-v1

FAQ

What's done, what's not, and what we're honest about not knowing yet.

What is the current status of this evaluation?
Done Cloud quality screening — 16 models × 63 tasks, dual-judge, multiple batches
Done Local benchmarks — Gemma4-E2B (63 tasks), GLM-4.7-Flash (63 tasks), Gemma4-26B on M4 Air
Done Prompt mining — 750 real prompts classified, 97.3% offloadable finding
Done Cost analysis and hardware comparison
Done Orchestration benchmark framework built (7 compound tasks, 3 strategies)
Done Cloud-only baseline: 1.88/2.0 avg score, $0.21/task
Done Hybrid strategy (Opus→local Gemma4-26B) — 1.98/2.0, beats cloud-only
Done Fully-local strategy (Gemma4-26B) — 1.83/2.0, $0.00/task, within 0.05 of cloud
Pending Concurrent user load testing (1/3/5 simultaneous sessions)
Pending Q4 vs Q8 quantization quality comparison
Planned Alternative harness evaluation (aider + Ollama for agentic tasks on Pi)
Planned Mac Studio M5 hardware purchase (summer 2026, pending orchestration results)
How reliable are these scores?
The scores are directional, not definitive. 63 tasks gives solid category-level confidence. The dual-judge approach (Opus + o4-mini) with 100% agreement on most batches adds reliability. Cloud scores are an upper bound — local quantized inference may score lower. The GLM case is the sharpest example: cloud 1.64 (thinking enabled) vs local ~0.81 (thinking disabled, required to avoid 10-min TTFT). Dense models degrade less: Qwen3-14B local was 1.48 vs 1.55 cloud.
Why not just use existing benchmarks (HumanEval, SWE-Bench)?
Existing benchmarks test generic capabilities. This evaluation tests your actual use cases — Swedish email extraction, MCP tool generation, Fortnox invoice refactoring, project status summaries. The question isn’t "which model is smartest?" but "which model can do my work well enough that I can downgrade my cloud subscription?"
What about fine-tuning?
Not evaluated. Fine-tuning could improve task-specific quality (especially for Swedish text and domain-specific patterns like Fortnox API formats), but it adds complexity and maintenance burden. The base models scored well enough on most tasks without fine-tuning.
Is GPT-oss-120B actually runnable locally?
Yes, but with caveats. People have run it on Apple Silicon via Ollama (ollama pull gpt-oss:120b), llama.cpp (native MXFP4), and MLX (via OpenHarmony-MLX). The model needs ~80 GB RAM, leaving ~28 GB free on a 128 GB Mac Studio — tight but functional as a solo model.

Real-world performance: Best reported is ~40 tok/s on optimized MLX, but practical experience is often slower. The Metal reference implementation from OpenAI is still experimental, not production-grade. Most Mac users actually run the smaller GPT-oss-20B (13 GB) instead.

Our take: GPT-oss-120B scored highest in our evaluation (1.89/2.0), but the local runtime is not yet battle-tested. Option B (Qwen3-Coder-Next + GLM-4.7-Flash) is the safer recommendation — both models are confirmed running reliably on Apple Silicon, use only 50 GB total, and leave comfortable headroom. If GPT-oss-120B matures, it becomes Option A.
What if I have different tasks?
The framework is open-source and designed for custom tasks. You can add your own task definitions and re-run the evaluation. See the section below for how to submit tasks.
What’s the plan ahead?
Right now (April 2026): Run orchestration hybrid strategies (Opus→Gemma4-26B, Opus→Qwen3.5) to see if local models can replace Sonnet in sub-task delegation. This is the critical gate for the hardware decision.
Short term: If hybrid scores within 0.3 of cloud-only (threshold: 1.58/2.0), the case for hardware is strong. Also testing aider + Ollama as an alternative harness for Pi.
Medium term (Summer 2026): Wait for M5 Max announcement. Buy the hardware if orchestration results justify it. Deploy GPT-oss-120B + Gemma4-26B as the co-op stack.
Long term: Build the hackathon booth. Run Hugin (local orchestrator) as the agent substrate. 300 SEK/person/month for unlimited local inference.

Submit Your Own Tasks

Don't trust our benchmarks? Good. Add yours and we'll run them.

The framework is open-source. Fork it, add a task definition, open a PR. We'll run it against all 12 models and publish the results here.

How to contribute a task

1. Fork the repo
2. Add your task to src/tasks/real-world.ts (or create a new category file)
3. Follow the TaskDefinition interface:
{
  id: 'community-001',
  category: 'debugging',        // or: simple-coding, refactoring, architecture,
                                 //     multi-file, reasoning, non-coding
  title: 'Your Task Title',
  difficulty: 3,                 // 1-5
  maxTokens: 2000,
  tags: ['your', 'tags'],
  prompt: `Your full prompt here. Be specific.
Include any code snippets or context the model needs.`,
  expectedCapabilities: [
    'what a good answer should include',
    'another expected capability',
  ],
}
4. Run npx vitest run tests/tasks.test.ts to validate
5. Open a PR with a description of what your task tests and why it matters
6. We’ll run it against all 12 models, judge it, and add results to this page

View the source on GitHub →

Timeline

From data to hardware. Three phases, six months, one machine.

MARCH–APRIL 2026

Phase A: Evaluate & Screen

Done 16 models × 63 tasks via OpenRouter. Local benchmarks on M4 Air (Gemma4-E2B, GLM-4.7-Flash, Gemma4-26B). 750 real prompts classified: 97.3% offloadable. Orchestration benchmark built and cloud baseline established: 1.88/2.0, $0.21/task.

APRIL–MAY 2026

Phase B: Orchestration & Decision

In Progress Run hybrid orchestration strategies (Opus→local Gemma4-26B, Opus→Qwen3.5). If hybrid scores ≥1.58/2.0: hardware justified. Also: aider + Ollama agentic testing on Pi, quantization comparison (Q4 vs Q8), concurrent load tests. Finalize co-op group.

JUNE–JULY 2026

Phase C: Buy & Deploy

Planned Mac Studio M5 Max 128 GB drops (expected). Purchase as co-op. Deploy GPT-oss-120B + GLM-4.7-Flash. Set up Ollama as always-on inference server. Host first hackathon booth — solstolar, fika, and local AI in the garden.

// The Invitation

Join the Co-op

300 SEK/month. Unlimited local inference. Full privacy. No rate limits. 5 concurrent users. Silent. Always on.

Get In Touch

* Mac Studio M5 Max — expected release summer 2026. Benchmarks based on M4 Max specs; M5 likely ~12% faster.