Why pay for tokens when you can own the factory?
Burning through your AI subscription every month and still hitting rate limits?
Curious about running models on your own hardware but unsure if the quality holds up,
what it actually costs, or whether it's worth the investment?
We tested 12 models across 30 real tasks on three hardware tiers
— from a Raspberry Pi to a Mac Studio — so you don't have to guess.
We threw 30 real tasks at 12 open-weight models. Two AI judges scored every answer.
30 tasks. Dual LLM judges (Claude Opus + OpenAI o4-mini). Scale: 0 = fail, 1 = acceptable, 2 = good.
| # | Model | Score | Quality | 128 GB | 256 GB | Active Params | Avg Time |
|---|---|---|---|---|---|---|---|
| 1 | GPT-oss-120B | 1.89 | Yes* | Yes | 5.1B MoE | 102s | |
| 2 | Qwen3-235B-MoE | 1.68 | No | Yes | 22B MoE | 159s | |
| 3 | MiniMax-M2.5 | 1.68 | No | Yes | 15B MoE | 62s | |
| 4 | GLM-4.7-Flash | 1.64 | Yes | Yes | ~3B MoE | 65s | |
| 5 | Qwen3-Coder-Next | 1.62 | Yes | Yes | 3B MoE | 19s | |
| 5 | Qwen3-32B | 1.62 | Yes | Yes | 32B dense | 443s | |
| 5 | Devstral-2 | 1.62 | Yes* | Yes | 123B dense | 14s | |
| 8 | Qwen3-14B | 1.51 | Yes | Yes | 14B dense | 84s | |
| 9 | Nemotron-3-Super | 1.51 | Yes* | Yes | 12B hybrid | 7.5s | |
| 10 | DS-R1-32B | 1.35 | Yes | Yes | 32B dense | 82s | |
| 11 | Llama 3.3-70B | 1.33 | No | Yes | 70B dense | 66s | |
| 12 | Qwen2.5-Coder-32B | 0.33 | Yes | Yes | 32B dense | 5.4s |
* Fits as sole model only. No room for a second model alongside it after macOS overhead (~20 GB).
[Task Definitions] ---> [Runner (OpenRouter / Ollama)] ---> [SQLite DB]
|
[Analysis Reports] <--- [Judge (Opus + o4-mini)] <-----------------+
Each phase is a separate CLI command. All state is persisted in SQLite (WAL mode). Runs are idempotent — failed or interrupted runs can be resumed with --resume without re-billing completed tasks.
Every model response is scored by two independent judges from different model families:
| Judge | Model | Why |
|---|---|---|
| Judge A | Claude Opus 4.5 (via OpenRouter) | Strong on idiomatic code quality |
| Judge B | OpenAI o4-mini (via OpenRouter) | Reasoning model, strong on correctness |
Scoring uses a 3-point scale across 3 dimensions:
| Score | Correctness | Completeness | Quality |
|---|---|---|---|
| fail (0) | Broken or wrong | Missing key parts | Unusable structure |
| acceptable (1) | Works, some gaps | Most covered | OK but room for improvement |
| good (2) | Correct, edge cases | All addressed | Professional quality |
Judge agreement on this evaluation: 100%. Disagreements >1.5 on any dimension trigger manual review (none were triggered).
30 tasks across 7 categories, sourced from two approaches:
20 generic tasks (designed to cover coding fundamentals): simple coding (4), refactoring (3), architecture (2), debugging (3), multi-file generation (2), reasoning (1), non-coding (5).
10 real-world tasks (mined from 6,567 actual Claude Code prompts): status reports, error debugging, commit messages, Express refactoring, Swedish email extraction, documentation, deployment debugging, slide generation, MCP tool implementation, fact-checking.
Real-world tasks were QA’d by Codex (GPT-5.4) in an adversarial review pass. 3 tasks were revised based on feedback (ambiguous dates, missing context, over-broad scope).
OpenRouter serves models at provider-chosen precision (typically FP16/BF16). Local inference uses Q4/Q8 quantization. Our three-tier test (Cloud vs Air vs Pi) showed Qwen3-14B quality holds within noise locally (1.48 local vs 1.55 cloud). However, highly sparse MoE models like GLM-4.7-Flash showed more degradation (1.07 local vs 1.70 cloud). All cloud scores should be treated as upper bounds.
Full source code, task definitions, and raw data on GitHub →
GPT-oss-120B is OpenAI’s open-weight MoE model (Apache 2.0). Key properties:
| Property | Value |
|---|---|
| Total parameters | 117B |
| Active parameters per token | 5.1B (MoE routing) |
| Native format | MXFP4 (~80 GB on disk) |
| Apple Silicon | Metal reference implementation from OpenAI |
| Quality score | 1.89 / 2.0 (near-perfect across all categories) |
The MoE architecture means it reads only ~10 GB of weights per token, despite being a 117B model. On a Mac Studio M4 Max (546 GB/s bandwidth), that translates to ~55 tok/s — fast enough for interactive use.
This model has not been validated running locally via Ollama/MLX on Apple Silicon. The MXFP4 format and Metal implementation exist but real-world performance needs Phase B testing on actual hardware. This is the single highest-risk item in our recommendation.
One task. One model. Cloud, laptop, and a Raspberry Pi. Same quality — wildly different speeds.
Task: "Summarize this project's status and prioritize next steps." Best available model per tier.
Same quality everywhere. 20x–50x slower locally — but free and private.
On a Mac Studio (546 GB/s), those 238s become ~60s.
| Platform | RAM | Bandwidth | Power | Noise | Price (SEK) |
|---|---|---|---|---|---|
| Mac Studio M4 Max 128GB | 128 GB unified | 546 GB/s | ~120W | Silent | ~40,000 |
| Mac Studio M3 Ultra 256GB | 256 GB unified | 819 GB/s | ~180W | Silent | ~95,000 |
| RTX 5090 PC | 32 GB VRAM | 1,792 GB/s | ~950W | Very loud | ~54,000 |
| RTX 4090 PC (used) | 24 GB VRAM | 1,008 GB/s | ~750W | Loud | ~36,500 |
| Dual RTX 5090 | 64 GB VRAM | 3,584 GB/s | ~1,500W | Extreme | ~106,000 |
| MacBook Air M4 32GB | 32 GB unified | 120 GB/s | ~15W | Silent | ~20,000 |
| Raspberry Pi 5 8GB | 8 GB | ~30 GB/s | ~10W | Silent | ~1,200 |
Unified memory is the differentiator. NVIDIA GPUs have higher bandwidth but are limited by VRAM — an RTX 5090 with 32 GB cannot run GPT-oss-120B (80 GB). You’d need dual GPUs with PCIe tensor parallelism, which adds complexity, noise, power draw, and costs more than the Mac.
The Mac Studio at 128 GB unified memory fits GPT-oss-120B + GLM-4.7-Flash simultaneously in a box that draws 120W and is silent. An equivalent NVIDIA setup draws 1,500W and sounds like a jet engine.
Mac Studio (8h/day): ~150 SEK/month. RTX 5090 PC (8h/day): ~450 SEK/month. ~10,800 SEK saved over 36 months — that covers 25% of the Mac’s purchase price.
Neither the RTX 4090 nor RTX 5090 support NVLink. Multi-GPU model parallelism uses PCIe bandwidth (~25 GB/s practical), which is 24x slower than NVLink. It works for inference but limits throughput and adds latency.
Hardware pays for itself. Faster if you split with friends. The math is brutal for API providers.
Mac Studio M5 Max* 128 GB — ~40,000 SEK est. — amortized over 36 months + electricity. Based on actual usage: ~1.9B tokens/month.
Based on actual measured token usage: ~1.9B tokens/month (daily: ~115M, weekly: ~740M). Usage split: 30% input tokens, 70% output tokens. Hardware depreciation: 36 months. Electricity: 2 SEK/kWh, 8h/day average load.
| Model | Input price/M | Output price/M | Monthly (1.9B tok) |
|---|---|---|---|
| Qwen3-32B | $0.10 | $0.30 | ~4,600 SEK |
| GPT-oss-120B | $0.30 | $1.20 | ~18,000 SEK |
| Qwen3-14B | $0.14 | $0.56 | ~8,500 SEK |
| GLM-4.7-Flash | $0.06 | $0.40 | ~5,300 SEK |
Even the cheapest model (Qwen3-32B) costs 4,600 SEK/month. The Mac Studio at 1,186 SEK/month is 4x cheaper.
| Usage | API/month (Qwen3-32B) | HW/month | Break-even |
|---|---|---|---|
| 500M tok/mo (light) | ~1,200 SEK | 1,186 SEK | Never (roughly equal) |
| 1B tok/mo | ~2,400 SEK | 1,186 SEK | ~33 months |
| 1.9B tok/mo (actual) | ~4,600 SEK | 1,186 SEK | ~12 months |
| 3B tok/mo (co-op) | ~7,300 SEK | 1,186 SEK | ~7 months |
Not included: time spent on setup/maintenance, opportunity cost of waiting for tasks to complete locally (slower), the value of privacy (hard to quantify), and the resale value of the hardware after 3 years.
Two options depending on your priorities: maximum quality or maximum headroom.
Mac Studio M5 Max* · 128 GB Unified Memory · ~614 GB/s bandwidth · Silent · ~120W
Best quality, but you're all-in on one model. Swap between models by unloading/loading as needed (Ollama handles this automatically, but incurs a ~5-10s cold start).
Total: ~50 GB loaded · ~58 GB free after OS. Rock solid — no swap risk, handles multiple concurrent users, both models always hot. Slightly lower peak quality (1.62-1.64) but much more practical for daily use and the hackathon booth.
Or forget the trade-offs entirely. Load everything. Run the best model and three more alongside it.
Total: ~230 GB loaded on a Mac Studio M5 Ultra* 256 GB (~95,000 SEK). All four models hot simultaneously. No compromises, no swapping. The everything machine.
Co-op cost at 4 people: ~720 SEK/person/month — still cheaper than one Claude MAX subscription.
* M5 Max / M5 Ultra not yet announced. Specs estimated from Apple Silicon generational improvements. All benchmarks run on M4 equivalent (via OpenRouter). Plan: buy when M5 ships (summer 2026).
Half your work doesn't need the cloud. Refactoring, debugging, and docs score perfect locally.
Analysis of 6,567 actual Claude Code prompts. ~50% of work can run locally.
What's done, what's not, and what we're honest about not knowing yet.
ollama pull gpt-oss:120b), llama.cpp (native MXFP4), and MLX (via OpenHarmony-MLX). The model needs ~80 GB RAM, leaving ~28 GB free on a 128 GB Mac Studio — tight but functional as a solo model.
Don't trust our benchmarks? Good. Add yours and we'll run them.
The framework is open-source. Fork it, add a task definition, open a PR. We'll run it against all 12 models and publish the results here.
src/tasks/real-world.ts (or create a new category file)TaskDefinition interface:
{
id: 'community-001',
category: 'debugging', // or: simple-coding, refactoring, architecture,
// multi-file, reasoning, non-coding
title: 'Your Task Title',
difficulty: 3, // 1-5
maxTokens: 2000,
tags: ['your', 'tags'],
prompt: `Your full prompt here. Be specific.
Include any code snippets or context the model needs.`,
expectedCapabilities: [
'what a good answer should include',
'another expected capability',
],
}
npx vitest run tests/tasks.test.ts to validateFrom data to hardware. Three phases, six months, one machine.
Done Run 385 inference tests across 12 models and 3 hardware tiers. Share results. Collect feedback from co-op candidates. Accept community task submissions via PR. Validate GPT-oss-120B running locally on rented Mac Studio.
Pending Concurrent user load testing on real hardware (1/3/5 sessions). Agentic workflow testing with aider + Ollama. Quantization quality comparison (Q4 vs Q8). Finalize co-op group and cost-sharing agreement. Monitor M5 Max announcement from Apple.
Planned Mac Studio M5 Max 128 GB drops (expected). Purchase as co-op. Deploy GPT-oss-120B + GLM-4.7-Flash. Set up Ollama as always-on inference server. Host first hackathon booth — solstolar, fika, and local AI in the garden.
300 SEK/month. Unlimited local inference. Full privacy. No rate limits. 5 concurrent users. Silent. Always on.
Get In Touch* Mac Studio M5 Max — expected release summer 2026. Benchmarks based on M4 Max specs; M5 likely ~12% faster.