LOCAL INFERENCE
MACHINE

Why pay for tokens when you can own the factory?

Burning through your AI subscription every month and still hitting rate limits? Curious about running models on your own hardware but unsure if the quality holds up, what it actually costs, or whether it's worth the investment?

We tested 12 models across 30 real tasks on three hardware tiers — from a Raspberry Pi to a Mac Studio — so you don't have to guess.

12
Models Tested
385
Inference Runs
758
Judge Evaluations
3
Hardware Tiers
Scroll to explore the data

The Leaderboard

We threw 30 real tasks at 12 open-weight models. Two AI judges scored every answer.

30 tasks. Dual LLM judges (Claude Opus + OpenAI o4-mini). Scale: 0 = fail, 1 = acceptable, 2 = good.

# Model Score Quality 128 GB 256 GB Active Params Avg Time
1GPT-oss-120B1.89
Yes*Yes5.1B MoE102s
2Qwen3-235B-MoE1.68
NoYes22B MoE159s
3MiniMax-M2.51.68
NoYes15B MoE62s
4GLM-4.7-Flash1.64
YesYes~3B MoE65s
5Qwen3-Coder-Next1.62
YesYes3B MoE19s
5Qwen3-32B1.62
YesYes32B dense443s
5Devstral-21.62
Yes*Yes123B dense14s
8Qwen3-14B1.51
YesYes14B dense84s
9Nemotron-3-Super1.51
Yes*Yes12B hybrid7.5s
10DS-R1-32B1.35
YesYes32B dense82s
11Llama 3.3-70B1.33
NoYes70B dense66s
12Qwen2.5-Coder-32B0.33
YesYes32B dense5.4s

* Fits as sole model only. No room for a second model alongside it after macOS overhead (~20 GB).

Dual LLM JudgeEvery response judged independently by Claude Opus 4.5 and OpenAI o4-mini. Two model families reduces systematic bias. Disagreements >1.5 trigger manual review. 30 Tasks15 generic coding tasks (algorithms, refactoring, debugging, architecture), 5 non-coding (summarization, extraction, planning), 10 real-world tasks mined from 6,567 actual Claude Code prompts. Coding + Non-CodingNot just HumanEval. Tasks include Swedish email extraction, MCP tool generation, Fortnox API refactoring, project status reports, and slide content generation. Real-World Prompts10 tasks derived from analyzing the user's actual prompt history: commit messages (8% of prompts), debugging (5%), documentation (8%), status reports (7%), Swedish tasks (5%). 100% Judge AgreementOpus and o4-mini agreed on the score for every single evaluation in this batch. This is unusually high — likely because the 3-point scale (fail/acceptable/good) is coarser than typical 1-5 scales.
Technical Deep Dive: Evaluation Methodology

Pipeline Architecture

[Task Definitions] ---> [Runner (OpenRouter / Ollama)] ---> [SQLite DB]
                                                                    |
[Analysis Reports] <--- [Judge (Opus + o4-mini)] <-----------------+

Each phase is a separate CLI command. All state is persisted in SQLite (WAL mode). Runs are idempotent — failed or interrupted runs can be resumed with --resume without re-billing completed tasks.

Judging Protocol

Every model response is scored by two independent judges from different model families:

JudgeModelWhy
Judge AClaude Opus 4.5 (via OpenRouter)Strong on idiomatic code quality
Judge BOpenAI o4-mini (via OpenRouter)Reasoning model, strong on correctness

Scoring uses a 3-point scale across 3 dimensions:

ScoreCorrectnessCompletenessQuality
fail (0)Broken or wrongMissing key partsUnusable structure
acceptable (1)Works, some gapsMost coveredOK but room for improvement
good (2)Correct, edge casesAll addressedProfessional quality

Judge agreement on this evaluation: 100%. Disagreements >1.5 on any dimension trigger manual review (none were triggered).

Task Design

30 tasks across 7 categories, sourced from two approaches:

20 generic tasks (designed to cover coding fundamentals): simple coding (4), refactoring (3), architecture (2), debugging (3), multi-file generation (2), reasoning (1), non-coding (5).

10 real-world tasks (mined from 6,567 actual Claude Code prompts): status reports, error debugging, commit messages, Express refactoring, Swedish email extraction, documentation, deployment debugging, slide generation, MCP tool implementation, fact-checking.

Real-world tasks were QA’d by Codex (GPT-5.4) in an adversarial review pass. 3 tasks were revised based on feedback (ambiguous dates, missing context, over-broad scope).

Proxy Validity

OpenRouter serves models at provider-chosen precision (typically FP16/BF16). Local inference uses Q4/Q8 quantization. Our three-tier test (Cloud vs Air vs Pi) showed Qwen3-14B quality holds within noise locally (1.48 local vs 1.55 cloud). However, highly sparse MoE models like GLM-4.7-Flash showed more degradation (1.07 local vs 1.70 cloud). All cloud scores should be treated as upper bounds.

Full source code, task definitions, and raw data on GitHub →

Technical Deep Dive: Why GPT-oss-120B Wins

GPT-oss-120B is OpenAI’s open-weight MoE model (Apache 2.0). Key properties:

PropertyValue
Total parameters117B
Active parameters per token5.1B (MoE routing)
Native formatMXFP4 (~80 GB on disk)
Apple SiliconMetal reference implementation from OpenAI
Quality score1.89 / 2.0 (near-perfect across all categories)

The MoE architecture means it reads only ~10 GB of weights per token, despite being a 117B model. On a Mac Studio M4 Max (546 GB/s bandwidth), that translates to ~55 tok/s — fast enough for interactive use.

Caveat

This model has not been validated running locally via Ollama/MLX on Apple Silicon. The MXFP4 format and Metal implementation exist but real-world performance needs Phase B testing on actual hardware. This is the single highest-risk item in our recommendation.

The Speed Test

One task. One model. Cloud, laptop, and a Raspberry Pi. Same quality — wildly different speeds.

Task: "Summarize this project's status and prioritize next steps." Best available model per tier.

OpenRouter
Cloud API · Qwen3-14B
12s
Score: 2.0 / 2.0
~$0.001/task
💻
MacBook Air M4
32 GB · 120 GB/s · Qwen3-14B
238s
Score: 2.0 / 2.0
6.5 tok/s · Free
🤖
Raspberry Pi 5
8 GB · CPU only · Qwen2.5-7B
559s
Score: 2.0 / 2.0
0.6 tok/s · Free

Same quality everywhere. 20x–50x slower locally — but free and private.
On a Mac Studio (546 GB/s), those 238s become ~60s.

Technical Deep Dive: Hardware Comparison Matrix

Platforms Evaluated

PlatformRAMBandwidthPowerNoisePrice (SEK)
Mac Studio M4 Max 128GB128 GB unified546 GB/s~120WSilent~40,000
Mac Studio M3 Ultra 256GB256 GB unified819 GB/s~180WSilent~95,000
RTX 5090 PC32 GB VRAM1,792 GB/s~950WVery loud~54,000
RTX 4090 PC (used)24 GB VRAM1,008 GB/s~750WLoud~36,500
Dual RTX 509064 GB VRAM3,584 GB/s~1,500WExtreme~106,000
MacBook Air M4 32GB32 GB unified120 GB/s~15WSilent~20,000
Raspberry Pi 5 8GB8 GB~30 GB/s~10WSilent~1,200

Why Mac Studio over NVIDIA?

Unified memory is the differentiator. NVIDIA GPUs have higher bandwidth but are limited by VRAM — an RTX 5090 with 32 GB cannot run GPT-oss-120B (80 GB). You’d need dual GPUs with PCIe tensor parallelism, which adds complexity, noise, power draw, and costs more than the Mac.

The Mac Studio at 128 GB unified memory fits GPT-oss-120B + GLM-4.7-Flash simultaneously in a box that draws 120W and is silent. An equivalent NVIDIA setup draws 1,500W and sounds like a jet engine.

Over 3 years: electricity savings

Mac Studio (8h/day): ~150 SEK/month. RTX 5090 PC (8h/day): ~450 SEK/month. ~10,800 SEK saved over 36 months — that covers 25% of the Mac’s purchase price.

RTX 4090/5090: No NVLink

Neither the RTX 4090 nor RTX 5090 support NVLink. Multi-GPU model parallelism uses PCIe bandwidth (~25 GB/s practical), which is 24x slower than NVLink. It works for inference but limits throughput and adds latency.

What It Costs

Hardware pays for itself. Faster if you split with friends. The math is brutal for API providers.

Mac Studio M5 Max* 128 GB — ~40,000 SEK est. — amortized over 36 months + electricity. Based on actual usage: ~1.9B tokens/month.

Current Plan
Claude MAX
SEK 1,000/mo
Runs out of compute
API Alternative
OpenRouter
SEK 4,800/mo
Full volume, no limits
Solo Hardware
Mac Studio 128 GB
SEK 1,186/mo
Break-even: 8 months
Technical Deep Dive: Full Cost Model

Assumptions

Based on actual measured token usage: ~1.9B tokens/month (daily: ~115M, weekly: ~740M). Usage split: 30% input tokens, 70% output tokens. Hardware depreciation: 36 months. Electricity: 2 SEK/kWh, 8h/day average load.

API costs at actual volume (SEK/month)

ModelInput price/MOutput price/MMonthly (1.9B tok)
Qwen3-32B$0.10$0.30~4,600 SEK
GPT-oss-120B$0.30$1.20~18,000 SEK
Qwen3-14B$0.14$0.56~8,500 SEK
GLM-4.7-Flash$0.06$0.40~5,300 SEK

Even the cheapest model (Qwen3-32B) costs 4,600 SEK/month. The Mac Studio at 1,186 SEK/month is 4x cheaper.

Break-even sensitivity

UsageAPI/month (Qwen3-32B)HW/monthBreak-even
500M tok/mo (light)~1,200 SEK1,186 SEKNever (roughly equal)
1B tok/mo~2,400 SEK1,186 SEK~33 months
1.9B tok/mo (actual)~4,600 SEK1,186 SEK~12 months
3B tok/mo (co-op)~7,300 SEK1,186 SEK~7 months

What’s NOT in the cost model

Not included: time spent on setup/maintenance, opportunity cost of waiting for tasks to complete locally (slower), the value of privacy (hard to quantify), and the resale value of the hardware after 3 years.

The Setup

Two options depending on your priorities: maximum quality or maximum headroom.

Mac Studio M5 Max* · 128 GB Unified Memory · ~614 GB/s bandwidth · Silent · ~120W

Option A: Maximum Quality (solo model)

Single Model — Best Score in the Evaluation
GPT-oss-120B
RAM: ~80 GB (native MXFP4)
Active: 5.1B params (MoE)
Score: 1.89 / 2.0 — near-perfect
Est. speed: ~55 tok/s
Concurrent users: 5
Free after OS: ~28 GB — works but no room for a second model

Best quality, but you're all-in on one model. Swap between models by unloading/loading as needed (Ollama handles this automatically, but incurs a ~5-10s cold start).

Option B: Dual-Model Setup (recommended for co-op)

Quality — Harder Tasks
Qwen3-Coder-Next
RAM: ~46 GB
Active: 3B params (MoE)
Score: 1.62 / 2.0
Est. speed: ~90 tok/s
Concurrent users: 9
Utility — Fast Tasks
GLM-4.7-Flash
RAM: ~4 GB
Active: ~3B params (MoE)
Score: 1.64 / 2.0
Est. speed: ~90 tok/s
Concurrent users: 9

Total: ~50 GB loaded · ~58 GB free after OS. Rock solid — no swap risk, handles multiple concurrent users, both models always hot. Slightly lower peak quality (1.62-1.64) but much more practical for daily use and the hackathon booth.

Option C: The 256 GB Beast

Or forget the trade-offs entirely. Load everything. Run the best model and three more alongside it.

Flagship — Near-Perfect Quality
GPT-oss-120B
RAM: ~80 GB
Score: 1.89 / 2.0
Est. speed: ~70 tok/s (819 GB/s on Ultra)
Coding — Ultra-Fast MoE
Qwen3-Coder-Next
RAM: ~46 GB
Score: 1.62 / 2.0
Est. speed: ~130 tok/s
Utility — Quick Tasks
GLM-4.7-Flash
RAM: ~4 GB
Score: 1.64 / 2.0
Est. speed: ~130 tok/s
Reasoning — 256 GB Exclusive
MiniMax-M2.5
RAM: ~100 GB
Score: 1.68 / 2.0
Only runs on 256 GB

Total: ~230 GB loaded on a Mac Studio M5 Ultra* 256 GB (~95,000 SEK). All four models hot simultaneously. No compromises, no swapping. The everything machine.
Co-op cost at 4 people: ~720 SEK/person/month — still cheaper than one Claude MAX subscription.

* M5 Max / M5 Ultra not yet announced. Specs estimated from Apple Silicon generational improvements. All benchmarks run on M4 equivalent (via OpenRouter). Plan: buy when M5 ships (summer 2026).

What Moves Local?

Half your work doesn't need the cloud. Refactoring, debugging, and docs score perfect locally.

Analysis of 6,567 actual Claude Code prompts. ~50% of work can run locally.

LOCAL ~50%
CLOUD (OPUS) ~50%

Runs Locally

  • Refactoring2.0
  • Debugging2.0
  • Documentation2.0
  • Commit messages2.0
  • Status reports2.0
  • Simple coding1.5
  • Swedish text1.5
  • Non-coding assistant1.7

Stays on Cloud

  • Complex architecture1.0
  • Research (web search)N/A
  • Multi-turn creativevaries
  • Long context (>128K)N/A
  • Strategic planningvaries

FAQ

What's done, what's not, and what we're honest about not knowing yet.

What is the current status of this evaluation?
Done Phase A — Cloud quality screening (12 models, 30 tasks, 758 judge evaluations)
Done Three-tier testing (Cloud / MacBook Air M4 / Raspberry Pi 5)
Done Cost analysis and hardware comparison
Done Real-world task mining from 6,567 actual prompts
Pending Phase B — Validate GPT-oss-120B running locally on real Apple Silicon
Pending Phase B — Concurrent user load testing (1/3/5 simultaneous sessions)
Pending Phase B — Q4 vs Q8 quantization quality comparison on real hardware
Planned Agentic workflow testing (aider + Ollama for multi-turn coding)
Planned M5 Max evaluation (when Apple announces, expected summer 2026)
How reliable are these scores?
The scores are directional, not definitive. 30 tasks is enough to rank models confidently but not enough for precise per-category claims (some categories have only 2-4 tasks). The dual-judge approach (Opus + o4-mini) with 100% agreement adds confidence. Cloud scores are an upper bound — local quantized inference may score slightly lower, though our Air test showed minimal degradation for dense models.
Why not just use existing benchmarks (HumanEval, SWE-Bench)?
Existing benchmarks test generic capabilities. This evaluation tests your actual use cases — Swedish email extraction, MCP tool generation, Fortnox invoice refactoring, project status summaries. The question isn’t "which model is smartest?" but "which model can do my work well enough that I stop paying for API tokens?"
What about fine-tuning?
Not evaluated. Fine-tuning could improve task-specific quality (especially for Swedish text and domain-specific patterns like Fortnox API formats), but it adds complexity and maintenance burden. The base models scored well enough on most tasks without fine-tuning.
Is GPT-oss-120B actually runnable locally?
Yes, but with caveats. People have run it on Apple Silicon via Ollama (ollama pull gpt-oss:120b), llama.cpp (native MXFP4), and MLX (via OpenHarmony-MLX). The model needs ~80 GB RAM, leaving ~28 GB free on a 128 GB Mac Studio — tight but functional as a solo model.

Real-world performance: Best reported is ~40 tok/s on optimized MLX, but practical experience is often slower. The Metal reference implementation from OpenAI is still experimental, not production-grade. Most Mac users actually run the smaller GPT-oss-20B (13 GB) instead.

Our take: GPT-oss-120B scored highest in our evaluation (1.89/2.0), but the local runtime is not yet battle-tested. Option B (Qwen3-Coder-Next + GLM-4.7-Flash) is the safer recommendation — both models are confirmed running reliably on Apple Silicon, use only 50 GB total, and leave comfortable headroom. If GPT-oss-120B matures, it becomes Option A.
What if I have different tasks?
The framework is open-source and designed for custom tasks. You can add your own task definitions and re-run the evaluation. See the section below for how to submit tasks.
What’s the plan ahead?
Short term (April 2026): Rent a Mac Studio for a weekend, validate GPT-oss-120B locally, run concurrent load tests, measure real tok/s and TTFT.
Medium term (Summer 2026): Wait for M5 Max announcement. If bandwidth improves to 614+ GB/s, that’s 12% faster inference for free. Buy the hardware, deploy for the co-op.
Long term: Build the hackathon booth. Host vibe-coding workshops. Run a local inference service for 3-5 people at 300 SEK/person/month.

Submit Your Own Tasks

Don't trust our benchmarks? Good. Add yours and we'll run them.

The framework is open-source. Fork it, add a task definition, open a PR. We'll run it against all 12 models and publish the results here.

How to contribute a task

1. Fork the repo
2. Add your task to src/tasks/real-world.ts (or create a new category file)
3. Follow the TaskDefinition interface:
{
  id: 'community-001',
  category: 'debugging',        // or: simple-coding, refactoring, architecture,
                                 //     multi-file, reasoning, non-coding
  title: 'Your Task Title',
  difficulty: 3,                 // 1-5
  maxTokens: 2000,
  tags: ['your', 'tags'],
  prompt: `Your full prompt here. Be specific.
Include any code snippets or context the model needs.`,
  expectedCapabilities: [
    'what a good answer should include',
    'another expected capability',
  ],
}
4. Run npx vitest run tests/tasks.test.ts to validate
5. Open a PR with a description of what your task tests and why it matters
6. We’ll run it against all 12 models, judge it, and add results to this page

View the source on GitHub →

Timeline

From data to hardware. Three phases, six months, one machine.

MARCH–APRIL 2026

Phase A: Evaluate & Decide

Done Run 385 inference tests across 12 models and 3 hardware tiers. Share results. Collect feedback from co-op candidates. Accept community task submissions via PR. Validate GPT-oss-120B running locally on rented Mac Studio.

MAY–JUNE 2026

Phase B: Validate & Prepare

Pending Concurrent user load testing on real hardware (1/3/5 sessions). Agentic workflow testing with aider + Ollama. Quantization quality comparison (Q4 vs Q8). Finalize co-op group and cost-sharing agreement. Monitor M5 Max announcement from Apple.

JUNE–JULY 2026

Phase C: Buy & Deploy

Planned Mac Studio M5 Max 128 GB drops (expected). Purchase as co-op. Deploy GPT-oss-120B + GLM-4.7-Flash. Set up Ollama as always-on inference server. Host first hackathon booth — solstolar, fika, and local AI in the garden.

// The Invitation

Join the Co-op

300 SEK/month. Unlimited local inference. Full privacy. No rate limits. 5 concurrent users. Silent. Always on.

Get In Touch

* Mac Studio M5 Max — expected release summer 2026. Benchmarks based on M4 Max specs; M5 likely ~12% faster.