LOCAL INFERENCE MACHINE — localbench.gille.ai

// The Finding That Changes Everything

97.3% Offloadable

We analyzed 750 real prompts from 56 Claude Code projects. Only 2.7% genuinely need frontier models.

750 prompts · 56 projects · classified by category, complexity, and minimum model tier.

97.3%

Runs on Local Models

≤ local-large tier (e.g. Gemma4-26B, GLM-4.7-Flash)

78.0%

Runs on Edge Tier

Gemma4-E2B or GLM-4.7-Flash quality suffices

2.7%

Needs Frontier

Claude / GPT-4 class. Mostly architecture tasks.

Top Prompt Categories

            Coding30.8%
          
            Questions / research23.3%
          
            Communication / writing13.5%
          
            Architecture← most frontier-heavy (18.4%)

The Hidden Multiplier

User prompts are the tip of the iceberg. When you say "do the analysis", the orchestrator (Claude Code / Hugin) spawns dozens of sub-tasks: write this script, read these files, summarize that diff, run these tests.

Those sub-tasks are where ~90% of tokens are spent. The real question is not "can local models answer my questions" but "can local models handle what Opus delegates?"

750 prompts classified using Qwen2.5-7B-Instruct via OpenRouter. Schema: category, complexity, min_model_tier (edge/local-small/local-large/frontier). Source: actual Claude Code session transcripts.

// Model Quality Rankings

The Leaderboard

We ran 63 tasks across 16+ models via OpenRouter. Two AI judges scored every answer independently.

63 tasks across 7 categories. Dual LLM judges (Claude Opus 4.5 + OpenAI o4-mini). Scale: 0 = fail, 1 = acceptable, 2 = good. Cloud scores are upper bounds — local inference may differ (see local benchmark notes below).

Note: GLM-4.7-Flash scores 1.27 cloud but ~0.81 locally with thinking disabled. Gemma4-26B scores 1.64 cloud, local benchmark in progress.

#	Model	Score	128 GB	256 GB	Active Params	Avg Time
1	GPT-oss-120B	1.89	Yes*	Yes	5.1B MoE	102s
2	Qwen3-235B-MoE	1.68	No	Yes	22B MoE	159s
3	MiniMax-M2.5	1.68	No	Yes	15B MoE	62s
4	GLM-4.7-Flash †	1.64	Yes	Yes	~3B MoE	65s
4	Gemma4-26B ††	1.64	Yes	Yes	4B MoE	55s
6	Qwen3-Coder-Next	1.62	Yes	Yes	3B MoE	19s
6	Qwen3-32B	1.62	Yes	Yes	32B dense	443s
6	Devstral-2	1.62	Yes*	Yes	123B dense	14s
9	Qwen3.5-35B-A3B	1.57	Yes	Yes	3B MoE	28s
10	Qwen3-14B	1.51	Yes	Yes	14B dense	84s
11	Nemotron-3-Super	1.51	Yes*	Yes	12B hybrid	7.5s
12	DS-R1-32B	1.35	Yes	Yes	32B dense	82s
13	Llama 3.3-70B	1.33	No	Yes	70B dense	66s
14	Qwen2.5-Coder-32B	0.33	Yes	Yes	32B dense	5.4s

* Fits as sole model only. No room for a second model alongside it after macOS overhead (~20 GB).
† GLM-4.7-Flash cloud score is with thinking enabled (~2,886 tokens avg). Local score with thinking disabled (required to avoid 10-minute TTFT on M4 Air): ~0.81. Thinking-enabled locally on Mac Studio M5 would score ~1.22 est.
†† Gemma4-26B cloud (4B active MoE). Local benchmark in progress on M4 Air. Gemma4-E2B (2B) local benchmark complete.

Dual LLM Judge 63 Tasks Coding + Non-Coding Real-World Prompts 100% Judge Agreement

Technical Deep Dive: Evaluation Methodology

Pipeline Architecture

[Task Definitions] ---> [Runner (OpenRouter / Ollama)] ---> [SQLite DB]
                                                                    |
[Analysis Reports] <--- [Judge (Opus + o4-mini)] <-----------------+

Each phase is a separate CLI command. All state is persisted in SQLite (WAL mode). Runs are idempotent — failed or interrupted runs can be resumed with --resume without re-billing completed tasks.

Judging Protocol

Every model response is scored by two independent judges from different model families:

Judge	Model	Why
Judge A	Claude Opus 4.5 (via OpenRouter)	Strong on idiomatic code quality
Judge B	OpenAI o4-mini (via OpenRouter)	Reasoning model, strong on correctness

Scoring uses a 3-point scale across 3 dimensions:

Score	Correctness	Completeness	Quality
fail (0)	Broken or wrong	Missing key parts	Unusable structure
acceptable (1)	Works, some gaps	Most covered	OK but room for improvement
good (2)	Correct, edge cases	All addressed	Professional quality

Judge agreement on this evaluation: 100%. Disagreements >1.5 on any dimension trigger manual review (none were triggered).

Task Design

30 tasks across 7 categories, sourced from two approaches:

20 generic tasks (designed to cover coding fundamentals): simple coding (4), refactoring (3), architecture (2), debugging (3), multi-file generation (2), reasoning (1), non-coding (5).

10 real-world tasks (mined from 6,567 actual Claude Code prompts): status reports, error debugging, commit messages, Express refactoring, Swedish email extraction, documentation, deployment debugging, slide generation, MCP tool implementation, fact-checking.

Real-world tasks were QA’d by Codex (GPT-5.4) in an adversarial review pass. 3 tasks were revised based on feedback (ambiguous dates, missing context, over-broad scope).

Proxy Validity

OpenRouter serves models at provider-chosen precision (typically FP16/BF16). Local inference uses Q4/Q8 quantization. Our three-tier test (Cloud vs Air vs Pi) showed Qwen3-14B quality holds within noise locally (1.48 local vs 1.55 cloud). However, highly sparse MoE models like GLM-4.7-Flash showed more degradation (1.07 local vs 1.70 cloud). All cloud scores should be treated as upper bounds.

Full source code, task definitions, and raw data on GitHub →

Technical Deep Dive: Why GPT-oss-120B Wins

GPT-oss-120B is OpenAI’s open-weight MoE model (Apache 2.0). Key properties:

Property	Value
Total parameters	117B
Active parameters per token	5.1B (MoE routing)
Native format	MXFP4 (~80 GB on disk)
Apple Silicon	Metal reference implementation from OpenAI
Quality score	1.89 / 2.0 (near-perfect across all categories)

The MoE architecture means it reads only ~10 GB of weights per token, despite being a 117B model. On a Mac Studio M4 Max (546 GB/s bandwidth), that translates to ~55 tok/s — fast enough for interactive use.

Caveat

This model has not been validated running locally via Ollama/MLX on Apple Silicon. The MXFP4 format and Metal implementation exist but real-world performance needs Phase B testing on actual hardware. This is the single highest-risk item in our recommendation.

// Same Model, Same Task, Three Machines

The Speed Test

One task. One model. Cloud, laptop, and a Raspberry Pi. Same quality — wildly different speeds.

Task: "Summarize this project's status and prioritize next steps." Best available model per tier.

☁

OpenRouter

Cloud API · Qwen3-14B

12s

Score: 2.0 / 2.0

~$0.001/task

💻

MacBook Air M4

32 GB · 120 GB/s · Qwen3-14B

238s

Score: 2.0 / 2.0

6.5 tok/s · Free

🤖

Raspberry Pi 5

8 GB · CPU only · Qwen2.5-7B

559s

Score: 2.0 / 2.0

0.6 tok/s · Free

Same quality everywhere. 20x–50x slower locally — but free and private.
On a Mac Studio (546 GB/s), those 238s become ~60s.

Technical Deep Dive: Hardware Comparison Matrix

Platforms Evaluated

Platform	RAM	Bandwidth	Power	Noise	Price (SEK)
Mac Studio M4 Max 128GB	128 GB unified	546 GB/s	~120W	Silent	~40,000
Mac Studio M3 Ultra 256GB	256 GB unified	819 GB/s	~180W	Silent	~95,000
RTX 5090 PC	32 GB VRAM	1,792 GB/s	~950W	Very loud	~54,000
RTX 4090 PC (used)	24 GB VRAM	1,008 GB/s	~750W	Loud	~36,500
Dual RTX 5090	64 GB VRAM	3,584 GB/s	~1,500W	Extreme	~106,000
MacBook Air M4 32GB	32 GB unified	120 GB/s	~15W	Silent	~20,000
Raspberry Pi 5 8GB	8 GB	~30 GB/s	~10W	Silent	~1,200

Why Mac Studio over NVIDIA?

Unified memory is the differentiator. NVIDIA GPUs have higher bandwidth but are limited by VRAM — an RTX 5090 with 32 GB cannot run GPT-oss-120B (80 GB). You’d need dual GPUs with PCIe tensor parallelism, which adds complexity, noise, power draw, and costs more than the Mac.

The Mac Studio at 128 GB unified memory fits GPT-oss-120B + GLM-4.7-Flash simultaneously in a box that draws 120W and is silent. An equivalent NVIDIA setup draws 1,500W and sounds like a jet engine.

Over 3 years: electricity savings

Mac Studio (8h/day): ~150 SEK/month. RTX 5090 PC (8h/day): ~450 SEK/month. ~10,800 SEK saved over 36 months — that covers 25% of the Mac’s purchase price.

RTX 4090/5090: No NVLink

Neither the RTX 4090 nor RTX 5090 support NVLink. Multi-GPU model parallelism uses PCIe bandwidth (~25 GB/s practical), which is 24x slower than NVLink. It works for inference but limits throughput and adds latency.

// The Economics

What It Costs

Hardware only makes sense at the right scale. Here's the honest math across three scenarios.

Mac Studio M5 Max* 128 GB — ~40,000 SEK est. — amortized over 36 months + electricity (100 SEK/mo).

Scenario A — vs API (pay-per-token)

⚠ Reality check: Most heavy users are on Claude MAX ($200/mo flat), not pay-per-token. Our billing audit showed $20/mo on OpenRouter — all benchmarking, zero production. Scenario B (subscription) is the relevant comparison for most users.

Light use (~10M tok/mo)

API cost

SEK 98/mo

Hardware: never breaks even

Moderate (~100M tok/mo)

API cost

SEK 977/mo

Hardware: never breaks even

Heavy (~1B tok/mo)

API cost

SEK 9,765/mo

Break-even: 5 months

Very heavy (~5B tok/mo, co-op)

API cost

SEK 48,825/mo

Break-even: 1 month

Scenario B — vs Subscription (Claude MAX)

Subscriptions are flat-rate. Hardware enables a tier downgrade: swap MAX for Pro (~210 SEK/mo) and run the rest locally. Tiers: Pro $20 · MAX 5x $100 · MAX 20x $200 (per person/month)

1 user — MAX 5x → Pro

128 GB hardware

SEK 1,211/mo hw

Sub saving: 840 SEK/mo — never breaks even

2 users — MAX 5x → Pro each

128 GB hardware

SEK 606/person/mo

Break-even: ~25 months

2 users — MAX 20x → Pro each

128 GB hardware

SEK 606/person/mo

Break-even: ~11 months · saves 93k SEK/3yr

4 users — MAX 5x → Pro each

128 GB co-op

SEK 310/person/mo

Break-even: ~7 months · saves 173k SEK/3yr

Note: MAX is also a bet that subscription prices rise over time — hardware cost is fixed, MAX is not.

Technical Deep Dive: Full Cost Model

Assumptions

Usage split: 30% input tokens, 70% output tokens. Hardware depreciation: 36 months. Electricity: ~100 SEK/month. Subscription tiers: Pro $20/mo (~210 SEK), MAX 5x $100/mo (~1,050 SEK), MAX 20x $200/mo (~2,100 SEK).

Throughput ceiling — what the hardware can actually produce

Token generation is bounded by memory bandwidth. A 128 GB Mac Studio (546 GB/s) running one model 24/7 has a hard cap:

Model	Est. tok/s	Max tokens/month (24/7)	Quality
Gemma4-E2B	150	~389M	1.30 / 2.0
GLM-4.7-Flash	120	~311M	1.01 / 2.0 local
Qwen3-Coder-Next	100	~259M	1.55 / 2.0
Gemma4-26B	90	~233M	1.64 / 2.0
GPT-oss-120B	60	~155M	1.89 / 2.0

Total throughput is fixed by bandwidth — more concurrent users slice the same pie. The hardware cannot replace all API volume at high usage. It complements cloud by handling sub-tasks cheaply.

What the hardware actually replaces (the real break-even)

Billing reality: Real usage is 2.8B tokens/month via Claude MAX subscription, of which 95.4% are cache reads. Only ~13M output tokens/month are actually generated — local inference has no cache equivalent, so it addresses only the output slice.

For pay-per-token users, break-even depends on which API calls go local. The orchestration benchmark proved local Gemma4-26B replaces Sonnet-quality sub-tasks (1.98/2.0 hybrid vs 1.88/2.0 cloud). So local tokens displace Sonnet pricing ($3/$15 per M tokens = ~120 SEK/M tokens).

Local tokens replace…	API price (SEK/M tok)	Tokens/month to break even	Generation time/day
Sonnet sub-tasks	~120 SEK/M	~10M	~1 hour
GPT-oss-120B	~10 SEK/M	~124M	~16 hours
Gemma4-26B cloud	~3.4 SEK/M	~361M	Exceeds 24/7 capacity

Key insight: For subscription users (most heavy users), the API displacement table above is irrelevant. The real case is subscription displacement: hardware lets you downgrade from MAX ($200/mo) to Pro ($20/mo), saving $180/mo per user. See Scenario B above.

Subscription comparison (MAX → Pro + local)

With hardware handling sub-tasks, users can downgrade from MAX to Pro. The saving is fixed — doesn’t depend on token volume, only on quality being good enough (it is: 1.83–1.98/2.0).

Scenario	Sub saving/month	HW cost/month	Break-even	Saves over 3yr
1 user MAX 5x → Pro	840 SEK	1,211 SEK	Never (HW costs more)	−13k SEK
2 users MAX 5x → Pro	1,680 SEK	1,211 SEK	~25 months	+17k SEK
2 users MAX 20x → Pro	3,780 SEK	1,211 SEK	~11 months	+93k SEK
4 users MAX 5x → Pro	3,360 SEK	1,211 SEK	~7 months	+173k SEK

What’s NOT in the cost model

Not included: time spent on setup/maintenance, opportunity cost of waiting for tasks to complete locally (slower), the value of privacy (hard to quantify), resale value of the hardware after 3 years, and the asymmetric bet that subscription prices rise over time while hardware cost is fixed.

// The Recommended Configuration

The Setup

Two options depending on your priorities: maximum quality or maximum headroom.

Mac Studio M5 Max* · 128 GB Unified Memory · ~614 GB/s bandwidth · Silent · ~120W

Option A: Maximum Quality (solo model)

Single Model — Best Score in the Evaluation

GPT-oss-120B

RAM: ~80 GB (native MXFP4)
Active: 5.1B params (MoE)
Score: 1.89 / 2.0 — near-perfect
Est. speed: ~55 tok/s
Concurrent users: 5
Free after OS: ~28 GB — works but no room for a second model

Best quality, but you're all-in on one model. Swap between models by unloading/loading as needed (Ollama handles this automatically, but incurs a ~5-10s cold start).

Option B: Dual-Model Setup (recommended for co-op)

Quality — Harder Tasks

Qwen3-Coder-Next

RAM: ~46 GB
Active: 3B params (MoE)
Score: 1.62 / 2.0
Est. speed: ~90 tok/s
Concurrent users: 9

Utility — Fast Tasks

GLM-4.7-Flash

RAM: ~4 GB
Active: ~3B params (MoE)
Score: 1.64 / 2.0
Est. speed: ~90 tok/s
Concurrent users: 9

Total: ~50 GB loaded · ~58 GB free after OS. Rock solid — no swap risk, handles multiple concurrent users, both models always hot. Slightly lower peak quality (1.62-1.64) but much more practical for daily use and the hackathon booth.

Option C: The 256 GB Beast

Or forget the trade-offs entirely. Load everything. Run the best model and three more alongside it.

Flagship — Near-Perfect Quality

GPT-oss-120B

RAM: ~80 GB
Score: 1.89 / 2.0
Est. speed: ~70 tok/s (819 GB/s on Ultra)

Coding — Ultra-Fast MoE

Qwen3-Coder-Next

RAM: ~46 GB
Score: 1.62 / 2.0
Est. speed: ~130 tok/s

Utility — Quick Tasks

GLM-4.7-Flash

RAM: ~4 GB
Score: 1.64 / 2.0
Est. speed: ~130 tok/s

Reasoning — 256 GB Exclusive

MiniMax-M2.5

RAM: ~100 GB
Score: 1.68 / 2.0
Only runs on 256 GB

Total: ~230 GB loaded on a Mac Studio M5 Ultra* 256 GB (~95,000 SEK). All four models hot simultaneously. No compromises, no swapping. The everything machine.
Co-op cost at 4 people: ~720 SEK/person/month — still cheaper than one Claude MAX subscription.

* M5 Max / M5 Ultra not yet announced. Specs estimated from Apple Silicon generational improvements. All benchmarks run on M4 equivalent (via OpenRouter). Plan: buy when M5 ships (summer 2026).

// The Real Test: Can Local Models Do Sonnet's Job?

Orchestration Benchmark

We built a compound-task benchmark to answer the real question: when Opus delegates sub-tasks, can local models replace Sonnet?

7 compound tasks × 3+ strategies. Each task runs 2–4 sub-tasks then synthesizes a final answer. Judged on the final output only.

STRATEGY A — BASELINE

Cloud-Only

          Opus orchestrates Sonnet

          Score: 1.88 / 2.0

          Cost: $0.21/task

          Baseline complete

STRATEGY B — WINNER

Hybrid

          Opus orchestrates local Gemma4-26B

          Score: 1.98 / 2.0 ↑ beats cloud

          Cost: $0.15/task (orchestrator only, 29% cheaper)

          Complete — 7/7 tasks

STRATEGY C — ZERO COST

Fully Local

          Gemma4-26B orchestrates & executes

          Score: 1.83 / 2.0 −0.05 vs cloud

          Cost: $0.00/task

          Complete — 7/7 tasks

7 Compound Tasks

ct-001 — Feature + Tests + Review
ct-002 — Research & Recommend
ct-003 — Debug + Fix + Regression
ct-004 — PR Review + Changelog
ct-005 — Architecture Assessment
ct-006 — Code Port + Validation
ct-007 — Daily Briefing

Each task decomposes into 2–4 sub-tasks executed by the execution model (Sonnet or local), with outputs injected into the next prompt. A final synthesis call by the orchestrator (Opus or local) assembles the deliverable. Only the final output is judged — same pipeline, same rubric.

Verdict

✓ Hybrid — JUSTIFIED

            Score delta: +0.10 vs cloud (threshold: −0.30)

            Cost savings: 29% (threshold: >50%) — borderline

            Coverage: 7/7 tasks

            Best quality at lowest cloud spend
          
✓ Fully Local — JUSTIFIED

            Score delta: −0.05 vs cloud (within threshold)

            Cost savings: 100%

            Coverage: 7/7 tasks

            Near-cloud quality at zero marginal cost

Only weak spot: ct-006 Code Port + Validation scored 1.33 fully-local vs 1.83 cloud-only. Equivalence checking is where the gap shows. All other 6 tasks: 1.83–2.00.

Source: src/orchestrator/ — strategies.ts, executor.ts, index.ts — runs via tsx src/orchestrator/index.ts --batch orch-v1

// Questions & Status

FAQ

What's done, what's not, and what we're honest about not knowing yet.

What is the current status of this evaluation?

Done Cloud quality screening — 16 models × 63 tasks, dual-judge, multiple batches
Done Local benchmarks — Gemma4-E2B (63 tasks), GLM-4.7-Flash (63 tasks), Gemma4-26B on M4 Air
Done Prompt mining — 750 real prompts classified, 97.3% offloadable finding
Done Cost analysis and hardware comparison
Done Orchestration benchmark framework built (7 compound tasks, 3 strategies)
Done Cloud-only baseline: 1.88/2.0 avg score, $0.21/task
Done Hybrid strategy (Opus→local Gemma4-26B) — 1.98/2.0, beats cloud-only
Done Fully-local strategy (Gemma4-26B) — 1.83/2.0, $0.00/task, within 0.05 of cloud
Pending Concurrent user load testing (1/3/5 simultaneous sessions)
Pending Q4 vs Q8 quantization quality comparison
Planned Alternative harness evaluation (aider + Ollama for agentic tasks on Pi)
Planned Mac Studio M5 hardware purchase (summer 2026, pending orchestration results)

How reliable are these scores?

The scores are directional, not definitive. 63 tasks gives solid category-level confidence. The dual-judge approach (Opus + o4-mini) with 100% agreement on most batches adds reliability. Cloud scores are an upper bound — local quantized inference may score lower. The GLM case is the sharpest example: cloud 1.64 (thinking enabled) vs local ~0.81 (thinking disabled, required to avoid 10-min TTFT). Dense models degrade less: Qwen3-14B local was 1.48 vs 1.55 cloud.

Why not just use existing benchmarks (HumanEval, SWE-Bench)?

Existing benchmarks test generic capabilities. This evaluation tests your actual use cases — Swedish email extraction, MCP tool generation, Fortnox invoice refactoring, project status summaries. The question isn’t "which model is smartest?" but "which model can do my work well enough that I can downgrade my cloud subscription?"

What about fine-tuning?

Not evaluated. Fine-tuning could improve task-specific quality (especially for Swedish text and domain-specific patterns like Fortnox API formats), but it adds complexity and maintenance burden. The base models scored well enough on most tasks without fine-tuning.

Is GPT-oss-120B actually runnable locally?

Yes, but with caveats. People have run it on Apple Silicon via Ollama (ollama pull gpt-oss:120b), llama.cpp (native MXFP4), and MLX (via OpenHarmony-MLX). The model needs ~80 GB RAM, leaving ~28 GB free on a 128 GB Mac Studio — tight but functional as a solo model.

Real-world performance: Best reported is ~40 tok/s on optimized MLX, but practical experience is often slower. The Metal reference implementation from OpenAI is still experimental, not production-grade. Most Mac users actually run the smaller GPT-oss-20B (13 GB) instead.

Our take: GPT-oss-120B scored highest in our evaluation (1.89/2.0), but the local runtime is not yet battle-tested. Option B (Qwen3-Coder-Next + GLM-4.7-Flash) is the safer recommendation — both models are confirmed running reliably on Apple Silicon, use only 50 GB total, and leave comfortable headroom. If GPT-oss-120B matures, it becomes Option A.

What if I have different tasks?

The framework is open-source and designed for custom tasks. You can add your own task definitions and re-run the evaluation. See the section below for how to submit tasks.

What’s the plan ahead?

Right now (April 2026): Run orchestration hybrid strategies (Opus→Gemma4-26B, Opus→Qwen3.5) to see if local models can replace Sonnet in sub-task delegation. This is the critical gate for the hardware decision.
Short term: If hybrid scores within 0.3 of cloud-only (threshold: 1.58/2.0), the case for hardware is strong. Also testing aider + Ollama as an alternative harness for Pi.
Medium term (Summer 2026): Wait for M5 Max announcement. Buy the hardware if orchestration results justify it. Deploy GPT-oss-120B + Gemma4-26B as the co-op stack.
Long term: Build the hackathon booth. Run Hugin (local orchestrator) as the agent substrate. 300 SEK/person/month for unlimited local inference.

// Open Source

Submit Your Own Tasks

Don't trust our benchmarks? Good. Add yours and we'll run them.

The framework is open-source. Fork it, add a task definition, open a PR. We'll run it against all 12 models and publish the results here.

How to contribute a task

        1. Fork the repo

        2. Add your task to src/tasks/real-world.ts (or create a new category file)

        3. Follow the TaskDefinition interface:

{
  id: 'community-001',
  category: 'debugging',        // or: simple-coding, refactoring, architecture,
                                 //     multi-file, reasoning, non-coding
  title: 'Your Task Title',
  difficulty: 3,                 // 1-5
  maxTokens: 2000,
  tags: ['your', 'tags'],
  prompt: `Your full prompt here. Be specific.
Include any code snippets or context the model needs.`,
  expectedCapabilities: [
    'what a good answer should include',
    'another expected capability',
  ],
}

        4. Run npx vitest run tests/tasks.test.ts to validate

        5. Open a PR with a description of what your task tests and why it matters

        6. We’ll run it against all 12 models, judge it, and add results to this page

View the source on GitHub →

LOCAL INFERENCEMACHINE

97.3% Offloadable

Top Prompt Categories

The Hidden Multiplier

The Leaderboard

Pipeline Architecture

Judging Protocol

Task Design

Proxy Validity

Caveat

The Speed Test

Platforms Evaluated

Why Mac Studio over NVIDIA?

Over 3 years: electricity savings

RTX 4090/5090: No NVLink

What It Costs

Scenario A — vs API (pay-per-token)

Scenario B — vs Subscription (Claude MAX)

Assumptions

Throughput ceiling — what the hardware can actually produce

What the hardware actually replaces (the real break-even)

Subscription comparison (MAX → Pro + local)

What’s NOT in the cost model

The Setup

Option A: Maximum Quality (solo model)

Option B: Dual-Model Setup (recommended for co-op)

Option C: The 256 GB Beast

What Moves Local?

Runs Locally (97.3%)

Stays Cloud (2.7%)

Orchestration Benchmark

7 Compound Tasks

Verdict

FAQ

Submit Your Own Tasks

How to contribute a task

Timeline

Phase A: Evaluate & Screen

Phase B: Orchestration & Decision

Phase C: Buy & Deploy

Join the Co-op

LOCAL INFERENCE
MACHINE