LLM Build Leaderboard
Each game on our roadmaps is built with multiple LLMs. This page tracks how each model performs — speed, quality, cost, and bugs — so you can pick the right model for your next build.
23 benchmark runs
| Score | Model | Provider | Game | Time | Tokens | Cost | Bugs | Notes |
|---|---|---|---|---|---|---|---|---|
| 95 | Mistral Small | Mistral API | AI Dodge | 12s | 1,754 | $0 | 0 | Cleanest output — zero bugs, 12s, all features working. |
| 95 | DeepSeek V4 Pro | DeepSeek API | 3D Space Dodge | 2s | 8,335 | $0.0750 | 0 | Fastest build (~2s). Largest output. Bot mode. Premium cost ($0.075). |
| 92 | DeepSeek V4 Flash (0.4 temp) | DeepSeek API | 3D Space Dodge | 30s | 8,000 | $0.0010 | 0 | Two prompts. Complete game + bot mode on first attempt. 680 lines. |
| 91 | Tencent Hy3 Preview | OpenRouter | 3D Space Dodge | 5s | 7,459 | $0 | 0 | Both prompts. Bot mode. Large output (16.1 KB). Fast free-tier inference. |
| 90 | DeepSeek V4 Flash (0.7 temp) | DeepSeek API | AI Dodge | 68s | 8,399 | $0.0012 | 0 | Largest output (429 lines, 15,487 chars). Full feature set. |
| 90 | Xiaomi Mimo V2.5 | OpenRouter | 3D Space Dodge | 2s | 7,970 | $0.0050 | 0 | Both prompts. Bot mode. 13.3 KB output. Competitive performance ($0.005). |
| 89 | Nemotron 3 Ultra (550B) | OpenRouter | 3D Space Dodge | 2s | 7,350 | $0.0030 | 0 | 550B model. ~2s inference. Bot mode working. $0.003 total. |
| 88 | owl-alpha | OpenRouter (free) | AI Dodge | 53s | 2,024 | $0 | 0 | Best free-tier output (8,272 chars). No bugs. |
| 88 | Mistral Small | Mistral API | 3D Space Dodge | 21s | 6,689 | $0 | 0 | Two prompts. Clean Three.js output. Bot mode working. Fast build. |
| 85 | owl-alpha | OpenRouter (free) | 3D Space Dodge | 6s | 8,913 | $0 | 0 | Two prompts. Fastest build (6s total). Bot mode working. |
| 82 | DeepSeek V4 Flash (0.4 temp) | DeepSeek API | AI Dodge | 14s | 2,077 | $0.0005 | 3 | Interactive build across 5 prompts. Fastest cloud time. |
| 82 | Gemma-4-31B | OpenRouter (free tier) | 3D Space Dodge | 2s | 4,032 | $0 | 0 | Both prompts. Bot mode working. 31B model, fast free-tier inference. |
| 80 | poolside/laguna-m.1 | OpenRouter (free tier) | 3D Space Dodge | 2s | 7,796 | $0 | 0 | PASS after fix: Model used MeshBasicMaterial (no emissive) instead of MeshStandardMaterial. One-word fix. Bot mode working. |
| 78 | openai/gpt-oss-20b | Local — Mac Mini M4 | AI Dodge | 23s | 1,823 | $0 | 2 | Best local model. Fast (23s), functional. Speed + double-R bugs. |
| 76 | Nemotron 3 Ultra | OpenRouter (free) | AI Dodge | 86s | 2,015 | $0 | 1 | Most compact (132 lines). May lack edge wrapping. |
| 74 | Qwen 3.5 9B | Local — Mac Mini M4 | AI Dodge | 190s | 3,210 | $0 | 1 | Best local code quality. Clean structure, few bugs. |
| 72 | Llama 3.1 8B | Local — Mac Mini M4 | AI Dodge | 55s | 1,220 | $0 | 1 | Most reliable local fallback. Sparse but functional. |
| 70 | Gemma-4-12b-qat | Local — Mac Mini M4 | AI Dodge | 281s | 3,563 | $0 | 0 | Slowest by far (281s). Output adequate but not proportional to time. |
| 70 | Gemma-4-12b-qat | Local — Apple Mac Mini M4 (16GB) | 3D Space Dodge | 691s | 8,431 | $0 | 0 | Slowest benchmark (11.5 min). Both prompts succeeded. Bot mode working. |
| 68 | Gemma-4-12b-coder-fable | Local — Mac Mini M4 | AI Dodge | 112s | 1,463 | $0 | 1 | Shortest working output (3,451 chars). Missing some features. |
| 68 | Qwen 3.5 9B | Local — Apple Mac Mini M4 (16GB) | 3D Space Dodge | 462s | 5,435 | $0 | 6 | DEGRADED: 6 const reassignment errors at runtime. WebGL renders (680x480). Game runs but errors degrade gameplay. 40% tokens on CoT reasoning. |
| 65 | DeepSeek V4 Pro | DeepSeek API | AI Dodge | 101s | 8,399 | $0.0012 | 0 | Spent tokens on reasoning. Only 50 lines actual game code. |
| 60 | GPT-OSS-120B | OpenRouter (free tier) | 3D Space Dodge | 2s | 3,292 | $0 | 0 | Prompt 1 OK. Prompt 2 failed — no bot mode, glow, or boundary. Instruction gap. |