LLM Build Leaderboard

Each game on our roadmaps is built with multiple LLMs. This page tracks how each model performs — speed, quality, cost, and bugs — so you can pick the right model for your next build.

23 benchmark runs
Score Model Provider Game Time Tokens Cost Bugs Notes
95 Mistral Small Mistral API AI Dodge 12s 1,754 $0 0 Cleanest output — zero bugs, 12s, all features working.
95 DeepSeek V4 Pro DeepSeek API 3D Space Dodge 2s 8,335 $0.0750 0 Fastest build (~2s). Largest output. Bot mode. Premium cost ($0.075).
92 DeepSeek V4 Flash (0.4 temp) DeepSeek API 3D Space Dodge 30s 8,000 $0.0010 0 Two prompts. Complete game + bot mode on first attempt. 680 lines.
91 Tencent Hy3 Preview OpenRouter 3D Space Dodge 5s 7,459 $0 0 Both prompts. Bot mode. Large output (16.1 KB). Fast free-tier inference.
90 DeepSeek V4 Flash (0.7 temp) DeepSeek API AI Dodge 68s 8,399 $0.0012 0 Largest output (429 lines, 15,487 chars). Full feature set.
90 Xiaomi Mimo V2.5 OpenRouter 3D Space Dodge 2s 7,970 $0.0050 0 Both prompts. Bot mode. 13.3 KB output. Competitive performance ($0.005).
89 Nemotron 3 Ultra (550B) OpenRouter 3D Space Dodge 2s 7,350 $0.0030 0 550B model. ~2s inference. Bot mode working. $0.003 total.
88 owl-alpha OpenRouter (free) AI Dodge 53s 2,024 $0 0 Best free-tier output (8,272 chars). No bugs.
88 Mistral Small Mistral API 3D Space Dodge 21s 6,689 $0 0 Two prompts. Clean Three.js output. Bot mode working. Fast build.
85 owl-alpha OpenRouter (free) 3D Space Dodge 6s 8,913 $0 0 Two prompts. Fastest build (6s total). Bot mode working.
82 DeepSeek V4 Flash (0.4 temp) DeepSeek API AI Dodge 14s 2,077 $0.0005 3 Interactive build across 5 prompts. Fastest cloud time.
82 Gemma-4-31B OpenRouter (free tier) 3D Space Dodge 2s 4,032 $0 0 Both prompts. Bot mode working. 31B model, fast free-tier inference.
80 poolside/laguna-m.1 OpenRouter (free tier) 3D Space Dodge 2s 7,796 $0 0 PASS after fix: Model used MeshBasicMaterial (no emissive) instead of MeshStandardMaterial. One-word fix. Bot mode working.
78 openai/gpt-oss-20b Local — Mac Mini M4 AI Dodge 23s 1,823 $0 2 Best local model. Fast (23s), functional. Speed + double-R bugs.
76 Nemotron 3 Ultra OpenRouter (free) AI Dodge 86s 2,015 $0 1 Most compact (132 lines). May lack edge wrapping.
74 Qwen 3.5 9B Local — Mac Mini M4 AI Dodge 190s 3,210 $0 1 Best local code quality. Clean structure, few bugs.
72 Llama 3.1 8B Local — Mac Mini M4 AI Dodge 55s 1,220 $0 1 Most reliable local fallback. Sparse but functional.
70 Gemma-4-12b-qat Local — Mac Mini M4 AI Dodge 281s 3,563 $0 0 Slowest by far (281s). Output adequate but not proportional to time.
70 Gemma-4-12b-qat Local — Apple Mac Mini M4 (16GB) 3D Space Dodge 691s 8,431 $0 0 Slowest benchmark (11.5 min). Both prompts succeeded. Bot mode working.
68 Gemma-4-12b-coder-fable Local — Mac Mini M4 AI Dodge 112s 1,463 $0 1 Shortest working output (3,451 chars). Missing some features.
68 Qwen 3.5 9B Local — Apple Mac Mini M4 (16GB) 3D Space Dodge 462s 5,435 $0 6 DEGRADED: 6 const reassignment errors at runtime. WebGL renders (680x480). Game runs but errors degrade gameplay. 40% tokens on CoT reasoning.
65 DeepSeek V4 Pro DeepSeek API AI Dodge 101s 8,399 $0.0012 0 Spent tokens on reasoning. Only 50 lines actual game code.
60 GPT-OSS-120B OpenRouter (free tier) 3D Space Dodge 2s 3,292 $0 0 Prompt 1 OK. Prompt 2 failed — no bot mode, glow, or boundary. Instruction gap.