LLM Build Leaderboard

Each game on our roadmaps is built with multiple LLMs. This page tracks how each model performs — speed, quality, cost, and bugs — so you can pick the right model for your next build.

Filter by game: Era: 23 benchmark runs

Score	Model	Provider	Game	Time	Tokens	Cost	Bugs	Notes
95	Mistral Small	Mistral API	AI Dodge	12s	1,754	$0	0	Cleanest output — zero bugs, 12s, all features working.
95	DeepSeek V4 Pro	DeepSeek API	3D Space Dodge	2s	8,335	$0.0750	0	Fastest build (~2s). Largest output. Bot mode. Premium cost ($0.075).
92	DeepSeek V4 Flash (0.4 temp)	DeepSeek API	3D Space Dodge	30s	8,000	$0.0010	0	Two prompts. Complete game + bot mode on first attempt. 680 lines.
91	Tencent Hy3 Preview	OpenRouter	3D Space Dodge	5s	7,459	$0	0	Both prompts. Bot mode. Large output (16.1 KB). Fast free-tier inference.
90	DeepSeek V4 Flash (0.7 temp)	DeepSeek API	AI Dodge	68s	8,399	$0.0012	0	Largest output (429 lines, 15,487 chars). Full feature set.
90	Xiaomi Mimo V2.5	OpenRouter	3D Space Dodge	2s	7,970	$0.0050	0	Both prompts. Bot mode. 13.3 KB output. Competitive performance ($0.005).
89	Nemotron 3 Ultra (550B)	OpenRouter	3D Space Dodge	2s	7,350	$0.0030	0	550B model. ~2s inference. Bot mode working. $0.003 total.
88	owl-alpha	OpenRouter (free)	AI Dodge	53s	2,024	$0	0	Best free-tier output (8,272 chars). No bugs.
88	Mistral Small	Mistral API	3D Space Dodge	21s	6,689	$0	0	Two prompts. Clean Three.js output. Bot mode working. Fast build.
85	owl-alpha	OpenRouter (free)	3D Space Dodge	6s	8,913	$0	0	Two prompts. Fastest build (6s total). Bot mode working.
82	DeepSeek V4 Flash (0.4 temp)	DeepSeek API	AI Dodge	14s	2,077	$0.0005	3	Interactive build across 5 prompts. Fastest cloud time.
82	Gemma-4-31B	OpenRouter (free tier)	3D Space Dodge	2s	4,032	$0	0	Both prompts. Bot mode working. 31B model, fast free-tier inference.
80	poolside/laguna-m.1	OpenRouter (free tier)	3D Space Dodge	2s	7,796	$0	0	PASS after fix: Model used MeshBasicMaterial (no emissive) instead of MeshStandardMaterial. One-word fix. Bot mode working.
78	openai/gpt-oss-20b	Local — Mac Mini M4	AI Dodge	23s	1,823	$0	2	Best local model. Fast (23s), functional. Speed + double-R bugs.
76	Nemotron 3 Ultra	OpenRouter (free)	AI Dodge	86s	2,015	$0	1	Most compact (132 lines). May lack edge wrapping.
74	Qwen 3.5 9B	Local — Mac Mini M4	AI Dodge	190s	3,210	$0	1	Best local code quality. Clean structure, few bugs.
72	Llama 3.1 8B	Local — Mac Mini M4	AI Dodge	55s	1,220	$0	1	Most reliable local fallback. Sparse but functional.
70	Gemma-4-12b-qat	Local — Mac Mini M4	AI Dodge	281s	3,563	$0	0	Slowest by far (281s). Output adequate but not proportional to time.
70	Gemma-4-12b-qat	Local — Apple Mac Mini M4 (16GB)	3D Space Dodge	691s	8,431	$0	0	Slowest benchmark (11.5 min). Both prompts succeeded. Bot mode working.
68	Gemma-4-12b-coder-fable	Local — Mac Mini M4	AI Dodge	112s	1,463	$0	1	Shortest working output (3,451 chars). Missing some features.
68	Qwen 3.5 9B	Local — Apple Mac Mini M4 (16GB)	3D Space Dodge	462s	5,435	$0	6	DEGRADED: 6 const reassignment errors at runtime. WebGL renders (680x480). Game runs but errors degrade gameplay. 40% tokens on CoT reasoning.
65	DeepSeek V4 Pro	DeepSeek API	AI Dodge	101s	8,399	$0.0012	0	Spent tokens on reasoning. Only 50 lines actual game code.
60	GPT-OSS-120B	OpenRouter (free tier)	3D Space Dodge	2s	3,292	$0	0	Prompt 1 OK. Prompt 2 failed — no bot mode, glow, or boundary. Instruction gap.

LLM Build Leaderboard

How Scoring Works