18 LLMs Build a 3D Game — Who Ships and Who Breaks?

Can a language model build a playable 3D game from a single prompt? We took 18 models — from cloud APIs to local 4B quantized models — and gave each the exact same task: build a Three.js survival dodge game with player sphere, enemy cubes, WASD movement, score tracking, and camera following. Then a second prompt: add bot mode, glow pulse, and arena boundary ring.

No human edits. No intermediate feedback. Two shots per model.

The Test

Prompt 1:

Build a 3D game using Three.js r128 loaded from CDN. The game is a top-down survival dodge where the player is a glowing blue sphere on a dark grid plane. Red cube enemies spawn from outside the visible area and home toward the player. WASD moves the player. Score increases every second. Game over when an enemy touches the player. Press R to restart. Camera should follow the player at a low angle. Use a 680x480 canvas. Dark theme with neon-style emissive materials.

Prompt 2 (applied to the output of Prompt 1):

Add bot mode (?bot=true), glow pulse on player sphere, and dim arena boundary ring.

The Full Results

All scores are composite: output quality, build speed, token efficiency, bug count, and feature completeness. Results verified via Playwright with WebGL-enabled headless Chromium (SwiftShader software rendering).

Rank	Model	Type	Time	Tokens	Size	Cost	Score	Status
1	DeepSeek V4 Pro	API	~2s	8,335	18.5 KB	$0.075	95	PASS
2	DeepSeek V4 Flash	API	30s	~8K	14.3 KB	$0.001	92	PASS
3	Tencent Hy3 Preview	API	~5s	7,459	16.1 KB	$0	91	PASS
4	Xiaomi Mimo V2.5	API	~2s	7,970	13.3 KB	$0.005	90	PASS
5	Nemotron 3 Ultra (550B)	API	~2s	7,350	11.7 KB	$0.003	89	PASS
6	Mistral Small	Free API	21s	6,689	11.6 KB	$0	88	PASS
7	owl-alpha	Free API	6s	8,913	13.1 KB	$0	85	PASS
8	Gemma-4-31B	Free API	~2s	4,032	7.9 KB	$0	82	PASS
9	poolside/laguna-m.1	Free API	~2s	7,796	6.7 KB	$0	80	PASS
10	Qwen 3.5 9B	Local	462s	5,435	9.5 KB	$0	68	DEGRADED
11	Gemma-4-12b-qat	Local	691s	8,431	8.3 KB	$0	70	PASS
12	GPT-OSS-120B	Free API	~2s	3,292	6.7 KB	$0	60	PASS
—	GPT-OSS-20B	Local	38s	1,235	4.2 KB	$0	—	FAILED
—	Llama 3.1 8B	Local	128s	2,591	8.0 KB	$0	—	FAILED
—	GLM-4.6V-Flash	Local	354s	6,063	8.8 KB	$0	—	FAILED
—	Gemma-4-12b-coder-fable	Local	181s	2,128	3.4 KB	$0	—	FAILED
—	Nemotron-3-Nano-4B	Local	209s	5,211	4.8 KB	$0	—	FAILED
—	Gemma-4-12b-agentic-fable5	Local	234s	2,577	3.6 KB	$0	—	FAILED

Status key: PASS = canvas renders with WebGL, zero JS errors. DEGRADED = canvas renders but has code bugs from the model (affects score). FAILED = no canvas or game never renders.

DEGRADED breakdown (verified with WebGL animation-loop testing):

Degraded Variants	WebGL	Score UI	Errors	Verdict
Qwen 3.5 9B	680x480	✅	6 const reassignment	Game runs, errors degrade gameplay

Promoted to FAILED (verified broken in real browser):

Llama 3.1 8B — 2D context locks canvas before WebGL
GLM-4.6V-Flash — 2 extra closing parens break parser
Gemma-4-12b-coder-fable — output truncated at token limit
Nemotron-3-Nano-4B — now used at top level without declaration
poolside/laguna-m.1 fixed and promoted to PASS

What the Numbers Tell Us

11 PASS, 1 DEGRADED, 6 FAILED from 18 models tested. The only DEGRADED variant — Qwen 3.5 9B — renders with WebGL and is playable but has 6 const reassignment errors at runtime. The 6 failures cover every category: memory eviction, WebGL context conflict, syntax errors, truncated output, and undefined variables.

The biggest differentiator is output correctness, not speed. A model that finishes in 2 seconds with broken code loses to one that takes 30 seconds with clean output. Fast builds with syntax errors or runtime bugs still count as failures.

1 variant required a framework-level fix: poolside/laguna-m.1 used MeshBasicMaterial (which has no emissive property) and then called .emissive.setRGB() — a Three.js API misuse. This was caught by the user seeing a blank screen. The fix was a one-word change to MeshStandardMaterial. This highlights a testing blind spot: WebGL context existing ≠ game loop running. The error fires in the animation loop, after the initial frame renders.

Model code bugs are part of the test. We did not fix broken model outputs for other DEGRADED variants. If a model produces syntactically invalid JavaScript, undefined variables, or runtime errors — that’s its score.

Speed gap is absurd. Cloud models finish in 2-30 seconds. Local models take 2-11 minutes. The 550B Nemotron Ultra on OpenRouter finished in 2 seconds — faster than the 4B Nemotron Nano running locally (209s). That’s a 100x speedup for using API vs local.

Size ≠ quality. DeepSeek V4 Pro produced the largest output (18.5 KB) and scored highest, but small output doesn’t mean broken — the smallest working output was Gemma-4-12b-qat at 8.3 KB (PASS, score 70). The real signal is whether the output is complete and syntactically valid.

Speed matters less than correctness. Cloud models finished in 2-30 seconds. Local models took 2-11 minutes. But a fast model that ships broken JavaScript loses to a slow model with clean output. If you’re iterating during development, cloud APIs save you hours.

Chain-of-thought models waste budget. Qwen 3.5 9B spent 40% of its token budget (2,165 of 5,435 tokens) on reasoning before generating code. That doesn’t make the output better — it just makes it slower and more expensive.

Post-Processing Fixes Applied

Raw model outputs weren’t always clean. Several variants needed minimal fixes before they could render:

Fix	Variants	Root Cause
Stripped `[truncated]` artifact from JS	Gemma-4-12b-coder-fable	Model hit max token limit mid-output, leaving marker text in code
Added `let now = Date.now()`	GLM-4.6V-Flash, poolside/laguna-m.1, GPT-OSS-120B	Model used variable `now` without declaring it
Set `display:none` on game-over elements	8 variants	Game-over divs were visible by default — JS hides them too late
Set `display:none` on restart hints	3 variants	Same issue — “Press R” text visible during gameplay
Added `color: #e6edf3` to body	All 18 variants	Score UI elements inherited black text on dark background
Stripped line-number prefixes from JS	5 variants	Chrome-addition script accidentally embedded `read_file` line markers into JavaScript

No model output bugs were fixed. DEGRADED variants with JS errors from the model (const reassignment, undefined variables, syntax errors) are left as-is — those bugs are part of the benchmark results. The only post-processing applied was fixing artifacts from the chrome-wrapper build process and setting visible UI elements to hidden by default.

Fix	Variants	Root Cause
Set `display:none` on game-over elements	8 variants	Game-over divs were visible by default — JS hides them too late
Set `display:none` on restart hints	3 variants	Same issue — “Press R” text visible during gameplay
Added `color: #e6edf3` to body	All 18 variants	Score UI elements inherited black text on dark background
Stripped line-number prefixes from JS	5 variants	Chrome-addition script accidentally embedded `read_file` line markers into JavaScript

3 variants remain FAILED — no post-processing can fix a model that was evicted from memory mid-task or produced syntactically invalid output.

Cost Breakdown

Paid models are cheap for single builds — the most expensive (DeepSeek V4 Pro) cost $0.075 for a complete game. Free tier and local models cost nothing but come with speed or rate-limit tradeoffs.

Model	Tokens Used	Cost	Cost per 1K tokens
DeepSeek V4 Pro	8,335	$0.075	$0.009
DeepSeek V4 Flash	~8,000	$0.001	$0.0001
Xiaomi Mimo V2.5	7,970	$0.005	$0.0006
Nemotron 3 Ultra (550B)	7,350	$0.003	$0.0004
Mistral Small (free tier)	6,689	$0	—
owl-alpha (free tier)	8,913	$0	—
Local models	1,235–8,431	$0	—

At these prices, API cost is never the bottleneck — build time and iteration speed are the real constraints. Local models take 2-11 minutes per build; cloud models finish in seconds. If your time is worth anything, the paid APIs save you hours per project.

Era Grading

This was a 1970s-era game (C grade on our roadmap scale). Every model that completed both prompts earned a C — meaning they can handle fundamentals: 3D scene setup, physics, basic AI, and canvas rendering. The question now is whether these models can scale to 1980s complexity (Pac-Man, Mega Man) and beyond.

Try the Games Yourself

Every model’s output is playable. Each variant page shows the exact build metrics:

Variant	Model	Time	Cost	Status	Notes
DeepSeek V4 Pro	DeepSeek V4 Pro	~2s	$0.075	PASS	Largest output (18.5 KB). Clean.
3D Space Dodge	DeepSeek V4 Flash	30s	$0.001	PASS	Baseline. Full game on 1st prompt.
Hy3 Preview	Tencent Hy3 Preview	~5s	$0	PASS	16.1 KB output. Clean.
Mimo V2.5	Xiaomi Mimo V2.5	~2s	$0.005	PASS	13.3 KB. Clean.
Nemotron 3 Ultra	Nemotron 3 Ultra (550B)	~2s	$0.003	PASS	550B model. Clean.
Mistral Small	Mistral Small	21s	$0	PASS	Clean output.
owl-alpha	owl-alpha	6s	$0	PASS	Free tier. Clean.
Gemma-4-31B	Gemma-4-31B	~2s	$0	PASS	Free tier. Clean.
poolside/laguna-m.1	poolside/laguna-m.1	~2s	$0	PASS	Fix applied: MeshBasicMaterial→MeshStandardMaterial.
Gemma-qat	Gemma-4-12b-qat (local)	691s	$0	PASS	Slowest build (11.5 min). Clean WebGL output.
Qwen 3.5 9B	Qwen 3.5 9B (local)	462s	$0	DEGRADED	WebGL renders. 6 const reassignment errors.
GPT-OSS-120B	GPT-OSS-120B	~2s	$0	PASS	Prompt 2 ignored — no bot mode. Zero JS errors otherwise.
—	GPT-OSS-20B (local)	38s	$0	FAILED	Model evicted from memory. Prompt 2 never ran.
—	Llama 3.1 8B (local)	128s	$0	FAILED	Canvas locked to 2D before WebGL — getContext(‘2d’) then WebGLRenderer fails.
—	GLM-4.6V-Flash (local)	354s	$0	FAILED	2 extra closing parens break parser after scene init. Game never runs.
—	Nemotron-3-Nano-4B (local)	209s	$0	FAILED	`now` used at top level but never declared — only exists as animate(now) param.
—	Gemma-4-12b-agentic-fable5 (local)	234s	$0	FAILED	Multiple syntax errors in model output. Game never renders.

*Partial builds — Prompt 2 did not complete.

No model bugs were fixed. DEGRADED variants have genuine code bugs from the model output — those bugs are part of the test results. Only chrome-wrapper artifacts and UI visibility were patched (see Post-Processing Fixes).

Key Takeaway

11 PASS, 1 DEGRADED, 6 FAILED from 18 models tested. All 10 API models passed (DeepSeek API, OpenRouter, Free API). Only 2 of 8 local models produced working output (Gemma-4-12b-qat PASS, Qwen 3.5 9B DEGRADED — 6 failed entirely). Speed and output size matter less than whether the code actually runs — a model that finishes in 2 seconds with broken JavaScript is worse than one that takes 30 seconds with clean output. The next test is harder: can these models build an 80s-era game with AI, tilemaps, and state machines?

Why Local Models Failed

The 6 local failures aren’t about parameter count alone. Gemma-4-12b-coder-fable (12B) hit the token limit mid-function and Gemma-4-12b-agentic-fable5 (12B) produced syntactically invalid output — yet Gemma-4-12b-qat (12B, same size) passed cleanly. Meanwhile, free-tier API models like Mistral Small (likely 7B) passed with a score of 88 — smaller than most local models that failed.

Three patterns emerge:

Quantization quality matters more than size. The two local models that work (Gemma-qat, Qwen 3.5 9B) are quantization-optimized variants, not raw checkpoints. QAT (quantization-aware training) preserves code-generation capability that standard quantization strips away.
Test-time compute is the real bottleneck. Local models share memory with the OS on a 16GB Mac Mini. When they hit a tricky spot (closing a complex expression, tracking nested parentheses), they have no headroom to self-correct. API models run on dedicated hardware with higher generation budgets — they can afford to be verbose and careful.
Iteration kills local models. Prompt 1 → Prompt 2 was where most broke. They could produce an initial scene, but couldn’t correctly modify it without introducing syntax errors (extra parens, undefined vars, wrong material types). That’s a working-memory / context-retention issue under compound instructions.

Does a single comprehensive prompt help?

We tested this hypothesis on Mistral Small (which passed under 2-prompt format) by giving it a single prompt containing ALL requirements at once (game + bot mode + glow + boundary). The result was worse, not better:

Metric	2-Prompt	Single Prompt
Output size	11,675 chars	4,159 chars
Bot mode	✅	❌ Missing
Glow pulse	✅	❌ Missing
Arena boundary	✅	✅ Present
JS errors	0	0
Material type	MeshStandardMaterial (correct)	MeshBasicMaterial + emissive (bug)

The single-prompt game rendered with no JS errors but was missing bot mode and glow entirely — the model dropped features under the cognitive load of the longer prompt. The 2-prompt format actually produces more complete output because each prompt has a narrower scope.

The real issue isn’t the 2-prompt format — it’s that local models can’t reliably produce syntactically correct code in either format. The iteration process exposes weaknesses in working memory and syntax tracking that API models handle gracefully.

The takeaway for builders: if you’re iterating on a game with an LLM, use an API model. Local models are improving but the failure rate (6/8) is too high for production use — even for a simple 3D dodge game, even with a single comprehensive prompt.