We Asked 10 LLMs to Build the Same Game — Here's How Each One Did
We gave 10 different LLMs (local and cloud) the exact same prompt to build an AI dodge game in Phaser.js. We tracked tokens, code quality, bugs, and speed. The results surprised us.
What happens when you give 10 different language models the exact same prompt to build a browser game? We tested exactly that.
The task: build a Phaser.js 3 game where the player is a blue circle dodging red AI enemies. Identical prompt, same game engine, zero human edits. We fed it to cloud APIs (DeepSeek V4, Mistral, OpenRouter) and local models (GPT-OSS-20B, Llama 3.1, Gemma 4, Qwen 3.5) and compared the results.
Playable Demos
Two versions are embedded below. All other model builds are playable at their own URLs.
All games: Move with WASD/arrows, survive as long as you can, press R to restart.
| Metric | Value |
|---|---|
| Model | DeepSeek V4 Flash (API) |
| Provider | DeepSeek API |
| Model ID | deepseek-v4-flash |
| Temperature | 0.7 |
| Max Tokens | 8,192 |
| Tokens Used | 8,399 (207 in / 8,192 out) |
| Build Time | 68s |
| File Size | 14.8 KB |
| Cost | $0.0012 |
| Notes | Largest original output (15,487 chars, 429 lines) |
| Source | github.com/driphtyio/ai-dodge |
| Metric | Value |
|---|---|
| Model | openai/gpt-oss-20b (local) |
| Provider | Local — Apple Mac Mini M4 (16GB) |
| Model ID | openai/gpt-oss-20b |
| Temperature | 0.7 |
| Max Tokens | 8,192 |
| Tokens Used | 1,823 (269 in / 1,570 out) |
| Build Time | 23s |
| File Size | 14.8 KB |
| Cost | $0 (local) |
| Notes | Original output had player speed bug + double R key |
| Source | github.com/driphtyio/ai-dodge |
All other model builds are playable here:
- /games/ai-dodge/ — DeepSeek V4 Flash (0.4 temp, interactive build)
- /games/ai-dodge/mistral-small — Mistral Small (cleanest output)
- /games/ai-dodge/owl-alpha — owl-alpha (largest free output)
- /games/ai-dodge/nemotron-3-ultra — Nemotron 3 Ultra
- /games/ai-dodge/local-llama-3.1-8b-local — Llama 3.1 8B
- /games/ai-dodge/local-qwen-3.5-9b — Qwen 3.5 9B
- /games/ai-dodge/local-gemma-4-12b-qat — Gemma-4-12b-qat
- /games/ai-dodge/local-gemma-4-12b-coder-fable5-composer2.5 — Gemma-4-12b-coder-fable
The Lineup
We tested 10 models across cloud APIs and local inference:
| Tier | Models |
|---|---|
| Cloud APIs | DeepSeek V4 Flash, DeepSeek V4 Pro, Mistral Small, OpenRouter (owl-alpha, Nemotron 3 Ultra) |
| Local — Apple Mac Mini M4 (16GB) | openai/gpt-oss-20b, Llama 3.1 8B, Gemma-4-12b-qat, Qwen 3.5 9B, Gemma-4-12b-coder-fable |
| Blocked by content policy | Groq (70B + 8B), Cerebras (120B) — ironic given they’re great at playing games via API |
Results
| Model | Status | Time | Tokens | Lines | Chars | Cost |
|---|---|---|---|---|---|---|
| DeepSeek V4 Flash (0.7 temp) | ✅ PASS | 68s | 8,399 | 429 | 15,487 | $0.0012 |
| DeepSeek V4 Flash (0.4 temp) | ✅ PASS | 14s | 2,077 | 256 | 7,376 | $0.0005 |
| Mistral Small | ✅ PASS | 12s | 1,754 | 233 | 6,869 | $0 |
| owl-alpha (OpenRouter) | ✅ PASS | 53s | 2,024 | 241 | 8,272 | $0 |
| DeepSeek V4 Pro | ✅ PASS | 101s | 8,399 | 50 | 1,424 | $0.0012 |
| Nemotron 3 Ultra (OpenRouter) | ✅ PASS | 86s | 2,015 | 132 | 5,323 | $0 |
| openai/gpt-oss-20b (local) | ✅ PASS | 23s | 1,823 | 216 | 4,849 | $0 |
| Qwen 3.5 9B (local) | ✅ PASS | 190s | 3,210 | 177 | 4,804 | $0 |
| Llama 3.1 8B (local) | ✅ PASS | 55s | 1,220 | 153 | 5,183 | $0 |
| Gemma-4-12b-qat (local) | ✅ PASS | 281s | 3,563 | 182 | 5,410 | $0 |
| Gemma-4-12b-coder-fable (local) | ✅ PASS | 112s | 1,463 | 107 | 3,451 | $0 |
| Lfm2.5-8b (local) | ⚠️ NO LOGIC | 128s | 8,369 | 551 | 25,966 | $0 |
| Vibethinker-3b (local) | ❌ OVERFLOW | 204s | 8,383 | 1 | 0 | — |
| Groq 70B / 8B | ❌ POLICY | 1s | — | — | — | — |
| Cerebras 120B | ❌ POLICY | 0.3s | — | — | — | — |
| Averages (PASS only) | ||||||
| API / Cloud avg (n=6) | ✅ | 56s | 4,111 | 224 | 7,458 | $0.0005 |
| Local avg (n=5) | ✅ | 132s | 2,256 | 167 | 4,739 | $0 |
Total Cost
$0.0032 — that’s all 10 successful API calls combined. The local models cost $0 (they ran on our machine). The cloud calls (DeepSeek) cost less than a penny. For simple 2D canvas games under 500 lines, the cost is effectively zero — but this doesn’t scale linearly to larger projects.
Bugs & Key Findings
| Model | Bugs | Notes |
|---|---|---|
| DeepSeek V4 Flash (0.7 temp) | No bugs | Largest output — 429 lines, full feature set |
| DeepSeek V4 Flash (0.4 temp) | Score timer stacked on restart; no edge wrapping; no speed scaling | Fastest cloud (14s) |
| Mistral Small | None | Cleanest output — zero bugs, no edits needed |
| owl-alpha (OpenRouter) | None reported | Best free-tier — 8,272 chars, no bugs |
| DeepSeek V4 Pro | No bugs | Spent 8,192 tokens on reasoning before code. Only 50 lines of actual game |
| Nemotron 3 Ultra (OpenRouter) | May lack edge wrapping | Most compact — 132 lines |
| openai/gpt-oss-20b (local) | Player speed coupled to enemy speed; double R key; add.circle() shapes | Best local — fast (23s), functional, $0 |
| Qwen 3.5 9B (local) | Few bugs | Best local code quality — clean structure, functional first attempt |
| Llama 3.1 8B (local) | Sparse game logic | Most reliable local fallback |
| Gemma-4-12b-qat (local) | No bugs | Slowest by far (281s) |
| Gemma-4-12b-coder-fable (local) | Missing some features | Shortest working output (3,451 chars) |
| Lfm2.5-8b (local) | 551 lines of HTML with zero game logic | Empty canvas — no game code |
| Vibethinker-3b (local) | Reasoning overflow at 8,383 tokens | Too small (3B) for this task |
| Groq 70B / 8B | Error 1010 — provider-level content policy | Model works via other providers |
| Cerebras 120B | Error 1010 — provider-level content policy | Model works via other providers |
The Prompt
In case you want to run your own benchmark:
“Create a complete, playable Phaser.js 3 game as a single HTML file. Player is a blue circle that moves with WASD/arrow keys. Red circles spawn from screen edges and chase the player with simple AI (seek behavior). Score increases by 1 every second. Collision with any red circle = game over. Press R to restart after game over. Enemies wrap around screen edges. Player wraps around screen edges too. Enemies speed up gradually as score increases. Player has a subtle glow/pulse effect. Dark theme. 680x480. Centered canvas with instructions below.”
Want to pit your favorite model against this? Drop it the same prompt and see if it beats Mistral Small’s clean record.
Methodology
- All models received the same prompt verbatim
- No human edits or iterations after generation
- Temperature: 0.4-0.7, max_tokens: 4096-8192
- Local models ran on 10.0.0.25:1234 (LM Studio) — Apple Mac Mini M4 (16GB memory)
- Cloud models used official API endpoints with free/free-tier keys
- Games were validated by rendering in browser and checking for core features
Lessons Learned
- Model size doesn’t predict output quality. Mistral Small (the smallest cloud model) produced the cleanest code with zero bugs. DeepSeek V4 Pro (the more expensive model) spent its entire 8,192 token budget on reasoning text before the code — only 50 lines of actual game.
- Temperature and token limits are bigger levers than model choice. The same model (DeepSeek V4 Flash) at 0.4 temp / 4,096 tokens produced 7,376 chars in 14s at $0.0005. At 0.7 temp / 8,192 tokens it produced 15,487 chars in 68s at $0.0012. The model didn’t change — the settings did.
- Local inference is viable for simple games but slow. All 5 local models produced functional games at $0 cost, but build times ranged from 23s (GPT-OSS-20B) to 281s (Gemma-4-12b-qat) — 2-20x slower than cloud APIs.
- Provider policy blocks more models than model capability. Groq and Cerebras returned error 1010 for the same prompt that worked everywhere else. The models themselves (Llama 3.3 70B, GPT-OSS-120B) can generate game code — their API safety layers blocked it.
- Output token budget is the real bottleneck. Models capped at 4,096 tokens produced shorter, sometimes incomplete code. At 8,192 tokens, most models had room to finish the full game. Pro spent its budget on reasoning instead of code — a model-specific behavior regardless of token limit.
- Phaser.js 3.60 patterns trip up most models. Common failure modes across nearly every model: using
add.circle()instead ofgenerateTexture()+physics.add.sprite(), binding keyboard events insideupdate()instead ofcreate(), and omitting the parent<div id="game">. These aren’t logic bugs — they’re API knowledge gaps. - First-attempt quality is the strongest differentiator. Mistral Small and owl-alpha produced zero-bug games on the first try. Everyone else needed at least one fix pass. For a “single prompt, no edits” benchmark, this is the metric that matters most for real use.
What’s Next
This comparison revealed something unexpected: bigger and more expensive doesn’t mean better code. Mistral Small (the smallest cloud model) produced the cleanest output. DeepSeek V4 Flash (0.7 temp) produced the most code but with more bugs. And Groq/Cerebras won’t build games at all but are happy to play them.
Upcoming posts will extend this benchmark to:
- Iterative builds — which model fixes its bugs fastest with a second prompt
- Bot performance — which LLM survives longest when playing its own game
- Larger scope — how models handle a 500-line game spec vs this 150-word prompt