AoE2 LLM Arena — Technical Documentation
A two-tier AI agent that plays Age of Empires II: Definitive Edition, plus a synthetic evaluation tier (Arena) that races prompt/model variants against an in-memory AoE2-lite world and a web UI for replaying and forking past runs.
Architecture overview
Dashed lines indicate optional / off-by-default components. The real-game tier runs without YOLO; the arena tier defaults to the in-process broker; Redis is a Phase C add-on.
Reading paths
Short curated routes through the tutorial — pick one based on what you want to learn, instead of reading all 23 chapters end-to-end.
- 15-minute tour — 01 System Overview → 07 Detector Architecture → 14 Arena Overview.
- LLM-agent design — Parts I, II, and VIII: 01, 04, 05, 06, 22, 23.
- Computer vision — Parts III–V: 07, 08, 09, 11, 13.
- Arena infra — Parts VI–VII: 14, 15, 16, 17, 18, 19, 20.
See also the Glossary for one-line definitions of terms used throughout the tutorial.
Table of contents
Part 1: Real-game architecture
| # | Chapter | Description | Key files |
|---|---|---|---|
| 01 | System Overview | Two-tier design, graceful degradation, async architecture | config.py, main.py |
| 02 | Game Loop Pipeline | Capture-detect-alarm-strategist-execute-verify cycle (RTC pipelining, reactive tier) | game_loop.py, reactive.py, turn_phases.py, goals.py, screen.py |
| 03 | Action Model & Execution | Pydantic action types, target_id/target_class resolution | models.py, executor.py |
Part 2: LLM integration
| # | Chapter | Description | Key files |
|---|---|---|---|
| 04 | Provider Pattern | Abstract base, Claude executor (text-only), strategist (text + local OCR) | providers/base.py, providers/claude.py, providers/strategist.py |
| 05 | Prompt Engineering | Executor + strategist prompt design | prompts/system.md, prompts/strategist.md |
| 06 | Context Injection | Memory system, goals, resources, dynamic game knowledge | memory.py, goals.py, providers/claude.py |
Part 3: Entity detection
| # | Chapter | Description | Key files |
|---|---|---|---|
| 07 | Detector Architecture | EntityDetector, PyTorch/ONNX/Mock backends, 60-class taxonomy | packages/detection/src/inference/detector.py |
| 08 | Training Pipeline | Synthetic data, augmentations, YOLO26n training | training/generate_training_data.py, training/train_yolo.py |
| 09 | Labeling & Active Learning | CVAT workflow, COCO/YOLO conversion, class definitions | labeling/prepare_training.py, labeling/class_mapping.py |
Part 4: Game knowledge
| # | Chapter | Description | Key files |
|---|---|---|---|
| 10 | Knowledge Database | SQLite schema, data sources, dynamic queries | packages/data/src/game_knowledge.py, packages/data/src/fetch_aoe2_data.py |
| 11 | Sprite Extraction | SLD format, DXT1 decompression, player color recoloring | packages/detection/src/extraction/sld_extractor.py |
Part 5: Operations
| # | Chapter | Description | Key files |
|---|---|---|---|
| 12 | Cloud Training | Lambda Labs workflow, dataset packaging, cost analysis | tmp/train_v2_lambda.sh |
| 13 | Class Schema Evolution | Schema history, unified 60-class taxonomy, legacy mapping | labeling/class_mapping.py, training/config/classes.yaml |
Part 6: Evaluation arena
| # | Chapter | Description | Key files |
|---|---|---|---|
| 14 | Arena Overview | race / smoke / rank — when to use which | apps/arena/src/__main__.py, apps/arena/src/race.py |
| 15 | Event Broker | Protocol, in-process vs Redis, backpressure, /metrics | packages/evaluation/src/event_broker.py, packages/evaluation/src/redis_broker.py, packages/evaluation/src/broker_factory.py |
| 16 | DuckDB Persister and Replay | Event log schema, cold-path reader, fork primitive | packages/evaluation/src/event_log.py, packages/evaluation/src/duckdb_persister.py, packages/evaluation/src/fork.py |
| 17 | Ranking Pipeline | Bradley-Terry MLE, scenarios, bootstrap CIs | apps/arena/src/ranking.py, apps/arena/src/scenarios.py, apps/arena/src/profiles/ranking-v1.yaml |
| 18 | Synthetic World Sim | AoE2-lite economy model + perception projection | packages/evaluation/src/world_sim.py |
Part 7: Arena web
| # | Chapter | Description | Key files |
|---|---|---|---|
| 19 | Web Architecture | FastAPI lifespan, /events dispatch, reaper, /forks flow | apps/api/src/server.py, apps/api/src/forks.py |
| 20 | Fork and Diff UI | Timeline scrubber, World/Trace/Diff/Operator tabs | apps/dashboard/src/App.tsx, panels/* |
| 21 | Running the UI Locally | Dev proxy, VITE_API_BASE_URL, deployment modes | apps/dashboard/vite.config.ts |
Part 8: Autoresearch
| # | Chapter | Description | Key files |
|---|---|---|---|
| 22 | Autoresearch Overview | Reflective mutate → run → score → accept/revert loop (Pareto frontier) | apps/autoresearch/src/orchestrator.py, apps/autoresearch/src/pareto.py, apps/autoresearch/src/trace.py, apps/autoresearch/src/config.yaml |
| 23 | Prompt Mutation and Memory | Mutator constraints, protected sections, memory chain | apps/autoresearch/src/prompt_mutator.py, apps/autoresearch/src/memory_chain.py |
Architecture Decision Records (ADRs)
Short (~1 page) decisions that shaped the current architecture. Read these to understand the why; chapters above describe the what.
- ADR 0001 — Broker-first event architecture
- ADR 0002 — Redis Streams as cross-process broker backend
- ADR 0003 — pyright → basedpyright with
reportAny - ADR 0004 — Bradley-Terry ranking over simple win-rate
- ADR 0005 — Vite + React + Tailwind for arena UI
Runbooks
“You have a problem right now” checklists. Symptom → diagnosis → command, not narrative.
- Redis broker operations — compose stack, password rotation, key inspection.
- Switching the broker backend — in-process ↔ Redis switching, verification.
- Debugging a stuck fork or replay — what to check, in what order.
- Windows VM agent bring-up — fast path + symptom matrix. Full first-time setup is in deployment-guide.md.
- Retrain the detection model (v6 / YOLO26n) — end-to-end retraining loop: sprite extraction, real-terrain backgrounds, synthetic generation, cvat.ai annotation, Lambda training, deploy.
Design specs (frozen historical)
Original architectural proposals. Status headers note what shipped. Kept for why we built it this way context; current state lives in the chapters above.
- Event Broker Architecture — log-first SSE design that became Parts 6 chapters 15–16 and ADRs 0001–0002. Status: SHIPPED.
- Synthetic Arena Analysis — fork/race/mutate/observe analysis that became Parts 6–7. Status: SUPERSEDED BY IMPLEMENTATION.
- Autoresearch Plan — 5-phase Karpathy-inspired plan. Status: PARTIALLY SHIPPED (Phases 0–1; 2–5 unbuilt).
Explorations
Speculative scratch documents that haven’t crystallized into shipped designs.
- eval-virtualbox-ideas.md — notes on VirtualBox-based headless game replay.
Quick links
- Game loop entry point — the capture-detect-think-act cycle.
- Action types reference.
- System prompt — what the executor LLM knows.
- 60-class taxonomy.
- Arena CLI cheatsheet.
- Broker backpressure semantics.
Conventions
- Code references:
file.py:42format points to exact source lines (paths relative toagent/). - Cross-references:
[Chapter N](./path)between related topics. - Status callouts: design docs carry a
Status:line noting whether they’re proposals, shipped, or superseded. - Optional modules: dashed lines in diagrams; explicit notes for graceful-fallback dependencies.
- ADRs vs chapters: ADRs answer “why?” in 1 page. Chapters answer “how does it work today?” in detail. Design specs in
design/answer “how did we get here?” and are frozen in time.