AoE2 LLM Arena — Technical Documentation

A two-tier AI agent that plays Age of Empires II: Definitive Edition, plus a synthetic evaluation tier (Arena) that races prompt/model variants against an in-memory AoE2-lite world and a web UI for replaying and forking past runs.

Architecture overview

Dashed lines indicate optional / off-by-default components. The real-game tier runs without YOLO; the arena tier defaults to the in-process broker; Redis is a Phase C add-on.

Reading paths

Short curated routes through the tutorial — pick one based on what you want to learn, instead of reading all 23 chapters end-to-end.

15-minute tour — 01 System Overview → 07 Detector Architecture → 14 Arena Overview.
LLM-agent design — Parts I, II, and VIII: 01, 04, 05, 06, 22, 23.
Computer vision — Parts III–V: 07, 08, 09, 11, 13.
Arena infra — Parts VI–VII: 14, 15, 16, 17, 18, 19, 20.

See also the Glossary for one-line definitions of terms used throughout the tutorial.

Part 1: Real-game architecture

#	Chapter	Description	Key files
01	System Overview	Two-tier design, graceful degradation, async architecture	`config.py`, `main.py`
02	Game Loop Pipeline	Capture-detect-alarm-strategist-execute-verify cycle (RTC pipelining, reactive tier)	`game_loop.py`, `reactive.py`, `turn_phases.py`, `goals.py`, `screen.py`
03	Action Model & Execution	Pydantic action types, target_id/target_class resolution	`models.py`, `executor.py`

Part 2: LLM integration

#	Chapter	Description	Key files
04	Provider Pattern	Abstract base, Claude executor (text-only), strategist (text + local OCR)	`providers/base.py`, `providers/claude.py`, `providers/strategist.py`
05	Prompt Engineering	Executor + strategist prompt design	`prompts/system.md`, `prompts/strategist.md`
06	Context Injection	Memory system, goals, resources, dynamic game knowledge	`memory.py`, `goals.py`, `providers/claude.py`

Part 3: Entity detection

#	Chapter	Description	Key files
07	Detector Architecture	EntityDetector, PyTorch/ONNX/Mock backends, 60-class taxonomy	`packages/detection/src/inference/detector.py`
08	Training Pipeline	Synthetic data, augmentations, YOLO26n training	`training/generate_training_data.py`, `training/train_yolo.py`
09	Labeling & Active Learning	CVAT workflow, COCO/YOLO conversion, class definitions	`labeling/prepare_training.py`, `labeling/class_mapping.py`

Part 4: Game knowledge

#	Chapter	Description	Key files
10	Knowledge Database	SQLite schema, data sources, dynamic queries	`packages/data/src/game_knowledge.py`, `packages/data/src/fetch_aoe2_data.py`
11	Sprite Extraction	SLD format, DXT1 decompression, player color recoloring	`packages/detection/src/extraction/sld_extractor.py`

Part 5: Operations

#	Chapter	Description	Key files
12	Cloud Training	Lambda Labs workflow, dataset packaging, cost analysis	`tmp/train_v2_lambda.sh`
13	Class Schema Evolution	Schema history, unified 60-class taxonomy, legacy mapping	`labeling/class_mapping.py`, `training/config/classes.yaml`

Part 6: Evaluation arena

#	Chapter	Description	Key files
14	Arena Overview	race / smoke / rank — when to use which	`apps/arena/src/__main__.py`, `apps/arena/src/race.py`
15	Event Broker	Protocol, in-process vs Redis, backpressure, `/metrics`	`packages/evaluation/src/event_broker.py`, `packages/evaluation/src/redis_broker.py`, `packages/evaluation/src/broker_factory.py`
16	DuckDB Persister and Replay	Event log schema, cold-path reader, fork primitive	`packages/evaluation/src/event_log.py`, `packages/evaluation/src/duckdb_persister.py`, `packages/evaluation/src/fork.py`
17	Ranking Pipeline	Bradley-Terry MLE, scenarios, bootstrap CIs	`apps/arena/src/ranking.py`, `apps/arena/src/scenarios.py`, `apps/arena/src/profiles/ranking-v1.yaml`
18	Synthetic World Sim	AoE2-lite economy model + perception projection	`packages/evaluation/src/world_sim.py`

Part 7: Arena web

#	Chapter	Description	Key files
19	Web Architecture	FastAPI lifespan, `/events` dispatch, reaper, `/forks` flow	`apps/api/src/server.py`, `apps/api/src/forks.py`
20	Fork and Diff UI	Timeline scrubber, World/Trace/Diff/Operator tabs	`apps/dashboard/src/App.tsx`, `panels/*`
21	Running the UI Locally	Dev proxy, VITE_API_BASE_URL, deployment modes	`apps/dashboard/vite.config.ts`

Part 8: Autoresearch

#	Chapter	Description	Key files
22	Autoresearch Overview	Reflective mutate → run → score → accept/revert loop (Pareto frontier)	`apps/autoresearch/src/orchestrator.py`, `apps/autoresearch/src/pareto.py`, `apps/autoresearch/src/trace.py`, `apps/autoresearch/src/config.yaml`
23	Prompt Mutation and Memory	Mutator constraints, protected sections, memory chain	`apps/autoresearch/src/prompt_mutator.py`, `apps/autoresearch/src/memory_chain.py`

Architecture Decision Records (ADRs)

Short (~1 page) decisions that shaped the current architecture. Read these to understand the why; chapters above describe the what.

Runbooks

“You have a problem right now” checklists. Symptom → diagnosis → command, not narrative.

Redis broker operations — compose stack, password rotation, key inspection.
Switching the broker backend — in-process ↔ Redis switching, verification.
Debugging a stuck fork or replay — what to check, in what order.
Windows VM agent bring-up — fast path + symptom matrix. Full first-time setup is in deployment-guide.md.
Retrain the detection model (v6 / YOLO26n) — end-to-end retraining loop: sprite extraction, real-terrain backgrounds, synthetic generation, cvat.ai annotation, Lambda training, deploy.

Design specs (frozen historical)

Original architectural proposals. Status headers note what shipped. Kept for why we built it this way context; current state lives in the chapters above.

Event Broker Architecture — log-first SSE design that became Parts 6 chapters 15–16 and ADRs 0001–0002. Status: SHIPPED.
Synthetic Arena Analysis — fork/race/mutate/observe analysis that became Parts 6–7. Status: SUPERSEDED BY IMPLEMENTATION.
Autoresearch Plan — 5-phase Karpathy-inspired plan. Status: PARTIALLY SHIPPED (Phases 0–1; 2–5 unbuilt).

Explorations

Speculative scratch documents that haven’t crystallized into shipped designs.

eval-virtualbox-ideas.md — notes on VirtualBox-based headless game replay.

Quick links

Game loop entry point — the capture-detect-think-act cycle.
Action types reference.
System prompt — what the executor LLM knows.
60-class taxonomy.
Arena CLI cheatsheet.
Broker backpressure semantics.

Conventions

Code references: file.py:42 format points to exact source lines (paths relative to agent/).
Cross-references: [Chapter N](./path) between related topics.
Status callouts: design docs carry a Status: line noting whether they’re proposals, shipped, or superseded.
Optional modules: dashed lines in diagrams; explicit notes for graceful-fallback dependencies.
ADRs vs chapters: ADRs answer “why?” in 1 page. Chapters answer “how does it work today?” in detail. Design specs in design/ answer “how did we get here?” and are frozen in time.