AoE2 · LLM Arena

AoE2 LLM Arena — Technical Documentation

A two-tier AI agent that plays Age of Empires II: Definitive Edition, plus a synthetic evaluation tier (Arena) that races prompt/model variants against an in-memory AoE2-lite world and a web UI for replaying and forking past runs.


Architecture overview

Autoresearch (prompt evolution)

Arena Web (operator surface)

Synthetic Arena tier

Detection (macOS host, optional)

Real-game tier (Windows VM)

optional

HTTP

SSE

gameplay_agent/main.py

game_loop.py

screen.py

executor.py

goals.py

providers/claude.py

providers/strategist.py

detector.py

YOLO26n

server/app.py

CoreML / ONNX

arena/__main__.py

race / smoke / rank

evaluation/world_sim.py

arena/ranking.py

Bradley-Terry

MultiRunBrokerSink

make_broker

InProcessEventBroker

RedisStreamsBroker

DuckDB log

logs/arena/...

apps/api/src/server.py

FastAPI + SSE

apps/api/src/forks.py

POST /forks → async replay

apps/dashboard

Vite + React + Tailwind

autoresearch/orchestrator.py

prompt_mutator.py

prompts/system.md

game_runner.py

memory_chain.py

memories/*.md

Dashed lines indicate optional / off-by-default components. The real-game tier runs without YOLO; the arena tier defaults to the in-process broker; Redis is a Phase C add-on.


Reading paths

Short curated routes through the tutorial — pick one based on what you want to learn, instead of reading all 23 chapters end-to-end.

See also the Glossary for one-line definitions of terms used throughout the tutorial.


Table of contents

Part 1: Real-game architecture

#ChapterDescriptionKey files
01System OverviewTwo-tier design, graceful degradation, async architectureconfig.py, main.py
02Game Loop PipelineCapture-detect-alarm-strategist-execute-verify cycle (RTC pipelining, reactive tier)game_loop.py, reactive.py, turn_phases.py, goals.py, screen.py
03Action Model & ExecutionPydantic action types, target_id/target_class resolutionmodels.py, executor.py

Part 2: LLM integration

#ChapterDescriptionKey files
04Provider PatternAbstract base, Claude executor (text-only), strategist (text + local OCR)providers/base.py, providers/claude.py, providers/strategist.py
05Prompt EngineeringExecutor + strategist prompt designprompts/system.md, prompts/strategist.md
06Context InjectionMemory system, goals, resources, dynamic game knowledgememory.py, goals.py, providers/claude.py

Part 3: Entity detection

#ChapterDescriptionKey files
07Detector ArchitectureEntityDetector, PyTorch/ONNX/Mock backends, 60-class taxonomypackages/detection/src/inference/detector.py
08Training PipelineSynthetic data, augmentations, YOLO26n trainingtraining/generate_training_data.py, training/train_yolo.py
09Labeling & Active LearningCVAT workflow, COCO/YOLO conversion, class definitionslabeling/prepare_training.py, labeling/class_mapping.py

Part 4: Game knowledge

#ChapterDescriptionKey files
10Knowledge DatabaseSQLite schema, data sources, dynamic queriespackages/data/src/game_knowledge.py, packages/data/src/fetch_aoe2_data.py
11Sprite ExtractionSLD format, DXT1 decompression, player color recoloringpackages/detection/src/extraction/sld_extractor.py

Part 5: Operations

#ChapterDescriptionKey files
12Cloud TrainingLambda Labs workflow, dataset packaging, cost analysistmp/train_v2_lambda.sh
13Class Schema EvolutionSchema history, unified 60-class taxonomy, legacy mappinglabeling/class_mapping.py, training/config/classes.yaml

Part 6: Evaluation arena

#ChapterDescriptionKey files
14Arena Overviewrace / smoke / rank — when to use whichapps/arena/src/__main__.py, apps/arena/src/race.py
15Event BrokerProtocol, in-process vs Redis, backpressure, /metricspackages/evaluation/src/event_broker.py, packages/evaluation/src/redis_broker.py, packages/evaluation/src/broker_factory.py
16DuckDB Persister and ReplayEvent log schema, cold-path reader, fork primitivepackages/evaluation/src/event_log.py, packages/evaluation/src/duckdb_persister.py, packages/evaluation/src/fork.py
17Ranking PipelineBradley-Terry MLE, scenarios, bootstrap CIsapps/arena/src/ranking.py, apps/arena/src/scenarios.py, apps/arena/src/profiles/ranking-v1.yaml
18Synthetic World SimAoE2-lite economy model + perception projectionpackages/evaluation/src/world_sim.py

Part 7: Arena web

#ChapterDescriptionKey files
19Web ArchitectureFastAPI lifespan, /events dispatch, reaper, /forks flowapps/api/src/server.py, apps/api/src/forks.py
20Fork and Diff UITimeline scrubber, World/Trace/Diff/Operator tabsapps/dashboard/src/App.tsx, panels/*
21Running the UI LocallyDev proxy, VITE_API_BASE_URL, deployment modesapps/dashboard/vite.config.ts

Part 8: Autoresearch

#ChapterDescriptionKey files
22Autoresearch OverviewReflective mutate → run → score → accept/revert loop (Pareto frontier)apps/autoresearch/src/orchestrator.py, apps/autoresearch/src/pareto.py, apps/autoresearch/src/trace.py, apps/autoresearch/src/config.yaml
23Prompt Mutation and MemoryMutator constraints, protected sections, memory chainapps/autoresearch/src/prompt_mutator.py, apps/autoresearch/src/memory_chain.py

Architecture Decision Records (ADRs)

Short (~1 page) decisions that shaped the current architecture. Read these to understand the why; chapters above describe the what.


Runbooks

“You have a problem right now” checklists. Symptom → diagnosis → command, not narrative.


Design specs (frozen historical)

Original architectural proposals. Status headers note what shipped. Kept for why we built it this way context; current state lives in the chapters above.


Explorations

Speculative scratch documents that haven’t crystallized into shipped designs.



Conventions