Two-tier LLM agent · Synthetic evaluation arena

An AI agent that plays Age of Empires II — and the arena that tests it.

A Sonnet strategist reads the screen and sets goals. A Haiku executor reads YOLO entity detections and emits actions. A synthetic arena races prompt and model variants against a deterministic world. This site is the technical documentation.

Start the tutorial Pick a path Browse by package Runbooks

Real screenshot of Age of Empires II Definitive Edition running in VMware Fusion on macOS. Dark Age, with two villagers and a few forest patches on a partially-explored map. — 1. Capture screenFrame grabbed from the AoE2 client

Real screenshot · real detector output (6 entities on this frame) · strategist/executor copy is illustrative.

Start the tutorial

Eight Parts, ~23 chapters. Starts at the system overview and ends at autoresearch.

Begin

Browse by package

Nine packages in a uv workspace. Pick the one closest to what you want to change.

Package map

Runbooks

Operational guides for when something is broken or needs to be restarted.

Open runbooks

Pick a reading path

Short curated routes through the 23-chapter tutorial. Pick the one closest to what you want to learn — or scroll down for the full arc.

15-minute tour

~15 min

Three chapters that give you the whole system in one sitting: how the agent thinks, what it sees, and how we know it's getting better.

LLM-agent design

~6 chapters

If you build LLM agents and want to see how a two-tier strategist/executor split, prompt caching, and prompt-evolution loops actually play out in production.

Computer vision

~5 chapters

If you train detectors or care about synthetic data: YOLO architecture, our 60-class taxonomy, training pipeline, sprite extraction, and class-schema evolution.

Arena infra

~7 chapters

If you build distributed/observability systems: event broker, DuckDB persister, Bradley-Terry ranking, FastAPI + SSE backend, and the fork/diff UI.

The tutorial arc

The eight Parts in order. Pick a part to jump in.

Packages

Nine packages share types through core. Hover a card to see what it depends on.

Shared

core

Shared event / payload / world-state types. The protocol everyone agrees on.

data

SQLite game-knowledge database: tech trees, building stats, unit properties.

Part IV

Detection

detection

YOLO inference, model training, labeling UI, ownership classification (60 classes, 92.2% mAP50).

depends on corePart III

detection-server

macOS CoreML/ONNX inference endpoint (~15ms Neural Engine vs 1.2s CPU).

depends on detectionPart III

Real game

gameplay-agent

Real-game loop, goal manager, alarm system, Sonnet strategist + Haiku executor.

depends on core, data, detectionPart I

Arena

evaluation

Event broker (in-process / Redis), DuckDB persister, deterministic world_sim projection.

depends on corePart VI

arena

Synthetic CLI (race/smoke/rank), Bradley-Terry ranking, multi-run orchestration.

depends on core, evaluation, gameplay-agentPart VI

arena-web

FastAPI + SSE backend for live tailing, DuckDB queries, fork replay. Powers the internal UI.

depends on core, evaluation, arenaPart VII

Operations

autoresearch

Automated prompt-optimization loop: mutator, game_runner, memory-chain evolution.

depends on gameplay-agentPart VIII

See it in action

The internal arena UI shows what the agent perceives, decides, and does — turn by turn. Click any panel to expand.

Runbooks

Operational guides for the moments when something is broken or needs to be restarted.

Ready to dig in?

The tutorial walks the full system end to end. Each chapter calls out the files it describes, so you can read the code alongside the docs.

Start at Chapter 01