AoE2 · LLM Arena

Synthetic Arena: An Analysis of Forkable, Raceable, Mutable Agent Evaluation for AoE2 LLM Arena

Date: 2026-05-11 Author: Claude (research + analysis) Status: SUPERSEDED BY IMPLEMENTATION — fork / race / mutate / observe shipped through Phase 9 plus the broker rollout. Frozen historical analysis; for current state see Part 6 — Evaluation Arena and Part 7 — Arena Web. (The “vision-LLM” strategist mentioned below also predates the move to local OCR — the strategist is now text-only.)


Context

The AoE2 LLM Arena agent is maturing past the point where single-run, single-config, real-game testing scales. Iterating on prompts, models, perception parameters, or strategist cadence currently requires running a real AoE2 instance in a Windows VM, paying ~$0.50/run for live LLM tests, and reading file-based structlogs after the fact. The autoresearch loop exists and runs sequential games, but it cannot:

  1. Fork: start N agents from an identical mid-game state and let them diverge.
  2. Race: run config variants in parallel and pick a winner with statistical rigor.
  3. Mutate: pause a run, change a world parameter (resources, unit counts, fog), and resume.
  4. Observe / steer: surface a live web view of agent reasoning, world state, and fork-diffs.

We want all four. The scope of this analysis is a synthetic perception layer: the real gameplay_agent runtime (detection_phase → strategist_phase → turn_phases → executor) talks to a fake world instead of AoE2.exe. Variants race on prompts, models, perception, and loop pacing as a configurable grid.

This document maps the field’s best practices to those four capabilities and recommends a concrete architecture rooted in the existing codebase.


Current state — what exists, what’s missing

A 60-second tour of what was found in this repo:

CapabilityAlready existsGap
Stateful synthetic worldevaluation/world_sim.py (resources, ages, villager queue, building costs)Doesn’t emit detections; no perturbation API
Synthetic perceptiondetection/inference/mock.py (mock_detect() returns frozen Dark Age)Stateless — no awareness of world_sim state
Decoupled agent loopgameplay_agent/{detection,strategist,turn}_phase.py (recent refactor)Phase modules read globals (singleton config); no instance scoping
Scenario harnessevaluation/runner.py + scenarios/*.yaml + assertion DSLOne-shot; no fork / branching / pause
Experiment orchestratorautoresearch/orchestrator.py, game_runner.pySequential, single-machine, no ranking
Replay logstructlog → logs/YYYY_MM_DD/game.txt + optional screenshotsAppend-only text; not queryable; not event-sourced
MetricsAgentMemory.get_metrics_snapshot() returns 20+ fieldsPer-run only, no cross-run aggregation/UI
Web UIEmpty .superset/config.json placeholder; FastAPI present (for detection server)No game-state dashboard exists
Determinismrandom.seed(42) in mock_detectLLM temperature is hardcoded SDK default; random.uniform() in executor.py:221 is unseeded
Reference pathsgameplay_agent/main.py, gameplay_agent/game_loop.py, gameplay_agent/config.py, gameplay_agent/memory.py, gameplay_agent/providers/claude_tools.py

The takeaway: this is not a greenfield environment. The shape of the answer is “compose existing parts into a fork-able harness”, not “build a simulator from scratch.”


Field survey: best practices, with citations

1. Forking / branching from a common state

Canonical patterns

Core invariant: a forkable simulation must snapshot the tuple (world_state, agent_state, RNG_state). Missing the RNG state turns “deterministic replay” into a lie. For LLM agents the tuple grows: also (LLM_context, prompt_cache_key).

2. Multi-variant racing / tournament evaluation

3. Pause / inject / resume — interactive simulation manipulation

The dominant pattern is event-sourced replay: store every action + RNG draw in an append-only log, replay deterministically to any timestep.

Concrete pattern: event-source every game; periodic full snapshots as cheap insurance against replay drift; fork(run_id, t, mutation_fn) API; store in DuckDB/SQLite at single-machine scale.

4. Web interfaces for observing / steering

5. Stochasticity & reproducibility — the LLM problem

LLMs are not deterministic at temperature=0. Recent results that change how to evaluate:

Mitigations, in order of cost:

  1. temperature=0, fix seed where the API supports it (OpenAI + Anthropic both do now).
  2. Prompt caching — identical prefix → more consistent outputs + cheaper.
  3. Run N trials per condition (N≥20 for ranking; N≥100 to detect <5% deltas). Single-trial benchmarks are unreliable.
  4. Log and compare action sequences, not only outcomes — win-rate can be stable while paths diverge wildly.
  5. Pin model snapshot (claude-sonnet-4-6, not floating aliases).

6. Prior work directly relevant to AoE2

SystemDomainPattern to stealSource
PyAge2AoE2 (!)OpenAI-Gym wrapper, DLL injection, 20–30 min game → secondsgithub
aoe2-ai-moduleAoE2 DEUnofficial AI scripting extensions (closest thing to a real API)github
AlphaStarStarCraft IILeague + PFSP + exploiter agents + raw-vs-camera replayNature
SIMA / SIMA 23D-game generalist600-skill atomic eval taxonomy, OCR task-completion detectionSIMA 2 paper
VoyagerMinecraftSkill library as content-addressable code, self-verification looparxiv 2305.16291
Generative AgentsSocial sandboxURL-addressable timestep replay, NL world-state injectionarxiv 2304.03442
Inspect AILLM-agent harnessSandboxing, external-agent adapter, static-HTML trace viewerinspect.aisi.org.uk
ReverbRL infraTable-of-trajectories abstraction, priority samplingarxiv 2102.04736

PyAge2 is the closest prior art for AoE2 itself — even if not adopted directly, its action-space shaping decisions encode hard-won AoE2-specific lessons.


The scope choice — synthetic perception layer — is the right pivot. The agent’s whole code path (detection → ownership → context build → LLM call → tool dispatch → executor) runs unchanged; only the bottom layer (screenshot capture + YOLO detection + executor mouse/keyboard sinks) is replaced.

┌─────────────────────────────────────────────────────────────┐
│  Arena Controller (new)                                     │
│  - spawns N agent processes per Run                          │
│  - injects WorldState seed + ConfigProfile                  │
│  - subscribes to per-agent event streams                    │
└──────────────────────────┬──────────────────────────────────┘

            ┌──────────────┼──────────────┐
            │              │              │
   ┌────────▼────┐ ┌──────▼─────┐ ┌──────▼─────┐
   │ Agent A     │ │ Agent B    │ │ Agent C    │
   │ ConfigProf. │ │ ConfigProf.│ │ ConfigProf.│
   │             │ │            │ │            │
   │ game_loop ──┼─┼─ game_loop─┼─┼─ game_loop │ ◀── real code path
   │  ↓ phases   │ │  ↓ phases  │ │  ↓ phases  │
   │ detection_  │ │ detection_ │ │ detection_ │
   │  phase      │ │  phase     │ │  phase     │
   └──┬──────────┘ └──┬─────────┘ └──┬─────────┘
      │ swap          │ swap         │ swap
      ▼               ▼              ▼
   ┌────────────────────────────────────────────┐
   │ SyntheticWorldServer (new)                 │
   │   - per-agent WorldState instance          │
   │   - tick() drives resources/age/queues     │
   │   - render() → DetectedEntity[]            │
   │   - apply_action() consumes agent actions  │
   │   - mutate() applies operator perturbation │
   │   - snapshot()/restore() for fork & resume │
   └────────────────────────────────────────────┘

                           ▼ event-sourced
   ┌────────────────────────────────────────────┐
   │ Event Log (SQLite/DuckDB)                  │
   │   (run_id, agent_id, t, kind, payload)     │
   └──────────────────────────┬─────────────────┘

   ┌────────────────────────────────────────────┐
   │ FastAPI + SSE + React (new)                │
   │   - live trace per agent                   │
   │   - world-state timeline + minimap         │
   │   - fork diff view                         │
   │   - operator mutation form                 │
   └────────────────────────────────────────────┘

Key design decisions

A. Synthetic world = world_sim.py + a perception projection

Promote evaluation/world_sim.py to a first-class SyntheticWorld with:

Why this works: world_sim.py already models the right state shape for AoE2 Dark→Imperial regression testing. The current mock_detect() is a constant function of (screenshot dimensions); upgrading it to be a function of WorldState is small, additive, and unblocks everything else.

Calibrate WorldState constants (unit costs, build times, age requirements, gather rates) from openage’s converted nyan data files rather than hand-encoding. openage itself is pre-alpha and not viable as a simulation backend today — “gameplay is basically non-functional” per their README — but their asset-converter output of the original AoE2 DAT files is authoritative ground truth for game constants, and depending on it as data (not as a runtime) avoids the GPLv3 / C++/Qt/Cython build-complexity contagion that adopting the engine itself would bring.

B. Agent process isolation via configuration scoping

The singleton config in gameplay_agent/config.py is the single biggest blocker to parallel racing. Three options, ranked:

  1. Process-per-agent + env-var injection (recommended). Each variant is a subprocess with its own env. Already supported by Config.from_env(). Lowest code change. Natural fault isolation.
  2. Thread-per-agent + ContextVar scoping. Faster startup but needs surgical removal of every from .config import config import-time read.
  3. Asyncio-per-agent. Same problem as #2 but with cooperative scheduling.

Pick #1 unless racing 50+ variants on one machine, in which case revisit #2. Pair with a ConfigProfile schema — a YAML file enumerating model, temperature (newly exposed knob), strategist_interval, detection_imgsz, loop_delay, etc. The autoresearch orchestrator already has YAML scaffolding to extend.

C. Event log = OpenTelemetry traces → DuckDB → Langfuse

Replace ad-hoc structlog text with an event-sourced log. Schema:

events(run_id, agent_id, t, kind, payload_json, ts)
  kind ∈ { 'turn_start', 'observation', 'llm_prompt', 'llm_response',
           'action', 'action_result', 'world_mutation', 'fork', 'metric' }

Backend: DuckDB or SQLite — query-able, no server. Mirror to Langfuse (self-hosted, MIT, OTel-native) for LLM-trace observability. This is the table-of-trajectories pattern from Reverb at a single-machine scale.

forks and mutations are first-class event types — a fork is (parent_run_id, parent_t) → child_run_id; the world is reconstructed by replaying events up to parent_t, applying any mutation, then continuing.

D. Determinism protocol

Exposing controls already implicit in the architecture:

Accept that determinism is asymptotic — temp=0 flips 5–12% of decisions per arxiv 2408.04667. Plan for statistics, not exact replay.

E. Ranking and racing — start simple, grow toward AlphaStar

Don’t build v3 before v1 is providing signal.

F. Web UI — replay-log-driven, three panels

Single biggest leverage point in the UI: Smallville’s pattern of “static replay viewer reading from event log” — the UI doesn’t drive runs, it watches them. Buying that decoupling early avoids socket/lifecycle complexity.

G. Chaos mode for robustness testing

A mutate() library — destroy random units, fog map regions, swap civs, simulate API latency spikes, inject malformed detections. Run a baseline variant against a chaos schedule and rank by graceful-degradation metric (composite score under perturbation / composite score baseline). Plugs directly into the event log as world_mutation events.

H. Infrastructure & reproducibility

The existing project deliberately runs with minimal infrastructure (pip + justfile + GitHub Actions; native execution split across a Windows VM and a macOS host). The synthetic arena introduces stateful third-party services (LLM-trace store, object storage for replay artifacts) that benefit from containerization without disturbing the existing real-game tier.

Architectural split:

Containerized services (docker-compose.yml, all images digest-pinned):

ServiceImagePurposeVolume
langfuse-web + langfuse-workerlangfuse/langfuse:3@sha256:…LLM trace UI + OTel ingestion(uses langfuse-db)
langfuse-dbpostgres:17@sha256:…Langfuse backing storelangfuse-pg-data
clickhouseclickhouse/clickhouse-server:24@sha256:…Langfuse analytics (required by v3+)clickhouse-data
miniominio/minio:RELEASE.YYYY-MM-DD@sha256:…S3-compatible store for replays, screenshots, event-log snapshotsminio-data
otel-collectorotel/opentelemetry-collector-contrib:0.112@sha256:…OpenTelemetry ingestion → Langfuse(stateless)

Single bridge network arena-net; only Langfuse UI exposed to host by default. Postgres, ClickHouse, MinIO unreachable from host except via the service network.

Native (uncontainerized) components:

Why DuckDB-as-file, not containerized Postgres, for the event log: the event log is OLAP-shaped (aggregate over millions of (run_id, t, kind) rows), single-writer, query-only for the UI. DuckDB outperforms Postgres for this access pattern, is a single file (trivial to back up, commit as fixture, ship as replay artifact), and avoids inter-process I/O. Postgres is already in the stack for Langfuse; reusing it would couple arena tail-latency to Langfuse container health for no benefit.

Python dependency reproducibility:

Image pinning:

Secrets & config:

Bring-up commands (justfile additions):

just arena-infra-up       # docker compose up -d (services only)
just arena-infra-down     # docker compose down (preserves volumes)
just arena-infra-nuke     # docker compose down -v (DATA LOSS; clean slate)
just arena-infra-logs     # tail logs from all services
just arena-infra-status   # docker compose ps + health-check summary
just arena-up             # arena-infra-up + native arena controller + web UI

CI integration:

Backup & data lifecycle:

Onboarding flow (new contributor):

git clone …
uv sync                    # installs Python deps from uv.lock
cp .env.example .env       # fill in ANTHROPIC_API_KEY
just arena-infra-up        # docker compose up, ~30s
just arena-smoke           # 50-turn synthetic run, exits when assertions pass

Five commands. No manual Postgres / ClickHouse / MinIO install.


Suggested phased build sequence

Ordered for early signal, no premature infrastructure.

PhaseCapabilityConcrete deliverableEffort
0Infra & reproducibility baseline (see §H)Migrate to uv + commit uv.lock; docker-compose.yml with digest-pinned Langfuse + Postgres + ClickHouse + MinIO; .env.example; Renovate config; six new just arena-* targets1–2 days
1Synthetic world projects to perceptionSyntheticWorld.render() → DetectedEntity[] in evaluation/world_sim.py; mock_detect() consults it2–3 days
2Agent runs against synthetic worldWire game_loop to use SyntheticWorld in test mode; executor.py actions consumed by world.apply_action()3–5 days
3Determinism knobsExpose temperature, seed in Config; pin model snapshot; seed random.uniform()1 day
4Event logDuckDB schema + a thin structlog → events writer; replace text logs in test mode2–3 days
5Fork primitivefork(run_id, t, mutation_fn=None) → new run_id; replay events to t, apply mutation, branch3–5 days
6Multi-process racingSubprocess pool driven by ConfigProfile YAML; aggregate metrics; simple plots3–5 days
7Web UI v1FastAPI + SSE + minimal React; three panels reading event log5–8 days
8Bradley-Terry rankingPairwise outcome model; per-condition CIs2–3 days
9Pause / resume / inject UIOperator panel triggers mutate() and forks3–5 days
10+League / Inspect AI / chaos schedule / Langfuse mirrorAs needed
WatchTrack openage maturityRevisit as a potential SyntheticWorld backend if their simulation reaches alpha and exposes an out-of-process agent API. Until then, consume their nyan data files only (see §A).Ongoing

Phase 0 is a one-time infra/reproducibility setup that pays back from phase 4 onward. Phases 1–3 unblock everything else and are cheap. Phases 4–5 are the core of the proposal. Phases 6–9 are the user-visible features. Treat phase 10+ as opportunistic.


Future: deployment and competitive multi-agent research

The phased plan above is scoped to local development of the synthetic arena. The longer-term direction the local environment is meant to enable:

Goal: improve agent configs, system prompts, and settings without running the actual game, by racing competitive agent populations in deterministic synthetic environments and ranking the winners.

Design intent:

  1. Controlled rounds, varied across rounds. Within a research round, every competing agent runs against the same SyntheticWorld seed and trajectory; variation lives only in the ConfigProfile (prompt template, model, temperature, strategist cadence, etc.). Across rounds, environments differ — so a config that wins reflects config quality, not lucky environment match. This is a 2-axis experimental design (configs × environments) and the natural progression of the league/PFSP pattern from §E. It also enables statistical decomposition of “due to the config” vs “due to the environment” vs “due to interaction” (e.g., hierarchical Bradley-Terry or two-way ANOVA over the round results).

  2. Continuity with autoresearch/. The existing autoresearch/orchestrator.py already runs sequential games and collects metrics; the future arena replaces its execution backend (synthetic instead of real-game), its concurrency model (parallel instead of sequential), and its evaluation (ranked tournament instead of standalone). The conceptual layer — “run experiments, mutate configs, learn what works” — stays the same. Treat the synthetic arena as the next backend for autoresearch, not as a replacement.

  3. Hosted, not laptop-bound. Phase 0’s local docker-compose stack is the substrate. The eventual target is a cloud or dedicated-server deployment where rounds run unattended at much higher trial counts. Same compose file, different .env (cloud-managed Postgres, S3-backed event log, etc.) — the architectural shape is identical; only where the services run changes. This is why phase 0 invests in digest-pinning and uv.lock now: lifting the local stack to a remote host should cost ~zero infra surprises.

Explicitly out of scope until the local arena is built: cloud deployment, secrets management beyond .env, multi-host orchestration, automated prompt mutation, ranked-round scheduling. Designing the cloud or autoresearch-integrated version before the local substrate exists is premature — every design decision there depends on what shape the local primitives end up taking.

When this section converts from future plan to next iteration: once the local arena is providing usable ranking signal (after phases 6–8 land). At that point the cost/benefit of remote rounds becomes concrete, and this section gets promoted into a fresh phased plan of its own.


Risks and tradeoffs

  1. Sim-to-real gap. A synthetic perception layer is intentionally lower fidelity than AoE2.exe. Risk: prompt/strategy variants that win in the synth lose in reality. Mitigation: keep the existing real-game test path; run final candidates against AoE2 before declaring victory. The two-tier eval pattern (fast synth + expensive real) is the AlphaStar/OpenAI Five default for a reason.

  2. Determinism is asymptotic. Even with all the knobs, expect 5–12% per-decision variance and ~20–40% tool-path variance (arxiv 2601.15322). Don’t promise exact replay; promise statistical replay over N trials. Build CIs into ranking from day one.

  3. Singleton config refactor. Phase 6 requires removing from .config import config import-time reads scattered across gameplay_agent/. Easy to underestimate — grep first. Process-per-agent (env var) sidesteps the deepest refactor; thread/asyncio doesn’t.

  4. YOLO detection vs synthetic projection drift. SyntheticWorld.render() must match the real detector’s output schema closely or the agent will perceive different worlds in synth vs real. Mitigation: write a contract test that runs real detection on a screenshot, then a synthetic render of a near-equivalent world, and asserts the schemas align. The existing tests/test_detector.py is the natural home.

  5. Event-log schema lock-in. Once the UI and ranking depend on the schema, changing it is painful. Mitigation: version events from day one (schema_version column); use Pydantic for payloads so migration is mechanical.

  6. Web UI scope creep. “Edit world parameters from the browser” is a feature surface that can grow indefinitely. Mitigation: ship a read-only replay viewer first (phase 7). Mutation UI (phase 9) only after the read-only view is in use.

  7. Benchmark exploitability. Per Berkeley RDI, agents will exploit any shortcut the eval permits (trustworthy benchmarks). Randomize map seeds, civ assignment, starting resources; never let the agent read state the real game wouldn’t expose.

  8. Container image drift. Digest-pinned images mean upstream security fixes don’t land automatically. Mitigation: Renovate scheduled weekly with auto-merge for patch/minor digest bumps; manual review for major-version bumps. The opt-in integration CI re-runs on every dependency PR, so a bad bump fails loudly before merge.


Critical files to touch (forward reference for execution)

CapabilityFileAction
World projectionevaluation/world_sim.pyAdd render() → DetectedEntity[], snapshot(), restore(), mutate()
Synthetic detectiondetection/inference/mock.pyAccept optional world: SyntheticWorld; project state to detections
Agent action sinkgameplay_agent/executor.pyPluggable backend: pyautogui (real) vs SyntheticWorld.apply_action() (test); seed random.uniform()
Determinism knobsgameplay_agent/config.py, gameplay_agent/providers/claude.pyExpose temperature, seed; pin model snapshot
Event lognew evaluation/event_log.py (DuckDB)Schema, writer, replay
Forknew evaluation/fork.pyfork(run_id, t, mutation_fn)
Arena controllernew arena/controller.py, arena/config_profile.pySubprocess pool + profile loader
Web UInew arena/web/ (FastAPI) + arena/web/ui/ (React)SSE + three panels
Rankingnew arena/ranking.pyBradley-Terry over event log
Infra orchestrationnew docker-compose.yml, docker-compose.ci.ymlDigest-pinned services (Langfuse, Postgres, ClickHouse, MinIO, OTel collector)
Python lockpyproject.toml, new uv.lockMigrate to uv; commit lockfile; CI verifies with uv lock --locked
Env contractnew .env.exampleAll required vars documented (ANTHROPIC_API_KEY, LANGFUSE_SECRET, MINIO_ROOT_PASSWORD, …)
Renovatenew .github/renovate.jsonAuto-PR for Docker digest + Python lock bumps
Integration CInew .github/workflows/arena-integration.ymlOpt-in or nightly; brings up compose stack with tmpfs volumes
Existing reuseevaluation/runner.py, evaluation/assertions.py, autoresearch/orchestrator.py, gameplay_agent/{detection,strategist,turn}_phase.pyNo structural change; consumed unchanged

Verification (when implementation begins)

End-to-end smoke for the synthetic arena:

  1. just arena-infra-up && just arena-infra-status — all services healthy within 60s; Langfuse UI reachable at http://localhost:3000; MinIO console at http://localhost:9001. CI: uv lock --locked exits 0; grep guard finds no unpinned digests.
  2. just synth-arena-smoke — runs two agent variants (same prompt, different temperature) against the same SyntheticWorld seed for 100 turns, writes events to DuckDB, prints metric deltas.
  3. just fork-test — runs agent A to t=50, forks two children with different loop_delay, asserts both children produce events with parent_run_id == A, asserts child snapshots match A at t=50.
  4. just mutate-test — runs agent, pauses at t=30, applies mutate({food: -200}), resumes, asserts the next observation event reflects the mutation.
  5. just web-smoke — starts FastAPI + UI, drives a 50-turn run, opens browser, asserts SSE stream produces ≥1 event per turn and the timeline panel renders.
  6. pytest evaluation/ — existing scenario regression tests should pass unchanged (the synthetic arena is additive, not a replacement).
  7. Sim-to-real check: run the same variant against a real AoE2 instance for 1 game; compare turn-1 DetectedEntity[] from real detector vs synthetic render of an equivalent Dark Age state; assert schema match.

Implementation runbook

Step-by-step pickup tasks for executing this design. Each entry references the relevant design section(s) and the “Critical files to touch” table; it does not restate design content. Tasks within a phase are typically dependent on the prior task; phases unblock in the order shown in Suggested phased build sequence.

Phase 0 — Infra & reproducibility baseline

References: §H. Phased table row 0.

0.1 — Migrate Python deps to uv with committed lockfile

0.2 — docker-compose for stateful services

0.3 — Image-pin enforcement & Renovate

Phase 1 — Synthetic world projects to perception

References: §A. Phased table row 1.

1.1 — SyntheticWorld.render() → list[DetectedEntity]

Phase 2 — Agent runs against synthetic world

References: §A, §B. Phased table row 2.

2.1 — Pluggable executor sink

2.2 — Wire game_loop to the synth path under test mode

Phase 3 — Determinism knobs

References: §D. Phased table row 3.

3.1 — Expose temperature and seed in Config

3.2 — Seed random.uniform() in executor

Phase 4 — Event log

References: §C. Phased table row 4.

4.1 — DuckDB event log schema + structlog adapter

Phase 5 — Fork primitive

References: §C (“Fork primitive”). Phased table row 5.

5.1 — fork(run_id, t, mutation_fn=None) → new_run_id

Phase 6 — Multi-process racing

References: §B, §E. Phased table row 6.

6.1 — ConfigProfile YAML schema

6.2 — Subprocess-pool arena controller

Phase 7 — Web UI v1

References: §F. Phased table row 7.

7.1 — FastAPI + SSE backend

7.2 — React frontend skeleton (Vite)

7.3 — Three-panel layout (world / trace / diff)

Phase 8 — Bradley-Terry ranking

References: §E. Phased table row 8.

8.1 — Bradley-Terry pairwise ranking over event log

Phase 9 — Pause / resume / inject UI

References: §F (operator panel), §G. Phased table row 9.

9.1 — Operator mutation panel

Phase 10+ — Opportunistic

Per the phased table: League / Inspect AI / chaos schedule / Langfuse mirror. Spawn runbook entries here only when the local arena (phases 0–8) is in active use.


Sources

Forking: OpenSpiel docs, OpenSpiel serialize, Gymnasium, PettingZoo, RLlib checkpointing.

Tournament/league: AlphaStar blog, AlphaStar Nature PDF, PBT blog, PBT paper, Chatbot Arena, Inspect AI, AgentBench, Berkeley RDI broken benchmarks.

Pause/inject: Reverb paper, Reverb github, DRL adversarial survey, LLM-agent robustness, Chaos for AI.

Stochasticity: Non-Determinism of Deterministic LLM Settings, Numerical Sources of Nondeterminism, Defeating Nondeterminism (Thinking Machines), Replayable Financial Agents.

Web UIs: Langfuse, LLM observability comparison, AlphaStar visualizer resources, PySC2.

Prior work: Voyager, Generative Agents, OpenAI Five, SIMA 2, Cicero, TALES, PyAge2, aoe2-ai-module.