AoE2 · LLM Arena

ADR 0004 — Bradley–Terry ranking over simple win-rate

Status: Accepted (2026-05). Shipped as Phase 8. Context: Chapter 17 — Ranking Pipeline.

Decision

For multi-profile evaluation, fit a Bradley–Terry MLE on the pairwise win matrix via iterative Minorization-Maximization. Report 95% percentile-bootstrap confidence intervals. Use a symmetric +0.5 smoothing prior to keep the MLE bounded when one profile is undefeated.

What we considered

OptionProsCons
Simple win-rate (wins / total)Trivial to compute.Doesn’t account for matchup structure when not all profiles meet equally; no statistical guarantee.
Elo iterative updateFamiliar from chess.Sensitive to scheduling; no natural confidence intervals; tunable K-factor adds another knob.
Bradley–Terry MLEClosed-form fit; natural pairwise interpretation; CI via bootstrap.Needs smoothing to handle undefeated profiles; requires numpy.
TrueSkill / GlickoPer-player rating with uncertainty built in.Heavy machinery for our small-N case; extra deps.

Why Bradley–Terry

What we explicitly traded away

Why the lexicographic scoring (not a weighted sum)

composite_score = (age_index, population, food + wood) (ranking.py:80). Lexicographic, not weighted.

The rationale is the same one that drove the autoresearch composite score: reaching a higher age dominates any amount of lower-age economy. A weighted sum would let an agent “cheat” by stockpiling food in Dark Age to outrank a Feudal-Age opponent. Lexicographic ordering encodes the actual game-strategic intuition: you cannot trade age progress for resources.

The score function is passed in (ranking.py:307score_fn parameter) so research with different goals can swap it without changing the harness.

Consequences

Positive

Negative