Chapter 2: Game Loop Pipeline
The game loop is the heartbeat of the agent. Every ~1 second, it captures a screenshot, detects entities, checks for threats, optionally runs the strategist, builds text context, asks the executor for actions, and executes them.
2.1 The Iteration Cycle
The main loop is implemented in game_loop() with phase logic decomposed into named functions.
Step 1: Check game is running
Calls is_game_running() which searches for a window titled "Age of Empires II: Definitive Edition" via pygetwindow. If the window is gone, the loop exits.
Step 2: Ensure focus
Calls ensure_game_focused(). If focus fails, the iteration is skipped with continue and a 1-second sleep.
Step 3: Capture screenshot — _capture_screenshot()
Uses the mss library to grab the game window region, convert from BGRA to RGB via PIL, and encode as JPEG. Returns (bytes, width, height). Optionally saves screenshots to disk when config.save_screenshots is enabled.
Step 4: Run entity detection — _run_detection()
Entity detection runs a single forward pass at imgsz=640 — the resolution the v6 model was trained at (config.detection_imgsz = 640, config.adaptive_sahi = False). On real screenshots this beats both higher-resolution and SAHI-tiled inference: tiling a 3024px frame into 640 crops makes objects ~2.4× larger than training scale and lowers real F1 (≈0.42 single-pass @640 vs ≈0.04 with full SAHI). SAHI is implemented but off; see Chapter 7 §7.4 for the measurement.
When
config.adaptive_sahiisTrue(it isn’t, by default), the loop instead runs adaptive SAHI and forces full SAHI on the first iteration, everyfull_sahi_intervalturns, and after an alarm — the parked path for a future SAHI-native model.
Results are cached in the executor module via set_detected_entities() for later target_id/target_class resolution. Entity IDs persist across frames via the Kalman filter tracker. Detection failures are caught and logged without breaking the loop.
Step 5: Classify ownership — _classify_entities()
For military entities, a color-based classifier checks blue pixel dominance in the health bar and unit body regions. In AoE2:DE, Player 1 is always blue. Entities are tagged [own] or [enemy] in the text context sent to the executor.
Entity formatting uses build_entity_summary() from entity_utils.py, which normalizes both DetectedEntity objects and plain dicts via extract_attrs().
Step 6: Alarm check
Scans detected entities for 21 enemy military classes (militia_line, archer_line, knight_line, etc.). Uses ownership classification to filter out own units. If enemy threats are found, injects a priority-10 “Defend base” goal and triggers an early strategist run.
Step 7: Launch strategist — _maybe_launch_strategist()
The strategist (Sonnet) runs every N turns (default 10), on the first successful iteration, or when an alarm is triggered. It is launched asynchronously via asyncio.create_task() so it runs in the background while the executor continues. If a previous strategist task is still pending, it is reused rather than launching a new one.
The strategist:
- Reads resource values, population, and age from the resource bar locally via OCR (
resource_ocr.read_resource_bar, RapidOCR) — the screenshot is passed in for this, but no image is sent to Claude. Field geometry is auto-detected per frame (autodetect_calibration), with an optionalcalibration.<W>x<H>.yamloverride. - Sends those readings to Sonnet in a text-only prompt
- Creates 3-5 prioritized goals
- Returns resource readings that are cached for the executor
The strategist uses messages.parse() with a StrategistResponse Pydantic model for structured output. In the cleanup phase, any pending strategist task is awaited to ensure goals are finalized.
Step 8: Build context — _build_llm_context()
Assembles text context from multiple sources, layered in this order:
- Detected entities — YOLO results formatted as text:
sheep_0: sheep at (456,789) [95%] - Active goals — from goal manager, sorted by priority:
[HIGH] Queue villagers: 4/10 (40%) - Resource readings — cached from strategist:
Food: 250, Wood: 180, Gold: 50, Stone: 100 - Game state — from memory: population, age, under_attack flags
- Recent decisions — last 3 turns with action feedback
- Dynamic game knowledge — affordable units/buildings based on current resources (optional)
Step 9: Get actions from executor (pipelined)
Routine turns are pipelined RTC-style (request-to-completion overlap): the executor call for this turn is launched as a background task via asyncio.create_task(provider.get_actions(...)), and while it computes the agent does useful work in that window:
- Ground commands (turn 1 only) — zoom in, select scout, enable auto-scout
- Reactive tier (
reactive.decide) — deterministic, no-LLM upkeep run every turn: queue a villager when population is below the age cap, and send an idle villager to the nearest resource. It returns nothing on alarm, ceding combat to the LLM. - Previous turn’s committed head — the plan launched last turn is drained and executed now (Step 11), against freshly re-detected entities.
The freshly-launched plan is held as a _PendingPlan and executed on the next iteration. Whether a turn pipelines is decided by _should_pipeline (= provider._use_single_shot(context)): combat/housing turns can’t pipeline because the tool loop executes its own actions mid-call, so they run synchronously — any pending routine plan is discarded (its frame is stale) and the turn’s actions execute the same turn.
The executor is 100% text-based — no screenshot. get_actions() picks one of two paths per turn (_use_single_shot): routine turns take a single-shot structured call (_call_single_shot — one roundtrip; the returned actions are executed by the game loop), while combat/housing turns take an agentic tool loop (_call_api) where Claude calls tools one at a time (up to max_tool_iterations = 7), each executed locally via execute_action() with the result fed back. Composite tools (build, send_villager, queue_villager) execute multi-step sequences within a single tool call, eliminating intermediate API roundtrips.
Step 10: Update memory and goals — _process_response()
Creates a Turn record, updates GameState from the executor’s observations. Evaluates goal progress against the updated state. Computes a turn reward based on resource deltas, population changes, and age progression. Checks for game-over conditions (victory, defeat, timeout).
Step 11: Execute actions — _execute_turn_actions()
If the agentic tool loop already executed actions (indicated by actions_already_executed flag), this step just records the results. Otherwise, it executes LLM actions:
- For a pipelined turn, only the committed head runs — the first
pipeline_commit_max(2) actions, after_revalidate_against_freshdrops any whosetarget_id/target_classno longer resolves against the current frame. The tail is discarded: next turn’s plan supersedes it from fresher perception. - Resolves
target_idortarget_classto coordinates from cached entity positions - Translates coordinates from screenshot-relative to screen-absolute
- Executes via pyautogui with
action_delay(50ms) between actions - Tracks success/failure via
ActionResult— failed actions are recorded in memory as feedback for the next turn - Verifies effects (R1). After entity-affecting actions (a
build/placement, or a camera move), it re-detects and records a verification line —CONFIRMED built: <class>on success, or the exact phraseno visible changeon a miss. That line feeds the stuck-loop detector inmemory.get_context_for_llm, so repeated no-ops escalate to a warning the LLM sees. Routine economy turns with no entity expectation skip the extra rescan. - If no actions were returned, logs
no_actions_fallback(the reactive tier already handled routine upkeep this turn). - On
rescan: true, runs the rescan pipeline:- Tracker prediction check — if tracker confidence > 80%, extrapolate positions via Kalman predict (~0ms, no screenshot or inference needed)
- Screenshot capture — if prediction not used
- Frame differencing — compare to previous frame; skip detection if MAD < 3%
- Fast detection — single-pass
detect_fast()atimgsz=640(the same mode as the main detection step)
Step 12: Wait
asyncio.sleep(config.loop_delay) — default 0.3 seconds.
Deep dive — Adaptive SAHI and ROI clustering (built, measured, parked)
This describes a path we built and then disabled. The agent ships single-pass @640, not SAHI. We measured all three modes on real held-out frames (
evaluate_real.py): single-pass @640 ≈ 0.42 real F1, @1280 ≈ 0.21, full SAHI ≈ 0.04. SAHI loses because tiling a 3024px frame into 640 crops shows the model objects ~2.4× larger than itsimgsz=640training scale. The adaptive-SAHI machinery below stays in the codebase for a future model retrained at SAHI-native scale — it’s kept here as the design we’d reach for then, not what runs now.
The base problem (as it looked pre-v6). A game screenshot resized to YOLO’s 640×640 throws away resolution that small entities (sheep, scouts) depend on — if the model was trained on larger crops. The v6 fix turned out to be simpler than tiling: train and infer at the same 640, so the resize is exactly what the model expects.
SAHI — the classic fix. Slicing Aided Hyper Inference (Akyon et al., 2022) slices the input into overlapping tiles, runs the detector on each tile at full resolution, then merges the per-tile predictions back into image coordinates (with extra NMS on overlap zones). For a 1920×1080 screen with 640×640 tiles and 20% overlap, that’s ~18 tiles per frame. It raises recall on small objects when the tile scale matches training — which, for v6 at retina resolution, it doesn’t.
Adaptive SAHI — the parked two-pass scheme. Pays the tiling cost only where it matters:
- Fast scan. Run YOLO once at
imgsz=1280(in-between the fast 640 and the slow tiled approach). Catches every medium-and-large object plus a noisy first guess at where the small stuff is. ~60 ms. - ROI clustering. Take the bounding boxes of the small/uncertain detections, cluster them into a handful of regions of interest using a Union-Find (disjoint-set) data structure. Two boxes belong to the same ROI if their inflated bounding boxes intersect; Union-Find walks the box list once and assigns each to its cluster root in near-constant amortized time.
- Targeted SAHI. Run YOLO at full resolution only on those 3–8 ROI tiles. ~40–140 ms depending on ROI count.
- Merge + NMS. Combine fast-scan predictions outside the ROIs with the targeted-SAHI predictions inside them, then a final NMS pass.
Why Union-Find for clustering? We need: given N boxes, group any pair that overlap into the same cluster. The naive O(N²) pairwise intersection works at N ≤ 50 but degrades. K-means doesn’t fit — clusters aren’t centroidal, they’re connectivity components. DBSCAN works but is more code. Union-Find runs in O(N · α(N)) — practically O(N) — and the implementation is ~30 lines: each box starts in its own set, and any time two boxes overlap you union() their sets. The chapter’s full SAHI fallback is what you’d use if you wanted exhaustive coverage of every tile regardless of content.
When adaptive SAHI is wrong to use. First frame (no prior boxes → no ROIs → only the fast scan runs → you miss things). Solution: force full SAHI on the first frame, and on every Nth frame (full_sahi_interval=5), and after an alarm. These are the three branches you see in Step 4.
Further reading. Akyon et al., Slicing Aided Hyper Inference and Fine-tuning for Small Object Detection (2022). Sedgewick & Wayne, Algorithms (4th ed.), §1.5 for the Union-Find treatment.
Deep dive — Kalman filters and the Hungarian algorithm (how IDs persist across frames)
Why we need this: the detector gives you a fresh list of boxes every frame, but those boxes don’t carry identity — frame N’s sheep at (455, 790) and frame N+1’s sheep at (462, 794) have to be linked back to the same sheep so the LLM can refer to sheep_0 across turns. That’s the multi-object tracking problem, and the canonical solution is a two-step loop: predict where each tracked entity is now → match predictions to detections → update tracks with the matched detections.
The Kalman filter — mental model. Picture each tracked entity carrying a small Gaussian “cloud” describing where the system thinks it is. The cloud has a mean (best guess: [x, y, vx, vy] for a constant-velocity model) and a covariance (how uncertain that guess is). On every frame we do two things:
- Predict. Push the mean forward by
Δtusing the velocity (x' = x + vx·Δt) and grow the covariance — uncertainty always increases when you don’t see new evidence. This is the closed-form updateμ' = F·μ,Σ' = F·Σ·Fᵀ + Q, whereFis the state-transition matrix andQis the process noise (how much we expect the entity to deviate from constant velocity per step). - Update. When a detection arrives, the filter computes a weighted average between the prediction and the observation, weighted by their relative uncertainties (the Kalman gain
K). A confident prediction + noisy observation → trust the prediction. Stale prediction + crisp observation → trust the observation.
Hungarian assignment — the matching step. Before we can call “update,” we have to know which detection goes with which track. Build an N × M cost matrix (N tracked entities × M detections this frame) where each cell is some distance — pixel distance, 1 - IoU, or a Mahalanobis distance using the track’s covariance. The Hungarian algorithm finds the assignment that minimizes total cost in O((N+M)³) — globally optimal, not greedy. Unmatched detections become new tracks; unmatched tracks decay (and after a few frames of misses, get retired).
Why we picked it. Kalman is lightweight (a few small matrix ops per track, no training, no GPU), probabilistic (the covariance is a built-in confidence score — the rescan path uses confidence > 80% to skip a screenshot entirely), and well-behaved under partial occlusion (the prediction carries the entity forward through a few missed frames). Hungarian is exact and fast at the scale of dozens of entities.
When it breaks. Rapid acceleration (a knight charging) violates the constant-velocity model — the prediction lags and matching can swap IDs. Heavy occlusion (a building behind a tree) eventually exhausts the “no-update tolerance” and the track is retired, then re-spawns with a new ID. For our use case (top-down isometric, slow units, mostly-visible entities) this is rarely a problem; for highway traffic or sports you’d reach for a learned tracker like ByteTrack or BoT-SORT.
Further reading. Welch & Bishop, An Introduction to the Kalman Filter (2006) — the classic 20-page write-up. SORT (Bewley et al., 2016) is the original “Kalman + Hungarian” multi-object tracker and is the algorithmic ancestor of our tracker.
2.2 Single-Iteration Test Mode
run_single_iteration() runs one cycle without looping:
python -m gameplay_agent --test
Captures a screenshot, runs detection, builds context, gets actions from Claude but does not execute them by default. Returns all intermediate results for debugging.
2.3 Loop Timing
| Phase | Duration | Source |
|---|---|---|
| Window check + focus | ~200ms worst case | window.py (3 retries, 200ms each) |
| Screenshot capture | ~10-30ms | mss grab + PIL convert + JPEG encode |
| YOLO detection (single-pass @640) | one forward pass | The deployed path (adaptive_sahi=False); cost is backend/hardware-dependent |
| Ownership classification | ~5ms | NumPy pixel analysis |
| Strategist call (periodic) | 3-8s | Sonnet text call (resources via local OCR) |
| Executor single-shot (routine turns) | ~2-4s | One messages.parse call, no tool loop |
| Executor call (per tool iteration) | ~3s | Roundtrips in the agentic loop (combat/housing) |
| Action execution | ~50ms per action | pyautogui + 50ms inter-action delay |
| Rescan: tracker prediction | ~0ms | Kalman extrapolation (confidence > 80%) |
| Rescan: fast detection | one forward pass | Single-pass YOLO at imgsz=640 |
| Loop delay | 0.3s | config.loop_delay |
Cycle time depends on the path: routine turns single-shot in ~one roundtrip (~2-4s of API + 0.3s loop delay), and because routine turns pipeline, the previous turn’s committed head plus the reactive tier execute during that roundtrip rather than adding to it. Combat/housing turns run the agentic tool loop synchronously and can reach ~20-30s when they use all 7 iterations. Composite tools cut the loop path further (~9s saved per building placement). The strategist runs in the background and does not add to cycle time.
2.4 Error Handling
The main loop wraps everything in try/except:
KeyboardInterrupt— logs and exits cleanly- Any other exception — logs the error with iteration number and re-raises
Individual steps have their own error handling:
- Detection failures are caught and logged — the loop continues without detection
- Focus failures skip the iteration
- Strategist failures are caught — executor continues with stale goals/readings
- API errors in the executor return a wait action
2.5 Time Budget
The game loop supports a time_budget parameter (seconds). When elapsed time exceeds the budget, the loop exits with game_end_reason = "timeout". Used by the autoresearch framework for timed experiments.
Summary
- 12-step iteration cycle: check → focus → capture → detect → classify → alarm → strategist → context → executor → memory → execute+verify → wait
- Cycle time depends on the path: routine turns single-shot in ~2-4s; combat/housing turns run the agentic tool loop and can reach ~30s
- Routine turns are pipelined (RTC): the executor call overlaps the previous turn’s committed head plus the deterministic reactive tier (villager queue/reassign) via
asyncio.create_task(); combat turns run synchronously - Action-effect verification (R1): entity-affecting actions are confirmed by re-detection, and misses emit
no visible changeinto the stuck-loop detector - Strategist runs asynchronously in the background; executor runs every turn
- Composite tools (build, send_villager, queue_villager) eliminate multiple API roundtrips per sequence
- Detection is a single pass at
imgsz=640(training resolution); SAHI is implemented but disabled because it lowers real F1 at retina resolution - Rescans use tracker prediction (~0ms) or single-pass fast detection
- Goal-driven with reward computation per turn
- Action failure feedback tracked via
ActionResultand fed back to memory
Related Topics
- Chapter 1: System Overview — component dependencies and graceful degradation
- Chapter 3: Action Model & Execution — how actions are validated and executed
- Chapter 6: Context Injection — what context the LLM receives