AoE2 · LLM Arena

Runbook — Debugging a Stuck Fork or Replay

When POST /forks returns a child_run_id but the run never finishes — no Complete status, no events past the first few, eventually hits the broker buffer and starts dropping. This runbook walks through the diagnosis order.

First, decide what “stuck” means

Frontend saysWhat it usually means
Streaming + low event count + no growthThe replay is blocked or crashed; check the server logs.
Streaming + steady growth + then ErrorBroker overflow without auto-reconnect, or backend crash.
Closed immediatelyReplay exited early; check for an exception in server logs.
Run never appears in /runsReplay crashed before any event was persisted; check server logs.

Step 1 — read the server log

The replay task uses logger.exception(...) on any exception (apps/api/src/forks.py:248). The trace shows up wherever uvicorn is writing logs. Most “stuck” cases turn out to be:

Step 2 — check /metrics

curl -s http://localhost:8000/metrics | jq
CounterMeaningWhat to look for
events_publishedTotal publishes since server startShould grow during an active replay.
events_streamedTotal envelopes yielded to consumersShould grow with each SSE consumer.
streams_droppedConsumers that hit BrokerOverflowErrorNon-zero ⇒ a consumer fell behind; expected to be small / zero.
runs_openCurrently-publishing runsIf non-zero long after the producer should be done, the producer didn’t close the run.

A stuck replay typically shows runs_open > 0 and events_published constant for an unusually long time. The reaper grace period is 30 min by default — if a run is stuck for >30 min without progress, the reaper will eventually drop the buffer, and any reader will start getting BrokerOverflowError (or 404 on cold-path fallback if the run wasn’t persisted yet).

Step 3 — inspect the fork tasks set

If you have access to the running process, this is the gold standard:

# In a REPL attached to the server (or via a debug endpoint you add)
import asyncio
for task in app.state.fork_tasks:
    print(task.get_name(), task.done(), task.get_coro())

A task that’s done()=False but get_coro() shows it parked on a specific await tells you exactly where the replay is stuck. Common stuck spots:

Step 4 — Redis-backend specific checks

# Is the producer's :open sentinel still there?
redis-cli -a "$REDIS_PASSWORD" EXISTS arena:run:<child_run_id>:open
# 1 = open, 0 = closed/never-opened

# How many events did the producer write?
redis-cli -a "$REDIS_PASSWORD" XLEN arena:run:<child_run_id>:events

# Has the consumer fallen behind the head?
redis-cli -a "$REDIS_PASSWORD" XINFO STREAM arena:run:<child_run_id>:events
# Look at first-entry vs last-entry IDs

If EXISTS returns 0 but the stream has entries, the producer closed normally but the consumer hasn’t terminated — likely the consumer’s await self.flush() queue isn’t draining. See Runbook: redis-broker-ops.

If EXISTS returns 1 and the stream length is stable, the producer is stuck mid-replay (back to step 1 — server logs).

Step 5 — recover

Options, in order of preference:

  1. Wait it out. If the producer is genuinely doing an LLM call, give it 60s.
  2. Restart the server. The lifespan teardown cancels all fork_tasks and gathers them with return_exceptions=True (apps/api/src/server.py:241), so a clean restart drains. Existing finalized runs in DuckDB are unaffected.
  3. Manual reap (Redis). If the run is wedged with no persister catching up, you can delete the Redis keys directly:
    redis-cli -a "$REDIS_PASSWORD" DEL arena:run:<rid>:events arena:run:<rid>:seq arena:run:<rid>:open
    This is destructive — any consumer mid-stream will get BrokerOverflowError or empty results. Only do this if you’re already going to restart everything.
  4. Manual reap (in-process). Restart the server. There’s no “reap one run” endpoint by design; the reaper does it on a schedule.

Step 6 — file the bug

If the stuck case is reproducible, capture:

The replay path (apps/api/src/forks.py:_replay) has annotated lifecycle ordering — any new “stuck” failure mode is usually a violation of that ordering or a new SDK gotcha worth recording in the source as a comment.