Merge PR: Phase 1 Slice 3 — single-host resilience (crash recovery, WIP checkpoint/resume, retry) + execution insights

Reviewed against §11/§25/§26 single-host scope; selftest 34/34 (regression intact); WIP protects current branch + PID-reuse guard on orphan recovery verified.
2026-05-29 19:00:13 -07:00 · 2026-05-29 19:00:13 -07:00 · 41f91d7ea1
commit 41f91d7ea1
parent bc0c0e263c 87a4bf591a
5 changed files with 657 additions and 40 deletions
--- a/agent-queue/README.md
+++ b/agent-queue/README.md
@ -113,7 +113,7 @@ are otherwise **no-ops until a later phase** (they do not yet affect execution).
 | `prefers` | RESERVED | _(none)_ | soft routing/affinity hints (e.g. `[factory:mac-2]`) |
 | `budget` | RESERVED | _(none)_ | `{ usd, tokens, wall }` ceilings (`wall` enforcement is a later slice) |
 | `deps` / `deps-mode` | RESERVED | _(none)_ | DAG dependencies (single-host blocking is a later slice) |
-| `retry` | RESERVED | _(none)_ | `{ max, backoff, on }` retry policy |
+| `retry` | **active** | _(none)_ | `{ max: N, backoff: 5m, on: [timeout, verify_failed, crash] }` — requeue failures with backoff up to `max`, then `retries_exhausted` (see **Resilience**) |
 | `review-policy` | RESERVED | _(none)_ | `auto\|manual\|reviewers:[…]` |
 | `artifacts` | RESERVED | _(none)_ | extra outputs to capture (coverage, screenshots) |
 | `tracker-item` | RESERVED | _(none)_ | link back to the originating tracker task |
@ -160,8 +160,10 @@ misparsed as a flag.
 | `init` | create the `queue/` folders |
 | `add <file> [--engine E] [--cwd P] [--yolo\|--no-yolo]` | queue a prompt into `inbox/` |
 | `run [--max N] [--engine E] [--once]` | process the inbox (foreground loop) |
-| `status` | kanban counts + running-worker table (marks `⚠ stalled` workers) |
+| `status` | kanban counts + running-worker table (marks `⚠ stalled`; per-job insights sub-line) |
 | `watch [interval]` | live `status` (bash), redrawn every N seconds (default 2) |
+| `insights [job]` | per-job metrics, or a recent-jobs table + per-engine token/cost/success rollup (see **Insights**) |
+| `recover` | reclaim orphaned `building/` jobs (dead worker) back to `inbox/` (see **Resilience**) |
 | `dash [--interval N]` | **interactive Node dashboard** — navigable numbered job list with single-key actions (see below) |
 | `stop` | kill running workers + the run loop |
 | `logs <job> [-f]` | print / follow a job's log |
@ -225,7 +227,59 @@ queue/
 **`result=` values** written to `<job>.meta`: `review`, `testing`, `shipped`,
 `failed`, `timeout`, `verify_failed`, `rejected`, `requeued`, `capability_mismatch`
 (host missing a required capability — agent never launched), `no_engine`
-(an `engine-class` had no available engine).
+(an `engine-class` had no available engine), `retries_exhausted` (failed after
+`retry.max` attempts — single-host dead-letter stand-in), `retry_scheduled`
+(transient: requeued for another attempt), `recovered` (transient: an orphan was
+reclaimed to `inbox/`).
+
+## Resilience (crash recovery & work preservation)
+
+Single-host implementations of the durability model (roadmap §25):
+
+- **Orphan recovery.** A job left in `building/` whose worker process is dead (no
+  live `pid`, PID-reuse-guarded by `pidstart`) is an orphan from a previous
+  crash/power-off. On `run` startup and on every loop iteration (or on demand via
+  `agent-queue.sh recover`) it is moved back to `inbox/` with `attempts`
+  incremented. Recovery is **idempotent** — once moved out of `building/` it is
+  never recovered twice.
+- **WIP checkpointing.** When a job's `cwd` is a git repo, the worker creates/checks
+  out a dedicated branch **`aq/wip/<job>`** at start and commits any changes to it
+  on **every** exit path — success, failure, timeout, and SIGTERM/SIGINT (via a
+  trap). It **never** commits to `main`/your current branch. Non-git `cwd` is
+  skipped cleanly. `wip_branch` / `wip_base` / `wip_commit` are recorded in the meta.
+- **Resume.** When an orphan/retry of a job whose `aq/wip/<job>` branch already
+  exists is relaunched, that branch is checked out first so the agent **continues
+  from the checkpoint** instead of from zero.
+- **Retry policy** (`retry` frontmatter, now active). On a failure whose class is in
+  `on` (`crash`/`agent_error` for a non-zero agent exit, `timeout`, `verify_failed`)
+  the job is requeued to `inbox/` honoring `backoff` (selection skips it until
+  `next_eligible`) up to `max` attempts; on exhaustion it lands in `failed/` with
+  `result=retries_exhausted`, preserving the WIP branch + full log. No `retry` =
+  no retry (Phase-0 behavior).
+
+All bookkeeping (`attempts`, `next_eligible`, `wip_*`) is append-only in the meta
+and re-derivable from the meta + folder location, so recovery is crash-safe.
+
+## Insights (metrics & token accounting)
+
+Each finished run records into `<job>.meta`: `duration_s`, `exit`, `result`,
+`attempts`, and — for a git `cwd` — `files_changed` / `lines_added` /
+`lines_deleted` (diffed `wip_base..HEAD`). A single `parse_usage <engine> <log>`
+adapter extracts `model` / `tokens_in` / `tokens_out` / `tokens_cached` /
+`cost_usd` / `turns` / `tool_calls` when the engine exposes them.
+
+```bash
+agent-queue.sh insights <job>   # full metrics for one job
+agent-queue.sh insights         # recent-jobs table + per-engine rollup
+```
+
+> **Token caveat (honest):** real usage is captured only where the engine surfaces
+> it. A cooperating wrapper may emit a machine-readable `AQ_USAGE key=value …` line;
+> otherwise per-engine heuristics apply (Claude/Codex token fields parsed; Devin
+> session metrics + Copilot are API-only and currently TODO in `parse_usage`). When
+> a value is not provider-reported it is **omitted or flagged `usage_estimated`** —
+> numbers are never fabricated. The per-engine rollup marks totals that include any
+> estimated value with `*`.

 ## Config (env overrides)

--- a/agent-queue/agent-queue.sh
+++ b/agent-queue/agent-queue.sh
@ -469,11 +469,26 @@ run_worker() {
    return 1
  fi

+  local started; started=$(grep '^started=' "$metaf" 2>/dev/null | tail -1 | cut -d= -f2)
+
  # Strip our frontmatter so the agent only sees the task body.
  local bodyf="$STATE/$job.body.md"
  strip_frontmatter "$doing_file" > "$bodyf"
  build_agent_cmd "$engine" "$bodyf" "$yolo"

+  # ── WIP checkpoint setup (§25.2): on a git cwd, create/checkout aq/wip/<job>
+  # so partial work survives a crash; a trap guarantees a checkpoint on EVERY
+  # exit path (success, failure, timeout, SIGTERM/SIGINT). Non-git cwd: no-op. ──
+  WIP_ACTIVE=0; WIP_BASE=""; WIP_DONE=0
+  _worker_trap() {
+    [[ "$WIP_DONE" == 1 ]] && return
+    WIP_DONE=1
+    [[ "$WIP_ACTIVE" == 1 ]] && _wip_checkpoint "$job" "$cwd" "$metaf" "$logf" "trap-exit"
+  }
+  trap '_worker_trap' EXIT
+  trap '_worker_trap; exit 143' INT TERM
+  _wip_start "$job" "$cwd" "$metaf" "$logf" || true
+
  _run_agent() {
    if [[ -n "$AGENT_STDIN" ]]; then
      ( cd "$cwd" && "${AGENT_CMD[@]}" < "$AGENT_STDIN" )
@ -525,23 +540,27 @@ run_worker() {
  fi
  rm -f "$tmo_flag"

-  echo "exit=$rc" >> "$metaf"
+  # ── Preserve work + capture run metrics on EVERY path (§25.2/§26.1/§26.2) ──
+  if [[ "$WIP_ACTIVE" == 1 ]]; then
+    _wip_checkpoint "$job" "$cwd" "$metaf" "$logf" "agent-exit"; WIP_DONE=1
+  fi
+  _numstat_into_meta "$cwd" "$WIP_BASE" "$metaf"
+  parse_usage "$engine" "$logf" >> "$metaf"
+
  if $timed_out; then
-    mv "$doing_file" "$FAILED/" 2>/dev/null
-    echo "result=timeout" >> "$metaf"
-    echo "ended=$(date +%s)" >> "$metaf"
    echo "TIMED OUT after ${tmo}s (rc=$rc): $(date)" >> "$logf"
+    _finish_failure "$job" "$doing_file" "$metaf" "$logf" "timeout" "$rc" "$started"
  elif [[ $rc -eq 0 ]]; then
    # Agent succeeded: land in review/, then run the auto-QA verify gate. The
    # worker is still alive here so the concurrency slot stays held through
    # verification — `ended=` is written only once we reach a resting stage.
    mv "$doing_file" "$REVIEW/" 2>/dev/null
    local review_file="$REVIEW/$job.md"
+    echo "exit=$rc" >> "$metaf"
    echo "completed OK (rc=0): landed in review — $(date)" >> "$logf"
    local verify; verify=$(fm_get "$review_file" verify "$DEFAULT_VERIFY")
    if [[ -z "$verify" ]]; then
-      echo "result=review" >> "$metaf"
-      echo "ended=$(date +%s)" >> "$metaf"
+      _meta_end "$metaf" "review" "$started"
      echo "no verify command — parked in review for manual promote: $(date)" >> "$logf"
    else
      echo "----- verify: $verify -----" >> "$logf"
@ -550,24 +569,273 @@ run_worker() {
      echo "verify_exit=$vrc" >> "$metaf"
      if [[ $vrc -eq 0 ]]; then
        mv "$review_file" "$TESTING/" 2>/dev/null
-        echo "result=testing" >> "$metaf"
-        echo "ended=$(date +%s)" >> "$metaf"
+        _meta_end "$metaf" "testing" "$started"
        echo "VERIFY PASSED — promoted to testing (QA): $(date)" >> "$logf"
      else
-        mv "$review_file" "$FAILED/" 2>/dev/null
-        echo "result=verify_failed" >> "$metaf"
-        echo "ended=$(date +%s)" >> "$metaf"
        echo "VERIFY FAILED (rc=$vrc): $(date)" >> "$logf"
+        # verify ran on the review_file; retry policy may requeue it.
+        _finish_failure "$job" "$review_file" "$metaf" "$logf" "verify_failed" "$rc" "$started"
      fi
    fi
  else
-    mv "$doing_file" "$FAILED/" 2>/dev/null
-    echo "result=failed" >> "$metaf"
-    echo "ended=$(date +%s)" >> "$metaf"
    echo "FAILED (rc=$rc): $(date)" >> "$logf"
+    _finish_failure "$job" "$doing_file" "$metaf" "$logf" "crash" "$rc" "$started"
  fi
 }

+# ── Resilience & insights helpers (Phase 1 — single-host §25/§26) ────
+#
+# _in_list <item> <space-separated-list> -> 0 if item is present.
+_in_list() { case " ${2:-} " in *" $1 "*) return 0;; esac; return 1; }
+
+# _meta_end <metafile> <result> <started-epoch> -> append result/ended/duration
+# (append-only; never truncates a live meta — §25 state integrity).
+_meta_end() {
+  local mf=$1 res=$2 now; now=$(date +%s); local st=${3:-$now}
+  [[ "$st" =~ ^[0-9]+$ ]] || st=$now
+  { echo "result=$res"; echo "ended=$now"; echo "duration_s=$(( now - st ))"; } >> "$mf"
+}
+
+# ── git WIP checkpointing (§25.2) ──
+_is_git_repo() { git -C "$1" rev-parse --is-inside-work-tree >/dev/null 2>&1; }
+_wip_branch()  { printf 'aq/wip/%s' "$1"; }
+
+# _wip_start <job> <cwd> <metaf> <logf> -> ensure/checkout the WIP branch.
+# Sets globals WIP_ACTIVE (1 when a git WIP branch is in play) and WIP_BASE.
+# RESUME: if aq/wip/<job> already exists (orphan/retry relaunch), check it out so
+# the agent continues from the checkpoint instead of from zero. The base commit is
+# persisted in a write-once sidecar so it survives meta re-creation across attempts.
+_wip_start() {
+  local job=$1 cwd=$2 metaf=$3 logf=$4
+  WIP_ACTIVE=0; WIP_BASE=""
+  if ! _is_git_repo "$cwd"; then
+    echo "wip: cwd not a git repo — skipping checkpoint" >> "$logf"
+    return 1
+  fi
+  local br basefile base; br=$(_wip_branch "$job"); basefile="$STATE/$job.wipbase"
+  if git -C "$cwd" show-ref --verify --quiet "refs/heads/$br"; then
+    git -C "$cwd" checkout "$br" >/dev/null 2>&1 \
+      || echo "wip: could not checkout $br (resume) — staying on current branch" >> "$logf"
+    echo "wip: resuming on $br ($(date))" >> "$logf"
+  else
+    base=$(git -C "$cwd" rev-parse --short HEAD 2>/dev/null || echo "")
+    if ! git -C "$cwd" checkout -b "$br" >/dev/null 2>&1; then
+      echo "wip: could not create $br — skipping checkpoint" >> "$logf"
+      return 1
+    fi
+    [[ -n "$base" ]] && printf '%s\n' "$base" > "$basefile"
+    echo "wip: created $br from ${base:-<root>} ($(date))" >> "$logf"
+  fi
+  base=$(cat "$basefile" 2>/dev/null || echo "")
+  { echo "wip_branch=$br"; echo "wip_base=$base"; } >> "$metaf"
+  WIP_ACTIVE=1; WIP_BASE="$base"
+  return 0
+}
+
+# _wip_checkpoint <job> <cwd> <metaf> <logf> <stage> -> commit any changes in cwd
+# onto aq/wip/<job>. Idempotent (no commit when the tree is clean). NEVER commits
+# to a non-WIP branch — protects main/protected branches (§12).
+_wip_checkpoint() {
+  local job=$1 cwd=$2 metaf=$3 logf=$4 stage=$5
+  _is_git_repo "$cwd" || return 0
+  local br cur; br=$(_wip_branch "$job")
+  cur=$(git -C "$cwd" symbolic-ref --short HEAD 2>/dev/null || echo "")
+  if [[ "$cur" != "$br" ]]; then
+    git -C "$cwd" checkout "$br" >/dev/null 2>&1 \
+      || { echo "wip: not on $br — skipping checkpoint to protect '$cur'" >> "$logf"; return 0; }
+  fi
+  git -C "$cwd" add -A >/dev/null 2>&1
+  if git -C "$cwd" diff --cached --quiet 2>/dev/null; then
+    return 0   # nothing to preserve
+  fi
+  git -C "$cwd" -c user.email=agent-queue@local -c user.name=agent-queue \
+    commit -q -m "aq wip: $job ($stage)" >/dev/null 2>&1
+  local c; c=$(git -C "$cwd" rev-parse --short HEAD 2>/dev/null || echo "")
+  [[ -n "$c" ]] && echo "wip_commit=$c" >> "$metaf"
+  echo "wip: checkpoint ($stage) -> ${c:-?} ($(date))" >> "$logf"
+}
+
+# _numstat_into_meta <cwd> <base> <metaf> -> record files_changed/lines_added/
+# lines_deleted for the run (base..HEAD on the WIP branch; binary files count 0).
+_numstat_into_meta() {
+  local cwd=$1 base=$2 metaf=$3 out stats
+  _is_git_repo "$cwd" || return 0
+  if [[ -n "$base" ]]; then
+    out=$(git -C "$cwd" diff --numstat "$base" HEAD 2>/dev/null)
+  else
+    out=$(git -C "$cwd" diff --numstat HEAD 2>/dev/null)
+  fi
+  [[ -n "$out" ]] || return 0
+  stats=$(printf '%s\n' "$out" | awk '
+    { if ($1 ~ /^[0-9]+$/) a+=$1; if ($2 ~ /^[0-9]+$/) d+=$2; f++ }
+    END { printf "%d %d %d", f+0, a+0, d+0 }')
+  # shellcheck disable=SC2086
+  set -- $stats
+  { echo "files_changed=$1"; echo "lines_added=$2"; echo "lines_deleted=$3"; } >> "$metaf"
+}
+
+# ── retry policy (§5/§11/§25.3) ──
+# Parse `retry: { max: N, backoff: 5m, on: [classes] }`. Absent/empty -> no retry.
+_retry_max()      { local m; m=$(printf '%s' "${1:-}" | grep -oE 'max:[[:space:]]*[0-9]+' | grep -oE '[0-9]+' | head -1); echo "${m:-0}"; }
+_retry_backoff_s() { local b; b=$(printf '%s' "${1:-}" | grep -oE 'backoff:[[:space:]]*[0-9]+[smhd]?' | sed -E 's/^backoff:[[:space:]]*//' | head -1); _dur_to_secs "${b:-0}"; }
+_retry_on() {
+  local raw=${1:-} inside
+  inside=$(printf '%s' "$raw" | sed -nE 's/.*on:[[:space:]]*\[([^]]*)\].*/\1/p')
+  [[ -n "$inside" ]] || inside=$(printf '%s' "$raw" | sed -nE 's/.*on:[[:space:]]*([A-Za-z_,[:space:]-]+).*/\1/p')
+  parse_list "$inside" | tr '\n' ' '
+}
+# _class_retryable <class> <on-list> -> 0 if this failure class should retry.
+# `crash` and `agent_error` are synonyms for a non-zero agent exit.
+_class_retryable() {
+  local class=$1 on=$2
+  case "$class" in
+    crash) _in_list crash "$on" || _in_list agent_error "$on";;
+    *) _in_list "$class" "$on";;
+  esac
+}
+
+# _finish_failure <job> <file> <metaf> <logf> <class> <rc> <started> -> apply the
+# retry policy to a failed run. Retries (requeue to inbox with backoff via
+# next_eligible) while the class is in `retry.on` and attempts <= max; on
+# exhaustion -> failed/ result=retries_exhausted; with no policy -> failed/ with
+# the natural result (failed|timeout|verify_failed). The WIP branch is preserved
+# either way so a retry resumes from the checkpoint.
+_finish_failure() {
+  local job=$1 file=$2 metaf=$3 logf=$4 class=$5 rc=$6 started=$7
+  local raw max on attempts; raw=$(fm_get "$file" retry "")
+  attempts=$(grep '^attempts=' "$metaf" 2>/dev/null | tail -1 | cut -d= -f2); attempts=${attempts:-1}
+  max=$(_retry_max "$raw"); on=$(_retry_on "$raw")
+  if [[ "$max" -gt 0 ]] && _class_retryable "$class" "$on" && [[ "$attempts" -le "$max" ]]; then
+    local backoff now next na; backoff=$(_retry_backoff_s "$raw")
+    now=$(date +%s); next=$(( now + backoff )); na=$(( attempts + 1 ))
+    mv "$file" "$INBOX/" 2>/dev/null
+    {
+      echo "exit=$rc"; echo "attempts=$na"; echo "next_eligible=$next"
+      echo "retry_class=$class"; echo "result=retry_scheduled"
+      echo "ended=$now"; echo "duration_s=$(( now - ${started:-now} ))"
+    } >> "$metaf"
+    echo "RETRY scheduled: class=$class, attempt $attempts/$max, backoff ${backoff}s -> inbox ($(date))" >> "$logf"
+    return 0
+  fi
+  local result
+  if [[ "$max" -gt 0 ]] && _class_retryable "$class" "$on"; then
+    result="retries_exhausted"
+    echo "RETRIES EXHAUSTED after $attempts attempt(s) (class=$class) -> failed/ ($(date))" >> "$logf"
+  else
+    case "$class" in
+      timeout) result="timeout";;
+      verify_failed) result="verify_failed";;
+      *) result="failed";;
+    esac
+  fi
+  echo "exit=$rc" >> "$metaf"
+  mv "$file" "$FAILED/" 2>/dev/null
+  _meta_end "$metaf" "$result" "$started"
+}
+
+# ── orphan recovery (§25.3) ──
+# recover_orphans -> move building/ jobs whose worker is dead back to inbox/ for
+# re-selection (resume-aware via the WIP branch), incrementing attempts. Idempotent:
+# once moved out of building/ a job is never recovered twice. A retry-capped job
+# that has exhausted crash retries goes to failed/ result=retries_exhausted instead
+# of looping forever; otherwise recovery never strands work.
+recover_orphans() {
+  local f job metaf pid pidstart
+  for f in "$BUILDING"/*.md; do
+    [[ -e "$f" ]] || continue
+    job=$(basename "$f"); job=${job%.md}; metaf="$STATE/$job.meta"
+    if [[ -f "$metaf" ]] && ! grep -q '^ended=' "$metaf"; then
+      pid=$(grep '^pid=' "$metaf" 2>/dev/null | tail -1 | cut -d= -f2)
+      pidstart=$(grep '^pidstart=' "$metaf" 2>/dev/null | tail -1 | cut -d= -f2-)
+      _pid_alive "$pid" "$pidstart" && continue   # a live worker still owns it
+    fi
+    local prev na raw max now; prev=$(grep '^attempts=' "$metaf" 2>/dev/null | tail -1 | cut -d= -f2); prev=${prev:-1}
+    raw=$(fm_get "$f" retry ""); max=$(_retry_max "$raw"); now=$(date +%s)
+    if [[ "$max" -gt 0 ]] && _class_retryable crash "$(_retry_on "$raw")" && [[ "$prev" -gt "$max" ]]; then
+      mv "$f" "$FAILED/" 2>/dev/null
+      { echo "attempts=$prev"; echo "recovered=$now"; } >> "$metaf"
+      _meta_end "$metaf" "retries_exhausted" "$(grep '^started=' "$metaf" | tail -1 | cut -d= -f2)"
+      echo "ORPHAN: $job exhausted crash retries -> failed/ ($(date))" >> "$LOGS/$job.log"
+      log "↻ orphan $C_BOLD$job$C_RESET exhausted retries -> failed"
+      continue
+    fi
+    na=$(( prev + 1 ))
+    local next=""; [[ "$max" -gt 0 ]] && next=$(( now + $(_retry_backoff_s "$raw") ))
+    mv "$f" "$INBOX/" 2>/dev/null
+    {
+      echo "attempts=$na"; echo "recovered=$now"
+      [[ -n "$next" ]] && echo "next_eligible=$next"
+      echo "result=recovered"; echo "ended=$now"
+    } >> "$metaf"
+    echo "ORPHAN RECOVERED: $job (worker dead) -> inbox, attempt now $na ($(date))" >> "$LOGS/$job.log"
+    log "↻ recovered orphan $C_BOLD$job$C_RESET (attempt $na)"
+  done
+}
+
+# ── token / cost capture (§26.2) ──
+# parse_usage <engine> <logfile> -> emit `key=value` usage lines (model, tokens_in,
+# tokens_out, tokens_cached, cost_usd, turns, tool_calls, usage_estimated) when the
+# engine's output exposes them. This is the SINGLE place per-engine extraction lives.
+# A wrapper of any engine may emit a machine-readable `AQ_USAGE k=v ...` line, which
+# is always honored; engine-specific heuristics are best-effort (real where known,
+# TODO otherwise). Never fabricate precise numbers — omit or mark usage_estimated.
+parse_usage() {
+  local engine=$1 log=$2
+  [[ -f "$log" ]] || return 0
+  # 1) Generic, explicit usage line (preferred; emitted by any cooperating wrapper).
+  local line; line=$(grep -E '^AQ_USAGE ' "$log" 2>/dev/null | tail -1)
+  if [[ -n "$line" ]]; then
+    local kv
+    for kv in ${line#AQ_USAGE }; do
+      case "$kv" in
+        model=*|tokens_in=*|tokens_out=*|tokens_cached=*|cost_usd=*|turns=*|tool_calls=*|usage_estimated=*) echo "$kv";;
+      esac
+    done
+    return 0
+  fi
+  # 2) Engine-specific best-effort heuristics (real where the format is known).
+  local ti to
+  case "$engine" in
+    claude)
+      # Claude Code can surface usage as JSON-ish input_tokens/output_tokens.
+      ti=$(grep -oE '"input_tokens"[": ]+[0-9]+' "$log" 2>/dev/null | grep -oE '[0-9]+' | tail -1)
+      to=$(grep -oE '"output_tokens"[": ]+[0-9]+' "$log" 2>/dev/null | grep -oE '[0-9]+' | tail -1)
+      [[ -n "$ti" ]] && echo "tokens_in=$ti"
+      [[ -n "$to" ]] && echo "tokens_out=$to"
+      ;;
+    codex)
+      # OpenAI usage object: prompt_tokens / completion_tokens.
+      ti=$(grep -oE '"prompt_tokens"[": ]+[0-9]+' "$log" 2>/dev/null | grep -oE '[0-9]+' | tail -1)
+      to=$(grep -oE '"completion_tokens"[": ]+[0-9]+' "$log" 2>/dev/null | grep -oE '[0-9]+' | tail -1)
+      [[ -n "$ti" ]] && echo "tokens_in=$ti"
+      [[ -n "$to" ]] && echo "tokens_out=$to"
+      ;;
+    devin)   : ;;  # TODO: Devin session metrics are exposed via API, not the local log.
+    copilot) : ;;  # TODO: GitHub Copilot CLI usage format not yet documented here.
+  esac
+  return 0
+}
+
+# ── insights helpers (§26) ──
+# _meta_val <metafile> <key> -> last value for key (append-only safe), else empty.
+_meta_val() { grep "^$2=" "$1" 2>/dev/null | tail -1 | cut -d= -f2-; }
+# _result_is_success <result> -> 0 if the agent run succeeded (reached a good stage).
+_result_is_success() { case "$1" in review|testing|shipped) return 0;; *) return 1;; esac; }
+
+# _insights_line <metafile> -> compact one-line metrics summary for status/dash.
+_insights_line() {
+  local f=$1 s="" v ti to la ld
+  v=$(_meta_val "$f" attempts);   [[ -n "$v" ]] && s+="attempt $v  "
+  v=$(_meta_val "$f" duration_s); [[ -n "$v" ]] && s+="${v}s  "
+  ti=$(_meta_val "$f" tokens_in); to=$(_meta_val "$f" tokens_out)
+  [[ -n "$ti$to" ]] && s+="tok ${ti:-0}/${to:-0}  "
+  v=$(_meta_val "$f" cost_usd);   [[ -n "$v" ]] && s+="usd=$v  "
+  la=$(_meta_val "$f" lines_added); ld=$(_meta_val "$f" lines_deleted)
+  [[ -n "$la$ld" ]] && s+="+${la:-0}/-${ld:-0}  "
+  [[ -n "$(_meta_val "$f" usage_estimated)" ]] && s+="(est)  "
+  printf 'insights: %s' "${s:-(pending)}"
+}
+
 # ── Commands ────────────────────────────────────────────────────────
 cmd_init() { ensure_dirs; log "queue initialized at $C_BOLD$QUEUE_ROOT$C_RESET"; }

@ -671,20 +939,30 @@ cmd_run() {
  echo "$$" > "$STATE/daemon.pid"
  trap 'rm -f "$STATE/daemon.pid"; log "run loop stopped"; exit 0' INT TERM
  log "run loop started (max=$MAX_CONCURRENCY, default engine=$DEFAULT_ENGINE). Ctrl-C to stop."
+  # Crash recovery (§25.3): reclaim jobs orphaned in building/ by a previous
+  # crash/power-off before launching anything new.
+  recover_orphans

  while true; do
+    # continuously sweep for orphans (a worker that died mid-loop)
+    recover_orphans
    local running; running=$(active_workers)
    # launch jobs while we have capacity and an eligible inbox file
    while [[ "$running" -lt "$MAX_CONCURRENCY" ]]; do
      # pick by priority (critical→low) then age, skipping files whose lock key
      # is currently busy, so two jobs sharing a cwd (or `lock:` key) never run
      # at once regardless of --max. inbox_sorted replaces the old pure-FIFO sort.
+      # Also skip jobs still inside their retry/recovery backoff (next_eligible).
      local busy; busy=$(busy_keys)
-      local next="" cand cand_key
+      local next="" cand cand_key cand_job cand_ne now_s
+      now_s=$(date +%s)
      while IFS= read -r cand; do
        [[ -n "$cand" ]] || continue
        cand_key=$(lock_key_for "$cand")
        if printf '%s\n' "$busy" | grep -qxF -- "$cand_key"; then continue; fi
+        cand_job=$(basename "$cand"); cand_job=${cand_job%.md}
+        cand_ne=$(grep '^next_eligible=' "$STATE/$cand_job.meta" 2>/dev/null | tail -1 | cut -d= -f2)
+        if [[ "$cand_ne" =~ ^[0-9]+$ ]] && [[ "$cand_ne" -gt "$now_s" ]]; then continue; fi
        next="$cand"; break
      done < <(inbox_sorted)
      [[ -z "$next" ]] && break
@ -700,6 +978,11 @@ cmd_run() {
      w_cwd=$(fm_get "$doing_file" cwd "$PWD")
      w_yolo=$(fm_get "$doing_file" yolo "true")
      w_key=$(lock_key_for "$doing_file")
+      # Preserve the attempt counter across requeues (retry/orphan recovery set
+      # it before re-queuing); a fresh job starts at 1. Read BEFORE truncating
+      # the meta below so the count survives re-creation (§25.4 crash-safe).
+      local w_attempts; w_attempts=$(grep '^attempts=' "$STATE/$job.meta" 2>/dev/null | tail -1 | cut -d= -f2)
+      [[ "$w_attempts" =~ ^[0-9]+$ ]] && [[ "$w_attempts" -gt 0 ]] || w_attempts=1
      # write meta BEFORE launch (no pid yet), then append the worker pid from $!.
      # The new manifest fields (§5) are recorded here; only priority,
      # capabilities, engine-class and idempotency-key are functional this
@ -711,6 +994,7 @@ cmd_run() {
        echo "yolo=$w_yolo"
        echo "lock=$w_key"
        echo "started=$(date +%s)"
+        echo "attempts=$w_attempts"
        echo "priority=$(fm_get "$doing_file" priority medium)"
        echo "profile=$(fm_get "$doing_file" profile "")"
        echo "engine_class=$(fm_get "$doing_file" engine-class "")"
@ -785,6 +1069,7 @@ cmd_status() {
    [[ -n "$m_caps" ]] && extra+="caps=$m_caps "
    [[ -n "$m_trk" ]] && extra+="tracker=$m_trk "
    [[ -n "$extra" ]] && printf '      %s%s%s\n' "$C_DIM" "$extra" "$C_RESET"
+    printf '      %s%s%s\n' "$C_DIM" "$(_insights_line "$f")" "$C_RESET"
  done
  $printed || printf '  %sno workers running%s\n' "$C_DIM" "$C_RESET"
  echo
@ -795,6 +1080,68 @@ cmd_watch() {
  while true; do clear; cmd_status; sleep "$interval"; done
 }

+# recover — reclaim orphaned building/ jobs (dead worker) back to inbox/. Runs
+# automatically inside `run`; exposed for operators + crash-recovery testing.
+cmd_recover() { ensure_dirs; recover_orphans; log "orphan sweep complete"; }
+
+# insights [job] — per-job metrics, or a recent-jobs table + per-engine rollup.
+cmd_insights() {
+  ensure_dirs
+  local job="${1:-}"
+  if [[ -n "$job" ]]; then
+    local f="$STATE/$job.meta"
+    [[ -f "$f" ]] || f=$(ls -1t "$STATE"/*"$job"*.meta 2>/dev/null | head -1)
+    [[ -f "$f" ]] || die "no job meta matching '$job'"
+    local jn; jn=$(basename "$f"); jn=${jn%.meta}
+    printf '\n%s  INSIGHTS %s%s%s\n' "$C_BOLD" "$C_CYAN" "$jn" "$C_RESET"
+    local k val
+    for k in engine result attempts started ended duration_s exit verify_exit \
+             model tokens_in tokens_out tokens_cached cost_usd turns tool_calls usage_estimated \
+             files_changed lines_added lines_deleted wip_branch wip_base wip_commit \
+             next_eligible retry_class recovered; do
+      val=$(_meta_val "$f" "$k")
+      [[ -n "$val" ]] && printf '  %-15s %s\n' "$k" "$val"
+    done
+    echo
+    return 0
+  fi
+
+  printf '\n%s  INSIGHTS — recent finished jobs%s %s%s%s\n' \
+    "$C_BOLD" "$C_RESET" "$C_DIM" "$QUEUE_ROOT" "$C_RESET"
+  printf '  %-26s %-8s %-16s %6s %10s %9s\n' "job" "engine" "result" "dur" "tok(i/o)" "cost"
+  local f rows=0 agg; agg=$(mktemp "${TMPDIR:-/tmp}/aq-insights.XXXXXX")
+  while IFS= read -r f; do
+    [[ -n "$f" ]] || continue
+    grep -q '^ended=' "$f" || continue
+    local jn eng res dur ti to cost est
+    jn=$(basename "$f"); jn=${jn%.meta}
+    eng=$(_meta_val "$f" engine); res=$(_meta_val "$f" result); dur=$(_meta_val "$f" duration_s)
+    ti=$(_meta_val "$f" tokens_in); to=$(_meta_val "$f" tokens_out); cost=$(_meta_val "$f" cost_usd)
+    est=$(_meta_val "$f" usage_estimated)
+    rows=$((rows+1))
+    [[ $rows -le 15 ]] && printf '  %-26.26s %-8.8s %-16.16s %5ss %10s %9s\n' \
+      "$jn" "${eng:-?}" "${res:-?}" "${dur:-0}" "${ti:-0}/${to:-0}" "${cost:-–}"
+    local succ=0; _result_is_success "$res" && succ=1
+    printf '%s|%s|%s|%s|%s|%s|%s\n' "${eng:-?}" "${ti:-0}" "${to:-0}" "${cost:-0}" "${dur:-0}" "$succ" "$est" >> "$agg"
+  done < <(ls -1t "$STATE"/*.meta 2>/dev/null)
+  if [[ $rows -eq 0 ]]; then
+    printf '  %sno finished jobs yet%s\n\n' "$C_DIM" "$C_RESET"; rm -f "$agg"; return 0
+  fi
+  printf '\n%s  ROLLUP BY ENGINE%s\n' "$C_BOLD" "$C_RESET"
+  printf '  %-8s %5s %10s %10s %10s %8s\n' "engine" "jobs" "tok_in" "tok_out" "cost" "success"
+  local e
+  for e in $(cut -d'|' -f1 "$agg" | sort -u); do
+    awk -F'|' -v eng="$e" '
+      $1==eng { jobs++; ti+=$2; to+=$3; cost+=$4; succ+=$6; if ($7!="") est=1 }
+      END {
+        rate = jobs>0 ? (succ*100.0/jobs) : 0
+        printf "  %-8s %5d %10d %10d %9.4f%s %6.0f%%\n", eng, jobs, ti, to, cost, (est?"*":" "), rate
+      }' "$agg"
+  done
+  printf '  %s* total includes estimated token/cost values%s\n\n' "$C_DIM" "$C_RESET"
+  rm -f "$agg"
+}
+
 cmd_dash() {
  command -v node >/dev/null 2>&1 || die "node not found — use 'watch' for the bash status view"
  AGENT_QUEUE_ROOT="$QUEUE_ROOT" exec node "$SCRIPT_DIR/dashboard.mjs" "$@"
@ -948,8 +1295,10 @@ ${C_BOLD}COMMANDS${C_RESET}
       --engine devin|claude|codex   --cwd PATH   --yolo | --no-yolo
  run [--max N] [--engine E] [--once]
                               process inbox/ (foreground loop; Ctrl-C to stop)
-  status                       show kanban counts + running workers
+  status                       show kanban counts + running workers (+ insights)
  watch [interval]            live status (default 2s, bash)
+  insights [job]               per-job metrics, or recent table + per-engine rollup
+  recover                      reclaim orphaned building/ jobs (dead worker) -> inbox
  dash [--interval N]          richer live Node dashboard (recent shipped/failed too)
  stop                         kill running workers + the run loop
  logs <job> [-f]              print (or follow) a job's log
@ -978,10 +1327,15 @@ ${C_BOLD}TASK FRONTMATTER${C_RESET} (top of each .md)
  prefers-engine: [claude]    # optional order hint for engine-class resolution
  capabilities: [os:any, node>=20, has:git]  # hard host requirements; unmet -> failed (capability_mismatch)
  idempotency-key: my-task-1  # re-adding same key+body = no-op; same key+different body = reject/supersede
+  retry: { max: 2, backoff: 5m, on: [timeout, verify_failed, crash] }  # requeue on these classes up to max, then retries_exhausted
  # --- reserved (parsed + shown in status, but no-op until a later phase) ---
-  profile:  prefers:  budget:  deps:  deps-mode:  retry:  review-policy:  artifacts:  tracker-item:
+  profile:  prefers:  budget:  deps:  deps-mode:  review-policy:  artifacts:  tracker-item:
  ---

+${C_BOLD}RESILIENCE${C_RESET}  crash-safe: orphaned building/ jobs (dead worker) are recovered to inbox/ on
+            'run' startup; git-repo cwd work is checkpointed to branch aq/wip/<job> on every
+            exit (resumed on retry); 'retry' requeues failures with backoff. See 'insights'.
+
 ${C_BOLD}ENV${C_RESET}
  AGENT_QUEUE_ROOT (=$QUEUE_ROOT)   AGENT_QUEUE_MAX (=$MAX_CONCURRENCY)
  AGENT_QUEUE_ENGINE (=$DEFAULT_ENGINE)   AGENT_QUEUE_VERIFY (default verify cmd)
@ -997,6 +1351,8 @@ main() {
    run)    cmd_run "$@";;
    status) cmd_status "$@";;
    watch)  cmd_watch "$@";;
+    insights) cmd_insights "$@";;
+    recover) cmd_recover "$@";;
    dash|dashboard) cmd_dash "$@";;
    stop)   cmd_stop "$@";;
    logs)   cmd_logs "$@";;
--- a/agent-queue/dashboard.mjs
+++ b/agent-queue/dashboard.mjs
@ -72,6 +72,18 @@ const parseMeta = (file) => {
  return out;
 };

+// Compact per-job insights (read-only from meta; agent-queue.sh is the source of
+// truth). Surfaces tokens or cost + attempts + line deltas for finished jobs.
+const insightsTag = (m) => {
+  const parts = [];
+  if (m.attempts && m.attempts !== '1') parts.push(`x${m.attempts}`);
+  if (m.cost_usd) parts.push(`$${m.cost_usd}${m.usage_estimated ? '~' : ''}`);
+  else if (m.tokens_in || m.tokens_out) parts.push(`tok ${m.tokens_in || 0}/${m.tokens_out || 0}`);
+  if (m.lines_added || m.lines_deleted) parts.push(`+${m.lines_added || 0}/-${m.lines_deleted || 0}`);
+  if (m.duration_s) parts.push(`${m.duration_s}s`);
+  return parts.join(' ');
+};
+
 const pidAlive = (pid) => {
  if (!pid) return false;
  try { process.kill(Number(pid), 0); return true; } catch { return false; }
@ -307,7 +319,9 @@ function drawBoard() {
  } else {
    for (const m of recent) {
      const res = m.result || '';
-      const failedRes = res === 'failed' || res === 'timeout' || res === 'verify_failed' || res === 'rejected';
+      const failedRes = res === 'failed' || res === 'timeout' || res === 'verify_failed' ||
+        res === 'rejected' || res === 'retries_exhausted' || res === 'capability_mismatch' ||
+        res === 'no_engine';
      const mark = failedRes ? c('red', '✕') : c('green', '▣');
      const when = m.ended ? new Date(Number(m.ended) * 1000).toLocaleTimeString() : '';
      let label;
@ -317,12 +331,13 @@ function drawBoard() {
      else if (res === 'verify_failed') label = c('red', 'verify failed');
      else if (res === 'timeout') label = c('red', 'timeout');
      else if (res === 'rejected') label = c('red', 'rejected');
+      else if (res === 'retries_exhausted') label = c('red', 'retries exhausted');
      else if (res === 'failed') label = c('red', 'failed rc=' + (m.exit || '?'));
      else label = c('gray', res || '?');
      out.push(
        `    ${mark} ${trunc(m.job || '?', 34).padEnd(34)} ` +
        `${c('gray', (m.engine || '').padEnd(7))} ` +
-        `${label}  ${c('gray', when)}`
+        `${label}  ${c('gray', when)}  ${c('cyan', insightsTag(m))}`
      );
    }
  }
--- a/agent-queue/docs/GIGAFACTORY_ROADMAP.md
+++ b/agent-queue/docs/GIGAFACTORY_ROADMAP.md
@ -11,7 +11,7 @@
 | Phase | Theme | Status | % | Gate |
 | ----- | ----- | ------ | - | ---- |
 | **0** | Baseline (today) | ✅ shipped | 100% | `selftest.sh` green |
-| **1** | Manifest + profiles + capabilities + tracker adapter (single host) | ◐ in progress | 35% | adapter e2e + selftest |
+| **1** | Manifest + profiles + capabilities + tracker adapter (single host) | ◐ in progress | 55% | adapter e2e + selftest |
 | **2** | Coordinator as platform-service module + Cosmos + multi-factory leasing | ☐ not started | 0% | fleet e2e + module tests |
 | **3** | Fleet control plane in tracker-web + DAG deps + budgets + scoring router | ☐ not started | 0% | web e2e + router tests |
 | **4** | Message bus + autoscaling + cross-OS capability marketplace | ☐ not started | 0% | load/chaos suite |
@ -289,10 +289,10 @@ Three transports were evaluated. **Decision: platform-service-native coordinator
 - [ ] Canonical stages enforced server-side: `queued → assigned → building → review → testing → shipped` (+ `blocked`, `failed`, `dead_letter`); transitions validated (illegal transition → 409).
 - [ ] Per-profile default `verify`; per-job override; verify runs at the factory, result reported as an event.
 - [ ] Human gates: `review-policy` routes to reviewers; multi-reviewer support (P3).
- [ ] **Dead-letter**: after `retry.max` exhausted, job → `dead_letter` with full diagnostics; never silently dropped.
+- [x] **Dead-letter**: after `retry.max` exhausted, job → `dead_letter` with full diagnostics; never silently dropped. *(P1-S3 single-host stand-in: `failed/` `result=retries_exhausted`, WIP branch + full log preserved.)*
 - [ ] **Backpressure**: when no factory can take more, jobs stay `queued` (no thrash); SLA timers visible.
 - [ ] **Ship semantics** are profile-configurable (merged+green vs `pr-opened`, §10); `shipped` is terminal-success, `dead_letter` terminal-failure; `blocked` (unmet deps) is distinct from `queued`.
- [ ] **Retry vs idempotency**: a retry creates a new `fleet_runs` attempt under the same job/`idempotency-key` (never a duplicate job); backoff honored; `retry.on` filters which failure classes retry.
+- [x] **Retry vs idempotency**: a retry creates a new `fleet_runs` attempt under the same job/`idempotency-key` (never a duplicate job); backoff honored; `retry.on` filters which failure classes retry. *(P1-S3 single-host: `attempts` counter survives requeue; `backoff`→`next_eligible` gates selection; `on` filters timeout/verify_failed/crash.)*
 - **Acceptance:** a perpetually-failing job lands in `dead_letter` after configured retries; a passing one auto-advances to `testing` then waits for human `ship`; an illegal transition is rejected.
 - **Verify gate:** lifecycle state-machine unit tests (all transitions + illegal-transition rejection + retry/dead-letter path).

@ -340,14 +340,16 @@ Each phase: **Goal → checklist → Exit criteria**. Don't start a phase until
 ### Phase 1 — Manifest + profiles + capabilities + tracker adapter (single host)
 **Goal:** richer single-host runner that understands profiles/capabilities and bridges to tracker — no distributed infra yet.

-> **Slice progress — P1-S1 (this commit):** manifest parsing (all §5 fields, defaulted + backward-compatible), `priority` ordering, capability detection+match gate, `engine-class` resolution, and `idempotency-key` dedupe are **done** on the bash runner. Profiles, `deps` DAG, `retry`/`budget.wall`, `allowed-scope`, the tracker adapter, and dashboard surfacing remain **for later slices**.
+> **Slice progress — P1-S1:** manifest parsing (all §5 fields, defaulted + backward-compatible), `priority` ordering, capability detection+match gate, `engine-class` resolution, and `idempotency-key` dedupe are **done** on the bash runner.
+>
+> **Slice progress — P1-S3 (resilience & insights, single host):** crash recovery (`recover_orphans` + `aq recover`), git WIP checkpoint/resume (`aq/wip/<job>`), functional `retry` policy (backoff + `retries_exhausted`), and execution insights (`parse_usage`, per-run metrics in meta, `aq insights`, `status`/`dash` insights) are **done** — see §11/§25/§26. Profiles, `deps` DAG, `budget.wall`, `allowed-scope`, and the tracker adapter remain **for later slices**.

 - [x] Extend `agent-queue.sh` frontmatter parsing for all new manifest fields (§5), defaulted + backward-compatible. *(P1-S1)*
 - [ ] Add `profiles/` directory + profile resolution (persona injection, default verify/caps/scope) (§6).
 - [x] Local capability detection + a job/factory capability match check before launch (§8 subset). *(P1-S1: `detect_capabilities` + `caps_match`; mismatch ⇒ `failed/` `result=capability_mismatch`, agent never launched.)*
 - [x] `priority` ordering in the inbox pick (replace pure FIFO with priority-then-age). *(P1-S1: `inbox_sorted`; per-lock serialization preserved.)*
 - [ ] `deps` (DAG) blocking on a single host; `idempotency-key` dedupe on `add`. *(P1-S1: `idempotency-key` dedupe DONE; `deps` DAG blocking still pending.)*
- [ ] `retry` with backoff into `failed`/requeue; `budget.wall` enforced (extends `timeout`).
+- [ ] `retry` with backoff into `failed`/requeue; `budget.wall` enforced (extends `timeout`). *(P1-S3: `retry` with backoff + `retries_exhausted` DONE; `budget.wall` still pending.)*
 - [ ] `allowed-scope` guardrail (warn-only this phase) + post-run diff report.
 - [ ] **Tracker adapter** `aq from-tracker <ITEM>` + `aq to-tracker` event poster (§10 P1).
 - [ ] Dashboard shows profile + priority + capability tags + tracker-item link. *(P1-S1: `status` shows priority/profile/caps/tracker-item; Node `dash` surfacing pending.)*
@ -618,16 +620,16 @@ Composite work obeys the same SoT discipline as the core contract (§4 immutable
 - [ ] A factory only ever holds a **transient materialized copy** (temp prompt file) fetched from the API — losing the factory loses nothing. On the offline edge, the `.md` file on disk is the durable copy and reconciles on reconnect (§9).

 ### 25.2 Work-in-progress is preserved (checkpointing)
- [ ] For a git-repo `cwd`, the worker commits **WIP to a dedicated branch `aq/wip/<jobId>`** at start and on every exit path (success, failure, timeout, signal) — partial work is never lost to a crash. Never commits to `main`/protected branches (§12 push policy).
- [ ] `fleet_jobs.checkpoint` records the WIP branch + last commit so any worker can find it.
- [ ] Long agents checkpoint periodically where the engine supports it; otherwise the start/exit commits bound the loss window.
+- [x] For a git-repo `cwd`, the worker commits **WIP to a dedicated branch `aq/wip/<jobId>`** at start and on every exit path (success, failure, timeout, signal) — partial work is never lost to a crash. Never commits to `main`/protected branches (§12 push policy). *(P1-S3: `_wip_start`/`_wip_checkpoint` + EXIT/INT/TERM trap; non-git cwd skipped.)*
+- [ ] `fleet_jobs.checkpoint` records the WIP branch + last commit so any worker can find it. *(P2 Cosmos; single-host records `wip_branch`/`wip_base`/`wip_commit` in `<job>.meta`.)*
+- [x] Long agents checkpoint periodically where the engine supports it; otherwise the start/exit commits bound the loss window. *(P1-S3: start + every-exit-path commits bound the loss window.)*

 ### 25.3 Recovery is automatic, resumable, and fenced
- [ ] **Orphan detection:** on coordinator/runner startup (and continuously), a job in `building/assigned` whose worker is dead (no live lease / dead pid) is an **orphan**; it is recovered, not stranded.
- [ ] **Resume vs restart:** recovery starts a **new `fleet_runs` attempt**; if `aq/wip/<jobId>` exists, the new worker **resumes from the checkpoint** instead of restarting from zero.
- [ ] **Fencing (§4):** the reclaimed run gets a higher `leaseEpoch`; the dead/zombie worker's late commits/ship reports are rejected — no double-execution of *visible* outcomes.
- [ ] **Retry policy** (`retry.max/backoff/on`): agent `rc≠0` / `timeout` / `verify_failed` requeue with backoff up to `max`; on exhaustion → `dead_letter` (P2) / `failed` (P1 stand-in) with full diagnostics — never silently dropped.
- [ ] **State integrity:** all run state is **append-only / optimistic-concurrency guarded** (§13); recovery is idempotent (running it twice yields one recovery).
+- [x] **Orphan detection:** on coordinator/runner startup (and continuously), a job in `building/assigned` whose worker is dead (no live lease / dead pid) is an **orphan**; it is recovered, not stranded. *(P1-S3: `recover_orphans` on `run` startup + each loop, and `agent-queue.sh recover`; dead-pid + `pidstart` reuse guard.)*
+- [x] **Resume vs restart:** recovery starts a **new `fleet_runs` attempt**; if `aq/wip/<jobId>` exists, the new worker **resumes from the checkpoint** instead of restarting from zero. *(P1-S3: relaunch checks out `aq/wip/<job>`; `attempts` incremented.)*
+- [ ] **Fencing (§4):** the reclaimed run gets a higher `leaseEpoch`; the dead/zombie worker's late commits/ship reports are rejected — no double-execution of *visible* outcomes. *(P2 — distributed leasing; out of single-host scope.)*
+- [x] **Retry policy** (`retry.max/backoff/on`): agent `rc≠0` / `timeout` / `verify_failed` requeue with backoff up to `max`; on exhaustion → `dead_letter` (P2) / `failed` (P1 stand-in) with full diagnostics — never silently dropped. *(P1-S3 single-host.)*
+- [x] **State integrity:** all run state is **append-only / optimistic-concurrency guarded** (§13); recovery is idempotent (running it twice yields one recovery). *(P1-S3 single-host: meta is append-only + re-derivable from folder location; `_etag` guard is P2.)*

 ### 25.4 Crash taxonomy (all handled)
 | Failure | Detection | Recovery |
@ -646,11 +648,11 @@ Composite work obeys the same SoT discipline as the core contract (§4 immutable

 **Goal:** per-job/run visibility into **token usage, cost, model, latency, and tool activity** — to drive budgets (§5/§12), cost burndown (§17), and learned routing (§14 P5).

- [ ] **Per-run telemetry record** (in `fleet_runs`, streamed as `fleet_events`): engine, model, **tokensIn/Out (+cached)**, **cost USD** (`estimated:true` when not provider-reported), wall + CPU time, **turn count, tool-call counts**, verify pass/fail, **filesChanged, linesAdded/Deleted**, attempt number, retries.
- [ ] **Token source (honest feasibility):** capture real usage where the engine/provider exposes it (Claude/Codex/OpenAI usage in responses; Devin session metrics); otherwise **estimate** from log heuristics and mark `estimated` — same caveat as `budget.usd/tokens` (§5). A single `parse_usage(engine, log)` adapter centralizes per-engine extraction.
- [ ] **Aggregation/rollups:** per job, roadmap (§24), product, factory, engine, profile, and day. Powers cost burndown (§17) and the learned-routing eval (§14).
- [ ] **Surfacing:** control-plane panels (tokens, cost, success/first-pass/human-edit rates) + a CLI insights summary at the edge; reuse the platform-service telemetry module where present.
- [ ] **Privacy:** telemetry carries metrics + pointers only — **never prompt content or secrets** (redaction §12).
+- [x] **Per-run telemetry record** (in `fleet_runs`, streamed as `fleet_events`): engine, model, **tokensIn/Out (+cached)**, **cost USD** (`estimated:true` when not provider-reported), wall + CPU time, **turn count, tool-call counts**, verify pass/fail, **filesChanged, linesAdded/Deleted**, attempt number, retries. *(P1-S3 single-host: recorded in `<job>.meta` — `duration_s`, `files_changed`/`lines_added`/`lines_deleted`, tokens/cost/turns/tool_calls, `attempts`; CPU time not captured.)*
+- [x] **Token source (honest feasibility):** capture real usage where the engine/provider exposes it (Claude/Codex/OpenAI usage in responses; Devin session metrics); otherwise **estimate** from log heuristics and mark `estimated` — same caveat as `budget.usd/tokens` (§5). A single `parse_usage(engine, log)` adapter centralizes per-engine extraction. *(P1-S3: `parse_usage` adapter; generic `AQ_USAGE` line + Claude/Codex heuristics; Devin/Copilot TODO; `usage_estimated` flag, never fabricated.)*
+- [ ] **Aggregation/rollups:** per job, roadmap (§24), product, factory, engine, profile, and day. Powers cost burndown (§17) and the learned-routing eval (§14). *(P1-S3 partial: `aq insights` does per-job + per-engine rollup; product/factory/profile/day are P2/P3.)*
+- [ ] **Surfacing:** control-plane panels (tokens, cost, success/first-pass/human-edit rates) + a CLI insights summary at the edge; reuse the platform-service telemetry module where present. *(P1-S3 partial: edge CLI `aq insights` + `status`/`dash` insights line done; web control-plane panels are P3.)*
+- [x] **Privacy:** telemetry carries metrics + pointers only — **never prompt content or secrets** (redaction §12). *(P1-S3: insights/meta record only metrics; no prompt body or secrets added.)*
 - **Acceptance:** after a run, its `fleet_runs` carries token/cost/duration/tool/diff metrics (real where metered, flagged `estimated` otherwise); dashboards show per-engine and per-profile cost + token totals; a budget breach is detectable from telemetry alone.
 - **Verify gate:** telemetry unit tests (capture + rollup); a metered-engine run records real tokens; an unmetered run records estimated + flagged; aggregation totals verified.

--- a/agent-queue/selftest.sh
+++ b/agent-queue/selftest.sh
@ -230,4 +230,194 @@ cnt=$(find "$AGENT_QUEUE_ROOT/inbox" -maxdepth 1 -type f -name '*.md' 2>/dev/nul
 [ "$cnt" = "0" ] && pass "idempotency: a rejected add enqueues nothing" \
  || fail "idempotency: rejected add should not enqueue (inbox=$cnt)"

+# ─────────────────────────────────────────────────────────────────────
+# Phase 1 — Slice 3 cases (resilience & insights, single host).
+# Use temp git repos + stubs; never touches a real queue.
+# ─────────────────────────────────────────────────────────────────────
+metaval() { grep "^$2=" "$1" 2>/dev/null | tail -1 | cut -d= -f2-; }
+mkrepo() {
+  local d=$1; mkdir -p "$d"; git -C "$d" init -q
+  git -C "$d" config user.email t@t; git -C "$d" config user.name selftest
+  echo seed > "$d/seed.txt"; git -C "$d" add -A; git -C "$d" commit -q -m seed
+}
+
+# 12. orphan recovery: a building/ job whose worker pid is dead → `recover`
+# moves it to inbox/ with attempts incremented; a second recover is a no-op.
+export AGENT_QUEUE_ROOT="$tmp/queue-orphan"
+"$AQ" init >/dev/null
+printf '%s\n' '---' 'engine: devin' "cwd: $work" 'yolo: true' '---' '' '# orphan task' \
+  > "$AGENT_QUEUE_ROOT/building/orphanjob.md"
+# pid 1 is alive but pidstart is bogus → the PID-reuse guard marks it dead.
+printf '%s\n' 'job=orphanjob' 'engine=devin' "cwd=$work" 'started=1' 'attempts=1' 'pid=1' 'pidstart=NOPE' \
+  > "$AGENT_QUEUE_ROOT/.state/orphanjob.meta"
+"$AQ" recover >/dev/null 2>&1
+if [ -f "$AGENT_QUEUE_ROOT/inbox/orphanjob.md" ] && [ ! -f "$AGENT_QUEUE_ROOT/building/orphanjob.md" ]; then
+  pass "orphan recovery: dead-worker building/ job recovered to inbox/"
+else
+  ls -R "$AGENT_QUEUE_ROOT" >&2 || true; fail "orphan not recovered to inbox/"
+fi
+[ "$(metaval "$AGENT_QUEUE_ROOT/.state/orphanjob.meta" attempts)" = "2" ] \
+  && pass "orphan recovery: attempts incremented (1 -> 2)" \
+  || fail "orphan recovery: attempts not incremented (got $(metaval "$AGENT_QUEUE_ROOT/.state/orphanjob.meta" attempts))"
+"$AQ" recover >/dev/null 2>&1   # idempotent: nothing left in building/
+inbn=$(find "$AGENT_QUEUE_ROOT/inbox" -maxdepth 1 -name 'orphanjob.md' | wc -l | tr -d ' ')
+[ "$inbn" = "1" ] && [ "$(metaval "$AGENT_QUEUE_ROOT/.state/orphanjob.meta" attempts)" = "2" ] \
+  && pass "orphan recovery: idempotent (twice recovers once)" \
+  || fail "orphan recovery not idempotent (inbox=$inbn attempts=$(metaval "$AGENT_QUEUE_ROOT/.state/orphanjob.meta" attempts))"
+
+# 13. WIP checkpoint (git) + numstat: a git-repo cwd whose agent writes a 3-line
+# file → branch aq/wip/<job> has a commit with the change, main is untouched,
+# and lines_added is recorded.
+export AGENT_QUEUE_ROOT="$tmp/queue-wip"
+repo="$tmp/repo-wip"; mkrepo "$repo"
+mainbr=$(git -C "$repo" symbolic-ref --short HEAD)
+wipstub="$tmp/wip-engine"
+printf '#!/usr/bin/env bash\nprintf '"'"'a\\nb\\nc\\n'"'"' > created_by_agent.txt\nexit 0\n' > "$wipstub"
+chmod +x "$wipstub"
+"$AQ" init >/dev/null
+printf '%s\n' '---' 'engine: devin' "cwd: $repo" 'yolo: true' '---' '' '# wip task' \
+  > "$AGENT_QUEUE_ROOT/inbox/wipjob.md"
+DEVIN_BIN="$wipstub" "$AQ" run --once >/dev/null 2>&1
+# capture the log first (avoid `git log | grep -q` — under pipefail the early
+# grep -q exit SIGPIPEs git log and falsely fails the pipeline).
+wiplog=$(git -C "$repo" log --oneline aq/wip/wipjob 2>/dev/null || true)
+if git -C "$repo" show-ref --verify --quiet refs/heads/aq/wip/wipjob \
+   && [[ "$wiplog" == *"aq wip: wipjob"* ]] \
+   && git -C "$repo" show aq/wip/wipjob:created_by_agent.txt >/dev/null 2>&1; then
+  pass "wip checkpoint: aq/wip/wipjob has a commit with the agent's change"
+else
+  git -C "$repo" branch -a >&2 || true; fail "wip checkpoint branch/commit missing"
+fi
+if git -C "$repo" cat-file -e "$mainbr":created_by_agent.txt 2>/dev/null; then
+  fail "wip checkpoint: main branch was modified (must be untouched)"
+else
+  pass "wip checkpoint: main branch ($mainbr) untouched"
+fi
+[ "$(metaval "$AGENT_QUEUE_ROOT/.state/wipjob.meta" lines_added)" = "3" ] \
+  && pass "insights numstat: lines_added recorded (=3)" \
+  || fail "insights numstat: lines_added wrong (got $(metaval "$AGENT_QUEUE_ROOT/.state/wipjob.meta" lines_added))"
+
+# 13b. non-git cwd → WIP skipped cleanly (no error), job still completes.
+export AGENT_QUEUE_ROOT="$tmp/queue-nogit"
+"$AQ" init >/dev/null
+printf '%s\n' '---' 'engine: devin' "cwd: $work" 'yolo: true' '---' '' '# nogit task' \
+  > "$AGENT_QUEUE_ROOT/inbox/nogitjob.md"
+DEVIN_BIN="$stub" "$AQ" run --once >/dev/null 2>&1
+if ls "$AGENT_QUEUE_ROOT"/review/*.md >/dev/null 2>&1 \
+   && grep -q 'not a git repo' "$AGENT_QUEUE_ROOT/logs/nogitjob.log" 2>/dev/null; then
+  pass "wip checkpoint: non-git cwd skipped cleanly → review/"
+else
+  fail "non-git cwd run did not complete cleanly"
+fi
+
+# 14. WIP resume: an orphan whose aq/wip/<job> already has a prior commit →
+# the relaunch checks out that branch (agent sees HEAD on aq/wip/<job>).
+export AGENT_QUEUE_ROOT="$tmp/queue-resume"
+repo2="$tmp/repo-resume"; mkrepo "$repo2"
+mainbr2=$(git -C "$repo2" symbolic-ref --short HEAD)
+git -C "$repo2" checkout -q -b aq/wip/resumejob
+echo prior > "$repo2/prior.txt"; git -C "$repo2" add -A; git -C "$repo2" commit -q -m "aq wip: resumejob (prior)"
+git -C "$repo2" checkout -q "$mainbr2"
+resumeout="$tmp/resume-head.txt"; rm -f "$resumeout"
+resumestub="$tmp/resume-engine"
+printf '#!/usr/bin/env bash\ngit rev-parse --abbrev-ref HEAD > %q 2>/dev/null\nexit 0\n' "$resumeout" > "$resumestub"
+chmod +x "$resumestub"
+"$AQ" init >/dev/null
+printf '%s\n' '---' 'engine: devin' "cwd: $repo2" 'yolo: true' '---' '' '# resume task' \
+  > "$AGENT_QUEUE_ROOT/building/resumejob.md"
+printf '%s\n' 'job=resumejob' 'engine=devin' "cwd=$repo2" 'started=1' 'attempts=1' 'pid=1' 'pidstart=NOPE' \
+  > "$AGENT_QUEUE_ROOT/.state/resumejob.meta"
+DEVIN_BIN="$resumestub" "$AQ" run --once >/dev/null 2>&1
+if [ "$(cat "$resumeout" 2>/dev/null)" = "aq/wip/resumejob" ]; then
+  pass "wip resume: recovered job ran with HEAD on aq/wip/resumejob"
+else
+  echo "resume HEAD was: $(cat "$resumeout" 2>/dev/null)" >&2
+  fail "wip resume did not check out the existing WIP branch"
+fi
+
+# 15. retry on verify_failed: max=1 → requeued once (attempts=2) then failed/
+# result=retries_exhausted; a backoff (next_eligible) is recorded.
+export AGENT_QUEUE_ROOT="$tmp/queue-retry"
+export AGENT_QUEUE_POLL=1
+"$AQ" init >/dev/null
+printf '%s\n' '---' 'engine: devin' "cwd: $work" 'yolo: true' 'verify: false' \
+  'retry: { max: 1, backoff: 1s, on: [verify_failed] }' '---' '' '# retry task' \
+  > "$AGENT_QUEUE_ROOT/inbox/retryjob.md"
+DEVIN_BIN="$stub" "$AQ" run --once >/dev/null 2>&1
+if ls "$AGENT_QUEUE_ROOT"/failed/retryjob.md >/dev/null 2>&1 \
+   && [ "$(metaval "$AGENT_QUEUE_ROOT/.state/retryjob.meta" result)" = "retries_exhausted" ] \
+   && [ "$(metaval "$AGENT_QUEUE_ROOT/.state/retryjob.meta" attempts)" = "2" ]; then
+  pass "retry(verify_failed): requeued once (attempts=2) then retries_exhausted"
+else
+  fail "retry(verify_failed) wrong (result=$(metaval "$AGENT_QUEUE_ROOT/.state/retryjob.meta" result) attempts=$(metaval "$AGENT_QUEUE_ROOT/.state/retryjob.meta" attempts))"
+fi
+grep -q 'RETRY scheduled' "$AGENT_QUEUE_ROOT/logs/retryjob.log" 2>/dev/null \
+  && pass "retry: backoff RETRY scheduled (next_eligible honored)" \
+  || fail "retry: no RETRY scheduled line in log"
+
+# 16. retry on crash: rc!=0 with on=[crash] retries; without crash it does not.
+crashstub="$tmp/crash-engine"
+printf '#!/usr/bin/env bash\nexit 3\n' > "$crashstub"; chmod +x "$crashstub"
+export AGENT_QUEUE_ROOT="$tmp/queue-crash"
+"$AQ" init >/dev/null
+printf '%s\n' '---' 'engine: devin' "cwd: $work" 'yolo: true' \
+  'retry: { max: 1, backoff: 1s, on: [crash] }' '---' '' '# crash-retry task' \
+  > "$AGENT_QUEUE_ROOT/inbox/crashjob.md"
+DEVIN_BIN="$crashstub" "$AQ" run --once >/dev/null 2>&1
+[ "$(metaval "$AGENT_QUEUE_ROOT/.state/crashjob.meta" result)" = "retries_exhausted" ] \
+  && [ "$(metaval "$AGENT_QUEUE_ROOT/.state/crashjob.meta" attempts)" = "2" ] \
+  && pass "retry(crash): rc!=0 with on=[crash] retried then retries_exhausted (attempts=2)" \
+  || fail "retry(crash) wrong (result=$(metaval "$AGENT_QUEUE_ROOT/.state/crashjob.meta" result) attempts=$(metaval "$AGENT_QUEUE_ROOT/.state/crashjob.meta" attempts))"
+export AGENT_QUEUE_ROOT="$tmp/queue-nocrash"
+"$AQ" init >/dev/null
+printf '%s\n' '---' 'engine: devin' "cwd: $work" 'yolo: true' \
+  'retry: { max: 1, backoff: 1s, on: [verify_failed] }' '---' '' '# crash-no-retry task' \
+  > "$AGENT_QUEUE_ROOT/inbox/nocrashjob.md"
+DEVIN_BIN="$crashstub" "$AQ" run --once >/dev/null 2>&1
+[ "$(metaval "$AGENT_QUEUE_ROOT/.state/nocrashjob.meta" result)" = "failed" ] \
+  && [ "$(metaval "$AGENT_QUEUE_ROOT/.state/nocrashjob.meta" attempts)" = "1" ] \
+  && pass "retry(crash): crash not in on -> straight to failed/ (no retry)" \
+  || fail "retry(crash) should not retry when crash not in on (result=$(metaval "$AGENT_QUEUE_ROOT/.state/nocrashjob.meta" result) attempts=$(metaval "$AGENT_QUEUE_ROOT/.state/nocrashjob.meta" attempts))"
+unset AGENT_QUEUE_POLL
+
+# 17. insights parse: a stub log with a usage line → parse_usage records tokens/
+# cost into meta; `insights <job>` prints them; a no-usage log doesn't crash.
+export AGENT_QUEUE_ROOT="$tmp/queue-usage"
+usagestub="$tmp/usage-engine"
+printf '#!/usr/bin/env bash\necho "AQ_USAGE model=claude-test tokens_in=100 tokens_out=50 cost_usd=0.0021 turns=3 tool_calls=5"\nexit 0\n' > "$usagestub"
+chmod +x "$usagestub"
+"$AQ" init >/dev/null
+printf '%s\n' '---' 'engine: claude' "cwd: $work" 'yolo: true' '---' '' '# usage task' \
+  > "$AGENT_QUEUE_ROOT/inbox/usagejob.md"
+CLAUDE_BIN="$usagestub" "$AQ" run --once >/dev/null 2>&1
+if [ "$(metaval "$AGENT_QUEUE_ROOT/.state/usagejob.meta" tokens_in)" = "100" ] \
+   && [ "$(metaval "$AGENT_QUEUE_ROOT/.state/usagejob.meta" cost_usd)" = "0.0021" ]; then
+  pass "insights parse_usage: tokens/cost extracted into meta"
+else
+  fail "parse_usage did not record tokens/cost (tokens_in=$(metaval "$AGENT_QUEUE_ROOT/.state/usagejob.meta" tokens_in))"
+fi
+ins=$("$AQ" insights usagejob 2>/dev/null || true)
+if [[ "$ins" == *tokens_in* && "$ins" == *0.0021* ]]; then
+  pass "insights <job>: prints per-job metrics"
+else
+  fail "insights <job> did not print metrics"
+fi
+printf '%s\n' '---' 'engine: claude' "cwd: $work" 'yolo: true' '---' '' '# no-usage task' \
+  > "$AGENT_QUEUE_ROOT/inbox/nousagejob.md"
+CLAUDE_BIN="$stub" "$AQ" run --once >/dev/null 2>&1
+if "$AQ" insights nousagejob >/dev/null 2>&1 \
+   && [ -z "$(metaval "$AGENT_QUEUE_ROOT/.state/nousagejob.meta" tokens_in)" ]; then
+  pass "insights: no-usage log omits token fields without crashing"
+else
+  fail "insights crashed or fabricated tokens for a no-usage log"
+fi
+
+# 18. insights aggregate: two finished jobs → per-engine rollup with totals + rate.
+out=$("$AQ" insights 2>/dev/null || true)
+if [[ "$out" == *"ROLLUP BY ENGINE"* ]] && grep -qE 'claude .* 100 .* 50' <<<"$out"; then
+  pass "insights aggregate: per-engine rollup with token totals"
+else
+  printf '%s\n' "$out" >&2; fail "insights aggregate rollup missing/incorrect"
+fi
+
 echo "self-test PASS"