bytelyst-devops-tools/agent-queue/docs/jobs/phase1-slice3.md
Saravanakumar D 257efcb4bc docs(gigafactory): consolidate gigafactory docs into docs/gigafactory/
Move GIGAFACTORY_ROADMAP.md and GIGAFACTORY_SYSTEM_OVERVIEW.md under
agent-queue/docs/gigafactory/ so the scattered top-level docs are easy to
discover. Update the README links, the overview code-map, and all phase
job-spec source-of-truth paths to the new location. Pure docs move; no
behavior change.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-30 21:01:23 -07:00

9.5 KiB

engine cwd yolo lock timeout
devin /Users/sd9235/code/mygh/learning_ai_devops_tools true devops-tools 4h

ROLE: Senior engineer. Implement Phase 1 — Slice 3: RESILIENCE & INSIGHTS (single host). This is a LARGE, fully self-contained slice (git + log parsing only — NO network, NO external service, NO credentials) so it runs end-to-end without blockers.

SOURCE OF TRUTH: agent-queue/docs/gigafactory/GIGAFACTORY_ROADMAP.md (read §11 lifecycle/retry, §25 durability/crash-recovery, §26 execution insights, §17 observability, §14 Phase 1). Implement the SINGLE-HOST bash equivalents of §25 and §26.

PREREQUISITE / BRANCHING:

  • Builds on Slice 1 (PR #1, branch feat/gigafactory-p1-slice1).
  • Base on main IF PR #1 (and PR #2 if present) are merged; otherwise branch off feat/gigafactory-p1-slice1. Do NOT revert or duplicate earlier slice code.
  • This slice is INDEPENDENT of Slice 2 (profiles/deps) — do not depend on it.
  • New branch: feat/gigafactory-p1-slice3. Commit in logical steps, push, open a PR. DO NOT merge (human gate).

STRICT SCOPE:

  • Edit ONLY under agent-queue/ (agent-queue.sh, selftest.sh, README.md, docs/gigafactory/GIGAFACTORY_ROADMAP.md). No other repo.
  • DO NOT modify/delete anything under agent-queue/queue/ (live jobs). DO NOT run agent-queue.sh run against the real queue. selftest.sh uses its own temp AGENT_QUEUE_ROOT and temp git repos only.
  • bash, single host, macOS + Linux safe, zero new runtime deps.

================================================================== A. CRASH RECOVERY & WORK PRESERVATION (single-host §25)

A1. ORPHAN RECOVERY: On run startup (and at the top of each run loop), detect jobs stuck in building/ whose worker is no longer alive — i.e. the meta has a pid= whose process is dead (and pidstart mismatch guards PID reuse), or no live pid at all. Such a job is an ORPHAN from a previous crash/power-off. Recover it deterministically (never lose or strand it): - increment an attempts= counter in the meta, - log a clear recovery line, - move it back to inbox/ for re-selection (subject to retry policy A3), - recovery MUST be idempotent (running it twice recovers once).

A2. WIP CHECKPOINTING (work preservation): when a job's cwd is inside a git repo, the worker preserves partial work on a dedicated branch so a crash never loses it: - at START: ensure/create branch aq/wip/<job> (from current HEAD), record wip_branch= + wip_base= in meta. NEVER touch main/protected branches. - on EVERY exit path (success, failure, timeout, signal/trap): commit any changes in cwd to aq/wip/<job> with a message like "aq wip: (<stage/exit>)" and record wip_commit= in meta. - use a trap so even SIGTERM/SIGINT/timeout still checkpoints. - if cwd is NOT a git repo: skip cleanly (log "wip: cwd not a git repo"). RESUME: when an orphan/retry of a job whose aq/wip/<job> branch exists is relaunched, check out / fast-forward that branch first so the agent continues from the checkpoint instead of from zero. Document the resume behavior.

A3. RETRY POLICY (make the reserved retry field FUNCTIONAL): parse retry: { max: N, backoff: 5m, on: [timeout, verify_failed, crash] }. On a failure whose class is in on (agent rc!=0 => crash/agent_error, timeout => timeout, verify fail => verify_failed), requeue to inbox/ with the backoff delay honored (record next_eligible= epoch; selection skips until then) up to max attempts. On exhaustion → failed/ with result=retries_exhausted (single-host stand-in for dead_letter), preserving the wip branch + full diagnostics in the log. Default when retry absent = no retry (current behavior).

A4. STATE INTEGRITY: keep all meta writes append-only (as today); never truncate a live meta. Recovery/retry/backoff bookkeeping must be crash-safe (re-derivable from meta + folder location).

================================================================== B. EXECUTION INSIGHTS & TOKEN ACCOUNTING (single-host §26)

B1. PER-RUN METRICS: on completion, record into the job meta: duration_s, exit, result, attempts, and repo deltas for the run — files_changed, lines_added, lines_deleted (from git -C <cwd> diff --numstat against wip_base, or against HEAD~ if applicable). B2. TOKEN/COST CAPTURE (best-effort, honest): add a single extensible adapter parse_usage <engine> <logfile> that extracts, when present in the engine's output: model, tokens_in, tokens_out, tokens_cached, cost_usd, turns, tool_calls. Where the engine does not expose usage, omit the field or set an estimated=true marker — DO NOT fabricate precise numbers. Centralize all per-engine patterns in this one function (devin/claude/codex/copilot stubs; real patterns where known, TODO-commented otherwise). B3. SURFACE in status: add an insights sub-line per finished/running job (duration, attempts, tokens/cost if known, +/- lines). B4. NEW COMMAND aq insights [job]: - with a job id: print that job's full metrics. - without: print a table of recent finished jobs + an AGGREGATE rollup by engine (total tokens, total cost (mark if any estimated), job count, success rate, avg duration). B5. dashboard.mjs: surface a compact insights column/panel (tokens or cost + attempts) for finished jobs. Keep it read-only from meta (agent-queue.sh stays the single source of truth). B6. PRIVACY: never write prompt content or secrets into meta/insights/logs beyond what already exists.

================================================================== TESTS (selftest.sh — tests are sacred; only ADD; use temp git repos + stubs)

  • orphan recovery: craft a building/ job whose meta pid is a dead PID → a run startup recovers it to inbox/ with attempts incremented; running recovery twice recovers exactly once.
  • wip checkpoint (git): job with a git-repo cwd that creates a file → after the run, branch aq/wip/ exists and contains a commit with the change; main branch untouched. Non-git cwd → skipped cleanly (no error).
  • wip resume: a recovered job whose aq/wip/ has a prior commit → the relaunch checks out that branch (assert HEAD is on aq/wip/ when the agent runs).
  • retry policy: verify-fail job with retry.max=1 on=[verify_failed] → requeued once (attempts=2) then → failed/ result=retries_exhausted; backoff next_eligible respected (job not picked before its delay — use a tiny backoff like 1s).
  • retry on crash: agent rc!=0 with on=[crash] retries; without crash in on, it goes straight to failed/ (no retry).
  • insights parse: feed a stub engine log containing a known usage line → parse_usage extracts tokens/cost into meta; aq insights <job> prints them; a no-usage log → fields omitted/estimated, no crash.
  • insights aggregate: two finished jobs → aq insights prints a per-engine rollup with correct totals + success rate.
  • numstat deltas: a run that adds N lines → lines_added recorded.
  • REGRESSION: all existing selftest cases (Slice 0 + Slice 1) still green.

================================================================== DOCS

  • README: new "Resilience" section (orphan recovery, WIP checkpoint/resume, retry) and "Insights" section (metrics, aq insights, token caveat) + document the retry frontmatter (now active) and the new result= values (retries_exhausted). Update the manifest table: move retry from RESERVED to ACTIVE.
  • docs/gigafactory/GIGAFACTORY_ROADMAP.md: tick the single-host items you fully completed in §11 (retry/dead-letter stand-in), §25 (orphan/WIP/retry — note "single-host subset"), §26 (capture/insights — single-host subset); bump §0 Phase 1 %.

================================================================== CONSTRAINTS

  • bash style consistent with the existing script; no new runtime deps; mac+linux safe (no GNU-only flags without a fallback — note macOS has BSD date/stat); no emojis in code; no leftover debug noise; conventional commits.
  • Be careful with set -euo pipefail + traps so the WIP-on-exit checkpoint always runs even on failure/timeout.

VERIFY GATE (must pass before finishing):

  • bash agent-queue/selftest.sh → fully green (existing + all new cases).
  • bash -n agent-queue/agent-queue.sh ; node --check agent-queue/dashboard.mjs.
  • shellcheck --severity=error agent-queue/agent-queue.sh (if available) → clean.

FINAL OUTPUT — print the implementation report in EXACTLY this format:

Implementation Report — Phase 1 Slice 3

Branch & commits

  • branch / based-on: (based on main | feat/gigafactory-p1-slice1)
  • commits: (one per line)
  • PR: <url or "opened, not merged">

Files changed

  • :

What was implemented (A1-A4, B1-B6)

  • : <how, key functions added/changed>

Tests added

  • : (plus selftest.sh PASS/FAIL summary)

Verify gate results

  • selftest.sh: <PASS/FAIL + counts>
  • bash -n / node --check / shellcheck:

Deviations / assumptions

  • <anything changed from spec and why; which engines have real token parsing vs TODO>

Suggested next slice

  • <what should come next (likely: tracker adapter aq from-tracker/to-tracker, P2)>