bytelyst-devops-tools/agent-queue/docs/jobs/phase1-slice3.md
Saravanakumar D 237481247e docs(gigafactory): uppercase GIGAFACTORY folder + add index README
Rename agent-queue/docs/gigafactory/ to docs/GIGAFACTORY/ and update every
reference (README, system-overview code-map, and all phase job specs). Add an
index README that lists the docs and points to the companion docs in
learning_ai_common_plat. Docs-only; no behavior change.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-30 21:21:31 -07:00

169 lines
9.5 KiB
Markdown

---
engine: devin
cwd: /Users/sd9235/code/mygh/learning_ai_devops_tools
yolo: true
lock: devops-tools
timeout: 4h
---
ROLE: Senior engineer. Implement Phase 1 — Slice 3: RESILIENCE & INSIGHTS (single host).
This is a LARGE, fully self-contained slice (git + log parsing only — NO network,
NO external service, NO credentials) so it runs end-to-end without blockers.
SOURCE OF TRUTH: agent-queue/docs/GIGAFACTORY/GIGAFACTORY_ROADMAP.md (read §11 lifecycle/retry,
§25 durability/crash-recovery, §26 execution insights, §17 observability, §14 Phase 1).
Implement the SINGLE-HOST bash equivalents of §25 and §26.
PREREQUISITE / BRANCHING:
- Builds on Slice 1 (PR #1, branch feat/gigafactory-p1-slice1).
- Base on `main` IF PR #1 (and PR #2 if present) are merged; otherwise branch off
feat/gigafactory-p1-slice1. Do NOT revert or duplicate earlier slice code.
- This slice is INDEPENDENT of Slice 2 (profiles/deps) — do not depend on it.
- New branch: feat/gigafactory-p1-slice3. Commit in logical steps, push, open a PR.
DO NOT merge (human gate).
STRICT SCOPE:
- Edit ONLY under agent-queue/ (agent-queue.sh, selftest.sh, README.md,
docs/GIGAFACTORY/GIGAFACTORY_ROADMAP.md). No other repo.
- DO NOT modify/delete anything under agent-queue/queue/ (live jobs). DO NOT run
`agent-queue.sh run` against the real queue. selftest.sh uses its own temp
AGENT_QUEUE_ROOT and temp git repos only.
- bash, single host, macOS + Linux safe, zero new runtime deps.
==================================================================
A. CRASH RECOVERY & WORK PRESERVATION (single-host §25)
==================================================================
A1. ORPHAN RECOVERY: On `run` startup (and at the top of each run loop), detect
jobs stuck in building/ whose worker is no longer alive — i.e. the meta has a
`pid=` whose process is dead (and `pidstart` mismatch guards PID reuse), or no
live pid at all. Such a job is an ORPHAN from a previous crash/power-off.
Recover it deterministically (never lose or strand it):
- increment an `attempts=` counter in the meta,
- log a clear recovery line,
- move it back to inbox/ for re-selection (subject to retry policy A3),
- recovery MUST be idempotent (running it twice recovers once).
A2. WIP CHECKPOINTING (work preservation): when a job's `cwd` is inside a git repo,
the worker preserves partial work on a dedicated branch so a crash never loses it:
- at START: ensure/create branch `aq/wip/<job>` (from current HEAD), record
`wip_branch=` + `wip_base=` in meta. NEVER touch main/protected branches.
- on EVERY exit path (success, failure, timeout, signal/trap): commit any
changes in cwd to `aq/wip/<job>` with a message like
"aq wip: <job> (<stage/exit>)" and record `wip_commit=` in meta.
- use a trap so even SIGTERM/SIGINT/timeout still checkpoints.
- if cwd is NOT a git repo: skip cleanly (log "wip: cwd not a git repo").
RESUME: when an orphan/retry of a job whose `aq/wip/<job>` branch exists is
relaunched, check out / fast-forward that branch first so the agent continues
from the checkpoint instead of from zero. Document the resume behavior.
A3. RETRY POLICY (make the reserved `retry` field FUNCTIONAL):
parse `retry: { max: N, backoff: 5m, on: [timeout, verify_failed, crash] }`.
On a failure whose class is in `on` (agent rc!=0 => crash/agent_error,
timeout => timeout, verify fail => verify_failed), requeue to inbox/ with the
backoff delay honored (record `next_eligible=` epoch; selection skips until
then) up to `max` attempts. On exhaustion → failed/ with
result=retries_exhausted (single-host stand-in for dead_letter), preserving the
wip branch + full diagnostics in the log. Default when `retry` absent = no
retry (current behavior).
A4. STATE INTEGRITY: keep all meta writes append-only (as today); never truncate a
live meta. Recovery/retry/backoff bookkeeping must be crash-safe (re-derivable
from meta + folder location).
==================================================================
B. EXECUTION INSIGHTS & TOKEN ACCOUNTING (single-host §26)
==================================================================
B1. PER-RUN METRICS: on completion, record into the job meta:
duration_s, exit, result, attempts, and repo deltas for the run —
files_changed, lines_added, lines_deleted (from `git -C <cwd> diff --numstat`
against wip_base, or against HEAD~ if applicable).
B2. TOKEN/COST CAPTURE (best-effort, honest): add a single extensible adapter
`parse_usage <engine> <logfile>` that extracts, when present in the engine's
output: model, tokens_in, tokens_out, tokens_cached, cost_usd, turns,
tool_calls. Where the engine does not expose usage, omit the field or set an
`estimated=true` marker — DO NOT fabricate precise numbers. Centralize all
per-engine patterns in this one function (devin/claude/codex/copilot stubs;
real patterns where known, TODO-commented otherwise).
B3. SURFACE in `status`: add an insights sub-line per finished/running job
(duration, attempts, tokens/cost if known, +/- lines).
B4. NEW COMMAND `aq insights [job]`:
- with a job id: print that job's full metrics.
- without: print a table of recent finished jobs + an AGGREGATE rollup by
engine (total tokens, total cost (mark if any estimated), job count,
success rate, avg duration).
B5. dashboard.mjs: surface a compact insights column/panel (tokens or cost +
attempts) for finished jobs. Keep it read-only from meta (agent-queue.sh
stays the single source of truth).
B6. PRIVACY: never write prompt content or secrets into meta/insights/logs beyond
what already exists.
==================================================================
TESTS (selftest.sh — tests are sacred; only ADD; use temp git repos + stubs)
==================================================================
- orphan recovery: craft a building/ job whose meta pid is a dead PID → a `run`
startup recovers it to inbox/ with attempts incremented; running recovery twice
recovers exactly once.
- wip checkpoint (git): job with a git-repo cwd that creates a file → after the
run, branch aq/wip/<job> exists and contains a commit with the change; main
branch untouched. Non-git cwd → skipped cleanly (no error).
- wip resume: a recovered job whose aq/wip/<job> has a prior commit → the relaunch
checks out that branch (assert HEAD is on aq/wip/<job> when the agent runs).
- retry policy: verify-fail job with retry.max=1 on=[verify_failed] → requeued once
(attempts=2) then → failed/ result=retries_exhausted; backoff next_eligible
respected (job not picked before its delay — use a tiny backoff like 1s).
- retry on crash: agent rc!=0 with on=[crash] retries; without `crash` in `on`,
it goes straight to failed/ (no retry).
- insights parse: feed a stub engine log containing a known usage line →
parse_usage extracts tokens/cost into meta; `aq insights <job>` prints them;
a no-usage log → fields omitted/estimated, no crash.
- insights aggregate: two finished jobs → `aq insights` prints a per-engine rollup
with correct totals + success rate.
- numstat deltas: a run that adds N lines → lines_added recorded.
- REGRESSION: all existing selftest cases (Slice 0 + Slice 1) still green.
==================================================================
DOCS
==================================================================
- README: new "Resilience" section (orphan recovery, WIP checkpoint/resume, retry)
and "Insights" section (metrics, `aq insights`, token caveat) + document the
`retry` frontmatter (now active) and the new result= values
(retries_exhausted). Update the manifest table: move `retry` from RESERVED to ACTIVE.
- docs/GIGAFACTORY/GIGAFACTORY_ROADMAP.md: tick the single-host items you fully completed in
§11 (retry/dead-letter stand-in), §25 (orphan/WIP/retry — note "single-host
subset"), §26 (capture/insights — single-host subset); bump §0 Phase 1 %.
==================================================================
CONSTRAINTS
==================================================================
- bash style consistent with the existing script; no new runtime deps; mac+linux
safe (no GNU-only flags without a fallback — note macOS has BSD date/stat);
no emojis in code; no leftover debug noise; conventional commits.
- Be careful with `set -euo pipefail` + traps so the WIP-on-exit checkpoint always
runs even on failure/timeout.
VERIFY GATE (must pass before finishing):
- bash agent-queue/selftest.sh → fully green (existing + all new cases).
- bash -n agent-queue/agent-queue.sh ; node --check agent-queue/dashboard.mjs.
- shellcheck --severity=error agent-queue/agent-queue.sh (if available) → clean.
FINAL OUTPUT — print the implementation report in EXACTLY this format:
## Implementation Report — Phase 1 Slice 3
### Branch & commits
- branch / based-on: <name> (based on main | feat/gigafactory-p1-slice1)
- commits: <sha> <message> (one per line)
- PR: <url or "opened, not merged">
### Files changed
- <path>: <one-line summary>
### What was implemented (A1-A4, B1-B6)
- <item>: <how, key functions added/changed>
### Tests added
- <test name>: <what it asserts> (plus selftest.sh PASS/FAIL summary)
### Verify gate results
- selftest.sh: <PASS/FAIL + counts>
- bash -n / node --check / shellcheck: <result>
### Deviations / assumptions
- <anything changed from spec and why; which engines have real token parsing vs TODO>
### Suggested next slice
- <what should come next (likely: tracker adapter aq from-tracker/to-tracker, P2)>