Commit Graph

14 Commits

Author SHA1 Message Date
saravanakumardb1
1758bc1ab1 feat(agent-queue): single-host crash recovery, WIP checkpoint/resume, retry + insights (P1-S3)
Implements the single-host bash equivalents of roadmap §25 (durability/crash
recovery) and §26 (execution insights), plus §11 retry/dead-letter stand-in.

Resilience (A1-A4):
- recover_orphans + `recover` command: building/ jobs with a dead worker (dead
  pid, pidstart reuse-guard) are moved back to inbox/ with attempts incremented,
  on `run` startup and each loop. Idempotent (folder location is the guard).
- WIP checkpointing: for a git cwd, _wip_start creates/checks out aq/wip/<job>
  and _wip_checkpoint commits changes on every exit path via an EXIT/INT/TERM
  trap; never commits to main/current branch; non-git cwd skipped. RESUME: a
  relaunch whose aq/wip/<job> exists checks it out first (continue from
  checkpoint). wip_base persisted in a write-once sidecar.
- retry policy (now functional): retry { max, backoff, on } requeues failures
  whose class (timeout|verify_failed|crash) is in `on`, honoring backoff via
  next_eligible (selection skips until eligible), up to max attempts; exhaustion
  -> failed/ result=retries_exhausted with the WIP branch + full log preserved.
- state integrity: all meta writes stay append-only; attempts/next_eligible/wip_*
  are re-derivable; recovery is crash-safe.

Insights (B1-B6):
- per-run metrics into meta: duration_s, exit, result, attempts, and (git cwd)
  files_changed/lines_added/lines_deleted from numstat wip_base..HEAD.
- parse_usage(engine, log) adapter: generic AQ_USAGE line + Claude/Codex token
  heuristics; Devin/Copilot TODO; usage_estimated flag; never fabricates numbers.
- status insights sub-line; new `insights [job]` command (per-job metrics or a
  recent table + per-engine token/cost/success/duration rollup).
- privacy: only metrics are recorded, never prompt content or secrets.

Backward-compatible: legacy .md and non-git cwd behave exactly as before.
2026-05-29 18:43:21 -07:00
saravanakumardb1
0be5b34123 feat(agent-queue): evolved manifest, priority, capabilities, engine-class, idempotency (P1-S1)
Implements Gigafactory Phase 1 - Slice 1 in the bash runner (backward-compatible;
a legacy engine/cwd/yolo-only .md behaves exactly as before):

- Parse all new §5 manifest keys via fm_get with safe defaults; record them in
  <job>.meta and surface priority/profile/capabilities/tracker-item in `status`.
  Only priority, capabilities, engine-class and idempotency-key are functional
  this slice; the rest (profile, prefers, budget, deps, deps-mode, retry,
  review-policy, artifacts, tracker-item) are stored but inert.
- priority ordering: inbox_sorted picks critical>high>medium>low, ties by oldest;
  per-lock serialization preserved.
- capability grammar + match: detect_capabilities advertises os/engine/node/has
  tokens; caps_match honors key, key:value, key<op>version and os:any. A job whose
  declared capabilities the host cannot satisfy is moved to failed/ with
  result=capability_mismatch and the agent is never launched.
- engine-class resolution: explicit engine wins; else engine-class picks the first
  available engine honoring prefers-engine (agentic-coder->devin,claude,codex;
  chat-coder->copilot). No available engine -> result=no_engine. Adds copilot to
  the engine driver + COPILOT_BIN.
- idempotency-key dedupe on add: same key+body -> no-op; same key+different body
  supersedes an inbox prior, else is rejected with a clear error.

No change to queue/ data or the run/ship lifecycle. macOS + Linux safe.
2026-05-29 17:44:19 -07:00
saravanakumardb1
4ed4d75a67 feat(agent-queue): default max concurrency 2->3 (still env/flag configurable)
- AGENT_QUEUE_MAX default 3 (override via env or run --max N)
- sync README quick-start + env table + bytelyst-cli example to --max 3
2026-05-29 16:09:12 -07:00
saravanakumardb1
af1bc6904e feat(agent-queue): build/ship lifecycle with auto-QA verify gate + manual ship
Redesign the kanban runner stages from inbox->doing->done/failed to
inbox->building->review->testing->shipped (+ failed):

- worker: agent rc=0 lands in review/, then runs the configurable
  verify command (frontmatter verify: / AGENT_QUEUE_VERIFY) in cwd;
  pass -> testing/ (QA), fail -> failed/, none -> parks in review/
- new commands: ship (testing->shipped, manual gate), promote
  (advance one stage), reject (review/testing->failed); requeue now
  also pulls from review/testing
- status + dashboard.mjs render all six stages; RECENT panel labels
  shipped/testing/review/verify_failed/timeout/rejected
- README: new lifecycle diagram, verify: frontmatter, result= glossary,
  command table + folder layout
- selftest: assert no-verify->review, verify-pass->testing->ship->shipped,
  verify-fail->failed
- rename queue/doing->building, queue/done->review; add testing/ shipped/
2026-05-29 16:03:01 -07:00
saravanakumardb1
27feba36fa fix(agent-queue): status used undefined live_workers; call active_workers 2026-05-29 15:27:15 -07:00
saravanakumardb1
1f15520c4f feat(agent-queue): add requeue and clean commands
- requeue <job>: move a failed job back to inbox/ and drop stale meta/body so
  it re-runs cleanly
- clean [--keep N]: archive finished jobs' logs+meta beyond the newest N
  (default 50) into queue/.archive/<ts>/; running jobs + .md records untouched
- document both in usage + bytelyst-cli subcommand list
2026-05-28 22:31:56 -07:00
saravanakumardb1
4239648876 fix(agent-queue): verify pid start time to defeat pid reuse
Record pidstart (ps lstart) at launch and verify it in all liveness checks
(_meta_active, status, stop) via _pid_alive, so a recycled pid can never be
mistaken for our worker. Falls back to plain liveness when no start time recorded.
2026-05-28 22:24:50 -07:00
saravanakumardb1
a849a30e11 feat(agent-queue): refuse a second run when a daemon is already active
cmd_run now checks daemon.pid liveness up front: if a run loop is alive it exits
with an error (protecting the single-launcher invariant locking depends on); a
stale daemon.pid (dead pid) is cleared and the run proceeds.
2026-05-28 22:21:31 -07:00
saravanakumardb1
11935d0539 fix(agent-queue): reserve concurrency slot before backgrounding worker
Replace live_workers with reservation-aware active_workers + shared _meta_active:
a job counts toward --max the moment its meta is written (before the worker is
backgrounded), so --max can never be exceeded. A <30s guard prevents a meta
orphaned mid-launch from pinning a slot. busy_keys now shares _meta_active.
2026-05-28 22:17:36 -07:00
saravanakumardb1
79331d591f feat(agent-queue): flag stalled workers in status + dash
Mark a running worker '⚠ stalled' when its log has not changed for more than
AGENT_QUEUE_STALL_MIN minutes (default 10), using log mtime as the freshness
signal. Implemented in both the bash status table and the Node dashboard.
2026-05-28 22:15:26 -07:00
saravanakumardb1
3b71f0117a feat(agent-queue): per-job timeout via frontmatter timeout:
Honor 'timeout: 45m' (90s|45m|2h|1d) by wrapping the agent in timeout/gtimeout
when available (hard process-tree kill), else a portable bash watchdog. On expiry
the job moves doing->failed with result=timeout and a TIMED OUT log line.
2026-05-28 22:13:50 -07:00
saravanakumardb1
f14e6c2336 feat(agent-queue): per-cwd locking so two agents never share a repo
Serialize jobs by lock key (frontmatter 'lock:' override, default cwd) via the
single run-loop's pre-launch eligibility check; the oldest non-busy job is picked
regardless of --max. Adds a flock-based worker guard where flock exists (Linux);
macOS relies on the single-daemon model. Records lock= in job meta.
2026-05-28 22:10:30 -07:00
saravanakumardb1
169e944c3c feat(agent-queue): Node live dashboard + bytelyst-cli integration
- dashboard.mjs: zero-dep Node TUI (running workers w/ engine, elapsed,
  cwd, last log line + recent done/failed); 'dash' subcommand execs it
- bytelyst-cli.sh: 'agent-queue' / 'aq' passthrough handled before the
  GITHUB_TOKEN + jq/curl gates; usage + interactive-menu entry
- README: document dash + bytelyst-cli usage
2026-05-28 21:39:25 -07:00
saravanakumardb1
179108504f feat(agent-queue): folder-kanban runner for devin/claude/codex CLIs
Add a zero-dependency, bash 3.2-compatible queue runner that executes
prompt .md files through headless coding-agent CLIs in auto-approve mode,
moving them inbox -> doing -> done/failed with per-job logs and live status.

- pluggable engine drivers (devin --prompt-file, claude/codex via stdin)
- per-task YAML frontmatter: engine, cwd, yolo
- subcommands: init, add, run (--max N), status, watch, stop, logs
- runtime queue/ state gitignored
2026-05-28 21:35:59 -07:00