bytelyst-devops-tools/agent-queue
saravanakumardb1 1e0a17bbc0 test(agent-queue): tracker adapter selftest cases (P1-S4)
Adds (never weakens) 7 stub-driven cases (AQ_TRACKER_API_CMD stub, no live service): from-tracker create + label mapping + idempotent; to-tracker shipped echo (PATCH done + metrics comment, asserts NO prompt body sent) + idempotent; HTTP 500 non-fatal; AQ_TRACKER_AUTO auto-echo on run. Full suite green (53 checks).
2026-05-29 21:35:16 -07:00
..
docs docs(agent-queue): P0 atomic-claim resolved (PR #29) — tick §4/§13/§14 fleet items 2026-05-29 21:05:38 -07:00
profiles feat(agent-queue): starter profile catalog (P1-S2) 2026-05-29 19:26:26 -07:00
queue feat(agent-queue): build/ship lifecycle with auto-QA verify gate + manual ship 2026-05-29 16:03:01 -07:00
.gitignore chore(agent-queue): track queue/ in repo + seed inbox with nomgap/localmemgpt/devintelli jobs 2026-05-29 15:10:00 -07:00
agent-queue.sh feat(agent-queue): tracker adapter — task <-> job round-trip (P1-S4) 2026-05-29 21:35:06 -07:00
dashboard.mjs feat(agent-queue): dashboard insights column for finished jobs (P1-S3) 2026-05-29 18:43:30 -07:00
README.md docs(agent-queue): profiles + deps docs; tick §5/§6 + bump Phase 1 to 80% (P1-S2) 2026-05-29 19:26:33 -07:00
selftest.sh test(agent-queue): tracker adapter selftest cases (P1-S4) 2026-05-29 21:35:16 -07:00

agent-queue

A zero-dependency folder "kanban" runner for headless coding-agent CLIs — Devin, Claude Code, and OpenAI Codex. Drop prompt .md files into a folder, and they get executed (in auto-approve mode) one slot at a time, moving through inbox → building → review → testing → shipped (plus failed) with live status.

Vision & roadmap: where this is headed — a distributed multi-machine "gigafactory" (fleet of factories × tools × profiles, scheduler-routed, built on platform-service + tracker-web) — is specified as a checklist-driven implementation roadmap in docs/GIGAFACTORY_ROADMAP.md.

Build/ship lifecycle — auto-QA, manual ship:

inbox ─▶ building ─▶ review ─▶ testing ─▶ shipped
  (queued)  (agent     (rc=0;    (verify    (you ran
            running)   awaiting  passed —    `ship`)
                       verify)   QA gate)
                          │
        agent rc≠0 /      │ verify fails
        timeout ──────────┴──────────────▶ failed
  • Auto: agent exits 0 → review/. If a verify: command is configured it runs automatically: pass → testing/ (QA), fail → failed/. No verify: → the job parks in review/ for a manual promote.
  • Manual: you ship a testing/ job → shipped/ (the human gate). Shipping is never automatic.

Why this exists: the agent CLIs ship a minimal local interface (no built-in batch/queue/dashboard — that lives in their cloud products). This is the zero-dependency bash glue that turns "run one prompt interactively" into "queue many and walk away."


Quick start

cd learning_ai_devops_tools/agent-queue
chmod +x agent-queue.sh
./agent-queue.sh init

# queue a roadmap for Devin, running in the tracker-web repo, auto-approving everything
./agent-queue.sh add ~/roadmaps/UX-2.md \
  --engine devin \
  --cwd /Users/sd9235/code/mygh/learning_ai_common_plat/dashboards/tracker-web \
  --yolo

# start processing (foreground; Ctrl-C to stop). Run up to 3 agents at once (default).
./agent-queue.sh run --max 3

In a second terminal, watch progress:

./agent-queue.sh watch
  AGENT QUEUE  /…/agent-queue/queue
  inbox 3   building 2   review 1   testing 2   shipped 5   failed 0   running 2/2

  RUNNING
    20260528-2130__UX-2        devin     4m12s  pid 51234  ⏺ Edited src/app/dashboard/items/page.tsx
    20260528-2131__UX-3        claude    1m02s  pid 51290  Running: pnpm typecheck

How a task is configured

Each .md carries optional frontmatter telling the runner which engine to use, which directory to run in, and whether to auto-approve:

---
engine: devin          # devin | claude | codex | copilot  (default: $AGENT_QUEUE_ENGINE)
cwd: /abs/path/to/repo # where the agent executes   (default: cwd when added)
yolo: true             # auto-approve ALL tools      (default: true)
lock: my-repo          # optional mutex key (default: cwd). Jobs sharing a key run serially
timeout: 45m           # optional. 90s|45m|2h|1d. On expiry → failed (result=timeout)
verify: pnpm -s test   # optional auto-QA gate. Runs in cwd after rc=0:
                       #   pass → testing/ (QA),  fail → failed/
                       #   (omit to park in review/ for manual promote)
---

# Your task / roadmap goes here
...

add --engine/--cwd/--yolo will inject this frontmatter for you if the file doesn't already have a --- block.

Manifest fields (Gigafactory Phase 1)

The runner parses the richer gigafactory manifest backward-compatibly — a legacy engine/cwd/yolo-only .md behaves exactly as before. Fields marked RESERVED are parsed, stored in .state/<job>.meta, and shown in status, but are otherwise no-ops until a later phase (they do not yet affect execution).

Field Status Default Meaning
engine active $AGENT_QUEUE_ENGINE explicit engine (devin|claude|codex|copilot) — always wins over engine-class
cwd / yolo / lock / timeout / verify active see above Phase-0 behavior, unchanged
priority active medium critical|high|medium|low. Inbox is picked highest-priority first, then oldest (was pure FIFO)
engine-class active (none) used only when engine is unset: agentic-coderdevin,claude,codex; chat-codercopilot. Picks the first available engine. No engine available → job fails result=no_engine
prefers-engine active (none) optional order hint for engine-class resolution, e.g. [claude, devin]
capabilities active (none) hard host requirements, e.g. [os:any, node>=20, has:git]. If the host can't satisfy them the job is sent to failed/ with result=capability_mismatch and the agent is never launched (grammar below)
idempotency-key active (none) dedupe on add (semantics below)
profile active (none) inherit persona + verify/caps/engine-class/prefers-engine/allowed-scope/review-policy from profiles/<name>.md (job fields override — see Profiles)
prefers RESERVED (none) soft routing/affinity hints (e.g. [factory:mac-2])
budget RESERVED (none) { usd, tokens, wall } ceilings (wall enforcement is a later slice)
deps / deps-mode active (none) block until each referenced idempotency-key is in shipped/ (or testing/ when deps-mode: soft). Submit-time cycle detection (see Profiles & deps)
retry active (none) { max: N, backoff: 5m, on: [timeout, verify_failed, crash] } — requeue failures with backoff up to max, then retries_exhausted (see Resilience)
review-policy RESERVED (none) auto|manual|reviewers:[…]
artifacts RESERVED (none) extra outputs to capture (coverage, screenshots)
tracker-item RESERVED (none) link back to the originating tracker task

Capability grammar (a job matches a host iff every required token is satisfied):

Token form Example Satisfied when
key (bare presence) gpu the host advertises key in any form
key:value (exact) os:mac, engine:devin, has:git the host advertises that exact token
key:any (wildcard) os:any the host advertises any key:* (so os:any matches every host)
key<op>version (>= > = <= <) node>=20 numeric/semver-major compare vs the host's key:<n>

The host advertises (via detect_capabilities): os:<mac\|linux>, engine:<each available engine>, node:<major>, and has:<git\|pnpm\|docker> when present.

idempotency-key semantics (on add, hashing the frontmatter-stripped body):

  • same key + same bodyno-op (logged duplicate, skipped).
  • same key + different body, prior job still in inbox/supersedes it (replaces the queued file).
  • same key + different body, prior job already past inbox/ (building/review/testing/shipped) → rejected with a clear error (use a new key, or requeue the existing job).

Engine mapping

engine: Command run Auto-approve flag (yolo: true)
devin devin -p --prompt-file <body> --permission-mode dangerous
claude claude -p (body on stdin) --dangerously-skip-permissions
codex codex exec (body on stdin) --dangerously-bypass-approvals-and-sandbox
copilot copilot -p (body on stdin) --allow-all-tools (best-effort; chat-coder class target)

The frontmatter is stripped before the body reaches the agent, and claude/codex receive it on stdin so a body starting with -- is never misparsed as a flag.

Flags drift between CLI versions — if one changes, edit build_agent_cmd() in agent-queue.sh (it's the single place each engine is mapped).

Commands

Command What it does
init create the queue/ folders
add <file> [--engine E] [--cwd P] [--yolo|--no-yolo] queue a prompt into inbox/
run [--max N] [--engine E] [--once] process the inbox (foreground loop)
status kanban counts + running-worker table (marks ⚠ stalled; per-job insights sub-line)
watch [interval] live status (bash), redrawn every N seconds (default 2)
insights [job] per-job metrics, or a recent-jobs table + per-engine token/cost/success rollup (see Insights)
recover reclaim orphaned building/ jobs (dead worker) back to inbox/ (see Resilience)
dash [--interval N] interactive Node dashboard — navigable numbered job list with single-key actions (see below)
stop kill running workers + the run loop
logs <job> [-f] print / follow a job's log
promote <job> advance one stage forward: review → testing → shipped
ship <job> manual gate: move a testing/ (QA) job → shipped/
reject <job> send a review/ or testing/ job → failed/
requeue <job> move a failed/review/testing job back to inbox/ for a fresh run
clean [--keep N] archive finished logs+meta beyond the newest N (default 50) into queue/.archive/

Only one run loop may be active per queue — a second run against the same queue is refused while the first is alive (a stale daemon.pid is cleared).

Interactive dashboard (dash)

dash is a single-script, menu-driven control panel (think a tiny "glassbox"). It shows the kanban counts, live RUNNING workers (engine, elapsed, last log line, stall), a navigable numbered JOBS list, and RECENT finished jobs — and lets you act on jobs without leaving the screen. Every action shells out to agent-queue.sh, so the script stays the single source of truth.

Key Action
/, j/k, 19 select a job in the JOBS list
enter / l view the selected job's log (live, auto-refreshing)
p promote (review → testing → shipped)
s ship (testing/QA → shipped, the manual gate)
x reject (review/testingfailed) — asks y/n
u requeue (failed/review/testinginbox) — asks y/n
r start the run loop (detached → logs/run-loop.log)
S stop the run loop + running workers
g refresh now · ?/h help · q/Ctrl-C quit

The header shows a ● run loop pid N / ○ run loop stopped indicator. Run it in a TTY for the interactive mode; piped/non-TTY it falls back to a read-only live view.

Via bytelyst-cli.sh

Wired into the repo's unified CLI (no GitHub token required for this subcommand):

./bytelyst-cli.sh agent-queue run --max 3     # full passthrough
./bytelyst-cli.sh aq status                   # short alias

Folder layout

queue/
  inbox/    # drop / queued .md files (oldest eligible picked first)
  building/ # currently executing (agent running)
  review/   # agent exited 0 — awaiting the auto-QA verify gate (or manual promote)
  testing/  # verify passed (QA) — awaiting manual `ship`
  shipped/  # manually shipped — the terminal success stage
  failed/   # non-zero exit, bad cwd, timeout, verify failure, or manual reject
  logs/     # <job>.log — full agent + verify output
  locks/    # per-key flock files (Linux hardening; unused on macOS)
  .state/   # <job>.meta heartbeats + daemon.pid (runtime only)
  .archive/ # <ts>/ — logs+meta moved here by `clean`

result= values written to <job>.meta: review, testing, shipped, failed, timeout, verify_failed, rejected, requeued, capability_mismatch (host missing a required capability — agent never launched), no_engine (an engine-class had no available engine), retries_exhausted (failed after retry.max attempts — single-host dead-letter stand-in), retry_scheduled (transient: requeued for another attempt), recovered (transient: an orphan was reclaimed to inbox/).

Profiles & deps

Profiles (roadmap §6)

A profile is a reusable role preset in profiles/<name>.md. A job opts in with profile: <name> and inherits any of these fields it does not set itself: verify (from the profile's default-verify), capabilities, engine-class, prefers-engine, allowed-scope, review-policy. The profile's persona block is prepended to the body sent to the engine (the job .md on disk is unchanged; secrets are never logged). Resolution runs before the capability gate and engine resolution, so inherited caps / engine-class take effect.

Precedence: job field > profile field > built-in default. Set AGENT_QUEUE_PROFILES to point at a different catalog directory (defaults to ./profiles).

Starter catalog: developer, backend-engineer, frontend-engineer, ux-designer, ui-designer, qa, reviewer, docs-writer, and a reserved planner. Each presets name, persona, capabilities, default-verify, engine-class, prefers-engine, allowed-scope, and review-policy.

allowed-scope (warn-only this phase). After a run on a git cwd, changed paths outside the profile/job allowed-scope globs (dir/** matches the whole subtree) are logged as a WARNING and recorded as scope_warning= in the meta — non-blocking (the job is not failed). path_in_scope is exposed as a unit-testable function.

deps / DAG, single host (roadmap §5)

deps: [keyA, keyB] references other jobs by their author-controlled idempotency-key. A dep is satisfied when a job with that key is in shipped/ (default), or in shipped/ or testing/ when the dependent job sets deps-mode: soft. A job with unmet deps is blocked: it is skipped in inbox selection (never launched, never failed) and surfaced in status as blocked (waiting on: <keys>), then re-evaluated every loop until its deps are met. add performs submit-time cycle detection over the inbox + active-stage dep graph and rejects (nonzero exit) a job that would create a cycle. Cross-machine deps are P2.

Resilience (crash recovery & work preservation)

Single-host implementations of the durability model (roadmap §25):

  • Orphan recovery. A job left in building/ whose worker process is dead (no live pid, PID-reuse-guarded by pidstart) is an orphan from a previous crash/power-off. On run startup and on every loop iteration (or on demand via agent-queue.sh recover) it is moved back to inbox/ with attempts incremented. Recovery is idempotent — once moved out of building/ it is never recovered twice.
  • WIP checkpointing. When a job's cwd is a git repo, the worker creates/checks out a dedicated branch aq/wip/<job> at start and commits any changes to it on every exit path — success, failure, timeout, and SIGTERM/SIGINT (via a trap). It never commits to main/your current branch. Non-git cwd is skipped cleanly. wip_branch / wip_base / wip_commit are recorded in the meta.
  • Resume. When an orphan/retry of a job whose aq/wip/<job> branch already exists is relaunched, that branch is checked out first so the agent continues from the checkpoint instead of from zero.
  • Retry policy (retry frontmatter, now active). On a failure whose class is in on (crash/agent_error for a non-zero agent exit, timeout, verify_failed) the job is requeued to inbox/ honoring backoff (selection skips it until next_eligible) up to max attempts; on exhaustion it lands in failed/ with result=retries_exhausted, preserving the WIP branch + full log. No retry = no retry (Phase-0 behavior).

All bookkeeping (attempts, next_eligible, wip_*) is append-only in the meta and re-derivable from the meta + folder location, so recovery is crash-safe.

Insights (metrics & token accounting)

Each finished run records into <job>.meta: duration_s, exit, result, attempts, and — for a git cwdfiles_changed / lines_added / lines_deleted (diffed wip_base..HEAD). A single parse_usage <engine> <log> adapter extracts model / tokens_in / tokens_out / tokens_cached / cost_usd / turns / tool_calls when the engine exposes them.

agent-queue.sh insights <job>   # full metrics for one job
agent-queue.sh insights         # recent-jobs table + per-engine rollup

Token caveat (honest): real usage is captured only where the engine surfaces it. A cooperating wrapper may emit a machine-readable AQ_USAGE key=value … line; otherwise per-engine heuristics apply (Claude/Codex token fields parsed; Devin session metrics + Copilot are API-only and currently TODO in parse_usage). When a value is not provider-reported it is omitted or flagged usage_estimated — numbers are never fabricated. The per-engine rollup marks totals that include any estimated value with *.

Config (env overrides)

Var Default Meaning
AGENT_QUEUE_ROOT ./queue where the kanban folders live
AGENT_QUEUE_MAX 3 max concurrent agents (override per-run with run --max N)
AGENT_QUEUE_ENGINE devin default engine when none in frontmatter
AGENT_QUEUE_POLL 3 inbox poll interval (seconds)
AGENT_QUEUE_VERIFY (empty) default auto-QA verify command; per-job verify: overrides it
AGENT_QUEUE_STALL_MIN 10 minutes of unchanged log before a worker is ⚠ stalled
DEVIN_BIN / CLAUDE_BIN / CODEX_BIN / COPILOT_BIN autodetected override CLI binary paths
FLOCK_BIN / TIMEOUT_BIN autodetected flock (lock hardening) and timeout/gtimeout (hard timeouts); absent on stock macOS — see notes

⚠️ Safety

Running agents with yolo: true means no approval prompts — they will edit files, run shell commands, and commit unattended. Mitigate:

  • Prefer scope-locked prompt files (e.g. "edit only under dashboards/tracker-web/").
  • Tell prompts not to git push — review commits before they leave your machine.
  • Same-repo safety is automatic: jobs sharing a cwd (or lock: key) are serialized, so two agents never run in one repo at once — even at --max 2+.
  • Set a timeout: on long jobs so a wedged agent can't run forever.
  • Watch cost: each job is a full agent session.

Portability notes

  • macOS has no flock/timeout; locking relies on the single run-loop (enforced by the second-run refusal) and timeouts use a pure-bash watchdog. Install coreutils (gtimeout) for hard process-tree kills.
  • Linux (incl. Gitea CI) uses flock + timeout for cross-process hardening.

Roadmap / nice-to-haves

  • Per-repo lock to serialize same-repo jobs automatically (lock: / cwd).
  • Per-job timeout: with hard kill (or bash watchdog fallback).
  • Stall detection in status/dash.
  • requeue failed jobs + clean/archive old runs.
  • Build/ship lifecycle: review → testing → shipped with auto-QA verify: gate + manual ship.
  • --push opt-in policy + commit review gate.
  • Optional notifications (Slack/desktop) on done/failed/stall.
  • Persisted run-loop as a daemon/service with auto-restart.