From c52c165fd6e1c000406555068c83042a03d46460 Mon Sep 17 00:00:00 2001 From: saravanakumardb1 Date: Thu, 28 May 2026 22:33:20 -0700 Subject: [PATCH] docs(agent-queue): document locking, timeout, stall, requeue/clean Update README command table (requeue/clean, stall marker, single-run note), frontmatter (lock/timeout), engine mapping (stdin), config (STALL_MIN, FLOCK_BIN/TIMEOUT_BIN), folder layout (locks/.archive), Safety (automatic same-repo serialization + portability notes), and mark roadmap items done. --- agent-queue/README.md | 56 ++++++++++++++++++++++++++++++++----------- 1 file changed, 42 insertions(+), 14 deletions(-) diff --git a/agent-queue/README.md b/agent-queue/README.md index f2f210f..864c650 100644 --- a/agent-queue/README.md +++ b/agent-queue/README.md @@ -6,8 +6,9 @@ and they get executed (in auto-approve mode) one slot at a time, moving through `inbox → doing → done/failed` with live status. > **Why this exists:** the agent CLIs ship a minimal local interface (no built-in -> batch/queue/dashboard — that lives in their *cloud* products). This is the ~250-line -> glue that turns "run one prompt interactively" into "queue many and walk away." +> batch/queue/dashboard — that lives in their *cloud* products). This is the +> zero-dependency bash glue that turns "run one prompt interactively" into +> "queue many and walk away." --- @@ -55,6 +56,8 @@ which directory to run in, and whether to auto-approve: engine: devin # devin | claude | codex (default: $AGENT_QUEUE_ENGINE) cwd: /abs/path/to/repo # where the agent executes (default: cwd when added) yolo: true # auto-approve ALL tools (default: true) +lock: my-repo # optional mutex key (default: cwd). Jobs sharing a key run serially +timeout: 45m # optional. 90s|45m|2h|1d. On expiry → failed (result=timeout) --- # Your task / roadmap goes here @@ -68,9 +71,13 @@ already have a `---` block. | `engine:` | Command run | Auto-approve flag (`yolo: true`) | | --------- | ----------- | -------------------------------- | -| `devin` | `devin -p --prompt-file ` | `--permission-mode dangerous` | -| `claude` | `claude -p ""` | `--dangerously-skip-permissions` | -| `codex` | `codex exec ""` | `--dangerously-bypass-approvals-and-sandbox` | +| `devin` | `devin -p --prompt-file ` | `--permission-mode dangerous` | +| `claude` | `claude -p` (body on **stdin**) | `--dangerously-skip-permissions` | +| `codex` | `codex exec` (body on **stdin**) | `--dangerously-bypass-approvals-and-sandbox` | + +The frontmatter is **stripped** before the body reaches the agent, and +claude/codex receive it on **stdin** so a body starting with `--` is never +misparsed as a flag. > Flags drift between CLI versions — if one changes, edit `build_agent_cmd()` in > `agent-queue.sh` (it's the single place each engine is mapped). @@ -82,11 +89,16 @@ already have a `---` block. | `init` | create the `queue/` folders | | `add [--engine E] [--cwd P] [--yolo\|--no-yolo]` | queue a prompt into `inbox/` | | `run [--max N] [--engine E] [--once]` | process the inbox (foreground loop) | -| `status` | kanban counts + running-worker table | +| `status` | kanban counts + running-worker table (marks `⚠ stalled` workers) | | `watch [interval]` | live `status` (bash), redrawn every N seconds (default 2) | -| `dash [--interval N]` | richer **Node** live dashboard — running workers (engine, elapsed, last log line) + recent done/failed | +| `dash [--interval N]` | richer **Node** live dashboard — running workers (engine, elapsed, last log line, stall) + recent done/failed | | `stop` | kill running workers + the run loop | | `logs [-f]` | print / follow a job's log | +| `requeue ` | move a failed job back to `inbox/` for a fresh run | +| `clean [--keep N]` | archive finished logs+meta beyond the newest N (default 50) into `queue/.archive/` | + +Only one `run` loop may be active per queue — a second `run` against the same +queue is refused while the first is alive (a stale `daemon.pid` is cleared). ## Via `bytelyst-cli.sh` @@ -101,12 +113,14 @@ Wired into the repo's unified CLI (no GitHub token required for this subcommand) ``` queue/ - inbox/ # drop / queued .md files (oldest picked first) + inbox/ # drop / queued .md files (oldest eligible picked first) doing/ # currently executing done/ # exited 0 - failed/ # non-zero exit (or bad cwd) + failed/ # non-zero exit, bad cwd, or timeout (result=timeout) logs/ # .log — full agent output + locks/ # per-key flock files (Linux hardening; unused on macOS) .state/ # .meta heartbeats + daemon.pid (runtime only) + .archive/ # / — logs+meta moved here by `clean` ``` ## Config (env overrides) @@ -117,7 +131,9 @@ queue/ | `AGENT_QUEUE_MAX` | `2` | max concurrent agents | | `AGENT_QUEUE_ENGINE` | `devin` | default engine when none in frontmatter | | `AGENT_QUEUE_POLL` | `3` | inbox poll interval (seconds) | +| `AGENT_QUEUE_STALL_MIN` | `10` | minutes of unchanged log before a worker is `⚠ stalled` | | `DEVIN_BIN` / `CLAUDE_BIN` / `CODEX_BIN` | autodetected | override CLI binary paths | +| `FLOCK_BIN` / `TIMEOUT_BIN` | autodetected | `flock` (lock hardening) and `timeout`/`gtimeout` (hard timeouts); absent on stock macOS — see notes | ## ⚠️ Safety @@ -126,12 +142,24 @@ run shell commands, and commit unattended. Mitigate: - Prefer **scope-locked** prompt files (e.g. "edit only under `dashboards/tracker-web/`"). - Tell prompts **not to `git push`** — review commits before they leave your machine. -- Avoid queueing two tasks that touch the **same repo** concurrently (git contention). - Use `--max 1` if all tasks share a repo. +- **Same-repo safety is automatic:** jobs sharing a `cwd` (or `lock:` key) are + serialized, so two agents never run in one repo at once — even at `--max 2+`. +- Set a `timeout:` on long jobs so a wedged agent can't run forever. - Watch cost: each job is a full agent session. +### Portability notes + +- **macOS** has no `flock`/`timeout`; locking relies on the single run-loop + (enforced by the second-run refusal) and timeouts use a pure-bash watchdog. + Install coreutils (`gtimeout`) for hard process-tree kills. +- **Linux** (incl. Gitea CI) uses `flock` + `timeout` for cross-process hardening. + ## Roadmap / nice-to-haves -- `--push` opt-in + per-repo lock to serialize same-repo jobs automatically. -- Node/TS rewrite with a richer live TUI dashboard. -- `done`-folder retention / archive by date. +- [x] Per-repo lock to serialize same-repo jobs automatically (`lock:` / cwd). +- [x] Per-job `timeout:` with hard kill (or bash watchdog fallback). +- [x] Stall detection in `status`/`dash`. +- [x] `requeue` failed jobs + `clean`/archive old runs. +- [ ] `--push` opt-in policy + commit review gate. +- [ ] Optional notifications (Slack/desktop) on done/failed/stall. +- [ ] Persisted run-loop as a daemon/service with auto-restart.