docs(agent-queue): document locking, timeout, stall, requeue/clean

Update README command table (requeue/clean, stall marker, single-run note),
frontmatter (lock/timeout), engine mapping (stdin), config (STALL_MIN,
FLOCK_BIN/TIMEOUT_BIN), folder layout (locks/.archive), Safety (automatic
same-repo serialization + portability notes), and mark roadmap items done.
This commit is contained in:
saravanakumardb1 2026-05-28 22:33:20 -07:00
parent 1f15520c4f
commit c52c165fd6

View File

@ -6,8 +6,9 @@ and they get executed (in auto-approve mode) one slot at a time, moving through
`inbox → doing → done/failed` with live status.
> **Why this exists:** the agent CLIs ship a minimal local interface (no built-in
> batch/queue/dashboard — that lives in their *cloud* products). This is the ~250-line
> glue that turns "run one prompt interactively" into "queue many and walk away."
> batch/queue/dashboard — that lives in their *cloud* products). This is the
> zero-dependency bash glue that turns "run one prompt interactively" into
> "queue many and walk away."
---
@ -55,6 +56,8 @@ which directory to run in, and whether to auto-approve:
engine: devin # devin | claude | codex (default: $AGENT_QUEUE_ENGINE)
cwd: /abs/path/to/repo # where the agent executes (default: cwd when added)
yolo: true # auto-approve ALL tools (default: true)
lock: my-repo # optional mutex key (default: cwd). Jobs sharing a key run serially
timeout: 45m # optional. 90s|45m|2h|1d. On expiry → failed (result=timeout)
---
# Your task / roadmap goes here
@ -68,9 +71,13 @@ already have a `---` block.
| `engine:` | Command run | Auto-approve flag (`yolo: true`) |
| --------- | ----------- | -------------------------------- |
| `devin` | `devin -p --prompt-file <file>` | `--permission-mode dangerous` |
| `claude` | `claude -p "<file contents>"` | `--dangerously-skip-permissions` |
| `codex` | `codex exec "<file contents>"` | `--dangerously-bypass-approvals-and-sandbox` |
| `devin` | `devin -p --prompt-file <body>` | `--permission-mode dangerous` |
| `claude` | `claude -p` (body on **stdin**) | `--dangerously-skip-permissions` |
| `codex` | `codex exec` (body on **stdin**) | `--dangerously-bypass-approvals-and-sandbox` |
The frontmatter is **stripped** before the body reaches the agent, and
claude/codex receive it on **stdin** so a body starting with `--` is never
misparsed as a flag.
> Flags drift between CLI versions — if one changes, edit `build_agent_cmd()` in
> `agent-queue.sh` (it's the single place each engine is mapped).
@ -82,11 +89,16 @@ already have a `---` block.
| `init` | create the `queue/` folders |
| `add <file> [--engine E] [--cwd P] [--yolo\|--no-yolo]` | queue a prompt into `inbox/` |
| `run [--max N] [--engine E] [--once]` | process the inbox (foreground loop) |
| `status` | kanban counts + running-worker table |
| `status` | kanban counts + running-worker table (marks `⚠ stalled` workers) |
| `watch [interval]` | live `status` (bash), redrawn every N seconds (default 2) |
| `dash [--interval N]` | richer **Node** live dashboard — running workers (engine, elapsed, last log line) + recent done/failed |
| `dash [--interval N]` | richer **Node** live dashboard — running workers (engine, elapsed, last log line, stall) + recent done/failed |
| `stop` | kill running workers + the run loop |
| `logs <job> [-f]` | print / follow a job's log |
| `requeue <job>` | move a failed job back to `inbox/` for a fresh run |
| `clean [--keep N]` | archive finished logs+meta beyond the newest N (default 50) into `queue/.archive/` |
Only one `run` loop may be active per queue — a second `run` against the same
queue is refused while the first is alive (a stale `daemon.pid` is cleared).
## Via `bytelyst-cli.sh`
@ -101,12 +113,14 @@ Wired into the repo's unified CLI (no GitHub token required for this subcommand)
```
queue/
inbox/ # drop / queued .md files (oldest picked first)
inbox/ # drop / queued .md files (oldest eligible picked first)
doing/ # currently executing
done/ # exited 0
failed/ # non-zero exit (or bad cwd)
failed/ # non-zero exit, bad cwd, or timeout (result=timeout)
logs/ # <job>.log — full agent output
locks/ # per-key flock files (Linux hardening; unused on macOS)
.state/ # <job>.meta heartbeats + daemon.pid (runtime only)
.archive/ # <ts>/ — logs+meta moved here by `clean`
```
## Config (env overrides)
@ -117,7 +131,9 @@ queue/
| `AGENT_QUEUE_MAX` | `2` | max concurrent agents |
| `AGENT_QUEUE_ENGINE` | `devin` | default engine when none in frontmatter |
| `AGENT_QUEUE_POLL` | `3` | inbox poll interval (seconds) |
| `AGENT_QUEUE_STALL_MIN` | `10` | minutes of unchanged log before a worker is `⚠ stalled` |
| `DEVIN_BIN` / `CLAUDE_BIN` / `CODEX_BIN` | autodetected | override CLI binary paths |
| `FLOCK_BIN` / `TIMEOUT_BIN` | autodetected | `flock` (lock hardening) and `timeout`/`gtimeout` (hard timeouts); absent on stock macOS — see notes |
## ⚠️ Safety
@ -126,12 +142,24 @@ run shell commands, and commit unattended. Mitigate:
- Prefer **scope-locked** prompt files (e.g. "edit only under `dashboards/tracker-web/`").
- Tell prompts **not to `git push`** — review commits before they leave your machine.
- Avoid queueing two tasks that touch the **same repo** concurrently (git contention).
Use `--max 1` if all tasks share a repo.
- **Same-repo safety is automatic:** jobs sharing a `cwd` (or `lock:` key) are
serialized, so two agents never run in one repo at once — even at `--max 2+`.
- Set a `timeout:` on long jobs so a wedged agent can't run forever.
- Watch cost: each job is a full agent session.
### Portability notes
- **macOS** has no `flock`/`timeout`; locking relies on the single run-loop
(enforced by the second-run refusal) and timeouts use a pure-bash watchdog.
Install coreutils (`gtimeout`) for hard process-tree kills.
- **Linux** (incl. Gitea CI) uses `flock` + `timeout` for cross-process hardening.
## Roadmap / nice-to-haves
- `--push` opt-in + per-repo lock to serialize same-repo jobs automatically.
- Node/TS rewrite with a richer live TUI dashboard.
- `done`-folder retention / archive by date.
- [x] Per-repo lock to serialize same-repo jobs automatically (`lock:` / cwd).
- [x] Per-job `timeout:` with hard kill (or bash watchdog fallback).
- [x] Stall detection in `status`/`dash`.
- [x] `requeue` failed jobs + `clean`/archive old runs.
- [ ] `--push` opt-in policy + commit review gate.
- [ ] Optional notifications (Slack/desktop) on done/failed/stall.
- [ ] Persisted run-loop as a daemon/service with auto-restart.