- docs/GIGAFACTORY_ROADMAP.md: distributed multi-machine fleet vision
(factory x tool x profile routing) as a checklist-driven, phased
implementation roadmap (Phase 0-5) with acceptance criteria, verify
gates, and a 100% Definition-of-Done rubric
- committed path: coordinator as a platform-service module + control
plane on tracker-web, reached via a thin tracker adapter first; bash
runner survives as the offline edge factory agent
- README: add vision/roadmap pointer
Turn dash into a menu-driven control panel (single mjs script):
- numbered, arrow/j-k/1-9 selectable JOBS list (review/testing/failed/inbox)
- single-key actions wired to agent-queue.sh (single source of truth):
p promote, s ship, x reject, u requeue (reject/requeue confirm y/n)
- enter/l opens a live log viewer; r starts a detached run loop, S stops it
- run-loop pid indicator, transient action flashes, ? help overlay
- non-TTY falls back to the read-only live view
- README: dash command + interactive key table
- requeue <job>: move a failed job back to inbox/ and drop stale meta/body so
it re-runs cleanly
- clean [--keep N]: archive finished jobs' logs+meta beyond the newest N
(default 50) into queue/.archive/<ts>/; running jobs + .md records untouched
- document both in usage + bytelyst-cli subcommand list
Record pidstart (ps lstart) at launch and verify it in all liveness checks
(_meta_active, status, stop) via _pid_alive, so a recycled pid can never be
mistaken for our worker. Falls back to plain liveness when no start time recorded.
cmd_run now checks daemon.pid liveness up front: if a run loop is alive it exits
with an error (protecting the single-launcher invariant locking depends on); a
stale daemon.pid (dead pid) is cleared and the run proceeds.
Replace live_workers with reservation-aware active_workers + shared _meta_active:
a job counts toward --max the moment its meta is written (before the worker is
backgrounded), so --max can never be exceeded. A <30s guard prevents a meta
orphaned mid-launch from pinning a slot. busy_keys now shares _meta_active.
Mark a running worker '⚠ stalled' when its log has not changed for more than
AGENT_QUEUE_STALL_MIN minutes (default 10), using log mtime as the freshness
signal. Implemented in both the bash status table and the Node dashboard.
Honor 'timeout: 45m' (90s|45m|2h|1d) by wrapping the agent in timeout/gtimeout
when available (hard process-tree kill), else a portable bash watchdog. On expiry
the job moves doing->failed with result=timeout and a TIMED OUT log line.
Serialize jobs by lock key (frontmatter 'lock:' override, default cwd) via the
single run-loop's pre-launch eligibility check; the oldest non-busy job is picked
regardless of --max. Adds a flock-based worker guard where flock exists (Linux);
macOS relies on the single-daemon model. Records lock= in job meta.