Two runner-side reliability improvements (additive, opt-out via env where relevant):
- Lease-renewal retry: a renewal lost to a transient blip (timeout/5xx/proxy)
previously let the lease expire and the coordinator reclaim a still-running job,
wasting the work. fleet_lease_renew now retries a TRANSIENT failure a few times
with a short backoff (AQ_FLEET_RENEW_RETRIES=2, AQ_FLEET_RENEW_BACKOFF_SEC=2);
a 409/412 FENCE is terminal and never retried.
- Graceful drain: a new `drain` command signals a running loop (via a $STATE/draining
flag) to stop taking NEW work (coordinator claim + local inbox), let in-flight
jobs finish, release their leases, and exit cleanly — ideal before a deploy.
Distinct from `stop` (kills workers immediately) and the `--drain`/--once startup
flag (drains the queue to empty). The flag is cleared at run-loop start and on exit.
bash -n clean on both files; ./selftest.sh PASS; `drain` smoke-tested.
Generated with [Devin](https://cli.devin.ai/docs)
Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Three runner-side robustness fixes (behavior-preserving, opt-out where relevant):
- resolve_engine now availability-checks an EXPLICIT engine (mirroring the
engine-class path): if the requested engine's binary isn't installed it emits
the no-engine signal so the job is marked no_engine, instead of invoking a
missing binary and surfacing a generic crash.
- The run-loop INT/TERM trap now best-effort releases leases for in-flight
building/ jobs (new fleet_release_all_active) so a stopped factory's jobs are
reclaimable immediately rather than waiting out the ~900s lease TTL.
- _cache_prune GCs cached repo checkouts under $STATE/repos not accessed in
AQ_FLEET_CACHE_TTL_DAYS days (default 14; 0 disables), run once at run-loop
startup, to stop unbounded disk growth. Guards against rm on an empty base path.
bash -n passes on both files; ./selftest.sh PASS.
Generated with [Devin](https://cli.devin.ai/docs)
Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
When materializing a claimed fleet job, write `engine: <pick>` into the job
frontmatter (resolve_engine then runs it). Only a KNOWN engine
(devin/claude/codex/copilot) is honored — never the run's 'unknown'/class
placeholder — so an engineless job still falls back to the factory default
(AGENT_QUEUE_ENGINE). No behavior change for existing jobs.
Generated with [Devin](https://cli.devin.ai/docs)
Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
parse_usage now always emits engine=<engine>, and the devin arm extracts the
ATIF export's session_id. fleet-client includes engine + sessionId in the run
insights it reports, so the coordinator/UI can show the real engine (not
'unknown') and a session handle for traceability/recovery (devin --resume <id>).
Generated with [Devin](https://cli.devin.ai/docs)
Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Without AQ_FLEET_PR=1 + AQ_FLEET_REPO_BASE a job's repo is ignored and the agent
just runs the prompt in the sandbox cwd (no PR). Add both (PR on by default,
REPO_BASE = the repos' parent dir; FLEET_PR=0 to opt out) + a PRODUCTS
subset-restart note so a busy factory can be left running.
Generated with [Devin](https://cli.devin.ai/docs)
Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Adds agent-queue-boot.sh (PATH repair + ~/.agent-queue.env overrides + caffeinate
wrap) and launchd/ (install.sh + README) so the run loop auto-starts on login and
survives reboot/crash — the persistence layer tmux+caffeinate alone cannot give.
No secrets tracked (host config lives in untracked ~/.agent-queue.env).
Generated with [Devin](https://cli.devin.ai/docs)
Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Extends deploy-gigafactory.sh to optionally start the web tracker (tracker-web)
alongside platform-service and registered factories: adds --full (backend +
register + tracker), --with-tracker, --tracker-only, a per-process pid file with
child-aware --stop, and waits for both to be healthy. Gitignore the new runtime
tracker pid.
Generated with [Devin](https://cli.devin.ai/docs)
Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
The operational _start_fleet.sh lives in a local (untracked) sandbox, so the
gate + heartbeat-cadence settings weren't version-controlled anywhere. Add
demo/start-fleet.example.sh: a parameterized, sanitized launcher (one
agent-queue.sh run daemon per product against a live platform-service) that
ships the two settings you must get right — AQ_FLEET_GATE=1 (M0 RU gate) and
AQ_FLEET_LEASE_RENEW_SEC=30 (heartbeat cadence < the 90s stale threshold). No
hardcoded paths/secrets; everything env-overridable. Documented in demo/README.
Generated with [Devin](https://cli.devin.ai/docs)
Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
- System overview: mark Phase 4 in-progress (M0 RU gate shipped), add
fleet_queue_state container + GET /fleet/queue-state, document the heartbeat
cadence vs 90s stale gotcha, the tracker-web caps=build form bug, the missing
deregister API, and the ended=-race fix; drop the now-false "roadmap §0 stale"
and "boxes 384/386 unticked" claims (both reconciled); link the redesign doc.
- Roadmap: §0 Phase 4 -> in progress (M0); align the Phase-2 §8 spec endpoint
sketches to the as-built API (/fleet/factories/enroll, /factories/heartbeat,
/fleet/claim) + note the heartbeat cadence and the M0 gate.
Generated with [Devin](https://cli.devin.ai/docs)
Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
run_worker marked the job ended (testing) right after moving to testing/, BEFORE
opening/merging the PR and reporting to the coordinator. Once ended= is written,
_meta_active returns false, active_workers drops to 0, and "run --once" could
drain-exit (and callers could observe completion) while the background worker was
still opening the PR — a real race that made the PR-mode selftest flaky and could
free a concurrency slot prematurely in production.
Move the ended= write to the end of the success path (after PR open/merge +
testing/shipped reports). No behavior change on the autoship/ship path. Full
selftest now passes deterministically across repeated runs.
Generated with [Devin](https://cli.devin.ai/docs)
Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Adds AQ_FLEET_GATE (default OFF): the run loop point-reads the cheap per-product
queue version (GET /fleet/queue-state) and SKIPS the expensive /fleet/claim while
the version is unchanged and it is not mid-drain, with a periodic safety backstop
and fail-open-on-read-error so work is never stranded. Keeps POLL_SECONDS for
local job responsiveness rather than raising it globally. selftest 39b covers the
gate decisions; reconciles the M0 section of the dispatch redesign doc.
Generated with [Devin](https://cli.devin.ai/docs)
Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Proposes moving fleet work-dispatch off Cosmos busy-polling onto Azure Service
Bus in a coordinator-owns-scheduling / broker-owns-delivery hybrid, fixing the
product-as-queue routing smell and the idle-poll RU cost. Includes phased
migration (M0 RU quick win -> shadow -> cutover -> scale-to-zero) with a ticked
checklist. Self-reviewed (v2) for the outbox/change-feed, message-size,
long-job lock, idempotency, and routing-model consistency issues.
Generated with [Devin](https://cli.devin.ai/docs)
Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Devin's export has tokens but no USD cost; estimate cost_usd from a per-model
$/1M price map (Opus/Sonnet/Haiku) and flag usage_estimated so the dashboard
shows it as approx.
When AQ_FLEET_REPO_BASE/<repo> is an existing checkout, create a git worktree off it
for branch aq/job/<id> (shares objects + remotes, leaves the main checkout
untouched) instead of cloning. Falls back to clone for remote-only repos. selftest
exercises the worktree path.
Generated with [Devin](https://cli.devin.ai/docs)
Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Claim now carries verify (drives the existing verify gate -> PR opens only if it
passes) and autoMerge (squash-merge via gh pr merge after the PR opens, non-fatal).
selftest covers both.
Generated with [Devin](https://cli.devin.ai/docs)
Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
In PR mode the agent is asked to write .aq_pr.md (line 1 = PR title, then a markdown
description) based on the task + the diff it produced. The factory reads it for
`gh pr create` (via --body-file) and removes it before committing (never part of the
PR). Falls back to a derived title if absent. selftest asserts the authored title is
used and .aq_pr.md is not committed.
Generated with [Devin](https://cli.devin.ai/docs)
Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
MVP: when AQ_FLEET_REPO_BASE/<repo> is an existing local checkout, use it as the
clone source (fast, no network) and push/PR to its GitHub origin — embedded creds
in the local origin URL are stripped (gh credential helper handles auth). Selftest
PASS (full-path bare-repo fallback unchanged).
Generated with [Devin](https://cli.devin.ai/docs)
Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
When AQ_FLEET_PR=1 and a claimed fleet job carries a `repo`, run the agent in an
isolated checkout on branch aq/job/<fleetJobId> (off baseBranch), then on a passing
verify commit/push and `gh pr create`. The PR URL + branch are recorded in the meta
and reported on lease release (-> the coordinator stores them on the run).
- fleet-client: parse repo/baseBranch from the claim, carry them in frontmatter;
fleet_report_insights now sends prUrl/branch.
- _fleet_pr_prepare (clone/fetch + branch, local-path aware, identity fallback) and
_fleet_pr_open (commit/push/gh pr create). WIP checkpointing is skipped for PR jobs
(the pushed branch is the durable artifact).
- New flags: AQ_FLEET_PR, AQ_FLEET_REPOS_DIR, GH_BIN. README documented.
- selftest: +1 case (bare-repo origin + gh stub) — branch pushed, PR opened, prUrl
reported on release. Full self-test PASS.
Generated with [Devin](https://cli.devin.ai/docs)
Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Previously the factory reported up to `review` and "shipping is always manual",
so a coordinator job never reached a terminal stage autonomously.
- On a passing local verify, always report `testing` to the coordinator so its
stage reflects that QA passed (was stuck at `review`).
- New AQ_FLEET_AUTOSHIP=1: the factory's verify gate IS the test phase, so advance
the coordinator job testing -> shipped and land it in shipped/ locally. This
closes the testing->shipped gap for an autonomous submit -> shipped pipeline.
Default off keeps the human review gate authoritative (job rests at testing).
selftest: +2 cases (autoship reports testing+shipped + lands in shipped/; autoship
OFF reports testing but withholds shipped). Full self-test PASS.
Generated with [Devin](https://cli.devin.ai/docs)
Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
7-doc kit mapping the JD competency matrix to the ByteLyst ecosystem:
ecosystem-as-RAG-fabric architecture, competency deep-dives, STAR bank,
enhancement roadmap, banking blueprints, and a glossary quick-ref.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Devin does not surface token/cost in its stdout or local log, so parse_usage
previously emitted nothing for the devin engine (runs showed no metrics). Devin
DOES expose per-step usage in its ATIF conversation export.
- build_agent_cmd: pass `--export <path>` for the devin engine (path derived from
the job log path so parse_usage can find it; harmless 4th arg for other engines).
- parse_usage devin: read the export and sum per-step metadata.metrics
input_tokens / output_tokens / cache_read_tokens; take model from agent.model_name.
Pure grep/awk, no new dependency. USD cost is left unset (the export carries token
counts but not cost) — the dashboard shows tokens + model, cost stays blank.
These feed fleet_report_insights, so live devin fleet runs now report tokens +
model to the coordinator (verified live: model "Claude Opus 4.8", tokensIn/out +
cache populated on a real run).
selftest: +1 case (parse_usage devin sums per-step tokens + model from --export).
Full self-test PASS.
Generated with [Devin](https://cli.devin.ai/docs)
Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
#1 fleet_report_insights: on a successful fleet run the factory now reports the
parsed cost/token/effort metrics (model, tokensIn/Out/cached, costUsd, turns,
toolCalls) plus the run result onto the coordinator run via POST
.../lease/release (which also frees the lease). parse_usage already extracted
these into the job meta; they were never sent. Engines that do not expose usage
locally (devin) still land result + endedAt.
#2 normalize AQ_FLEET_API: platform-service mounts fleet under /api, so a base
without it silently returned 404 on every call. Strip a trailing slash and
append /api unless already present, so AQ_FLEET_API=http://host:4003 works too.
selftest: +2 cases (insights reported via lease/release; API-base normalization).
Full self-test PASS.
Generated with [Devin](https://cli.devin.ai/docs)
Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Document a phased roadmap for the single-VM deployment layer (build-off-VM,
recreate-in-place to cut downtime, change-detection + BuildKit guarantee,
image slimming + resource caps, artifact-based rollback). Scoped to deploy
orchestration; defers image-build internals to docker-build-optimization-roadmap.
Register the doc in repo-map.
Generated with [Devin](https://cli.devin.ai/docs)
Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Document the two ways @bytelyst/* packages resolve (local workspace links
vs Gitea npm registry for Docker/CI), the common 'registry offline' local-dev
failure and its fix (sibling directory layout, not a token), and the
deploy-side 'package not published' / token issues with remediation.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Document how the daemon + agents must run after a review found jobs executing
in --yolo/dangerous mode directly against live working trees (the root cause of
repo dirtiness + duplicate commits). Policy: per-job worktree off origin/main,
branch-per-task + PR, yolo:false by default (dangerous only in disposable
sandboxes), clean-tree contract, one writer per repo. Linked from the README.
Generated with [Devin](https://cli.devin.ai/docs)
Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Two bugs caused duplicate items on re-run: the dedupe list used limit=500
(server caps at 100 -> 400 -> silent empty set -> dupes), and meta productIds
weren't registered so GET /items 400'd ("Unknown product"). Now registers every
referenced product first (idempotent) and lists with limit=100; dedupe failures
are logged loudly. Verified idempotent: re-run skips all 16.
Generated with [Devin](https://cli.devin.ai/docs)
Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
deploy-gigafactory.sh loads platform-service/.env, starts the fleet backend,
waits for /health, and registers the ecosystem products (idempotent) so live
/api/fleet/* calls resolve. Supports --stop / --register-only / --no-register.
Registered the 11 ecosystem products against the configured Cosmos during a
live run; note fleet metrics needs a composite index on real Azure Cosmos.
Generated with [Devin](https://cli.devin.ai/docs)
Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Add a Work Tracking entry to README Primary Entry Points and a short pointer
in CLAUDE.md, both routing to scripts/tracker-seed/ and the AGENTS.md section.
Generated with [Devin](https://cli.devin.ai/docs)
Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Add a "Cutting Tracker Items" section to AGENTS.md and register
scripts/tracker-seed/ in docs/repo-map.md so future "cut items to track"
requests route to the seed tooling instead of ad-hoc API calls.
Generated with [Devin](https://cli.devin.ai/docs)
Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Rename agent-queue/docs/gigafactory/ to docs/GIGAFACTORY/ and update every
reference (README, system-overview code-map, and all phase job specs). Add an
index README that lists the docs and points to the companion docs in
learning_ai_common_plat. Docs-only; no behavior change.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Files the ENGINEERING_REVIEW_SCORECARD.md P0-P3 action plan as tracker items
(one per affected product) via the platform-service POST /api/items API.
Dependency-free Node seeder mints an HS256 token from $JWT_SECRET, dedupes by
title, and supports --dry-run. No live writes performed (stack is down); run
the script once the platform stack is up.
Generated with [Devin](https://cli.devin.ai/docs)
Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Move GIGAFACTORY_ROADMAP.md and GIGAFACTORY_SYSTEM_OVERVIEW.md under
agent-queue/docs/gigafactory/ so the scattered top-level docs are easy to
discover. Update the README links, the overview code-map, and all phase
job-spec source-of-truth paths to the new location. Pure docs move; no
behavior change.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Jobs move through .state/inbox/building/testing/review/failed/shipped/logs at
runtime, which constantly dirtied the repo and blocked clean rebases. Ignore
the per-job lifecycle files (keeping each dir via .gitkeep) and stop tracking
the consumed inbox job instances.
Generated with [Devin](https://cli.devin.ai/docs)
Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>