Rename agent-queue/docs/gigafactory/ to docs/GIGAFACTORY/ and update every reference (README, system-overview code-map, and all phase job specs). Add an index README that lists the docs and points to the companion docs in learning_ai_common_plat. Docs-only; no behavior change. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
157 lines
9.5 KiB
Markdown
157 lines
9.5 KiB
Markdown
---
|
|
engine: devin
|
|
cwd: /Users/sd9235/code/mygh/learning_ai_devops_tools
|
|
yolo: true
|
|
lock: agent-queue
|
|
timeout: 4h
|
|
---
|
|
|
|
ROLE: Senior bash + distributed-systems engineer. Implement PHASE 2 SLICE 3 —
|
|
FACTORY-AGENT INTEGRATION: make the single-host `agent-queue.sh` runner act as a
|
|
"factory" that registers / heartbeats / claims / reports against the already-merged
|
|
`fleet` coordinator in platform-service, **behind a feature flag**, while keeping
|
|
the existing offline git-queue path 100% intact when the flag is off.
|
|
|
|
NON-NEGOTIABLE DESIGN RULE (prevents merge churn + regressions):
|
|
- Put ALL coordinator-client logic in a NEW separate file `agent-queue/lib/fleet-client.sh`
|
|
that `agent-queue.sh` sources. Touch `agent-queue.sh` only at a few well-defined hook
|
|
points (claim source, stage-transition reporting, dispatch/help). The offline git-queue
|
|
code path MUST be byte-for-byte behaviorally unchanged when `AQ_FLEET` is unset/0.
|
|
- Gate every coordinator interaction on `AQ_FLEET=1`. Default (unset) = today's offline
|
|
behavior. All 53 existing selftest checks MUST still pass unchanged.
|
|
|
|
READ FIRST (verify the real contract — do not guess):
|
|
- agent-queue/agent-queue.sh — the runner. Study: the manifest/lifecycle stages
|
|
(queued→assigned→building→review→testing→shipped + blocked/failed/dead_letter),
|
|
`run_worker`/`cmd_run`/`ship`/`promote`, the Slice-4 `tracker_api` curl wrapper +
|
|
`_api_call` + awk JSON helpers (REUSE these patterns — POSIX awk, curl-only, no jq),
|
|
and the Slice-4 auto-echo hooks. Mirror that style exactly.
|
|
- agent-queue/selftest.sh — how stub-driven HTTP tests work (the tracker stub overrides
|
|
the curl wrapper). Build the fleet stub the same way.
|
|
- THE COORDINATOR CONTRACT (read-only, in the sibling repo
|
|
../learning_ai_common_plat/services/platform-service/src/modules/fleet/routes.ts):
|
|
all routes are registered under the `/api` prefix. Exact endpoints:
|
|
POST /api/fleet/factories/heartbeat {factoryId, capabilities[], health, load}
|
|
POST /api/fleet/claim {factoryId, capabilities[]} -> job + leaseEpoch + lease expiry (or empty)
|
|
GET /api/fleet/jobs/:id
|
|
PATCH /api/fleet/jobs/:id fenced stage transition: {stage, checkpoint?, leaseEpoch}
|
|
POST /api/fleet/jobs/:id/lease/renew {leaseEpoch}
|
|
POST /api/fleet/jobs/:id/lease/release {leaseEpoch}
|
|
GET /api/fleet/jobs/:id/runs
|
|
GET /api/fleet/jobs/:id/events
|
|
Note: there is NO client-side "register factory" or "append event" endpoint — registration
|
|
is the heartbeat upsert, and `fleet_events` are written SERVER-SIDE by the coordinator on
|
|
each PATCH/claim. The coordinator owns `leaseEpoch` fencing: a PATCH/renew carrying a stale
|
|
epoch is rejected (409/conflict).
|
|
- ../learning_ai_devops_tools/agent-queue/docs/GIGAFACTORY/GIGAFACTORY_ROADMAP.md §7 (claim loop),
|
|
§8 (factory/heartbeat/claim/report/drain), §9 (split-brain/offline-degrade), §18 (fencing).
|
|
|
|
PREREQUISITE / BRANCHING:
|
|
- Branch off CURRENT `main` (Phase 1 complete; foundation + hardening merged).
|
|
New branch: feat/gigafactory-p2-slice3. Commit in logical steps. Push + open a PR.
|
|
DO NOT merge.
|
|
|
|
CONFIG BLOCK (env, in fleet-client.sh; document in README):
|
|
- AQ_FLEET (0/1, default 0 — master switch; 0 = pure offline git-queue)
|
|
- AQ_FLEET_API (default http://localhost:4003/api)
|
|
- AQ_FLEET_TOKEN (bearer; never hardcode)
|
|
- AQ_PRODUCT_ID (reuse the Slice-4 var; X-Product-Id header)
|
|
- AQ_FACTORY_ID (default: hostname + short rand; stable per process)
|
|
- AQ_FLEET_LEASE_RENEW_SEC (default 300), AQ_FLEET_CAPS (auto-detected caps override)
|
|
|
|
DELIVERABLES
|
|
|
|
1. `agent-queue/lib/fleet-client.sh` (new) — a sourced library, curl-only + POSIX awk
|
|
(reuse Slice-4 helpers; do not add deps):
|
|
- `fleet_enabled` — returns true iff AQ_FLEET=1 (guard for every other fn).
|
|
- `fleet_api METHOD PATH [json]` — curl wrapper adding bearer + X-Product-Id; returns
|
|
body; captures HTTP code; non-2xx is logged and surfaced (never crashes the runner).
|
|
- `fleet_detect_caps` — reuse the runner's existing capability auto-detection (os, engines,
|
|
tools) to build the capabilities array.
|
|
- `fleet_heartbeat` — POST factories/heartbeat (registration == first heartbeat); call at
|
|
loop start + every AQ_FLEET_LEASE_RENEW_SEC during long runs.
|
|
- `fleet_claim` — POST /fleet/claim with caps; parse job id + bodyMd + leaseEpoch + lease
|
|
expiry; materialize a transient local job file (reuse the Slice-4 from-tracker
|
|
materialization) so the existing runner executes it unchanged. Store leaseEpoch in the
|
|
job meta.
|
|
- `fleet_report STAGE [checkpoint]` — PATCH /fleet/jobs/:id with {stage, checkpoint?,
|
|
leaseEpoch}. **Fencing-aware:** if the coordinator returns conflict/409 (stale epoch),
|
|
the worker MUST self-abort the job (stop work, do NOT ship/merge) and log a fenced-abort
|
|
event — a reclaimed/zombie worker can never corrupt coordinator state.
|
|
- `fleet_lease_renew` / `fleet_lease_release` — fenced; renew on a timer while building;
|
|
release on terminal stages.
|
|
- `fleet_checkpoint` — capture {wipBranch, wipCommit} and send via fleet_report so a
|
|
reclaim can resume (durability, §25).
|
|
|
|
2. Wire `agent-queue.sh` at MINIMAL hook points (all guarded by `fleet_enabled`):
|
|
- source `lib/fleet-client.sh` near the top.
|
|
- claim: when AQ_FLEET=1 and the local inbox is empty, try `fleet_claim` before idling
|
|
(coordinator jobs interleave with local `.md` files; local files still work).
|
|
- stage transitions (building/review/testing/shipped/failed): call `fleet_report` +
|
|
checkpoint — REPLACE the meaning of the Slice-4 direct tracker echo when AQ_FLEET=1
|
|
(the coordinator records `fleet_events`, becoming the audit source of truth → "tracker
|
|
echo routed through fleet_events"); keep the direct tracker echo as the offline path.
|
|
- heartbeat timer in the run loop; lease renew while a fleet job is building; release on done.
|
|
- new subcommands: `aq fleet-status` (heartbeat + show claimable count) and surface
|
|
factoryId/leaseEpoch in `status`; add to dispatch + help.
|
|
|
|
3. OFFLINE-DEGRADE + SPLIT-BRAIN (§9/§18): if the coordinator is unreachable mid-job, the
|
|
runner finishes the in-flight job locally and reconciles on the next reachable call; on
|
|
reconnect it presents its leaseEpoch — if the coordinator reports it stale (reclaimed),
|
|
the local result is quarantined (marked, NOT auto-shipped) and surfaced for human triage.
|
|
|
|
TESTS — extend `agent-queue/selftest.sh` (stub the fleet API exactly like the tracker stub;
|
|
tests are sacred, all 53 prior checks stay green):
|
|
- flag off (default): AQ_FLEET unset → ZERO fleet API calls; existing offline flow identical
|
|
(re-assert a couple of the offline cases under flag-off).
|
|
- heartbeat/register: AQ_FLEET=1 loop start → stub receives POST factories/heartbeat with caps.
|
|
- claim: stub returns a job → runner materializes a local job (bodyMd + leaseEpoch in meta)
|
|
and executes it to review/.
|
|
- report + checkpoint: building/review/testing → stub receives PATCH /fleet/jobs/:id with the
|
|
correct stage + leaseEpoch (+ checkpoint on building).
|
|
- FENCING: stub returns conflict on PATCH (stale epoch) → worker self-aborts, job NOT shipped,
|
|
a fenced-abort is logged/surfaced.
|
|
- lease renew: long-running stub → at least one renew call with current leaseEpoch.
|
|
- offline-degrade: stub returns connection error mid-job → job still completes locally; on
|
|
next call presenting a now-stale epoch → result quarantined (not auto-shipped).
|
|
- no-leak: assert the prompt/bodyMd + token are never sent in a report/comment payload they
|
|
shouldn't be (reuse the Slice-4 sentinel check).
|
|
|
|
VERIFY GATE (must all pass):
|
|
- bash agent-queue/selftest.sh (all prior 53 + new fleet cases green; none weakened)
|
|
- bash -n agent-queue/agent-queue.sh && bash -n agent-queue/lib/fleet-client.sh
|
|
- node --check agent-queue/dashboard.mjs (if present/unchanged)
|
|
- shellcheck --severity=error agent-queue/agent-queue.sh agent-queue/lib/fleet-client.sh
|
|
|
|
DOCS:
|
|
- README: a "Fleet integration (Phase 2)" section — the AQ_FLEET flag, env table, the
|
|
claim/heartbeat/report/fence/renew protocol, offline-degrade + quarantine behavior, and a
|
|
one-paragraph "offline vs fleet mode" explainer.
|
|
- Tick the relevant §8/§9/§14 Phase-2 boxes in GIGAFACTORY_ROADMAP.md with a P2-S3 slice note.
|
|
|
|
CONSTRAINTS: bash + curl + POSIX awk only (no jq, no new deps); reuse Slice-4 helpers; never
|
|
hardcode tokens/secrets; offline path unchanged when AQ_FLEET unset; conventional commits
|
|
(feat(agent-queue): ...); never weaken a test; do not edit the sibling common-plat repo.
|
|
|
|
FINAL OUTPUT — print the report in EXACTLY this format:
|
|
|
|
## Implementation Report — Phase 2 Slice 3 (factory-agent integration)
|
|
### Branch & commits / PR
|
|
### Files changed
|
|
- <path>: <summary>
|
|
### What was implemented
|
|
- fleet-client.sh: <functions + flag gating>
|
|
- agent-queue.sh hook points: <the few places touched + why minimal>
|
|
- fencing + offline-degrade + quarantine: <how>
|
|
- tracker echo via fleet_events: <how>
|
|
### Tests added
|
|
- <name>: <assertion> (esp. flag-off no-op, claim, fenced self-abort, offline quarantine)
|
|
- selftest summary: <N checks = 53 prior + M new>
|
|
### Verify gate results
|
|
- selftest / bash -n / node --check / shellcheck: <results>
|
|
### Deviations / assumptions
|
|
- <claim/lease contract details, anything stubbed, how registration maps to heartbeat>
|
|
### Suggested next slice
|
|
- Phase 2 remaining: scheduler/router wiring, factory enrollment + scoped tokens, feature-flag
|
|
shadow/dual-run, and the two-factory parallel demo (Phase 2 exit criteria).
|