bytelyst-devops-tools/agent-queue/docs/jobs/phase2-slice3.md
Saravanakumar D 257efcb4bc docs(gigafactory): consolidate gigafactory docs into docs/gigafactory/
Move GIGAFACTORY_ROADMAP.md and GIGAFACTORY_SYSTEM_OVERVIEW.md under
agent-queue/docs/gigafactory/ so the scattered top-level docs are easy to
discover. Update the README links, the overview code-map, and all phase
job-spec source-of-truth paths to the new location. Pure docs move; no
behavior change.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-30 21:01:23 -07:00

9.5 KiB

engine cwd yolo lock timeout
devin /Users/sd9235/code/mygh/learning_ai_devops_tools true agent-queue 4h

ROLE: Senior bash + distributed-systems engineer. Implement PHASE 2 SLICE 3 — FACTORY-AGENT INTEGRATION: make the single-host agent-queue.sh runner act as a "factory" that registers / heartbeats / claims / reports against the already-merged fleet coordinator in platform-service, behind a feature flag, while keeping the existing offline git-queue path 100% intact when the flag is off.

NON-NEGOTIABLE DESIGN RULE (prevents merge churn + regressions):

  • Put ALL coordinator-client logic in a NEW separate file agent-queue/lib/fleet-client.sh that agent-queue.sh sources. Touch agent-queue.sh only at a few well-defined hook points (claim source, stage-transition reporting, dispatch/help). The offline git-queue code path MUST be byte-for-byte behaviorally unchanged when AQ_FLEET is unset/0.
  • Gate every coordinator interaction on AQ_FLEET=1. Default (unset) = today's offline behavior. All 53 existing selftest checks MUST still pass unchanged.

READ FIRST (verify the real contract — do not guess):

  • agent-queue/agent-queue.sh — the runner. Study: the manifest/lifecycle stages (queued→assigned→building→review→testing→shipped + blocked/failed/dead_letter), run_worker/cmd_run/ship/promote, the Slice-4 tracker_api curl wrapper + _api_call + awk JSON helpers (REUSE these patterns — POSIX awk, curl-only, no jq), and the Slice-4 auto-echo hooks. Mirror that style exactly.
  • agent-queue/selftest.sh — how stub-driven HTTP tests work (the tracker stub overrides the curl wrapper). Build the fleet stub the same way.
  • THE COORDINATOR CONTRACT (read-only, in the sibling repo ../learning_ai_common_plat/services/platform-service/src/modules/fleet/routes.ts): all routes are registered under the /api prefix. Exact endpoints: POST /api/fleet/factories/heartbeat {factoryId, capabilities[], health, load} POST /api/fleet/claim {factoryId, capabilities[]} -> job + leaseEpoch + lease expiry (or empty) GET /api/fleet/jobs/:id PATCH /api/fleet/jobs/:id fenced stage transition: {stage, checkpoint?, leaseEpoch} POST /api/fleet/jobs/:id/lease/renew {leaseEpoch} POST /api/fleet/jobs/:id/lease/release {leaseEpoch} GET /api/fleet/jobs/:id/runs GET /api/fleet/jobs/:id/events Note: there is NO client-side "register factory" or "append event" endpoint — registration is the heartbeat upsert, and fleet_events are written SERVER-SIDE by the coordinator on each PATCH/claim. The coordinator owns leaseEpoch fencing: a PATCH/renew carrying a stale epoch is rejected (409/conflict).
  • ../learning_ai_devops_tools/agent-queue/docs/gigafactory/GIGAFACTORY_ROADMAP.md §7 (claim loop), §8 (factory/heartbeat/claim/report/drain), §9 (split-brain/offline-degrade), §18 (fencing).

PREREQUISITE / BRANCHING:

  • Branch off CURRENT main (Phase 1 complete; foundation + hardening merged). New branch: feat/gigafactory-p2-slice3. Commit in logical steps. Push + open a PR. DO NOT merge.

CONFIG BLOCK (env, in fleet-client.sh; document in README):

  • AQ_FLEET (0/1, default 0 — master switch; 0 = pure offline git-queue)
  • AQ_FLEET_API (default http://localhost:4003/api)
  • AQ_FLEET_TOKEN (bearer; never hardcode)
  • AQ_PRODUCT_ID (reuse the Slice-4 var; X-Product-Id header)
  • AQ_FACTORY_ID (default: hostname + short rand; stable per process)
  • AQ_FLEET_LEASE_RENEW_SEC (default 300), AQ_FLEET_CAPS (auto-detected caps override)

DELIVERABLES

  1. agent-queue/lib/fleet-client.sh (new) — a sourced library, curl-only + POSIX awk (reuse Slice-4 helpers; do not add deps):

    • fleet_enabled — returns true iff AQ_FLEET=1 (guard for every other fn).
    • fleet_api METHOD PATH [json] — curl wrapper adding bearer + X-Product-Id; returns body; captures HTTP code; non-2xx is logged and surfaced (never crashes the runner).
    • fleet_detect_caps — reuse the runner's existing capability auto-detection (os, engines, tools) to build the capabilities array.
    • fleet_heartbeat — POST factories/heartbeat (registration == first heartbeat); call at loop start + every AQ_FLEET_LEASE_RENEW_SEC during long runs.
    • fleet_claim — POST /fleet/claim with caps; parse job id + bodyMd + leaseEpoch + lease expiry; materialize a transient local job file (reuse the Slice-4 from-tracker materialization) so the existing runner executes it unchanged. Store leaseEpoch in the job meta.
    • fleet_report STAGE [checkpoint] — PATCH /fleet/jobs/:id with {stage, checkpoint?, leaseEpoch}. Fencing-aware: if the coordinator returns conflict/409 (stale epoch), the worker MUST self-abort the job (stop work, do NOT ship/merge) and log a fenced-abort event — a reclaimed/zombie worker can never corrupt coordinator state.
    • fleet_lease_renew / fleet_lease_release — fenced; renew on a timer while building; release on terminal stages.
    • fleet_checkpoint — capture {wipBranch, wipCommit} and send via fleet_report so a reclaim can resume (durability, §25).
  2. Wire agent-queue.sh at MINIMAL hook points (all guarded by fleet_enabled):

    • source lib/fleet-client.sh near the top.
    • claim: when AQ_FLEET=1 and the local inbox is empty, try fleet_claim before idling (coordinator jobs interleave with local .md files; local files still work).
    • stage transitions (building/review/testing/shipped/failed): call fleet_report + checkpoint — REPLACE the meaning of the Slice-4 direct tracker echo when AQ_FLEET=1 (the coordinator records fleet_events, becoming the audit source of truth → "tracker echo routed through fleet_events"); keep the direct tracker echo as the offline path.
    • heartbeat timer in the run loop; lease renew while a fleet job is building; release on done.
    • new subcommands: aq fleet-status (heartbeat + show claimable count) and surface factoryId/leaseEpoch in status; add to dispatch + help.
  3. OFFLINE-DEGRADE + SPLIT-BRAIN (§9/§18): if the coordinator is unreachable mid-job, the runner finishes the in-flight job locally and reconciles on the next reachable call; on reconnect it presents its leaseEpoch — if the coordinator reports it stale (reclaimed), the local result is quarantined (marked, NOT auto-shipped) and surfaced for human triage.

TESTS — extend agent-queue/selftest.sh (stub the fleet API exactly like the tracker stub; tests are sacred, all 53 prior checks stay green):

  • flag off (default): AQ_FLEET unset → ZERO fleet API calls; existing offline flow identical (re-assert a couple of the offline cases under flag-off).
  • heartbeat/register: AQ_FLEET=1 loop start → stub receives POST factories/heartbeat with caps.
  • claim: stub returns a job → runner materializes a local job (bodyMd + leaseEpoch in meta) and executes it to review/.
  • report + checkpoint: building/review/testing → stub receives PATCH /fleet/jobs/:id with the correct stage + leaseEpoch (+ checkpoint on building).
  • FENCING: stub returns conflict on PATCH (stale epoch) → worker self-aborts, job NOT shipped, a fenced-abort is logged/surfaced.
  • lease renew: long-running stub → at least one renew call with current leaseEpoch.
  • offline-degrade: stub returns connection error mid-job → job still completes locally; on next call presenting a now-stale epoch → result quarantined (not auto-shipped).
  • no-leak: assert the prompt/bodyMd + token are never sent in a report/comment payload they shouldn't be (reuse the Slice-4 sentinel check).

VERIFY GATE (must all pass):

  • bash agent-queue/selftest.sh (all prior 53 + new fleet cases green; none weakened)
  • bash -n agent-queue/agent-queue.sh && bash -n agent-queue/lib/fleet-client.sh
  • node --check agent-queue/dashboard.mjs (if present/unchanged)
  • shellcheck --severity=error agent-queue/agent-queue.sh agent-queue/lib/fleet-client.sh

DOCS:

  • README: a "Fleet integration (Phase 2)" section — the AQ_FLEET flag, env table, the claim/heartbeat/report/fence/renew protocol, offline-degrade + quarantine behavior, and a one-paragraph "offline vs fleet mode" explainer.
  • Tick the relevant §8/§9/§14 Phase-2 boxes in GIGAFACTORY_ROADMAP.md with a P2-S3 slice note.

CONSTRAINTS: bash + curl + POSIX awk only (no jq, no new deps); reuse Slice-4 helpers; never hardcode tokens/secrets; offline path unchanged when AQ_FLEET unset; conventional commits (feat(agent-queue): ...); never weaken a test; do not edit the sibling common-plat repo.

FINAL OUTPUT — print the report in EXACTLY this format:

Implementation Report — Phase 2 Slice 3 (factory-agent integration)

Branch & commits / PR

Files changed

  • :

What was implemented

  • fleet-client.sh: <functions + flag gating>
  • agent-queue.sh hook points: <the few places touched + why minimal>
  • fencing + offline-degrade + quarantine:
  • tracker echo via fleet_events:

Tests added

  • : (esp. flag-off no-op, claim, fenced self-abort, offline quarantine)
  • selftest summary: <N checks = 53 prior + M new>

Verify gate results

  • selftest / bash -n / node --check / shellcheck:

Deviations / assumptions

  • <claim/lease contract details, anything stubbed, how registration maps to heartbeat>

Suggested next slice

  • Phase 2 remaining: scheduler/router wiring, factory enrollment + scoped tokens, feature-flag shadow/dual-run, and the two-factory parallel demo (Phase 2 exit criteria).