Rename agent-queue/docs/gigafactory/ to docs/GIGAFACTORY/ and update every reference (README, system-overview code-map, and all phase job specs). Add an index README that lists the docs and points to the companion docs in learning_ai_common_plat. Docs-only; no behavior change. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
106 lines
6.5 KiB
Markdown
106 lines
6.5 KiB
Markdown
---
|
|
engine: devin
|
|
cwd: /Users/sd9235/code/mygh/learning_ai_devops_tools
|
|
yolo: true
|
|
lock: agent-queue
|
|
timeout: 4h
|
|
---
|
|
|
|
ROLE: Senior bash + distributed-systems engineer. Implement PHASE 2 — FLEET FEATURE FLAGS
|
|
+ SHADOW / DUAL-RUN for the agent-queue runner: a safe, reversible path to validate the
|
|
fleet coordinator against the proven single-host (P1) behavior BEFORE any real cutover.
|
|
|
|
PARALLEL-SAFETY (another Devin is running in a DIFFERENT repo — learning_ai_common_plat —
|
|
on enrollment/tokens; no file overlap with you. Stay within the agent-queue repo):
|
|
- You OWN: agent-queue/lib/fleet-client.sh, agent-queue/agent-queue.sh (the fleet hook
|
|
points only), agent-queue/selftest.sh, agent-queue/README.md,
|
|
agent-queue/docs/GIGAFACTORY/GIGAFACTORY_ROADMAP.md.
|
|
- Keep the offline git-queue path unchanged when fleet is off. All 60 existing selftest
|
|
checks MUST stay green.
|
|
|
|
READ FIRST:
|
|
- agent-queue/lib/fleet-client.sh — the P2-S3 client: fleet_enabled, fleet_api,
|
|
fleet_claim, fleet_report, lease renew/release, fleet_quarantine. You EXTEND this.
|
|
- agent-queue/agent-queue.sh — the run loop + the existing fleet hook points + the offline
|
|
path (cmd_add/run_worker/ship). Study how AQ_FLEET gates everything today.
|
|
- agent-queue/docs/GIGAFACTORY/GIGAFACTORY_ROADMAP.md §9 (split-brain / offline degrade), §16/§17
|
|
(feature flags fleet.enabled / fleet.route_via_service), §27 (cutover & rollback).
|
|
|
|
PREREQUISITE / BRANCHING: branch off CURRENT main → feat/gigafactory-p2-flags-shadow.
|
|
Push + open PR. DO NOT merge.
|
|
|
|
FLAG MODEL (three explicit, independently-toggleable levels; document precedence):
|
|
- AQ_FLEET=0|1 master switch (exists). 0 ⇒ pure offline, zero coordinator calls.
|
|
- AQ_FLEET_ROUTE=0|1 route_via_service: when 1 (and AQ_FLEET=1) the coordinator is
|
|
AUTHORITATIVE for claim/assignment (today's P2-S3 behavior).
|
|
When 0, the LOCAL inbox is authoritative (coordinator not used to
|
|
source work) — this is the pre-cutover state.
|
|
- AQ_FLEET_SHADOW=0|1 shadow/dual-run: when 1 (requires AQ_FLEET=1, AQ_FLEET_ROUTE=0)
|
|
the runner does its normal OFFLINE/local processing as the
|
|
authoritative path, and IN PARALLEL queries the coordinator
|
|
(shadow claim + shadow report) WITHOUT acting on its responses —
|
|
purely to compare decisions and record divergence. Shadow NEVER
|
|
ships, quarantines, or mutates real job state.
|
|
|
|
DELIVERABLES
|
|
1. fleet-client.sh additions (all guarded; no-ops unless their flag is on):
|
|
- fleet_route_enabled / fleet_shadow_enabled helpers (precedence: SHADOW only meaningful
|
|
when ROUTE=0; if both ROUTE=1 and SHADOW=1, ROUTE wins and a warning is logged).
|
|
- fleet_shadow_claim — asks the coordinator what it WOULD assign for this factory's caps,
|
|
without claiming a lease for real (read-only / dry-run; if the API has no dry-run, claim
|
|
then immediately lease/release, or use a shadow factoryId — pick the least-invasive and
|
|
document it). Returns the would-be job id (or none).
|
|
- fleet_shadow_compare — given the LOCAL decision (the job the offline path actually ran)
|
|
and the coordinator's would-be decision, classify AGREE / DIVERGE / COORD_EMPTY /
|
|
LOCAL_EMPTY and append a structured line to a shadow log
|
|
(agent-queue/queue/.state/fleet-shadow.log: ts, localJob, coordJob, verdict).
|
|
- fleet_shadow_report — mirrors stage transitions to the coordinator as shadow events
|
|
(clearly flagged shadow=1) so reporting is exercised, but divergence in the coordinator
|
|
response is logged, never acted on.
|
|
2. agent-queue.sh wiring (minimal, flag-gated):
|
|
- run loop: if SHADOW on, after the local authoritative decision each iteration, call
|
|
fleet_shadow_claim + fleet_shadow_compare (best-effort, error-swallowed — shadow must
|
|
NEVER fail a real job).
|
|
- ROUTE flag: thread it so claim sourcing honors it (ROUTE=1 ⇒ coordinator-sourced as
|
|
today; ROUTE=0 ⇒ local inbox authoritative even when AQ_FLEET=1).
|
|
- new subcommand `aq fleet-shadow-report` — summarize the shadow log (counts of
|
|
AGREE/DIVERGE/…, last N divergences). Add to dispatch + help.
|
|
- surface the three flags' resolved state in `aq status` / `aq fleet-status`.
|
|
3. Cutover safety: document the recommended rollout ladder in README — (1) AQ_FLEET=1,
|
|
ROUTE=0, SHADOW=1 (observe, zero risk) → (2) inspect agreement rate → (3) flip ROUTE=1
|
|
once agreement is high → rollback = set ROUTE=0 (and/or AQ_FLEET=0) at any time.
|
|
|
|
TESTS — extend selftest.sh (stub the coordinator like the P2-S3 fleet stub; all 60 prior
|
|
checks stay green):
|
|
- flags off: AQ_FLEET=0 ⇒ zero coordinator calls (incl. shadow); offline flow identical.
|
|
- shadow agree: stub returns the same job the local path runs ⇒ shadow log records AGREE;
|
|
the real job still ships via the offline/local path; coordinator state NOT mutated for real.
|
|
- shadow diverge: stub returns a different/empty job ⇒ DIVERGE/COORD_EMPTY logged; real job
|
|
still completes; nothing quarantined.
|
|
- shadow is non-fatal: coordinator 5xx/timeout during shadow ⇒ real job still completes,
|
|
exit 0, a shadow-error noted.
|
|
- ROUTE precedence: ROUTE=1 + SHADOW=1 ⇒ ROUTE path taken, warning logged, no shadow compare.
|
|
- ROUTE=0 + AQ_FLEET=1 ⇒ local inbox is authoritative (coordinator not used to source work).
|
|
- fleet-shadow-report summarizes the log counts correctly.
|
|
|
|
VERIFY GATE:
|
|
- bash agent-queue/selftest.sh (60 prior + new shadow/flag cases; none weakened)
|
|
- bash -n agent-queue/agent-queue.sh && bash -n agent-queue/lib/fleet-client.sh
|
|
- shellcheck --severity=error agent-queue/agent-queue.sh agent-queue/lib/fleet-client.sh
|
|
- node --check agent-queue/dashboard.mjs (if unchanged)
|
|
|
|
CONSTRAINTS: bash + curl + POSIX awk only (no jq/new deps); reuse P2-S3 helpers; shadow must
|
|
be strictly side-effect-free on real job state; offline path unchanged when AQ_FLEET=0;
|
|
never hardcode tokens; conventional commits (feat(agent-queue): ...); never weaken a test;
|
|
do not edit the common-plat repo.
|
|
|
|
FINAL OUTPUT — report in EXACTLY this format:
|
|
## Implementation Report — Phase 2 Feature Flags + Shadow/Dual-run
|
|
### Branch & commits / PR
|
|
### Files changed
|
|
### What was implemented (flag model + precedence, shadow claim/compare/report, cutover ladder)
|
|
### Tests added (+ selftest summary = 60 prior + N new; esp. flags-off no-op, shadow non-fatal, ROUTE precedence)
|
|
### Verify gate results
|
|
### Deviations / assumptions (how shadow claim avoids real lease mutation)
|
|
### Suggested next slice
|