bytelyst-devops-tools/agent-queue/docs/jobs/phase2-feature-flags-shadow.md

6.5 KiB

engine cwd yolo lock timeout
devin /Users/sd9235/code/mygh/learning_ai_devops_tools true agent-queue 4h

ROLE: Senior bash + distributed-systems engineer. Implement PHASE 2 — FLEET FEATURE FLAGS

  • SHADOW / DUAL-RUN for the agent-queue runner: a safe, reversible path to validate the fleet coordinator against the proven single-host (P1) behavior BEFORE any real cutover.

PARALLEL-SAFETY (another Devin is running in a DIFFERENT repo — learning_ai_common_plat — on enrollment/tokens; no file overlap with you. Stay within the agent-queue repo):

  • You OWN: agent-queue/lib/fleet-client.sh, agent-queue/agent-queue.sh (the fleet hook points only), agent-queue/selftest.sh, agent-queue/README.md, agent-queue/docs/GIGAFACTORY_ROADMAP.md.
  • Keep the offline git-queue path unchanged when fleet is off. All 60 existing selftest checks MUST stay green.

READ FIRST:

  • agent-queue/lib/fleet-client.sh — the P2-S3 client: fleet_enabled, fleet_api, fleet_claim, fleet_report, lease renew/release, fleet_quarantine. You EXTEND this.
  • agent-queue/agent-queue.sh — the run loop + the existing fleet hook points + the offline path (cmd_add/run_worker/ship). Study how AQ_FLEET gates everything today.
  • agent-queue/docs/GIGAFACTORY_ROADMAP.md §9 (split-brain / offline degrade), §16/§17 (feature flags fleet.enabled / fleet.route_via_service), §27 (cutover & rollback).

PREREQUISITE / BRANCHING: branch off CURRENT main → feat/gigafactory-p2-flags-shadow. Push + open PR. DO NOT merge.

FLAG MODEL (three explicit, independently-toggleable levels; document precedence):

  • AQ_FLEET=0|1 master switch (exists). 0 ⇒ pure offline, zero coordinator calls.
  • AQ_FLEET_ROUTE=0|1 route_via_service: when 1 (and AQ_FLEET=1) the coordinator is AUTHORITATIVE for claim/assignment (today's P2-S3 behavior). When 0, the LOCAL inbox is authoritative (coordinator not used to source work) — this is the pre-cutover state.
  • AQ_FLEET_SHADOW=0|1 shadow/dual-run: when 1 (requires AQ_FLEET=1, AQ_FLEET_ROUTE=0) the runner does its normal OFFLINE/local processing as the authoritative path, and IN PARALLEL queries the coordinator (shadow claim + shadow report) WITHOUT acting on its responses — purely to compare decisions and record divergence. Shadow NEVER ships, quarantines, or mutates real job state.

DELIVERABLES

  1. fleet-client.sh additions (all guarded; no-ops unless their flag is on):
    • fleet_route_enabled / fleet_shadow_enabled helpers (precedence: SHADOW only meaningful when ROUTE=0; if both ROUTE=1 and SHADOW=1, ROUTE wins and a warning is logged).
    • fleet_shadow_claim — asks the coordinator what it WOULD assign for this factory's caps, without claiming a lease for real (read-only / dry-run; if the API has no dry-run, claim then immediately lease/release, or use a shadow factoryId — pick the least-invasive and document it). Returns the would-be job id (or none).
    • fleet_shadow_compare — given the LOCAL decision (the job the offline path actually ran) and the coordinator's would-be decision, classify AGREE / DIVERGE / COORD_EMPTY / LOCAL_EMPTY and append a structured line to a shadow log (agent-queue/queue/.state/fleet-shadow.log: ts, localJob, coordJob, verdict).
    • fleet_shadow_report — mirrors stage transitions to the coordinator as shadow events (clearly flagged shadow=1) so reporting is exercised, but divergence in the coordinator response is logged, never acted on.
  2. agent-queue.sh wiring (minimal, flag-gated):
    • run loop: if SHADOW on, after the local authoritative decision each iteration, call fleet_shadow_claim + fleet_shadow_compare (best-effort, error-swallowed — shadow must NEVER fail a real job).
    • ROUTE flag: thread it so claim sourcing honors it (ROUTE=1 ⇒ coordinator-sourced as today; ROUTE=0 ⇒ local inbox authoritative even when AQ_FLEET=1).
    • new subcommand aq fleet-shadow-report — summarize the shadow log (counts of AGREE/DIVERGE/…, last N divergences). Add to dispatch + help.
    • surface the three flags' resolved state in aq status / aq fleet-status.
  3. Cutover safety: document the recommended rollout ladder in README — (1) AQ_FLEET=1, ROUTE=0, SHADOW=1 (observe, zero risk) → (2) inspect agreement rate → (3) flip ROUTE=1 once agreement is high → rollback = set ROUTE=0 (and/or AQ_FLEET=0) at any time.

TESTS — extend selftest.sh (stub the coordinator like the P2-S3 fleet stub; all 60 prior checks stay green):

  • flags off: AQ_FLEET=0 ⇒ zero coordinator calls (incl. shadow); offline flow identical.
  • shadow agree: stub returns the same job the local path runs ⇒ shadow log records AGREE; the real job still ships via the offline/local path; coordinator state NOT mutated for real.
  • shadow diverge: stub returns a different/empty job ⇒ DIVERGE/COORD_EMPTY logged; real job still completes; nothing quarantined.
  • shadow is non-fatal: coordinator 5xx/timeout during shadow ⇒ real job still completes, exit 0, a shadow-error noted.
  • ROUTE precedence: ROUTE=1 + SHADOW=1 ⇒ ROUTE path taken, warning logged, no shadow compare.
  • ROUTE=0 + AQ_FLEET=1 ⇒ local inbox is authoritative (coordinator not used to source work).
  • fleet-shadow-report summarizes the log counts correctly.

VERIFY GATE:

  • bash agent-queue/selftest.sh (60 prior + new shadow/flag cases; none weakened)
  • bash -n agent-queue/agent-queue.sh && bash -n agent-queue/lib/fleet-client.sh
  • shellcheck --severity=error agent-queue/agent-queue.sh agent-queue/lib/fleet-client.sh
  • node --check agent-queue/dashboard.mjs (if unchanged)

CONSTRAINTS: bash + curl + POSIX awk only (no jq/new deps); reuse P2-S3 helpers; shadow must be strictly side-effect-free on real job state; offline path unchanged when AQ_FLEET=0; never hardcode tokens; conventional commits (feat(agent-queue): ...); never weaken a test; do not edit the common-plat repo.

FINAL OUTPUT — report in EXACTLY this format:

Implementation Report — Phase 2 Feature Flags + Shadow/Dual-run

Branch & commits / PR

Files changed

What was implemented (flag model + precedence, shadow claim/compare/report, cutover ladder)

Tests added (+ selftest summary = 60 prior + N new; esp. flags-off no-op, shadow non-fatal, ROUTE precedence)

Verify gate results

Deviations / assumptions (how shadow claim avoids real lease mutation)

Suggested next slice