docs(agent-queue): draft P2 prompts — factory enrollment+tokens (§12) + feature flags/shadow-dualrun
This commit is contained in:
parent
cf5428acd1
commit
5c0ae020c0
97
agent-queue/docs/jobs/phase2-enrollment-tokens.md
Normal file
97
agent-queue/docs/jobs/phase2-enrollment-tokens.md
Normal file
@ -0,0 +1,97 @@
|
|||||||
|
---
|
||||||
|
engine: devin
|
||||||
|
cwd: /Users/sd9235/code/mygh/learning_ai_common_plat
|
||||||
|
yolo: true
|
||||||
|
lock: common-plat-enrollment
|
||||||
|
timeout: 4h
|
||||||
|
---
|
||||||
|
|
||||||
|
ROLE: Senior backend + security engineer. Implement PHASE 2 — FACTORY ENROLLMENT +
|
||||||
|
SCOPED ROTATABLE TOKENS (§12) for the fleet coordinator in platform-service, plus two
|
||||||
|
small artifact-route hardening fixes found in review.
|
||||||
|
|
||||||
|
PARALLEL-SAFETY (another Devin is running in a DIFFERENT repo — agent-queue/devops-tools —
|
||||||
|
on feature flags; no file overlap with you. Stay within platform-service):
|
||||||
|
- You OWN: a NEW modules/fleet/enrollment.ts, modules/fleet/tokens.ts (or one
|
||||||
|
enrollment.ts), enrollment.test.ts, and ADDITIVE edits to types.ts, repository.ts,
|
||||||
|
routes.ts, cosmos-init.ts (factory token fields + enrollment endpoints + token-auth
|
||||||
|
middleware). You MAY edit artifacts-blob.ts/routes.ts ONLY for the two review fixes below.
|
||||||
|
- You MUST NOT change the scheduler.ts scoring, coordinator.ts claim/lease/fence CAS, or
|
||||||
|
the heartbeat/claim PAYLOAD shape (only ADD an optional auth check around them, behind a
|
||||||
|
flag — see below). Do not break any of the existing 79 fleet tests / 1591 platform tests.
|
||||||
|
|
||||||
|
READ FIRST:
|
||||||
|
- modules/fleet/types.ts — FleetFactoryDoc (id, productId, capabilities, health, load,
|
||||||
|
lastHeartbeatAt...). repository.ts — factory upsert (heartbeat). routes.ts — POST
|
||||||
|
/fleet/factories/heartbeat, POST /fleet/claim (these will optionally require a token).
|
||||||
|
- modules/auth/** in platform-service AND ../../packages/auth — reuse the EXISTING token/
|
||||||
|
hashing primitives (bcrypt/sha-256 recovery-code pattern). Do NOT invent new crypto.
|
||||||
|
Tokens are stored HASHED at rest; the plaintext is returned exactly once at enroll/rotate.
|
||||||
|
- ../learning_ai_devops_tools/agent-queue/docs/GIGAFACTORY_ROADMAP.md §12 (enrollment,
|
||||||
|
scoped tokens, rotation, revocation) + §18 (trust boundary).
|
||||||
|
|
||||||
|
PREREQUISITE / BRANCHING: branch off CURRENT main → feat/gigafactory-p2-enrollment.
|
||||||
|
Push + open PR. DO NOT merge.
|
||||||
|
|
||||||
|
DELIVERABLES
|
||||||
|
1. Factory enrollment + token lifecycle (enrollment.ts):
|
||||||
|
- enrollFactory({productId, capabilities, label?}) → creates/links a FleetFactoryDoc and
|
||||||
|
issues a SCOPED token: scope = {productId, factoryId, capabilities[]}. Persist only the
|
||||||
|
HASH (+ tokenId, createdAt, lastUsedAt, status). Return plaintext token ONCE.
|
||||||
|
- rotateToken(factoryId, productId) → issue a new token, invalidate the previous (grace:
|
||||||
|
mark old `rotating` with a short overlap TTL so an in-flight worker isn't cut off).
|
||||||
|
- revokeToken(tokenId|factoryId, productId) → status=revoked; immediately rejected.
|
||||||
|
- verifyToken(plaintext) → resolves {factoryId, productId, capabilities, status} or null;
|
||||||
|
constant-time hash compare; updates lastUsedAt. Revoked/expired ⇒ null.
|
||||||
|
2. Token-auth on the fleet endpoints — GATED so existing tests keep passing:
|
||||||
|
- Add a `requireFactoryToken` check to POST /fleet/factories/heartbeat and POST
|
||||||
|
/fleet/claim that is ENFORCED only when enforcement is on (env/flag
|
||||||
|
FLEET_REQUIRE_FACTORY_TOKEN, default OFF so the 79 existing tests are unaffected). When
|
||||||
|
on: missing/invalid/revoked token ⇒ 401; token scope must cover the requested productId
|
||||||
|
+ the claim's capabilities ⇒ else 403. When off: behaves exactly as today.
|
||||||
|
- The claim's effective capabilities/productId must be taken from the VERIFIED token scope
|
||||||
|
when enforcement is on (a factory cannot claim outside its scope).
|
||||||
|
3. Routes (additive): POST /fleet/factories/enroll, POST /fleet/factories/:id/token/rotate,
|
||||||
|
POST /fleet/factories/:id/token/revoke — all auth + productId + Zod validated, registered
|
||||||
|
like the existing fleet routes (do not reorder others).
|
||||||
|
4. REVIEW FIXES (small, same module):
|
||||||
|
- listArtifactsByJob must be productId-scoped: thread `productId` through
|
||||||
|
repo.listArtifactsByJob + the GET /fleet/jobs/:id/artifacts handler (use the request
|
||||||
|
productId), so a caller can only list artifacts for their own product.
|
||||||
|
- Upload must prefer the request/auth productId over body.productId (drop the
|
||||||
|
`body.productId ||` precedence; use getRequestProductId(req), body value only as a
|
||||||
|
non-overriding hint or removed).
|
||||||
|
|
||||||
|
TESTS (enrollment.test.ts + targeted additions; tests are sacred, all prior green):
|
||||||
|
- enroll returns a plaintext token once; the stored doc holds only a hash (assert no
|
||||||
|
plaintext persisted) + scope (productId, capabilities).
|
||||||
|
- verifyToken: valid → scope; tampered/unknown → null; revoked → null.
|
||||||
|
- rotate: old token still works during the overlap TTL, then is rejected; new token works.
|
||||||
|
- revoke: immediate rejection.
|
||||||
|
- enforcement OFF (default): heartbeat/claim behave exactly as the existing tests expect
|
||||||
|
(re-assert claim works with NO token).
|
||||||
|
- enforcement ON: no token → 401; out-of-scope productId or capability → 403; in-scope → ok,
|
||||||
|
and claim is constrained to the token's scope.
|
||||||
|
- artifact fixes: list is productId-scoped (a different product cannot see the pointers);
|
||||||
|
upload ignores a spoofed body.productId.
|
||||||
|
|
||||||
|
VERIFY GATE:
|
||||||
|
- pnpm --filter @lysnrai/platform-service exec vitest run src/modules/fleet (all green;
|
||||||
|
count grows from 79)
|
||||||
|
- pnpm --filter @lysnrai/platform-service build
|
||||||
|
- pnpm build && pnpm test (no regression across consumers)
|
||||||
|
|
||||||
|
CONSTRAINTS: ESM .js imports; no any; no console.log; productId on every doc; tokens HASHED
|
||||||
|
at rest, plaintext shown once; reuse existing auth/crypto primitives (no new schemes);
|
||||||
|
enforcement default OFF; conventional commits (feat(platform-service): ...); do not touch
|
||||||
|
scheduler scoring or the claim CAS; do not edit the agent-queue repo.
|
||||||
|
|
||||||
|
FINAL OUTPUT — report in EXACTLY this format:
|
||||||
|
## Implementation Report — Phase 2 Factory Enrollment + Scoped Tokens (§12)
|
||||||
|
### Branch & commits / PR
|
||||||
|
### Files changed
|
||||||
|
### What was implemented (enroll/rotate/revoke/verify, scope model, gated auth, artifact fixes)
|
||||||
|
### Tests added (+ pnpm test summary; esp. hashed-at-rest, scope 401/403, enforcement-off no-op)
|
||||||
|
### Verify gate results
|
||||||
|
### Deviations / assumptions (which crypto primitive, rotation overlap TTL, flag name)
|
||||||
|
### Suggested next slice
|
||||||
105
agent-queue/docs/jobs/phase2-feature-flags-shadow.md
Normal file
105
agent-queue/docs/jobs/phase2-feature-flags-shadow.md
Normal file
@ -0,0 +1,105 @@
|
|||||||
|
---
|
||||||
|
engine: devin
|
||||||
|
cwd: /Users/sd9235/code/mygh/learning_ai_devops_tools
|
||||||
|
yolo: true
|
||||||
|
lock: agent-queue
|
||||||
|
timeout: 4h
|
||||||
|
---
|
||||||
|
|
||||||
|
ROLE: Senior bash + distributed-systems engineer. Implement PHASE 2 — FLEET FEATURE FLAGS
|
||||||
|
+ SHADOW / DUAL-RUN for the agent-queue runner: a safe, reversible path to validate the
|
||||||
|
fleet coordinator against the proven single-host (P1) behavior BEFORE any real cutover.
|
||||||
|
|
||||||
|
PARALLEL-SAFETY (another Devin is running in a DIFFERENT repo — learning_ai_common_plat —
|
||||||
|
on enrollment/tokens; no file overlap with you. Stay within the agent-queue repo):
|
||||||
|
- You OWN: agent-queue/lib/fleet-client.sh, agent-queue/agent-queue.sh (the fleet hook
|
||||||
|
points only), agent-queue/selftest.sh, agent-queue/README.md,
|
||||||
|
agent-queue/docs/GIGAFACTORY_ROADMAP.md.
|
||||||
|
- Keep the offline git-queue path unchanged when fleet is off. All 60 existing selftest
|
||||||
|
checks MUST stay green.
|
||||||
|
|
||||||
|
READ FIRST:
|
||||||
|
- agent-queue/lib/fleet-client.sh — the P2-S3 client: fleet_enabled, fleet_api,
|
||||||
|
fleet_claim, fleet_report, lease renew/release, fleet_quarantine. You EXTEND this.
|
||||||
|
- agent-queue/agent-queue.sh — the run loop + the existing fleet hook points + the offline
|
||||||
|
path (cmd_add/run_worker/ship). Study how AQ_FLEET gates everything today.
|
||||||
|
- agent-queue/docs/GIGAFACTORY_ROADMAP.md §9 (split-brain / offline degrade), §16/§17
|
||||||
|
(feature flags fleet.enabled / fleet.route_via_service), §27 (cutover & rollback).
|
||||||
|
|
||||||
|
PREREQUISITE / BRANCHING: branch off CURRENT main → feat/gigafactory-p2-flags-shadow.
|
||||||
|
Push + open PR. DO NOT merge.
|
||||||
|
|
||||||
|
FLAG MODEL (three explicit, independently-toggleable levels; document precedence):
|
||||||
|
- AQ_FLEET=0|1 master switch (exists). 0 ⇒ pure offline, zero coordinator calls.
|
||||||
|
- AQ_FLEET_ROUTE=0|1 route_via_service: when 1 (and AQ_FLEET=1) the coordinator is
|
||||||
|
AUTHORITATIVE for claim/assignment (today's P2-S3 behavior).
|
||||||
|
When 0, the LOCAL inbox is authoritative (coordinator not used to
|
||||||
|
source work) — this is the pre-cutover state.
|
||||||
|
- AQ_FLEET_SHADOW=0|1 shadow/dual-run: when 1 (requires AQ_FLEET=1, AQ_FLEET_ROUTE=0)
|
||||||
|
the runner does its normal OFFLINE/local processing as the
|
||||||
|
authoritative path, and IN PARALLEL queries the coordinator
|
||||||
|
(shadow claim + shadow report) WITHOUT acting on its responses —
|
||||||
|
purely to compare decisions and record divergence. Shadow NEVER
|
||||||
|
ships, quarantines, or mutates real job state.
|
||||||
|
|
||||||
|
DELIVERABLES
|
||||||
|
1. fleet-client.sh additions (all guarded; no-ops unless their flag is on):
|
||||||
|
- fleet_route_enabled / fleet_shadow_enabled helpers (precedence: SHADOW only meaningful
|
||||||
|
when ROUTE=0; if both ROUTE=1 and SHADOW=1, ROUTE wins and a warning is logged).
|
||||||
|
- fleet_shadow_claim — asks the coordinator what it WOULD assign for this factory's caps,
|
||||||
|
without claiming a lease for real (read-only / dry-run; if the API has no dry-run, claim
|
||||||
|
then immediately lease/release, or use a shadow factoryId — pick the least-invasive and
|
||||||
|
document it). Returns the would-be job id (or none).
|
||||||
|
- fleet_shadow_compare — given the LOCAL decision (the job the offline path actually ran)
|
||||||
|
and the coordinator's would-be decision, classify AGREE / DIVERGE / COORD_EMPTY /
|
||||||
|
LOCAL_EMPTY and append a structured line to a shadow log
|
||||||
|
(agent-queue/queue/.state/fleet-shadow.log: ts, localJob, coordJob, verdict).
|
||||||
|
- fleet_shadow_report — mirrors stage transitions to the coordinator as shadow events
|
||||||
|
(clearly flagged shadow=1) so reporting is exercised, but divergence in the coordinator
|
||||||
|
response is logged, never acted on.
|
||||||
|
2. agent-queue.sh wiring (minimal, flag-gated):
|
||||||
|
- run loop: if SHADOW on, after the local authoritative decision each iteration, call
|
||||||
|
fleet_shadow_claim + fleet_shadow_compare (best-effort, error-swallowed — shadow must
|
||||||
|
NEVER fail a real job).
|
||||||
|
- ROUTE flag: thread it so claim sourcing honors it (ROUTE=1 ⇒ coordinator-sourced as
|
||||||
|
today; ROUTE=0 ⇒ local inbox authoritative even when AQ_FLEET=1).
|
||||||
|
- new subcommand `aq fleet-shadow-report` — summarize the shadow log (counts of
|
||||||
|
AGREE/DIVERGE/…, last N divergences). Add to dispatch + help.
|
||||||
|
- surface the three flags' resolved state in `aq status` / `aq fleet-status`.
|
||||||
|
3. Cutover safety: document the recommended rollout ladder in README — (1) AQ_FLEET=1,
|
||||||
|
ROUTE=0, SHADOW=1 (observe, zero risk) → (2) inspect agreement rate → (3) flip ROUTE=1
|
||||||
|
once agreement is high → rollback = set ROUTE=0 (and/or AQ_FLEET=0) at any time.
|
||||||
|
|
||||||
|
TESTS — extend selftest.sh (stub the coordinator like the P2-S3 fleet stub; all 60 prior
|
||||||
|
checks stay green):
|
||||||
|
- flags off: AQ_FLEET=0 ⇒ zero coordinator calls (incl. shadow); offline flow identical.
|
||||||
|
- shadow agree: stub returns the same job the local path runs ⇒ shadow log records AGREE;
|
||||||
|
the real job still ships via the offline/local path; coordinator state NOT mutated for real.
|
||||||
|
- shadow diverge: stub returns a different/empty job ⇒ DIVERGE/COORD_EMPTY logged; real job
|
||||||
|
still completes; nothing quarantined.
|
||||||
|
- shadow is non-fatal: coordinator 5xx/timeout during shadow ⇒ real job still completes,
|
||||||
|
exit 0, a shadow-error noted.
|
||||||
|
- ROUTE precedence: ROUTE=1 + SHADOW=1 ⇒ ROUTE path taken, warning logged, no shadow compare.
|
||||||
|
- ROUTE=0 + AQ_FLEET=1 ⇒ local inbox is authoritative (coordinator not used to source work).
|
||||||
|
- fleet-shadow-report summarizes the log counts correctly.
|
||||||
|
|
||||||
|
VERIFY GATE:
|
||||||
|
- bash agent-queue/selftest.sh (60 prior + new shadow/flag cases; none weakened)
|
||||||
|
- bash -n agent-queue/agent-queue.sh && bash -n agent-queue/lib/fleet-client.sh
|
||||||
|
- shellcheck --severity=error agent-queue/agent-queue.sh agent-queue/lib/fleet-client.sh
|
||||||
|
- node --check agent-queue/dashboard.mjs (if unchanged)
|
||||||
|
|
||||||
|
CONSTRAINTS: bash + curl + POSIX awk only (no jq/new deps); reuse P2-S3 helpers; shadow must
|
||||||
|
be strictly side-effect-free on real job state; offline path unchanged when AQ_FLEET=0;
|
||||||
|
never hardcode tokens; conventional commits (feat(agent-queue): ...); never weaken a test;
|
||||||
|
do not edit the common-plat repo.
|
||||||
|
|
||||||
|
FINAL OUTPUT — report in EXACTLY this format:
|
||||||
|
## Implementation Report — Phase 2 Feature Flags + Shadow/Dual-run
|
||||||
|
### Branch & commits / PR
|
||||||
|
### Files changed
|
||||||
|
### What was implemented (flag model + precedence, shadow claim/compare/report, cutover ladder)
|
||||||
|
### Tests added (+ selftest summary = 60 prior + N new; esp. flags-off no-op, shadow non-fatal, ROUTE precedence)
|
||||||
|
### Verify gate results
|
||||||
|
### Deviations / assumptions (how shadow claim avoids real lease mutation)
|
||||||
|
### Suggested next slice
|
||||||
Loading…
Reference in New Issue
Block a user