bytelyst-devops-tools/agent-queue/docs/jobs/phase2-enrollment-tokens.md
Saravanakumar D 237481247e docs(gigafactory): uppercase GIGAFACTORY folder + add index README
Rename agent-queue/docs/gigafactory/ to docs/GIGAFACTORY/ and update every
reference (README, system-overview code-map, and all phase job specs). Add an
index README that lists the docs and points to the companion docs in
learning_ai_common_plat. Docs-only; no behavior change.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-30 21:21:31 -07:00

5.9 KiB

engine cwd yolo lock timeout
devin /Users/sd9235/code/mygh/learning_ai_common_plat true common-plat-enrollment 4h

ROLE: Senior backend + security engineer. Implement PHASE 2 — FACTORY ENROLLMENT + SCOPED ROTATABLE TOKENS (§12) for the fleet coordinator in platform-service, plus two small artifact-route hardening fixes found in review.

PARALLEL-SAFETY (another Devin is running in a DIFFERENT repo — agent-queue/devops-tools — on feature flags; no file overlap with you. Stay within platform-service):

  • You OWN: a NEW modules/fleet/enrollment.ts, modules/fleet/tokens.ts (or one enrollment.ts), enrollment.test.ts, and ADDITIVE edits to types.ts, repository.ts, routes.ts, cosmos-init.ts (factory token fields + enrollment endpoints + token-auth middleware). You MAY edit artifacts-blob.ts/routes.ts ONLY for the two review fixes below.
  • You MUST NOT change the scheduler.ts scoring, coordinator.ts claim/lease/fence CAS, or the heartbeat/claim PAYLOAD shape (only ADD an optional auth check around them, behind a flag — see below). Do not break any of the existing 79 fleet tests / 1591 platform tests.

READ FIRST:

  • modules/fleet/types.ts — FleetFactoryDoc (id, productId, capabilities, health, load, lastHeartbeatAt...). repository.ts — factory upsert (heartbeat). routes.ts — POST /fleet/factories/heartbeat, POST /fleet/claim (these will optionally require a token).
  • modules/auth/** in platform-service AND ../../packages/auth — reuse the EXISTING token/ hashing primitives (bcrypt/sha-256 recovery-code pattern). Do NOT invent new crypto. Tokens are stored HASHED at rest; the plaintext is returned exactly once at enroll/rotate.
  • ../learning_ai_devops_tools/agent-queue/docs/GIGAFACTORY/GIGAFACTORY_ROADMAP.md §12 (enrollment, scoped tokens, rotation, revocation) + §18 (trust boundary).

PREREQUISITE / BRANCHING: branch off CURRENT main → feat/gigafactory-p2-enrollment. Push + open PR. DO NOT merge.

DELIVERABLES

  1. Factory enrollment + token lifecycle (enrollment.ts):
    • enrollFactory({productId, capabilities, label?}) → creates/links a FleetFactoryDoc and issues a SCOPED token: scope = {productId, factoryId, capabilities[]}. Persist only the HASH (+ tokenId, createdAt, lastUsedAt, status). Return plaintext token ONCE.
    • rotateToken(factoryId, productId) → issue a new token, invalidate the previous (grace: mark old rotating with a short overlap TTL so an in-flight worker isn't cut off).
    • revokeToken(tokenId|factoryId, productId) → status=revoked; immediately rejected.
    • verifyToken(plaintext) → resolves {factoryId, productId, capabilities, status} or null; constant-time hash compare; updates lastUsedAt. Revoked/expired ⇒ null.
  2. Token-auth on the fleet endpoints — GATED so existing tests keep passing:
    • Add a requireFactoryToken check to POST /fleet/factories/heartbeat and POST /fleet/claim that is ENFORCED only when enforcement is on (env/flag FLEET_REQUIRE_FACTORY_TOKEN, default OFF so the 79 existing tests are unaffected). When on: missing/invalid/revoked token ⇒ 401; token scope must cover the requested productId
      • the claim's capabilities ⇒ else 403. When off: behaves exactly as today.
    • The claim's effective capabilities/productId must be taken from the VERIFIED token scope when enforcement is on (a factory cannot claim outside its scope).
  3. Routes (additive): POST /fleet/factories/enroll, POST /fleet/factories/:id/token/rotate, POST /fleet/factories/:id/token/revoke — all auth + productId + Zod validated, registered like the existing fleet routes (do not reorder others).
  4. REVIEW FIXES (small, same module):
    • listArtifactsByJob must be productId-scoped: thread productId through repo.listArtifactsByJob + the GET /fleet/jobs/:id/artifacts handler (use the request productId), so a caller can only list artifacts for their own product.
    • Upload must prefer the request/auth productId over body.productId (drop the body.productId || precedence; use getRequestProductId(req), body value only as a non-overriding hint or removed).

TESTS (enrollment.test.ts + targeted additions; tests are sacred, all prior green):

  • enroll returns a plaintext token once; the stored doc holds only a hash (assert no plaintext persisted) + scope (productId, capabilities).
  • verifyToken: valid → scope; tampered/unknown → null; revoked → null.
  • rotate: old token still works during the overlap TTL, then is rejected; new token works.
  • revoke: immediate rejection.
  • enforcement OFF (default): heartbeat/claim behave exactly as the existing tests expect (re-assert claim works with NO token).
  • enforcement ON: no token → 401; out-of-scope productId or capability → 403; in-scope → ok, and claim is constrained to the token's scope.
  • artifact fixes: list is productId-scoped (a different product cannot see the pointers); upload ignores a spoofed body.productId.

VERIFY GATE:

  • pnpm --filter @lysnrai/platform-service exec vitest run src/modules/fleet (all green; count grows from 79)
  • pnpm --filter @lysnrai/platform-service build
  • pnpm build && pnpm test (no regression across consumers)

CONSTRAINTS: ESM .js imports; no any; no console.log; productId on every doc; tokens HASHED at rest, plaintext shown once; reuse existing auth/crypto primitives (no new schemes); enforcement default OFF; conventional commits (feat(platform-service): ...); do not touch scheduler scoring or the claim CAS; do not edit the agent-queue repo.

FINAL OUTPUT — report in EXACTLY this format:

Implementation Report — Phase 2 Factory Enrollment + Scoped Tokens (§12)

Branch & commits / PR

Files changed

What was implemented (enroll/rotate/revoke/verify, scope model, gated auth, artifact fixes)

Tests added (+ pnpm test summary; esp. hashed-at-rest, scope 401/403, enforcement-off no-op)

Verify gate results

Deviations / assumptions (which crypto primitive, rotation overlap TTL, flag name)

Suggested next slice