bytelyst-devops-tools/agent-queue/docs/jobs/phase2-scheduler.md
Saravanakumar D 237481247e docs(gigafactory): uppercase GIGAFACTORY folder + add index README
Rename agent-queue/docs/gigafactory/ to docs/GIGAFACTORY/ and update every
reference (README, system-overview code-map, and all phase job specs). Add an
index README that lists the docs and points to the companion docs in
learning_ai_common_plat. Docs-only; no behavior change.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-30 21:21:31 -07:00

4.9 KiB
Raw Blame History

engine cwd yolo lock timeout
devin /Users/sd9235/code/mygh/learning_ai_common_plat true common-plat-scheduler 4h

ROLE: Senior backend engineer. Implement the PHASE 2 SCHEDULER / ROUTER CORE (§7) for the fleet coordinator: a deterministic, fixed-weight scoring engine that picks WHICH job a claiming factory gets, and wire it into the atomic claim.

PARALLEL-SAFETY (two other Devins are running — DO NOT collide):

  • You OWN: services/platform-service/src/modules/fleet/scheduler.ts (NEW), scheduler.test.ts (NEW), and the candidate-ranking section of coordinator.ts + coordinator.test.ts.
  • You MUST NOT touch: types.ts, repository.ts, routes.ts, cosmos-init.ts, server.ts (another Devin is editing those for fleet_artifacts). If you need a new type, define it inside scheduler.ts. If wiring truly requires a types.ts change, instead re-export from scheduler.ts. Import existing FleetJobDoc/FleetFactoryDoc from types.ts (read-only).
  • A third Devin is in a different repo (agent-queue) — no overlap.

READ FIRST:

  • services/platform-service/src/modules/fleet/coordinator.ts — claimNextJob / tryClaimJob: today it selects "highest-priority, oldest, deps-satisfied, capability- subset". You will replace the SELECTION step with the scoring engine (keep the atomic tryClaimJob CAS exactly as-is).
  • types.ts (read-only) — FleetJobDoc (priority, capabilities, budget, createdAt, deps, stage), FleetFactoryDoc (capabilities, health, load, seatLimit).
  • ../learning_ai_devops_tools/agent-queue/docs/GIGAFACTORY/GIGAFACTORY_ROADMAP.md §7 (the formula
    • tie-breaks + phasing note: Phase 2 = fixed weights; Phase 3 = tunable + preemption).

PREREQUISITE / BRANCHING: branch off CURRENT main → feat/gigafactory-p2-scheduler. Push + open PR. DO NOT merge.

DELIVERABLES

  1. scheduler.ts (pure, no I/O, fully unit-testable):
    • Weight config (fixed defaults, overridable via a passed-in object — NOT env here): score = w1·capabilityFit + w2·affinity(prefersEngine/repo-stickiness) + w3·(1/(1+load)) + w4·costFit(budget) + w5·health w6·starvationPenalty(age)
    • scoreCandidate(job, factory, ctx, weights?) → { score, breakdown } — return the per-term breakdown for explainability (§7/Phase-3 readiness).
    • selectJob(candidates: FleetJobDoc[], factory, ctx, weights?) → FleetJobDoc | null — filter to deps-satisfied + capability-subset (reuse the coordinator's existing predicates; if they're inline, extract pure helpers INTO scheduler.ts), then rank by score; deterministic tie-break: higher priority → older createdAt → lower cost class.
    • Pure, synchronous, no datastore calls. Health/load come from the factory doc; age from job.createdAt vs ctx.now (coordinator-authoritative time, passed in).
  2. Wire into coordinator.claimNextJob: replace the ad-hoc selection with selectJob(...), passing the existing candidate set + the claiming factory + ctx.now. Keep tryClaimJob's rev/updateIfMatch CAS and lease/fence logic byte-for-byte unchanged. If the claim has no factory capabilities/health context today, thread the minimal fields through ClaimContext (additive, in coordinator.ts only).

TESTS (scheduler.test.ts + additions to coordinator.test.ts — tests are sacred):

  • capabilityFit: a factory missing a required cap → candidate filtered out (never selected).
  • priority dominates when all else equal; age breaks ties deterministically.
  • load: higher-load factory lowers score (1/(1+load)); health: degraded < ok.
  • starvation: an old low-priority job eventually outranks a fresh low-priority one.
  • costFit: a job exceeding the factory/budget cost class is penalized/last.
  • breakdown: scoreCandidate returns each weighted term (sums to score).
  • selectJob determinism: same inputs → same pick across runs; empty/no-eligible → null.
  • coordinator integration: claimNextJob still returns exactly one winner under the existing concurrency tests (all prior fleet tests stay green); selection now follows the score.

VERIFY GATE:

  • pnpm --filter @lysnrai/platform-service exec vitest run src/modules/fleet (all green)
  • pnpm --filter @lysnrai/platform-service build
  • pnpm build && pnpm test (no regression)

CONSTRAINTS: ESM .js imports; no any; no console.log; fixed weights this phase (tunable + preemption are Phase 3 — do NOT build them); pure scheduler (no I/O); conventional commits (feat(platform-service): ...); do not touch the files reserved above; do not edit the agent-queue repo.

FINAL OUTPUT — report in EXACTLY this format:

Implementation Report — Phase 2 Scheduler/Router Core (§7)

Branch & commits / PR

Files changed

What was implemented (scoring terms, tie-breaks, coordinator wiring)

Tests added (+ pnpm test summary)

Verify gate results

Deviations / assumptions (what ctx fields were threaded, weight defaults chosen)

Suggested next slice (Phase 3 tunable weights + preemption + explainability UI)