bytelyst-devops-tools/agent-queue/docs/jobs/phase2-scheduler.md
Saravanakumar D 237481247e docs(gigafactory): uppercase GIGAFACTORY folder + add index README
Rename agent-queue/docs/gigafactory/ to docs/GIGAFACTORY/ and update every
reference (README, system-overview code-map, and all phase job specs). Add an
index README that lists the docs and points to the companion docs in
learning_ai_common_plat. Docs-only; no behavior change.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-30 21:21:31 -07:00

85 lines
4.9 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
engine: devin
cwd: /Users/sd9235/code/mygh/learning_ai_common_plat
yolo: true
lock: common-plat-scheduler
timeout: 4h
---
ROLE: Senior backend engineer. Implement the PHASE 2 SCHEDULER / ROUTER CORE (§7)
for the fleet coordinator: a deterministic, fixed-weight scoring engine that picks
WHICH job a claiming factory gets, and wire it into the atomic claim.
PARALLEL-SAFETY (two other Devins are running — DO NOT collide):
- You OWN: services/platform-service/src/modules/fleet/scheduler.ts (NEW),
scheduler.test.ts (NEW), and the candidate-ranking section of coordinator.ts +
coordinator.test.ts.
- You MUST NOT touch: types.ts, repository.ts, routes.ts, cosmos-init.ts, server.ts
(another Devin is editing those for fleet_artifacts). If you need a new type, define
it inside scheduler.ts. If wiring truly requires a types.ts change, instead re-export
from scheduler.ts. Import existing FleetJobDoc/FleetFactoryDoc from types.ts (read-only).
- A third Devin is in a different repo (agent-queue) — no overlap.
READ FIRST:
- services/platform-service/src/modules/fleet/coordinator.ts — claimNextJob /
tryClaimJob: today it selects "highest-priority, oldest, deps-satisfied, capability-
subset". You will replace the SELECTION step with the scoring engine (keep the atomic
tryClaimJob CAS exactly as-is).
- types.ts (read-only) — FleetJobDoc (priority, capabilities, budget, createdAt, deps,
stage), FleetFactoryDoc (capabilities, health, load, seatLimit).
- ../learning_ai_devops_tools/agent-queue/docs/GIGAFACTORY/GIGAFACTORY_ROADMAP.md §7 (the formula
+ tie-breaks + phasing note: Phase 2 = fixed weights; Phase 3 = tunable + preemption).
PREREQUISITE / BRANCHING: branch off CURRENT main → feat/gigafactory-p2-scheduler.
Push + open PR. DO NOT merge.
DELIVERABLES
1. scheduler.ts (pure, no I/O, fully unit-testable):
- Weight config (fixed defaults, overridable via a passed-in object — NOT env here):
score = w1·capabilityFit + w2·affinity(prefersEngine/repo-stickiness)
+ w3·(1/(1+load)) + w4·costFit(budget) + w5·health w6·starvationPenalty(age)
- `scoreCandidate(job, factory, ctx, weights?) → { score, breakdown }` — return the
per-term breakdown for explainability (§7/Phase-3 readiness).
- `selectJob(candidates: FleetJobDoc[], factory, ctx, weights?) → FleetJobDoc | null`
filter to deps-satisfied + capability-subset (reuse the coordinator's existing
predicates; if they're inline, extract pure helpers INTO scheduler.ts), then rank by
score; deterministic tie-break: higher priority → older createdAt → lower cost class.
- Pure, synchronous, no datastore calls. Health/load come from the factory doc; age
from job.createdAt vs ctx.now (coordinator-authoritative time, passed in).
2. Wire into coordinator.claimNextJob: replace the ad-hoc selection with
`selectJob(...)`, passing the existing candidate set + the claiming factory + ctx.now.
Keep tryClaimJob's rev/updateIfMatch CAS and lease/fence logic byte-for-byte unchanged.
If the claim has no factory capabilities/health context today, thread the minimal fields
through ClaimContext (additive, in coordinator.ts only).
TESTS (scheduler.test.ts + additions to coordinator.test.ts — tests are sacred):
- capabilityFit: a factory missing a required cap → candidate filtered out (never selected).
- priority dominates when all else equal; age breaks ties deterministically.
- load: higher-load factory lowers score (1/(1+load)); health: degraded < ok.
- starvation: an old low-priority job eventually outranks a fresh low-priority one.
- costFit: a job exceeding the factory/budget cost class is penalized/last.
- breakdown: scoreCandidate returns each weighted term (sums to score).
- selectJob determinism: same inputs same pick across runs; empty/no-eligible null.
- coordinator integration: claimNextJob still returns exactly one winner under the existing
concurrency tests (all prior fleet tests stay green); selection now follows the score.
VERIFY GATE:
- pnpm --filter @lysnrai/platform-service exec vitest run src/modules/fleet (all green)
- pnpm --filter @lysnrai/platform-service build
- pnpm build && pnpm test (no regression)
CONSTRAINTS: ESM .js imports; no any; no console.log; fixed weights this phase (tunable +
preemption are Phase 3 do NOT build them); pure scheduler (no I/O); conventional commits
(feat(platform-service): ...); do not touch the files reserved above; do not edit the
agent-queue repo.
FINAL OUTPUT report in EXACTLY this format:
## Implementation Report — Phase 2 Scheduler/Router Core (§7)
### Branch & commits / PR
### Files changed
### What was implemented (scoring terms, tie-breaks, coordinator wiring)
### Tests added (+ pnpm test summary)
### Verify gate results
### Deviations / assumptions (what ctx fields were threaded, weight defaults chosen)
### Suggested next slice (Phase 3 tunable weights + preemption + explainability UI)