docs(agent-queue): add Phase 3 overnight (10h) job — tunable scoring+preemption, DAG, budgets, tracker-web control plane
This commit is contained in:
parent
eaaa545e6c
commit
2d76af916d
162
agent-queue/docs/jobs/phase3-overnight.md
Normal file
162
agent-queue/docs/jobs/phase3-overnight.md
Normal file
@ -0,0 +1,162 @@
|
||||
---
|
||||
engine: devin
|
||||
cwd: /Users/sd9235/code/mygh/learning_ai_common_plat
|
||||
yolo: true
|
||||
lock: common-plat-phase3
|
||||
timeout: 10h
|
||||
---
|
||||
|
||||
ROLE: Senior full-stack engineer. Implement PHASE 3 of the Agent Gigafactory END-TO-END
|
||||
in `learning_ai_common_plat`, SEQUENTIALLY, over a long unattended run: smart routing
|
||||
(tunable weights + preemption), DAG job decomposition, per-product budgets, and the
|
||||
tracker-web fleet control plane. Work SLICE BY SLICE; each slice is self-contained,
|
||||
fully tested, and pushed before the next begins. This is an overnight run — favor
|
||||
correctness, small verifiable steps, and never leaving main/PR in a broken state.
|
||||
|
||||
================================================================================
|
||||
PREREQUISITE (the operator guarantees this before starting): Phase 2 is COMPLETE and
|
||||
merged to origin/main — fleet foundation, atomic claim, scheduler/router core, artifacts,
|
||||
enrollment+tokens, feature-flags/shadow, the in-process tracker->module wiring, and the
|
||||
two-factory demo are ALL on main. You branch off CURRENT origin/main.
|
||||
================================================================================
|
||||
|
||||
GLOBAL GUARDRAILS (unattended danger mode — obey strictly):
|
||||
- Branch: feat/gigafactory-phase3 off CURRENT origin/main. ONE long-lived branch; ONE
|
||||
commit per slice (conventional commits). Push after EVERY slice. Open ONE PR after
|
||||
Slice 1 and keep pushing to it. DO NOT MERGE anything. DO NOT touch origin/main.
|
||||
- Tests are SACRED: never delete, weaken, skip, or `.skip`/`.only` a test to go green.
|
||||
If you cannot make a slice pass honestly, see the FAILURE PROTOCOL below.
|
||||
- A slice is "done" only when its VERIFY GATE is fully green. Never start slice N+1 with
|
||||
slice N red.
|
||||
- Reserved / DO NOT TOUCH: the agent-queue repo (different repo), unrelated services
|
||||
(cowork-service, extraction-service), packages/* internals (consume, don't edit),
|
||||
and any backup/* or dependabot/* branches. Stay in services/platform-service +
|
||||
dashboards/tracker-web.
|
||||
- Conventions: ESM `.js` import specifiers; no `any`; no console.log (use app.log/req.log,
|
||||
and the tracker-web logger/telemetry pattern); every Cosmos doc carries `productId`;
|
||||
reuse @bytelyst/* packages and existing module patterns (types.ts -> repository.ts ->
|
||||
routes.ts). Do NOT hardcode colors/URLs/secrets.
|
||||
- CHECKPOINTING: maintain docs/gigafactory-phase3-progress.md on the branch. After each
|
||||
slice, record: slice name, status (DONE/WIP/FAILED), commit sha, verify-gate result,
|
||||
and any follow-ups. Commit it WITH the slice. If you resume after an interruption, read
|
||||
it first and continue from the first not-DONE slice.
|
||||
|
||||
FAILURE PROTOCOL (per slice): attempt the verify gate up to 3 times, fixing the ROOT
|
||||
cause each time (not the test). If still red after 3 honest attempts: commit the WIP with
|
||||
message `wip(<scope>): <slice> — BLOCKED: <one-line reason>`, mark it FAILED in
|
||||
progress.md with the exact failing output, and MOVE ON to the next slice that does NOT
|
||||
depend on it (dependencies noted per slice). Never thrash; never fake green.
|
||||
|
||||
READ FIRST:
|
||||
- ../learning_ai_devops_tools/agent-queue/docs/GIGAFACTORY_ROADMAP.md — §7 (scoring;
|
||||
Phase-3 = tunable weights + preemption), §5/§6 (DAG/deps), §11/§13 (budgets), §14
|
||||
Phase-3 checklist + Exit criteria, §16 Definition-of-Done.
|
||||
- services/platform-service/src/modules/fleet/{scheduler,coordinator,repository,routes,
|
||||
types}.ts — the engine you extend (read the existing claim/lease/fence/selectJob).
|
||||
- dashboards/tracker-web/ — match its App-Router structure (src/app, src/app/api),
|
||||
data-fetching/auth pattern, @bytelyst/ui + design-tokens usage, vitest + Playwright
|
||||
(e2e/, playwright.config.ts) setup. The existing fleet HTTP API you consume:
|
||||
POST/GET /fleet/jobs, GET /fleet/jobs/:id, PATCH /fleet/jobs/:id, POST /fleet/claim,
|
||||
lease renew/release, POST /fleet/factories/heartbeat|enroll, token rotate/revoke,
|
||||
GET /fleet/jobs/:id/runs, GET /fleet/jobs/:id/events, artifacts upload/list/get/delete.
|
||||
|
||||
--------------------------------------------------------------------------------
|
||||
SLICE 1 — Tunable scoring weights + preemption (backend; depends on: nothing)
|
||||
--------------------------------------------------------------------------------
|
||||
Extend the PURE scheduler (scheduler.ts) without breaking §7 Phase-2 behavior:
|
||||
- Weights become configurable per-product/per-request (passed in; fixed defaults preserved
|
||||
so existing tests stay green). Add a small typed FleetWeightConfig resolver (defaults ->
|
||||
optional product override). NO env reads inside the pure module.
|
||||
- Preemption: a `selectWithPreemption(candidates, runningJobs, factory, ctx, weights?)`
|
||||
that, when a CRITICAL job cannot be placed and only lower-priority jobs are running,
|
||||
returns a preemption decision { evict: jobId, reason, breakdown } — PURE, no I/O.
|
||||
- Wire preemption into the coordinator behind a flag (FLEET_PREEMPTION, default OFF; OFF =
|
||||
byte-for-byte current behavior). Eviction must checkpoint + requeue the victim via the
|
||||
EXISTING fenced-requeue path (bump leaseEpoch; the zombie's late report is fenced).
|
||||
TESTS: weight override changes ranking; defaults reproduce all prior picks; preemption
|
||||
evicts only a strictly-lower-priority running job, never an equal/higher; victim is
|
||||
requeued with checkpoint + bumped epoch and its stale report is fenced; flag OFF = no
|
||||
preemption. VERIFY GATE: pnpm --filter @lysnrai/platform-service exec vitest run
|
||||
src/modules/fleet && build && (pnpm build && pnpm test).
|
||||
|
||||
--------------------------------------------------------------------------------
|
||||
SLICE 2 — DAG job decomposition (backend; depends on: nothing; independent of S1)
|
||||
--------------------------------------------------------------------------------
|
||||
Parent/child jobs with dependency-gated execution (§5/§6):
|
||||
- types: a job may declare children (subtasks) and dependsOn[] (sibling/child ids). Reuse
|
||||
existing kind ('leaf'|...) + parentId; add child submission + a DAG edge model. Cycle
|
||||
detection at submit (extend the existing submit-time cycle check).
|
||||
- coordinator: a parent is not claimable until its children reach a terminal state (or its
|
||||
declared deps are satisfied); completing the last child unblocks the parent. claimNextJob
|
||||
only returns deps-satisfied jobs (extend the existing predicate). Fan-out: submitting a
|
||||
parent with children atomically creates the children.
|
||||
- routes (additive): POST /fleet/jobs/:id/children (submit children), GET /fleet/jobs/:id/dag
|
||||
(return the subtree + per-node stage). productId-scoped.
|
||||
TESTS: parent blocked until children done; last child completion unblocks parent; cycle at
|
||||
submit -> rejected; capability/priority still respected per node; DAG endpoint returns the
|
||||
correct subtree; all prior fleet tests green. VERIFY GATE as in Slice 1 (+ items unaffected).
|
||||
|
||||
--------------------------------------------------------------------------------
|
||||
SLICE 3 — Per-product budgets + pause/resume (backend; depends on: nothing)
|
||||
--------------------------------------------------------------------------------
|
||||
Cost ceilings that pause routing (§11/§13):
|
||||
- A FleetBudgetDoc per productId (ceilingUsd, window, spentUsd, status active|paused).
|
||||
Spend accrues from job run cost (reuse run/insights cost if present; else estimate from
|
||||
budget.usd at completion). Container partitioned by /productId.
|
||||
- Enforcement in claimNextJob: if the product's budget is paused or the next job would
|
||||
exceed the ceiling, that product's jobs are NOT claimed (other products unaffected).
|
||||
Behind FLEET_BUDGETS (default OFF = unchanged).
|
||||
- routes (additive): GET/PUT /fleet/budgets/:productId, POST /fleet/budgets/:productId/pause,
|
||||
POST /fleet/budgets/:productId/resume.
|
||||
TESTS: under ceiling -> claims proceed; crossing ceiling -> that product pauses, others
|
||||
still claim; manual pause blocks claims; resume restores; flag OFF = no enforcement;
|
||||
spend accounting is monotonic + idempotent per run. VERIFY GATE as above.
|
||||
|
||||
--------------------------------------------------------------------------------
|
||||
SLICE 4 — tracker-web Fleet Control Plane UI (frontend; depends on: S1-S3 endpoints,
|
||||
but build defensively — feature-detect/degrade if an endpoint is absent)
|
||||
--------------------------------------------------------------------------------
|
||||
A new `/fleet` section in dashboards/tracker-web (App Router), matching existing patterns:
|
||||
- Typed fleet API client (src/lib or src/app/api proxy as the repo does it) wrapping the
|
||||
fleet endpoints with auth token injection (reuse the existing auth/client pattern).
|
||||
- Pages/components (use @bytelyst/ui + --*-tokens; every interactive element has an
|
||||
aria-label or visible label):
|
||||
* Fleet map: factories (id, caps, health, load, lease state) as live cards.
|
||||
* Job table: filter by product/stage/priority; submit-job modal; row -> job detail.
|
||||
* Job detail: stage timeline from /events, runs from /runs, artifacts list, a SHIP
|
||||
action (PATCH stage), and the DAG subtree (from /dag) when present.
|
||||
* Budget panel: per-product ceiling + spent + pause/resume controls.
|
||||
- Live updates via polling (simple, robust) unless an SSE/stream endpoint exists.
|
||||
TESTS: vitest component/unit tests for the client + key components (render, actions call
|
||||
the right endpoint, error/empty/degraded states); Playwright e2e for the core flow
|
||||
(see fleet map -> open a job -> ship; pause a budget -> resume). VERIFY GATE:
|
||||
the tracker-web `verify` script (typecheck + lint + test + e2e) green — run exactly what
|
||||
its package.json defines (e.g. pnpm --filter <tracker-web> run verify, or the documented
|
||||
equivalent). Do not weaken its lint/e2e config.
|
||||
|
||||
--------------------------------------------------------------------------------
|
||||
SLICE 5 — Docs + roadmap + Phase-3 exit criteria (depends on: S1-S4 outcomes)
|
||||
--------------------------------------------------------------------------------
|
||||
- Update ../learning_ai_devops_tools/agent-queue/docs/GIGAFACTORY_ROADMAP.md §14 Phase-3
|
||||
checkboxes for every box you actually completed, with a one-line note + the flag names
|
||||
(FLEET_PREEMPTION/FLEET_BUDGETS) and which are default-OFF. Tick the Phase-3 Exit-criteria
|
||||
line ONLY if its conditions are genuinely met; otherwise note the exact remaining %.
|
||||
(This is a docs edit in the OTHER repo — make it as a separate small commit/PR in
|
||||
learning_ai_devops_tools, OR include the roadmap delta as a patch file under
|
||||
docs/ in THIS branch and note it for the operator — do NOT entangle the two repos'
|
||||
git history. Prefer the patch-file note if a clean cross-repo PR isn't trivial.)
|
||||
- Update dashboards/tracker-web/README + a short docs/FLEET_CONTROL_PLANE.md (how to use
|
||||
the new UI, the flags, the endpoints consumed).
|
||||
- Finalize docs/gigafactory-phase3-progress.md with the end-state of every slice.
|
||||
|
||||
FINAL OUTPUT — print ONE consolidated report in EXACTLY this format:
|
||||
## Implementation Report — Phase 3 (overnight)
|
||||
### Branch & PR
|
||||
### Per-slice results
|
||||
| slice | status (DONE/WIP/FAILED) | commit | verify gate | notes |
|
||||
### What was implemented (per slice: key files, flags, endpoints, UI surfaces)
|
||||
### Tests added (counts per area + the final verify-gate output per slice)
|
||||
### Deviations / assumptions (weight defaults, budget accounting source, polling vs SSE, any degraded UI paths)
|
||||
### Phase 3 status (which §14 boxes now complete; exit criteria met Y/N; remaining %)
|
||||
### Anything that needs a human decision (esp. risky majors, cross-repo roadmap tick)
|
||||
### Suggested next phase (Phase 4 — message bus + autoscaling + capability marketplace)
|
||||
Loading…
Reference in New Issue
Block a user