Commit Graph

475 Commits

Author SHA1 Message Date
saravanakumardb1
e4c84acf29 fix(fleet): bump the M0 queue gate when a draft/queued job is edited
updateDraft changes caps/deps/priority (and body) without a stage change, so it
did not bump fleet_queue_state — gated factories (AQ_FLEET_GATE=1) would not
re-evaluate claimability until the safety interval. Bump the gate on edit so an
edit that makes a job claimable wakes factories promptly.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
2026-06-01 00:58:45 -07:00
saravanakumardb1
b7df779e1d feat(fleet): draft jobs + editable prompt (save-as-draft, submit, lock on pickup)
Backend (platform-service):
- New `draft` stage (not claimable; scheduler only takes queued/blocked).
- submitJob accepts `draft: true` → parks a new/superseded job as a draft.
- updateDraft(): edit prompt/config in place while draft/queued/blocked;
  recomputes contentHash; rejected (conflict) once picked up (assigned+).
- submitDraft(): promote draft → queued (or blocked on unmet deps); idempotent.
- Routes: PATCH /fleet/jobs/:id/draft, POST /fleet/jobs/:id/submit.
- tracker-bridge: map draft → item status `open`. Tests + FLEET_STAGES updated.

Frontend (tracker-web):
- New-Job form: add "Save as draft" alongside "Submit".
- Job detail: edit the prompt + Save while draft/queued/blocked, "Submit" a
  draft, and lock it read-only once a factory picks it up.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
2026-06-01 00:44:10 -07:00
saravanakumardb1
78c4e47460 docs(gigafactory): fix stale/incorrect fleet docs
- fleet module README: add fleet_queue_state container + GET /fleet/queue-state
  and /fleet/metrics; note the heartbeat cadence must stay under the 90s stale
  threshold (AQ_FLEET_LEASE_RENEW_SEC).
- FLEET_CONTROL_PLANE: correct wrong endpoint paths (/fleet/claim and
  /fleet/factories/heartbeat were documented as /fleet/jobs/:id/claim and
  /fleet/factories/:id/heartbeat; removed a non-existent GET /fleet/factories);
  add enroll, metrics, and the M0 queue-state endpoint.
- ROADMAP_COMPLETION_AUDIT: dated status banner — roadmap §0 now reconciled and
  Phase-4 M0 shipped, superseding the older "stale §0 / not started" findings.
- README: point to FLEET_DISPATCH_REDESIGN.md + the M0 gate.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
2026-06-01 00:03:05 -07:00
saravanakumardb1
ba7db0008d feat(fleet): M0 RU gate — cheap per-product queue version + skip-claim
Adds fleet_queue_state (monotonic version per product), bumped on job create +
every stage change in the repository layer (best-effort, never fails a job
write), and a GET /fleet/queue-state read endpoint. Lets a polling factory
detect "work changed" with a ~1 RU point read instead of a full listJobs scan
on every claim. Registers the container; tests cover the bump + endpoint.

See agent-queue docs/GIGAFACTORY/FLEET_DISPATCH_REDESIGN.md §8/§12 (M0).

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
2026-05-31 23:18:27 -07:00
Saravanakumar Dhandapani
65c7d09584 feat(auth): include displayName claim in platform access token
Adds an optional displayName claim to the platform access token so
downstream product backends can source the user's display name from the
JWT (single source of truth = platform auth), not from per-product DB
copies. verifyToken already exposes displayName; this populates it at all
token-minting sites (password login, register, refresh, SSO, OAuth,
magic-link, passkeys, QR, enterprise SAML/OIDC). Additive and backward
compatible.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-31 20:23:15 -07:00
saravanakumardb1
2bd97791c9 feat(fleet): resilient PR merge on ship (inline attempt + background retry)
The corporate proxy intermittently 407s GitHub's API, so a single gh pr merge can
fail transiently. Try once inline (fast path), then retry in the background with
backoff (3s/8s/20s/45s) without blocking the ship; mark prState=merged when one
lands. Best-effort throughout.
2026-05-31 16:13:52 -07:00
saravanakumardb1
740335a149 feat(fleet): Ship can also merge the linked PR (gh pr merge)
On ship (Ship button / operator action / autoship PATCH), when the run has an open
PR and FLEET_SHIP_MERGES_PR=1, the coordinator squash-merges it via gh (best-effort,
where gh is authed) and marks the run prState=merged. UI button reads 'Ship & merge
PR' when an open PR exists; Ship refreshes runs.
2026-05-31 16:08:28 -07:00
saravanakumardb1
696ee4189e feat(fleet): track + show PR status (open/merged) on the run
Runs now carry prState (open when the PR is opened, merged when auto-merge
succeeds), reported on lease release. Job-detail Runs table shows a status badge
next to the PR link.
2026-05-31 13:56:33 -07:00
saravanakumardb1
c239abeec9 feat(fleet): per-repo verify + auto-merge options for PR jobs
Add job-level verify (command run in the PR checkout before opening the PR) and
autoMerge (squash-merge the PR once opened). Surfaced in the New Job form as a
Verify-command field + Auto-merge checkbox (PR mode only); confirmation now shows
PR-mode/repo. More repos added to the dropdown.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
2026-05-31 06:17:19 -07:00
saravanakumardb1
883cf329e5 feat(fleet): PR deliverables — jobs target a repo, factory opens a PR, link recorded
Make "shipped" produce a real artifact. A job can now carry an optional repo
(owner/name or clone URL) + baseBranch; the factory's PR mode runs the agent in an
isolated checkout, opens a PR, and records the link.

Backend:
- SubmitJobSchema + FleetJobDoc: optional repo/baseBranch (recorded on submit).
- FleetRunDoc: optional prUrl/branch.
- ReleaseLease report carries prUrl/branch -> stored on the run.
- +2 coordinator tests.

UI (tracker-web):
- New Job form gains optional Repo + Base branch fields (and fixes the priority
  options to the valid critical/high/medium/low; "normal" was rejected by the API).
- Job detail Runs table shows a PR ↗ link from run.prUrl.
- fleet-client: submitJob repo/baseBranch; FleetRun prUrl/branch; OperatorAction +ship.

Docs: FLEET_CONTROL_PLANE.md "PR deliverable (PR mode)" section.

Verified: tsc clean; fleet suite 182; tracker-web 230.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
2026-05-31 05:27:11 -07:00
saravanakumardb1
bcdb5dcdf3 fix(fleet): run result mirrors a shipped job + document testing->shipped
A job can reach `shipped` via autoship PATCH, the `ship` operator action, or a
terminal lease release, but the run-level `result` was left at whatever the
factory last reported (e.g. `review`), so the dashboard showed a shipped job with
a non-terminal run result.

- Add markLatestRunShipped(): on any transition to `shipped`, set the latest run
  result to `shipped` (+ endedAt if unset). Idempotent, best-effort.
- Wire it into patchJobFenced (ungated; budget accrual stays flag-gated) and the
  `ship` operator action.
- Document the testing->shipped paths (factory autoship vs `ship` operator action)
  and the run-mirroring in docs/GIGAFACTORY/FLEET_CONTROL_PLANE.md.

Tests: +2 (patchJobFenced->shipped and operator ship both set run.result=shipped).
Fleet suite 180 pass; tsc clean.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
2026-05-31 04:31:46 -07:00
saravanakumardb1
8fe26027e7 chore(deps): bump @types/node 22 -> 25 (dev types)
Verified: full workspace build (tsc) green across all packages/services/dashboards;
fleet+items tests pass. Compile-time only.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
2026-05-31 04:02:56 -07:00
saravanakumardb1
3022e634b8 feat(billing): migrate to Stripe SDK v20 (API shape changes)
Bump stripe 17 -> 20 and adapt to the breaking API changes:
- PromotionCode: coupon moved under `promotion` ({ type: 'coupon', coupon }).
  mapPromo now reads p.promotion.coupon; create now passes
  promotion: { type: 'coupon', coupon: id } instead of a top-level coupon.
- Subscription.current_period_end removed (now per subscription item). Add
  getSubscriptionPeriodEnd() = max(items[].current_period_end) and use it in the
  customer.subscription.updated webhook handler.
- Update the promos route test fixture to the new promotion.coupon shape.

Verified: platform-service build (tsc) clean; promos (14) + stripe/subscriptions/
billing tests pass; full suite 1692/1692.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
2026-05-31 04:00:13 -07:00
saravanakumardb1
0079274bff chore(deps): bump bcryptjs 2 -> 3 (+ @types/bcryptjs 3)
Verified: auth + platform-service build (tsc, default import still resolves);
packages/auth (31) + platform-service auth/api-key (65) pass (hash/compare work).

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
2026-05-31 03:46:19 -07:00
saravanakumardb1
66a4a74aa6 chore(deps): bump @fastify/cors 10 -> 11
Verified: fastify-core + platform-service build (tsc); createServiceApp boot smoke
handles a CORS preflight (OPTIONS -> 204 with access-control-allow-origin).

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
2026-05-31 03:44:54 -07:00
saravanakumardb1
808f615124 chore(deps): bump @azure/cosmos, jose, @typescript-eslint/parser
Applied fresh on current main (the matching dependabot branches were 350-430
commits behind and would have conflicted on the lockfile):
- @azure/cosmos 4.9.1 -> 4.9.3
- jose 6.1.3 -> 6.2.3 (mcp-server stays on the 5.x line: 5.9.6 -> 5.10.0)
- @typescript-eslint/parser 8.0 -> 8.60.0

Verified: full workspace build green; platform-service suite 1684 pass (only the
pre-existing single-fork migration-isolation flake, passes isolated); tracker-web
228 pass (only the pre-existing happy-dom product-context failures). No new
regressions. Major bumps (fastify/cors, happy-dom, lint-staged, stripe,
types/node) deferred for separate review.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
2026-05-31 03:28:40 -07:00
saravanakumardb1
1503ef2e19 fix(fleet): Phase 3 hardening — budget authz, idempotent accrual, cycle detection, artifact
Re-applies 4 defects merged with Phase 3 (still present on main), ported from the
stale reference branch and adapted to current main.

FIX 1 (BLOCKER): all 5 budget routes (GET/burndown/PUT/pause/resume) read
productId from the URL with no caller check, so a caller for product A could
read or modify product B budget. Add requireOwnProduct() -> 403 on mismatch.

FIX 2: stop tracking services/platform-service/.data/platform-events.json (the
EVENT_BUS_FILE runtime log); gitignore .data/.

FIX 3: accrueSpend was definition-only (budgets never accrued) and not
idempotent. Add accruedRunIds to FleetBudgetDoc; accrueSpend(productId, costUsd,
runId) no-ops on a seen runId; wire it into patchJobFenced on stage shipped,
flag-gated + best-effort + idempotent via jobId:leaseEpoch. Accrue the run ACTUAL
insights.costUsd so spentUsd and costBurndown agree.

FIX 4: submitChildren cycle detection is now batch-aware — rejects duplicate
child keys and walks each child declared deps across BOTH the unpersisted batch
and existing jobs, rejecting child-self, child-parent and sibling cycles.

Tests: +2 budget-authz (403 on all 5 verbs), +3 accrual (idempotent runId, ship
accrues insights.costUsd once, flag-off no accrual), +5 cycle detection. Gates
green: tsc/build, fleet+items (204), full suite (only the unrelated single-fork
migration-isolation file flakes, passes isolated). Flags stay default-OFF.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
2026-05-31 02:45:52 -07:00
saravanakumardb1
32e426d423 feat(fleet): run-insights reporting + ship action; complete the lifecycle
Make the Agent Gigafactory fully drivable end-to-end via the API:

- lease/release now accepts `insights` (model/tokens/cost) + `result`, recorded
  on the current run with endedAt — factories report cost/token metrics on
  completion (previously no API existed; runs stayed insights:{}).
- add `ship` operator action so a job in `testing` (where the review gate left
  no lease holder) can reach the terminal `shipped` stage. Idempotent.
- operatorAction now retries on optimistic-concurrency conflict with backoff
  (mirrors submitReview) so a ship right after approve survives real-Cosmos
  read-after-write lag instead of a spurious 409.

Tests: +2 coordinator (ship idempotent, release records insights) and +2 route
integration (gated submit->...->ship->metrics; release-with-insights). 170 pass.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
2026-05-31 01:49:59 -07:00
saravanakumardb1
e0a904c7ea fix(cosmos): support composite indexes; add fleet_jobs (priorityOrder, createdAt)
Azure Cosmos cannot serve a multi-field ORDER BY without a matching composite
index (the local emulator is lenient, real Cosmos returns HTTP 400). The fleet
listJobs() query orders by (priorityOrder, createdAt), which broke
GET /api/fleet/metrics and /api/fleet/jobs on real Cosmos.

- ContainerConfig gains an optional `compositeIndexes` field
- container init applies it on create AND reconciles it onto existing
  containers (createIfNotExists never updates an existing index policy)
- fleet_jobs declares the (priorityOrder ASC, createdAt ASC) composite index

Verified live against Azure Cosmos: both endpoints now return 200.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
2026-05-30 23:07:06 -07:00
Saravanakumar D
20bcbd9f19 docs(gigafactory): uppercase GIGAFACTORY folder + add index README
Rename docs/gigafactory/ to docs/GIGAFACTORY/ and update the cross-repo
source-of-truth references in the fleet README and types.ts comment. Add an
index README listing the platform docs and pointing to the canonical spec in
learning_ai_devops_tools. Docs/comment only; no behavior change.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-30 21:22:21 -07:00
Saravanakumar D
fc000c4c20 docs(gigafactory): consolidate gigafactory docs into docs/gigafactory/
Move ROADMAP_COMPLETION_AUDIT.md, TASKS_TO_COMPLETE.md,
gigafactory-phase3-progress.md and FLEET_CONTROL_PLANE.md under
docs/gigafactory/ so the scattered Gigafactory docs are easy to discover.
Update intra-doc and cross-repo source-of-truth references (fleet README
and types.ts comment) to the new agent-queue/docs/gigafactory/ path.
Pure docs/comment move; no behavior change.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-30 21:07:20 -07:00
Saravanakumar D
4ac499f301 feat: add multi-reviewer human gate (review-policy routing)
Implements the §14 Phase 3 review gate. requestReview() routes a building
job into the review stage (fencing any worker), carrying a normalized policy
(requiredApprovals + reviewer allowlist) and clearing prior decisions.
submitReview() records one decision per reviewer (last-write-wins, identity-
normalized), advances the job to testing once distinct approvals reach the
quorum, and treats any reject as a veto that returns the job to queued for
rework. Adds POST /fleet/jobs/:id/review/request and POST /fleet/jobs/:id/review,
a typed client, and a review-gate card on the job-detail page.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-30 19:04:24 -07:00
Saravanakumar D
c9c2c174db feat: add fleet metrics + alerting (GET /fleet/metrics)
Adds coordinator.fleetMetrics() computing queue depth, stage histogram,
oldest-queued age (starvation signal), factory health and seat utilization,
plus derived alerts (no_live_capacity, all_factories_down, queue_starvation,
saturated, stale_factories). Exposed via GET /fleet/metrics and surfaced as a
metrics+alerts panel on the fleet overview. Thresholds injectable for tests.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-30 18:51:59 -07:00
Saravanakumar D
ea42602407 feat: add resumable SSE live event stream for fleet jobs
Backend: GET /fleet/jobs/:id/events/stream emits a snapshot (seq > Last-Event-ID)
then long-polls the append-only event log, closing after a bounded window so
EventSource-style clients reconnect cleanly. Honors Last-Event-ID resume,
keepalive comments, and a terminal error frame.

Frontend: subscribeJobEvents uses fetch streaming (to send auth + product
headers) with parseSseFrames, Last-Event-ID resume, reconnect backoff, and a
fatal-on-error-frame fallback to polling. Job detail page subscribes live
(deduped by seq), falls back to 4s polling on failure, and shows a Live badge;
refresh() now merges events so a slow snapshot can't clobber streamed ones.

Tests: +3 route (snapshot, resume cursor, append-after-connect), +5 client
(parseSseFrames x2, subscribe deliver/error/resume/error-frame). fleet 150, web 222.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-30 18:38:50 -07:00
Saravanakumar D
89860e39f9 feat: add fleet cost burndown chart
- coordinator.costBurndown() aggregates completed run cost (insights.costUsd)
  by UTC day over a window, returning a gap-free cumulative series + ceiling
- repository.listRunsByProduct() cross-partition run query
- GET /fleet/budgets/:productId/burndown?days=N route
- fleet-client.getBudgetBurndown() + CostBurndown/BurndownPoint types
- BurndownChart on the budget page: cumulative daily bars with a dashed
  ceiling overlay; bars turn red past the ceiling; degrades gracefully
- Tests: +2 coordinator, +1 routes, +2 fleet-client (fleet 147, web 216)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-30 18:25:27 -07:00
Saravanakumar D
2d5f9be642 feat: surface scoring explainability in fleet control plane
Adds 'why does this job route here?' to the §7 scheduler:
- coordinator.explainJob() re-runs scoreCandidate against every live factory,
  returning per-factory weighted breakdown, eligibility + reasons, deps state,
  and the best eligible factory (read-only, side-effect free)
- GET /fleet/jobs/:id/explain route (404 when job missing)
- fleet-client.getJobExplain() + JobExplain/ScoreBreakdown types
- ExplainPanel on the job detail page: score table per factory with the six
  weighted terms, eligibility, and unmet-deps note; degrades gracefully
- Tests: +2 coordinator, +1 routes, +2 fleet-client (fleet 144 green,
  tracker-web 214 green)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-30 18:21:14 -07:00
Saravanakumar D
c8ab43d3ae fix: harden operator action lease release + idempotent terminal actions
- release lease with fenced epoch (leaseEpoch+1, clear holder) so a stale
  renewal cannot resurrect a held lease after operator displacement
- reject on dead_letter / cancel on failed are now idempotent no-ops
  (no epoch bump, no duplicate event)
- add coordinator test for terminal idempotency

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-30 18:15:26 -07:00
Saravanakumar D
283383561c feat: complete operator job actions (requeue/reject/cancel)
Adds lease-free operator lifecycle control to the fleet control plane:
- coordinator.operatorAction() fences any current factory holder by bumping
  leaseEpoch (mirrors the reaper), preserves checkpoint, and routes the job:
  requeue -> queued (or blocked if deps unmet), reject -> dead_letter,
  cancel -> failed. Shipped jobs are terminal (invalid_state).
- POST /fleet/jobs/:id/actions/:action route (400 on unknown action)
- fleet-client.operatorAction() + Requeue/Cancel/Reject buttons on job detail
- Tests: +5 coordinator, +1 routes, +2 fleet-client (platform fleet 140 green,
  tracker-web 212 green)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-30 18:12:11 -07:00
Saravanakumar D
b061cc47f3 feat(tracker-web): fleet control plane UI — overview, jobs, budget, detail pages (Phase 3 Slice 4)
- Fleet overview page with factory cards + recent jobs polling
- Job table with stage filter tabs
- Job detail page with events timeline, runs, artifacts, DAG subtree, SHIP action
- Budget page with usage bar, pause/resume controls
- API proxy route forwarding /api/fleet/* to platform-service
- Typed fleet-client.ts with graceful 404 degradation
- 16 unit tests for fleet-client (198 total tracker-web tests green)
- Added Fleet nav item to dashboard layout
- Full monorepo build + test green

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-30 09:49:24 -07:00
Saravanakumar D
f4ea7b4a5b feat(fleet): per-product budgets with pause/resume (Phase 3 Slice 3)
- FleetBudgetDoc: ceilingUsd, window, spentUsd, status (active/paused)
- FLEET_BUDGETS flag (default OFF = no enforcement, unchanged behavior)
- Enforcement in claimNextJob: paused or ceiling-exceeded → null
- accrueSpend(): monotonic spend accumulation, auto-pause at ceiling
- Budget routes: GET/PUT /fleet/budgets/:productId, pause, resume
- UpsertBudgetSchema for route validation
- 7 new coordinator tests (ceiling, auto-pause, manual pause/resume,
  flag OFF bypass, monotonic accounting, cross-product isolation)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-30 09:49:23 -07:00
Saravanakumar D
484ed05c1f feat(fleet): DAG job decomposition — parent/child + fan-out (Phase 3 Slice 2)
- SubmitJobSchema accepts inline children[] for atomic fan-out creation
- Parent blocked until all children reach dep-satisfying terminal stage
- patchJobFenced triggers maybeUnblockParent on child stage transitions
- submitChildren() for POST /fleet/jobs/:id/children (add children later)
- getDagSubtree() for GET /fleet/jobs/:id/dag (recursive subtree query)
- listChildrenByParent() repository helper
- SubmitChildrenSchema for route validation
- 8 new coordinator tests (fan-out, blocking, unblocking, cycle, DAG query)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-30 09:49:23 -07:00
Saravanakumar D
4468a69526 feat(fleet): tunable scoring weights + preemption (Phase 3 Slice 1)
- Add FleetWeightRegistry + resolveWeights() for per-product/per-request
  weight tunability with defaults fallback (backward compatible)
- Add selectPreemptionVictim() pure function: only critical jobs may
  trigger, never evicts equal/higher priority, picks lowest-priority victim
- Wire preemption into coordinator behind FLEET_PREEMPTION flag (default OFF)
- Seat-limit enforcement: at seatLimit factories skip normal selection and
  attempt preemption of lower-priority running jobs for critical newcomers
- Eviction preserves checkpoint, bumps leaseEpoch (fences zombie), requeues
- 18 new tests (pure scheduler + coordinator integration)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-30 09:49:23 -07:00
ec055f6948 fix(infra,cowork): remove broken Cosmos emulator; harden IPC bridge
docker-compose:
  - Drop the cosmos-emulator service block. Both image variants we
    tried were unfit for the prototype: `:vnext-preview` returned
    plain-text PGCosmosError strings that crashed @azure/cosmos at
    JSON.parse, and `:latest` core-dumped under load. The container
    has been Exited(255) for weeks and was blocking depends_on chains.
  - Real Azure Cosmos account `cosmos-mywisprai` (db `bytelyst`,
    West US 2) is now the single source of truth; all services pick
    up COSMOS_ENDPOINT/KEY/DATABASE from `.env` (already mounted via
    `env_file: .env`).
  - extraction-service: drop hardcoded `COSMOS_ENDPOINT=…cosmos-emulator…`,
    `NODE_TLS_REJECT_UNAUTHORIZED=0`, and `depends_on: cosmos-emulator`.
  - cowork-service: same cleanup.

cowork-service IPC bridge:
  - Add `error` listeners to the spawned child's stdin/stdout/stderr.
    Without them, an EPIPE on stdin (child died mid-write) or a
    teardown-time stream error surfaced as an unhandled error and
    crashed vitest after all 140 tests had passed.
  - Removes the only failing recursive test in the workspace.

Test status after this commit:
  - 94 workspace packages, all green
  - cowork-service: 19 passed | 1 skipped (140 tests)
  - platform-service: 131 test files passed
  - extraction-service: 13 test files passed
  - All other packages: passing

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
2026-05-30 10:27:12 +00:00
saravanakumardb1
b2ce22d81c feat(platform-service): direct tracker -> fleet module wiring (§10 round-trip)
In-process tracker<->fleet bridge — no shell hop. Closes the §10 "direct
tracker->module calls" box.

- tracker-bridge.ts (new):
  * ingestItemAsJob(productId, itemId, opts?) — reads the Item via the items
    repository (foreign/unknown → NotFoundError), maps title/description → bodyMd
    (verbatim) + labels (engine-class:/profile:/priority:/cap:) → manifest hints,
    sets trackerItemId + a stable idempotency-key `tracker-<itemId>`, and submits
    through coordinator.submitJob — so re-ingest dedupes and the job is scheduled by
    the §7 router via the unchanged claim path.
  * echoJobToItem(productId, jobId, log?) — mirrors stage → Item status
    (queued/assigned/building/review/testing → in_progress; shipped → done;
    failed/dead_letter → wont_fix) + a metrics-ONLY comment (attempts/duration/
    tokens/cost — never the prompt body/secrets). Idempotent via the job's
    `trackerEchoedStatus`; best-effort + non-fatal (items-write failure →
    { echoed: null, error }, never thrown into the job lifecycle). productId-scoped.
- Auto-echo wired into the PATCH + lease/release transitions, GATED by
  FLEET_TRACKER_ECHO (default OFF → behavior byte-for-byte unchanged); never blocks
  or fails the transition.
- Routes (additive): POST /fleet/tracker/ingest, POST /fleet/tracker/echo
  (auth + getRequestProductId, productId-scoped).
- types.ts: optional FleetJobDoc.trackerEchoedStatus (reuses the existing
  trackerItemId field; no parallel schema) + Ingest/Echo request schemas.
- repository.ts: setTrackerEchoedStatus (no rev bump — never interferes with the
  fenced claim CAS).

Reuses the items + comments contracts directly (no HTTP). Does not touch
claimNextJob or the scheduler. productId on every doc; no any/console.log.
2026-05-30 01:32:12 -07:00
saravanakumardb1
4df134ec96 Merge: Phase 2 fleet factory enrollment + scoped rotatable tokens (§12) (#32) 2026-05-30 00:18:42 -07:00
Saravanakumar D
fb47d939a7 fix: resolve test failures across workspace
- packages/llm: use nullish coalescing (??) in GeminiProvider constructor
  so explicit empty-string apiKey is not overridden by env var
- dashboards/admin-web,tracker-web: exclude .next/ from vitest test glob
  to prevent Next.js internal test files from being picked up
- services/cowork-service: use platform-safe .kill() instead of SIGTERM
  which is invalid on Windows
- packages/use-keyboard-shortcuts: add @testing-library/react devDep
- scripts/npmrc.template: use https:// for Gitea registry

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-30 00:16:44 -07:00
saravanakumardb1
32328247ad test(platform-service): update fleet artifact tests for productId-scoped listing
listArtifactsByJob now requires productId; thread it through the existing
repository/artifacts test callers (signature update, assertions unchanged).
2026-05-30 00:06:10 -07:00
saravanakumardb1
e06b730161 feat(platform-service): fleet factory enrollment + scoped rotatable tokens (§12)
Adds factory enrollment + a scoped, rotatable credential model for the fleet
coordinator (trust boundary, §12/§18). Tokens are stored HASHED at rest (sha256 —
the same primitive the auth module uses for verify/magic-link tokens); the
high-entropy plaintext is returned exactly once at enroll/rotate and never persisted.

- enrollment.ts: enrollFactory (create/link factory + issue token), rotateToken
  (new active token; prior marked `rotating` with a grace overlap so an in-flight
  worker isn't cut off), revokeToken (immediate), verifyToken (constant-time hash
  compare; revoked/expired-grace → null; updates lastUsedAt). Scope = {productId,
  factoryId, capabilities[]}.
- Gated enforcement: enforceFactoryToken() on POST /fleet/factories/heartbeat and
  POST /fleet/claim, active only when FLEET_REQUIRE_FACTORY_TOKEN is on (default
  OFF — existing behavior/tests unchanged). When on: missing/invalid/revoked → 401;
  out-of-scope productId/capability/factory → 403; and the claim is CONSTRAINED to
  the verified token scope. Does not touch scheduler scoring or the claim CAS.
- types.ts: FleetFactoryTokenDoc + Enroll/Rotate/Revoke request schemas.
- repository.ts: fleet_factory_tokens collection + CRUD + findByHash.
- routes.ts (additive): POST /fleet/factories/enroll, /:id/token/rotate,
  /:id/token/revoke (user auth + productId + Zod).
- cosmos-init.ts: register fleet_factory_tokens (/productId).

Also hardens the artifact routes (review fixes): listArtifactsByJob is now
productId-scoped (GET /fleet/jobs/:id/artifacts threads the request productId), and
artifact upload uses the request/auth productId authoritatively (a spoofed
body.productId no longer overrides it).

Tokens hashed at rest; plaintext shown once; no new crypto schemes; productId on
every doc; no any/console.log; enforcement default OFF.
2026-05-30 00:05:52 -07:00
saravanakumardb1
b65e818f3d feat(platform-service): fleet artifacts + blob wiring (§13)
Artifact pointers in fleet_artifacts; large outputs in @bytelyst/blob (never
Cosmos). Routes: POST/GET /fleet/jobs/:id/artifacts, GET/DELETE
/fleet/artifacts/:id with short-lived SAS. 7 artifact tests.
2026-05-29 23:11:45 -07:00
saravanakumardb1
7930e8b0bd feat(platform-service): Phase 2 scheduler/router core (§7) + wire into atomic claim
Add a pure, fixed-weight scoring engine that decides WHICH queued job a claiming
factory gets, and wire it into coordinator.claimNextJob (the atomic rev-CAS claim
in tryClaimJob is unchanged).

scheduler.ts (pure, synchronous, no I/O):
- scoreCandidate(job, factory, ctx, weights?) -> { score, breakdown }
  score = w1*capabilityFit + w2*affinity + w3*(1/(1+load)) + w4*costFit(budget)
        + w5*health - w6*starvationPenalty(age); breakdown is per-weighted-term
        and sums to score (explainability / Phase-3 readiness).
- selectJob(candidates, factory, ctx, weights?) -> FleetJobDoc | null
  filters to stage-eligible + deps-satisfied (injected pure predicate) +
  capability-subset (+ down-health floor), ranks by score, deterministic
  tie-break: higher priority -> older createdAt -> lower cost class.
- Fixed default weights + bucketed anti-starvation aging (Phase 3 = tunable
  weights + preemption; intentionally NOT built here).

coordinator.ts (candidate-ranking section only):
- claimNextJob now resolves deps (store-backed) into a pure predicate, builds the
  factory view + authoritative now, and selects via selectJob; tryClaimJob CAS /
  lease / fence logic untouched. ClaimContext gains additive optional scheduler
  inputs (health/load/seatLimit/factoryEngines/warmScopes/costCeilingUsd). The
  pure capability-subset predicate moved into scheduler.ts and is re-exported.

Tests: scheduler.test.ts (16) covers capability hard-filter, priority/age
tie-breaks, load, health (+ down floor), starvation, cost fit, affinity, breakdown
sum, determinism, empty/no-eligible. coordinator.test.ts adds score-driven
selection, health floor, and ordered drain; all prior fleet tests stay green.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
2026-05-29 23:03:40 -07:00
saravanakumardb1
33c1d8d5fa fix(platform-service): make fleet job claim truly atomic via datastore updateIfMatch
The foundation's revUpdateJob/revUpdateLease did a read -> rev-check -> write with
await points between them, so two CONCURRENT claims could both read the same rev,
both pass the check, and both write — a double-assignment the old (sequential) race
test could not catch.

Rewire revUpdateJob/revUpdateLease to delegate to the datastore's updateIfMatch,
which performs the compare and the write as one indivisible operation (Cosmos
If-Match; synchronous compare-set on memory). The coordinator's tryClaimJob keeps
identical external behavior (ok/conflict) but is now genuinely single-winner.

Upgrades the coordinator tests to prove atomicity under TRUE concurrency:
- two contenders via Promise.all -> exactly one ok, one conflict; assigned once;
  one run; one lease; leaseEpoch 1.
- N-claimer (15) stress via Promise.all -> one ok, N-1 conflicts, no double-assignment.
- N concurrent claimNextJob for one job -> exactly one non-null claim.
- N concurrent lease renewals -> exactly one wins.

Verified these concurrent tests FAIL against the old read-check-write (double-assign)
and pass after the fix.
2026-05-29 20:59:08 -07:00
saravanakumardb1
95dd7aa1d0 docs(platform-service): fleet module README (containers, claim/lease/fence protocol, reaper) 2026-05-29 20:20:54 -07:00
saravanakumardb1
8eb02c48aa feat(platform-service): fleet REST routes + module registration (P2 foundation)
Guarded REST under /api (auth + productId, like items): POST /fleet/jobs (idempotent
submit), GET /fleet/jobs (by stage/idempotencyKey), GET /fleet/jobs/:id, PATCH
/fleet/jobs/:id (fenced transition), POST /fleet/claim, lease renew/release,
factories/heartbeat, and runs/events streams. Every body validated with the Zod
schemas; fenced/conflict map to 409, missing to 404, invalid to 400. Registers
fleetRoutes in server.ts next to itemRoutes. Routes tested via Fastify inject on
the memory provider (real coordinator).
2026-05-29 20:20:46 -07:00
saravanakumardb1
8f51570da7 feat(platform-service): fleet coordinator — claim/lease/fence/heartbeat/reaper (P2 foundation)
The concurrency core (§4/§7/§8/§18/§25):
- claimNextJob: priority+age selection over queued/dep-satisfied jobs whose caps
  are a subset of the factory's, then tryClaimJob does a rev CAS to flip to
  assigned + acquire the lease — exactly one contender wins, no double-assignment.
- leases + fencing: acquire/reclaim bumps leaseEpoch; patchJobFenced/renew/release
  reject a call whose leaseEpoch < job.leaseEpoch (zombie worker can't overwrite).
- heartbeat + isFactoryStale for factory liveness.
- reapExpiredLeases: returns expired-lease jobs to queued/blocked, bumps the epoch
  (fencing the dead holder), preserves the checkpoint pointer (resume), marks the
  lease expired; idempotent. Documents why Cosmos TTL cannot do this.
- submit: idempotent (dedup/supersede/409) + submit-time dependency cycle
  detection; deps gating (shipped, or testing when depsMode:soft).

Tests drive the atomic-claim race, fencing, and reaper deterministically via the
rev CAS (no real threads).
2026-05-29 20:20:30 -07:00
saravanakumardb1
fada354df8 feat(platform-service): fleet repositories with rev compare-and-swap (P2 foundation)
One repository per fleet_* container on the @bytelyst/datastore abstraction
(memory + cosmos): create/getById/list (by productId, stage, idempotencyKey),
partition-aware single-partition queries, ordered append-only appendEvent, and
runs/leases/factories/profiles/artifacts CRUD. Adds revUpdateJob/revUpdateLease —
a `rev`-token compare-and-swap that writes only when the stored rev still matches
(the optimistic-concurrency primitive for atomic claim + fenced transitions;
maps to Cosmos _etag/If-Match in production).
2026-05-29 20:20:15 -07:00
saravanakumardb1
721d3fcb48 feat(platform-service): fleet data model + container registration (P2 foundation)
Adds the agent-gigafactory fleet data model (modules/fleet/types.ts): Zod schemas
as the source of truth with inferred types (no `any`) for the 7 durable containers
— FleetJobDoc, FleetRunDoc, FleetLeaseDoc, FleetFactoryDoc, FleetProfileDoc,
FleetEventDoc, FleetArtifactDoc — each carrying productId. Lifecycle stages mirror
the agent-queue gigafactory spec (queued|blocked|assigned|building|review|testing|
shipped|failed|dead_letter). Registers fleet_* containers with their partition keys
(/productId for jobs/factories/profiles, /jobId for runs/leases/events/artifacts).
2026-05-29 20:19:59 -07:00
fe8338c2c5 feat(monitoring): add VM Overview Grafana dashboard
12-panel dashboard auto-provisioned via /var/lib/grafana/dashboards:
  - 4 stat tiles (disk %, RAM avail, swap used, CPU steal) with
    threshold colouring matching vm-health-check.sh
  - 4 time-series (disk %, RAM trend, steal, sda write GB/hr) — 7d default
  - 2 bargauge top-10 by RAM and CPU (cAdvisor container_memory_working_set,
    container_cpu_usage)
  - Load average (1/5/15) + network throughput (RX/TX, host interfaces)

uid: vm-overview. Picked up on next Grafana boot.

Closes Phase 5: "Add Grafana" item from VM observability roadmap.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-29 21:26:35 +00:00
saravanakumardb1
9d405952e2 fix(platform-service): TODO-4 \u2014 typed cast for request.auth augmentation
The DevOps admin preHandler read 'auth' as '(request as any).auth'.
The proper Fastify pattern is 'declare module' augmentation in
@bytelyst/fastify-auth, but the inline cast through 'unknown' is
sufficient for now and avoids touching the shared auth package.

Changed:
  - 'const auth = (request as any).auth;' \u2192
    'const auth = (request as unknown as { auth?: { role?: string } }).auth;'

Inline comment notes the cleaner 'declare module' alternative.

Final ecosystem state:
  scripts/check-rule-violations.sh: 0 findings across all rules \u2713

  web-hardcoded-hex:         0  \u2713
  b5-hardcoded-product-id:   0  \u2713
  b4-console-log:            0  \u2713
  b4-swift-print:            0  \u2713
  b4-python-print:           0  \u2713
  ts-any-type:               0  \u2713
  b7-emoji-in-code:          0  \u2713
2026-05-23 19:29:26 -07:00
saravanakumardb1
cde1a0b73c test(platform-service): align getAllProducts test with invttrdg fallback
CI run 67 surfaced a real test failure:

  src/modules/products/cache.test.ts:104
    getAllProducts > returns all cached products
    expected [ { id: 'lysnrai', …(11) }, …(2) ] to have a length of 2
    but got 3

Root cause: cache.ts has a TEMPORARY_FALLBACK_PRODUCTS map (currently
just 'invttrdg') that getAllProducts() merges into its return value
on top of the loaded cache. The test fixture loads 2 products
(lysnrai, mindlyst), so the actual return is 3 — the test was
written before the fallback shim landed and never got updated.

Two ways to reconcile: (a) make the test reflect today's behaviour,
or (b) gut the fallback. The cache.ts comment explicitly marks
the fallback as 'TODO(platform): remove after creating the real
product …', so the right move is (a): keep the shim in place and
make the test enforce the documented contract.

  - assertion now: toHaveLength(3) + .toContain('invttrdg')
  - inline comment ties the expectation back to cache.ts so a
    future cleanup removing the fallback will obviously need to
    drop it back to 2

Verified locally:
  pnpm vitest run cache.test.ts   -> 8/8 pass
2026-05-23 17:23:16 -07:00
saravanakumardb1
191b81756d fix(platform-service): resolve 3 TS errors in /devops/info handler
The platform-service build was failing with 3 unrelated TS errors,
surfaced while running the Gitea outdated-package detector earlier
in this session:

  src/server.ts(18,8):   Cannot find module '@bytelyst/devops/server'
  src/server.ts(318,61): Property 'cosmosEndpoint' does not exist on type 'ProductIdentity'
  src/server.ts(321,42): Property 'platformServiceUrl' does not exist on type 'ProductIdentity'

Root causes (two distinct bugs):

1. Stale install. '@bytelyst/devops' was already declared as
   'workspace:*' in services/platform-service/package.json (line 24),
   but node_modules/@bytelyst/devops/ did not exist. Re-running
   'pnpm install' at the workspace root materialised the symlink.

2. Variable shadowing. In the GET /devops/info handler the code
   declared a local 'const config' from loadProductIdentity() that
   shadowed the module-level 'config' (env vars) imported from
   './lib/config.js' at line 112. The author then tried to read
   'config.cosmosEndpoint' and 'config.platformServiceUrl' off the
   ProductIdentity, where those keys never exist:

     ProductIdentity = {
       productId, displayName, licensePrefix, configDirName,
       envVarPrefix, bundleIdSuffix, packageName
     }

   The intended values live on the env config:
     config.COSMOS_ENDPOINT  (Zod-validated, required at boot)
     config.HOST + config.PORT (defaults '0.0.0.0' / 4003)

   There is no 'platformServiceUrl' field anywhere in the codebase —
   it only appeared in this single buggy line. Reconstructed as
   '\${HOST}:\${PORT}' which is the URL admins would use to reach
   this service for the devops/info diagnostic dashboard.

Fix (services/platform-service/src/server.ts:310-339):
  - Rename local 'const config' to 'const productIdentity' to break
    the shadowing.
  - Use productIdentity.productId for the devops productId field.
  - Use config.COSMOS_ENDPOINT (the env config) for the cosmos
    dependency health check URL.
  - Use `http://${config.HOST}:${config.PORT}` for the extra
    platformServiceUrl field.
  - Add a doc comment block explaining the two-config distinction
    so future contributors don't reintroduce the shadow.

Verified:
  pnpm --filter @lysnrai/platform-service build    OK (0 errors)
  pnpm --filter @lysnrai/platform-service test     1511/1512 pass

The 1 remaining failure (src/modules/products/cache.test.ts line 104,
'returns all cached products' expects 2 products but got 3) is a
PRE-EXISTING product-registry test drift on main, verified by
stashing this commit's changes and re-running the same test against
the unmodified tree. It will be addressed separately.
2026-05-23 12:46:10 -07:00