Commit Graph

1589 Commits

Author SHA1 Message Date
Saravanakumar D
932951dbaf docs: update roadmap audit to reflect completed Phase 3 slices
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-30 19:05:33 -07:00
Saravanakumar D
4ac499f301 feat: add multi-reviewer human gate (review-policy routing)
Implements the §14 Phase 3 review gate. requestReview() routes a building
job into the review stage (fencing any worker), carrying a normalized policy
(requiredApprovals + reviewer allowlist) and clearing prior decisions.
submitReview() records one decision per reviewer (last-write-wins, identity-
normalized), advances the job to testing once distinct approvals reach the
quorum, and treats any reject as a veto that returns the job to queued for
rework. Adds POST /fleet/jobs/:id/review/request and POST /fleet/jobs/:id/review,
a typed client, and a review-gate card on the job-detail page.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-30 19:04:24 -07:00
Saravanakumar D
c9c2c174db feat: add fleet metrics + alerting (GET /fleet/metrics)
Adds coordinator.fleetMetrics() computing queue depth, stage histogram,
oldest-queued age (starvation signal), factory health and seat utilization,
plus derived alerts (no_live_capacity, all_factories_down, queue_starvation,
saturated, stale_factories). Exposed via GET /fleet/metrics and surfaced as a
metrics+alerts panel on the fleet overview. Thresholds injectable for tests.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-30 18:51:59 -07:00
Saravanakumar D
d780739cbe test: add fleet control-plane Playwright e2e coverage
New e2e/fleet.spec.ts with a method- and URL-aware /api/fleet/** mock that
holds mutable state so operator actions and budget toggles reflect in
follow-up GETs. Covers: fleet overview (factory cards + recent jobs), jobs
table + stage filter, job detail requeue (stage building->queued) with the
SSE-driven Live badge, and budget pause/resume. All 4 specs green.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-30 18:44:53 -07:00
Saravanakumar D
ea42602407 feat: add resumable SSE live event stream for fleet jobs
Backend: GET /fleet/jobs/:id/events/stream emits a snapshot (seq > Last-Event-ID)
then long-polls the append-only event log, closing after a bounded window so
EventSource-style clients reconnect cleanly. Honors Last-Event-ID resume,
keepalive comments, and a terminal error frame.

Frontend: subscribeJobEvents uses fetch streaming (to send auth + product
headers) with parseSseFrames, Last-Event-ID resume, reconnect backoff, and a
fatal-on-error-frame fallback to polling. Job detail page subscribes live
(deduped by seq), falls back to 4s polling on failure, and shows a Live badge;
refresh() now merges events so a slow snapshot can't clobber streamed ones.

Tests: +3 route (snapshot, resume cursor, append-after-connect), +5 client
(parseSseFrames x2, subscribe deliver/error/resume/error-frame). fleet 150, web 222.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-30 18:38:50 -07:00
Saravanakumar D
1ae15a7755 docs: mark cost burndown complete
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-30 18:25:46 -07:00
Saravanakumar D
89860e39f9 feat: add fleet cost burndown chart
- coordinator.costBurndown() aggregates completed run cost (insights.costUsd)
  by UTC day over a window, returning a gap-free cumulative series + ceiling
- repository.listRunsByProduct() cross-partition run query
- GET /fleet/budgets/:productId/burndown?days=N route
- fleet-client.getBudgetBurndown() + CostBurndown/BurndownPoint types
- BurndownChart on the budget page: cumulative daily bars with a dashed
  ceiling overlay; bars turn red past the ceiling; degrades gracefully
- Tests: +2 coordinator, +1 routes, +2 fleet-client (fleet 147, web 216)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-30 18:25:27 -07:00
Saravanakumar D
3f850b7b6f docs: mark scoring explainability complete
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-30 18:21:35 -07:00
Saravanakumar D
2d5f9be642 feat: surface scoring explainability in fleet control plane
Adds 'why does this job route here?' to the §7 scheduler:
- coordinator.explainJob() re-runs scoreCandidate against every live factory,
  returning per-factory weighted breakdown, eligibility + reasons, deps state,
  and the best eligible factory (read-only, side-effect free)
- GET /fleet/jobs/:id/explain route (404 when job missing)
- fleet-client.getJobExplain() + JobExplain/ScoreBreakdown types
- ExplainPanel on the job detail page: score table per factory with the six
  weighted terms, eligibility, and unmet-deps note; degrades gracefully
- Tests: +2 coordinator, +1 routes, +2 fleet-client (fleet 144 green,
  tracker-web 214 green)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-30 18:21:14 -07:00
Saravanakumar D
69f553d432 docs: mark operator job actions complete in TASKS_TO_COMPLETE
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-30 18:15:53 -07:00
Saravanakumar D
c8ab43d3ae fix: harden operator action lease release + idempotent terminal actions
- release lease with fenced epoch (leaseEpoch+1, clear holder) so a stale
  renewal cannot resurrect a held lease after operator displacement
- reject on dead_letter / cancel on failed are now idempotent no-ops
  (no epoch bump, no duplicate event)
- add coordinator test for terminal idempotency

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-30 18:15:26 -07:00
Saravanakumar D
283383561c feat: complete operator job actions (requeue/reject/cancel)
Adds lease-free operator lifecycle control to the fleet control plane:
- coordinator.operatorAction() fences any current factory holder by bumping
  leaseEpoch (mirrors the reaper), preserves checkpoint, and routes the job:
  requeue -> queued (or blocked if deps unmet), reject -> dead_letter,
  cancel -> failed. Shipped jobs are terminal (invalid_state).
- POST /fleet/jobs/:id/actions/:action route (400 on unknown action)
- fleet-client.operatorAction() + Requeue/Cancel/Reject buttons on job detail
- Tests: +5 coordinator, +1 routes, +2 fleet-client (platform fleet 140 green,
  tracker-web 212 green)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-30 18:12:11 -07:00
Saravanakumar D
0f903b935a audit: document current Gigafactory completion state
- ROADMAP_COMPLETION_AUDIT.md: verified state vs GIGAFACTORY_ROADMAP source of truth
- TASKS_TO_COMPLETE.md: prioritized remaining work with acceptance criteria
- Key finding: roadmap §0 tracker is stale (P2 ~95%, P3 ~70% actual vs 80%/0% claimed)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-30 18:06:33 -07:00
4777b28698 feat(dashboards): add ops cockpit and execution pipeline
Some checks failed
Publish @bytelyst/* packages / publish (push) Failing after 14s
CI — Common Platform / Build, Test & Typecheck (push) Successful in 42s
2026-05-30 23:12:06 +00:00
87acb8e414 test(admin-web): stabilize blocking e2e suite
Some checks failed
Publish @bytelyst/* packages / publish (push) Failing after 13s
CI — Common Platform / Build, Test & Typecheck (push) Successful in 44s
2026-05-30 22:20:14 +00:00
7465b21d91 style(admin-web): format dashboard sources
Some checks failed
Publish @bytelyst/* packages / publish (push) Failing after 13s
CI — Common Platform / Build, Test & Typecheck (push) Successful in 53s
2026-05-30 21:18:09 +00:00
20e1ac0e67 feat(admin-web): sync product context changes
Some checks failed
Publish @bytelyst/* packages / publish (push) Failing after 13s
CI — Common Platform / Build, Test & Typecheck (push) Successful in 52s
2026-05-30 21:04:04 +00:00
f77797881b feat(admin-web): add dashboard retry state
Some checks failed
Publish @bytelyst/* packages / publish (push) Failing after 11s
CI — Common Platform / Build, Test & Typecheck (push) Successful in 42s
2026-05-30 21:02:21 +00:00
e69ffadb4b feat(tracker-web): sync admin product context
Some checks failed
Publish @bytelyst/* packages / publish (push) Failing after 13s
CI — Common Platform / Build, Test & Typecheck (push) Successful in 48s
2026-05-30 20:27:27 +00:00
c1a88a39e2 feat(tracker-web): add dashboard stats retry state
Some checks failed
Publish @bytelyst/* packages / publish (push) Failing after 13s
CI — Common Platform / Build, Test & Typecheck (push) Successful in 44s
2026-05-30 20:23:14 +00:00
6c49296d40 test(tracker-web): stabilize dashboard e2e assertion
Some checks failed
Publish @bytelyst/* packages / publish (push) Failing after 13s
CI — Common Platform / Build, Test & Typecheck (push) Successful in 52s
2026-05-30 20:00:06 +00:00
359d99b67f feat(tracker-web): expose webhook ingestion proxy
Some checks failed
Publish @bytelyst/* packages / publish (push) Failing after 12s
CI — Common Platform / Build, Test & Typecheck (push) Successful in 47s
2026-05-30 19:56:28 +00:00
6e023f3bdc feat(tracker-web): expose agent v1 proxy
Some checks failed
Publish @bytelyst/* packages / publish (push) Failing after 14s
CI — Common Platform / Build, Test & Typecheck (push) Successful in 42s
2026-05-30 19:54:56 +00:00
ccfbfd194a feat(tracker-web): add admin settings page
Some checks failed
Publish @bytelyst/* packages / publish (push) Failing after 11s
CI — Common Platform / Build, Test & Typecheck (push) Successful in 38s
2026-05-30 19:53:16 +00:00
3a6ed3a5f8 feat(tracker-web): add public submission status page
Some checks failed
Publish @bytelyst/* packages / publish (push) Failing after 12s
CI — Common Platform / Build, Test & Typecheck (push) Successful in 42s
2026-05-30 19:50:15 +00:00
5606ccf1f7 test(tracker-web): cover platform health failures
Some checks failed
Publish @bytelyst/* packages / publish (push) Failing after 17s
CI — Common Platform / Build, Test & Typecheck (push) Successful in 49s
2026-05-30 19:38:49 +00:00
67104b8ebc test(tracker-web): isolate e2e dev server
Some checks failed
Publish @bytelyst/* packages / publish (push) Failing after 12s
CI — Common Platform / Build, Test & Typecheck (push) Successful in 48s
2026-05-30 19:36:07 +00:00
e25969d5dc fix(tracker-web): probe platform health
Some checks failed
Publish @bytelyst/* packages / publish (push) Failing after 40s
CI — Common Platform / Build, Test & Typecheck (push) Successful in 1m22s
Size limit / size-limit (push) Failing after 1m8s
2026-05-30 19:26:44 +00:00
Saravanakumar D
325dfcae8e docs(fleet): Phase 3 operational guide + progress report (Slice 5)
- Created docs/FLEET_CONTROL_PLANE.md — full operational guide covering:
  - Feature flags (FLEET_PREEMPTION, FLEET_BUDGETS)
  - Tunable scoring weights + resolution order
  - Preemption rules and behavior
  - DAG job decomposition API
  - Per-product budgets with auto-pause
  - Fleet Control Plane UI pages and configuration
  - API reference summary
  - Architecture decisions
- Updated docs/gigafactory-phase3-progress.md — all 5 slices DONE with commit SHAs

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-30 09:49:24 -07:00
Saravanakumar D
b061cc47f3 feat(tracker-web): fleet control plane UI — overview, jobs, budget, detail pages (Phase 3 Slice 4)
- Fleet overview page with factory cards + recent jobs polling
- Job table with stage filter tabs
- Job detail page with events timeline, runs, artifacts, DAG subtree, SHIP action
- Budget page with usage bar, pause/resume controls
- API proxy route forwarding /api/fleet/* to platform-service
- Typed fleet-client.ts with graceful 404 degradation
- 16 unit tests for fleet-client (198 total tracker-web tests green)
- Added Fleet nav item to dashboard layout
- Full monorepo build + test green

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-30 09:49:24 -07:00
Saravanakumar D
f4ea7b4a5b feat(fleet): per-product budgets with pause/resume (Phase 3 Slice 3)
- FleetBudgetDoc: ceilingUsd, window, spentUsd, status (active/paused)
- FLEET_BUDGETS flag (default OFF = no enforcement, unchanged behavior)
- Enforcement in claimNextJob: paused or ceiling-exceeded → null
- accrueSpend(): monotonic spend accumulation, auto-pause at ceiling
- Budget routes: GET/PUT /fleet/budgets/:productId, pause, resume
- UpsertBudgetSchema for route validation
- 7 new coordinator tests (ceiling, auto-pause, manual pause/resume,
  flag OFF bypass, monotonic accounting, cross-product isolation)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-30 09:49:23 -07:00
Saravanakumar D
484ed05c1f feat(fleet): DAG job decomposition — parent/child + fan-out (Phase 3 Slice 2)
- SubmitJobSchema accepts inline children[] for atomic fan-out creation
- Parent blocked until all children reach dep-satisfying terminal stage
- patchJobFenced triggers maybeUnblockParent on child stage transitions
- submitChildren() for POST /fleet/jobs/:id/children (add children later)
- getDagSubtree() for GET /fleet/jobs/:id/dag (recursive subtree query)
- listChildrenByParent() repository helper
- SubmitChildrenSchema for route validation
- 8 new coordinator tests (fan-out, blocking, unblocking, cycle, DAG query)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-30 09:49:23 -07:00
Saravanakumar D
4468a69526 feat(fleet): tunable scoring weights + preemption (Phase 3 Slice 1)
- Add FleetWeightRegistry + resolveWeights() for per-product/per-request
  weight tunability with defaults fallback (backward compatible)
- Add selectPreemptionVictim() pure function: only critical jobs may
  trigger, never evicts equal/higher priority, picks lowest-priority victim
- Wire preemption into coordinator behind FLEET_PREEMPTION flag (default OFF)
- Seat-limit enforcement: at seatLimit factories skip normal selection and
  attempt preemption of lower-priority running jobs for critical newcomers
- Eviction preserves checkpoint, bumps leaseEpoch (fences zombie), requeues
- 18 new tests (pure scheduler + coordinator integration)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-30 09:49:23 -07:00
c0db29014b fix(infra): bind caddy to public eth0 IP only
Caddy was binding 0.0.0.0:443, which prevented tailscaled from claiming
100.87.53.10:443 for `tailscale serve --https=443`. Restricting Caddy to
the public eth0 IP (187.124.159.82) keeps the public api.bytelyst.com /
devops.bytelyst.com routing intact while freeing the Tailscale IP so the
tailnet-only dashboard URL (https://srv1491630.tailf85608.ts.net) is
reachable again.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
2026-05-30 16:37:09 +00:00
ec055f6948 fix(infra,cowork): remove broken Cosmos emulator; harden IPC bridge
docker-compose:
  - Drop the cosmos-emulator service block. Both image variants we
    tried were unfit for the prototype: `:vnext-preview` returned
    plain-text PGCosmosError strings that crashed @azure/cosmos at
    JSON.parse, and `:latest` core-dumped under load. The container
    has been Exited(255) for weeks and was blocking depends_on chains.
  - Real Azure Cosmos account `cosmos-mywisprai` (db `bytelyst`,
    West US 2) is now the single source of truth; all services pick
    up COSMOS_ENDPOINT/KEY/DATABASE from `.env` (already mounted via
    `env_file: .env`).
  - extraction-service: drop hardcoded `COSMOS_ENDPOINT=…cosmos-emulator…`,
    `NODE_TLS_REJECT_UNAUTHORIZED=0`, and `depends_on: cosmos-emulator`.
  - cowork-service: same cleanup.

cowork-service IPC bridge:
  - Add `error` listeners to the spawned child's stdin/stdout/stderr.
    Without them, an EPIPE on stdin (child died mid-write) or a
    teardown-time stream error surfaced as an unhandled error and
    crashed vitest after all 140 tests had passed.
  - Removes the only failing recursive test in the workspace.

Test status after this commit:
  - 94 workspace packages, all green
  - cowork-service: 19 passed | 1 skipped (140 tests)
  - platform-service: 131 test files passed
  - extraction-service: 13 test files passed
  - All other packages: passing

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
2026-05-30 10:27:12 +00:00
72fa2d297f fix(infra): switch cosmos-emulator from vnext-preview to stable :latest
The vnext-preview (Postgres-backed) image returned PGCosmosError plaintext
for cross-partition queryFeed calls, crashing @azure/cosmos at JSON.parse.
:latest is HTTPS-only with a self-signed cert, so consumers are gated by
NODE_TLS_REJECT_UNAUTHORIZED=0 (dev-prototype only). platform-service now
points at the real Azure Cosmos account (per .env), so its dependency on
the local emulator service is removed.
2026-05-30 09:59:21 +00:00
saravanakumardb1
a8538db774 Merge: Phase 2 direct tracker -> fleet module wiring (§10) (#tracker) 2026-05-30 01:52:26 -07:00
saravanakumardb1
b2ce22d81c feat(platform-service): direct tracker -> fleet module wiring (§10 round-trip)
In-process tracker<->fleet bridge — no shell hop. Closes the §10 "direct
tracker->module calls" box.

- tracker-bridge.ts (new):
  * ingestItemAsJob(productId, itemId, opts?) — reads the Item via the items
    repository (foreign/unknown → NotFoundError), maps title/description → bodyMd
    (verbatim) + labels (engine-class:/profile:/priority:/cap:) → manifest hints,
    sets trackerItemId + a stable idempotency-key `tracker-<itemId>`, and submits
    through coordinator.submitJob — so re-ingest dedupes and the job is scheduled by
    the §7 router via the unchanged claim path.
  * echoJobToItem(productId, jobId, log?) — mirrors stage → Item status
    (queued/assigned/building/review/testing → in_progress; shipped → done;
    failed/dead_letter → wont_fix) + a metrics-ONLY comment (attempts/duration/
    tokens/cost — never the prompt body/secrets). Idempotent via the job's
    `trackerEchoedStatus`; best-effort + non-fatal (items-write failure →
    { echoed: null, error }, never thrown into the job lifecycle). productId-scoped.
- Auto-echo wired into the PATCH + lease/release transitions, GATED by
  FLEET_TRACKER_ECHO (default OFF → behavior byte-for-byte unchanged); never blocks
  or fails the transition.
- Routes (additive): POST /fleet/tracker/ingest, POST /fleet/tracker/echo
  (auth + getRequestProductId, productId-scoped).
- types.ts: optional FleetJobDoc.trackerEchoedStatus (reuses the existing
  trackerItemId field; no parallel schema) + Ingest/Echo request schemas.
- repository.ts: setTrackerEchoedStatus (no rev bump — never interferes with the
  fenced claim CAS).

Reuses the items + comments contracts directly (no HTTP). Does not touch
claimNextJob or the scheduler. productId on every doc; no any/console.log.
2026-05-30 01:32:12 -07:00
Saravanakumar D
07c0d304bc fix(use-keyboard-shortcuts): guard against undefined e.key and shortcut.key
Some keyboard events (dead keys, modifier-only presses) have e.key
as undefined. Similarly, malformed shortcut definitions may lack a key.
Added early-return guards to prevent TypeError.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-30 01:02:15 -07:00
saravanakumardb1
4df134ec96 Merge: Phase 2 fleet factory enrollment + scoped rotatable tokens (§12) (#32) 2026-05-30 00:18:42 -07:00
Saravanakumar D
fb47d939a7 fix: resolve test failures across workspace
- packages/llm: use nullish coalescing (??) in GeminiProvider constructor
  so explicit empty-string apiKey is not overridden by env var
- dashboards/admin-web,tracker-web: exclude .next/ from vitest test glob
  to prevent Next.js internal test files from being picked up
- services/cowork-service: use platform-safe .kill() instead of SIGTERM
  which is invalid on Windows
- packages/use-keyboard-shortcuts: add @testing-library/react devDep
- scripts/npmrc.template: use https:// for Gitea registry

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-30 00:16:44 -07:00
saravanakumardb1
32328247ad test(platform-service): update fleet artifact tests for productId-scoped listing
listArtifactsByJob now requires productId; thread it through the existing
repository/artifacts test callers (signature update, assertions unchanged).
2026-05-30 00:06:10 -07:00
saravanakumardb1
e06b730161 feat(platform-service): fleet factory enrollment + scoped rotatable tokens (§12)
Adds factory enrollment + a scoped, rotatable credential model for the fleet
coordinator (trust boundary, §12/§18). Tokens are stored HASHED at rest (sha256 —
the same primitive the auth module uses for verify/magic-link tokens); the
high-entropy plaintext is returned exactly once at enroll/rotate and never persisted.

- enrollment.ts: enrollFactory (create/link factory + issue token), rotateToken
  (new active token; prior marked `rotating` with a grace overlap so an in-flight
  worker isn't cut off), revokeToken (immediate), verifyToken (constant-time hash
  compare; revoked/expired-grace → null; updates lastUsedAt). Scope = {productId,
  factoryId, capabilities[]}.
- Gated enforcement: enforceFactoryToken() on POST /fleet/factories/heartbeat and
  POST /fleet/claim, active only when FLEET_REQUIRE_FACTORY_TOKEN is on (default
  OFF — existing behavior/tests unchanged). When on: missing/invalid/revoked → 401;
  out-of-scope productId/capability/factory → 403; and the claim is CONSTRAINED to
  the verified token scope. Does not touch scheduler scoring or the claim CAS.
- types.ts: FleetFactoryTokenDoc + Enroll/Rotate/Revoke request schemas.
- repository.ts: fleet_factory_tokens collection + CRUD + findByHash.
- routes.ts (additive): POST /fleet/factories/enroll, /:id/token/rotate,
  /:id/token/revoke (user auth + productId + Zod).
- cosmos-init.ts: register fleet_factory_tokens (/productId).

Also hardens the artifact routes (review fixes): listArtifactsByJob is now
productId-scoped (GET /fleet/jobs/:id/artifacts threads the request productId), and
artifact upload uses the request/auth productId authoritatively (a spoofed
body.productId no longer overrides it).

Tokens hashed at rest; plaintext shown once; no new crypto schemes; productId on
every doc; no any/console.log; enforcement default OFF.
2026-05-30 00:05:52 -07:00
Saravanakumar D
e83ab9907e fix(config): remove :3300 port from Gitea npm registry URLs
The Gitea instance runs on default port 80, not 3300. Updated the
npmrc template and AGENTS.md references accordingly.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-29 23:29:07 -07:00
Saravanakumar D
da98c9fd3f docs(config): add GITEA_NPM_TOKEN to .env.example
Include Gitea npm registry variables (token, host, owner) so
developers know which env vars to set for @bytelyst package access.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-29 23:15:43 -07:00
saravanakumardb1
4f9ec3332f Merge: Phase 2 scheduler/router core (§7) + fleet artifacts/blob (§13)
Combined parallel work (file-disjoint). Scheduler: deterministic scoring router
wired into atomic claim. Artifacts: blob-backed pointers in fleet_artifacts.
fleet suite 79/79; platform-service 1591/1591; build clean.
2026-05-29 23:12:33 -07:00
saravanakumardb1
b65e818f3d feat(platform-service): fleet artifacts + blob wiring (§13)
Artifact pointers in fleet_artifacts; large outputs in @bytelyst/blob (never
Cosmos). Routes: POST/GET /fleet/jobs/:id/artifacts, GET/DELETE
/fleet/artifacts/:id with short-lived SAS. 7 artifact tests.
2026-05-29 23:11:45 -07:00
saravanakumardb1
7930e8b0bd feat(platform-service): Phase 2 scheduler/router core (§7) + wire into atomic claim
Add a pure, fixed-weight scoring engine that decides WHICH queued job a claiming
factory gets, and wire it into coordinator.claimNextJob (the atomic rev-CAS claim
in tryClaimJob is unchanged).

scheduler.ts (pure, synchronous, no I/O):
- scoreCandidate(job, factory, ctx, weights?) -> { score, breakdown }
  score = w1*capabilityFit + w2*affinity + w3*(1/(1+load)) + w4*costFit(budget)
        + w5*health - w6*starvationPenalty(age); breakdown is per-weighted-term
        and sums to score (explainability / Phase-3 readiness).
- selectJob(candidates, factory, ctx, weights?) -> FleetJobDoc | null
  filters to stage-eligible + deps-satisfied (injected pure predicate) +
  capability-subset (+ down-health floor), ranks by score, deterministic
  tie-break: higher priority -> older createdAt -> lower cost class.
- Fixed default weights + bucketed anti-starvation aging (Phase 3 = tunable
  weights + preemption; intentionally NOT built here).

coordinator.ts (candidate-ranking section only):
- claimNextJob now resolves deps (store-backed) into a pure predicate, builds the
  factory view + authoritative now, and selects via selectJob; tryClaimJob CAS /
  lease / fence logic untouched. ClaimContext gains additive optional scheduler
  inputs (health/load/seatLimit/factoryEngines/warmScopes/costCeilingUsd). The
  pure capability-subset predicate moved into scheduler.ts and is re-exported.

Tests: scheduler.test.ts (16) covers capability hard-filter, priority/age
tie-breaks, load, health (+ down floor), starvation, cost fit, affinity, breakdown
sum, determinism, empty/no-eligible. coordinator.test.ts adds score-driven
selection, health floor, and ordered drain; all prior fleet tests stay green.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
2026-05-29 23:03:40 -07:00
saravanakumardb1
6f6e005114 Merge PR #29: Phase 2 atomic-claim hardening — true single-winner claim
Adds @bytelyst/datastore updateIfMatch (Cosmos If-Match/_etag + process-atomic
memory impl); rewires fleet revUpdate* to conditional writes. Concurrent claim
proven via Promise.all + N-claimer stress (fails on old read-check-write, passes
now). datastore 48 + fleet 53 green; builds clean; no consumer regressed.
2026-05-29 21:04:33 -07:00
saravanakumardb1
33c1d8d5fa fix(platform-service): make fleet job claim truly atomic via datastore updateIfMatch
The foundation's revUpdateJob/revUpdateLease did a read -> rev-check -> write with
await points between them, so two CONCURRENT claims could both read the same rev,
both pass the check, and both write — a double-assignment the old (sequential) race
test could not catch.

Rewire revUpdateJob/revUpdateLease to delegate to the datastore's updateIfMatch,
which performs the compare and the write as one indivisible operation (Cosmos
If-Match; synchronous compare-set on memory). The coordinator's tryClaimJob keeps
identical external behavior (ok/conflict) but is now genuinely single-winner.

Upgrades the coordinator tests to prove atomicity under TRUE concurrency:
- two contenders via Promise.all -> exactly one ok, one conflict; assigned once;
  one run; one lease; leaseEpoch 1.
- N-claimer (15) stress via Promise.all -> one ok, N-1 conflicts, no double-assignment.
- N concurrent claimNextJob for one job -> exactly one non-null claim.
- N concurrent lease renewals -> exactly one wins.

Verified these concurrent tests FAIL against the old read-check-write (double-assign)
and pass after the fix.
2026-05-29 20:59:08 -07:00