Commit Graph

331 Commits

Author SHA1 Message Date
saravanakumardb1
1503ef2e19 fix(fleet): Phase 3 hardening — budget authz, idempotent accrual, cycle detection, artifact
Re-applies 4 defects merged with Phase 3 (still present on main), ported from the
stale reference branch and adapted to current main.

FIX 1 (BLOCKER): all 5 budget routes (GET/burndown/PUT/pause/resume) read
productId from the URL with no caller check, so a caller for product A could
read or modify product B budget. Add requireOwnProduct() -> 403 on mismatch.

FIX 2: stop tracking services/platform-service/.data/platform-events.json (the
EVENT_BUS_FILE runtime log); gitignore .data/.

FIX 3: accrueSpend was definition-only (budgets never accrued) and not
idempotent. Add accruedRunIds to FleetBudgetDoc; accrueSpend(productId, costUsd,
runId) no-ops on a seen runId; wire it into patchJobFenced on stage shipped,
flag-gated + best-effort + idempotent via jobId:leaseEpoch. Accrue the run ACTUAL
insights.costUsd so spentUsd and costBurndown agree.

FIX 4: submitChildren cycle detection is now batch-aware — rejects duplicate
child keys and walks each child declared deps across BOTH the unpersisted batch
and existing jobs, rejecting child-self, child-parent and sibling cycles.

Tests: +2 budget-authz (403 on all 5 verbs), +3 accrual (idempotent runId, ship
accrues insights.costUsd once, flag-off no accrual), +5 cycle detection. Gates
green: tsc/build, fleet+items (204), full suite (only the unrelated single-fork
migration-isolation file flakes, passes isolated). Flags stay default-OFF.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
2026-05-31 02:45:52 -07:00
saravanakumardb1
32e426d423 feat(fleet): run-insights reporting + ship action; complete the lifecycle
Make the Agent Gigafactory fully drivable end-to-end via the API:

- lease/release now accepts `insights` (model/tokens/cost) + `result`, recorded
  on the current run with endedAt — factories report cost/token metrics on
  completion (previously no API existed; runs stayed insights:{}).
- add `ship` operator action so a job in `testing` (where the review gate left
  no lease holder) can reach the terminal `shipped` stage. Idempotent.
- operatorAction now retries on optimistic-concurrency conflict with backoff
  (mirrors submitReview) so a ship right after approve survives real-Cosmos
  read-after-write lag instead of a spurious 409.

Tests: +2 coordinator (ship idempotent, release records insights) and +2 route
integration (gated submit->...->ship->metrics; release-with-insights). 170 pass.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
2026-05-31 01:49:59 -07:00
saravanakumardb1
e0a904c7ea fix(cosmos): support composite indexes; add fleet_jobs (priorityOrder, createdAt)
Azure Cosmos cannot serve a multi-field ORDER BY without a matching composite
index (the local emulator is lenient, real Cosmos returns HTTP 400). The fleet
listJobs() query orders by (priorityOrder, createdAt), which broke
GET /api/fleet/metrics and /api/fleet/jobs on real Cosmos.

- ContainerConfig gains an optional `compositeIndexes` field
- container init applies it on create AND reconciles it onto existing
  containers (createIfNotExists never updates an existing index policy)
- fleet_jobs declares the (priorityOrder ASC, createdAt ASC) composite index

Verified live against Azure Cosmos: both endpoints now return 200.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
2026-05-30 23:07:06 -07:00
Saravanakumar D
20bcbd9f19 docs(gigafactory): uppercase GIGAFACTORY folder + add index README
Rename docs/gigafactory/ to docs/GIGAFACTORY/ and update the cross-repo
source-of-truth references in the fleet README and types.ts comment. Add an
index README listing the platform docs and pointing to the canonical spec in
learning_ai_devops_tools. Docs/comment only; no behavior change.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-30 21:22:21 -07:00
Saravanakumar D
fc000c4c20 docs(gigafactory): consolidate gigafactory docs into docs/gigafactory/
Move ROADMAP_COMPLETION_AUDIT.md, TASKS_TO_COMPLETE.md,
gigafactory-phase3-progress.md and FLEET_CONTROL_PLANE.md under
docs/gigafactory/ so the scattered Gigafactory docs are easy to discover.
Update intra-doc and cross-repo source-of-truth references (fleet README
and types.ts comment) to the new agent-queue/docs/gigafactory/ path.
Pure docs/comment move; no behavior change.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-30 21:07:20 -07:00
Saravanakumar D
4ac499f301 feat: add multi-reviewer human gate (review-policy routing)
Implements the §14 Phase 3 review gate. requestReview() routes a building
job into the review stage (fencing any worker), carrying a normalized policy
(requiredApprovals + reviewer allowlist) and clearing prior decisions.
submitReview() records one decision per reviewer (last-write-wins, identity-
normalized), advances the job to testing once distinct approvals reach the
quorum, and treats any reject as a veto that returns the job to queued for
rework. Adds POST /fleet/jobs/:id/review/request and POST /fleet/jobs/:id/review,
a typed client, and a review-gate card on the job-detail page.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-30 19:04:24 -07:00
Saravanakumar D
c9c2c174db feat: add fleet metrics + alerting (GET /fleet/metrics)
Adds coordinator.fleetMetrics() computing queue depth, stage histogram,
oldest-queued age (starvation signal), factory health and seat utilization,
plus derived alerts (no_live_capacity, all_factories_down, queue_starvation,
saturated, stale_factories). Exposed via GET /fleet/metrics and surfaced as a
metrics+alerts panel on the fleet overview. Thresholds injectable for tests.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-30 18:51:59 -07:00
Saravanakumar D
ea42602407 feat: add resumable SSE live event stream for fleet jobs
Backend: GET /fleet/jobs/:id/events/stream emits a snapshot (seq > Last-Event-ID)
then long-polls the append-only event log, closing after a bounded window so
EventSource-style clients reconnect cleanly. Honors Last-Event-ID resume,
keepalive comments, and a terminal error frame.

Frontend: subscribeJobEvents uses fetch streaming (to send auth + product
headers) with parseSseFrames, Last-Event-ID resume, reconnect backoff, and a
fatal-on-error-frame fallback to polling. Job detail page subscribes live
(deduped by seq), falls back to 4s polling on failure, and shows a Live badge;
refresh() now merges events so a slow snapshot can't clobber streamed ones.

Tests: +3 route (snapshot, resume cursor, append-after-connect), +5 client
(parseSseFrames x2, subscribe deliver/error/resume/error-frame). fleet 150, web 222.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-30 18:38:50 -07:00
Saravanakumar D
89860e39f9 feat: add fleet cost burndown chart
- coordinator.costBurndown() aggregates completed run cost (insights.costUsd)
  by UTC day over a window, returning a gap-free cumulative series + ceiling
- repository.listRunsByProduct() cross-partition run query
- GET /fleet/budgets/:productId/burndown?days=N route
- fleet-client.getBudgetBurndown() + CostBurndown/BurndownPoint types
- BurndownChart on the budget page: cumulative daily bars with a dashed
  ceiling overlay; bars turn red past the ceiling; degrades gracefully
- Tests: +2 coordinator, +1 routes, +2 fleet-client (fleet 147, web 216)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-30 18:25:27 -07:00
Saravanakumar D
2d5f9be642 feat: surface scoring explainability in fleet control plane
Adds 'why does this job route here?' to the §7 scheduler:
- coordinator.explainJob() re-runs scoreCandidate against every live factory,
  returning per-factory weighted breakdown, eligibility + reasons, deps state,
  and the best eligible factory (read-only, side-effect free)
- GET /fleet/jobs/:id/explain route (404 when job missing)
- fleet-client.getJobExplain() + JobExplain/ScoreBreakdown types
- ExplainPanel on the job detail page: score table per factory with the six
  weighted terms, eligibility, and unmet-deps note; degrades gracefully
- Tests: +2 coordinator, +1 routes, +2 fleet-client (fleet 144 green,
  tracker-web 214 green)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-30 18:21:14 -07:00
Saravanakumar D
c8ab43d3ae fix: harden operator action lease release + idempotent terminal actions
- release lease with fenced epoch (leaseEpoch+1, clear holder) so a stale
  renewal cannot resurrect a held lease after operator displacement
- reject on dead_letter / cancel on failed are now idempotent no-ops
  (no epoch bump, no duplicate event)
- add coordinator test for terminal idempotency

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-30 18:15:26 -07:00
Saravanakumar D
283383561c feat: complete operator job actions (requeue/reject/cancel)
Adds lease-free operator lifecycle control to the fleet control plane:
- coordinator.operatorAction() fences any current factory holder by bumping
  leaseEpoch (mirrors the reaper), preserves checkpoint, and routes the job:
  requeue -> queued (or blocked if deps unmet), reject -> dead_letter,
  cancel -> failed. Shipped jobs are terminal (invalid_state).
- POST /fleet/jobs/:id/actions/:action route (400 on unknown action)
- fleet-client.operatorAction() + Requeue/Cancel/Reject buttons on job detail
- Tests: +5 coordinator, +1 routes, +2 fleet-client (platform fleet 140 green,
  tracker-web 212 green)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-30 18:12:11 -07:00
Saravanakumar D
b061cc47f3 feat(tracker-web): fleet control plane UI — overview, jobs, budget, detail pages (Phase 3 Slice 4)
- Fleet overview page with factory cards + recent jobs polling
- Job table with stage filter tabs
- Job detail page with events timeline, runs, artifacts, DAG subtree, SHIP action
- Budget page with usage bar, pause/resume controls
- API proxy route forwarding /api/fleet/* to platform-service
- Typed fleet-client.ts with graceful 404 degradation
- 16 unit tests for fleet-client (198 total tracker-web tests green)
- Added Fleet nav item to dashboard layout
- Full monorepo build + test green

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-30 09:49:24 -07:00
Saravanakumar D
f4ea7b4a5b feat(fleet): per-product budgets with pause/resume (Phase 3 Slice 3)
- FleetBudgetDoc: ceilingUsd, window, spentUsd, status (active/paused)
- FLEET_BUDGETS flag (default OFF = no enforcement, unchanged behavior)
- Enforcement in claimNextJob: paused or ceiling-exceeded → null
- accrueSpend(): monotonic spend accumulation, auto-pause at ceiling
- Budget routes: GET/PUT /fleet/budgets/:productId, pause, resume
- UpsertBudgetSchema for route validation
- 7 new coordinator tests (ceiling, auto-pause, manual pause/resume,
  flag OFF bypass, monotonic accounting, cross-product isolation)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-30 09:49:23 -07:00
Saravanakumar D
484ed05c1f feat(fleet): DAG job decomposition — parent/child + fan-out (Phase 3 Slice 2)
- SubmitJobSchema accepts inline children[] for atomic fan-out creation
- Parent blocked until all children reach dep-satisfying terminal stage
- patchJobFenced triggers maybeUnblockParent on child stage transitions
- submitChildren() for POST /fleet/jobs/:id/children (add children later)
- getDagSubtree() for GET /fleet/jobs/:id/dag (recursive subtree query)
- listChildrenByParent() repository helper
- SubmitChildrenSchema for route validation
- 8 new coordinator tests (fan-out, blocking, unblocking, cycle, DAG query)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-30 09:49:23 -07:00
Saravanakumar D
4468a69526 feat(fleet): tunable scoring weights + preemption (Phase 3 Slice 1)
- Add FleetWeightRegistry + resolveWeights() for per-product/per-request
  weight tunability with defaults fallback (backward compatible)
- Add selectPreemptionVictim() pure function: only critical jobs may
  trigger, never evicts equal/higher priority, picks lowest-priority victim
- Wire preemption into coordinator behind FLEET_PREEMPTION flag (default OFF)
- Seat-limit enforcement: at seatLimit factories skip normal selection and
  attempt preemption of lower-priority running jobs for critical newcomers
- Eviction preserves checkpoint, bumps leaseEpoch (fences zombie), requeues
- 18 new tests (pure scheduler + coordinator integration)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-30 09:49:23 -07:00
saravanakumardb1
b2ce22d81c feat(platform-service): direct tracker -> fleet module wiring (§10 round-trip)
In-process tracker<->fleet bridge — no shell hop. Closes the §10 "direct
tracker->module calls" box.

- tracker-bridge.ts (new):
  * ingestItemAsJob(productId, itemId, opts?) — reads the Item via the items
    repository (foreign/unknown → NotFoundError), maps title/description → bodyMd
    (verbatim) + labels (engine-class:/profile:/priority:/cap:) → manifest hints,
    sets trackerItemId + a stable idempotency-key `tracker-<itemId>`, and submits
    through coordinator.submitJob — so re-ingest dedupes and the job is scheduled by
    the §7 router via the unchanged claim path.
  * echoJobToItem(productId, jobId, log?) — mirrors stage → Item status
    (queued/assigned/building/review/testing → in_progress; shipped → done;
    failed/dead_letter → wont_fix) + a metrics-ONLY comment (attempts/duration/
    tokens/cost — never the prompt body/secrets). Idempotent via the job's
    `trackerEchoedStatus`; best-effort + non-fatal (items-write failure →
    { echoed: null, error }, never thrown into the job lifecycle). productId-scoped.
- Auto-echo wired into the PATCH + lease/release transitions, GATED by
  FLEET_TRACKER_ECHO (default OFF → behavior byte-for-byte unchanged); never blocks
  or fails the transition.
- Routes (additive): POST /fleet/tracker/ingest, POST /fleet/tracker/echo
  (auth + getRequestProductId, productId-scoped).
- types.ts: optional FleetJobDoc.trackerEchoedStatus (reuses the existing
  trackerItemId field; no parallel schema) + Ingest/Echo request schemas.
- repository.ts: setTrackerEchoedStatus (no rev bump — never interferes with the
  fenced claim CAS).

Reuses the items + comments contracts directly (no HTTP). Does not touch
claimNextJob or the scheduler. productId on every doc; no any/console.log.
2026-05-30 01:32:12 -07:00
saravanakumardb1
32328247ad test(platform-service): update fleet artifact tests for productId-scoped listing
listArtifactsByJob now requires productId; thread it through the existing
repository/artifacts test callers (signature update, assertions unchanged).
2026-05-30 00:06:10 -07:00
saravanakumardb1
e06b730161 feat(platform-service): fleet factory enrollment + scoped rotatable tokens (§12)
Adds factory enrollment + a scoped, rotatable credential model for the fleet
coordinator (trust boundary, §12/§18). Tokens are stored HASHED at rest (sha256 —
the same primitive the auth module uses for verify/magic-link tokens); the
high-entropy plaintext is returned exactly once at enroll/rotate and never persisted.

- enrollment.ts: enrollFactory (create/link factory + issue token), rotateToken
  (new active token; prior marked `rotating` with a grace overlap so an in-flight
  worker isn't cut off), revokeToken (immediate), verifyToken (constant-time hash
  compare; revoked/expired-grace → null; updates lastUsedAt). Scope = {productId,
  factoryId, capabilities[]}.
- Gated enforcement: enforceFactoryToken() on POST /fleet/factories/heartbeat and
  POST /fleet/claim, active only when FLEET_REQUIRE_FACTORY_TOKEN is on (default
  OFF — existing behavior/tests unchanged). When on: missing/invalid/revoked → 401;
  out-of-scope productId/capability/factory → 403; and the claim is CONSTRAINED to
  the verified token scope. Does not touch scheduler scoring or the claim CAS.
- types.ts: FleetFactoryTokenDoc + Enroll/Rotate/Revoke request schemas.
- repository.ts: fleet_factory_tokens collection + CRUD + findByHash.
- routes.ts (additive): POST /fleet/factories/enroll, /:id/token/rotate,
  /:id/token/revoke (user auth + productId + Zod).
- cosmos-init.ts: register fleet_factory_tokens (/productId).

Also hardens the artifact routes (review fixes): listArtifactsByJob is now
productId-scoped (GET /fleet/jobs/:id/artifacts threads the request productId), and
artifact upload uses the request/auth productId authoritatively (a spoofed
body.productId no longer overrides it).

Tokens hashed at rest; plaintext shown once; no new crypto schemes; productId on
every doc; no any/console.log; enforcement default OFF.
2026-05-30 00:05:52 -07:00
saravanakumardb1
b65e818f3d feat(platform-service): fleet artifacts + blob wiring (§13)
Artifact pointers in fleet_artifacts; large outputs in @bytelyst/blob (never
Cosmos). Routes: POST/GET /fleet/jobs/:id/artifacts, GET/DELETE
/fleet/artifacts/:id with short-lived SAS. 7 artifact tests.
2026-05-29 23:11:45 -07:00
saravanakumardb1
7930e8b0bd feat(platform-service): Phase 2 scheduler/router core (§7) + wire into atomic claim
Add a pure, fixed-weight scoring engine that decides WHICH queued job a claiming
factory gets, and wire it into coordinator.claimNextJob (the atomic rev-CAS claim
in tryClaimJob is unchanged).

scheduler.ts (pure, synchronous, no I/O):
- scoreCandidate(job, factory, ctx, weights?) -> { score, breakdown }
  score = w1*capabilityFit + w2*affinity + w3*(1/(1+load)) + w4*costFit(budget)
        + w5*health - w6*starvationPenalty(age); breakdown is per-weighted-term
        and sums to score (explainability / Phase-3 readiness).
- selectJob(candidates, factory, ctx, weights?) -> FleetJobDoc | null
  filters to stage-eligible + deps-satisfied (injected pure predicate) +
  capability-subset (+ down-health floor), ranks by score, deterministic
  tie-break: higher priority -> older createdAt -> lower cost class.
- Fixed default weights + bucketed anti-starvation aging (Phase 3 = tunable
  weights + preemption; intentionally NOT built here).

coordinator.ts (candidate-ranking section only):
- claimNextJob now resolves deps (store-backed) into a pure predicate, builds the
  factory view + authoritative now, and selects via selectJob; tryClaimJob CAS /
  lease / fence logic untouched. ClaimContext gains additive optional scheduler
  inputs (health/load/seatLimit/factoryEngines/warmScopes/costCeilingUsd). The
  pure capability-subset predicate moved into scheduler.ts and is re-exported.

Tests: scheduler.test.ts (16) covers capability hard-filter, priority/age
tie-breaks, load, health (+ down floor), starvation, cost fit, affinity, breakdown
sum, determinism, empty/no-eligible. coordinator.test.ts adds score-driven
selection, health floor, and ordered drain; all prior fleet tests stay green.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
2026-05-29 23:03:40 -07:00
saravanakumardb1
33c1d8d5fa fix(platform-service): make fleet job claim truly atomic via datastore updateIfMatch
The foundation's revUpdateJob/revUpdateLease did a read -> rev-check -> write with
await points between them, so two CONCURRENT claims could both read the same rev,
both pass the check, and both write — a double-assignment the old (sequential) race
test could not catch.

Rewire revUpdateJob/revUpdateLease to delegate to the datastore's updateIfMatch,
which performs the compare and the write as one indivisible operation (Cosmos
If-Match; synchronous compare-set on memory). The coordinator's tryClaimJob keeps
identical external behavior (ok/conflict) but is now genuinely single-winner.

Upgrades the coordinator tests to prove atomicity under TRUE concurrency:
- two contenders via Promise.all -> exactly one ok, one conflict; assigned once;
  one run; one lease; leaseEpoch 1.
- N-claimer (15) stress via Promise.all -> one ok, N-1 conflicts, no double-assignment.
- N concurrent claimNextJob for one job -> exactly one non-null claim.
- N concurrent lease renewals -> exactly one wins.

Verified these concurrent tests FAIL against the old read-check-write (double-assign)
and pass after the fix.
2026-05-29 20:59:08 -07:00
saravanakumardb1
95dd7aa1d0 docs(platform-service): fleet module README (containers, claim/lease/fence protocol, reaper) 2026-05-29 20:20:54 -07:00
saravanakumardb1
8eb02c48aa feat(platform-service): fleet REST routes + module registration (P2 foundation)
Guarded REST under /api (auth + productId, like items): POST /fleet/jobs (idempotent
submit), GET /fleet/jobs (by stage/idempotencyKey), GET /fleet/jobs/:id, PATCH
/fleet/jobs/:id (fenced transition), POST /fleet/claim, lease renew/release,
factories/heartbeat, and runs/events streams. Every body validated with the Zod
schemas; fenced/conflict map to 409, missing to 404, invalid to 400. Registers
fleetRoutes in server.ts next to itemRoutes. Routes tested via Fastify inject on
the memory provider (real coordinator).
2026-05-29 20:20:46 -07:00
saravanakumardb1
8f51570da7 feat(platform-service): fleet coordinator — claim/lease/fence/heartbeat/reaper (P2 foundation)
The concurrency core (§4/§7/§8/§18/§25):
- claimNextJob: priority+age selection over queued/dep-satisfied jobs whose caps
  are a subset of the factory's, then tryClaimJob does a rev CAS to flip to
  assigned + acquire the lease — exactly one contender wins, no double-assignment.
- leases + fencing: acquire/reclaim bumps leaseEpoch; patchJobFenced/renew/release
  reject a call whose leaseEpoch < job.leaseEpoch (zombie worker can't overwrite).
- heartbeat + isFactoryStale for factory liveness.
- reapExpiredLeases: returns expired-lease jobs to queued/blocked, bumps the epoch
  (fencing the dead holder), preserves the checkpoint pointer (resume), marks the
  lease expired; idempotent. Documents why Cosmos TTL cannot do this.
- submit: idempotent (dedup/supersede/409) + submit-time dependency cycle
  detection; deps gating (shipped, or testing when depsMode:soft).

Tests drive the atomic-claim race, fencing, and reaper deterministically via the
rev CAS (no real threads).
2026-05-29 20:20:30 -07:00
saravanakumardb1
fada354df8 feat(platform-service): fleet repositories with rev compare-and-swap (P2 foundation)
One repository per fleet_* container on the @bytelyst/datastore abstraction
(memory + cosmos): create/getById/list (by productId, stage, idempotencyKey),
partition-aware single-partition queries, ordered append-only appendEvent, and
runs/leases/factories/profiles/artifacts CRUD. Adds revUpdateJob/revUpdateLease —
a `rev`-token compare-and-swap that writes only when the stored rev still matches
(the optimistic-concurrency primitive for atomic claim + fenced transitions;
maps to Cosmos _etag/If-Match in production).
2026-05-29 20:20:15 -07:00
saravanakumardb1
721d3fcb48 feat(platform-service): fleet data model + container registration (P2 foundation)
Adds the agent-gigafactory fleet data model (modules/fleet/types.ts): Zod schemas
as the source of truth with inferred types (no `any`) for the 7 durable containers
— FleetJobDoc, FleetRunDoc, FleetLeaseDoc, FleetFactoryDoc, FleetProfileDoc,
FleetEventDoc, FleetArtifactDoc — each carrying productId. Lifecycle stages mirror
the agent-queue gigafactory spec (queued|blocked|assigned|building|review|testing|
shipped|failed|dead_letter). Registers fleet_* containers with their partition keys
(/productId for jobs/factories/profiles, /jobId for runs/leases/events/artifacts).
2026-05-29 20:19:59 -07:00
saravanakumardb1
9d405952e2 fix(platform-service): TODO-4 \u2014 typed cast for request.auth augmentation
The DevOps admin preHandler read 'auth' as '(request as any).auth'.
The proper Fastify pattern is 'declare module' augmentation in
@bytelyst/fastify-auth, but the inline cast through 'unknown' is
sufficient for now and avoids touching the shared auth package.

Changed:
  - 'const auth = (request as any).auth;' \u2192
    'const auth = (request as unknown as { auth?: { role?: string } }).auth;'

Inline comment notes the cleaner 'declare module' alternative.

Final ecosystem state:
  scripts/check-rule-violations.sh: 0 findings across all rules \u2713

  web-hardcoded-hex:         0  \u2713
  b5-hardcoded-product-id:   0  \u2713
  b4-console-log:            0  \u2713
  b4-swift-print:            0  \u2713
  b4-python-print:           0  \u2713
  ts-any-type:               0  \u2713
  b7-emoji-in-code:          0  \u2713
2026-05-23 19:29:26 -07:00
saravanakumardb1
cde1a0b73c test(platform-service): align getAllProducts test with invttrdg fallback
CI run 67 surfaced a real test failure:

  src/modules/products/cache.test.ts:104
    getAllProducts > returns all cached products
    expected [ { id: 'lysnrai', …(11) }, …(2) ] to have a length of 2
    but got 3

Root cause: cache.ts has a TEMPORARY_FALLBACK_PRODUCTS map (currently
just 'invttrdg') that getAllProducts() merges into its return value
on top of the loaded cache. The test fixture loads 2 products
(lysnrai, mindlyst), so the actual return is 3 — the test was
written before the fallback shim landed and never got updated.

Two ways to reconcile: (a) make the test reflect today's behaviour,
or (b) gut the fallback. The cache.ts comment explicitly marks
the fallback as 'TODO(platform): remove after creating the real
product …', so the right move is (a): keep the shim in place and
make the test enforce the documented contract.

  - assertion now: toHaveLength(3) + .toContain('invttrdg')
  - inline comment ties the expectation back to cache.ts so a
    future cleanup removing the fallback will obviously need to
    drop it back to 2

Verified locally:
  pnpm vitest run cache.test.ts   -> 8/8 pass
2026-05-23 17:23:16 -07:00
saravanakumardb1
191b81756d fix(platform-service): resolve 3 TS errors in /devops/info handler
The platform-service build was failing with 3 unrelated TS errors,
surfaced while running the Gitea outdated-package detector earlier
in this session:

  src/server.ts(18,8):   Cannot find module '@bytelyst/devops/server'
  src/server.ts(318,61): Property 'cosmosEndpoint' does not exist on type 'ProductIdentity'
  src/server.ts(321,42): Property 'platformServiceUrl' does not exist on type 'ProductIdentity'

Root causes (two distinct bugs):

1. Stale install. '@bytelyst/devops' was already declared as
   'workspace:*' in services/platform-service/package.json (line 24),
   but node_modules/@bytelyst/devops/ did not exist. Re-running
   'pnpm install' at the workspace root materialised the symlink.

2. Variable shadowing. In the GET /devops/info handler the code
   declared a local 'const config' from loadProductIdentity() that
   shadowed the module-level 'config' (env vars) imported from
   './lib/config.js' at line 112. The author then tried to read
   'config.cosmosEndpoint' and 'config.platformServiceUrl' off the
   ProductIdentity, where those keys never exist:

     ProductIdentity = {
       productId, displayName, licensePrefix, configDirName,
       envVarPrefix, bundleIdSuffix, packageName
     }

   The intended values live on the env config:
     config.COSMOS_ENDPOINT  (Zod-validated, required at boot)
     config.HOST + config.PORT (defaults '0.0.0.0' / 4003)

   There is no 'platformServiceUrl' field anywhere in the codebase —
   it only appeared in this single buggy line. Reconstructed as
   '\${HOST}:\${PORT}' which is the URL admins would use to reach
   this service for the devops/info diagnostic dashboard.

Fix (services/platform-service/src/server.ts:310-339):
  - Rename local 'const config' to 'const productIdentity' to break
    the shadowing.
  - Use productIdentity.productId for the devops productId field.
  - Use config.COSMOS_ENDPOINT (the env config) for the cosmos
    dependency health check URL.
  - Use `http://${config.HOST}:${config.PORT}` for the extra
    platformServiceUrl field.
  - Add a doc comment block explaining the two-config distinction
    so future contributors don't reintroduce the shadow.

Verified:
  pnpm --filter @lysnrai/platform-service build    OK (0 errors)
  pnpm --filter @lysnrai/platform-service test     1511/1512 pass

The 1 remaining failure (src/modules/products/cache.test.ts line 104,
'returns all cached products' expects 2 products but got 3) is a
PRE-EXISTING product-registry test drift on main, verified by
stashing this commit's changes and re-running the same test against
the unmodified tree. It will be addressed separately.
2026-05-23 12:46:10 -07:00
root
c39da91588 feat(platform): add /devops page with platform common devops package
- Add @bytelyst/devops backend endpoints to platform-service
- Add /api/devops/version (public) and /api/devops/info (admin) endpoints
- Add /devops page to admin-web using @bytelyst/devops/ui DevopsPanel
- Add devops link to admin web sidebar navigation
- Add build metadata and runtime information display
- Follow trading web devops pattern

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
2026-05-11 03:38:06 +00:00
root
8f449a5ee5 fix(platform): add temporary invttrdg product fallback 2026-05-05 21:07:26 +00:00
663dcdecab chore(platform): replace runtime console diagnostics
What changed:
- Replaced platform-service runtime console diagnostics with structured stderr logging helpers.
- Preserved push notification, AI diagnostics, auto-trigger, and diagnostics repository error handling semantics.

Warning impact:
- Platform runtime no-console warnings: 12 -> 0.
- Workspace lint: 12 -> 0 warnings.

Verification:
- pnpm --filter @lysnrai/platform-service build
- pnpm --filter @lysnrai/platform-service test
- pnpm --filter @lysnrai/platform-service exec eslint <runtime files> --ext .ts
- pnpm lint
2026-05-04 16:48:25 -07:00
2c9dc1870d chore(platform): document script CLI output
What changed:
- Added explicit no-console policy for platform-service CLI/codegen scripts.
- Replaced the remaining migrate-referrals catch any with unknown narrowing.
- Added a TODO for making migrate-referrals --help independent of service env loading.

Warning impact:
- Platform script lint warnings: 78 -> 0.
- Workspace lint: 90 -> 12 warnings.

Verification:
- pnpm --filter @lysnrai/platform-service build
- pnpm --filter @lysnrai/platform-service exec eslint scripts --ext .ts
- pnpm lint

Note:
- migrate-referrals --help still requires service env due eager config import; TODO added for a follow-up behavior-safe refactor.
2026-05-04 16:45:42 -07:00
b4403300fa chore(nomgap): finalize product deployment config
What changed:
- Remove nomgap-web from the ecosystem Docker stack now that web is Vercel-hosted.
- Add a TODO for deciding whether local Docker smoke tests still need a NomGap web service.
- Update NomGap product containers and feature flags.
- Seed the NomGap push trigger flag without duplicating the common encryption flag.

Safety notes:
- Dropped unrelated pnpm-lock.yaml formatting churn instead of committing it.

Verification:
- node JSON.parse products/nomgap/product.json
- ruby Psych.safe_load docker-compose.ecosystem.yml
- pnpm --filter @bytelyst/admin-web typecheck
- pnpm --filter @bytelyst/admin-web test
- pnpm --filter @bytelyst/admin-web exec eslint . --ext .ts,.tsx
- pnpm --filter @lysnrai/platform-service build
- pnpm --filter @lysnrai/platform-service test
- pnpm --filter @lysnrai/platform-service exec eslint . --ext .ts,.tsx
- pnpm typecheck
- pnpm lint
2026-05-04 16:29:20 -07:00
021f053143 chore(platform): type predictive campaign events
What changed:
- Replaced predictive campaign-engine event-bus any casts with typed bus.emit calls.
- Preserved existing fire-and-forget event dispatch behavior with void emit calls.

Warning impact:
- services/platform-service/src/modules/predictive-analytics/campaign-engine.ts: no-explicit-any 5 -> 0.
- Workspace lint baseline: 189 warnings -> 184 warnings.

Verification:
- pnpm --filter @lysnrai/platform-service build
- pnpm --filter @lysnrai/platform-service exec eslint src/modules/predictive-analytics/campaign-engine.ts
- pnpm --filter @lysnrai/platform-service test
- pnpm --filter @lysnrai/platform-service exec eslint . --ext .ts,.tsx
- pnpm lint
2026-05-04 15:58:44 -07:00
9625999ad4 chore(platform-service): remove stale lint disables
Drop unused eslint-disable comments now that the ignored destructured names use the underscore convention.

Verification: pnpm --filter @lysnrai/platform-service build; pnpm --filter @lysnrai/platform-service exec vitest run src/lib/declarative-loader.test.ts --pool forks; pnpm --filter @lysnrai/platform-service exec eslint src/lib/declarative-loader.test.ts --ext .ts,.tsx; pnpm lint.
2026-05-04 15:22:37 -07:00
saravanakumardb1
7d266bfcc0 fix(docker): INFRA-gap-02 unblock full-stack docker compose up
Three coordinated fixes so 'docker compose up cosmos-emulator platform-service
cowork-service --wait' completes end-to-end (pre-existing blocker surfaced by
W1 post-push review).

1. Remove harmful prepare:tsc from @bytelyst/react-native-platform-sdk
   package.json. The hook fires during pnpm install --frozen-lockfile against
   an empty src/ tree (because Dockerfiles COPY package.jsons before
   sources), tsc aborts, install fails. Canonical monorepo build flow is
   pnpm -r build using the existing build:tsc script; prepare only runs for
   git+ URL installs (which this published package doesn't use), so removing
   it is lossless.

2. Add --ignore-scripts to platform-service + mcp-server Dockerfile install
   steps. Mirrors the pattern already used by extraction-service/Dockerfile,
   dashboards/admin-web/Dockerfile, dashboards/tracker-web/Dockerfile.
   Belt-and-braces against future prepare-hook regressions in any workspace
   package.

3. Expand .dockerignore node_modules/dist/.next/coverage to **/ globs.
   Docker's .dockerignore with bare 'node_modules' only matches root-level;
   nested packages/*/node_modules/ were being COPY'd into images, poisoning
   them with host-absolute-path .bin shims (e.g. @bytelyst/storage's tsc
   shim resolved to /learning_voice_ai_agent/node_modules/.pnpm/... which
   doesn't exist in the container → MODULE_NOT_FOUND). The glob fix makes
   COPY packages/ packages/ deliver source only.

Gap: INFRA-gap-02
Verified:
  pnpm install --frozen-lockfile                              
  pnpm --filter @bytelyst/react-native-platform-sdk build     
  pnpm --filter @bytelyst/react-native-platform-sdk typecheck 
  docker compose build platform-service                        (previously failed)
  docker compose build mcp-server                             
  docker compose build extraction-service                     
2026-04-16 15:48:32 -07:00
saravanakumardb1
a954f434ef fix(lint): repair pre-existing baseline lint errors blocking W1 gates
Baseline origin/main pnpm -r lint failed with 90+ errors across
platform-service, extraction-service, and tracker-web. These block the
shared W1 quality gates (prompts/README.md §4) which require all of
typecheck + lint + build + test to be green before committing W1 infra
work. Fixes are strictly scoped to unblock gates:

- eslint.config.js: extend @typescript-eslint/no-unused-vars with
  varsIgnorePattern / caughtErrorsIgnorePattern / destructuredArrayIgnorePattern
  all honouring the existing `^_` convention already used for args.
- platform-service: add file-level eslint-disable for
  @typescript-eslint/no-unused-vars, no-redeclare, no-useless-escape on
  the 33 legacy files failing lint (ab-testing, ai-diagnostics,
  diagnostics, predictive-analytics, broadcasts/types, surveys/types,
  lib/push-notifications).
- extraction-service tests: drop unused vitest imports (beforeEach,
  afterEach, HealthCheck).
- tracker-web tracker-proxy.test.ts: prefix unused url with _.
- Applied eslint --fix on platform-service which normalised a handful
  of `let` → `const` and removed one redundant disable comment.

Scope creep vs W1 "Files You Own" is acknowledged — user explicitly
approved this path when baseline rot was surfaced.

Verified: pnpm -r typecheck, lint, build, test all green.
2026-04-16 13:06:37 -07:00
saravanakumardb1
05594a334f feat(jobs): register devintelli-daily-sync cron job (0 6 * * *)
- Add devintelli-daily-sync handler: POST to DevIntelli backend /api/sync/daily
- Uses x-internal-key header for service-to-service auth
- Add DEVINTELLI_BACKEND_URL + DEVINTELLI_INTERNAL_API_KEY env vars
- Cron: 0 6 * * * (6am UTC daily), timeout: 5 min
- Returns triggered/skipped/totalConnections metrics from DevIntelli response
2026-04-04 23:37:25 -07:00
saravanakumardb1
89e200fa9f feat(flags): seed devintelli feature flags (11 flags)
- devintelli_enabled, devintelli_scan_enabled, devintelli_daily_sync
- 6 analytics panel flags (overview, commit, pr, review, language, productivity, repo)
- devintelli_billing_enabled (disabled by default)

Aligns with backend/src/lib/feature-flags.ts defaults
2026-04-04 23:37:25 -07:00
fdf9286e34 fix(audit): preserve source event timestamps 2026-04-04 11:27:21 -07:00
ff8c5eb704 fix(runtime): add queued agent run state 2026-04-04 11:11:45 -07:00
fe36296196 feat(runtime): add platform runtime projection api 2026-04-04 01:14:37 -07:00
e377351842 feat(timeline): add platform timeline ingest and query api 2026-04-04 00:54:07 -07:00
705f58c5c5 chore(deploy): remove railway deployment artifacts 2026-04-03 17:05:15 -07:00
saravanakumardb1
a87c533fd3 feat(cowork-service): scaffold Fastify bridge + seed clawcowork feature flags (H.1 + H.2)
H.1: Product registration
- Added 12 clawcowork feature flags to platform-service flags/seed.ts
  (sandbox, plugins, mcp, scheduling, computer_use, parallel_agents,
   marketplace, wasm, llm_multi_model, audit, platform_auth, dispatch_api)

H.2: cowork-service scaffold (services/cowork-service/)
- @lysnrai/cowork-service on port 4009, productId clawcowork
- createServiceApp + startService from @bytelyst/fastify-core
- Modules: health (dependency check), tasks (submit/list/get/cancel)
- Zod-validated config, Swagger, readiness endpoint
- 8 tests passing (1 bootstrap + 7 task routes), typecheck clean
2026-04-02 20:39:22 -07:00
saravanakumardb1
46ee14371c fix(ci): add --pool forks to all vitest test scripts to fix kill EPERM on Node v25
Root cause: tinypool worker teardown calls kill() which returns EPERM
in the act_runner host environment on Node.js v25.2.1. Tests pass but
the vitest process crashes during cleanup, causing CI failure.

Fix: --pool forks CLI flag on every package/service test script, plus
pool: 'forks' in all vitest.config.ts files. This uses child_process.fork()
worker management which handles termination cleanly.

60 package.json files updated, 10 vitest.config.ts files updated.
2026-03-27 23:23:38 -07:00
saravanakumardb1
0628f5b3bf test(platform): add 4 impersonation business rule tests (6→10) 2026-03-27 13:22:12 -07:00
saravanakumardb1
3cda7190fb feat(platform): add i18n translations module (P3.20) 2026-03-27 11:32:39 -07:00