The foundation's revUpdateJob/revUpdateLease did a read -> rev-check -> write with
await points between them, so two CONCURRENT claims could both read the same rev,
both pass the check, and both write — a double-assignment the old (sequential) race
test could not catch.
Rewire revUpdateJob/revUpdateLease to delegate to the datastore's updateIfMatch,
which performs the compare and the write as one indivisible operation (Cosmos
If-Match; synchronous compare-set on memory). The coordinator's tryClaimJob keeps
identical external behavior (ok/conflict) but is now genuinely single-winner.
Upgrades the coordinator tests to prove atomicity under TRUE concurrency:
- two contenders via Promise.all -> exactly one ok, one conflict; assigned once;
one run; one lease; leaseEpoch 1.
- N-claimer (15) stress via Promise.all -> one ok, N-1 conflicts, no double-assignment.
- N concurrent claimNextJob for one job -> exactly one non-null claim.
- N concurrent lease renewals -> exactly one wins.
Verified these concurrent tests FAIL against the old read-check-write (double-assign)
and pass after the fix.
Guarded REST under /api (auth + productId, like items): POST /fleet/jobs (idempotent
submit), GET /fleet/jobs (by stage/idempotencyKey), GET /fleet/jobs/:id, PATCH
/fleet/jobs/:id (fenced transition), POST /fleet/claim, lease renew/release,
factories/heartbeat, and runs/events streams. Every body validated with the Zod
schemas; fenced/conflict map to 409, missing to 404, invalid to 400. Registers
fleetRoutes in server.ts next to itemRoutes. Routes tested via Fastify inject on
the memory provider (real coordinator).
The concurrency core (§4/§7/§8/§18/§25):
- claimNextJob: priority+age selection over queued/dep-satisfied jobs whose caps
are a subset of the factory's, then tryClaimJob does a rev CAS to flip to
assigned + acquire the lease — exactly one contender wins, no double-assignment.
- leases + fencing: acquire/reclaim bumps leaseEpoch; patchJobFenced/renew/release
reject a call whose leaseEpoch < job.leaseEpoch (zombie worker can't overwrite).
- heartbeat + isFactoryStale for factory liveness.
- reapExpiredLeases: returns expired-lease jobs to queued/blocked, bumps the epoch
(fencing the dead holder), preserves the checkpoint pointer (resume), marks the
lease expired; idempotent. Documents why Cosmos TTL cannot do this.
- submit: idempotent (dedup/supersede/409) + submit-time dependency cycle
detection; deps gating (shipped, or testing when depsMode:soft).
Tests drive the atomic-claim race, fencing, and reaper deterministically via the
rev CAS (no real threads).
One repository per fleet_* container on the @bytelyst/datastore abstraction
(memory + cosmos): create/getById/list (by productId, stage, idempotencyKey),
partition-aware single-partition queries, ordered append-only appendEvent, and
runs/leases/factories/profiles/artifacts CRUD. Adds revUpdateJob/revUpdateLease —
a `rev`-token compare-and-swap that writes only when the stored rev still matches
(the optimistic-concurrency primitive for atomic claim + fenced transitions;
maps to Cosmos _etag/If-Match in production).
Adds the agent-gigafactory fleet data model (modules/fleet/types.ts): Zod schemas
as the source of truth with inferred types (no `any`) for the 7 durable containers
— FleetJobDoc, FleetRunDoc, FleetLeaseDoc, FleetFactoryDoc, FleetProfileDoc,
FleetEventDoc, FleetArtifactDoc — each carrying productId. Lifecycle stages mirror
the agent-queue gigafactory spec (queued|blocked|assigned|building|review|testing|
shipped|failed|dead_letter). Registers fleet_* containers with their partition keys
(/productId for jobs/factories/profiles, /jobId for runs/leases/events/artifacts).
The DevOps admin preHandler read 'auth' as '(request as any).auth'.
The proper Fastify pattern is 'declare module' augmentation in
@bytelyst/fastify-auth, but the inline cast through 'unknown' is
sufficient for now and avoids touching the shared auth package.
Changed:
- 'const auth = (request as any).auth;' \u2192
'const auth = (request as unknown as { auth?: { role?: string } }).auth;'
Inline comment notes the cleaner 'declare module' alternative.
Final ecosystem state:
scripts/check-rule-violations.sh: 0 findings across all rules \u2713
web-hardcoded-hex: 0 \u2713
b5-hardcoded-product-id: 0 \u2713
b4-console-log: 0 \u2713
b4-swift-print: 0 \u2713
b4-python-print: 0 \u2713
ts-any-type: 0 \u2713
b7-emoji-in-code: 0 \u2713
CI run 67 surfaced a real test failure:
src/modules/products/cache.test.ts:104
getAllProducts > returns all cached products
expected [ { id: 'lysnrai', …(11) }, …(2) ] to have a length of 2
but got 3
Root cause: cache.ts has a TEMPORARY_FALLBACK_PRODUCTS map (currently
just 'invttrdg') that getAllProducts() merges into its return value
on top of the loaded cache. The test fixture loads 2 products
(lysnrai, mindlyst), so the actual return is 3 — the test was
written before the fallback shim landed and never got updated.
Two ways to reconcile: (a) make the test reflect today's behaviour,
or (b) gut the fallback. The cache.ts comment explicitly marks
the fallback as 'TODO(platform): remove after creating the real
product …', so the right move is (a): keep the shim in place and
make the test enforce the documented contract.
- assertion now: toHaveLength(3) + .toContain('invttrdg')
- inline comment ties the expectation back to cache.ts so a
future cleanup removing the fallback will obviously need to
drop it back to 2
Verified locally:
pnpm vitest run cache.test.ts -> 8/8 pass
The platform-service build was failing with 3 unrelated TS errors,
surfaced while running the Gitea outdated-package detector earlier
in this session:
src/server.ts(18,8): Cannot find module '@bytelyst/devops/server'
src/server.ts(318,61): Property 'cosmosEndpoint' does not exist on type 'ProductIdentity'
src/server.ts(321,42): Property 'platformServiceUrl' does not exist on type 'ProductIdentity'
Root causes (two distinct bugs):
1. Stale install. '@bytelyst/devops' was already declared as
'workspace:*' in services/platform-service/package.json (line 24),
but node_modules/@bytelyst/devops/ did not exist. Re-running
'pnpm install' at the workspace root materialised the symlink.
2. Variable shadowing. In the GET /devops/info handler the code
declared a local 'const config' from loadProductIdentity() that
shadowed the module-level 'config' (env vars) imported from
'./lib/config.js' at line 112. The author then tried to read
'config.cosmosEndpoint' and 'config.platformServiceUrl' off the
ProductIdentity, where those keys never exist:
ProductIdentity = {
productId, displayName, licensePrefix, configDirName,
envVarPrefix, bundleIdSuffix, packageName
}
The intended values live on the env config:
config.COSMOS_ENDPOINT (Zod-validated, required at boot)
config.HOST + config.PORT (defaults '0.0.0.0' / 4003)
There is no 'platformServiceUrl' field anywhere in the codebase —
it only appeared in this single buggy line. Reconstructed as
'\${HOST}:\${PORT}' which is the URL admins would use to reach
this service for the devops/info diagnostic dashboard.
Fix (services/platform-service/src/server.ts:310-339):
- Rename local 'const config' to 'const productIdentity' to break
the shadowing.
- Use productIdentity.productId for the devops productId field.
- Use config.COSMOS_ENDPOINT (the env config) for the cosmos
dependency health check URL.
- Use `http://${config.HOST}:${config.PORT}` for the extra
platformServiceUrl field.
- Add a doc comment block explaining the two-config distinction
so future contributors don't reintroduce the shadow.
Verified:
pnpm --filter @lysnrai/platform-service build OK (0 errors)
pnpm --filter @lysnrai/platform-service test 1511/1512 pass
The 1 remaining failure (src/modules/products/cache.test.ts line 104,
'returns all cached products' expects 2 products but got 3) is a
PRE-EXISTING product-registry test drift on main, verified by
stashing this commit's changes and re-running the same test against
the unmodified tree. It will be addressed separately.
- Add @bytelyst/devops backend endpoints to platform-service
- Add /api/devops/version (public) and /api/devops/info (admin) endpoints
- Add /devops page to admin-web using @bytelyst/devops/ui DevopsPanel
- Add devops link to admin web sidebar navigation
- Add build metadata and runtime information display
- Follow trading web devops pattern
Generated with [Devin](https://cli.devin.ai/docs)
Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
What changed:
- Remove nomgap-web from the ecosystem Docker stack now that web is Vercel-hosted.
- Add a TODO for deciding whether local Docker smoke tests still need a NomGap web service.
- Update NomGap product containers and feature flags.
- Seed the NomGap push trigger flag without duplicating the common encryption flag.
Safety notes:
- Dropped unrelated pnpm-lock.yaml formatting churn instead of committing it.
Verification:
- node JSON.parse products/nomgap/product.json
- ruby Psych.safe_load docker-compose.ecosystem.yml
- pnpm --filter @bytelyst/admin-web typecheck
- pnpm --filter @bytelyst/admin-web test
- pnpm --filter @bytelyst/admin-web exec eslint . --ext .ts,.tsx
- pnpm --filter @lysnrai/platform-service build
- pnpm --filter @lysnrai/platform-service test
- pnpm --filter @lysnrai/platform-service exec eslint . --ext .ts,.tsx
- pnpm typecheck
- pnpm lint
Baseline origin/main pnpm -r lint failed with 90+ errors across
platform-service, extraction-service, and tracker-web. These block the
shared W1 quality gates (prompts/README.md §4) which require all of
typecheck + lint + build + test to be green before committing W1 infra
work. Fixes are strictly scoped to unblock gates:
- eslint.config.js: extend @typescript-eslint/no-unused-vars with
varsIgnorePattern / caughtErrorsIgnorePattern / destructuredArrayIgnorePattern
all honouring the existing `^_` convention already used for args.
- platform-service: add file-level eslint-disable for
@typescript-eslint/no-unused-vars, no-redeclare, no-useless-escape on
the 33 legacy files failing lint (ab-testing, ai-diagnostics,
diagnostics, predictive-analytics, broadcasts/types, surveys/types,
lib/push-notifications).
- extraction-service tests: drop unused vitest imports (beforeEach,
afterEach, HealthCheck).
- tracker-web tracker-proxy.test.ts: prefix unused url with _.
- Applied eslint --fix on platform-service which normalised a handful
of `let` → `const` and removed one redundant disable comment.
Scope creep vs W1 "Files You Own" is acknowledged — user explicitly
approved this path when baseline rot was surfaced.
Verified: pnpm -r typecheck, lint, build, test all green.
- exports/routes: exclude inline data from GET /exports list response
to prevent returning megabytes of serialized export data (perf+security)
- Update WORKSPACE_TODO_AUDIT.md: add post-audit review section with
9 bugs found and fixed across 2 commits (73b07c2, 841cdf3), mark
all action plan sprints complete
- Typecheck clean, 1483/1483 tests pass
- diagnostics/subscribers: wire session.created email notification to
target user using existing 'diagnostics-session-created' template
(was just logging instead of sending the email)
- events/types: add missing 'currency' field to payment.failed schema
(payment.succeeded had it, payment.failed did not — inconsistency)
- delivery/subscribers: use event.payload.currency instead of hardcoded
empty string in payment-failed email variables
- Typecheck clean, 1483/1483 tests pass
- diagnostics/subscribers: use correct template IDs
'diagnostics-session-cancelled' and 'diagnostics-session-completed'
instead of non-existent 'generic' (would throw at runtime)
- delivery/templates: add missing 'broadcast' email template used by
broadcast delivery route (dispatchEmail would throw on unknown ID)
- broadcasts/routes: replace broken dot-path 'metrics.sent' update
with proper updateBroadcastMetrics() call, add productName variable
- exports/routes: store serialized data on job doc, add download
endpoint GET /exports/:id/download with content-type headers,
exclude data payload from metadata GET endpoint
- waitlist/routes: store invitation doc ID (inv_...) instead of
code string (WL-...) in invitationCodeId field
- delivery/delivery.test.ts: update template count 12 -> 13
- Typecheck clean, 1483/1483 tests pass
- delivery/subscribers: welcome email used raw productId as productName,
now uses resolveProductName() for proper display name
- delivery/subscribers: remove redundant String(daysLeft) in trial_expiring
- surveys/routes: incentiveClaimed was set outside if(sub) block, marking
response as claimed even when user has no subscription. Moved inside
if(sub) so claims are only recorded when incentive is actually granted
Backend (delivery retry):
- Use NotFoundError (404) instead of BadRequestError (400) for missing log doc
- Add telegram + slack retry support (was email-only, threw error for others)
Frontend (delivery page):
- Add pk field to DeliveryEntry interface
- Pass pk query param in retry call so backend can look up the doc
- Fix handleRetry to accept full entry object instead of just id
Frontend (webhooks page):
- Parallelize delivery fetches with Promise.allSettled (was sequential for loop)
- Significant page load improvement for subscriptions with many deliveries
Phase 0 from DASHBOARD_UI_COVERAGE_ROADMAP:
- Register ai-diagnostics routes in server.ts (671-line module was never mounted)
- Add 6 hidden pages to admin sidebar-nav.tsx:
Debug Sessions, Health Dashboard, Extraction, Experiments,
Predictive, AI Diagnostics
- /users was already in sidebar (no change needed)
- Kill switch verified: already per-product via productId query param
- Admin sidebar now has 33 items (was 27)
EventSource API cannot set custom headers, so the SSE /flags/stream
endpoint and feature-flag-client were broken for streaming mode:
- Server: accept productId and token from query string as fallback
when x-product-id / authorization headers are absent
- Client: pass productId (and optional auth token) as query params
when constructing the EventSource URL
- repository.ts: update() and remove() now require productId as partition key
(was passing 'id' as both params — works with memory provider but fails on Cosmos DB)
- repository.ts: updateSegment() and removeSegment() also fixed
- routes.ts: all repo.update/remove calls updated to pass productId
- routes.ts: audit 'before' snapshots now use JSON deep copy instead of shallow spread
(prevents nested object mutation from corrupting audit trail)
- routes.ts: kill switch audit now uses repo.update() return value for 'after' snapshot
- evaluator.ts: anonymous users (no userId) with partial percentage (0 < pct < 100)
now correctly return 'off' instead of falling through to default variation
(can't deterministically hash without a userId)
- Auto-triage previously always set status to 'triaged', even for cases
already in in_progress, escalated, or other later states
- Now only transitions to 'triaged' if case is still in 'open' state
- Cases in later states keep their current status (only priority + tags update)
- Added regression test for in_progress case
- 10 support-cases tests passing
- Previous filter only checked e.recordedAt < currentPeriodStart
- Now also checks e.recordedAt >= prevPeriodStart (lower bound)
- Prevents entries from periods before the previous one from inflating
the spent amount, which would reduce the rollover incorrectly
- 12 ai-budgets tests passing
- supportCase.tags is optional in SupportCaseDoc schema
- Spreading undefined throws TypeError at runtime
- Fixed both [...supportCase.tags] and .includes() call with ?? [] fallback
- Added regression test for undefined tags case
- 9 support-cases tests passing