- Fleet overview page with factory cards + recent jobs polling
- Job table with stage filter tabs
- Job detail page with events timeline, runs, artifacts, DAG subtree, SHIP action
- Budget page with usage bar, pause/resume controls
- API proxy route forwarding /api/fleet/* to platform-service
- Typed fleet-client.ts with graceful 404 degradation
- 16 unit tests for fleet-client (198 total tracker-web tests green)
- Added Fleet nav item to dashboard layout
- Full monorepo build + test green
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
docker-compose:
- Drop the cosmos-emulator service block. Both image variants we
tried were unfit for the prototype: `:vnext-preview` returned
plain-text PGCosmosError strings that crashed @azure/cosmos at
JSON.parse, and `:latest` core-dumped under load. The container
has been Exited(255) for weeks and was blocking depends_on chains.
- Real Azure Cosmos account `cosmos-mywisprai` (db `bytelyst`,
West US 2) is now the single source of truth; all services pick
up COSMOS_ENDPOINT/KEY/DATABASE from `.env` (already mounted via
`env_file: .env`).
- extraction-service: drop hardcoded `COSMOS_ENDPOINT=…cosmos-emulator…`,
`NODE_TLS_REJECT_UNAUTHORIZED=0`, and `depends_on: cosmos-emulator`.
- cowork-service: same cleanup.
cowork-service IPC bridge:
- Add `error` listeners to the spawned child's stdin/stdout/stderr.
Without them, an EPIPE on stdin (child died mid-write) or a
teardown-time stream error surfaced as an unhandled error and
crashed vitest after all 140 tests had passed.
- Removes the only failing recursive test in the workspace.
Test status after this commit:
- 94 workspace packages, all green
- cowork-service: 19 passed | 1 skipped (140 tests)
- platform-service: 131 test files passed
- extraction-service: 13 test files passed
- All other packages: passing
Generated with [Devin](https://cli.devin.ai/docs)
Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
In-process tracker<->fleet bridge — no shell hop. Closes the §10 "direct
tracker->module calls" box.
- tracker-bridge.ts (new):
* ingestItemAsJob(productId, itemId, opts?) — reads the Item via the items
repository (foreign/unknown → NotFoundError), maps title/description → bodyMd
(verbatim) + labels (engine-class:/profile:/priority:/cap:) → manifest hints,
sets trackerItemId + a stable idempotency-key `tracker-<itemId>`, and submits
through coordinator.submitJob — so re-ingest dedupes and the job is scheduled by
the §7 router via the unchanged claim path.
* echoJobToItem(productId, jobId, log?) — mirrors stage → Item status
(queued/assigned/building/review/testing → in_progress; shipped → done;
failed/dead_letter → wont_fix) + a metrics-ONLY comment (attempts/duration/
tokens/cost — never the prompt body/secrets). Idempotent via the job's
`trackerEchoedStatus`; best-effort + non-fatal (items-write failure →
{ echoed: null, error }, never thrown into the job lifecycle). productId-scoped.
- Auto-echo wired into the PATCH + lease/release transitions, GATED by
FLEET_TRACKER_ECHO (default OFF → behavior byte-for-byte unchanged); never blocks
or fails the transition.
- Routes (additive): POST /fleet/tracker/ingest, POST /fleet/tracker/echo
(auth + getRequestProductId, productId-scoped).
- types.ts: optional FleetJobDoc.trackerEchoedStatus (reuses the existing
trackerItemId field; no parallel schema) + Ingest/Echo request schemas.
- repository.ts: setTrackerEchoedStatus (no rev bump — never interferes with the
fenced claim CAS).
Reuses the items + comments contracts directly (no HTTP). Does not touch
claimNextJob or the scheduler. productId on every doc; no any/console.log.
- packages/llm: use nullish coalescing (??) in GeminiProvider constructor
so explicit empty-string apiKey is not overridden by env var
- dashboards/admin-web,tracker-web: exclude .next/ from vitest test glob
to prevent Next.js internal test files from being picked up
- services/cowork-service: use platform-safe .kill() instead of SIGTERM
which is invalid on Windows
- packages/use-keyboard-shortcuts: add @testing-library/react devDep
- scripts/npmrc.template: use https:// for Gitea registry
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds factory enrollment + a scoped, rotatable credential model for the fleet
coordinator (trust boundary, §12/§18). Tokens are stored HASHED at rest (sha256 —
the same primitive the auth module uses for verify/magic-link tokens); the
high-entropy plaintext is returned exactly once at enroll/rotate and never persisted.
- enrollment.ts: enrollFactory (create/link factory + issue token), rotateToken
(new active token; prior marked `rotating` with a grace overlap so an in-flight
worker isn't cut off), revokeToken (immediate), verifyToken (constant-time hash
compare; revoked/expired-grace → null; updates lastUsedAt). Scope = {productId,
factoryId, capabilities[]}.
- Gated enforcement: enforceFactoryToken() on POST /fleet/factories/heartbeat and
POST /fleet/claim, active only when FLEET_REQUIRE_FACTORY_TOKEN is on (default
OFF — existing behavior/tests unchanged). When on: missing/invalid/revoked → 401;
out-of-scope productId/capability/factory → 403; and the claim is CONSTRAINED to
the verified token scope. Does not touch scheduler scoring or the claim CAS.
- types.ts: FleetFactoryTokenDoc + Enroll/Rotate/Revoke request schemas.
- repository.ts: fleet_factory_tokens collection + CRUD + findByHash.
- routes.ts (additive): POST /fleet/factories/enroll, /:id/token/rotate,
/:id/token/revoke (user auth + productId + Zod).
- cosmos-init.ts: register fleet_factory_tokens (/productId).
Also hardens the artifact routes (review fixes): listArtifactsByJob is now
productId-scoped (GET /fleet/jobs/:id/artifacts threads the request productId), and
artifact upload uses the request/auth productId authoritatively (a spoofed
body.productId no longer overrides it).
Tokens hashed at rest; plaintext shown once; no new crypto schemes; productId on
every doc; no any/console.log; enforcement default OFF.
The foundation's revUpdateJob/revUpdateLease did a read -> rev-check -> write with
await points between them, so two CONCURRENT claims could both read the same rev,
both pass the check, and both write — a double-assignment the old (sequential) race
test could not catch.
Rewire revUpdateJob/revUpdateLease to delegate to the datastore's updateIfMatch,
which performs the compare and the write as one indivisible operation (Cosmos
If-Match; synchronous compare-set on memory). The coordinator's tryClaimJob keeps
identical external behavior (ok/conflict) but is now genuinely single-winner.
Upgrades the coordinator tests to prove atomicity under TRUE concurrency:
- two contenders via Promise.all -> exactly one ok, one conflict; assigned once;
one run; one lease; leaseEpoch 1.
- N-claimer (15) stress via Promise.all -> one ok, N-1 conflicts, no double-assignment.
- N concurrent claimNextJob for one job -> exactly one non-null claim.
- N concurrent lease renewals -> exactly one wins.
Verified these concurrent tests FAIL against the old read-check-write (double-assign)
and pass after the fix.
Guarded REST under /api (auth + productId, like items): POST /fleet/jobs (idempotent
submit), GET /fleet/jobs (by stage/idempotencyKey), GET /fleet/jobs/:id, PATCH
/fleet/jobs/:id (fenced transition), POST /fleet/claim, lease renew/release,
factories/heartbeat, and runs/events streams. Every body validated with the Zod
schemas; fenced/conflict map to 409, missing to 404, invalid to 400. Registers
fleetRoutes in server.ts next to itemRoutes. Routes tested via Fastify inject on
the memory provider (real coordinator).
The concurrency core (§4/§7/§8/§18/§25):
- claimNextJob: priority+age selection over queued/dep-satisfied jobs whose caps
are a subset of the factory's, then tryClaimJob does a rev CAS to flip to
assigned + acquire the lease — exactly one contender wins, no double-assignment.
- leases + fencing: acquire/reclaim bumps leaseEpoch; patchJobFenced/renew/release
reject a call whose leaseEpoch < job.leaseEpoch (zombie worker can't overwrite).
- heartbeat + isFactoryStale for factory liveness.
- reapExpiredLeases: returns expired-lease jobs to queued/blocked, bumps the epoch
(fencing the dead holder), preserves the checkpoint pointer (resume), marks the
lease expired; idempotent. Documents why Cosmos TTL cannot do this.
- submit: idempotent (dedup/supersede/409) + submit-time dependency cycle
detection; deps gating (shipped, or testing when depsMode:soft).
Tests drive the atomic-claim race, fencing, and reaper deterministically via the
rev CAS (no real threads).
One repository per fleet_* container on the @bytelyst/datastore abstraction
(memory + cosmos): create/getById/list (by productId, stage, idempotencyKey),
partition-aware single-partition queries, ordered append-only appendEvent, and
runs/leases/factories/profiles/artifacts CRUD. Adds revUpdateJob/revUpdateLease —
a `rev`-token compare-and-swap that writes only when the stored rev still matches
(the optimistic-concurrency primitive for atomic claim + fenced transitions;
maps to Cosmos _etag/If-Match in production).
Adds the agent-gigafactory fleet data model (modules/fleet/types.ts): Zod schemas
as the source of truth with inferred types (no `any`) for the 7 durable containers
— FleetJobDoc, FleetRunDoc, FleetLeaseDoc, FleetFactoryDoc, FleetProfileDoc,
FleetEventDoc, FleetArtifactDoc — each carrying productId. Lifecycle stages mirror
the agent-queue gigafactory spec (queued|blocked|assigned|building|review|testing|
shipped|failed|dead_letter). Registers fleet_* containers with their partition keys
(/productId for jobs/factories/profiles, /jobId for runs/leases/events/artifacts).
The DevOps admin preHandler read 'auth' as '(request as any).auth'.
The proper Fastify pattern is 'declare module' augmentation in
@bytelyst/fastify-auth, but the inline cast through 'unknown' is
sufficient for now and avoids touching the shared auth package.
Changed:
- 'const auth = (request as any).auth;' \u2192
'const auth = (request as unknown as { auth?: { role?: string } }).auth;'
Inline comment notes the cleaner 'declare module' alternative.
Final ecosystem state:
scripts/check-rule-violations.sh: 0 findings across all rules \u2713
web-hardcoded-hex: 0 \u2713
b5-hardcoded-product-id: 0 \u2713
b4-console-log: 0 \u2713
b4-swift-print: 0 \u2713
b4-python-print: 0 \u2713
ts-any-type: 0 \u2713
b7-emoji-in-code: 0 \u2713
CI run 67 surfaced a real test failure:
src/modules/products/cache.test.ts:104
getAllProducts > returns all cached products
expected [ { id: 'lysnrai', …(11) }, …(2) ] to have a length of 2
but got 3
Root cause: cache.ts has a TEMPORARY_FALLBACK_PRODUCTS map (currently
just 'invttrdg') that getAllProducts() merges into its return value
on top of the loaded cache. The test fixture loads 2 products
(lysnrai, mindlyst), so the actual return is 3 — the test was
written before the fallback shim landed and never got updated.
Two ways to reconcile: (a) make the test reflect today's behaviour,
or (b) gut the fallback. The cache.ts comment explicitly marks
the fallback as 'TODO(platform): remove after creating the real
product …', so the right move is (a): keep the shim in place and
make the test enforce the documented contract.
- assertion now: toHaveLength(3) + .toContain('invttrdg')
- inline comment ties the expectation back to cache.ts so a
future cleanup removing the fallback will obviously need to
drop it back to 2
Verified locally:
pnpm vitest run cache.test.ts -> 8/8 pass
The platform-service build was failing with 3 unrelated TS errors,
surfaced while running the Gitea outdated-package detector earlier
in this session:
src/server.ts(18,8): Cannot find module '@bytelyst/devops/server'
src/server.ts(318,61): Property 'cosmosEndpoint' does not exist on type 'ProductIdentity'
src/server.ts(321,42): Property 'platformServiceUrl' does not exist on type 'ProductIdentity'
Root causes (two distinct bugs):
1. Stale install. '@bytelyst/devops' was already declared as
'workspace:*' in services/platform-service/package.json (line 24),
but node_modules/@bytelyst/devops/ did not exist. Re-running
'pnpm install' at the workspace root materialised the symlink.
2. Variable shadowing. In the GET /devops/info handler the code
declared a local 'const config' from loadProductIdentity() that
shadowed the module-level 'config' (env vars) imported from
'./lib/config.js' at line 112. The author then tried to read
'config.cosmosEndpoint' and 'config.platformServiceUrl' off the
ProductIdentity, where those keys never exist:
ProductIdentity = {
productId, displayName, licensePrefix, configDirName,
envVarPrefix, bundleIdSuffix, packageName
}
The intended values live on the env config:
config.COSMOS_ENDPOINT (Zod-validated, required at boot)
config.HOST + config.PORT (defaults '0.0.0.0' / 4003)
There is no 'platformServiceUrl' field anywhere in the codebase —
it only appeared in this single buggy line. Reconstructed as
'\${HOST}:\${PORT}' which is the URL admins would use to reach
this service for the devops/info diagnostic dashboard.
Fix (services/platform-service/src/server.ts:310-339):
- Rename local 'const config' to 'const productIdentity' to break
the shadowing.
- Use productIdentity.productId for the devops productId field.
- Use config.COSMOS_ENDPOINT (the env config) for the cosmos
dependency health check URL.
- Use `http://${config.HOST}:${config.PORT}` for the extra
platformServiceUrl field.
- Add a doc comment block explaining the two-config distinction
so future contributors don't reintroduce the shadow.
Verified:
pnpm --filter @lysnrai/platform-service build OK (0 errors)
pnpm --filter @lysnrai/platform-service test 1511/1512 pass
The 1 remaining failure (src/modules/products/cache.test.ts line 104,
'returns all cached products' expects 2 products but got 3) is a
PRE-EXISTING product-registry test drift on main, verified by
stashing this commit's changes and re-running the same test against
the unmodified tree. It will be addressed separately.
- Add @bytelyst/devops backend endpoints to platform-service
- Add /api/devops/version (public) and /api/devops/info (admin) endpoints
- Add /devops page to admin-web using @bytelyst/devops/ui DevopsPanel
- Add devops link to admin web sidebar navigation
- Add build metadata and runtime information display
- Follow trading web devops pattern
Generated with [Devin](https://cli.devin.ai/docs)
Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
What changed:
- Remove nomgap-web from the ecosystem Docker stack now that web is Vercel-hosted.
- Add a TODO for deciding whether local Docker smoke tests still need a NomGap web service.
- Update NomGap product containers and feature flags.
- Seed the NomGap push trigger flag without duplicating the common encryption flag.
Safety notes:
- Dropped unrelated pnpm-lock.yaml formatting churn instead of committing it.
Verification:
- node JSON.parse products/nomgap/product.json
- ruby Psych.safe_load docker-compose.ecosystem.yml
- pnpm --filter @bytelyst/admin-web typecheck
- pnpm --filter @bytelyst/admin-web test
- pnpm --filter @bytelyst/admin-web exec eslint . --ext .ts,.tsx
- pnpm --filter @lysnrai/platform-service build
- pnpm --filter @lysnrai/platform-service test
- pnpm --filter @lysnrai/platform-service exec eslint . --ext .ts,.tsx
- pnpm typecheck
- pnpm lint
Three coordinated fixes so 'docker compose up cosmos-emulator platform-service
cowork-service --wait' completes end-to-end (pre-existing blocker surfaced by
W1 post-push review).
1. Remove harmful prepare:tsc from @bytelyst/react-native-platform-sdk
package.json. The hook fires during pnpm install --frozen-lockfile against
an empty src/ tree (because Dockerfiles COPY package.jsons before
sources), tsc aborts, install fails. Canonical monorepo build flow is
pnpm -r build using the existing build:tsc script; prepare only runs for
git+ URL installs (which this published package doesn't use), so removing
it is lossless.
2. Add --ignore-scripts to platform-service + mcp-server Dockerfile install
steps. Mirrors the pattern already used by extraction-service/Dockerfile,
dashboards/admin-web/Dockerfile, dashboards/tracker-web/Dockerfile.
Belt-and-braces against future prepare-hook regressions in any workspace
package.
3. Expand .dockerignore node_modules/dist/.next/coverage to **/ globs.
Docker's .dockerignore with bare 'node_modules' only matches root-level;
nested packages/*/node_modules/ were being COPY'd into images, poisoning
them with host-absolute-path .bin shims (e.g. @bytelyst/storage's tsc
shim resolved to /learning_voice_ai_agent/node_modules/.pnpm/... which
doesn't exist in the container → MODULE_NOT_FOUND). The glob fix makes
COPY packages/ packages/ deliver source only.
Gap: INFRA-gap-02
Verified:
pnpm install --frozen-lockfile ✅
pnpm --filter @bytelyst/react-native-platform-sdk build ✅
pnpm --filter @bytelyst/react-native-platform-sdk typecheck ✅
docker compose build platform-service ✅ (previously failed)
docker compose build mcp-server ✅
docker compose build extraction-service ✅
Hot-reload the orchestrator's on-disk plugin registry without a
restart. Routes to the reload_plugins Rust IPC method, gated by the
same authz the orchestrator enforces (admin role OR platform-signed
JWT) so a forbidden caller gets a canonical ForbiddenError envelope
instead of a raw IPC error passthrough. The response body is a
ReloadStats { loaded, added, removed, updated, errors } summary,
validated against ReloadResponseSchema before being returned to the
caller.
Tests cover: admin success (200 + envelope), user-without-platform
(403 before IPC), bridge unavailable (400), orchestrator -32003 →
ForbiddenError, other IPC errors → BadRequestError, malformed
orchestrator payloads → BadRequestError.
Phase: 3.1
Verified: pnpm -r typecheck, pnpm --filter @lysnrai/cowork-service {lint,build,test}
(140 passed, 6 new reload tests)
Adds cowork-service (port 4009) to docker-compose.yml with healthcheck,
depends_on gates for cosmos-emulator and platform-service, env_file
integration, and Traefik labels. Unblocks Phase 3 ecosystem wiring of
the ByteLyst roadmap.
Also adds the services/cowork-service/Dockerfile that compose builds
from. Pattern mirrors services/mcp-server/Dockerfile but copies the
full workspace in one step rather than enumerating every package.json,
to stay resilient to workspace membership changes. Production stage
runs `node dist/server.js` on :4009 with BusyBox-wget healthcheck
(bundled with node:22-alpine — no apk install required).
.env.example gains a Cowork-Service section documenting:
- ANTHROPIC_API_KEY, RUST_RUNTIME_BIN, RUST_RUNTIME_TIMEOUT_MS
- OLLAMA_URL, OLLAMA_MODELS
- FEATURE_FLAGS_ENABLED
The 13th clawcowork flag telemetry_enabled already ships via COMMON_FLAGS
in services/platform-service/src/modules/flags/seed.ts so seed.ts was not
touched.
Gap: INFRA-gap-01
Verified: docker compose config (YAML validity + env substitution),
pnpm -r typecheck / lint / build / test (all green),
docker compose build cowork-service (image built),
docker compose up -d cowork-service --no-deps --wait (Healthy),
curl -fsS localhost:4009/health → {"status":"ok","service":"cowork-service",...}.
Note: full-stack `docker compose up cosmos-emulator platform-service
cowork-service --wait` is blocked by a pre-existing issue in
services/platform-service/Dockerfile (react-native-platform-sdk prepare
script fails during pnpm install --frozen-lockfile in the image build).
That is outside W1 scope; cowork-service starts clean on its own and
becomes Healthy when platform-service is available out-of-band.
Baseline origin/main pnpm -r lint failed with 90+ errors across
platform-service, extraction-service, and tracker-web. These block the
shared W1 quality gates (prompts/README.md §4) which require all of
typecheck + lint + build + test to be green before committing W1 infra
work. Fixes are strictly scoped to unblock gates:
- eslint.config.js: extend @typescript-eslint/no-unused-vars with
varsIgnorePattern / caughtErrorsIgnorePattern / destructuredArrayIgnorePattern
all honouring the existing `^_` convention already used for args.
- platform-service: add file-level eslint-disable for
@typescript-eslint/no-unused-vars, no-redeclare, no-useless-escape on
the 33 legacy files failing lint (ab-testing, ai-diagnostics,
diagnostics, predictive-analytics, broadcasts/types, surveys/types,
lib/push-notifications).
- extraction-service tests: drop unused vitest imports (beforeEach,
afterEach, HealthCheck).
- tracker-web tracker-proxy.test.ts: prefix unused url with _.
- Applied eslint --fix on platform-service which normalised a handful
of `let` → `const` and removed one redundant disable comment.
Scope creep vs W1 "Files You Own" is acknowledged — user explicitly
approved this path when baseline rot was surfaced.
Verified: pnpm -r typecheck, lint, build, test all green.
BUG 1: Azure locale derivation produced 'en-EN' (invalid) for 2-letter codes.
→ Added toAzureLocale() with 28-language mapping table (en→en-US, pt→pt-BR, etc.)
→ Exported for testing; falls back to code-CODE for unmapped languages.
BUG 2: model field from request schema was silently dropped after provider refactor.
→ Added optional model field to TranscriptionInput interface.
→ OpenAI provider now uses input.model override (falls back to config.model).
→ Route passes model through to provider.transcribe().
GAP 4: SUPPORTED_AUDIO_TYPES was defined but never validated against.
→ Route now rejects unsupported content-types with a clear error message.
→ Allows application/octet-stream (Azure Blob SAS URLs often return this).
GAP 5: Client JSDoc still said 'via OpenAI Whisper API' — now 'via configured STT provider'.
GAP 8: Azure WAV content-type hardcoded samplerate=16000 — now generic audio/wav.
Tests: 42 transcription tests (was 35), 178 total passing.
→ toAzureLocale: 4 tests (locale mapping, passthrough, fallback, case-insensitive)
→ setSTT: 1 test (singleton override)
→ model passthrough: 2 tests (mock ignores, input accepts)