Backend: GET /fleet/jobs/:id/events/stream emits a snapshot (seq > Last-Event-ID)
then long-polls the append-only event log, closing after a bounded window so
EventSource-style clients reconnect cleanly. Honors Last-Event-ID resume,
keepalive comments, and a terminal error frame.
Frontend: subscribeJobEvents uses fetch streaming (to send auth + product
headers) with parseSseFrames, Last-Event-ID resume, reconnect backoff, and a
fatal-on-error-frame fallback to polling. Job detail page subscribes live
(deduped by seq), falls back to 4s polling on failure, and shows a Live badge;
refresh() now merges events so a slow snapshot can't clobber streamed ones.
Tests: +3 route (snapshot, resume cursor, append-after-connect), +5 client
(parseSseFrames x2, subscribe deliver/error/resume/error-frame). fleet 150, web 222.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- coordinator.costBurndown() aggregates completed run cost (insights.costUsd)
by UTC day over a window, returning a gap-free cumulative series + ceiling
- repository.listRunsByProduct() cross-partition run query
- GET /fleet/budgets/:productId/burndown?days=N route
- fleet-client.getBudgetBurndown() + CostBurndown/BurndownPoint types
- BurndownChart on the budget page: cumulative daily bars with a dashed
ceiling overlay; bars turn red past the ceiling; degrades gracefully
- Tests: +2 coordinator, +1 routes, +2 fleet-client (fleet 147, web 216)
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds 'why does this job route here?' to the §7 scheduler:
- coordinator.explainJob() re-runs scoreCandidate against every live factory,
returning per-factory weighted breakdown, eligibility + reasons, deps state,
and the best eligible factory (read-only, side-effect free)
- GET /fleet/jobs/:id/explain route (404 when job missing)
- fleet-client.getJobExplain() + JobExplain/ScoreBreakdown types
- ExplainPanel on the job detail page: score table per factory with the six
weighted terms, eligibility, and unmet-deps note; degrades gracefully
- Tests: +2 coordinator, +1 routes, +2 fleet-client (fleet 144 green,
tracker-web 214 green)
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- release lease with fenced epoch (leaseEpoch+1, clear holder) so a stale
renewal cannot resurrect a held lease after operator displacement
- reject on dead_letter / cancel on failed are now idempotent no-ops
(no epoch bump, no duplicate event)
- add coordinator test for terminal idempotency
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- ROADMAP_COMPLETION_AUDIT.md: verified state vs GIGAFACTORY_ROADMAP source of truth
- TASKS_TO_COMPLETE.md: prioritized remaining work with acceptance criteria
- Key finding: roadmap §0 tracker is stale (P2 ~95%, P3 ~70% actual vs 80%/0% claimed)
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Fleet overview page with factory cards + recent jobs polling
- Job table with stage filter tabs
- Job detail page with events timeline, runs, artifacts, DAG subtree, SHIP action
- Budget page with usage bar, pause/resume controls
- API proxy route forwarding /api/fleet/* to platform-service
- Typed fleet-client.ts with graceful 404 degradation
- 16 unit tests for fleet-client (198 total tracker-web tests green)
- Added Fleet nav item to dashboard layout
- Full monorepo build + test green
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Caddy was binding 0.0.0.0:443, which prevented tailscaled from claiming
100.87.53.10:443 for `tailscale serve --https=443`. Restricting Caddy to
the public eth0 IP (187.124.159.82) keeps the public api.bytelyst.com /
devops.bytelyst.com routing intact while freeing the Tailscale IP so the
tailnet-only dashboard URL (https://srv1491630.tailf85608.ts.net) is
reachable again.
Generated with [Devin](https://cli.devin.ai/docs)
Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
docker-compose:
- Drop the cosmos-emulator service block. Both image variants we
tried were unfit for the prototype: `:vnext-preview` returned
plain-text PGCosmosError strings that crashed @azure/cosmos at
JSON.parse, and `:latest` core-dumped under load. The container
has been Exited(255) for weeks and was blocking depends_on chains.
- Real Azure Cosmos account `cosmos-mywisprai` (db `bytelyst`,
West US 2) is now the single source of truth; all services pick
up COSMOS_ENDPOINT/KEY/DATABASE from `.env` (already mounted via
`env_file: .env`).
- extraction-service: drop hardcoded `COSMOS_ENDPOINT=…cosmos-emulator…`,
`NODE_TLS_REJECT_UNAUTHORIZED=0`, and `depends_on: cosmos-emulator`.
- cowork-service: same cleanup.
cowork-service IPC bridge:
- Add `error` listeners to the spawned child's stdin/stdout/stderr.
Without them, an EPIPE on stdin (child died mid-write) or a
teardown-time stream error surfaced as an unhandled error and
crashed vitest after all 140 tests had passed.
- Removes the only failing recursive test in the workspace.
Test status after this commit:
- 94 workspace packages, all green
- cowork-service: 19 passed | 1 skipped (140 tests)
- platform-service: 131 test files passed
- extraction-service: 13 test files passed
- All other packages: passing
Generated with [Devin](https://cli.devin.ai/docs)
Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
The vnext-preview (Postgres-backed) image returned PGCosmosError plaintext
for cross-partition queryFeed calls, crashing @azure/cosmos at JSON.parse.
:latest is HTTPS-only with a self-signed cert, so consumers are gated by
NODE_TLS_REJECT_UNAUTHORIZED=0 (dev-prototype only). platform-service now
points at the real Azure Cosmos account (per .env), so its dependency on
the local emulator service is removed.
In-process tracker<->fleet bridge — no shell hop. Closes the §10 "direct
tracker->module calls" box.
- tracker-bridge.ts (new):
* ingestItemAsJob(productId, itemId, opts?) — reads the Item via the items
repository (foreign/unknown → NotFoundError), maps title/description → bodyMd
(verbatim) + labels (engine-class:/profile:/priority:/cap:) → manifest hints,
sets trackerItemId + a stable idempotency-key `tracker-<itemId>`, and submits
through coordinator.submitJob — so re-ingest dedupes and the job is scheduled by
the §7 router via the unchanged claim path.
* echoJobToItem(productId, jobId, log?) — mirrors stage → Item status
(queued/assigned/building/review/testing → in_progress; shipped → done;
failed/dead_letter → wont_fix) + a metrics-ONLY comment (attempts/duration/
tokens/cost — never the prompt body/secrets). Idempotent via the job's
`trackerEchoedStatus`; best-effort + non-fatal (items-write failure →
{ echoed: null, error }, never thrown into the job lifecycle). productId-scoped.
- Auto-echo wired into the PATCH + lease/release transitions, GATED by
FLEET_TRACKER_ECHO (default OFF → behavior byte-for-byte unchanged); never blocks
or fails the transition.
- Routes (additive): POST /fleet/tracker/ingest, POST /fleet/tracker/echo
(auth + getRequestProductId, productId-scoped).
- types.ts: optional FleetJobDoc.trackerEchoedStatus (reuses the existing
trackerItemId field; no parallel schema) + Ingest/Echo request schemas.
- repository.ts: setTrackerEchoedStatus (no rev bump — never interferes with the
fenced claim CAS).
Reuses the items + comments contracts directly (no HTTP). Does not touch
claimNextJob or the scheduler. productId on every doc; no any/console.log.
Some keyboard events (dead keys, modifier-only presses) have e.key
as undefined. Similarly, malformed shortcut definitions may lack a key.
Added early-return guards to prevent TypeError.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- packages/llm: use nullish coalescing (??) in GeminiProvider constructor
so explicit empty-string apiKey is not overridden by env var
- dashboards/admin-web,tracker-web: exclude .next/ from vitest test glob
to prevent Next.js internal test files from being picked up
- services/cowork-service: use platform-safe .kill() instead of SIGTERM
which is invalid on Windows
- packages/use-keyboard-shortcuts: add @testing-library/react devDep
- scripts/npmrc.template: use https:// for Gitea registry
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds factory enrollment + a scoped, rotatable credential model for the fleet
coordinator (trust boundary, §12/§18). Tokens are stored HASHED at rest (sha256 —
the same primitive the auth module uses for verify/magic-link tokens); the
high-entropy plaintext is returned exactly once at enroll/rotate and never persisted.
- enrollment.ts: enrollFactory (create/link factory + issue token), rotateToken
(new active token; prior marked `rotating` with a grace overlap so an in-flight
worker isn't cut off), revokeToken (immediate), verifyToken (constant-time hash
compare; revoked/expired-grace → null; updates lastUsedAt). Scope = {productId,
factoryId, capabilities[]}.
- Gated enforcement: enforceFactoryToken() on POST /fleet/factories/heartbeat and
POST /fleet/claim, active only when FLEET_REQUIRE_FACTORY_TOKEN is on (default
OFF — existing behavior/tests unchanged). When on: missing/invalid/revoked → 401;
out-of-scope productId/capability/factory → 403; and the claim is CONSTRAINED to
the verified token scope. Does not touch scheduler scoring or the claim CAS.
- types.ts: FleetFactoryTokenDoc + Enroll/Rotate/Revoke request schemas.
- repository.ts: fleet_factory_tokens collection + CRUD + findByHash.
- routes.ts (additive): POST /fleet/factories/enroll, /:id/token/rotate,
/:id/token/revoke (user auth + productId + Zod).
- cosmos-init.ts: register fleet_factory_tokens (/productId).
Also hardens the artifact routes (review fixes): listArtifactsByJob is now
productId-scoped (GET /fleet/jobs/:id/artifacts threads the request productId), and
artifact upload uses the request/auth productId authoritatively (a spoofed
body.productId no longer overrides it).
Tokens hashed at rest; plaintext shown once; no new crypto schemes; productId on
every doc; no any/console.log; enforcement default OFF.
The Gitea instance runs on default port 80, not 3300. Updated the
npmrc template and AGENTS.md references accordingly.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Include Gitea npm registry variables (token, host, owner) so
developers know which env vars to set for @bytelyst package access.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The foundation's revUpdateJob/revUpdateLease did a read -> rev-check -> write with
await points between them, so two CONCURRENT claims could both read the same rev,
both pass the check, and both write — a double-assignment the old (sequential) race
test could not catch.
Rewire revUpdateJob/revUpdateLease to delegate to the datastore's updateIfMatch,
which performs the compare and the write as one indivisible operation (Cosmos
If-Match; synchronous compare-set on memory). The coordinator's tryClaimJob keeps
identical external behavior (ok/conflict) but is now genuinely single-winner.
Upgrades the coordinator tests to prove atomicity under TRUE concurrency:
- two contenders via Promise.all -> exactly one ok, one conflict; assigned once;
one run; one lease; leaseEpoch 1.
- N-claimer (15) stress via Promise.all -> one ok, N-1 conflicts, no double-assignment.
- N concurrent claimNextJob for one job -> exactly one non-null claim.
- N concurrent lease renewals -> exactly one wins.
Verified these concurrent tests FAIL against the old read-check-write (double-assign)
and pass after the fix.
Adds an additive, backward-compatible conditional write to the datastore abstraction
so consumers can do true single-winner compare-and-swap:
updateIfMatch(id, partitionKey, expected: { etag?, rev? }, patch)
-> { ok: true, doc } | { ok: false, reason: 'conflict' | 'not_found' }
- types: ConcurrencyToken + UpdateIfMatchResult; optional _etag on BaseDocument
(provider-managed, surfaced on reads); new method on DocumentCollection; exported.
- memory provider: get -> compare -> set in ONE synchronous block (no await/yield),
so two concurrent callers cannot interleave under the single-threaded event loop —
the first wins and bumps rev + _etag, the rest get conflict. True in-process atomicity.
- cosmos provider: conditional replace with accessCondition { type: 'IfMatch',
condition: _etag }; translates Cosmos 412 -> conflict, 404 -> not_found; also
compares/bumps rev for parity.
Existing method signatures are unchanged (additive only). Tests: memory match/stale/
missing + an N-concurrent Promise.all atomicity proof; cosmos If-Match mapping via a
fake @azure/cosmos (match writes, stale etag -> 412 conflict, missing -> not_found).
Foundation review: 50 fleet tests green, build clean, no regressions. NOTE: the
atomic-claim uses an in-module rev-CAS over an unconditional datastore write —
true cross-process atomicity requires an If-Match/_etag-conditional update in
@bytelyst/datastore (tracked P0 hardening follow-up before P2-S3).
Guarded REST under /api (auth + productId, like items): POST /fleet/jobs (idempotent
submit), GET /fleet/jobs (by stage/idempotencyKey), GET /fleet/jobs/:id, PATCH
/fleet/jobs/:id (fenced transition), POST /fleet/claim, lease renew/release,
factories/heartbeat, and runs/events streams. Every body validated with the Zod
schemas; fenced/conflict map to 409, missing to 404, invalid to 400. Registers
fleetRoutes in server.ts next to itemRoutes. Routes tested via Fastify inject on
the memory provider (real coordinator).