Make the Agent Gigafactory fully drivable end-to-end via the API:
- lease/release now accepts `insights` (model/tokens/cost) + `result`, recorded
on the current run with endedAt — factories report cost/token metrics on
completion (previously no API existed; runs stayed insights:{}).
- add `ship` operator action so a job in `testing` (where the review gate left
no lease holder) can reach the terminal `shipped` stage. Idempotent.
- operatorAction now retries on optimistic-concurrency conflict with backoff
(mirrors submitReview) so a ship right after approve survives real-Cosmos
read-after-write lag instead of a spurious 409.
Tests: +2 coordinator (ship idempotent, release records insights) and +2 route
integration (gated submit->...->ship->metrics; release-with-insights). 170 pass.
Generated with [Devin](https://cli.devin.ai/docs)
Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Rename docs/gigafactory/ to docs/GIGAFACTORY/ and update the cross-repo
source-of-truth references in the fleet README and types.ts comment. Add an
index README listing the platform docs and pointing to the canonical spec in
learning_ai_devops_tools. Docs/comment only; no behavior change.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Move ROADMAP_COMPLETION_AUDIT.md, TASKS_TO_COMPLETE.md,
gigafactory-phase3-progress.md and FLEET_CONTROL_PLANE.md under
docs/gigafactory/ so the scattered Gigafactory docs are easy to discover.
Update intra-doc and cross-repo source-of-truth references (fleet README
and types.ts comment) to the new agent-queue/docs/gigafactory/ path.
Pure docs/comment move; no behavior change.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Implements the §14 Phase 3 review gate. requestReview() routes a building
job into the review stage (fencing any worker), carrying a normalized policy
(requiredApprovals + reviewer allowlist) and clearing prior decisions.
submitReview() records one decision per reviewer (last-write-wins, identity-
normalized), advances the job to testing once distinct approvals reach the
quorum, and treats any reject as a veto that returns the job to queued for
rework. Adds POST /fleet/jobs/:id/review/request and POST /fleet/jobs/:id/review,
a typed client, and a review-gate card on the job-detail page.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds coordinator.fleetMetrics() computing queue depth, stage histogram,
oldest-queued age (starvation signal), factory health and seat utilization,
plus derived alerts (no_live_capacity, all_factories_down, queue_starvation,
saturated, stale_factories). Exposed via GET /fleet/metrics and surfaced as a
metrics+alerts panel on the fleet overview. Thresholds injectable for tests.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Backend: GET /fleet/jobs/:id/events/stream emits a snapshot (seq > Last-Event-ID)
then long-polls the append-only event log, closing after a bounded window so
EventSource-style clients reconnect cleanly. Honors Last-Event-ID resume,
keepalive comments, and a terminal error frame.
Frontend: subscribeJobEvents uses fetch streaming (to send auth + product
headers) with parseSseFrames, Last-Event-ID resume, reconnect backoff, and a
fatal-on-error-frame fallback to polling. Job detail page subscribes live
(deduped by seq), falls back to 4s polling on failure, and shows a Live badge;
refresh() now merges events so a slow snapshot can't clobber streamed ones.
Tests: +3 route (snapshot, resume cursor, append-after-connect), +5 client
(parseSseFrames x2, subscribe deliver/error/resume/error-frame). fleet 150, web 222.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- coordinator.costBurndown() aggregates completed run cost (insights.costUsd)
by UTC day over a window, returning a gap-free cumulative series + ceiling
- repository.listRunsByProduct() cross-partition run query
- GET /fleet/budgets/:productId/burndown?days=N route
- fleet-client.getBudgetBurndown() + CostBurndown/BurndownPoint types
- BurndownChart on the budget page: cumulative daily bars with a dashed
ceiling overlay; bars turn red past the ceiling; degrades gracefully
- Tests: +2 coordinator, +1 routes, +2 fleet-client (fleet 147, web 216)
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds 'why does this job route here?' to the §7 scheduler:
- coordinator.explainJob() re-runs scoreCandidate against every live factory,
returning per-factory weighted breakdown, eligibility + reasons, deps state,
and the best eligible factory (read-only, side-effect free)
- GET /fleet/jobs/:id/explain route (404 when job missing)
- fleet-client.getJobExplain() + JobExplain/ScoreBreakdown types
- ExplainPanel on the job detail page: score table per factory with the six
weighted terms, eligibility, and unmet-deps note; degrades gracefully
- Tests: +2 coordinator, +1 routes, +2 fleet-client (fleet 144 green,
tracker-web 214 green)
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- release lease with fenced epoch (leaseEpoch+1, clear holder) so a stale
renewal cannot resurrect a held lease after operator displacement
- reject on dead_letter / cancel on failed are now idempotent no-ops
(no epoch bump, no duplicate event)
- add coordinator test for terminal idempotency
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
In-process tracker<->fleet bridge — no shell hop. Closes the §10 "direct
tracker->module calls" box.
- tracker-bridge.ts (new):
* ingestItemAsJob(productId, itemId, opts?) — reads the Item via the items
repository (foreign/unknown → NotFoundError), maps title/description → bodyMd
(verbatim) + labels (engine-class:/profile:/priority:/cap:) → manifest hints,
sets trackerItemId + a stable idempotency-key `tracker-<itemId>`, and submits
through coordinator.submitJob — so re-ingest dedupes and the job is scheduled by
the §7 router via the unchanged claim path.
* echoJobToItem(productId, jobId, log?) — mirrors stage → Item status
(queued/assigned/building/review/testing → in_progress; shipped → done;
failed/dead_letter → wont_fix) + a metrics-ONLY comment (attempts/duration/
tokens/cost — never the prompt body/secrets). Idempotent via the job's
`trackerEchoedStatus`; best-effort + non-fatal (items-write failure →
{ echoed: null, error }, never thrown into the job lifecycle). productId-scoped.
- Auto-echo wired into the PATCH + lease/release transitions, GATED by
FLEET_TRACKER_ECHO (default OFF → behavior byte-for-byte unchanged); never blocks
or fails the transition.
- Routes (additive): POST /fleet/tracker/ingest, POST /fleet/tracker/echo
(auth + getRequestProductId, productId-scoped).
- types.ts: optional FleetJobDoc.trackerEchoedStatus (reuses the existing
trackerItemId field; no parallel schema) + Ingest/Echo request schemas.
- repository.ts: setTrackerEchoedStatus (no rev bump — never interferes with the
fenced claim CAS).
Reuses the items + comments contracts directly (no HTTP). Does not touch
claimNextJob or the scheduler. productId on every doc; no any/console.log.
Adds factory enrollment + a scoped, rotatable credential model for the fleet
coordinator (trust boundary, §12/§18). Tokens are stored HASHED at rest (sha256 —
the same primitive the auth module uses for verify/magic-link tokens); the
high-entropy plaintext is returned exactly once at enroll/rotate and never persisted.
- enrollment.ts: enrollFactory (create/link factory + issue token), rotateToken
(new active token; prior marked `rotating` with a grace overlap so an in-flight
worker isn't cut off), revokeToken (immediate), verifyToken (constant-time hash
compare; revoked/expired-grace → null; updates lastUsedAt). Scope = {productId,
factoryId, capabilities[]}.
- Gated enforcement: enforceFactoryToken() on POST /fleet/factories/heartbeat and
POST /fleet/claim, active only when FLEET_REQUIRE_FACTORY_TOKEN is on (default
OFF — existing behavior/tests unchanged). When on: missing/invalid/revoked → 401;
out-of-scope productId/capability/factory → 403; and the claim is CONSTRAINED to
the verified token scope. Does not touch scheduler scoring or the claim CAS.
- types.ts: FleetFactoryTokenDoc + Enroll/Rotate/Revoke request schemas.
- repository.ts: fleet_factory_tokens collection + CRUD + findByHash.
- routes.ts (additive): POST /fleet/factories/enroll, /:id/token/rotate,
/:id/token/revoke (user auth + productId + Zod).
- cosmos-init.ts: register fleet_factory_tokens (/productId).
Also hardens the artifact routes (review fixes): listArtifactsByJob is now
productId-scoped (GET /fleet/jobs/:id/artifacts threads the request productId), and
artifact upload uses the request/auth productId authoritatively (a spoofed
body.productId no longer overrides it).
Tokens hashed at rest; plaintext shown once; no new crypto schemes; productId on
every doc; no any/console.log; enforcement default OFF.
The foundation's revUpdateJob/revUpdateLease did a read -> rev-check -> write with
await points between them, so two CONCURRENT claims could both read the same rev,
both pass the check, and both write — a double-assignment the old (sequential) race
test could not catch.
Rewire revUpdateJob/revUpdateLease to delegate to the datastore's updateIfMatch,
which performs the compare and the write as one indivisible operation (Cosmos
If-Match; synchronous compare-set on memory). The coordinator's tryClaimJob keeps
identical external behavior (ok/conflict) but is now genuinely single-winner.
Upgrades the coordinator tests to prove atomicity under TRUE concurrency:
- two contenders via Promise.all -> exactly one ok, one conflict; assigned once;
one run; one lease; leaseEpoch 1.
- N-claimer (15) stress via Promise.all -> one ok, N-1 conflicts, no double-assignment.
- N concurrent claimNextJob for one job -> exactly one non-null claim.
- N concurrent lease renewals -> exactly one wins.
Verified these concurrent tests FAIL against the old read-check-write (double-assign)
and pass after the fix.
Guarded REST under /api (auth + productId, like items): POST /fleet/jobs (idempotent
submit), GET /fleet/jobs (by stage/idempotencyKey), GET /fleet/jobs/:id, PATCH
/fleet/jobs/:id (fenced transition), POST /fleet/claim, lease renew/release,
factories/heartbeat, and runs/events streams. Every body validated with the Zod
schemas; fenced/conflict map to 409, missing to 404, invalid to 400. Registers
fleetRoutes in server.ts next to itemRoutes. Routes tested via Fastify inject on
the memory provider (real coordinator).
The concurrency core (§4/§7/§8/§18/§25):
- claimNextJob: priority+age selection over queued/dep-satisfied jobs whose caps
are a subset of the factory's, then tryClaimJob does a rev CAS to flip to
assigned + acquire the lease — exactly one contender wins, no double-assignment.
- leases + fencing: acquire/reclaim bumps leaseEpoch; patchJobFenced/renew/release
reject a call whose leaseEpoch < job.leaseEpoch (zombie worker can't overwrite).
- heartbeat + isFactoryStale for factory liveness.
- reapExpiredLeases: returns expired-lease jobs to queued/blocked, bumps the epoch
(fencing the dead holder), preserves the checkpoint pointer (resume), marks the
lease expired; idempotent. Documents why Cosmos TTL cannot do this.
- submit: idempotent (dedup/supersede/409) + submit-time dependency cycle
detection; deps gating (shipped, or testing when depsMode:soft).
Tests drive the atomic-claim race, fencing, and reaper deterministically via the
rev CAS (no real threads).
One repository per fleet_* container on the @bytelyst/datastore abstraction
(memory + cosmos): create/getById/list (by productId, stage, idempotencyKey),
partition-aware single-partition queries, ordered append-only appendEvent, and
runs/leases/factories/profiles/artifacts CRUD. Adds revUpdateJob/revUpdateLease —
a `rev`-token compare-and-swap that writes only when the stored rev still matches
(the optimistic-concurrency primitive for atomic claim + fenced transitions;
maps to Cosmos _etag/If-Match in production).
Adds the agent-gigafactory fleet data model (modules/fleet/types.ts): Zod schemas
as the source of truth with inferred types (no `any`) for the 7 durable containers
— FleetJobDoc, FleetRunDoc, FleetLeaseDoc, FleetFactoryDoc, FleetProfileDoc,
FleetEventDoc, FleetArtifactDoc — each carrying productId. Lifecycle stages mirror
the agent-queue gigafactory spec (queued|blocked|assigned|building|review|testing|
shipped|failed|dead_letter). Registers fleet_* containers with their partition keys
(/productId for jobs/factories/profiles, /jobId for runs/leases/events/artifacts).
CI run 67 surfaced a real test failure:
src/modules/products/cache.test.ts:104
getAllProducts > returns all cached products
expected [ { id: 'lysnrai', …(11) }, …(2) ] to have a length of 2
but got 3
Root cause: cache.ts has a TEMPORARY_FALLBACK_PRODUCTS map (currently
just 'invttrdg') that getAllProducts() merges into its return value
on top of the loaded cache. The test fixture loads 2 products
(lysnrai, mindlyst), so the actual return is 3 — the test was
written before the fallback shim landed and never got updated.
Two ways to reconcile: (a) make the test reflect today's behaviour,
or (b) gut the fallback. The cache.ts comment explicitly marks
the fallback as 'TODO(platform): remove after creating the real
product …', so the right move is (a): keep the shim in place and
make the test enforce the documented contract.
- assertion now: toHaveLength(3) + .toContain('invttrdg')
- inline comment ties the expectation back to cache.ts so a
future cleanup removing the fallback will obviously need to
drop it back to 2
Verified locally:
pnpm vitest run cache.test.ts -> 8/8 pass
What changed:
- Remove nomgap-web from the ecosystem Docker stack now that web is Vercel-hosted.
- Add a TODO for deciding whether local Docker smoke tests still need a NomGap web service.
- Update NomGap product containers and feature flags.
- Seed the NomGap push trigger flag without duplicating the common encryption flag.
Safety notes:
- Dropped unrelated pnpm-lock.yaml formatting churn instead of committing it.
Verification:
- node JSON.parse products/nomgap/product.json
- ruby Psych.safe_load docker-compose.ecosystem.yml
- pnpm --filter @bytelyst/admin-web typecheck
- pnpm --filter @bytelyst/admin-web test
- pnpm --filter @bytelyst/admin-web exec eslint . --ext .ts,.tsx
- pnpm --filter @lysnrai/platform-service build
- pnpm --filter @lysnrai/platform-service test
- pnpm --filter @lysnrai/platform-service exec eslint . --ext .ts,.tsx
- pnpm typecheck
- pnpm lint
Baseline origin/main pnpm -r lint failed with 90+ errors across
platform-service, extraction-service, and tracker-web. These block the
shared W1 quality gates (prompts/README.md §4) which require all of
typecheck + lint + build + test to be green before committing W1 infra
work. Fixes are strictly scoped to unblock gates:
- eslint.config.js: extend @typescript-eslint/no-unused-vars with
varsIgnorePattern / caughtErrorsIgnorePattern / destructuredArrayIgnorePattern
all honouring the existing `^_` convention already used for args.
- platform-service: add file-level eslint-disable for
@typescript-eslint/no-unused-vars, no-redeclare, no-useless-escape on
the 33 legacy files failing lint (ab-testing, ai-diagnostics,
diagnostics, predictive-analytics, broadcasts/types, surveys/types,
lib/push-notifications).
- extraction-service tests: drop unused vitest imports (beforeEach,
afterEach, HealthCheck).
- tracker-web tracker-proxy.test.ts: prefix unused url with _.
- Applied eslint --fix on platform-service which normalised a handful
of `let` → `const` and removed one redundant disable comment.
Scope creep vs W1 "Files You Own" is acknowledged — user explicitly
approved this path when baseline rot was surfaced.
Verified: pnpm -r typecheck, lint, build, test all green.
- exports/routes: exclude inline data from GET /exports list response
to prevent returning megabytes of serialized export data (perf+security)
- Update WORKSPACE_TODO_AUDIT.md: add post-audit review section with
9 bugs found and fixed across 2 commits (73b07c2, 841cdf3), mark
all action plan sprints complete
- Typecheck clean, 1483/1483 tests pass
- diagnostics/subscribers: wire session.created email notification to
target user using existing 'diagnostics-session-created' template
(was just logging instead of sending the email)
- events/types: add missing 'currency' field to payment.failed schema
(payment.succeeded had it, payment.failed did not — inconsistency)
- delivery/subscribers: use event.payload.currency instead of hardcoded
empty string in payment-failed email variables
- Typecheck clean, 1483/1483 tests pass
- diagnostics/subscribers: use correct template IDs
'diagnostics-session-cancelled' and 'diagnostics-session-completed'
instead of non-existent 'generic' (would throw at runtime)
- delivery/templates: add missing 'broadcast' email template used by
broadcast delivery route (dispatchEmail would throw on unknown ID)
- broadcasts/routes: replace broken dot-path 'metrics.sent' update
with proper updateBroadcastMetrics() call, add productName variable
- exports/routes: store serialized data on job doc, add download
endpoint GET /exports/:id/download with content-type headers,
exclude data payload from metadata GET endpoint
- waitlist/routes: store invitation doc ID (inv_...) instead of
code string (WL-...) in invitationCodeId field
- delivery/delivery.test.ts: update template count 12 -> 13
- Typecheck clean, 1483/1483 tests pass
- delivery/subscribers: welcome email used raw productId as productName,
now uses resolveProductName() for proper display name
- delivery/subscribers: remove redundant String(daysLeft) in trial_expiring
- surveys/routes: incentiveClaimed was set outside if(sub) block, marking
response as claimed even when user has no subscription. Moved inside
if(sub) so claims are only recorded when incentive is actually granted