job.retry (max / on / backoff) was persisted but never enforced: a failed
attempt just went to `failed` and required a manual operator requeue. Now, when
a factory releases a lease reporting a failure, the coordinator applies the policy:
- retryable result (matches a retry.on class) with attempts remaining ⇒ requeue
(queued, or blocked if deps are now unmet) with a retry backoff;
- retryable but attempts exhausted ⇒ dead_letter;
- no policy or non-retryable result (capability_mismatch/no_engine) ⇒ failed,
exactly as before (behavior-preserving).
Backoff is honored via a new job.retryNotBefore timestamp; the scheduler skips a
queued job until it elapses (new pure isAwaitingRetryBackoff gate in selectJob).
parseBackoffMs supports "<n>", "<n>s|m|h", "<n>ms", and "exp" (30s·2^(n-1), capped
1h). retry_scheduled / dead_letter audit events are emitted. decideFailureOutcome
and parseBackoffMs are pure and unit-tested (retry.test.ts), plus scheduler-gate
and end-to-end releaseLease coverage.
Generated with [Devin](https://cli.devin.ai/docs)
Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Two recovery/cleanup gaps left the coordinator's containers growing without
bound and jobs stuck longer than necessary:
- reclaimStaleFactoryLeases: a crashed/partitioned factory stops heartbeating
~90s before its 900s lease TTL expires; the reaper now reclaims held leases of
stale (or vanished) holders within one stale window, via the same fence +
checkpoint-preserving path as the expiry reaper (refactored into reclaimLeaseJob).
- sweepFleetGarbage: deletes ephemeral coordination state on by default (finished
expired/released leases past a 24h TTL; factory docs with no heartbeat for 7d —
a live host just re-registers). Terminal-job retention (jobs + their runs/events/
artifacts+blobs) is OPT-IN only via FLEET_GC_RETENTION_DAYS (default 0 = never
delete history). Every delete is best-effort so one failure can't stall the sweep.
Both are wired into the existing reaper loop: recovery scans run every 30s, the
deletion sweep is throttled to hourly. New repo helpers (listHeldLeases,
listFinishedLeasesOlderThan, deleteLease, listAllFactories, deleteFactory,
listTerminalJobsOlderThan, deleteRun, deleteEvent) back the new coordinator
functions. Covered by cleanup.test.ts + expanded reaper.test.ts.
Generated with [Devin](https://cli.devin.ai/docs)
Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
reapExpiredLeases implements the full section-25 recovery (fence the zombie
holder via a leaseEpoch bump, return the job to queued/blocked, preserve the
checkpoint) but nothing ever called it: no route, no cron, no timer. So when a
factory crashed, lost network, or shut down, its in-flight job stayed stuck in
an active stage forever and was never requeued — the recovery code was dormant.
Add a process-wide background reaper (leases are queried across all products)
that runs reapExpiredLeases every 30s, started at server boot and stopped on
graceful shutdown, mirroring the diagnostics trigger-job pattern. A failing pass
is logged and retried on the next tick rather than crashing the service.
Generated with [Devin](https://cli.devin.ai/docs)
Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
The engine picker only constrained the UI; the router never gated on the
chosen engine, so a job pinned to e.g. engine:codex could be claimed by a
factory that doesn't run codex (the runner's resolve_engine honors an explicit
engine with no availability check, so it would then fail at execution time).
Add a pure engineEligible(job, caps) hard gate to the section-7 scheduler filter
(and preemption): a concrete-engine job runs only on a factory advertising the
matching engine:<e> cap. Gated only against engine-aware factories (those that
advertise any engine:* token); engine-agnostic/legacy factories stay eligible,
mirroring the picker's "engine set unknown => offer all" fallback. explainJob now
surfaces the mismatch reason. No DB migration; behavior-preserving for legacy.
Generated with [Devin](https://cli.devin.ai/docs)
Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Add GET /fleet/factories (lists a product's factory docs with capabilities) —
also fixes the fleet map's empty factory cards (listFactories had no route and
silently returned []). The New-Job form now loads the selected factory's
engine:* capabilities and constrains the engine dropdown to those (e.g. hides
codex when the host doesn't have it), keeping the current pick valid; falls back
to all engines when capabilities are unknown.
Generated with [Devin](https://cli.devin.ai/docs)
Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Add an optional concrete `engine` to a job (overrides engineClass; resolved by
the runner's resolve_engine where an explicit engine wins). All additive +
optional, so existing engineless jobs keep falling back to the factory default.
- types: FLEET_ENGINES enum; engine on SubmitJob/FleetJobDoc/UpdateDraft.
- coordinator: store engine on create/supersede/updateDraft; run.engine at claim
prefers job.engine, then engineClass, then 'unknown'.
- tracker-web: Engine dropdown on the New-Job form (default devin) + editable on
draft/queued jobs; shown in the detail config grid.
Generated with [Devin](https://cli.devin.ai/docs)
Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Backend: insights now carry engine + sessionId/sessionUrl; releaseLease promotes
the reported engine onto the run (was created with the abstract engineClass,
usually 'unknown').
tracker-web job detail:
- Runs: show the concrete engine (insights.engine, falls back off 'unknown') and
the agent session (Devin session id with a `devin --resume <id>` hint, or a
link when a sessionUrl is present).
- PromptCard: edit repo/baseBranch/verify/autoMerge (not just the prompt) while
draft/queued/blocked.
- Timeline: filter by event type (default collapses heartbeat runs).
- Show a "no PR — needs verify / not PR mode" hint when parked in review.
Generated with [Devin](https://cli.devin.ai/docs)
Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
updateDraft changes caps/deps/priority (and body) without a stage change, so it
did not bump fleet_queue_state — gated factories (AQ_FLEET_GATE=1) would not
re-evaluate claimability until the safety interval. Bump the gate on edit so an
edit that makes a job claimable wakes factories promptly.
Generated with [Devin](https://cli.devin.ai/docs)
Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Backend (platform-service):
- New `draft` stage (not claimable; scheduler only takes queued/blocked).
- submitJob accepts `draft: true` → parks a new/superseded job as a draft.
- updateDraft(): edit prompt/config in place while draft/queued/blocked;
recomputes contentHash; rejected (conflict) once picked up (assigned+).
- submitDraft(): promote draft → queued (or blocked on unmet deps); idempotent.
- Routes: PATCH /fleet/jobs/:id/draft, POST /fleet/jobs/:id/submit.
- tracker-bridge: map draft → item status `open`. Tests + FLEET_STAGES updated.
Frontend (tracker-web):
- New-Job form: add "Save as draft" alongside "Submit".
- Job detail: edit the prompt + Save while draft/queued/blocked, "Submit" a
draft, and lock it read-only once a factory picks it up.
Generated with [Devin](https://cli.devin.ai/docs)
Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
- fleet module README: add fleet_queue_state container + GET /fleet/queue-state
and /fleet/metrics; note the heartbeat cadence must stay under the 90s stale
threshold (AQ_FLEET_LEASE_RENEW_SEC).
- FLEET_CONTROL_PLANE: correct wrong endpoint paths (/fleet/claim and
/fleet/factories/heartbeat were documented as /fleet/jobs/:id/claim and
/fleet/factories/:id/heartbeat; removed a non-existent GET /fleet/factories);
add enroll, metrics, and the M0 queue-state endpoint.
- ROADMAP_COMPLETION_AUDIT: dated status banner — roadmap §0 now reconciled and
Phase-4 M0 shipped, superseding the older "stale §0 / not started" findings.
- README: point to FLEET_DISPATCH_REDESIGN.md + the M0 gate.
Generated with [Devin](https://cli.devin.ai/docs)
Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Adds fleet_queue_state (monotonic version per product), bumped on job create +
every stage change in the repository layer (best-effort, never fails a job
write), and a GET /fleet/queue-state read endpoint. Lets a polling factory
detect "work changed" with a ~1 RU point read instead of a full listJobs scan
on every claim. Registers the container; tests cover the bump + endpoint.
See agent-queue docs/GIGAFACTORY/FLEET_DISPATCH_REDESIGN.md §8/§12 (M0).
Generated with [Devin](https://cli.devin.ai/docs)
Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Adds an optional displayName claim to the platform access token so
downstream product backends can source the user's display name from the
JWT (single source of truth = platform auth), not from per-product DB
copies. verifyToken already exposes displayName; this populates it at all
token-minting sites (password login, register, refresh, SSO, OAuth,
magic-link, passkeys, QR, enterprise SAML/OIDC). Additive and backward
compatible.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The corporate proxy intermittently 407s GitHub's API, so a single gh pr merge can
fail transiently. Try once inline (fast path), then retry in the background with
backoff (3s/8s/20s/45s) without blocking the ship; mark prState=merged when one
lands. Best-effort throughout.
On ship (Ship button / operator action / autoship PATCH), when the run has an open
PR and FLEET_SHIP_MERGES_PR=1, the coordinator squash-merges it via gh (best-effort,
where gh is authed) and marks the run prState=merged. UI button reads 'Ship & merge
PR' when an open PR exists; Ship refreshes runs.
Runs now carry prState (open when the PR is opened, merged when auto-merge
succeeds), reported on lease release. Job-detail Runs table shows a status badge
next to the PR link.
Add job-level verify (command run in the PR checkout before opening the PR) and
autoMerge (squash-merge the PR once opened). Surfaced in the New Job form as a
Verify-command field + Auto-merge checkbox (PR mode only); confirmation now shows
PR-mode/repo. More repos added to the dropdown.
Generated with [Devin](https://cli.devin.ai/docs)
Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Make "shipped" produce a real artifact. A job can now carry an optional repo
(owner/name or clone URL) + baseBranch; the factory's PR mode runs the agent in an
isolated checkout, opens a PR, and records the link.
Backend:
- SubmitJobSchema + FleetJobDoc: optional repo/baseBranch (recorded on submit).
- FleetRunDoc: optional prUrl/branch.
- ReleaseLease report carries prUrl/branch -> stored on the run.
- +2 coordinator tests.
UI (tracker-web):
- New Job form gains optional Repo + Base branch fields (and fixes the priority
options to the valid critical/high/medium/low; "normal" was rejected by the API).
- Job detail Runs table shows a PR ↗ link from run.prUrl.
- fleet-client: submitJob repo/baseBranch; FleetRun prUrl/branch; OperatorAction +ship.
Docs: FLEET_CONTROL_PLANE.md "PR deliverable (PR mode)" section.
Verified: tsc clean; fleet suite 182; tracker-web 230.
Generated with [Devin](https://cli.devin.ai/docs)
Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
A job can reach `shipped` via autoship PATCH, the `ship` operator action, or a
terminal lease release, but the run-level `result` was left at whatever the
factory last reported (e.g. `review`), so the dashboard showed a shipped job with
a non-terminal run result.
- Add markLatestRunShipped(): on any transition to `shipped`, set the latest run
result to `shipped` (+ endedAt if unset). Idempotent, best-effort.
- Wire it into patchJobFenced (ungated; budget accrual stays flag-gated) and the
`ship` operator action.
- Document the testing->shipped paths (factory autoship vs `ship` operator action)
and the run-mirroring in docs/GIGAFACTORY/FLEET_CONTROL_PLANE.md.
Tests: +2 (patchJobFenced->shipped and operator ship both set run.result=shipped).
Fleet suite 180 pass; tsc clean.
Generated with [Devin](https://cli.devin.ai/docs)
Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Verified: full workspace build (tsc) green across all packages/services/dashboards;
fleet+items tests pass. Compile-time only.
Generated with [Devin](https://cli.devin.ai/docs)
Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Bump stripe 17 -> 20 and adapt to the breaking API changes:
- PromotionCode: coupon moved under `promotion` ({ type: 'coupon', coupon }).
mapPromo now reads p.promotion.coupon; create now passes
promotion: { type: 'coupon', coupon: id } instead of a top-level coupon.
- Subscription.current_period_end removed (now per subscription item). Add
getSubscriptionPeriodEnd() = max(items[].current_period_end) and use it in the
customer.subscription.updated webhook handler.
- Update the promos route test fixture to the new promotion.coupon shape.
Verified: platform-service build (tsc) clean; promos (14) + stripe/subscriptions/
billing tests pass; full suite 1692/1692.
Generated with [Devin](https://cli.devin.ai/docs)
Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Applied fresh on current main (the matching dependabot branches were 350-430
commits behind and would have conflicted on the lockfile):
- @azure/cosmos 4.9.1 -> 4.9.3
- jose 6.1.3 -> 6.2.3 (mcp-server stays on the 5.x line: 5.9.6 -> 5.10.0)
- @typescript-eslint/parser 8.0 -> 8.60.0
Verified: full workspace build green; platform-service suite 1684 pass (only the
pre-existing single-fork migration-isolation flake, passes isolated); tracker-web
228 pass (only the pre-existing happy-dom product-context failures). No new
regressions. Major bumps (fastify/cors, happy-dom, lint-staged, stripe,
types/node) deferred for separate review.
Generated with [Devin](https://cli.devin.ai/docs)
Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Re-applies 4 defects merged with Phase 3 (still present on main), ported from the
stale reference branch and adapted to current main.
FIX 1 (BLOCKER): all 5 budget routes (GET/burndown/PUT/pause/resume) read
productId from the URL with no caller check, so a caller for product A could
read or modify product B budget. Add requireOwnProduct() -> 403 on mismatch.
FIX 2: stop tracking services/platform-service/.data/platform-events.json (the
EVENT_BUS_FILE runtime log); gitignore .data/.
FIX 3: accrueSpend was definition-only (budgets never accrued) and not
idempotent. Add accruedRunIds to FleetBudgetDoc; accrueSpend(productId, costUsd,
runId) no-ops on a seen runId; wire it into patchJobFenced on stage shipped,
flag-gated + best-effort + idempotent via jobId:leaseEpoch. Accrue the run ACTUAL
insights.costUsd so spentUsd and costBurndown agree.
FIX 4: submitChildren cycle detection is now batch-aware — rejects duplicate
child keys and walks each child declared deps across BOTH the unpersisted batch
and existing jobs, rejecting child-self, child-parent and sibling cycles.
Tests: +2 budget-authz (403 on all 5 verbs), +3 accrual (idempotent runId, ship
accrues insights.costUsd once, flag-off no accrual), +5 cycle detection. Gates
green: tsc/build, fleet+items (204), full suite (only the unrelated single-fork
migration-isolation file flakes, passes isolated). Flags stay default-OFF.
Generated with [Devin](https://cli.devin.ai/docs)
Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Make the Agent Gigafactory fully drivable end-to-end via the API:
- lease/release now accepts `insights` (model/tokens/cost) + `result`, recorded
on the current run with endedAt — factories report cost/token metrics on
completion (previously no API existed; runs stayed insights:{}).
- add `ship` operator action so a job in `testing` (where the review gate left
no lease holder) can reach the terminal `shipped` stage. Idempotent.
- operatorAction now retries on optimistic-concurrency conflict with backoff
(mirrors submitReview) so a ship right after approve survives real-Cosmos
read-after-write lag instead of a spurious 409.
Tests: +2 coordinator (ship idempotent, release records insights) and +2 route
integration (gated submit->...->ship->metrics; release-with-insights). 170 pass.
Generated with [Devin](https://cli.devin.ai/docs)
Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Azure Cosmos cannot serve a multi-field ORDER BY without a matching composite
index (the local emulator is lenient, real Cosmos returns HTTP 400). The fleet
listJobs() query orders by (priorityOrder, createdAt), which broke
GET /api/fleet/metrics and /api/fleet/jobs on real Cosmos.
- ContainerConfig gains an optional `compositeIndexes` field
- container init applies it on create AND reconciles it onto existing
containers (createIfNotExists never updates an existing index policy)
- fleet_jobs declares the (priorityOrder ASC, createdAt ASC) composite index
Verified live against Azure Cosmos: both endpoints now return 200.
Generated with [Devin](https://cli.devin.ai/docs)
Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Rename docs/gigafactory/ to docs/GIGAFACTORY/ and update the cross-repo
source-of-truth references in the fleet README and types.ts comment. Add an
index README listing the platform docs and pointing to the canonical spec in
learning_ai_devops_tools. Docs/comment only; no behavior change.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Move ROADMAP_COMPLETION_AUDIT.md, TASKS_TO_COMPLETE.md,
gigafactory-phase3-progress.md and FLEET_CONTROL_PLANE.md under
docs/gigafactory/ so the scattered Gigafactory docs are easy to discover.
Update intra-doc and cross-repo source-of-truth references (fleet README
and types.ts comment) to the new agent-queue/docs/gigafactory/ path.
Pure docs/comment move; no behavior change.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Implements the §14 Phase 3 review gate. requestReview() routes a building
job into the review stage (fencing any worker), carrying a normalized policy
(requiredApprovals + reviewer allowlist) and clearing prior decisions.
submitReview() records one decision per reviewer (last-write-wins, identity-
normalized), advances the job to testing once distinct approvals reach the
quorum, and treats any reject as a veto that returns the job to queued for
rework. Adds POST /fleet/jobs/:id/review/request and POST /fleet/jobs/:id/review,
a typed client, and a review-gate card on the job-detail page.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds coordinator.fleetMetrics() computing queue depth, stage histogram,
oldest-queued age (starvation signal), factory health and seat utilization,
plus derived alerts (no_live_capacity, all_factories_down, queue_starvation,
saturated, stale_factories). Exposed via GET /fleet/metrics and surfaced as a
metrics+alerts panel on the fleet overview. Thresholds injectable for tests.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Backend: GET /fleet/jobs/:id/events/stream emits a snapshot (seq > Last-Event-ID)
then long-polls the append-only event log, closing after a bounded window so
EventSource-style clients reconnect cleanly. Honors Last-Event-ID resume,
keepalive comments, and a terminal error frame.
Frontend: subscribeJobEvents uses fetch streaming (to send auth + product
headers) with parseSseFrames, Last-Event-ID resume, reconnect backoff, and a
fatal-on-error-frame fallback to polling. Job detail page subscribes live
(deduped by seq), falls back to 4s polling on failure, and shows a Live badge;
refresh() now merges events so a slow snapshot can't clobber streamed ones.
Tests: +3 route (snapshot, resume cursor, append-after-connect), +5 client
(parseSseFrames x2, subscribe deliver/error/resume/error-frame). fleet 150, web 222.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- coordinator.costBurndown() aggregates completed run cost (insights.costUsd)
by UTC day over a window, returning a gap-free cumulative series + ceiling
- repository.listRunsByProduct() cross-partition run query
- GET /fleet/budgets/:productId/burndown?days=N route
- fleet-client.getBudgetBurndown() + CostBurndown/BurndownPoint types
- BurndownChart on the budget page: cumulative daily bars with a dashed
ceiling overlay; bars turn red past the ceiling; degrades gracefully
- Tests: +2 coordinator, +1 routes, +2 fleet-client (fleet 147, web 216)
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds 'why does this job route here?' to the §7 scheduler:
- coordinator.explainJob() re-runs scoreCandidate against every live factory,
returning per-factory weighted breakdown, eligibility + reasons, deps state,
and the best eligible factory (read-only, side-effect free)
- GET /fleet/jobs/:id/explain route (404 when job missing)
- fleet-client.getJobExplain() + JobExplain/ScoreBreakdown types
- ExplainPanel on the job detail page: score table per factory with the six
weighted terms, eligibility, and unmet-deps note; degrades gracefully
- Tests: +2 coordinator, +1 routes, +2 fleet-client (fleet 144 green,
tracker-web 214 green)
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- release lease with fenced epoch (leaseEpoch+1, clear holder) so a stale
renewal cannot resurrect a held lease after operator displacement
- reject on dead_letter / cancel on failed are now idempotent no-ops
(no epoch bump, no duplicate event)
- add coordinator test for terminal idempotency
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Fleet overview page with factory cards + recent jobs polling
- Job table with stage filter tabs
- Job detail page with events timeline, runs, artifacts, DAG subtree, SHIP action
- Budget page with usage bar, pause/resume controls
- API proxy route forwarding /api/fleet/* to platform-service
- Typed fleet-client.ts with graceful 404 degradation
- 16 unit tests for fleet-client (198 total tracker-web tests green)
- Added Fleet nav item to dashboard layout
- Full monorepo build + test green
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
docker-compose:
- Drop the cosmos-emulator service block. Both image variants we
tried were unfit for the prototype: `:vnext-preview` returned
plain-text PGCosmosError strings that crashed @azure/cosmos at
JSON.parse, and `:latest` core-dumped under load. The container
has been Exited(255) for weeks and was blocking depends_on chains.
- Real Azure Cosmos account `cosmos-mywisprai` (db `bytelyst`,
West US 2) is now the single source of truth; all services pick
up COSMOS_ENDPOINT/KEY/DATABASE from `.env` (already mounted via
`env_file: .env`).
- extraction-service: drop hardcoded `COSMOS_ENDPOINT=…cosmos-emulator…`,
`NODE_TLS_REJECT_UNAUTHORIZED=0`, and `depends_on: cosmos-emulator`.
- cowork-service: same cleanup.
cowork-service IPC bridge:
- Add `error` listeners to the spawned child's stdin/stdout/stderr.
Without them, an EPIPE on stdin (child died mid-write) or a
teardown-time stream error surfaced as an unhandled error and
crashed vitest after all 140 tests had passed.
- Removes the only failing recursive test in the workspace.
Test status after this commit:
- 94 workspace packages, all green
- cowork-service: 19 passed | 1 skipped (140 tests)
- platform-service: 131 test files passed
- extraction-service: 13 test files passed
- All other packages: passing
Generated with [Devin](https://cli.devin.ai/docs)
Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
In-process tracker<->fleet bridge — no shell hop. Closes the §10 "direct
tracker->module calls" box.
- tracker-bridge.ts (new):
* ingestItemAsJob(productId, itemId, opts?) — reads the Item via the items
repository (foreign/unknown → NotFoundError), maps title/description → bodyMd
(verbatim) + labels (engine-class:/profile:/priority:/cap:) → manifest hints,
sets trackerItemId + a stable idempotency-key `tracker-<itemId>`, and submits
through coordinator.submitJob — so re-ingest dedupes and the job is scheduled by
the §7 router via the unchanged claim path.
* echoJobToItem(productId, jobId, log?) — mirrors stage → Item status
(queued/assigned/building/review/testing → in_progress; shipped → done;
failed/dead_letter → wont_fix) + a metrics-ONLY comment (attempts/duration/
tokens/cost — never the prompt body/secrets). Idempotent via the job's
`trackerEchoedStatus`; best-effort + non-fatal (items-write failure →
{ echoed: null, error }, never thrown into the job lifecycle). productId-scoped.
- Auto-echo wired into the PATCH + lease/release transitions, GATED by
FLEET_TRACKER_ECHO (default OFF → behavior byte-for-byte unchanged); never blocks
or fails the transition.
- Routes (additive): POST /fleet/tracker/ingest, POST /fleet/tracker/echo
(auth + getRequestProductId, productId-scoped).
- types.ts: optional FleetJobDoc.trackerEchoedStatus (reuses the existing
trackerItemId field; no parallel schema) + Ingest/Echo request schemas.
- repository.ts: setTrackerEchoedStatus (no rev bump — never interferes with the
fenced claim CAS).
Reuses the items + comments contracts directly (no HTTP). Does not touch
claimNextJob or the scheduler. productId on every doc; no any/console.log.
- packages/llm: use nullish coalescing (??) in GeminiProvider constructor
so explicit empty-string apiKey is not overridden by env var
- dashboards/admin-web,tracker-web: exclude .next/ from vitest test glob
to prevent Next.js internal test files from being picked up
- services/cowork-service: use platform-safe .kill() instead of SIGTERM
which is invalid on Windows
- packages/use-keyboard-shortcuts: add @testing-library/react devDep
- scripts/npmrc.template: use https:// for Gitea registry
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds factory enrollment + a scoped, rotatable credential model for the fleet
coordinator (trust boundary, §12/§18). Tokens are stored HASHED at rest (sha256 —
the same primitive the auth module uses for verify/magic-link tokens); the
high-entropy plaintext is returned exactly once at enroll/rotate and never persisted.
- enrollment.ts: enrollFactory (create/link factory + issue token), rotateToken
(new active token; prior marked `rotating` with a grace overlap so an in-flight
worker isn't cut off), revokeToken (immediate), verifyToken (constant-time hash
compare; revoked/expired-grace → null; updates lastUsedAt). Scope = {productId,
factoryId, capabilities[]}.
- Gated enforcement: enforceFactoryToken() on POST /fleet/factories/heartbeat and
POST /fleet/claim, active only when FLEET_REQUIRE_FACTORY_TOKEN is on (default
OFF — existing behavior/tests unchanged). When on: missing/invalid/revoked → 401;
out-of-scope productId/capability/factory → 403; and the claim is CONSTRAINED to
the verified token scope. Does not touch scheduler scoring or the claim CAS.
- types.ts: FleetFactoryTokenDoc + Enroll/Rotate/Revoke request schemas.
- repository.ts: fleet_factory_tokens collection + CRUD + findByHash.
- routes.ts (additive): POST /fleet/factories/enroll, /:id/token/rotate,
/:id/token/revoke (user auth + productId + Zod).
- cosmos-init.ts: register fleet_factory_tokens (/productId).
Also hardens the artifact routes (review fixes): listArtifactsByJob is now
productId-scoped (GET /fleet/jobs/:id/artifacts threads the request productId), and
artifact upload uses the request/auth productId authoritatively (a spoofed
body.productId no longer overrides it).
Tokens hashed at rest; plaintext shown once; no new crypto schemes; productId on
every doc; no any/console.log; enforcement default OFF.
The foundation's revUpdateJob/revUpdateLease did a read -> rev-check -> write with
await points between them, so two CONCURRENT claims could both read the same rev,
both pass the check, and both write — a double-assignment the old (sequential) race
test could not catch.
Rewire revUpdateJob/revUpdateLease to delegate to the datastore's updateIfMatch,
which performs the compare and the write as one indivisible operation (Cosmos
If-Match; synchronous compare-set on memory). The coordinator's tryClaimJob keeps
identical external behavior (ok/conflict) but is now genuinely single-winner.
Upgrades the coordinator tests to prove atomicity under TRUE concurrency:
- two contenders via Promise.all -> exactly one ok, one conflict; assigned once;
one run; one lease; leaseEpoch 1.
- N-claimer (15) stress via Promise.all -> one ok, N-1 conflicts, no double-assignment.
- N concurrent claimNextJob for one job -> exactly one non-null claim.
- N concurrent lease renewals -> exactly one wins.
Verified these concurrent tests FAIL against the old read-check-write (double-assign)
and pass after the fix.
Guarded REST under /api (auth + productId, like items): POST /fleet/jobs (idempotent
submit), GET /fleet/jobs (by stage/idempotencyKey), GET /fleet/jobs/:id, PATCH
/fleet/jobs/:id (fenced transition), POST /fleet/claim, lease renew/release,
factories/heartbeat, and runs/events streams. Every body validated with the Zod
schemas; fenced/conflict map to 409, missing to 404, invalid to 400. Registers
fleetRoutes in server.ts next to itemRoutes. Routes tested via Fastify inject on
the memory provider (real coordinator).