Thread a trace-context correlation id across the coordinator<->runner boundary
so a logical work-unit (job -> claim -> run -> ship) is stitchable end to end,
and add an advisory capacity autoscaling signal an external scaler can consume.
Tracing (#4):
- Mint/propagate a correlationId at submit from the inbound
x-correlation-id/traceparent/x-request-id (else generate ftr_<uuid>); persist
it on the job, inherit onto the run + lease at claim, and stamp every
lifecycle event (submitted/assigned/transition/lease_renewed/lease_released/
retry_scheduled/dead_letter). Children of a composite job share the parent id.
- Echo it back on the x-correlation-id response header (submit/claim/renew/
release/patch) so a factory can carry it forward, and bind it to req.log.
- New pure trace.ts (header resolution incl. W3C traceparent trace-id).
Autoscaling signal (#5):
- New pure autoscaler.ts turns a product FleetMetrics + saturation alerts
(no_live_capacity/saturated/queue_starvation) into an auditable scale
recommendation (action/recommendedSeats/delta/urgency/signals).
budget_exhausted suppresses scale-out; idle slack reclaims down to a floor.
Thresholds tunable via FLEET_AUTOSCALE_* env.
- GET /fleet/autoscale (per-product) + GET /fleet/autoscale/all (global, admin
or scrape token). Documented the env vars in .env.example.
Tests: +29 (trace 10, tracing 7, autoscaler 12); full suite 1846 green; lint + tsc clean.
Generated with [Devin](https://cli.devin.ai/docs)
Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Exports fleet observability to Prometheus/Grafana (previously JSON-only).
- GET /api/fleet/metrics/prom: global, product-labelled Prometheus exposition
(queue depth, blocked/active, per-stage histogram, factory health/seats/
utilization, active alerts, budget spent/ceiling/projected) plus process-wide
reaper/GC counters and engine circuit-breaker state. Pure renderer
(renderFleetMetricsProm) is unit-tested; route auth accepts a FLEET_METRICS_TOKEN
bearer (scrape path) or an admin JWT — never world-readable by default.
- Infra: add a prometheus container to docker-compose + a platform-service-fleet
scrape job; pin the Prometheus Grafana datasource uid; add a provisioned
"Fleet Overview" dashboard (breakers, dead-letter, stale factories, alerts,
queue depth, utilization, budget burn, reaper rate) with a product template var.
- Document FLEET_METRICS_TOKEN + the fleet feature flags in .env.example.
No default behavior change: the endpoint is additive and the new container is
opt-in via the compose stack.
Generated with [Devin](https://cli.devin.ai/docs)
Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Include Gitea npm registry variables (token, host, owner) so
developers know which env vars to set for @bytelyst package access.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds cowork-service (port 4009) to docker-compose.yml with healthcheck,
depends_on gates for cosmos-emulator and platform-service, env_file
integration, and Traefik labels. Unblocks Phase 3 ecosystem wiring of
the ByteLyst roadmap.
Also adds the services/cowork-service/Dockerfile that compose builds
from. Pattern mirrors services/mcp-server/Dockerfile but copies the
full workspace in one step rather than enumerating every package.json,
to stay resilient to workspace membership changes. Production stage
runs `node dist/server.js` on :4009 with BusyBox-wget healthcheck
(bundled with node:22-alpine — no apk install required).
.env.example gains a Cowork-Service section documenting:
- ANTHROPIC_API_KEY, RUST_RUNTIME_BIN, RUST_RUNTIME_TIMEOUT_MS
- OLLAMA_URL, OLLAMA_MODELS
- FEATURE_FLAGS_ENABLED
The 13th clawcowork flag telemetry_enabled already ships via COMMON_FLAGS
in services/platform-service/src/modules/flags/seed.ts so seed.ts was not
touched.
Gap: INFRA-gap-01
Verified: docker compose config (YAML validity + env substitution),
pnpm -r typecheck / lint / build / test (all green),
docker compose build cowork-service (image built),
docker compose up -d cowork-service --no-deps --wait (Healthy),
curl -fsS localhost:4009/health → {"status":"ok","service":"cowork-service",...}.
Note: full-stack `docker compose up cosmos-emulator platform-service
cowork-service --wait` is blocked by a pre-existing issue in
services/platform-service/Dockerfile (react-native-platform-sdk prepare
script fails during pnpm install --frozen-lockfile in the image build).
That is outside W1 scope; cowork-service starts clean on its own and
becomes Healthy when platform-service is available out-of-band.
- docker-compose.yml: extraction-service on port 4005 with Traefik, Loki, healthcheck
- .env.example: PYTHON_SIDECAR_URL, DEFAULT_MODEL_ID, GEMINI_API_KEY
Rewired all 4 services:
- lib/errors.ts → re-exports from @bytelyst/errors
- lib/cosmos.ts → re-exports from @bytelyst/cosmos
- lib/product-config.ts → uses loadProductIdentity()/getProductId() from @bytelyst/config
- lib/config.ts → kept self-contained (zod v3/v4 type mismatch with loadConfig)
Added workspace deps (@bytelyst/errors, @bytelyst/cosmos, @bytelyst/config) to all 4 services.
Added docker-compose.yml with Loki, Grafana, Traefik, and all 4 services.
Added .env.example with required env vars.
Added passWithNoTests to vitest.config.ts.
Pinned root zod to ^3.24.0 to match service zod versions.
All 12 projects build. 175 tests passing.