diff --git a/docs/runbooks/FLEET_DEVIN_LOCAL_RUN.md b/docs/runbooks/FLEET_DEVIN_LOCAL_RUN.md new file mode 100644 index 00000000..a0ed0bc8 --- /dev/null +++ b/docs/runbooks/FLEET_DEVIN_LOCAL_RUN.md @@ -0,0 +1,505 @@ +# Runbook — Run a Devin Fleet Job End‑to‑End (local) + +> **Audience:** developers and coding agents. +> **Goal:** stand up `platform-service` + `tracker-web` + a **fleet factory** (the +> `agent-queue` runner) so a submitted job is claimed and executed **autonomously +> by the Devin CLI** against a target repo (worked example: `learning_ai_notes`), +> pushing a branch and opening a **real pull request**. + +> ⚠️ **This is a real, cost‑incurring, side‑effecting operation.** The factory runs +> an autonomous coding agent (Devin) that consumes API credits, can run for a long +> time, pushes a branch, and opens a **real PR** on GitHub. Read [§9 Safety & +> cost](#9-safety--cost) before launching. For unattended local prototyping only — +> not a production deployment guide. + +--- + +## 1. Architecture (what talks to what) + +``` + you ──▶ tracker-web (:3003) ─┐ + │ REST + SSE (/api/fleet/*) + coding agent / curl ────────┼─▶ platform-service (:4003) ──▶ Azure Cosmos (jobs/runs/leases/events) + │ ▲ ▲ + Prometheus (:9090)┘ │ │ claim / lease-renew / report (Bearer JWT + X-Product-Id) + Grafana (:3000) ───────────┘ │ + │ + agent-queue FACTORY (fleet mode) ──▶ Devin CLI ──▶ git push + gh pr create + (learning_ai_devops_tools/agent-queue) (target repo, e.g. learning_ai_notes) +``` + +- **platform-service** — the fleet **coordinator**. Owns the job lifecycle + (`queued → assigned → building → review → testing → shipped|failed|dead_letter`), + atomic claim, leases, events, budgets, metrics. Code: `services/platform-service/src/modules/fleet/`. +- **tracker-web** (`:3003`) — submit/inspect jobs (`/dashboard/fleet/jobs/...`). +- **factory** — `learning_ai_devops_tools/agent-queue` in **fleet mode**. Polls + `POST /api/fleet/claim`, runs the agent CLI in an isolated checkout, reports back, + and (PR mode) opens the PR. +- **Prometheus/Grafana** — fleet metrics + the "Fleet Overview" dashboard. + +Lifecycle the factory drives: + +``` +queued ─▶ assigned ─▶ building ─▶ review ─▶ testing ─▶ shipped + (claim) (agent (rc=0) (verify (manual/auto ship) + running) passed) + └─ agent rc≠0 / timeout / verify fail ─▶ failed ─▶ (retry|dead_letter) +``` + +--- + +## 2. Prerequisites + +| Tool | Why | Check | +| ----------------------------------------------- | ---------------------------------------------------------------------------- | -------------------- | +| Node ≥ 20 + `pnpm` (corepack) | host-run service, scripts, tracker-web, build | `node -v && pnpm -v` | +| `git` + `gh` (authenticated) | factory clones target repo, pushes branch, opens PR; `gh pr merge`/reconcile | `gh auth status` | +| `devin` CLI (authenticated) | the agent the factory runs | `devin --version` | +| Both repos cloned side‑by‑side | coordinator/dashboards + the factory | see below | +| repo `.env` (root of `learning_ai_common_plat`) | `JWT_SECRET`, Cosmos creds, `FLEET_METRICS_TOKEN` | `test -f .env` | +| Docker | **optional** — only for the Docker path (§3 Option B) / Grafana+Prometheus | `docker info` | + +> **Node version:** the Docker image pins **node 22**; for the host path any **Node ≥ 20** +> works. Use one Node (nvm/asdf) for both repos to avoid native-module surprises. + +### 2.1 First‑time setup (fresh machine) + +Clone both repos as **siblings** (the factory clones targets relative to a shared parent): + +```bash +mkdir -p ~/code && cd ~/code +git clone /learning_ai_common_plat.git +git clone /learning_ai_devops_tools.git # contains agent-queue (the factory) +``` + +Create and fill `.env` at the **root of `learning_ai_common_plat`**: + +```bash +cd ~/code/learning_ai_common_plat +cp .env.example .env +# then edit .env — minimum for the fleet flow: +# JWT_SECRET= +# FLEET_METRICS_TOKEN=changeme-fleet-metrics-token # only needed for Prometheus +# COSMOS_* / connection vars -> see note below +``` + +- `JWT_SECRET` — HS256 secret platform-service verifies tokens with. Any strong value; + it only needs to be **internally consistent on this machine** (the token you mint in + §5 and the running service must share it). **Required.** +- **Cosmos** — the default prototype talks to a **real Azure Cosmos account** (no emulator + in the default compose). On a new machine you must either (a) point `.env` at the **same + Cosmos account** (to see/share existing jobs) or (b) point at your own DB and set + `COSMOS_AUTO_INIT=true` so containers are created on boot. Without valid Cosmos creds the + service starts but every fleet call fails. +- `FLEET_METRICS_TOKEN` — only needed if you run Prometheus (§4); must match + `services/monitoring/prometheus/prometheus.yml` (`credentials:`). + +### 2.2 Install + build the workspace (required for the host path) + +Host-run resolves `@bytelyst/*` workspace packages from their **`dist/`** (the `exports` +field points at `dist`), so you must build them once before `tsx`/Next can import them: + +```bash +cd ~/code/learning_ai_common_plat +pnpm install +pnpm -r build # builds all workspace packages (incl. @bytelyst/* → dist/) +# (faster, just the platform-service closure:) +# pnpm -r --filter @lysnrai/platform-service... build +``` + +> Skipping this is the #1 fresh-machine failure: `tsx watch` crashes with +> `Cannot find module '@bytelyst/...'/dist/index.js`. Re-run `pnpm -r build` after pulling +> changes to shared packages. + +--- + +## 3. Bring up platform-service + tracker-web + +Two ways. **Option A (all localhost, no Docker)** is recommended for a single dev Mac / +WSL box — everything runs on the host, so `gh`-backed features work out of the box. +**Option B (Docker)** is for when you also want the Grafana/Prometheus stack. + +### Option A — all localhost, no Docker (recommended) + +Two long‑lived processes, each in its own terminal. Both assume §2.1/§2.2 are done +(`.env` filled, `pnpm -r build` run). + +**Terminal 1 — coordinator (platform-service, :4003):** + +```bash +cd ~/code/learning_ai_common_plat/services/platform-service +pnpm exec tsx watch --env-file=../../.env src/server.ts +``` + +`tsx watch` hot-reloads on source changes. Use the explicit `--env-file=../../.env` +(the bare `pnpm dev` script does **not** load the root `.env`, so `JWT_SECRET`/Cosmos +would be missing). `FLEET_METRICS_TOKEN` is already in `.env` if you set it in §2.1. + +**Terminal 2 — dashboard (tracker-web, :3003):** + +```bash +cd ~/code/learning_ai_common_plat/dashboards/tracker-web +pnpm dev # serves http://localhost:3003 (proxies /api → :4003) +``` + +That's the whole coordinator + UI. **Monitoring (Grafana/Prometheus) is optional** on +the host path — `GET /api/fleet/metrics` (JSON), `GET /api/fleet/autoscale`, and the +tracker-web job pages cover observability without it. To get the Grafana "Fleet Overview" +dashboard you need Prometheus + Grafana (run them via Docker — Option B — or Homebrew +binaries pointed at `services/monitoring/...`). + +Because everything is on the host, `gh` is on `PATH` → the PR‑state **reconcile** (§8) +and ship‑time `gh pr merge` work (unlike the Docker container, which has no `gh`). + +Health checks: + +```bash +curl -s -o /dev/null -w '%{http_code}\n' http://localhost:4003/health # 200 +curl -s -o /dev/null -w '%{http_code}\n' http://localhost:3003 # 200 +``` + +### Option B — Docker (adds Grafana + Prometheus) + +```bash +cd ~/code/learning_ai_common_plat +# targeted fleet subset that always builds cleanly: +docker compose up -d --build platform-service prometheus grafana +# (full stack: bash scripts/prototype-up.sh) +``` + +Starts `platform-service` (`:4003`), `prometheus` (`:9090`), `grafana` (`:3000`, +admin/`lysnrai`) + deps. Still run **tracker-web from source** (Option A, Terminal 2). + +> **Docker caveats:** +> +> - `prototype-up.sh` may fail building the **dashboard** images when +> `corepack prepare pnpm@…` can't fetch pnpm on a restricted network → use the targeted +> subset above. +> - **`gh` is NOT in the container** → coordinator‑side `gh pr merge` and PR‑reconcile (§8) +> are no‑ops in Docker. Use the host path (Option A) if you need them. +> - Don't run both: the container and a host `tsx` both bind `:4003` +> (`docker compose stop platform-service` before host‑running). + +--- + +## 4. Make Prometheus auth work (only if running Prometheus) + +Skip this on the host path unless you also run Prometheus. `prometheus.yml` scrapes +`/api/fleet/metrics/prom` with a bearer, so the running `platform-service` must see the +same `FLEET_METRICS_TOKEN`: + +```bash +cd ~/code/learning_ai_common_plat +grep -q '^FLEET_METRICS_TOKEN=' .env || \ + printf '\nFLEET_METRICS_TOKEN=changeme-fleet-metrics-token\n' >> .env +# host path: restart Terminal-1 tsx so it re-reads .env +# docker path: docker compose up -d platform-service +``` + +Verify (if Prometheus is up): `http://localhost:9090/api/v1/targets` → +`platform-service-fleet` is `up`. The value must equal `credentials:` in +`services/monitoring/prometheus/prometheus.yml`. + +--- + +## 5. Mint a local API token (dev only) + +`platform-service` verifies HS256 JWTs signed with `JWT_SECRET` and requires +`type: "access"`. The tracker-web UI obtains one via login; for scripts/agents and +the factory, mint one directly. **Local dev only — never commit tokens or the secret.** + +Save `mint-token.mjs` (resolve `jose` from the workspace): + +```js +import { readFileSync } from 'node:fs'; +// adjust the jose path to your checkout if needed: +import { SignJWT } from '/ABS/PATH/learning_ai_common_plat/node_modules/.pnpm/jose@5.10.0/node_modules/jose/dist/node/esm/index.js'; + +const env = readFileSync('/ABS/PATH/learning_ai_common_plat/.env', 'utf8'); +const secret = new TextEncoder().encode(env.match(/^JWT_SECRET=(.*)$/m)[1].trim()); +const ttl = process.argv[2] || '15m'; // e.g. '15m' for scripts, '24h' for a factory +process.stdout.write( + await new SignJWT({ sub: 'local-dev', role: 'admin', type: 'access' }) + .setProtectedHeader({ alg: 'HS256' }) + .setIssuedAt() + .setExpirationTime(ttl) + .sign(secret) +); +``` + +```bash +node mint-token.mjs 15m > /tmp/tok # short-lived, for API calls +node mint-token.mjs 24h > /tmp/factok # longer-lived, for the factory daemon +``` + +> Find the jose path with: +> `find . -path '*/node_modules/jose/dist/*/esm/index.js' | head -1`. + +Requests must also carry the **product**: header `X-Product-Id: ` +(e.g. `notelett`). `role: admin` bypasses tenant ownership checks when +`FLEET_TENANT_ENFORCEMENT` is on (it's off by default). + +--- + +## 6. Submit a job + +### Via tracker-web (preferred) + +Open `http://localhost:3003/dashboard/fleet/jobs`, "New job". **Set the correct +product first** (the product selector) — a job is partitioned by `productId`, and +submitting under the wrong product misattributes cost/metrics/ownership and the +factory won't see it under the product it polls. + +PR‑mode fields that matter: + +- **`repo`** — must be `owner/name` (e.g. `saravanakumardb1/learning_ai_notes`) or a + clone URL, **not** a bare name (the factory feeds it to `gh`). +- **`baseBranch`** — e.g. `main`. +- **`engine`** — `devin` (pins the agent; otherwise the factory's default/engineClass). +- **`autoMerge`** — leave **`false`** for a human merge gate (recommended for large PRs). + +### Via API + +```bash +JOB=$(curl -s -X POST http://localhost:4003/api/fleet/jobs \ + -H "Authorization: Bearer $(cat /tmp/tok)" \ + -H "X-Product-Id: notelett" -H 'Content-Type: application/json' \ + -d '{ + "idempotencyKey": "notelett-demo-1", + "bodyMd": "# Task\n…full prompt…", + "priority": "high", + "engine": "devin", + "repo": "saravanakumardb1/learning_ai_notes", + "baseBranch": "main", + "autoMerge": false + }') +echo "$JOB" # → { outcome: "created", job: { id: "fjob_…", stage: "queued", ... } } +``` + +The job is now `queued` and claimable. It will **not run** until a factory polls +for its product (next step). + +--- + +## 7. Start the factory (agent-queue, fleet mode) + +The factory lives in a **separate repo**: `learning_ai_devops_tools/agent-queue`. +Run it on the **host** (needs `devin` + `gh`). Read its `docs/RUN_POLICY.md` first. + +### 7a. Sanity‑check connectivity (safe — registers + heartbeats only) + +```bash +cd learning_ai_devops_tools/agent-queue +./agent-queue.sh init # idempotent + +AQ_FLEET=1 AQ_FLEET_ROUTE=1 \ +AQ_FLEET_API=http://localhost:4003/api \ +AQ_PRODUCT_ID=notelett \ +AQ_FLEET_TOKEN="$(cat /tmp/factok)" \ +AQ_FACTORY_ID=mac-local-1 \ +AQ_FLEET_CAPS=engine:devin \ +AQ_FLEET_LEASE_RENEW_SEC=60 \ + ./agent-queue.sh fleet-status # → "heartbeat OK (registered)." +``` + +### 7b. Launch the run loop (claims + runs the agent) + +```bash +cd learning_ai_devops_tools/agent-queue +AQ_FLEET=1 AQ_FLEET_ROUTE=1 AQ_FLEET_PR=1 \ +AQ_FLEET_API=http://localhost:4003/api \ +AQ_PRODUCT_ID=notelett \ +AQ_FLEET_TOKEN="$(cat /tmp/factok)" \ +AQ_FACTORY_ID=mac-local-1 \ +AQ_FLEET_CAPS=engine:devin \ +AQ_FLEET_LEASE_RENEW_SEC=60 \ + ./agent-queue.sh run --max 1 +``` + +> ⚠️ **Set `AQ_FLEET_LEASE_RENEW_SEC` below 90 (e.g. 60).** This is the heartbeat/ +> lease‑renew cadence. The coordinator's reaper marks a factory **stale after 90s** +> (`DEFAULT_STALE_FACTORY_MS`, a constant — no env knob) and **reclaims its in‑flight +> lease**. The default cadence is **300s**, so a busy single‑slot factory looks stale +> for most of every cycle and its running job gets requeued mid‑run (`leaseEpoch` +> climbs, stage flips back to `queued`, and the final report is **fenced** so the job +> never tidies to `review`/`shipped`). 60s keeps it comfortably live. (Add the same env +> to the §7a `fleet-status` check for consistency.) + +Key fleet env vars (see `lib/fleet-client.sh`): + +| Var | Meaning | +| -------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `AQ_FLEET=1` | master switch — enable coordinator calls (0 = pure offline) | +| `AQ_FLEET_ROUTE=1` | coordinator is **authoritative** for claim (pulls work from platform-service) | +| `AQ_FLEET_PR=1` | PR mode — open a PR for jobs that target a `repo` | +| `AQ_FLEET_API` | base URL **including `/api`** (`http://localhost:4003/api`) | +| `AQ_FLEET_TOKEN` | **Bearer JWT** (mint per §5; ≥ run duration, e.g. 24h) | +| `AQ_PRODUCT_ID` | product to poll — sent as `X-Product-Id` (must match the job's product) | +| `AQ_FACTORY_ID` | this factory's id (registered/heartbeated) | +| `AQ_FLEET_CAPS` | advertised capabilities, e.g. `engine:devin` | +| `AQ_FLEET_LEASE_RENEW_SEC` | **set `<90`** (e.g. `60`) — heartbeat/renew cadence vs the 90s stale window (see warning) | +| `AQ_FLEET_REPO_BASE` | _(optional)_ dir of local checkouts; if `…//.git` exists it uses a **git worktree**, else it `git clone`s `https://github.com/.git` into its cache | +| `AQ_FLEET_AUTOSHIP=1` | _(optional)_ auto-advance to `shipped` (skips the manual gate) | + +The run loop `claim → assigned → building`, runs Devin in an isolated checkout, +heartbeats + renews the lease (`lease_renewed` events) so the reaper doesn't reclaim it, +then on agent exit moves to `review` and (PR mode) opens the PR. With `autoMerge:false` +it **stops at the human merge gate**. + +> **Repo checkout:** the job's `repo` is `owner/name`, so by default the factory +> `git clone`s `https://github.com//.git` into its own cache +> (`queue/.state/repos/…`) — clean isolation, nothing touches your working copies. To +> reuse an existing local clone via a **git worktree** instead, set +> `AQ_FLEET_REPO_BASE=` where `///.git` exists. + +--- + +## 8. Observe progress + +- **tracker-web:** `http://localhost:3003/dashboard/fleet/jobs/` — live event + stream (SSE), runs, PR link + state. +- **Events/API:** + ```bash + curl -s http://localhost:4003/api/fleet/jobs//events \ + -H "Authorization: Bearer $(cat /tmp/tok)" -H "X-Product-Id: notelett" + ``` +- **Metrics:** `GET /api/fleet/metrics` (JSON, per product) · `GET /api/fleet/metrics/prom` + (Prometheus, all products; needs `FLEET_METRICS_TOKEN`) · Grafana **Fleet Overview** + (`http://localhost:3000/d/fleet-overview`). +- **Autoscale signal:** `GET /api/fleet/autoscale` (this product) / `…/autoscale/all`. + +### PR‑state reconcile (externally‑merged PRs) + +If you merge the PR in the GitHub UI, the coordinator doesn't know until told. Trigger +a reconcile (flips run `prState → merged` when `gh pr view` reports MERGED): + +- UI: **"Refresh PR status"** button on the job's PR section, or +- API: `POST /api/fleet/jobs//pr/reconcile`. + +> Requires `gh` where platform-service runs → use the **host path** (§3 Option A); +> it's a no‑op in the Docker container (no `gh`). + +--- + +## 9. Safety & cost + +- **Billable + autonomous + long‑running.** Each run consumes Devin credits and can + run for a long time unattended. Scope jobs deliberately; very large multi‑workstream + specs are better split into several jobs. +- **Real PR.** PR mode pushes a branch and opens a PR on the target repo. Keep + `autoMerge:false` so a human reviews/merges; `gh pr merge` (auto) only fires when the + job opts in or `FLEET_SHIP_MERGES_PR=1`. +- **Isolation.** The factory works in an isolated worktree/clone, never your main + checkout (per `agent-queue/docs/RUN_POLICY.md`). Avoid blanket `--yolo` on live trees. +- **Stopping the daemon** mid‑run lets the lease expire; the coordinator's reaper then + reclaims and requeues the job (so partial work may be retried). Stop intentionally. +- **Tokens/secrets:** the minted JWT and `JWT_SECRET` are sensitive — never commit them + or paste into shared logs. `.env` is git‑ignored; keep it that way. + +--- + +## 10. Teardown + +```bash +# stop the factory: Ctrl-C the run loop +# host path: Ctrl-C the tsx (Terminal 1) and pnpm dev (Terminal 2) +# docker path: +# cd ~/code/learning_ai_common_plat && docker compose down # keep volumes +# docker compose down -v # also drop volumes +rm -f /tmp/tok /tmp/factok # discard minted tokens +``` + +--- + +## 11. Troubleshooting + +| Symptom | Cause → Fix | +| -------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `Cannot find module '@bytelyst/…/dist/index.js'` on `tsx`/Next start | workspace packages not built → `pnpm -r build` (§2.2). | +| `401 {"error":"Invalid or expired token"}` | JWT expired/mis‑signed → re‑mint (§5); ensure same `JWT_SECRET` as the running service. | +| Job claimed then flips back to `queued` mid‑run; `leaseEpoch` keeps climbing; final report **fenced**; PR opens but job never reaches `review`/`shipped` | factory heartbeat cadence (`AQ_FLEET_LEASE_RENEW_SEC`, default **300s**) > reaper stale window (**90s**) → set `AQ_FLEET_LEASE_RENEW_SEC=60` (§7). To recover the record after the fact, reconcile PR state (§8). | +| Job stays `queued`, never claimed | No factory for that product → `fleet-status` shows it registered? `AQ_PRODUCT_ID` must equal the job's product. Check `GET /api/fleet/factories` (X‑Product‑Id) for `0 live`. | +| `POST …/pr/reconcile` or ship auto‑merge does nothing | `gh` not present where platform-service runs (Docker container) → run the host path (§3 Option A). | +| Prometheus target `platform-service-fleet` = `down (401)` | service missing `FLEET_METRICS_TOKEN` → §4 (restart host `tsx` / recreate container). | +| `prototype-up.sh` build fails on `corepack prepare pnpm` | dashboard image network issue → use the targeted subset, or just use the host path (Option A). | +| `POST …/actions/` returns 500 "Body cannot be empty" | sent `Content-Type: application/json` with no body → omit the header or send `{}`. | +| Port `4003` conflict | host `tsx watch` and a `platform-service` container both bind `4003` → run only one. | +| `gh pr create` fails | `repo` is a bare name → must be `owner/name` or a clone URL; confirm `gh auth status`. | +| PR/cost attributed to wrong product | job submitted under the wrong `productId` partition → resubmit under the right product and cancel the stray (`POST …/actions/cancel`). | +| `vitest` exits non‑zero with `kill EPERM` after all suites pass | worker‑pool teardown artifact (sandbox), not a test failure → re‑run; all suites already passed. | + +--- + +## 12. Copy‑paste quickstart — all localhost (notelett → learning_ai_notes) + +Assumes §2.1/§2.2 done (`.env` filled, `pnpm -r build` run). Four terminals. + +```bash +# Terminal 1 — coordinator +cd ~/code/learning_ai_common_plat/services/platform-service +pnpm exec tsx watch --env-file=../../.env src/server.ts + +# Terminal 2 — dashboard +cd ~/code/learning_ai_common_plat/dashboards/tracker-web && pnpm dev # :3003 + +# Terminal 3 — tokens + submit (save mint-token.mjs from §5; fix ABS paths) +node mint-token.mjs 15m > /tmp/tok +node mint-token.mjs 24h > /tmp/factok +curl -s -X POST http://localhost:4003/api/fleet/jobs \ + -H "Authorization: Bearer $(cat /tmp/tok)" -H "X-Product-Id: notelett" \ + -H 'Content-Type: application/json' \ + -d '{"idempotencyKey":"notelett-demo-1","bodyMd":"# Task…","priority":"high","engine":"devin","repo":"saravanakumardb1/learning_ai_notes","baseBranch":"main","autoMerge":false}' + +# Terminal 4 — factory (runs Devin → opens a real PR). NOTE the <90s heartbeat cadence. +cd ~/code/learning_ai_devops_tools/agent-queue && ./agent-queue.sh init +AQ_FLEET=1 AQ_FLEET_ROUTE=1 AQ_FLEET_PR=1 AQ_FLEET_API=http://localhost:4003/api \ +AQ_PRODUCT_ID=notelett AQ_FLEET_TOKEN="$(cat /tmp/factok)" \ +AQ_FACTORY_ID=mac-local-1 AQ_FLEET_CAPS=engine:devin AQ_FLEET_LEASE_RENEW_SEC=60 \ + ./agent-queue.sh run --max 1 +``` + +--- + +## 13. WSL on Windows — differences to note + +The flow is identical **inside a WSL2 (Ubuntu) shell**, with these adjustments. Treat +WSL as "the Linux host" — install and run **everything inside WSL**, not Windows. + +- **Keep repos on the WSL filesystem, not `/mnt/c`.** Clone under e.g. `~/code` inside + WSL. On `/mnt/c` (the Windows drive over 9p) `tsx watch`/Next file‑watching is + unreliable (inotify doesn't fire) and git/pnpm are far slower. This is the single most + important difference. +- **Install the toolchain inside WSL** (Linux builds): `node`/`pnpm` (nvm), `git`, **`gh`**, + and the **`devin` CLI** — and run `gh auth login` + Devin auth **inside WSL**. A `gh`/ + `devin` installed on Windows is not visible to the WSL bash factory. +- **Line endings.** Clone inside WSL (don't reuse a Windows checkout with + `core.autocrlf=true`) so the `*.sh` scripts stay LF — CRLF breaks `agent-queue.sh` + (`bad interpreter`/`\r`). If needed: `git config --global core.autocrlf input`. +- **Reaching the UI from the Windows browser.** WSL2 forwards `localhost`, so + `http://localhost:3003` / `:4003` usually work from a Windows browser. If they don't + (older Windows / mirrored‑networking off), use the WSL IP (`hostname -I`) or set + `networkingMode=mirrored` in `.wslconfig`. +- **Ports.** Make sure nothing on the **Windows** side already binds `3003`/`4003` + (WSL2 publishes to the same localhost). Stop the Windows process or change ports. +- **Docker (Option B), if used.** Use **Docker Desktop with the WSL2 backend** and run + `docker compose` from inside the WSL shell. `host.docker.internal` resolves from + containers to the host as on Mac. +- **`/tmp` token paths** (`/tmp/tok`, `/tmp/factok`) are the WSL `/tmp` — fine; just keep + all four terminals in the same WSL distro so they share it. +- **Clock skew.** If WSL's clock drifts after sleep, JWT `iat/exp` checks can fail + (`Invalid or expired token`) — `sudo hwclock -s` (or restart WSL) to resync. + +Everything else — env vars, `pnpm -r build`, `tsx --env-file`, the factory env incl. +`AQ_FLEET_LEASE_RENEW_SEC=60`, token minting — is identical to the Mac host path. + +--- + +### Reference + +- Coordinator routes: `services/platform-service/src/modules/fleet/routes.ts` +- Coordinator logic: `services/platform-service/src/modules/fleet/coordinator.ts` +- Factory fleet client: `learning_ai_devops_tools/agent-queue/lib/fleet-client.sh` +- Factory runner + PR mode: `learning_ai_devops_tools/agent-queue/agent-queue.sh` +- Gigafactory spec/roadmap: `learning_ai_devops_tools/agent-queue/docs/GIGAFACTORY/` +- Prometheus scrape config: `services/monitoring/prometheus/prometheus.yml` +- Grafana dashboard: `services/monitoring/grafana/dashboards/fleet-overview.json`