docs(fleet): runbook to run a Devin fleet job end-to-end (local)

Add docs/runbooks/FLEET_DEVIN_LOCAL_RUN.md: how developers and coding agents
spin up platform-service + tracker-web + an agent-queue factory so a submitted
job is claimed and run autonomously by the Devin CLI against a target repo
(worked example: learning_ai_notes), pushing a branch and opening a real PR.

Covers: architecture + lifecycle, prerequisites incl. fresh-machine setup
(clone both repos, .env/Cosmos, pnpm -r build so host-run resolves @bytelyst/*
from dist/), all-localhost (no Docker) path as primary + Docker as the
Grafana/Prometheus option, local JWT minting, job submit, factory launch, observe,
PR-state reconcile, safety/cost, teardown, troubleshooting, and a copy-paste
quickstart.

Calls out two gotchas learned in practice: set AQ_FLEET_LEASE_RENEW_SEC < 90 so
the factory heartbeat beats the coordinator's 90s stale-factory reclaim window
(else a busy single-slot factory's in-flight lease is reclaimed mid-run and the
final report is fenced), and a WSL-on-Windows differences section (run inside
WSL, repos off /mnt/c, LF endings, gh/devin/node in WSL, localhost forwarding).

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
This commit is contained in:
saravanakumardb1 2026-06-02 00:50:29 -07:00
parent 6bddc88f0f
commit e6611cae1a

View File

@ -0,0 +1,505 @@
# Runbook — Run a Devin Fleet Job EndtoEnd (local)
> **Audience:** developers and coding agents.
> **Goal:** stand up `platform-service` + `tracker-web` + a **fleet factory** (the
> `agent-queue` runner) so a submitted job is claimed and executed **autonomously
> by the Devin CLI** against a target repo (worked example: `learning_ai_notes`),
> pushing a branch and opening a **real pull request**.
> ⚠️ **This is a real, costincurring, sideeffecting operation.** The factory runs
> an autonomous coding agent (Devin) that consumes API credits, can run for a long
> time, pushes a branch, and opens a **real PR** on GitHub. Read [§9 Safety &
> cost](#9-safety--cost) before launching. For unattended local prototyping only —
> not a production deployment guide.
---
## 1. Architecture (what talks to what)
```
you ──▶ tracker-web (:3003) ─┐
│ REST + SSE (/api/fleet/*)
coding agent / curl ────────┼─▶ platform-service (:4003) ──▶ Azure Cosmos (jobs/runs/leases/events)
│ ▲ ▲
Prometheus (:9090)┘ │ │ claim / lease-renew / report (Bearer JWT + X-Product-Id)
Grafana (:3000) ───────────┘ │
agent-queue FACTORY (fleet mode) ──▶ Devin CLI ──▶ git push + gh pr create
(learning_ai_devops_tools/agent-queue) (target repo, e.g. learning_ai_notes)
```
- **platform-service** — the fleet **coordinator**. Owns the job lifecycle
(`queued → assigned → building → review → testing → shipped|failed|dead_letter`),
atomic claim, leases, events, budgets, metrics. Code: `services/platform-service/src/modules/fleet/`.
- **tracker-web** (`:3003`) — submit/inspect jobs (`/dashboard/fleet/jobs/...`).
- **factory**`learning_ai_devops_tools/agent-queue` in **fleet mode**. Polls
`POST /api/fleet/claim`, runs the agent CLI in an isolated checkout, reports back,
and (PR mode) opens the PR.
- **Prometheus/Grafana** — fleet metrics + the "Fleet Overview" dashboard.
Lifecycle the factory drives:
```
queued ─▶ assigned ─▶ building ─▶ review ─▶ testing ─▶ shipped
(claim) (agent (rc=0) (verify (manual/auto ship)
running) passed)
└─ agent rc≠0 / timeout / verify fail ─▶ failed ─▶ (retry|dead_letter)
```
---
## 2. Prerequisites
| Tool | Why | Check |
| ----------------------------------------------- | ---------------------------------------------------------------------------- | -------------------- |
| Node ≥ 20 + `pnpm` (corepack) | host-run service, scripts, tracker-web, build | `node -v && pnpm -v` |
| `git` + `gh` (authenticated) | factory clones target repo, pushes branch, opens PR; `gh pr merge`/reconcile | `gh auth status` |
| `devin` CLI (authenticated) | the agent the factory runs | `devin --version` |
| Both repos cloned sidebyside | coordinator/dashboards + the factory | see below |
| repo `.env` (root of `learning_ai_common_plat`) | `JWT_SECRET`, Cosmos creds, `FLEET_METRICS_TOKEN` | `test -f .env` |
| Docker | **optional** — only for the Docker path (§3 Option B) / Grafana+Prometheus | `docker info` |
> **Node version:** the Docker image pins **node 22**; for the host path any **Node ≥ 20**
> works. Use one Node (nvm/asdf) for both repos to avoid native-module surprises.
### 2.1 Firsttime setup (fresh machine)
Clone both repos as **siblings** (the factory clones targets relative to a shared parent):
```bash
mkdir -p ~/code && cd ~/code
git clone <host>/learning_ai_common_plat.git
git clone <host>/learning_ai_devops_tools.git # contains agent-queue (the factory)
```
Create and fill `.env` at the **root of `learning_ai_common_plat`**:
```bash
cd ~/code/learning_ai_common_plat
cp .env.example .env
# then edit .env — minimum for the fleet flow:
# JWT_SECRET=<any strong secret; tokens are minted+verified with THIS value>
# FLEET_METRICS_TOKEN=changeme-fleet-metrics-token # only needed for Prometheus
# COSMOS_* / connection vars -> see note below
```
- `JWT_SECRET` — HS256 secret platform-service verifies tokens with. Any strong value;
it only needs to be **internally consistent on this machine** (the token you mint in
§5 and the running service must share it). **Required.**
- **Cosmos** — the default prototype talks to a **real Azure Cosmos account** (no emulator
in the default compose). On a new machine you must either (a) point `.env` at the **same
Cosmos account** (to see/share existing jobs) or (b) point at your own DB and set
`COSMOS_AUTO_INIT=true` so containers are created on boot. Without valid Cosmos creds the
service starts but every fleet call fails.
- `FLEET_METRICS_TOKEN` — only needed if you run Prometheus (§4); must match
`services/monitoring/prometheus/prometheus.yml` (`credentials:`).
### 2.2 Install + build the workspace (required for the host path)
Host-run resolves `@bytelyst/*` workspace packages from their **`dist/`** (the `exports`
field points at `dist`), so you must build them once before `tsx`/Next can import them:
```bash
cd ~/code/learning_ai_common_plat
pnpm install
pnpm -r build # builds all workspace packages (incl. @bytelyst/* → dist/)
# (faster, just the platform-service closure:)
# pnpm -r --filter @lysnrai/platform-service... build
```
> Skipping this is the #1 fresh-machine failure: `tsx watch` crashes with
> `Cannot find module '@bytelyst/...'/dist/index.js`. Re-run `pnpm -r build` after pulling
> changes to shared packages.
---
## 3. Bring up platform-service + tracker-web
Two ways. **Option A (all localhost, no Docker)** is recommended for a single dev Mac /
WSL box — everything runs on the host, so `gh`-backed features work out of the box.
**Option B (Docker)** is for when you also want the Grafana/Prometheus stack.
### Option A — all localhost, no Docker (recommended)
Two longlived processes, each in its own terminal. Both assume §2.1/§2.2 are done
(`.env` filled, `pnpm -r build` run).
**Terminal 1 — coordinator (platform-service, :4003):**
```bash
cd ~/code/learning_ai_common_plat/services/platform-service
pnpm exec tsx watch --env-file=../../.env src/server.ts
```
`tsx watch` hot-reloads on source changes. Use the explicit `--env-file=../../.env`
(the bare `pnpm dev` script does **not** load the root `.env`, so `JWT_SECRET`/Cosmos
would be missing). `FLEET_METRICS_TOKEN` is already in `.env` if you set it in §2.1.
**Terminal 2 — dashboard (tracker-web, :3003):**
```bash
cd ~/code/learning_ai_common_plat/dashboards/tracker-web
pnpm dev # serves http://localhost:3003 (proxies /api → :4003)
```
That's the whole coordinator + UI. **Monitoring (Grafana/Prometheus) is optional** on
the host path — `GET /api/fleet/metrics` (JSON), `GET /api/fleet/autoscale`, and the
tracker-web job pages cover observability without it. To get the Grafana "Fleet Overview"
dashboard you need Prometheus + Grafana (run them via Docker — Option B — or Homebrew
binaries pointed at `services/monitoring/...`).
Because everything is on the host, `gh` is on `PATH` → the PRstate **reconcile** (§8)
and shiptime `gh pr merge` work (unlike the Docker container, which has no `gh`).
Health checks:
```bash
curl -s -o /dev/null -w '%{http_code}\n' http://localhost:4003/health # 200
curl -s -o /dev/null -w '%{http_code}\n' http://localhost:3003 # 200
```
### Option B — Docker (adds Grafana + Prometheus)
```bash
cd ~/code/learning_ai_common_plat
# targeted fleet subset that always builds cleanly:
docker compose up -d --build platform-service prometheus grafana
# (full stack: bash scripts/prototype-up.sh)
```
Starts `platform-service` (`:4003`), `prometheus` (`:9090`), `grafana` (`:3000`,
admin/`lysnrai`) + deps. Still run **tracker-web from source** (Option A, Terminal 2).
> **Docker caveats:**
>
> - `prototype-up.sh` may fail building the **dashboard** images when
> `corepack prepare pnpm@…` can't fetch pnpm on a restricted network → use the targeted
> subset above.
> - **`gh` is NOT in the container** → coordinatorside `gh pr merge` and PRreconcile (§8)
> are noops in Docker. Use the host path (Option A) if you need them.
> - Don't run both: the container and a host `tsx` both bind `:4003`
> (`docker compose stop platform-service` before hostrunning).
---
## 4. Make Prometheus auth work (only if running Prometheus)
Skip this on the host path unless you also run Prometheus. `prometheus.yml` scrapes
`/api/fleet/metrics/prom` with a bearer, so the running `platform-service` must see the
same `FLEET_METRICS_TOKEN`:
```bash
cd ~/code/learning_ai_common_plat
grep -q '^FLEET_METRICS_TOKEN=' .env || \
printf '\nFLEET_METRICS_TOKEN=changeme-fleet-metrics-token\n' >> .env
# host path: restart Terminal-1 tsx so it re-reads .env
# docker path: docker compose up -d platform-service
```
Verify (if Prometheus is up): `http://localhost:9090/api/v1/targets`
`platform-service-fleet` is `up`. The value must equal `credentials:` in
`services/monitoring/prometheus/prometheus.yml`.
---
## 5. Mint a local API token (dev only)
`platform-service` verifies HS256 JWTs signed with `JWT_SECRET` and requires
`type: "access"`. The tracker-web UI obtains one via login; for scripts/agents and
the factory, mint one directly. **Local dev only — never commit tokens or the secret.**
Save `mint-token.mjs` (resolve `jose` from the workspace):
```js
import { readFileSync } from 'node:fs';
// adjust the jose path to your checkout if needed:
import { SignJWT } from '/ABS/PATH/learning_ai_common_plat/node_modules/.pnpm/jose@5.10.0/node_modules/jose/dist/node/esm/index.js';
const env = readFileSync('/ABS/PATH/learning_ai_common_plat/.env', 'utf8');
const secret = new TextEncoder().encode(env.match(/^JWT_SECRET=(.*)$/m)[1].trim());
const ttl = process.argv[2] || '15m'; // e.g. '15m' for scripts, '24h' for a factory
process.stdout.write(
await new SignJWT({ sub: 'local-dev', role: 'admin', type: 'access' })
.setProtectedHeader({ alg: 'HS256' })
.setIssuedAt()
.setExpirationTime(ttl)
.sign(secret)
);
```
```bash
node mint-token.mjs 15m > /tmp/tok # short-lived, for API calls
node mint-token.mjs 24h > /tmp/factok # longer-lived, for the factory daemon
```
> Find the jose path with:
> `find . -path '*/node_modules/jose/dist/*/esm/index.js' | head -1`.
Requests must also carry the **product**: header `X-Product-Id: <productId>`
(e.g. `notelett`). `role: admin` bypasses tenant ownership checks when
`FLEET_TENANT_ENFORCEMENT` is on (it's off by default).
---
## 6. Submit a job
### Via tracker-web (preferred)
Open `http://localhost:3003/dashboard/fleet/jobs`, "New job". **Set the correct
product first** (the product selector) — a job is partitioned by `productId`, and
submitting under the wrong product misattributes cost/metrics/ownership and the
factory won't see it under the product it polls.
PRmode fields that matter:
- **`repo`** — must be `owner/name` (e.g. `saravanakumardb1/learning_ai_notes`) or a
clone URL, **not** a bare name (the factory feeds it to `gh`).
- **`baseBranch`** — e.g. `main`.
- **`engine`** — `devin` (pins the agent; otherwise the factory's default/engineClass).
- **`autoMerge`** — leave **`false`** for a human merge gate (recommended for large PRs).
### Via API
```bash
JOB=$(curl -s -X POST http://localhost:4003/api/fleet/jobs \
-H "Authorization: Bearer $(cat /tmp/tok)" \
-H "X-Product-Id: notelett" -H 'Content-Type: application/json' \
-d '{
"idempotencyKey": "notelett-demo-1",
"bodyMd": "# Task\n…full prompt…",
"priority": "high",
"engine": "devin",
"repo": "saravanakumardb1/learning_ai_notes",
"baseBranch": "main",
"autoMerge": false
}')
echo "$JOB" # → { outcome: "created", job: { id: "fjob_…", stage: "queued", ... } }
```
The job is now `queued` and claimable. It will **not run** until a factory polls
for its product (next step).
---
## 7. Start the factory (agent-queue, fleet mode)
The factory lives in a **separate repo**: `learning_ai_devops_tools/agent-queue`.
Run it on the **host** (needs `devin` + `gh`). Read its `docs/RUN_POLICY.md` first.
### 7a. Sanitycheck connectivity (safe — registers + heartbeats only)
```bash
cd learning_ai_devops_tools/agent-queue
./agent-queue.sh init # idempotent
AQ_FLEET=1 AQ_FLEET_ROUTE=1 \
AQ_FLEET_API=http://localhost:4003/api \
AQ_PRODUCT_ID=notelett \
AQ_FLEET_TOKEN="$(cat /tmp/factok)" \
AQ_FACTORY_ID=mac-local-1 \
AQ_FLEET_CAPS=engine:devin \
AQ_FLEET_LEASE_RENEW_SEC=60 \
./agent-queue.sh fleet-status # → "heartbeat OK (registered)."
```
### 7b. Launch the run loop (claims + runs the agent)
```bash
cd learning_ai_devops_tools/agent-queue
AQ_FLEET=1 AQ_FLEET_ROUTE=1 AQ_FLEET_PR=1 \
AQ_FLEET_API=http://localhost:4003/api \
AQ_PRODUCT_ID=notelett \
AQ_FLEET_TOKEN="$(cat /tmp/factok)" \
AQ_FACTORY_ID=mac-local-1 \
AQ_FLEET_CAPS=engine:devin \
AQ_FLEET_LEASE_RENEW_SEC=60 \
./agent-queue.sh run --max 1
```
> ⚠️ **Set `AQ_FLEET_LEASE_RENEW_SEC` below 90 (e.g. 60).** This is the heartbeat/
> leaserenew cadence. The coordinator's reaper marks a factory **stale after 90s**
> (`DEFAULT_STALE_FACTORY_MS`, a constant — no env knob) and **reclaims its inflight
> lease**. The default cadence is **300s**, so a busy singleslot factory looks stale
> for most of every cycle and its running job gets requeued midrun (`leaseEpoch`
> climbs, stage flips back to `queued`, and the final report is **fenced** so the job
> never tidies to `review`/`shipped`). 60s keeps it comfortably live. (Add the same env
> to the §7a `fleet-status` check for consistency.)
Key fleet env vars (see `lib/fleet-client.sh`):
| Var | Meaning |
| -------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `AQ_FLEET=1` | master switch — enable coordinator calls (0 = pure offline) |
| `AQ_FLEET_ROUTE=1` | coordinator is **authoritative** for claim (pulls work from platform-service) |
| `AQ_FLEET_PR=1` | PR mode — open a PR for jobs that target a `repo` |
| `AQ_FLEET_API` | base URL **including `/api`** (`http://localhost:4003/api`) |
| `AQ_FLEET_TOKEN` | **Bearer JWT** (mint per §5; ≥ run duration, e.g. 24h) |
| `AQ_PRODUCT_ID` | product to poll — sent as `X-Product-Id` (must match the job's product) |
| `AQ_FACTORY_ID` | this factory's id (registered/heartbeated) |
| `AQ_FLEET_CAPS` | advertised capabilities, e.g. `engine:devin` |
| `AQ_FLEET_LEASE_RENEW_SEC` | **set `<90`** (e.g. `60`) — heartbeat/renew cadence vs the 90s stale window (see warning) |
| `AQ_FLEET_REPO_BASE` | _(optional)_ dir of local checkouts; if `…/<repo>/.git` exists it uses a **git worktree**, else it `git clone`s `https://github.com/<repo>.git` into its cache |
| `AQ_FLEET_AUTOSHIP=1` | _(optional)_ auto-advance to `shipped` (skips the manual gate) |
The run loop `claim → assigned → building`, runs Devin in an isolated checkout,
heartbeats + renews the lease (`lease_renewed` events) so the reaper doesn't reclaim it,
then on agent exit moves to `review` and (PR mode) opens the PR. With `autoMerge:false`
it **stops at the human merge gate**.
> **Repo checkout:** the job's `repo` is `owner/name`, so by default the factory
> `git clone`s `https://github.com/<owner>/<name>.git` into its own cache
> (`queue/.state/repos/…`) — clean isolation, nothing touches your working copies. To
> reuse an existing local clone via a **git worktree** instead, set
> `AQ_FLEET_REPO_BASE=<parent>` where `<parent>/<owner>/<name>/.git` exists.
---
## 8. Observe progress
- **tracker-web:** `http://localhost:3003/dashboard/fleet/jobs/<jobId>` — live event
stream (SSE), runs, PR link + state.
- **Events/API:**
```bash
curl -s http://localhost:4003/api/fleet/jobs/<jobId>/events \
-H "Authorization: Bearer $(cat /tmp/tok)" -H "X-Product-Id: notelett"
```
- **Metrics:** `GET /api/fleet/metrics` (JSON, per product) · `GET /api/fleet/metrics/prom`
(Prometheus, all products; needs `FLEET_METRICS_TOKEN`) · Grafana **Fleet Overview**
(`http://localhost:3000/d/fleet-overview`).
- **Autoscale signal:** `GET /api/fleet/autoscale` (this product) / `…/autoscale/all`.
### PRstate reconcile (externallymerged PRs)
If you merge the PR in the GitHub UI, the coordinator doesn't know until told. Trigger
a reconcile (flips run `prState → merged` when `gh pr view` reports MERGED):
- UI: **"Refresh PR status"** button on the job's PR section, or
- API: `POST /api/fleet/jobs/<jobId>/pr/reconcile`.
> Requires `gh` where platform-service runs → use the **host path** (§3 Option A);
> it's a noop in the Docker container (no `gh`).
---
## 9. Safety & cost
- **Billable + autonomous + longrunning.** Each run consumes Devin credits and can
run for a long time unattended. Scope jobs deliberately; very large multiworkstream
specs are better split into several jobs.
- **Real PR.** PR mode pushes a branch and opens a PR on the target repo. Keep
`autoMerge:false` so a human reviews/merges; `gh pr merge` (auto) only fires when the
job opts in or `FLEET_SHIP_MERGES_PR=1`.
- **Isolation.** The factory works in an isolated worktree/clone, never your main
checkout (per `agent-queue/docs/RUN_POLICY.md`). Avoid blanket `--yolo` on live trees.
- **Stopping the daemon** midrun lets the lease expire; the coordinator's reaper then
reclaims and requeues the job (so partial work may be retried). Stop intentionally.
- **Tokens/secrets:** the minted JWT and `JWT_SECRET` are sensitive — never commit them
or paste into shared logs. `.env` is gitignored; keep it that way.
---
## 10. Teardown
```bash
# stop the factory: Ctrl-C the run loop
# host path: Ctrl-C the tsx (Terminal 1) and pnpm dev (Terminal 2)
# docker path:
# cd ~/code/learning_ai_common_plat && docker compose down # keep volumes
# docker compose down -v # also drop volumes
rm -f /tmp/tok /tmp/factok # discard minted tokens
```
---
## 11. Troubleshooting
| Symptom | Cause → Fix |
| -------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `Cannot find module '@bytelyst/…/dist/index.js'` on `tsx`/Next start | workspace packages not built → `pnpm -r build` (§2.2). |
| `401 {"error":"Invalid or expired token"}` | JWT expired/missigned → remint (§5); ensure same `JWT_SECRET` as the running service. |
| Job claimed then flips back to `queued` midrun; `leaseEpoch` keeps climbing; final report **fenced**; PR opens but job never reaches `review`/`shipped` | factory heartbeat cadence (`AQ_FLEET_LEASE_RENEW_SEC`, default **300s**) > reaper stale window (**90s**) → set `AQ_FLEET_LEASE_RENEW_SEC=60` (§7). To recover the record after the fact, reconcile PR state (§8). |
| Job stays `queued`, never claimed | No factory for that product → `fleet-status` shows it registered? `AQ_PRODUCT_ID` must equal the job's product. Check `GET /api/fleet/factories` (XProductId) for `0 live`. |
| `POST …/pr/reconcile` or ship automerge does nothing | `gh` not present where platform-service runs (Docker container) → run the host path (§3 Option A). |
| Prometheus target `platform-service-fleet` = `down (401)` | service missing `FLEET_METRICS_TOKEN` → §4 (restart host `tsx` / recreate container). |
| `prototype-up.sh` build fails on `corepack prepare pnpm` | dashboard image network issue → use the targeted subset, or just use the host path (Option A). |
| `POST …/actions/<x>` returns 500 "Body cannot be empty" | sent `Content-Type: application/json` with no body → omit the header or send `{}`. |
| Port `4003` conflict | host `tsx watch` and a `platform-service` container both bind `4003` → run only one. |
| `gh pr create` fails | `repo` is a bare name → must be `owner/name` or a clone URL; confirm `gh auth status`. |
| PR/cost attributed to wrong product | job submitted under the wrong `productId` partition → resubmit under the right product and cancel the stray (`POST …/actions/cancel`). |
| `vitest` exits nonzero with `kill EPERM` after all suites pass | workerpool teardown artifact (sandbox), not a test failure → rerun; all suites already passed. |
---
## 12. Copypaste quickstart — all localhost (notelett → learning_ai_notes)
Assumes §2.1/§2.2 done (`.env` filled, `pnpm -r build` run). Four terminals.
```bash
# Terminal 1 — coordinator
cd ~/code/learning_ai_common_plat/services/platform-service
pnpm exec tsx watch --env-file=../../.env src/server.ts
# Terminal 2 — dashboard
cd ~/code/learning_ai_common_plat/dashboards/tracker-web && pnpm dev # :3003
# Terminal 3 — tokens + submit (save mint-token.mjs from §5; fix ABS paths)
node mint-token.mjs 15m > /tmp/tok
node mint-token.mjs 24h > /tmp/factok
curl -s -X POST http://localhost:4003/api/fleet/jobs \
-H "Authorization: Bearer $(cat /tmp/tok)" -H "X-Product-Id: notelett" \
-H 'Content-Type: application/json' \
-d '{"idempotencyKey":"notelett-demo-1","bodyMd":"# Task…","priority":"high","engine":"devin","repo":"saravanakumardb1/learning_ai_notes","baseBranch":"main","autoMerge":false}'
# Terminal 4 — factory (runs Devin → opens a real PR). NOTE the <90s heartbeat cadence.
cd ~/code/learning_ai_devops_tools/agent-queue && ./agent-queue.sh init
AQ_FLEET=1 AQ_FLEET_ROUTE=1 AQ_FLEET_PR=1 AQ_FLEET_API=http://localhost:4003/api \
AQ_PRODUCT_ID=notelett AQ_FLEET_TOKEN="$(cat /tmp/factok)" \
AQ_FACTORY_ID=mac-local-1 AQ_FLEET_CAPS=engine:devin AQ_FLEET_LEASE_RENEW_SEC=60 \
./agent-queue.sh run --max 1
```
---
## 13. WSL on Windows — differences to note
The flow is identical **inside a WSL2 (Ubuntu) shell**, with these adjustments. Treat
WSL as "the Linux host" — install and run **everything inside WSL**, not Windows.
- **Keep repos on the WSL filesystem, not `/mnt/c`.** Clone under e.g. `~/code` inside
WSL. On `/mnt/c` (the Windows drive over 9p) `tsx watch`/Next filewatching is
unreliable (inotify doesn't fire) and git/pnpm are far slower. This is the single most
important difference.
- **Install the toolchain inside WSL** (Linux builds): `node`/`pnpm` (nvm), `git`, **`gh`**,
and the **`devin` CLI** — and run `gh auth login` + Devin auth **inside WSL**. A `gh`/
`devin` installed on Windows is not visible to the WSL bash factory.
- **Line endings.** Clone inside WSL (don't reuse a Windows checkout with
`core.autocrlf=true`) so the `*.sh` scripts stay LF — CRLF breaks `agent-queue.sh`
(`bad interpreter`/`\r`). If needed: `git config --global core.autocrlf input`.
- **Reaching the UI from the Windows browser.** WSL2 forwards `localhost`, so
`http://localhost:3003` / `:4003` usually work from a Windows browser. If they don't
(older Windows / mirrorednetworking off), use the WSL IP (`hostname -I`) or set
`networkingMode=mirrored` in `.wslconfig`.
- **Ports.** Make sure nothing on the **Windows** side already binds `3003`/`4003`
(WSL2 publishes to the same localhost). Stop the Windows process or change ports.
- **Docker (Option B), if used.** Use **Docker Desktop with the WSL2 backend** and run
`docker compose` from inside the WSL shell. `host.docker.internal` resolves from
containers to the host as on Mac.
- **`/tmp` token paths** (`/tmp/tok`, `/tmp/factok`) are the WSL `/tmp` — fine; just keep
all four terminals in the same WSL distro so they share it.
- **Clock skew.** If WSL's clock drifts after sleep, JWT `iat/exp` checks can fail
(`Invalid or expired token`) — `sudo hwclock -s` (or restart WSL) to resync.
Everything else — env vars, `pnpm -r build`, `tsx --env-file`, the factory env incl.
`AQ_FLEET_LEASE_RENEW_SEC=60`, token minting — is identical to the Mac host path.
---
### Reference
- Coordinator routes: `services/platform-service/src/modules/fleet/routes.ts`
- Coordinator logic: `services/platform-service/src/modules/fleet/coordinator.ts`
- Factory fleet client: `learning_ai_devops_tools/agent-queue/lib/fleet-client.sh`
- Factory runner + PR mode: `learning_ai_devops_tools/agent-queue/agent-queue.sh`
- Gigafactory spec/roadmap: `learning_ai_devops_tools/agent-queue/docs/GIGAFACTORY/`
- Prometheus scrape config: `services/monitoring/prometheus/prometheus.yml`
- Grafana dashboard: `services/monitoring/grafana/dashboards/fleet-overview.json`