docs(fleet): runbook to run a Devin fleet job end-to-end (local)
Add docs/runbooks/FLEET_DEVIN_LOCAL_RUN.md: how developers and coding agents spin up platform-service + tracker-web + an agent-queue factory so a submitted job is claimed and run autonomously by the Devin CLI against a target repo (worked example: learning_ai_notes), pushing a branch and opening a real PR. Covers: architecture + lifecycle, prerequisites incl. fresh-machine setup (clone both repos, .env/Cosmos, pnpm -r build so host-run resolves @bytelyst/* from dist/), all-localhost (no Docker) path as primary + Docker as the Grafana/Prometheus option, local JWT minting, job submit, factory launch, observe, PR-state reconcile, safety/cost, teardown, troubleshooting, and a copy-paste quickstart. Calls out two gotchas learned in practice: set AQ_FLEET_LEASE_RENEW_SEC < 90 so the factory heartbeat beats the coordinator's 90s stale-factory reclaim window (else a busy single-slot factory's in-flight lease is reclaimed mid-run and the final report is fenced), and a WSL-on-Windows differences section (run inside WSL, repos off /mnt/c, LF endings, gh/devin/node in WSL, localhost forwarding). Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
This commit is contained in:
parent
6bddc88f0f
commit
e6611cae1a
505
docs/runbooks/FLEET_DEVIN_LOCAL_RUN.md
Normal file
505
docs/runbooks/FLEET_DEVIN_LOCAL_RUN.md
Normal file
@ -0,0 +1,505 @@
|
||||
# Runbook — Run a Devin Fleet Job End‑to‑End (local)
|
||||
|
||||
> **Audience:** developers and coding agents.
|
||||
> **Goal:** stand up `platform-service` + `tracker-web` + a **fleet factory** (the
|
||||
> `agent-queue` runner) so a submitted job is claimed and executed **autonomously
|
||||
> by the Devin CLI** against a target repo (worked example: `learning_ai_notes`),
|
||||
> pushing a branch and opening a **real pull request**.
|
||||
|
||||
> ⚠️ **This is a real, cost‑incurring, side‑effecting operation.** The factory runs
|
||||
> an autonomous coding agent (Devin) that consumes API credits, can run for a long
|
||||
> time, pushes a branch, and opens a **real PR** on GitHub. Read [§9 Safety &
|
||||
> cost](#9-safety--cost) before launching. For unattended local prototyping only —
|
||||
> not a production deployment guide.
|
||||
|
||||
---
|
||||
|
||||
## 1. Architecture (what talks to what)
|
||||
|
||||
```
|
||||
you ──▶ tracker-web (:3003) ─┐
|
||||
│ REST + SSE (/api/fleet/*)
|
||||
coding agent / curl ────────┼─▶ platform-service (:4003) ──▶ Azure Cosmos (jobs/runs/leases/events)
|
||||
│ ▲ ▲
|
||||
Prometheus (:9090)┘ │ │ claim / lease-renew / report (Bearer JWT + X-Product-Id)
|
||||
Grafana (:3000) ───────────┘ │
|
||||
│
|
||||
agent-queue FACTORY (fleet mode) ──▶ Devin CLI ──▶ git push + gh pr create
|
||||
(learning_ai_devops_tools/agent-queue) (target repo, e.g. learning_ai_notes)
|
||||
```
|
||||
|
||||
- **platform-service** — the fleet **coordinator**. Owns the job lifecycle
|
||||
(`queued → assigned → building → review → testing → shipped|failed|dead_letter`),
|
||||
atomic claim, leases, events, budgets, metrics. Code: `services/platform-service/src/modules/fleet/`.
|
||||
- **tracker-web** (`:3003`) — submit/inspect jobs (`/dashboard/fleet/jobs/...`).
|
||||
- **factory** — `learning_ai_devops_tools/agent-queue` in **fleet mode**. Polls
|
||||
`POST /api/fleet/claim`, runs the agent CLI in an isolated checkout, reports back,
|
||||
and (PR mode) opens the PR.
|
||||
- **Prometheus/Grafana** — fleet metrics + the "Fleet Overview" dashboard.
|
||||
|
||||
Lifecycle the factory drives:
|
||||
|
||||
```
|
||||
queued ─▶ assigned ─▶ building ─▶ review ─▶ testing ─▶ shipped
|
||||
(claim) (agent (rc=0) (verify (manual/auto ship)
|
||||
running) passed)
|
||||
└─ agent rc≠0 / timeout / verify fail ─▶ failed ─▶ (retry|dead_letter)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 2. Prerequisites
|
||||
|
||||
| Tool | Why | Check |
|
||||
| ----------------------------------------------- | ---------------------------------------------------------------------------- | -------------------- |
|
||||
| Node ≥ 20 + `pnpm` (corepack) | host-run service, scripts, tracker-web, build | `node -v && pnpm -v` |
|
||||
| `git` + `gh` (authenticated) | factory clones target repo, pushes branch, opens PR; `gh pr merge`/reconcile | `gh auth status` |
|
||||
| `devin` CLI (authenticated) | the agent the factory runs | `devin --version` |
|
||||
| Both repos cloned side‑by‑side | coordinator/dashboards + the factory | see below |
|
||||
| repo `.env` (root of `learning_ai_common_plat`) | `JWT_SECRET`, Cosmos creds, `FLEET_METRICS_TOKEN` | `test -f .env` |
|
||||
| Docker | **optional** — only for the Docker path (§3 Option B) / Grafana+Prometheus | `docker info` |
|
||||
|
||||
> **Node version:** the Docker image pins **node 22**; for the host path any **Node ≥ 20**
|
||||
> works. Use one Node (nvm/asdf) for both repos to avoid native-module surprises.
|
||||
|
||||
### 2.1 First‑time setup (fresh machine)
|
||||
|
||||
Clone both repos as **siblings** (the factory clones targets relative to a shared parent):
|
||||
|
||||
```bash
|
||||
mkdir -p ~/code && cd ~/code
|
||||
git clone <host>/learning_ai_common_plat.git
|
||||
git clone <host>/learning_ai_devops_tools.git # contains agent-queue (the factory)
|
||||
```
|
||||
|
||||
Create and fill `.env` at the **root of `learning_ai_common_plat`**:
|
||||
|
||||
```bash
|
||||
cd ~/code/learning_ai_common_plat
|
||||
cp .env.example .env
|
||||
# then edit .env — minimum for the fleet flow:
|
||||
# JWT_SECRET=<any strong secret; tokens are minted+verified with THIS value>
|
||||
# FLEET_METRICS_TOKEN=changeme-fleet-metrics-token # only needed for Prometheus
|
||||
# COSMOS_* / connection vars -> see note below
|
||||
```
|
||||
|
||||
- `JWT_SECRET` — HS256 secret platform-service verifies tokens with. Any strong value;
|
||||
it only needs to be **internally consistent on this machine** (the token you mint in
|
||||
§5 and the running service must share it). **Required.**
|
||||
- **Cosmos** — the default prototype talks to a **real Azure Cosmos account** (no emulator
|
||||
in the default compose). On a new machine you must either (a) point `.env` at the **same
|
||||
Cosmos account** (to see/share existing jobs) or (b) point at your own DB and set
|
||||
`COSMOS_AUTO_INIT=true` so containers are created on boot. Without valid Cosmos creds the
|
||||
service starts but every fleet call fails.
|
||||
- `FLEET_METRICS_TOKEN` — only needed if you run Prometheus (§4); must match
|
||||
`services/monitoring/prometheus/prometheus.yml` (`credentials:`).
|
||||
|
||||
### 2.2 Install + build the workspace (required for the host path)
|
||||
|
||||
Host-run resolves `@bytelyst/*` workspace packages from their **`dist/`** (the `exports`
|
||||
field points at `dist`), so you must build them once before `tsx`/Next can import them:
|
||||
|
||||
```bash
|
||||
cd ~/code/learning_ai_common_plat
|
||||
pnpm install
|
||||
pnpm -r build # builds all workspace packages (incl. @bytelyst/* → dist/)
|
||||
# (faster, just the platform-service closure:)
|
||||
# pnpm -r --filter @lysnrai/platform-service... build
|
||||
```
|
||||
|
||||
> Skipping this is the #1 fresh-machine failure: `tsx watch` crashes with
|
||||
> `Cannot find module '@bytelyst/...'/dist/index.js`. Re-run `pnpm -r build` after pulling
|
||||
> changes to shared packages.
|
||||
|
||||
---
|
||||
|
||||
## 3. Bring up platform-service + tracker-web
|
||||
|
||||
Two ways. **Option A (all localhost, no Docker)** is recommended for a single dev Mac /
|
||||
WSL box — everything runs on the host, so `gh`-backed features work out of the box.
|
||||
**Option B (Docker)** is for when you also want the Grafana/Prometheus stack.
|
||||
|
||||
### Option A — all localhost, no Docker (recommended)
|
||||
|
||||
Two long‑lived processes, each in its own terminal. Both assume §2.1/§2.2 are done
|
||||
(`.env` filled, `pnpm -r build` run).
|
||||
|
||||
**Terminal 1 — coordinator (platform-service, :4003):**
|
||||
|
||||
```bash
|
||||
cd ~/code/learning_ai_common_plat/services/platform-service
|
||||
pnpm exec tsx watch --env-file=../../.env src/server.ts
|
||||
```
|
||||
|
||||
`tsx watch` hot-reloads on source changes. Use the explicit `--env-file=../../.env`
|
||||
(the bare `pnpm dev` script does **not** load the root `.env`, so `JWT_SECRET`/Cosmos
|
||||
would be missing). `FLEET_METRICS_TOKEN` is already in `.env` if you set it in §2.1.
|
||||
|
||||
**Terminal 2 — dashboard (tracker-web, :3003):**
|
||||
|
||||
```bash
|
||||
cd ~/code/learning_ai_common_plat/dashboards/tracker-web
|
||||
pnpm dev # serves http://localhost:3003 (proxies /api → :4003)
|
||||
```
|
||||
|
||||
That's the whole coordinator + UI. **Monitoring (Grafana/Prometheus) is optional** on
|
||||
the host path — `GET /api/fleet/metrics` (JSON), `GET /api/fleet/autoscale`, and the
|
||||
tracker-web job pages cover observability without it. To get the Grafana "Fleet Overview"
|
||||
dashboard you need Prometheus + Grafana (run them via Docker — Option B — or Homebrew
|
||||
binaries pointed at `services/monitoring/...`).
|
||||
|
||||
Because everything is on the host, `gh` is on `PATH` → the PR‑state **reconcile** (§8)
|
||||
and ship‑time `gh pr merge` work (unlike the Docker container, which has no `gh`).
|
||||
|
||||
Health checks:
|
||||
|
||||
```bash
|
||||
curl -s -o /dev/null -w '%{http_code}\n' http://localhost:4003/health # 200
|
||||
curl -s -o /dev/null -w '%{http_code}\n' http://localhost:3003 # 200
|
||||
```
|
||||
|
||||
### Option B — Docker (adds Grafana + Prometheus)
|
||||
|
||||
```bash
|
||||
cd ~/code/learning_ai_common_plat
|
||||
# targeted fleet subset that always builds cleanly:
|
||||
docker compose up -d --build platform-service prometheus grafana
|
||||
# (full stack: bash scripts/prototype-up.sh)
|
||||
```
|
||||
|
||||
Starts `platform-service` (`:4003`), `prometheus` (`:9090`), `grafana` (`:3000`,
|
||||
admin/`lysnrai`) + deps. Still run **tracker-web from source** (Option A, Terminal 2).
|
||||
|
||||
> **Docker caveats:**
|
||||
>
|
||||
> - `prototype-up.sh` may fail building the **dashboard** images when
|
||||
> `corepack prepare pnpm@…` can't fetch pnpm on a restricted network → use the targeted
|
||||
> subset above.
|
||||
> - **`gh` is NOT in the container** → coordinator‑side `gh pr merge` and PR‑reconcile (§8)
|
||||
> are no‑ops in Docker. Use the host path (Option A) if you need them.
|
||||
> - Don't run both: the container and a host `tsx` both bind `:4003`
|
||||
> (`docker compose stop platform-service` before host‑running).
|
||||
|
||||
---
|
||||
|
||||
## 4. Make Prometheus auth work (only if running Prometheus)
|
||||
|
||||
Skip this on the host path unless you also run Prometheus. `prometheus.yml` scrapes
|
||||
`/api/fleet/metrics/prom` with a bearer, so the running `platform-service` must see the
|
||||
same `FLEET_METRICS_TOKEN`:
|
||||
|
||||
```bash
|
||||
cd ~/code/learning_ai_common_plat
|
||||
grep -q '^FLEET_METRICS_TOKEN=' .env || \
|
||||
printf '\nFLEET_METRICS_TOKEN=changeme-fleet-metrics-token\n' >> .env
|
||||
# host path: restart Terminal-1 tsx so it re-reads .env
|
||||
# docker path: docker compose up -d platform-service
|
||||
```
|
||||
|
||||
Verify (if Prometheus is up): `http://localhost:9090/api/v1/targets` →
|
||||
`platform-service-fleet` is `up`. The value must equal `credentials:` in
|
||||
`services/monitoring/prometheus/prometheus.yml`.
|
||||
|
||||
---
|
||||
|
||||
## 5. Mint a local API token (dev only)
|
||||
|
||||
`platform-service` verifies HS256 JWTs signed with `JWT_SECRET` and requires
|
||||
`type: "access"`. The tracker-web UI obtains one via login; for scripts/agents and
|
||||
the factory, mint one directly. **Local dev only — never commit tokens or the secret.**
|
||||
|
||||
Save `mint-token.mjs` (resolve `jose` from the workspace):
|
||||
|
||||
```js
|
||||
import { readFileSync } from 'node:fs';
|
||||
// adjust the jose path to your checkout if needed:
|
||||
import { SignJWT } from '/ABS/PATH/learning_ai_common_plat/node_modules/.pnpm/jose@5.10.0/node_modules/jose/dist/node/esm/index.js';
|
||||
|
||||
const env = readFileSync('/ABS/PATH/learning_ai_common_plat/.env', 'utf8');
|
||||
const secret = new TextEncoder().encode(env.match(/^JWT_SECRET=(.*)$/m)[1].trim());
|
||||
const ttl = process.argv[2] || '15m'; // e.g. '15m' for scripts, '24h' for a factory
|
||||
process.stdout.write(
|
||||
await new SignJWT({ sub: 'local-dev', role: 'admin', type: 'access' })
|
||||
.setProtectedHeader({ alg: 'HS256' })
|
||||
.setIssuedAt()
|
||||
.setExpirationTime(ttl)
|
||||
.sign(secret)
|
||||
);
|
||||
```
|
||||
|
||||
```bash
|
||||
node mint-token.mjs 15m > /tmp/tok # short-lived, for API calls
|
||||
node mint-token.mjs 24h > /tmp/factok # longer-lived, for the factory daemon
|
||||
```
|
||||
|
||||
> Find the jose path with:
|
||||
> `find . -path '*/node_modules/jose/dist/*/esm/index.js' | head -1`.
|
||||
|
||||
Requests must also carry the **product**: header `X-Product-Id: <productId>`
|
||||
(e.g. `notelett`). `role: admin` bypasses tenant ownership checks when
|
||||
`FLEET_TENANT_ENFORCEMENT` is on (it's off by default).
|
||||
|
||||
---
|
||||
|
||||
## 6. Submit a job
|
||||
|
||||
### Via tracker-web (preferred)
|
||||
|
||||
Open `http://localhost:3003/dashboard/fleet/jobs`, "New job". **Set the correct
|
||||
product first** (the product selector) — a job is partitioned by `productId`, and
|
||||
submitting under the wrong product misattributes cost/metrics/ownership and the
|
||||
factory won't see it under the product it polls.
|
||||
|
||||
PR‑mode fields that matter:
|
||||
|
||||
- **`repo`** — must be `owner/name` (e.g. `saravanakumardb1/learning_ai_notes`) or a
|
||||
clone URL, **not** a bare name (the factory feeds it to `gh`).
|
||||
- **`baseBranch`** — e.g. `main`.
|
||||
- **`engine`** — `devin` (pins the agent; otherwise the factory's default/engineClass).
|
||||
- **`autoMerge`** — leave **`false`** for a human merge gate (recommended for large PRs).
|
||||
|
||||
### Via API
|
||||
|
||||
```bash
|
||||
JOB=$(curl -s -X POST http://localhost:4003/api/fleet/jobs \
|
||||
-H "Authorization: Bearer $(cat /tmp/tok)" \
|
||||
-H "X-Product-Id: notelett" -H 'Content-Type: application/json' \
|
||||
-d '{
|
||||
"idempotencyKey": "notelett-demo-1",
|
||||
"bodyMd": "# Task\n…full prompt…",
|
||||
"priority": "high",
|
||||
"engine": "devin",
|
||||
"repo": "saravanakumardb1/learning_ai_notes",
|
||||
"baseBranch": "main",
|
||||
"autoMerge": false
|
||||
}')
|
||||
echo "$JOB" # → { outcome: "created", job: { id: "fjob_…", stage: "queued", ... } }
|
||||
```
|
||||
|
||||
The job is now `queued` and claimable. It will **not run** until a factory polls
|
||||
for its product (next step).
|
||||
|
||||
---
|
||||
|
||||
## 7. Start the factory (agent-queue, fleet mode)
|
||||
|
||||
The factory lives in a **separate repo**: `learning_ai_devops_tools/agent-queue`.
|
||||
Run it on the **host** (needs `devin` + `gh`). Read its `docs/RUN_POLICY.md` first.
|
||||
|
||||
### 7a. Sanity‑check connectivity (safe — registers + heartbeats only)
|
||||
|
||||
```bash
|
||||
cd learning_ai_devops_tools/agent-queue
|
||||
./agent-queue.sh init # idempotent
|
||||
|
||||
AQ_FLEET=1 AQ_FLEET_ROUTE=1 \
|
||||
AQ_FLEET_API=http://localhost:4003/api \
|
||||
AQ_PRODUCT_ID=notelett \
|
||||
AQ_FLEET_TOKEN="$(cat /tmp/factok)" \
|
||||
AQ_FACTORY_ID=mac-local-1 \
|
||||
AQ_FLEET_CAPS=engine:devin \
|
||||
AQ_FLEET_LEASE_RENEW_SEC=60 \
|
||||
./agent-queue.sh fleet-status # → "heartbeat OK (registered)."
|
||||
```
|
||||
|
||||
### 7b. Launch the run loop (claims + runs the agent)
|
||||
|
||||
```bash
|
||||
cd learning_ai_devops_tools/agent-queue
|
||||
AQ_FLEET=1 AQ_FLEET_ROUTE=1 AQ_FLEET_PR=1 \
|
||||
AQ_FLEET_API=http://localhost:4003/api \
|
||||
AQ_PRODUCT_ID=notelett \
|
||||
AQ_FLEET_TOKEN="$(cat /tmp/factok)" \
|
||||
AQ_FACTORY_ID=mac-local-1 \
|
||||
AQ_FLEET_CAPS=engine:devin \
|
||||
AQ_FLEET_LEASE_RENEW_SEC=60 \
|
||||
./agent-queue.sh run --max 1
|
||||
```
|
||||
|
||||
> ⚠️ **Set `AQ_FLEET_LEASE_RENEW_SEC` below 90 (e.g. 60).** This is the heartbeat/
|
||||
> lease‑renew cadence. The coordinator's reaper marks a factory **stale after 90s**
|
||||
> (`DEFAULT_STALE_FACTORY_MS`, a constant — no env knob) and **reclaims its in‑flight
|
||||
> lease**. The default cadence is **300s**, so a busy single‑slot factory looks stale
|
||||
> for most of every cycle and its running job gets requeued mid‑run (`leaseEpoch`
|
||||
> climbs, stage flips back to `queued`, and the final report is **fenced** so the job
|
||||
> never tidies to `review`/`shipped`). 60s keeps it comfortably live. (Add the same env
|
||||
> to the §7a `fleet-status` check for consistency.)
|
||||
|
||||
Key fleet env vars (see `lib/fleet-client.sh`):
|
||||
|
||||
| Var | Meaning |
|
||||
| -------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `AQ_FLEET=1` | master switch — enable coordinator calls (0 = pure offline) |
|
||||
| `AQ_FLEET_ROUTE=1` | coordinator is **authoritative** for claim (pulls work from platform-service) |
|
||||
| `AQ_FLEET_PR=1` | PR mode — open a PR for jobs that target a `repo` |
|
||||
| `AQ_FLEET_API` | base URL **including `/api`** (`http://localhost:4003/api`) |
|
||||
| `AQ_FLEET_TOKEN` | **Bearer JWT** (mint per §5; ≥ run duration, e.g. 24h) |
|
||||
| `AQ_PRODUCT_ID` | product to poll — sent as `X-Product-Id` (must match the job's product) |
|
||||
| `AQ_FACTORY_ID` | this factory's id (registered/heartbeated) |
|
||||
| `AQ_FLEET_CAPS` | advertised capabilities, e.g. `engine:devin` |
|
||||
| `AQ_FLEET_LEASE_RENEW_SEC` | **set `<90`** (e.g. `60`) — heartbeat/renew cadence vs the 90s stale window (see warning) |
|
||||
| `AQ_FLEET_REPO_BASE` | _(optional)_ dir of local checkouts; if `…/<repo>/.git` exists it uses a **git worktree**, else it `git clone`s `https://github.com/<repo>.git` into its cache |
|
||||
| `AQ_FLEET_AUTOSHIP=1` | _(optional)_ auto-advance to `shipped` (skips the manual gate) |
|
||||
|
||||
The run loop `claim → assigned → building`, runs Devin in an isolated checkout,
|
||||
heartbeats + renews the lease (`lease_renewed` events) so the reaper doesn't reclaim it,
|
||||
then on agent exit moves to `review` and (PR mode) opens the PR. With `autoMerge:false`
|
||||
it **stops at the human merge gate**.
|
||||
|
||||
> **Repo checkout:** the job's `repo` is `owner/name`, so by default the factory
|
||||
> `git clone`s `https://github.com/<owner>/<name>.git` into its own cache
|
||||
> (`queue/.state/repos/…`) — clean isolation, nothing touches your working copies. To
|
||||
> reuse an existing local clone via a **git worktree** instead, set
|
||||
> `AQ_FLEET_REPO_BASE=<parent>` where `<parent>/<owner>/<name>/.git` exists.
|
||||
|
||||
---
|
||||
|
||||
## 8. Observe progress
|
||||
|
||||
- **tracker-web:** `http://localhost:3003/dashboard/fleet/jobs/<jobId>` — live event
|
||||
stream (SSE), runs, PR link + state.
|
||||
- **Events/API:**
|
||||
```bash
|
||||
curl -s http://localhost:4003/api/fleet/jobs/<jobId>/events \
|
||||
-H "Authorization: Bearer $(cat /tmp/tok)" -H "X-Product-Id: notelett"
|
||||
```
|
||||
- **Metrics:** `GET /api/fleet/metrics` (JSON, per product) · `GET /api/fleet/metrics/prom`
|
||||
(Prometheus, all products; needs `FLEET_METRICS_TOKEN`) · Grafana **Fleet Overview**
|
||||
(`http://localhost:3000/d/fleet-overview`).
|
||||
- **Autoscale signal:** `GET /api/fleet/autoscale` (this product) / `…/autoscale/all`.
|
||||
|
||||
### PR‑state reconcile (externally‑merged PRs)
|
||||
|
||||
If you merge the PR in the GitHub UI, the coordinator doesn't know until told. Trigger
|
||||
a reconcile (flips run `prState → merged` when `gh pr view` reports MERGED):
|
||||
|
||||
- UI: **"Refresh PR status"** button on the job's PR section, or
|
||||
- API: `POST /api/fleet/jobs/<jobId>/pr/reconcile`.
|
||||
|
||||
> Requires `gh` where platform-service runs → use the **host path** (§3 Option A);
|
||||
> it's a no‑op in the Docker container (no `gh`).
|
||||
|
||||
---
|
||||
|
||||
## 9. Safety & cost
|
||||
|
||||
- **Billable + autonomous + long‑running.** Each run consumes Devin credits and can
|
||||
run for a long time unattended. Scope jobs deliberately; very large multi‑workstream
|
||||
specs are better split into several jobs.
|
||||
- **Real PR.** PR mode pushes a branch and opens a PR on the target repo. Keep
|
||||
`autoMerge:false` so a human reviews/merges; `gh pr merge` (auto) only fires when the
|
||||
job opts in or `FLEET_SHIP_MERGES_PR=1`.
|
||||
- **Isolation.** The factory works in an isolated worktree/clone, never your main
|
||||
checkout (per `agent-queue/docs/RUN_POLICY.md`). Avoid blanket `--yolo` on live trees.
|
||||
- **Stopping the daemon** mid‑run lets the lease expire; the coordinator's reaper then
|
||||
reclaims and requeues the job (so partial work may be retried). Stop intentionally.
|
||||
- **Tokens/secrets:** the minted JWT and `JWT_SECRET` are sensitive — never commit them
|
||||
or paste into shared logs. `.env` is git‑ignored; keep it that way.
|
||||
|
||||
---
|
||||
|
||||
## 10. Teardown
|
||||
|
||||
```bash
|
||||
# stop the factory: Ctrl-C the run loop
|
||||
# host path: Ctrl-C the tsx (Terminal 1) and pnpm dev (Terminal 2)
|
||||
# docker path:
|
||||
# cd ~/code/learning_ai_common_plat && docker compose down # keep volumes
|
||||
# docker compose down -v # also drop volumes
|
||||
rm -f /tmp/tok /tmp/factok # discard minted tokens
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 11. Troubleshooting
|
||||
|
||||
| Symptom | Cause → Fix |
|
||||
| -------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `Cannot find module '@bytelyst/…/dist/index.js'` on `tsx`/Next start | workspace packages not built → `pnpm -r build` (§2.2). |
|
||||
| `401 {"error":"Invalid or expired token"}` | JWT expired/mis‑signed → re‑mint (§5); ensure same `JWT_SECRET` as the running service. |
|
||||
| Job claimed then flips back to `queued` mid‑run; `leaseEpoch` keeps climbing; final report **fenced**; PR opens but job never reaches `review`/`shipped` | factory heartbeat cadence (`AQ_FLEET_LEASE_RENEW_SEC`, default **300s**) > reaper stale window (**90s**) → set `AQ_FLEET_LEASE_RENEW_SEC=60` (§7). To recover the record after the fact, reconcile PR state (§8). |
|
||||
| Job stays `queued`, never claimed | No factory for that product → `fleet-status` shows it registered? `AQ_PRODUCT_ID` must equal the job's product. Check `GET /api/fleet/factories` (X‑Product‑Id) for `0 live`. |
|
||||
| `POST …/pr/reconcile` or ship auto‑merge does nothing | `gh` not present where platform-service runs (Docker container) → run the host path (§3 Option A). |
|
||||
| Prometheus target `platform-service-fleet` = `down (401)` | service missing `FLEET_METRICS_TOKEN` → §4 (restart host `tsx` / recreate container). |
|
||||
| `prototype-up.sh` build fails on `corepack prepare pnpm` | dashboard image network issue → use the targeted subset, or just use the host path (Option A). |
|
||||
| `POST …/actions/<x>` returns 500 "Body cannot be empty" | sent `Content-Type: application/json` with no body → omit the header or send `{}`. |
|
||||
| Port `4003` conflict | host `tsx watch` and a `platform-service` container both bind `4003` → run only one. |
|
||||
| `gh pr create` fails | `repo` is a bare name → must be `owner/name` or a clone URL; confirm `gh auth status`. |
|
||||
| PR/cost attributed to wrong product | job submitted under the wrong `productId` partition → resubmit under the right product and cancel the stray (`POST …/actions/cancel`). |
|
||||
| `vitest` exits non‑zero with `kill EPERM` after all suites pass | worker‑pool teardown artifact (sandbox), not a test failure → re‑run; all suites already passed. |
|
||||
|
||||
---
|
||||
|
||||
## 12. Copy‑paste quickstart — all localhost (notelett → learning_ai_notes)
|
||||
|
||||
Assumes §2.1/§2.2 done (`.env` filled, `pnpm -r build` run). Four terminals.
|
||||
|
||||
```bash
|
||||
# Terminal 1 — coordinator
|
||||
cd ~/code/learning_ai_common_plat/services/platform-service
|
||||
pnpm exec tsx watch --env-file=../../.env src/server.ts
|
||||
|
||||
# Terminal 2 — dashboard
|
||||
cd ~/code/learning_ai_common_plat/dashboards/tracker-web && pnpm dev # :3003
|
||||
|
||||
# Terminal 3 — tokens + submit (save mint-token.mjs from §5; fix ABS paths)
|
||||
node mint-token.mjs 15m > /tmp/tok
|
||||
node mint-token.mjs 24h > /tmp/factok
|
||||
curl -s -X POST http://localhost:4003/api/fleet/jobs \
|
||||
-H "Authorization: Bearer $(cat /tmp/tok)" -H "X-Product-Id: notelett" \
|
||||
-H 'Content-Type: application/json' \
|
||||
-d '{"idempotencyKey":"notelett-demo-1","bodyMd":"# Task…","priority":"high","engine":"devin","repo":"saravanakumardb1/learning_ai_notes","baseBranch":"main","autoMerge":false}'
|
||||
|
||||
# Terminal 4 — factory (runs Devin → opens a real PR). NOTE the <90s heartbeat cadence.
|
||||
cd ~/code/learning_ai_devops_tools/agent-queue && ./agent-queue.sh init
|
||||
AQ_FLEET=1 AQ_FLEET_ROUTE=1 AQ_FLEET_PR=1 AQ_FLEET_API=http://localhost:4003/api \
|
||||
AQ_PRODUCT_ID=notelett AQ_FLEET_TOKEN="$(cat /tmp/factok)" \
|
||||
AQ_FACTORY_ID=mac-local-1 AQ_FLEET_CAPS=engine:devin AQ_FLEET_LEASE_RENEW_SEC=60 \
|
||||
./agent-queue.sh run --max 1
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 13. WSL on Windows — differences to note
|
||||
|
||||
The flow is identical **inside a WSL2 (Ubuntu) shell**, with these adjustments. Treat
|
||||
WSL as "the Linux host" — install and run **everything inside WSL**, not Windows.
|
||||
|
||||
- **Keep repos on the WSL filesystem, not `/mnt/c`.** Clone under e.g. `~/code` inside
|
||||
WSL. On `/mnt/c` (the Windows drive over 9p) `tsx watch`/Next file‑watching is
|
||||
unreliable (inotify doesn't fire) and git/pnpm are far slower. This is the single most
|
||||
important difference.
|
||||
- **Install the toolchain inside WSL** (Linux builds): `node`/`pnpm` (nvm), `git`, **`gh`**,
|
||||
and the **`devin` CLI** — and run `gh auth login` + Devin auth **inside WSL**. A `gh`/
|
||||
`devin` installed on Windows is not visible to the WSL bash factory.
|
||||
- **Line endings.** Clone inside WSL (don't reuse a Windows checkout with
|
||||
`core.autocrlf=true`) so the `*.sh` scripts stay LF — CRLF breaks `agent-queue.sh`
|
||||
(`bad interpreter`/`\r`). If needed: `git config --global core.autocrlf input`.
|
||||
- **Reaching the UI from the Windows browser.** WSL2 forwards `localhost`, so
|
||||
`http://localhost:3003` / `:4003` usually work from a Windows browser. If they don't
|
||||
(older Windows / mirrored‑networking off), use the WSL IP (`hostname -I`) or set
|
||||
`networkingMode=mirrored` in `.wslconfig`.
|
||||
- **Ports.** Make sure nothing on the **Windows** side already binds `3003`/`4003`
|
||||
(WSL2 publishes to the same localhost). Stop the Windows process or change ports.
|
||||
- **Docker (Option B), if used.** Use **Docker Desktop with the WSL2 backend** and run
|
||||
`docker compose` from inside the WSL shell. `host.docker.internal` resolves from
|
||||
containers to the host as on Mac.
|
||||
- **`/tmp` token paths** (`/tmp/tok`, `/tmp/factok`) are the WSL `/tmp` — fine; just keep
|
||||
all four terminals in the same WSL distro so they share it.
|
||||
- **Clock skew.** If WSL's clock drifts after sleep, JWT `iat/exp` checks can fail
|
||||
(`Invalid or expired token`) — `sudo hwclock -s` (or restart WSL) to resync.
|
||||
|
||||
Everything else — env vars, `pnpm -r build`, `tsx --env-file`, the factory env incl.
|
||||
`AQ_FLEET_LEASE_RENEW_SEC=60`, token minting — is identical to the Mac host path.
|
||||
|
||||
---
|
||||
|
||||
### Reference
|
||||
|
||||
- Coordinator routes: `services/platform-service/src/modules/fleet/routes.ts`
|
||||
- Coordinator logic: `services/platform-service/src/modules/fleet/coordinator.ts`
|
||||
- Factory fleet client: `learning_ai_devops_tools/agent-queue/lib/fleet-client.sh`
|
||||
- Factory runner + PR mode: `learning_ai_devops_tools/agent-queue/agent-queue.sh`
|
||||
- Gigafactory spec/roadmap: `learning_ai_devops_tools/agent-queue/docs/GIGAFACTORY/`
|
||||
- Prometheus scrape config: `services/monitoring/prometheus/prometheus.yml`
|
||||
- Grafana dashboard: `services/monitoring/grafana/dashboards/fleet-overview.json`
|
||||
Loading…
Reference in New Issue
Block a user