docs(fleet): runbook to run a Devin fleet job end-to-end (local)
Add docs/runbooks/FLEET_DEVIN_LOCAL_RUN.md: how developers and coding agents spin up platform-service + tracker-web + an agent-queue factory so a submitted job is claimed and run autonomously by the Devin CLI against a target repo (worked example: learning_ai_notes), pushing a branch and opening a real PR. Covers: architecture + lifecycle, prerequisites incl. fresh-machine setup (clone both repos, .env/Cosmos, pnpm -r build so host-run resolves @bytelyst/* from dist/), all-localhost (no Docker) path as primary + Docker as the Grafana/Prometheus option, local JWT minting, job submit, factory launch, observe, PR-state reconcile, safety/cost, teardown, troubleshooting, and a copy-paste quickstart. Calls out two gotchas learned in practice: set AQ_FLEET_LEASE_RENEW_SEC < 90 so the factory heartbeat beats the coordinator's 90s stale-factory reclaim window (else a busy single-slot factory's in-flight lease is reclaimed mid-run and the final report is fenced), and a WSL-on-Windows differences section (run inside WSL, repos off /mnt/c, LF endings, gh/devin/node in WSL, localhost forwarding). Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
This commit is contained in:
parent
6bddc88f0f
commit
e6611cae1a
505
docs/runbooks/FLEET_DEVIN_LOCAL_RUN.md
Normal file
505
docs/runbooks/FLEET_DEVIN_LOCAL_RUN.md
Normal file
@ -0,0 +1,505 @@
|
|||||||
|
# Runbook — Run a Devin Fleet Job End‑to‑End (local)
|
||||||
|
|
||||||
|
> **Audience:** developers and coding agents.
|
||||||
|
> **Goal:** stand up `platform-service` + `tracker-web` + a **fleet factory** (the
|
||||||
|
> `agent-queue` runner) so a submitted job is claimed and executed **autonomously
|
||||||
|
> by the Devin CLI** against a target repo (worked example: `learning_ai_notes`),
|
||||||
|
> pushing a branch and opening a **real pull request**.
|
||||||
|
|
||||||
|
> ⚠️ **This is a real, cost‑incurring, side‑effecting operation.** The factory runs
|
||||||
|
> an autonomous coding agent (Devin) that consumes API credits, can run for a long
|
||||||
|
> time, pushes a branch, and opens a **real PR** on GitHub. Read [§9 Safety &
|
||||||
|
> cost](#9-safety--cost) before launching. For unattended local prototyping only —
|
||||||
|
> not a production deployment guide.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 1. Architecture (what talks to what)
|
||||||
|
|
||||||
|
```
|
||||||
|
you ──▶ tracker-web (:3003) ─┐
|
||||||
|
│ REST + SSE (/api/fleet/*)
|
||||||
|
coding agent / curl ────────┼─▶ platform-service (:4003) ──▶ Azure Cosmos (jobs/runs/leases/events)
|
||||||
|
│ ▲ ▲
|
||||||
|
Prometheus (:9090)┘ │ │ claim / lease-renew / report (Bearer JWT + X-Product-Id)
|
||||||
|
Grafana (:3000) ───────────┘ │
|
||||||
|
│
|
||||||
|
agent-queue FACTORY (fleet mode) ──▶ Devin CLI ──▶ git push + gh pr create
|
||||||
|
(learning_ai_devops_tools/agent-queue) (target repo, e.g. learning_ai_notes)
|
||||||
|
```
|
||||||
|
|
||||||
|
- **platform-service** — the fleet **coordinator**. Owns the job lifecycle
|
||||||
|
(`queued → assigned → building → review → testing → shipped|failed|dead_letter`),
|
||||||
|
atomic claim, leases, events, budgets, metrics. Code: `services/platform-service/src/modules/fleet/`.
|
||||||
|
- **tracker-web** (`:3003`) — submit/inspect jobs (`/dashboard/fleet/jobs/...`).
|
||||||
|
- **factory** — `learning_ai_devops_tools/agent-queue` in **fleet mode**. Polls
|
||||||
|
`POST /api/fleet/claim`, runs the agent CLI in an isolated checkout, reports back,
|
||||||
|
and (PR mode) opens the PR.
|
||||||
|
- **Prometheus/Grafana** — fleet metrics + the "Fleet Overview" dashboard.
|
||||||
|
|
||||||
|
Lifecycle the factory drives:
|
||||||
|
|
||||||
|
```
|
||||||
|
queued ─▶ assigned ─▶ building ─▶ review ─▶ testing ─▶ shipped
|
||||||
|
(claim) (agent (rc=0) (verify (manual/auto ship)
|
||||||
|
running) passed)
|
||||||
|
└─ agent rc≠0 / timeout / verify fail ─▶ failed ─▶ (retry|dead_letter)
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 2. Prerequisites
|
||||||
|
|
||||||
|
| Tool | Why | Check |
|
||||||
|
| ----------------------------------------------- | ---------------------------------------------------------------------------- | -------------------- |
|
||||||
|
| Node ≥ 20 + `pnpm` (corepack) | host-run service, scripts, tracker-web, build | `node -v && pnpm -v` |
|
||||||
|
| `git` + `gh` (authenticated) | factory clones target repo, pushes branch, opens PR; `gh pr merge`/reconcile | `gh auth status` |
|
||||||
|
| `devin` CLI (authenticated) | the agent the factory runs | `devin --version` |
|
||||||
|
| Both repos cloned side‑by‑side | coordinator/dashboards + the factory | see below |
|
||||||
|
| repo `.env` (root of `learning_ai_common_plat`) | `JWT_SECRET`, Cosmos creds, `FLEET_METRICS_TOKEN` | `test -f .env` |
|
||||||
|
| Docker | **optional** — only for the Docker path (§3 Option B) / Grafana+Prometheus | `docker info` |
|
||||||
|
|
||||||
|
> **Node version:** the Docker image pins **node 22**; for the host path any **Node ≥ 20**
|
||||||
|
> works. Use one Node (nvm/asdf) for both repos to avoid native-module surprises.
|
||||||
|
|
||||||
|
### 2.1 First‑time setup (fresh machine)
|
||||||
|
|
||||||
|
Clone both repos as **siblings** (the factory clones targets relative to a shared parent):
|
||||||
|
|
||||||
|
```bash
|
||||||
|
mkdir -p ~/code && cd ~/code
|
||||||
|
git clone <host>/learning_ai_common_plat.git
|
||||||
|
git clone <host>/learning_ai_devops_tools.git # contains agent-queue (the factory)
|
||||||
|
```
|
||||||
|
|
||||||
|
Create and fill `.env` at the **root of `learning_ai_common_plat`**:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cd ~/code/learning_ai_common_plat
|
||||||
|
cp .env.example .env
|
||||||
|
# then edit .env — minimum for the fleet flow:
|
||||||
|
# JWT_SECRET=<any strong secret; tokens are minted+verified with THIS value>
|
||||||
|
# FLEET_METRICS_TOKEN=changeme-fleet-metrics-token # only needed for Prometheus
|
||||||
|
# COSMOS_* / connection vars -> see note below
|
||||||
|
```
|
||||||
|
|
||||||
|
- `JWT_SECRET` — HS256 secret platform-service verifies tokens with. Any strong value;
|
||||||
|
it only needs to be **internally consistent on this machine** (the token you mint in
|
||||||
|
§5 and the running service must share it). **Required.**
|
||||||
|
- **Cosmos** — the default prototype talks to a **real Azure Cosmos account** (no emulator
|
||||||
|
in the default compose). On a new machine you must either (a) point `.env` at the **same
|
||||||
|
Cosmos account** (to see/share existing jobs) or (b) point at your own DB and set
|
||||||
|
`COSMOS_AUTO_INIT=true` so containers are created on boot. Without valid Cosmos creds the
|
||||||
|
service starts but every fleet call fails.
|
||||||
|
- `FLEET_METRICS_TOKEN` — only needed if you run Prometheus (§4); must match
|
||||||
|
`services/monitoring/prometheus/prometheus.yml` (`credentials:`).
|
||||||
|
|
||||||
|
### 2.2 Install + build the workspace (required for the host path)
|
||||||
|
|
||||||
|
Host-run resolves `@bytelyst/*` workspace packages from their **`dist/`** (the `exports`
|
||||||
|
field points at `dist`), so you must build them once before `tsx`/Next can import them:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cd ~/code/learning_ai_common_plat
|
||||||
|
pnpm install
|
||||||
|
pnpm -r build # builds all workspace packages (incl. @bytelyst/* → dist/)
|
||||||
|
# (faster, just the platform-service closure:)
|
||||||
|
# pnpm -r --filter @lysnrai/platform-service... build
|
||||||
|
```
|
||||||
|
|
||||||
|
> Skipping this is the #1 fresh-machine failure: `tsx watch` crashes with
|
||||||
|
> `Cannot find module '@bytelyst/...'/dist/index.js`. Re-run `pnpm -r build` after pulling
|
||||||
|
> changes to shared packages.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 3. Bring up platform-service + tracker-web
|
||||||
|
|
||||||
|
Two ways. **Option A (all localhost, no Docker)** is recommended for a single dev Mac /
|
||||||
|
WSL box — everything runs on the host, so `gh`-backed features work out of the box.
|
||||||
|
**Option B (Docker)** is for when you also want the Grafana/Prometheus stack.
|
||||||
|
|
||||||
|
### Option A — all localhost, no Docker (recommended)
|
||||||
|
|
||||||
|
Two long‑lived processes, each in its own terminal. Both assume §2.1/§2.2 are done
|
||||||
|
(`.env` filled, `pnpm -r build` run).
|
||||||
|
|
||||||
|
**Terminal 1 — coordinator (platform-service, :4003):**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cd ~/code/learning_ai_common_plat/services/platform-service
|
||||||
|
pnpm exec tsx watch --env-file=../../.env src/server.ts
|
||||||
|
```
|
||||||
|
|
||||||
|
`tsx watch` hot-reloads on source changes. Use the explicit `--env-file=../../.env`
|
||||||
|
(the bare `pnpm dev` script does **not** load the root `.env`, so `JWT_SECRET`/Cosmos
|
||||||
|
would be missing). `FLEET_METRICS_TOKEN` is already in `.env` if you set it in §2.1.
|
||||||
|
|
||||||
|
**Terminal 2 — dashboard (tracker-web, :3003):**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cd ~/code/learning_ai_common_plat/dashboards/tracker-web
|
||||||
|
pnpm dev # serves http://localhost:3003 (proxies /api → :4003)
|
||||||
|
```
|
||||||
|
|
||||||
|
That's the whole coordinator + UI. **Monitoring (Grafana/Prometheus) is optional** on
|
||||||
|
the host path — `GET /api/fleet/metrics` (JSON), `GET /api/fleet/autoscale`, and the
|
||||||
|
tracker-web job pages cover observability without it. To get the Grafana "Fleet Overview"
|
||||||
|
dashboard you need Prometheus + Grafana (run them via Docker — Option B — or Homebrew
|
||||||
|
binaries pointed at `services/monitoring/...`).
|
||||||
|
|
||||||
|
Because everything is on the host, `gh` is on `PATH` → the PR‑state **reconcile** (§8)
|
||||||
|
and ship‑time `gh pr merge` work (unlike the Docker container, which has no `gh`).
|
||||||
|
|
||||||
|
Health checks:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
curl -s -o /dev/null -w '%{http_code}\n' http://localhost:4003/health # 200
|
||||||
|
curl -s -o /dev/null -w '%{http_code}\n' http://localhost:3003 # 200
|
||||||
|
```
|
||||||
|
|
||||||
|
### Option B — Docker (adds Grafana + Prometheus)
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cd ~/code/learning_ai_common_plat
|
||||||
|
# targeted fleet subset that always builds cleanly:
|
||||||
|
docker compose up -d --build platform-service prometheus grafana
|
||||||
|
# (full stack: bash scripts/prototype-up.sh)
|
||||||
|
```
|
||||||
|
|
||||||
|
Starts `platform-service` (`:4003`), `prometheus` (`:9090`), `grafana` (`:3000`,
|
||||||
|
admin/`lysnrai`) + deps. Still run **tracker-web from source** (Option A, Terminal 2).
|
||||||
|
|
||||||
|
> **Docker caveats:**
|
||||||
|
>
|
||||||
|
> - `prototype-up.sh` may fail building the **dashboard** images when
|
||||||
|
> `corepack prepare pnpm@…` can't fetch pnpm on a restricted network → use the targeted
|
||||||
|
> subset above.
|
||||||
|
> - **`gh` is NOT in the container** → coordinator‑side `gh pr merge` and PR‑reconcile (§8)
|
||||||
|
> are no‑ops in Docker. Use the host path (Option A) if you need them.
|
||||||
|
> - Don't run both: the container and a host `tsx` both bind `:4003`
|
||||||
|
> (`docker compose stop platform-service` before host‑running).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 4. Make Prometheus auth work (only if running Prometheus)
|
||||||
|
|
||||||
|
Skip this on the host path unless you also run Prometheus. `prometheus.yml` scrapes
|
||||||
|
`/api/fleet/metrics/prom` with a bearer, so the running `platform-service` must see the
|
||||||
|
same `FLEET_METRICS_TOKEN`:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cd ~/code/learning_ai_common_plat
|
||||||
|
grep -q '^FLEET_METRICS_TOKEN=' .env || \
|
||||||
|
printf '\nFLEET_METRICS_TOKEN=changeme-fleet-metrics-token\n' >> .env
|
||||||
|
# host path: restart Terminal-1 tsx so it re-reads .env
|
||||||
|
# docker path: docker compose up -d platform-service
|
||||||
|
```
|
||||||
|
|
||||||
|
Verify (if Prometheus is up): `http://localhost:9090/api/v1/targets` →
|
||||||
|
`platform-service-fleet` is `up`. The value must equal `credentials:` in
|
||||||
|
`services/monitoring/prometheus/prometheus.yml`.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 5. Mint a local API token (dev only)
|
||||||
|
|
||||||
|
`platform-service` verifies HS256 JWTs signed with `JWT_SECRET` and requires
|
||||||
|
`type: "access"`. The tracker-web UI obtains one via login; for scripts/agents and
|
||||||
|
the factory, mint one directly. **Local dev only — never commit tokens or the secret.**
|
||||||
|
|
||||||
|
Save `mint-token.mjs` (resolve `jose` from the workspace):
|
||||||
|
|
||||||
|
```js
|
||||||
|
import { readFileSync } from 'node:fs';
|
||||||
|
// adjust the jose path to your checkout if needed:
|
||||||
|
import { SignJWT } from '/ABS/PATH/learning_ai_common_plat/node_modules/.pnpm/jose@5.10.0/node_modules/jose/dist/node/esm/index.js';
|
||||||
|
|
||||||
|
const env = readFileSync('/ABS/PATH/learning_ai_common_plat/.env', 'utf8');
|
||||||
|
const secret = new TextEncoder().encode(env.match(/^JWT_SECRET=(.*)$/m)[1].trim());
|
||||||
|
const ttl = process.argv[2] || '15m'; // e.g. '15m' for scripts, '24h' for a factory
|
||||||
|
process.stdout.write(
|
||||||
|
await new SignJWT({ sub: 'local-dev', role: 'admin', type: 'access' })
|
||||||
|
.setProtectedHeader({ alg: 'HS256' })
|
||||||
|
.setIssuedAt()
|
||||||
|
.setExpirationTime(ttl)
|
||||||
|
.sign(secret)
|
||||||
|
);
|
||||||
|
```
|
||||||
|
|
||||||
|
```bash
|
||||||
|
node mint-token.mjs 15m > /tmp/tok # short-lived, for API calls
|
||||||
|
node mint-token.mjs 24h > /tmp/factok # longer-lived, for the factory daemon
|
||||||
|
```
|
||||||
|
|
||||||
|
> Find the jose path with:
|
||||||
|
> `find . -path '*/node_modules/jose/dist/*/esm/index.js' | head -1`.
|
||||||
|
|
||||||
|
Requests must also carry the **product**: header `X-Product-Id: <productId>`
|
||||||
|
(e.g. `notelett`). `role: admin` bypasses tenant ownership checks when
|
||||||
|
`FLEET_TENANT_ENFORCEMENT` is on (it's off by default).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 6. Submit a job
|
||||||
|
|
||||||
|
### Via tracker-web (preferred)
|
||||||
|
|
||||||
|
Open `http://localhost:3003/dashboard/fleet/jobs`, "New job". **Set the correct
|
||||||
|
product first** (the product selector) — a job is partitioned by `productId`, and
|
||||||
|
submitting under the wrong product misattributes cost/metrics/ownership and the
|
||||||
|
factory won't see it under the product it polls.
|
||||||
|
|
||||||
|
PR‑mode fields that matter:
|
||||||
|
|
||||||
|
- **`repo`** — must be `owner/name` (e.g. `saravanakumardb1/learning_ai_notes`) or a
|
||||||
|
clone URL, **not** a bare name (the factory feeds it to `gh`).
|
||||||
|
- **`baseBranch`** — e.g. `main`.
|
||||||
|
- **`engine`** — `devin` (pins the agent; otherwise the factory's default/engineClass).
|
||||||
|
- **`autoMerge`** — leave **`false`** for a human merge gate (recommended for large PRs).
|
||||||
|
|
||||||
|
### Via API
|
||||||
|
|
||||||
|
```bash
|
||||||
|
JOB=$(curl -s -X POST http://localhost:4003/api/fleet/jobs \
|
||||||
|
-H "Authorization: Bearer $(cat /tmp/tok)" \
|
||||||
|
-H "X-Product-Id: notelett" -H 'Content-Type: application/json' \
|
||||||
|
-d '{
|
||||||
|
"idempotencyKey": "notelett-demo-1",
|
||||||
|
"bodyMd": "# Task\n…full prompt…",
|
||||||
|
"priority": "high",
|
||||||
|
"engine": "devin",
|
||||||
|
"repo": "saravanakumardb1/learning_ai_notes",
|
||||||
|
"baseBranch": "main",
|
||||||
|
"autoMerge": false
|
||||||
|
}')
|
||||||
|
echo "$JOB" # → { outcome: "created", job: { id: "fjob_…", stage: "queued", ... } }
|
||||||
|
```
|
||||||
|
|
||||||
|
The job is now `queued` and claimable. It will **not run** until a factory polls
|
||||||
|
for its product (next step).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 7. Start the factory (agent-queue, fleet mode)
|
||||||
|
|
||||||
|
The factory lives in a **separate repo**: `learning_ai_devops_tools/agent-queue`.
|
||||||
|
Run it on the **host** (needs `devin` + `gh`). Read its `docs/RUN_POLICY.md` first.
|
||||||
|
|
||||||
|
### 7a. Sanity‑check connectivity (safe — registers + heartbeats only)
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cd learning_ai_devops_tools/agent-queue
|
||||||
|
./agent-queue.sh init # idempotent
|
||||||
|
|
||||||
|
AQ_FLEET=1 AQ_FLEET_ROUTE=1 \
|
||||||
|
AQ_FLEET_API=http://localhost:4003/api \
|
||||||
|
AQ_PRODUCT_ID=notelett \
|
||||||
|
AQ_FLEET_TOKEN="$(cat /tmp/factok)" \
|
||||||
|
AQ_FACTORY_ID=mac-local-1 \
|
||||||
|
AQ_FLEET_CAPS=engine:devin \
|
||||||
|
AQ_FLEET_LEASE_RENEW_SEC=60 \
|
||||||
|
./agent-queue.sh fleet-status # → "heartbeat OK (registered)."
|
||||||
|
```
|
||||||
|
|
||||||
|
### 7b. Launch the run loop (claims + runs the agent)
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cd learning_ai_devops_tools/agent-queue
|
||||||
|
AQ_FLEET=1 AQ_FLEET_ROUTE=1 AQ_FLEET_PR=1 \
|
||||||
|
AQ_FLEET_API=http://localhost:4003/api \
|
||||||
|
AQ_PRODUCT_ID=notelett \
|
||||||
|
AQ_FLEET_TOKEN="$(cat /tmp/factok)" \
|
||||||
|
AQ_FACTORY_ID=mac-local-1 \
|
||||||
|
AQ_FLEET_CAPS=engine:devin \
|
||||||
|
AQ_FLEET_LEASE_RENEW_SEC=60 \
|
||||||
|
./agent-queue.sh run --max 1
|
||||||
|
```
|
||||||
|
|
||||||
|
> ⚠️ **Set `AQ_FLEET_LEASE_RENEW_SEC` below 90 (e.g. 60).** This is the heartbeat/
|
||||||
|
> lease‑renew cadence. The coordinator's reaper marks a factory **stale after 90s**
|
||||||
|
> (`DEFAULT_STALE_FACTORY_MS`, a constant — no env knob) and **reclaims its in‑flight
|
||||||
|
> lease**. The default cadence is **300s**, so a busy single‑slot factory looks stale
|
||||||
|
> for most of every cycle and its running job gets requeued mid‑run (`leaseEpoch`
|
||||||
|
> climbs, stage flips back to `queued`, and the final report is **fenced** so the job
|
||||||
|
> never tidies to `review`/`shipped`). 60s keeps it comfortably live. (Add the same env
|
||||||
|
> to the §7a `fleet-status` check for consistency.)
|
||||||
|
|
||||||
|
Key fleet env vars (see `lib/fleet-client.sh`):
|
||||||
|
|
||||||
|
| Var | Meaning |
|
||||||
|
| -------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
|
| `AQ_FLEET=1` | master switch — enable coordinator calls (0 = pure offline) |
|
||||||
|
| `AQ_FLEET_ROUTE=1` | coordinator is **authoritative** for claim (pulls work from platform-service) |
|
||||||
|
| `AQ_FLEET_PR=1` | PR mode — open a PR for jobs that target a `repo` |
|
||||||
|
| `AQ_FLEET_API` | base URL **including `/api`** (`http://localhost:4003/api`) |
|
||||||
|
| `AQ_FLEET_TOKEN` | **Bearer JWT** (mint per §5; ≥ run duration, e.g. 24h) |
|
||||||
|
| `AQ_PRODUCT_ID` | product to poll — sent as `X-Product-Id` (must match the job's product) |
|
||||||
|
| `AQ_FACTORY_ID` | this factory's id (registered/heartbeated) |
|
||||||
|
| `AQ_FLEET_CAPS` | advertised capabilities, e.g. `engine:devin` |
|
||||||
|
| `AQ_FLEET_LEASE_RENEW_SEC` | **set `<90`** (e.g. `60`) — heartbeat/renew cadence vs the 90s stale window (see warning) |
|
||||||
|
| `AQ_FLEET_REPO_BASE` | _(optional)_ dir of local checkouts; if `…/<repo>/.git` exists it uses a **git worktree**, else it `git clone`s `https://github.com/<repo>.git` into its cache |
|
||||||
|
| `AQ_FLEET_AUTOSHIP=1` | _(optional)_ auto-advance to `shipped` (skips the manual gate) |
|
||||||
|
|
||||||
|
The run loop `claim → assigned → building`, runs Devin in an isolated checkout,
|
||||||
|
heartbeats + renews the lease (`lease_renewed` events) so the reaper doesn't reclaim it,
|
||||||
|
then on agent exit moves to `review` and (PR mode) opens the PR. With `autoMerge:false`
|
||||||
|
it **stops at the human merge gate**.
|
||||||
|
|
||||||
|
> **Repo checkout:** the job's `repo` is `owner/name`, so by default the factory
|
||||||
|
> `git clone`s `https://github.com/<owner>/<name>.git` into its own cache
|
||||||
|
> (`queue/.state/repos/…`) — clean isolation, nothing touches your working copies. To
|
||||||
|
> reuse an existing local clone via a **git worktree** instead, set
|
||||||
|
> `AQ_FLEET_REPO_BASE=<parent>` where `<parent>/<owner>/<name>/.git` exists.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 8. Observe progress
|
||||||
|
|
||||||
|
- **tracker-web:** `http://localhost:3003/dashboard/fleet/jobs/<jobId>` — live event
|
||||||
|
stream (SSE), runs, PR link + state.
|
||||||
|
- **Events/API:**
|
||||||
|
```bash
|
||||||
|
curl -s http://localhost:4003/api/fleet/jobs/<jobId>/events \
|
||||||
|
-H "Authorization: Bearer $(cat /tmp/tok)" -H "X-Product-Id: notelett"
|
||||||
|
```
|
||||||
|
- **Metrics:** `GET /api/fleet/metrics` (JSON, per product) · `GET /api/fleet/metrics/prom`
|
||||||
|
(Prometheus, all products; needs `FLEET_METRICS_TOKEN`) · Grafana **Fleet Overview**
|
||||||
|
(`http://localhost:3000/d/fleet-overview`).
|
||||||
|
- **Autoscale signal:** `GET /api/fleet/autoscale` (this product) / `…/autoscale/all`.
|
||||||
|
|
||||||
|
### PR‑state reconcile (externally‑merged PRs)
|
||||||
|
|
||||||
|
If you merge the PR in the GitHub UI, the coordinator doesn't know until told. Trigger
|
||||||
|
a reconcile (flips run `prState → merged` when `gh pr view` reports MERGED):
|
||||||
|
|
||||||
|
- UI: **"Refresh PR status"** button on the job's PR section, or
|
||||||
|
- API: `POST /api/fleet/jobs/<jobId>/pr/reconcile`.
|
||||||
|
|
||||||
|
> Requires `gh` where platform-service runs → use the **host path** (§3 Option A);
|
||||||
|
> it's a no‑op in the Docker container (no `gh`).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 9. Safety & cost
|
||||||
|
|
||||||
|
- **Billable + autonomous + long‑running.** Each run consumes Devin credits and can
|
||||||
|
run for a long time unattended. Scope jobs deliberately; very large multi‑workstream
|
||||||
|
specs are better split into several jobs.
|
||||||
|
- **Real PR.** PR mode pushes a branch and opens a PR on the target repo. Keep
|
||||||
|
`autoMerge:false` so a human reviews/merges; `gh pr merge` (auto) only fires when the
|
||||||
|
job opts in or `FLEET_SHIP_MERGES_PR=1`.
|
||||||
|
- **Isolation.** The factory works in an isolated worktree/clone, never your main
|
||||||
|
checkout (per `agent-queue/docs/RUN_POLICY.md`). Avoid blanket `--yolo` on live trees.
|
||||||
|
- **Stopping the daemon** mid‑run lets the lease expire; the coordinator's reaper then
|
||||||
|
reclaims and requeues the job (so partial work may be retried). Stop intentionally.
|
||||||
|
- **Tokens/secrets:** the minted JWT and `JWT_SECRET` are sensitive — never commit them
|
||||||
|
or paste into shared logs. `.env` is git‑ignored; keep it that way.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 10. Teardown
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# stop the factory: Ctrl-C the run loop
|
||||||
|
# host path: Ctrl-C the tsx (Terminal 1) and pnpm dev (Terminal 2)
|
||||||
|
# docker path:
|
||||||
|
# cd ~/code/learning_ai_common_plat && docker compose down # keep volumes
|
||||||
|
# docker compose down -v # also drop volumes
|
||||||
|
rm -f /tmp/tok /tmp/factok # discard minted tokens
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 11. Troubleshooting
|
||||||
|
|
||||||
|
| Symptom | Cause → Fix |
|
||||||
|
| -------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
|
| `Cannot find module '@bytelyst/…/dist/index.js'` on `tsx`/Next start | workspace packages not built → `pnpm -r build` (§2.2). |
|
||||||
|
| `401 {"error":"Invalid or expired token"}` | JWT expired/mis‑signed → re‑mint (§5); ensure same `JWT_SECRET` as the running service. |
|
||||||
|
| Job claimed then flips back to `queued` mid‑run; `leaseEpoch` keeps climbing; final report **fenced**; PR opens but job never reaches `review`/`shipped` | factory heartbeat cadence (`AQ_FLEET_LEASE_RENEW_SEC`, default **300s**) > reaper stale window (**90s**) → set `AQ_FLEET_LEASE_RENEW_SEC=60` (§7). To recover the record after the fact, reconcile PR state (§8). |
|
||||||
|
| Job stays `queued`, never claimed | No factory for that product → `fleet-status` shows it registered? `AQ_PRODUCT_ID` must equal the job's product. Check `GET /api/fleet/factories` (X‑Product‑Id) for `0 live`. |
|
||||||
|
| `POST …/pr/reconcile` or ship auto‑merge does nothing | `gh` not present where platform-service runs (Docker container) → run the host path (§3 Option A). |
|
||||||
|
| Prometheus target `platform-service-fleet` = `down (401)` | service missing `FLEET_METRICS_TOKEN` → §4 (restart host `tsx` / recreate container). |
|
||||||
|
| `prototype-up.sh` build fails on `corepack prepare pnpm` | dashboard image network issue → use the targeted subset, or just use the host path (Option A). |
|
||||||
|
| `POST …/actions/<x>` returns 500 "Body cannot be empty" | sent `Content-Type: application/json` with no body → omit the header or send `{}`. |
|
||||||
|
| Port `4003` conflict | host `tsx watch` and a `platform-service` container both bind `4003` → run only one. |
|
||||||
|
| `gh pr create` fails | `repo` is a bare name → must be `owner/name` or a clone URL; confirm `gh auth status`. |
|
||||||
|
| PR/cost attributed to wrong product | job submitted under the wrong `productId` partition → resubmit under the right product and cancel the stray (`POST …/actions/cancel`). |
|
||||||
|
| `vitest` exits non‑zero with `kill EPERM` after all suites pass | worker‑pool teardown artifact (sandbox), not a test failure → re‑run; all suites already passed. |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 12. Copy‑paste quickstart — all localhost (notelett → learning_ai_notes)
|
||||||
|
|
||||||
|
Assumes §2.1/§2.2 done (`.env` filled, `pnpm -r build` run). Four terminals.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Terminal 1 — coordinator
|
||||||
|
cd ~/code/learning_ai_common_plat/services/platform-service
|
||||||
|
pnpm exec tsx watch --env-file=../../.env src/server.ts
|
||||||
|
|
||||||
|
# Terminal 2 — dashboard
|
||||||
|
cd ~/code/learning_ai_common_plat/dashboards/tracker-web && pnpm dev # :3003
|
||||||
|
|
||||||
|
# Terminal 3 — tokens + submit (save mint-token.mjs from §5; fix ABS paths)
|
||||||
|
node mint-token.mjs 15m > /tmp/tok
|
||||||
|
node mint-token.mjs 24h > /tmp/factok
|
||||||
|
curl -s -X POST http://localhost:4003/api/fleet/jobs \
|
||||||
|
-H "Authorization: Bearer $(cat /tmp/tok)" -H "X-Product-Id: notelett" \
|
||||||
|
-H 'Content-Type: application/json' \
|
||||||
|
-d '{"idempotencyKey":"notelett-demo-1","bodyMd":"# Task…","priority":"high","engine":"devin","repo":"saravanakumardb1/learning_ai_notes","baseBranch":"main","autoMerge":false}'
|
||||||
|
|
||||||
|
# Terminal 4 — factory (runs Devin → opens a real PR). NOTE the <90s heartbeat cadence.
|
||||||
|
cd ~/code/learning_ai_devops_tools/agent-queue && ./agent-queue.sh init
|
||||||
|
AQ_FLEET=1 AQ_FLEET_ROUTE=1 AQ_FLEET_PR=1 AQ_FLEET_API=http://localhost:4003/api \
|
||||||
|
AQ_PRODUCT_ID=notelett AQ_FLEET_TOKEN="$(cat /tmp/factok)" \
|
||||||
|
AQ_FACTORY_ID=mac-local-1 AQ_FLEET_CAPS=engine:devin AQ_FLEET_LEASE_RENEW_SEC=60 \
|
||||||
|
./agent-queue.sh run --max 1
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 13. WSL on Windows — differences to note
|
||||||
|
|
||||||
|
The flow is identical **inside a WSL2 (Ubuntu) shell**, with these adjustments. Treat
|
||||||
|
WSL as "the Linux host" — install and run **everything inside WSL**, not Windows.
|
||||||
|
|
||||||
|
- **Keep repos on the WSL filesystem, not `/mnt/c`.** Clone under e.g. `~/code` inside
|
||||||
|
WSL. On `/mnt/c` (the Windows drive over 9p) `tsx watch`/Next file‑watching is
|
||||||
|
unreliable (inotify doesn't fire) and git/pnpm are far slower. This is the single most
|
||||||
|
important difference.
|
||||||
|
- **Install the toolchain inside WSL** (Linux builds): `node`/`pnpm` (nvm), `git`, **`gh`**,
|
||||||
|
and the **`devin` CLI** — and run `gh auth login` + Devin auth **inside WSL**. A `gh`/
|
||||||
|
`devin` installed on Windows is not visible to the WSL bash factory.
|
||||||
|
- **Line endings.** Clone inside WSL (don't reuse a Windows checkout with
|
||||||
|
`core.autocrlf=true`) so the `*.sh` scripts stay LF — CRLF breaks `agent-queue.sh`
|
||||||
|
(`bad interpreter`/`\r`). If needed: `git config --global core.autocrlf input`.
|
||||||
|
- **Reaching the UI from the Windows browser.** WSL2 forwards `localhost`, so
|
||||||
|
`http://localhost:3003` / `:4003` usually work from a Windows browser. If they don't
|
||||||
|
(older Windows / mirrored‑networking off), use the WSL IP (`hostname -I`) or set
|
||||||
|
`networkingMode=mirrored` in `.wslconfig`.
|
||||||
|
- **Ports.** Make sure nothing on the **Windows** side already binds `3003`/`4003`
|
||||||
|
(WSL2 publishes to the same localhost). Stop the Windows process or change ports.
|
||||||
|
- **Docker (Option B), if used.** Use **Docker Desktop with the WSL2 backend** and run
|
||||||
|
`docker compose` from inside the WSL shell. `host.docker.internal` resolves from
|
||||||
|
containers to the host as on Mac.
|
||||||
|
- **`/tmp` token paths** (`/tmp/tok`, `/tmp/factok`) are the WSL `/tmp` — fine; just keep
|
||||||
|
all four terminals in the same WSL distro so they share it.
|
||||||
|
- **Clock skew.** If WSL's clock drifts after sleep, JWT `iat/exp` checks can fail
|
||||||
|
(`Invalid or expired token`) — `sudo hwclock -s` (or restart WSL) to resync.
|
||||||
|
|
||||||
|
Everything else — env vars, `pnpm -r build`, `tsx --env-file`, the factory env incl.
|
||||||
|
`AQ_FLEET_LEASE_RENEW_SEC=60`, token minting — is identical to the Mac host path.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Reference
|
||||||
|
|
||||||
|
- Coordinator routes: `services/platform-service/src/modules/fleet/routes.ts`
|
||||||
|
- Coordinator logic: `services/platform-service/src/modules/fleet/coordinator.ts`
|
||||||
|
- Factory fleet client: `learning_ai_devops_tools/agent-queue/lib/fleet-client.sh`
|
||||||
|
- Factory runner + PR mode: `learning_ai_devops_tools/agent-queue/agent-queue.sh`
|
||||||
|
- Gigafactory spec/roadmap: `learning_ai_devops_tools/agent-queue/docs/GIGAFACTORY/`
|
||||||
|
- Prometheus scrape config: `services/monitoring/prometheus/prometheus.yml`
|
||||||
|
- Grafana dashboard: `services/monitoring/grafana/dashboards/fleet-overview.json`
|
||||||
Loading…
Reference in New Issue
Block a user