Add docs/runbooks/FLEET_DEVIN_LOCAL_RUN.md: how developers and coding agents spin up platform-service + tracker-web + an agent-queue factory so a submitted job is claimed and run autonomously by the Devin CLI against a target repo (worked example: learning_ai_notes), pushing a branch and opening a real PR. Covers: architecture + lifecycle, prerequisites incl. fresh-machine setup (clone both repos, .env/Cosmos, pnpm -r build so host-run resolves @bytelyst/* from dist/), all-localhost (no Docker) path as primary + Docker as the Grafana/Prometheus option, local JWT minting, job submit, factory launch, observe, PR-state reconcile, safety/cost, teardown, troubleshooting, and a copy-paste quickstart. Calls out two gotchas learned in practice: set AQ_FLEET_LEASE_RENEW_SEC < 90 so the factory heartbeat beats the coordinator's 90s stale-factory reclaim window (else a busy single-slot factory's in-flight lease is reclaimed mid-run and the final report is fenced), and a WSL-on-Windows differences section (run inside WSL, repos off /mnt/c, LF endings, gh/devin/node in WSL, localhost forwarding). Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
29 KiB
Runbook — Run a Devin Fleet Job End‑to‑End (local)
Audience: developers and coding agents. Goal: stand up
platform-service+tracker-web+ a fleet factory (theagent-queuerunner) so a submitted job is claimed and executed autonomously by the Devin CLI against a target repo (worked example:learning_ai_notes), pushing a branch and opening a real pull request.
⚠️ This is a real, cost‑incurring, side‑effecting operation. The factory runs an autonomous coding agent (Devin) that consumes API credits, can run for a long time, pushes a branch, and opens a real PR on GitHub. Read §9 Safety & cost before launching. For unattended local prototyping only — not a production deployment guide.
1. Architecture (what talks to what)
you ──▶ tracker-web (:3003) ─┐
│ REST + SSE (/api/fleet/*)
coding agent / curl ────────┼─▶ platform-service (:4003) ──▶ Azure Cosmos (jobs/runs/leases/events)
│ ▲ ▲
Prometheus (:9090)┘ │ │ claim / lease-renew / report (Bearer JWT + X-Product-Id)
Grafana (:3000) ───────────┘ │
│
agent-queue FACTORY (fleet mode) ──▶ Devin CLI ──▶ git push + gh pr create
(learning_ai_devops_tools/agent-queue) (target repo, e.g. learning_ai_notes)
- platform-service — the fleet coordinator. Owns the job lifecycle
(
queued → assigned → building → review → testing → shipped|failed|dead_letter), atomic claim, leases, events, budgets, metrics. Code:services/platform-service/src/modules/fleet/. - tracker-web (
:3003) — submit/inspect jobs (/dashboard/fleet/jobs/...). - factory —
learning_ai_devops_tools/agent-queuein fleet mode. PollsPOST /api/fleet/claim, runs the agent CLI in an isolated checkout, reports back, and (PR mode) opens the PR. - Prometheus/Grafana — fleet metrics + the "Fleet Overview" dashboard.
Lifecycle the factory drives:
queued ─▶ assigned ─▶ building ─▶ review ─▶ testing ─▶ shipped
(claim) (agent (rc=0) (verify (manual/auto ship)
running) passed)
└─ agent rc≠0 / timeout / verify fail ─▶ failed ─▶ (retry|dead_letter)
2. Prerequisites
| Tool | Why | Check |
|---|---|---|
Node ≥ 20 + pnpm (corepack) |
host-run service, scripts, tracker-web, build | node -v && pnpm -v |
git + gh (authenticated) |
factory clones target repo, pushes branch, opens PR; gh pr merge/reconcile |
gh auth status |
devin CLI (authenticated) |
the agent the factory runs | devin --version |
| Both repos cloned side‑by‑side | coordinator/dashboards + the factory | see below |
repo .env (root of learning_ai_common_plat) |
JWT_SECRET, Cosmos creds, FLEET_METRICS_TOKEN |
test -f .env |
| Docker | optional — only for the Docker path (§3 Option B) / Grafana+Prometheus | docker info |
Node version: the Docker image pins node 22; for the host path any Node ≥ 20 works. Use one Node (nvm/asdf) for both repos to avoid native-module surprises.
2.1 First‑time setup (fresh machine)
Clone both repos as siblings (the factory clones targets relative to a shared parent):
mkdir -p ~/code && cd ~/code
git clone <host>/learning_ai_common_plat.git
git clone <host>/learning_ai_devops_tools.git # contains agent-queue (the factory)
Create and fill .env at the root of learning_ai_common_plat:
cd ~/code/learning_ai_common_plat
cp .env.example .env
# then edit .env — minimum for the fleet flow:
# JWT_SECRET=<any strong secret; tokens are minted+verified with THIS value>
# FLEET_METRICS_TOKEN=changeme-fleet-metrics-token # only needed for Prometheus
# COSMOS_* / connection vars -> see note below
JWT_SECRET— HS256 secret platform-service verifies tokens with. Any strong value; it only needs to be internally consistent on this machine (the token you mint in §5 and the running service must share it). Required.- Cosmos — the default prototype talks to a real Azure Cosmos account (no emulator
in the default compose). On a new machine you must either (a) point
.envat the same Cosmos account (to see/share existing jobs) or (b) point at your own DB and setCOSMOS_AUTO_INIT=trueso containers are created on boot. Without valid Cosmos creds the service starts but every fleet call fails. FLEET_METRICS_TOKEN— only needed if you run Prometheus (§4); must matchservices/monitoring/prometheus/prometheus.yml(credentials:).
2.2 Install + build the workspace (required for the host path)
Host-run resolves @bytelyst/* workspace packages from their dist/ (the exports
field points at dist), so you must build them once before tsx/Next can import them:
cd ~/code/learning_ai_common_plat
pnpm install
pnpm -r build # builds all workspace packages (incl. @bytelyst/* → dist/)
# (faster, just the platform-service closure:)
# pnpm -r --filter @lysnrai/platform-service... build
Skipping this is the #1 fresh-machine failure:
tsx watchcrashes withCannot find module '@bytelyst/...'/dist/index.js. Re-runpnpm -r buildafter pulling changes to shared packages.
3. Bring up platform-service + tracker-web
Two ways. Option A (all localhost, no Docker) is recommended for a single dev Mac /
WSL box — everything runs on the host, so gh-backed features work out of the box.
Option B (Docker) is for when you also want the Grafana/Prometheus stack.
Option A — all localhost, no Docker (recommended)
Two long‑lived processes, each in its own terminal. Both assume §2.1/§2.2 are done
(.env filled, pnpm -r build run).
Terminal 1 — coordinator (platform-service, :4003):
cd ~/code/learning_ai_common_plat/services/platform-service
pnpm exec tsx watch --env-file=../../.env src/server.ts
tsx watch hot-reloads on source changes. Use the explicit --env-file=../../.env
(the bare pnpm dev script does not load the root .env, so JWT_SECRET/Cosmos
would be missing). FLEET_METRICS_TOKEN is already in .env if you set it in §2.1.
Terminal 2 — dashboard (tracker-web, :3003):
cd ~/code/learning_ai_common_plat/dashboards/tracker-web
pnpm dev # serves http://localhost:3003 (proxies /api → :4003)
That's the whole coordinator + UI. Monitoring (Grafana/Prometheus) is optional on
the host path — GET /api/fleet/metrics (JSON), GET /api/fleet/autoscale, and the
tracker-web job pages cover observability without it. To get the Grafana "Fleet Overview"
dashboard you need Prometheus + Grafana (run them via Docker — Option B — or Homebrew
binaries pointed at services/monitoring/...).
Because everything is on the host, gh is on PATH → the PR‑state reconcile (§8)
and ship‑time gh pr merge work (unlike the Docker container, which has no gh).
Health checks:
curl -s -o /dev/null -w '%{http_code}\n' http://localhost:4003/health # 200
curl -s -o /dev/null -w '%{http_code}\n' http://localhost:3003 # 200
Option B — Docker (adds Grafana + Prometheus)
cd ~/code/learning_ai_common_plat
# targeted fleet subset that always builds cleanly:
docker compose up -d --build platform-service prometheus grafana
# (full stack: bash scripts/prototype-up.sh)
Starts platform-service (:4003), prometheus (:9090), grafana (:3000,
admin/lysnrai) + deps. Still run tracker-web from source (Option A, Terminal 2).
Docker caveats:
prototype-up.shmay fail building the dashboard images whencorepack prepare pnpm@…can't fetch pnpm on a restricted network → use the targeted subset above.ghis NOT in the container → coordinator‑sidegh pr mergeand PR‑reconcile (§8) are no‑ops in Docker. Use the host path (Option A) if you need them.- Don't run both: the container and a host
tsxboth bind:4003(docker compose stop platform-servicebefore host‑running).
4. Make Prometheus auth work (only if running Prometheus)
Skip this on the host path unless you also run Prometheus. prometheus.yml scrapes
/api/fleet/metrics/prom with a bearer, so the running platform-service must see the
same FLEET_METRICS_TOKEN:
cd ~/code/learning_ai_common_plat
grep -q '^FLEET_METRICS_TOKEN=' .env || \
printf '\nFLEET_METRICS_TOKEN=changeme-fleet-metrics-token\n' >> .env
# host path: restart Terminal-1 tsx so it re-reads .env
# docker path: docker compose up -d platform-service
Verify (if Prometheus is up): http://localhost:9090/api/v1/targets →
platform-service-fleet is up. The value must equal credentials: in
services/monitoring/prometheus/prometheus.yml.
5. Mint a local API token (dev only)
platform-service verifies HS256 JWTs signed with JWT_SECRET and requires
type: "access". The tracker-web UI obtains one via login; for scripts/agents and
the factory, mint one directly. Local dev only — never commit tokens or the secret.
Save mint-token.mjs (resolve jose from the workspace):
import { readFileSync } from 'node:fs';
// adjust the jose path to your checkout if needed:
import { SignJWT } from '/ABS/PATH/learning_ai_common_plat/node_modules/.pnpm/jose@5.10.0/node_modules/jose/dist/node/esm/index.js';
const env = readFileSync('/ABS/PATH/learning_ai_common_plat/.env', 'utf8');
const secret = new TextEncoder().encode(env.match(/^JWT_SECRET=(.*)$/m)[1].trim());
const ttl = process.argv[2] || '15m'; // e.g. '15m' for scripts, '24h' for a factory
process.stdout.write(
await new SignJWT({ sub: 'local-dev', role: 'admin', type: 'access' })
.setProtectedHeader({ alg: 'HS256' })
.setIssuedAt()
.setExpirationTime(ttl)
.sign(secret)
);
node mint-token.mjs 15m > /tmp/tok # short-lived, for API calls
node mint-token.mjs 24h > /tmp/factok # longer-lived, for the factory daemon
Find the jose path with:
find . -path '*/node_modules/jose/dist/*/esm/index.js' | head -1.
Requests must also carry the product: header X-Product-Id: <productId>
(e.g. notelett). role: admin bypasses tenant ownership checks when
FLEET_TENANT_ENFORCEMENT is on (it's off by default).
6. Submit a job
Via tracker-web (preferred)
Open http://localhost:3003/dashboard/fleet/jobs, "New job". Set the correct
product first (the product selector) — a job is partitioned by productId, and
submitting under the wrong product misattributes cost/metrics/ownership and the
factory won't see it under the product it polls.
PR‑mode fields that matter:
repo— must beowner/name(e.g.saravanakumardb1/learning_ai_notes) or a clone URL, not a bare name (the factory feeds it togh).baseBranch— e.g.main.engine—devin(pins the agent; otherwise the factory's default/engineClass).autoMerge— leavefalsefor a human merge gate (recommended for large PRs).
Via API
JOB=$(curl -s -X POST http://localhost:4003/api/fleet/jobs \
-H "Authorization: Bearer $(cat /tmp/tok)" \
-H "X-Product-Id: notelett" -H 'Content-Type: application/json' \
-d '{
"idempotencyKey": "notelett-demo-1",
"bodyMd": "# Task\n…full prompt…",
"priority": "high",
"engine": "devin",
"repo": "saravanakumardb1/learning_ai_notes",
"baseBranch": "main",
"autoMerge": false
}')
echo "$JOB" # → { outcome: "created", job: { id: "fjob_…", stage: "queued", ... } }
The job is now queued and claimable. It will not run until a factory polls
for its product (next step).
7. Start the factory (agent-queue, fleet mode)
The factory lives in a separate repo: learning_ai_devops_tools/agent-queue.
Run it on the host (needs devin + gh). Read its docs/RUN_POLICY.md first.
7a. Sanity‑check connectivity (safe — registers + heartbeats only)
cd learning_ai_devops_tools/agent-queue
./agent-queue.sh init # idempotent
AQ_FLEET=1 AQ_FLEET_ROUTE=1 \
AQ_FLEET_API=http://localhost:4003/api \
AQ_PRODUCT_ID=notelett \
AQ_FLEET_TOKEN="$(cat /tmp/factok)" \
AQ_FACTORY_ID=mac-local-1 \
AQ_FLEET_CAPS=engine:devin \
AQ_FLEET_LEASE_RENEW_SEC=60 \
./agent-queue.sh fleet-status # → "heartbeat OK (registered)."
7b. Launch the run loop (claims + runs the agent)
cd learning_ai_devops_tools/agent-queue
AQ_FLEET=1 AQ_FLEET_ROUTE=1 AQ_FLEET_PR=1 \
AQ_FLEET_API=http://localhost:4003/api \
AQ_PRODUCT_ID=notelett \
AQ_FLEET_TOKEN="$(cat /tmp/factok)" \
AQ_FACTORY_ID=mac-local-1 \
AQ_FLEET_CAPS=engine:devin \
AQ_FLEET_LEASE_RENEW_SEC=60 \
./agent-queue.sh run --max 1
⚠️ Set
AQ_FLEET_LEASE_RENEW_SECbelow 90 (e.g. 60). This is the heartbeat/ lease‑renew cadence. The coordinator's reaper marks a factory stale after 90s (DEFAULT_STALE_FACTORY_MS, a constant — no env knob) and reclaims its in‑flight lease. The default cadence is 300s, so a busy single‑slot factory looks stale for most of every cycle and its running job gets requeued mid‑run (leaseEpochclimbs, stage flips back toqueued, and the final report is fenced so the job never tidies toreview/shipped). 60s keeps it comfortably live. (Add the same env to the §7afleet-statuscheck for consistency.)
Key fleet env vars (see lib/fleet-client.sh):
| Var | Meaning |
|---|---|
AQ_FLEET=1 |
master switch — enable coordinator calls (0 = pure offline) |
AQ_FLEET_ROUTE=1 |
coordinator is authoritative for claim (pulls work from platform-service) |
AQ_FLEET_PR=1 |
PR mode — open a PR for jobs that target a repo |
AQ_FLEET_API |
base URL including /api (http://localhost:4003/api) |
AQ_FLEET_TOKEN |
Bearer JWT (mint per §5; ≥ run duration, e.g. 24h) |
AQ_PRODUCT_ID |
product to poll — sent as X-Product-Id (must match the job's product) |
AQ_FACTORY_ID |
this factory's id (registered/heartbeated) |
AQ_FLEET_CAPS |
advertised capabilities, e.g. engine:devin |
AQ_FLEET_LEASE_RENEW_SEC |
set <90 (e.g. 60) — heartbeat/renew cadence vs the 90s stale window (see warning) |
AQ_FLEET_REPO_BASE |
(optional) dir of local checkouts; if …/<repo>/.git exists it uses a git worktree, else it git clones https://github.com/<repo>.git into its cache |
AQ_FLEET_AUTOSHIP=1 |
(optional) auto-advance to shipped (skips the manual gate) |
The run loop claim → assigned → building, runs Devin in an isolated checkout,
heartbeats + renews the lease (lease_renewed events) so the reaper doesn't reclaim it,
then on agent exit moves to review and (PR mode) opens the PR. With autoMerge:false
it stops at the human merge gate.
Repo checkout: the job's
repoisowner/name, so by default the factorygit cloneshttps://github.com/<owner>/<name>.gitinto its own cache (queue/.state/repos/…) — clean isolation, nothing touches your working copies. To reuse an existing local clone via a git worktree instead, setAQ_FLEET_REPO_BASE=<parent>where<parent>/<owner>/<name>/.gitexists.
8. Observe progress
- tracker-web:
http://localhost:3003/dashboard/fleet/jobs/<jobId>— live event stream (SSE), runs, PR link + state. - Events/API:
curl -s http://localhost:4003/api/fleet/jobs/<jobId>/events \ -H "Authorization: Bearer $(cat /tmp/tok)" -H "X-Product-Id: notelett" - Metrics:
GET /api/fleet/metrics(JSON, per product) ·GET /api/fleet/metrics/prom(Prometheus, all products; needsFLEET_METRICS_TOKEN) · Grafana Fleet Overview (http://localhost:3000/d/fleet-overview). - Autoscale signal:
GET /api/fleet/autoscale(this product) /…/autoscale/all.
PR‑state reconcile (externally‑merged PRs)
If you merge the PR in the GitHub UI, the coordinator doesn't know until told. Trigger
a reconcile (flips run prState → merged when gh pr view reports MERGED):
- UI: "Refresh PR status" button on the job's PR section, or
- API:
POST /api/fleet/jobs/<jobId>/pr/reconcile.
Requires
ghwhere platform-service runs → use the host path (§3 Option A); it's a no‑op in the Docker container (nogh).
9. Safety & cost
- Billable + autonomous + long‑running. Each run consumes Devin credits and can run for a long time unattended. Scope jobs deliberately; very large multi‑workstream specs are better split into several jobs.
- Real PR. PR mode pushes a branch and opens a PR on the target repo. Keep
autoMerge:falseso a human reviews/merges;gh pr merge(auto) only fires when the job opts in orFLEET_SHIP_MERGES_PR=1. - Isolation. The factory works in an isolated worktree/clone, never your main
checkout (per
agent-queue/docs/RUN_POLICY.md). Avoid blanket--yoloon live trees. - Stopping the daemon mid‑run lets the lease expire; the coordinator's reaper then reclaims and requeues the job (so partial work may be retried). Stop intentionally.
- Tokens/secrets: the minted JWT and
JWT_SECRETare sensitive — never commit them or paste into shared logs..envis git‑ignored; keep it that way.
10. Teardown
# stop the factory: Ctrl-C the run loop
# host path: Ctrl-C the tsx (Terminal 1) and pnpm dev (Terminal 2)
# docker path:
# cd ~/code/learning_ai_common_plat && docker compose down # keep volumes
# docker compose down -v # also drop volumes
rm -f /tmp/tok /tmp/factok # discard minted tokens
11. Troubleshooting
| Symptom | Cause → Fix |
|---|---|
Cannot find module '@bytelyst/…/dist/index.js' on tsx/Next start |
workspace packages not built → pnpm -r build (§2.2). |
401 {"error":"Invalid or expired token"} |
JWT expired/mis‑signed → re‑mint (§5); ensure same JWT_SECRET as the running service. |
Job claimed then flips back to queued mid‑run; leaseEpoch keeps climbing; final report fenced; PR opens but job never reaches review/shipped |
factory heartbeat cadence (AQ_FLEET_LEASE_RENEW_SEC, default 300s) > reaper stale window (90s) → set AQ_FLEET_LEASE_RENEW_SEC=60 (§7). To recover the record after the fact, reconcile PR state (§8). |
Job stays queued, never claimed |
No factory for that product → fleet-status shows it registered? AQ_PRODUCT_ID must equal the job's product. Check GET /api/fleet/factories (X‑Product‑Id) for 0 live. |
POST …/pr/reconcile or ship auto‑merge does nothing |
gh not present where platform-service runs (Docker container) → run the host path (§3 Option A). |
Prometheus target platform-service-fleet = down (401) |
service missing FLEET_METRICS_TOKEN → §4 (restart host tsx / recreate container). |
prototype-up.sh build fails on corepack prepare pnpm |
dashboard image network issue → use the targeted subset, or just use the host path (Option A). |
POST …/actions/<x> returns 500 "Body cannot be empty" |
sent Content-Type: application/json with no body → omit the header or send {}. |
Port 4003 conflict |
host tsx watch and a platform-service container both bind 4003 → run only one. |
gh pr create fails |
repo is a bare name → must be owner/name or a clone URL; confirm gh auth status. |
| PR/cost attributed to wrong product | job submitted under the wrong productId partition → resubmit under the right product and cancel the stray (POST …/actions/cancel). |
vitest exits non‑zero with kill EPERM after all suites pass |
worker‑pool teardown artifact (sandbox), not a test failure → re‑run; all suites already passed. |
12. Copy‑paste quickstart — all localhost (notelett → learning_ai_notes)
Assumes §2.1/§2.2 done (.env filled, pnpm -r build run). Four terminals.
# Terminal 1 — coordinator
cd ~/code/learning_ai_common_plat/services/platform-service
pnpm exec tsx watch --env-file=../../.env src/server.ts
# Terminal 2 — dashboard
cd ~/code/learning_ai_common_plat/dashboards/tracker-web && pnpm dev # :3003
# Terminal 3 — tokens + submit (save mint-token.mjs from §5; fix ABS paths)
node mint-token.mjs 15m > /tmp/tok
node mint-token.mjs 24h > /tmp/factok
curl -s -X POST http://localhost:4003/api/fleet/jobs \
-H "Authorization: Bearer $(cat /tmp/tok)" -H "X-Product-Id: notelett" \
-H 'Content-Type: application/json' \
-d '{"idempotencyKey":"notelett-demo-1","bodyMd":"# Task…","priority":"high","engine":"devin","repo":"saravanakumardb1/learning_ai_notes","baseBranch":"main","autoMerge":false}'
# Terminal 4 — factory (runs Devin → opens a real PR). NOTE the <90s heartbeat cadence.
cd ~/code/learning_ai_devops_tools/agent-queue && ./agent-queue.sh init
AQ_FLEET=1 AQ_FLEET_ROUTE=1 AQ_FLEET_PR=1 AQ_FLEET_API=http://localhost:4003/api \
AQ_PRODUCT_ID=notelett AQ_FLEET_TOKEN="$(cat /tmp/factok)" \
AQ_FACTORY_ID=mac-local-1 AQ_FLEET_CAPS=engine:devin AQ_FLEET_LEASE_RENEW_SEC=60 \
./agent-queue.sh run --max 1
13. WSL on Windows — differences to note
The flow is identical inside a WSL2 (Ubuntu) shell, with these adjustments. Treat WSL as "the Linux host" — install and run everything inside WSL, not Windows.
- Keep repos on the WSL filesystem, not
/mnt/c. Clone under e.g.~/codeinside WSL. On/mnt/c(the Windows drive over 9p)tsx watch/Next file‑watching is unreliable (inotify doesn't fire) and git/pnpm are far slower. This is the single most important difference. - Install the toolchain inside WSL (Linux builds):
node/pnpm(nvm),git,gh, and thedevinCLI — and rungh auth login+ Devin auth inside WSL. Agh/devininstalled on Windows is not visible to the WSL bash factory. - Line endings. Clone inside WSL (don't reuse a Windows checkout with
core.autocrlf=true) so the*.shscripts stay LF — CRLF breaksagent-queue.sh(bad interpreter/\r). If needed:git config --global core.autocrlf input. - Reaching the UI from the Windows browser. WSL2 forwards
localhost, sohttp://localhost:3003/:4003usually work from a Windows browser. If they don't (older Windows / mirrored‑networking off), use the WSL IP (hostname -I) or setnetworkingMode=mirroredin.wslconfig. - Ports. Make sure nothing on the Windows side already binds
3003/4003(WSL2 publishes to the same localhost). Stop the Windows process or change ports. - Docker (Option B), if used. Use Docker Desktop with the WSL2 backend and run
docker composefrom inside the WSL shell.host.docker.internalresolves from containers to the host as on Mac. /tmptoken paths (/tmp/tok,/tmp/factok) are the WSL/tmp— fine; just keep all four terminals in the same WSL distro so they share it.- Clock skew. If WSL's clock drifts after sleep, JWT
iat/expchecks can fail (Invalid or expired token) —sudo hwclock -s(or restart WSL) to resync.
Everything else — env vars, pnpm -r build, tsx --env-file, the factory env incl.
AQ_FLEET_LEASE_RENEW_SEC=60, token minting — is identical to the Mac host path.
Reference
- Coordinator routes:
services/platform-service/src/modules/fleet/routes.ts - Coordinator logic:
services/platform-service/src/modules/fleet/coordinator.ts - Factory fleet client:
learning_ai_devops_tools/agent-queue/lib/fleet-client.sh - Factory runner + PR mode:
learning_ai_devops_tools/agent-queue/agent-queue.sh - Gigafactory spec/roadmap:
learning_ai_devops_tools/agent-queue/docs/GIGAFACTORY/ - Prometheus scrape config:
services/monitoring/prometheus/prometheus.yml - Grafana dashboard:
services/monitoring/grafana/dashboards/fleet-overview.json