learning_ai_common_plat/docs/runbooks/FLEET_DEVIN_LOCAL_RUN.md
saravanakumardb1 4ac5a747d1 feat(scripts): fleet-logs.sh to tail/inspect a Devin fleet job's logs
Convenience CLI over the agent-queue factory logs: resolves the agent-queue
checkout (AQ override or sibling default), takes a full/partial job id (defaults
to newest), and exposes ls/status/tail/steps/watch/full/path over the runner
.log and the live Devin transcript (.devin-export.json steps[]). Referenced from
the §8 Observe section of the fleet run runbook.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
2026-06-02 01:09:19 -07:00

30 KiB
Raw Blame History

Runbook — Run a Devin Fleet Job EndtoEnd (local)

Audience: developers and coding agents. Goal: stand up platform-service + tracker-web + a fleet factory (the agent-queue runner) so a submitted job is claimed and executed autonomously by the Devin CLI against a target repo (worked example: learning_ai_notes), pushing a branch and opening a real pull request.

⚠️ This is a real, costincurring, sideeffecting operation. The factory runs an autonomous coding agent (Devin) that consumes API credits, can run for a long time, pushes a branch, and opens a real PR on GitHub. Read §9 Safety & cost before launching. For unattended local prototyping only — not a production deployment guide.


1. Architecture (what talks to what)

 you ──▶ tracker-web (:3003) ─┐
                              │  REST + SSE (/api/fleet/*)
 coding agent / curl ────────┼─▶ platform-service (:4003) ──▶ Azure Cosmos (jobs/runs/leases/events)
                              │        ▲   ▲
            Prometheus (:9090)┘        │   │ claim / lease-renew / report  (Bearer JWT + X-Product-Id)
            Grafana (:3000) ───────────┘   │
                                           │
                       agent-queue FACTORY (fleet mode) ──▶ Devin CLI ──▶ git push + gh pr create
                       (learning_ai_devops_tools/agent-queue)             (target repo, e.g. learning_ai_notes)
  • platform-service — the fleet coordinator. Owns the job lifecycle (queued → assigned → building → review → testing → shipped|failed|dead_letter), atomic claim, leases, events, budgets, metrics. Code: services/platform-service/src/modules/fleet/.
  • tracker-web (:3003) — submit/inspect jobs (/dashboard/fleet/jobs/...).
  • factorylearning_ai_devops_tools/agent-queue in fleet mode. Polls POST /api/fleet/claim, runs the agent CLI in an isolated checkout, reports back, and (PR mode) opens the PR.
  • Prometheus/Grafana — fleet metrics + the "Fleet Overview" dashboard.

Lifecycle the factory drives:

queued ─▶ assigned ─▶ building ─▶ review ─▶ testing ─▶ shipped
            (claim)    (agent     (rc=0)    (verify    (manual/auto ship)
                        running)             passed)
                            └─ agent rc≠0 / timeout / verify fail ─▶ failed ─▶ (retry|dead_letter)

2. Prerequisites

Tool Why Check
Node ≥ 20 + pnpm (corepack) host-run service, scripts, tracker-web, build node -v && pnpm -v
git + gh (authenticated) factory clones target repo, pushes branch, opens PR; gh pr merge/reconcile gh auth status
devin CLI (authenticated) the agent the factory runs devin --version
Both repos cloned sidebyside coordinator/dashboards + the factory see below
repo .env (root of learning_ai_common_plat) JWT_SECRET, Cosmos creds, FLEET_METRICS_TOKEN test -f .env
Docker optional — only for the Docker path (§3 Option B) / Grafana+Prometheus docker info

Node version: the Docker image pins node 22; for the host path any Node ≥ 20 works. Use one Node (nvm/asdf) for both repos to avoid native-module surprises.

2.1 Firsttime setup (fresh machine)

Clone both repos as siblings (the factory clones targets relative to a shared parent):

mkdir -p ~/code && cd ~/code
git clone <host>/learning_ai_common_plat.git
git clone <host>/learning_ai_devops_tools.git      # contains agent-queue (the factory)

Create and fill .env at the root of learning_ai_common_plat:

cd ~/code/learning_ai_common_plat
cp .env.example .env
# then edit .env — minimum for the fleet flow:
#   JWT_SECRET=<any strong secret; tokens are minted+verified with THIS value>
#   FLEET_METRICS_TOKEN=changeme-fleet-metrics-token      # only needed for Prometheus
#   COSMOS_*  / connection vars  -> see note below
  • JWT_SECRET — HS256 secret platform-service verifies tokens with. Any strong value; it only needs to be internally consistent on this machine (the token you mint in §5 and the running service must share it). Required.
  • Cosmos — the default prototype talks to a real Azure Cosmos account (no emulator in the default compose). On a new machine you must either (a) point .env at the same Cosmos account (to see/share existing jobs) or (b) point at your own DB and set COSMOS_AUTO_INIT=true so containers are created on boot. Without valid Cosmos creds the service starts but every fleet call fails.
  • FLEET_METRICS_TOKEN — only needed if you run Prometheus (§4); must match services/monitoring/prometheus/prometheus.yml (credentials:).

2.2 Install + build the workspace (required for the host path)

Host-run resolves @bytelyst/* workspace packages from their dist/ (the exports field points at dist), so you must build them once before tsx/Next can import them:

cd ~/code/learning_ai_common_plat
pnpm install
pnpm -r build          # builds all workspace packages (incl. @bytelyst/* → dist/)
# (faster, just the platform-service closure:)
# pnpm -r --filter @lysnrai/platform-service... build

Skipping this is the #1 fresh-machine failure: tsx watch crashes with Cannot find module '@bytelyst/...'/dist/index.js. Re-run pnpm -r build after pulling changes to shared packages.


3. Bring up platform-service + tracker-web

Two ways. Option A (all localhost, no Docker) is recommended for a single dev Mac / WSL box — everything runs on the host, so gh-backed features work out of the box. Option B (Docker) is for when you also want the Grafana/Prometheus stack.

Two longlived processes, each in its own terminal. Both assume §2.1/§2.2 are done (.env filled, pnpm -r build run).

Terminal 1 — coordinator (platform-service, :4003):

cd ~/code/learning_ai_common_plat/services/platform-service
pnpm exec tsx watch --env-file=../../.env src/server.ts

tsx watch hot-reloads on source changes. Use the explicit --env-file=../../.env (the bare pnpm dev script does not load the root .env, so JWT_SECRET/Cosmos would be missing). FLEET_METRICS_TOKEN is already in .env if you set it in §2.1.

Terminal 2 — dashboard (tracker-web, :3003):

cd ~/code/learning_ai_common_plat/dashboards/tracker-web
pnpm dev          # serves http://localhost:3003 (proxies /api → :4003)

That's the whole coordinator + UI. Monitoring (Grafana/Prometheus) is optional on the host path — GET /api/fleet/metrics (JSON), GET /api/fleet/autoscale, and the tracker-web job pages cover observability without it. To get the Grafana "Fleet Overview" dashboard you need Prometheus + Grafana (run them via Docker — Option B — or Homebrew binaries pointed at services/monitoring/...).

Because everything is on the host, gh is on PATH → the PRstate reconcile (§8) and shiptime gh pr merge work (unlike the Docker container, which has no gh).

Health checks:

curl -s -o /dev/null -w '%{http_code}\n' http://localhost:4003/health   # 200
curl -s -o /dev/null -w '%{http_code}\n' http://localhost:3003          # 200

Option B — Docker (adds Grafana + Prometheus)

cd ~/code/learning_ai_common_plat
# targeted fleet subset that always builds cleanly:
docker compose up -d --build platform-service prometheus grafana
# (full stack: bash scripts/prototype-up.sh)

Starts platform-service (:4003), prometheus (:9090), grafana (:3000, admin/lysnrai) + deps. Still run tracker-web from source (Option A, Terminal 2).

Docker caveats:

  • prototype-up.sh may fail building the dashboard images when corepack prepare pnpm@… can't fetch pnpm on a restricted network → use the targeted subset above.
  • gh is NOT in the container → coordinatorside gh pr merge and PRreconcile (§8) are noops in Docker. Use the host path (Option A) if you need them.
  • Don't run both: the container and a host tsx both bind :4003 (docker compose stop platform-service before hostrunning).

4. Make Prometheus auth work (only if running Prometheus)

Skip this on the host path unless you also run Prometheus. prometheus.yml scrapes /api/fleet/metrics/prom with a bearer, so the running platform-service must see the same FLEET_METRICS_TOKEN:

cd ~/code/learning_ai_common_plat
grep -q '^FLEET_METRICS_TOKEN=' .env || \
  printf '\nFLEET_METRICS_TOKEN=changeme-fleet-metrics-token\n' >> .env
# host path: restart Terminal-1 tsx so it re-reads .env
# docker path: docker compose up -d platform-service

Verify (if Prometheus is up): http://localhost:9090/api/v1/targetsplatform-service-fleet is up. The value must equal credentials: in services/monitoring/prometheus/prometheus.yml.


5. Mint a local API token (dev only)

platform-service verifies HS256 JWTs signed with JWT_SECRET and requires type: "access". The tracker-web UI obtains one via login; for scripts/agents and the factory, mint one directly. Local dev only — never commit tokens or the secret.

Save mint-token.mjs (resolve jose from the workspace):

import { readFileSync } from 'node:fs';
// adjust the jose path to your checkout if needed:
import { SignJWT } from '/ABS/PATH/learning_ai_common_plat/node_modules/.pnpm/jose@5.10.0/node_modules/jose/dist/node/esm/index.js';

const env = readFileSync('/ABS/PATH/learning_ai_common_plat/.env', 'utf8');
const secret = new TextEncoder().encode(env.match(/^JWT_SECRET=(.*)$/m)[1].trim());
const ttl = process.argv[2] || '15m'; // e.g. '15m' for scripts, '24h' for a factory
process.stdout.write(
  await new SignJWT({ sub: 'local-dev', role: 'admin', type: 'access' })
    .setProtectedHeader({ alg: 'HS256' })
    .setIssuedAt()
    .setExpirationTime(ttl)
    .sign(secret)
);
node mint-token.mjs 15m > /tmp/tok        # short-lived, for API calls
node mint-token.mjs 24h > /tmp/factok     # longer-lived, for the factory daemon

Find the jose path with: find . -path '*/node_modules/jose/dist/*/esm/index.js' | head -1.

Requests must also carry the product: header X-Product-Id: <productId> (e.g. notelett). role: admin bypasses tenant ownership checks when FLEET_TENANT_ENFORCEMENT is on (it's off by default).


6. Submit a job

Via tracker-web (preferred)

Open http://localhost:3003/dashboard/fleet/jobs, "New job". Set the correct product first (the product selector) — a job is partitioned by productId, and submitting under the wrong product misattributes cost/metrics/ownership and the factory won't see it under the product it polls.

PRmode fields that matter:

  • repo — must be owner/name (e.g. saravanakumardb1/learning_ai_notes) or a clone URL, not a bare name (the factory feeds it to gh).
  • baseBranch — e.g. main.
  • enginedevin (pins the agent; otherwise the factory's default/engineClass).
  • autoMerge — leave false for a human merge gate (recommended for large PRs).

Via API

JOB=$(curl -s -X POST http://localhost:4003/api/fleet/jobs \
  -H "Authorization: Bearer $(cat /tmp/tok)" \
  -H "X-Product-Id: notelett" -H 'Content-Type: application/json' \
  -d '{
    "idempotencyKey": "notelett-demo-1",
    "bodyMd": "# Task\n…full prompt…",
    "priority": "high",
    "engine": "devin",
    "repo": "saravanakumardb1/learning_ai_notes",
    "baseBranch": "main",
    "autoMerge": false
  }')
echo "$JOB"   # → { outcome: "created", job: { id: "fjob_…", stage: "queued", ... } }

The job is now queued and claimable. It will not run until a factory polls for its product (next step).


7. Start the factory (agent-queue, fleet mode)

The factory lives in a separate repo: learning_ai_devops_tools/agent-queue. Run it on the host (needs devin + gh). Read its docs/RUN_POLICY.md first.

7a. Sanitycheck connectivity (safe — registers + heartbeats only)

cd learning_ai_devops_tools/agent-queue
./agent-queue.sh init   # idempotent

AQ_FLEET=1 AQ_FLEET_ROUTE=1 \
AQ_FLEET_API=http://localhost:4003/api \
AQ_PRODUCT_ID=notelett \
AQ_FLEET_TOKEN="$(cat /tmp/factok)" \
AQ_FACTORY_ID=mac-local-1 \
AQ_FLEET_CAPS=engine:devin \
AQ_FLEET_LEASE_RENEW_SEC=60 \
  ./agent-queue.sh fleet-status      # → "heartbeat OK (registered)."

7b. Launch the run loop (claims + runs the agent)

cd learning_ai_devops_tools/agent-queue
AQ_FLEET=1 AQ_FLEET_ROUTE=1 AQ_FLEET_PR=1 \
AQ_FLEET_API=http://localhost:4003/api \
AQ_PRODUCT_ID=notelett \
AQ_FLEET_TOKEN="$(cat /tmp/factok)" \
AQ_FACTORY_ID=mac-local-1 \
AQ_FLEET_CAPS=engine:devin \
AQ_FLEET_LEASE_RENEW_SEC=60 \
  ./agent-queue.sh run --max 1

⚠️ Set AQ_FLEET_LEASE_RENEW_SEC below 90 (e.g. 60). This is the heartbeat/ leaserenew cadence. The coordinator's reaper marks a factory stale after 90s (DEFAULT_STALE_FACTORY_MS, a constant — no env knob) and reclaims its inflight lease. The default cadence is 300s, so a busy singleslot factory looks stale for most of every cycle and its running job gets requeued midrun (leaseEpoch climbs, stage flips back to queued, and the final report is fenced so the job never tidies to review/shipped). 60s keeps it comfortably live. (Add the same env to the §7a fleet-status check for consistency.)

Key fleet env vars (see lib/fleet-client.sh):

Var Meaning
AQ_FLEET=1 master switch — enable coordinator calls (0 = pure offline)
AQ_FLEET_ROUTE=1 coordinator is authoritative for claim (pulls work from platform-service)
AQ_FLEET_PR=1 PR mode — open a PR for jobs that target a repo
AQ_FLEET_API base URL including /api (http://localhost:4003/api)
AQ_FLEET_TOKEN Bearer JWT (mint per §5; ≥ run duration, e.g. 24h)
AQ_PRODUCT_ID product to poll — sent as X-Product-Id (must match the job's product)
AQ_FACTORY_ID this factory's id (registered/heartbeated)
AQ_FLEET_CAPS advertised capabilities, e.g. engine:devin
AQ_FLEET_LEASE_RENEW_SEC set <90 (e.g. 60) — heartbeat/renew cadence vs the 90s stale window (see warning)
AQ_FLEET_REPO_BASE (optional) dir of local checkouts; if …/<repo>/.git exists it uses a git worktree, else it git clones https://github.com/<repo>.git into its cache
AQ_FLEET_AUTOSHIP=1 (optional) auto-advance to shipped (skips the manual gate)

The run loop claim → assigned → building, runs Devin in an isolated checkout, heartbeats + renews the lease (lease_renewed events) so the reaper doesn't reclaim it, then on agent exit moves to review and (PR mode) opens the PR. With autoMerge:false it stops at the human merge gate.

Repo checkout: the job's repo is owner/name, so by default the factory git clones https://github.com/<owner>/<name>.git into its own cache (queue/.state/repos/…) — clean isolation, nothing touches your working copies. To reuse an existing local clone via a git worktree instead, set AQ_FLEET_REPO_BASE=<parent> where <parent>/<owner>/<name>/.git exists.


8. Observe progress

  • Factory/agent logs (the live Devin transcript): use the helper scripts/fleet-logs.sh (auto-finds the agent-queue logs; takes a full or partial job id, defaults to the newest job):
    scripts/fleet-logs.sh ls                 # list jobs: slot + step count
    scripts/fleet-logs.sh status 3c0586ce    # steps count + slot + latest step
    scripts/fleet-logs.sh steps  3c0586ce 20 # last 20 transcript steps
    scripts/fleet-logs.sh watch  3c0586ce    # live-refresh the tail
    scripts/fleet-logs.sh tail   3c0586ce    # follow the runner lifecycle .log
    scripts/fleet-logs.sh full   3c0586ce    # all agent messages in your pager
    
    (Override the factory location with AQ=/path/to/agent-queue. Needs jq for the transcript commands.)
  • tracker-web: http://localhost:3003/dashboard/fleet/jobs/<jobId> — live event stream (SSE), runs, PR link + state. (Select the job's product in the UI first, or it shows "job does not exist" — every call is scoped by X-Product-Id.)
  • Events/API:
    curl -s http://localhost:4003/api/fleet/jobs/<jobId>/events \
      -H "Authorization: Bearer $(cat /tmp/tok)" -H "X-Product-Id: notelett"
    
  • Metrics: GET /api/fleet/metrics (JSON, per product) · GET /api/fleet/metrics/prom (Prometheus, all products; needs FLEET_METRICS_TOKEN) · Grafana Fleet Overview (http://localhost:3000/d/fleet-overview).
  • Autoscale signal: GET /api/fleet/autoscale (this product) / …/autoscale/all.

PRstate reconcile (externallymerged PRs)

If you merge the PR in the GitHub UI, the coordinator doesn't know until told. Trigger a reconcile (flips run prState → merged when gh pr view reports MERGED):

  • UI: "Refresh PR status" button on the job's PR section, or
  • API: POST /api/fleet/jobs/<jobId>/pr/reconcile.

Requires gh where platform-service runs → use the host path (§3 Option A); it's a noop in the Docker container (no gh).


9. Safety & cost

  • Billable + autonomous + longrunning. Each run consumes Devin credits and can run for a long time unattended. Scope jobs deliberately; very large multiworkstream specs are better split into several jobs.
  • Real PR. PR mode pushes a branch and opens a PR on the target repo. Keep autoMerge:false so a human reviews/merges; gh pr merge (auto) only fires when the job opts in or FLEET_SHIP_MERGES_PR=1.
  • Isolation. The factory works in an isolated worktree/clone, never your main checkout (per agent-queue/docs/RUN_POLICY.md). Avoid blanket --yolo on live trees.
  • Stopping the daemon midrun lets the lease expire; the coordinator's reaper then reclaims and requeues the job (so partial work may be retried). Stop intentionally.
  • Tokens/secrets: the minted JWT and JWT_SECRET are sensitive — never commit them or paste into shared logs. .env is gitignored; keep it that way.

10. Teardown

# stop the factory: Ctrl-C the run loop
# host path: Ctrl-C the tsx (Terminal 1) and pnpm dev (Terminal 2)
# docker path:
#   cd ~/code/learning_ai_common_plat && docker compose down     # keep volumes
#   docker compose down -v                                       # also drop volumes
rm -f /tmp/tok /tmp/factok     # discard minted tokens

11. Troubleshooting

Symptom Cause → Fix
Cannot find module '@bytelyst/…/dist/index.js' on tsx/Next start workspace packages not built → pnpm -r build (§2.2).
401 {"error":"Invalid or expired token"} JWT expired/missigned → remint (§5); ensure same JWT_SECRET as the running service.
Job claimed then flips back to queued midrun; leaseEpoch keeps climbing; final report fenced; PR opens but job never reaches review/shipped factory heartbeat cadence (AQ_FLEET_LEASE_RENEW_SEC, default 300s) > reaper stale window (90s) → set AQ_FLEET_LEASE_RENEW_SEC=60 (§7). To recover the record after the fact, reconcile PR state (§8).
Job stays queued, never claimed No factory for that product → fleet-status shows it registered? AQ_PRODUCT_ID must equal the job's product. Check GET /api/fleet/factories (XProductId) for 0 live.
POST …/pr/reconcile or ship automerge does nothing gh not present where platform-service runs (Docker container) → run the host path (§3 Option A).
Prometheus target platform-service-fleet = down (401) service missing FLEET_METRICS_TOKEN → §4 (restart host tsx / recreate container).
prototype-up.sh build fails on corepack prepare pnpm dashboard image network issue → use the targeted subset, or just use the host path (Option A).
POST …/actions/<x> returns 500 "Body cannot be empty" sent Content-Type: application/json with no body → omit the header or send {}.
Port 4003 conflict host tsx watch and a platform-service container both bind 4003 → run only one.
gh pr create fails repo is a bare name → must be owner/name or a clone URL; confirm gh auth status.
PR/cost attributed to wrong product job submitted under the wrong productId partition → resubmit under the right product and cancel the stray (POST …/actions/cancel).
vitest exits nonzero with kill EPERM after all suites pass workerpool teardown artifact (sandbox), not a test failure → rerun; all suites already passed.

12. Copypaste quickstart — all localhost (notelett → learning_ai_notes)

Assumes §2.1/§2.2 done (.env filled, pnpm -r build run). Four terminals.

# Terminal 1 — coordinator
cd ~/code/learning_ai_common_plat/services/platform-service
pnpm exec tsx watch --env-file=../../.env src/server.ts

# Terminal 2 — dashboard
cd ~/code/learning_ai_common_plat/dashboards/tracker-web && pnpm dev   # :3003

# Terminal 3 — tokens + submit (save mint-token.mjs from §5; fix ABS paths)
node mint-token.mjs 15m > /tmp/tok
node mint-token.mjs 24h > /tmp/factok
curl -s -X POST http://localhost:4003/api/fleet/jobs \
  -H "Authorization: Bearer $(cat /tmp/tok)" -H "X-Product-Id: notelett" \
  -H 'Content-Type: application/json' \
  -d '{"idempotencyKey":"notelett-demo-1","bodyMd":"# Task…","priority":"high","engine":"devin","repo":"saravanakumardb1/learning_ai_notes","baseBranch":"main","autoMerge":false}'

# Terminal 4 — factory (runs Devin → opens a real PR). NOTE the <90s heartbeat cadence.
cd ~/code/learning_ai_devops_tools/agent-queue && ./agent-queue.sh init
AQ_FLEET=1 AQ_FLEET_ROUTE=1 AQ_FLEET_PR=1 AQ_FLEET_API=http://localhost:4003/api \
AQ_PRODUCT_ID=notelett AQ_FLEET_TOKEN="$(cat /tmp/factok)" \
AQ_FACTORY_ID=mac-local-1 AQ_FLEET_CAPS=engine:devin AQ_FLEET_LEASE_RENEW_SEC=60 \
  ./agent-queue.sh run --max 1

13. WSL on Windows — differences to note

The flow is identical inside a WSL2 (Ubuntu) shell, with these adjustments. Treat WSL as "the Linux host" — install and run everything inside WSL, not Windows.

  • Keep repos on the WSL filesystem, not /mnt/c. Clone under e.g. ~/code inside WSL. On /mnt/c (the Windows drive over 9p) tsx watch/Next filewatching is unreliable (inotify doesn't fire) and git/pnpm are far slower. This is the single most important difference.
  • Install the toolchain inside WSL (Linux builds): node/pnpm (nvm), git, gh, and the devin CLI — and run gh auth login + Devin auth inside WSL. A gh/ devin installed on Windows is not visible to the WSL bash factory.
  • Line endings. Clone inside WSL (don't reuse a Windows checkout with core.autocrlf=true) so the *.sh scripts stay LF — CRLF breaks agent-queue.sh (bad interpreter/\r). If needed: git config --global core.autocrlf input.
  • Reaching the UI from the Windows browser. WSL2 forwards localhost, so http://localhost:3003 / :4003 usually work from a Windows browser. If they don't (older Windows / mirrorednetworking off), use the WSL IP (hostname -I) or set networkingMode=mirrored in .wslconfig.
  • Ports. Make sure nothing on the Windows side already binds 3003/4003 (WSL2 publishes to the same localhost). Stop the Windows process or change ports.
  • Docker (Option B), if used. Use Docker Desktop with the WSL2 backend and run docker compose from inside the WSL shell. host.docker.internal resolves from containers to the host as on Mac.
  • /tmp token paths (/tmp/tok, /tmp/factok) are the WSL /tmp — fine; just keep all four terminals in the same WSL distro so they share it.
  • Clock skew. If WSL's clock drifts after sleep, JWT iat/exp checks can fail (Invalid or expired token) — sudo hwclock -s (or restart WSL) to resync.

Everything else — env vars, pnpm -r build, tsx --env-file, the factory env incl. AQ_FLEET_LEASE_RENEW_SEC=60, token minting — is identical to the Mac host path.


Reference

  • Coordinator routes: services/platform-service/src/modules/fleet/routes.ts
  • Coordinator logic: services/platform-service/src/modules/fleet/coordinator.ts
  • Factory fleet client: learning_ai_devops_tools/agent-queue/lib/fleet-client.sh
  • Factory runner + PR mode: learning_ai_devops_tools/agent-queue/agent-queue.sh
  • Gigafactory spec/roadmap: learning_ai_devops_tools/agent-queue/docs/GIGAFACTORY/
  • Prometheus scrape config: services/monitoring/prometheus/prometheus.yml
  • Grafana dashboard: services/monitoring/grafana/dashboards/fleet-overview.json