Commit Graph

245 Commits

Author SHA1 Message Date
Hermes VM
1e64d75fd4 feat(dashboard): Phase 5 P2 — structured pino logging with redaction
First half of Phase 5 P2 (the "structured backend logging" piece;
E2E-in-CI lands separately so the diff stays reviewable).

Adds `lib/logger.ts` exporting a singleton pino instance shared between
Fastify (via `loggerInstance`) and any non-request code path. One
configured logger across the backend means uniform formatting,
redaction, and log-level control:

  - LOG_LEVEL env knob (defaults: debug in non-prod, info in prod when
    NODE_ENV=production). Documented in `.env.example`.
  - Built-in redaction for Authorization / Cookie headers and the
    common secret-shaped field names (password, token, refreshToken,
    accessToken, csrfToken, JWT_SECRET, CSRF_SECRET, ENCRYPTION_KEY,
    COSMOS_KEY, AZURE_CLIENT_SECRET) so an accidental
    `req.log.info(req.body)` or `logger.error({ err, config }, …)`
    won't dump credentials. This is a backstop, not the primary
    defense — call sites should still avoid logging raw config/req.
  - JSON to stdout in every environment. Pipe through `pino-pretty`
    locally if you want pretty output; we deliberately don't bundle
    pino-pretty as a runtime dep.
  - `childLogger(module)` helper tags log lines with their origin so
    repositories/background workers don't have to repeat the module
    name on every line.

Sweeps the runtime `console.error` sites that lose request context
(deployment orchestrator background fire-and-forget, system docker
stats/cleanup, backup CRUD, vm getAllContainers) onto the structured
logger. CLI-only modules (`scripts/run-migrations.ts`,
`migrations/index.ts`, `cosmos-init.ts` startup, `azure-keyvault.ts`,
`config.ts` env warnings, `lib/migrations.ts` no-op message) keep
`console.*` for now — they run before Fastify is up and are queued for
a separate cleanup pass.

Tests, typecheck, lint (0 errors), build green. Coverage gate still
passing (≥95% lines on every gated file).

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
2026-05-30 07:18:44 +00:00
Hermes VM
c6ec1a06ea docs(dashboard): Phase 5 P1 — document privilege surface; gate /code-quality/check
Closes the final Phase 5 P1 checkbox and REVIEW_ACTIONS #6.

The backend container has root-equivalent host access via the docker
socket, host log mounts, and the VM scripts mount, but until now the
"who can do what to the host?" answer was scattered across compose
files and route handlers. This commit centralizes it.

DEPLOYMENT.md gains a "Privilege Surface" section that lists:

  - every host mount + container path + mode + purpose
  - every shell-outing route, the actual commands it runs, and the
    auth gate on each
  - what an admin token can do today (≈ host shell)
  - five known sharp edges (un-allow-listed container names, unvalidated
    projectPath, no per-route audit-log on shell-outs, container runs
    as root, global rate-limit only)
  - a P1 → P3 mitigation roadmap (allow-list wrapper around shell-outs,
    projectPath validation, audit-logging shell-outs, drop root in
    container, replace docker.sock with a verb-restricted proxy)

Concurrent code fix: `POST /code-quality/check` was reachable
**unauthenticated** despite shelling out to `npm run typecheck/lint/
build/test:run` in a caller-supplied `projectPath`. Added
`preHandler: requireAdmin` to bring it in line with every other
shell-outing route in the dashboard. Same commit because the
documentation table promises this gate exists.

REVIEW_ACTIONS #6 marked RESOLVED with the rationale; roadmap checkbox
ticked. Tests, typecheck, lint (0 errors), build, and coverage gate
(≥95% lines on every gated file) all stay green.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
2026-05-30 07:05:51 +00:00
Hermes VM
824f31586a docs(dashboard): Phase 5 P1 — fix port/endpoint drift, dedupe deployment docs
Closes the Phase 5 P1 doc-drift checkbox and REVIEW_ACTIONS #5.

The 3000-vs-3049 confusion came from prose claims in three docs that
each picked a different "right" answer. The truth is: the web container
listens on :3000; docker-compose maps `127.0.0.1:3049:3000`; production
is fronted by Traefik on `https://devops.bytelyst.com`. Encoding that
explicitly so future readers don't have to dig through compose files:

  - DEPLOYMENT.md becomes canonical. Its content is now the (more
    accurate) old DEPLOYMENT_GUIDE.md merged with a "Ports — quick
    reference" table covering Local dev / Docker Compose / Production
    Traefik, plus a Local-development section for `pnpm dev`.
  - DEPLOYMENT_GUIDE.md → 5-line redirect stub pointing at
    DEPLOYMENT.md (kept for `deploy.sh` and any external links).
  - deploy.sh updated to point at DEPLOYMENT.md.
  - README.md "Web port: 3000" line rewritten to spell out container
    vs Compose-host vs dev-mode and link to the port table.
  - ENDPOINTS.md gets a top-of-file note: every `localhost:3000` URL
    in that file is the `pnpm dev` workflow; substitute `:3049` for
    the Dockerized stack.
  - REVIEW_ACTIONS.md #5 marked RESOLVED with the rationale.

No code, behavior, lint, or test changes.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
2026-05-30 07:03:05 +00:00
Hermes VM
3fc471e880 chore(dashboard): Phase 5 P1 — remove dead SSE log-stream claim
Closes the long-standing SSE TODO. The previous attempt with
`fastify-sse-v2 ^4` was incompatible with Fastify 5 and was never wired
in; the README/DEPLOYMENT.md kept advertising "real-time log streaming"
that didn't exist. The web client never used EventSource — `web/src/
lib/api.ts` already polls `/deployments/:id/logs` via the normal
`apiRequest` helper.

Resolution: remove the claim, not ship the feature.

  - drop `fastify-sse-v2` dep from `backend/package.json` + lockfile
  - delete the commented-out plugin import + register in `server.ts`,
    replace with a NOTE explaining the JSON-polling decision and how
    to add a stream later (`reply.raw`)
  - remove the `TODO: Re-enable SSE` comment in `deployments/routes.ts`;
    the endpoint already returns JSON, document that explicitly
  - rewrite the README "Deployment Log Streaming" section as
    "Deployment Logs" (JSON-polled, no SSE); fix the endpoint table
  - flip the DEPLOYMENT.md bullet from "Real-time log streaming (SSE)"
    to "Deployment log retrieval (JSON polling — no SSE)"
  - mark REVIEW_ACTIONS #4 RESOLVED with the reasoning
  - tick the roadmap checkbox

If a real-time stream is wanted later, ship it explicitly via
`reply.raw` and update README/DEPLOYMENT.md/the route comment in the
same change. Don't reintroduce a half-disabled plugin.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
2026-05-30 07:00:07 +00:00
Hermes VM
18180aab78 test(dashboard): Phase 5 P1 — auth/csrf/health/orchestrator tests + coverage gate
Closes the Phase 5 P1 testing checkbox. Adds 35 new unit tests across the
modules called out in the roadmap and wires a v8 coverage gate into CI.

Coverage of newly-tested files (lines / branches):
  lib/auth.ts                          94.4% / 100%
  lib/csrf.ts                          95.1% /  90%
  modules/health/repository.ts          100% /  92%
  modules/deployments/orchestrator.ts  95.2% /  74%
  modules/services/repository.ts        100% / 100%
  modules/hermes-ops/repository.ts     95.2% /  68%

Threshold (lines/funcs/stmts ≥85%, branches ≥65%) is scoped to those six
files via `coverage.include` so untested legacy modules (vm, system,
audit, route handlers) report but don't gate. Add files there as they
gain real tests — ratchet up, never relax.

Test approach mirrors the existing services/hermes-ops suites: hoisted
mocks for I/O (fetch, child_process, fs/promises, cosmos-init), real
JOSE-signed JWTs for the auth path, fake timers for cache TTL and CSRF
expiry assertions.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
2026-05-30 06:56:16 +00:00
saravanakumardb1
5c0ae020c0 docs(agent-queue): draft P2 prompts — factory enrollment+tokens (§12) + feature flags/shadow-dualrun 2026-05-29 23:52:14 -07:00
Hermes VM
cf5428acd1 feat(dashboard): Phase 1 — harden hermes-ops backend + tests
- Short-TTL (30s) snapshot cache + in-flight coalescing so the panel poll and
  concurrent refreshes don't fan out ~20 systemctl/git/ps/du subprocesses each
  time; snapshot carries a `cached` flag and `getHermesOpsSnapshot({force})`.
- Distinguish "unit inactive" (down) from "probe couldn't run" (unknown): a new
  exec() wrapper reports whether the command actually ran (ENOENT/timeout =
  unknown) vs exited non-zero with output (e.g. systemctl is-active -> inactive).
  Per-field ProbeStatus on gateway/dashboard/timer/repo; warnings differentiate
  "is not active" from "status could not be determined".
- Robust Bheem/Uma checks: `runuser -u uma -- systemctl --user is-active/
  is-enabled` with a ps / existsSync fallback so a failed probe degrades to the
  legacy check instead of a false "down".
- Zod schema (HermesOpsSnapshotSchema) as the stable typed contract; the route
  validates output before sending. New status fields are additive (active/
  enabled/url/etc. preserved) so the existing web client is unaffected.
- Unit tests (mock execFile/fs): healthy snapshot, down vs unknown mapping,
  runuser->ps fallback, unreadable repo, cache hit + force bypass, request
  coalescing. Backend: 16 tests green.

Roadmap: check off Phase 1 items and Phase 5 P0 in hermes_dashboard_v2_roadmap.md.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 06:50:32 +00:00
Hermes VM
3ee4e7104e fix(dashboard): Phase 5 P0 — correct CI workspace path + real ESLint
- ci.yml: actions/checkout into the runner workspace instead of cd-ing into a
  hard-coded host path and `git reset --hard origin/main` on the live checkout;
  install via `pnpm install:gitea` (self-contained, no sibling common-plat
  checkout); E2E step left as a TODO pointer (ci-e2e-hardening, Phase 5 P2).
- Fix the same stale /opt/bytelyst/bytelyst-devops-tools path in deploy.sh,
  scripts/deploy-hotcopy.sh, DEPLOYMENT.md, DEPLOYMENT_GUIDE.md.
- Replace the no-op `lint` echoes with real ESLint 9 flat configs (js +
  typescript-eslint recommended) for backend and web; add a root `pnpm lint`.
- Fix the 10 errors lint surfaced, incl. require('os') in an ESM backend
  (system/repository.ts -> import * as os), prefer-const x4, and a ternary
  expression-statement in web vm/page.tsx.

Verified locally: secret-scan, lint (0 errors; correctly fails on bad code),
typecheck, unit tests (backend 9 / web 11), and build all green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 06:50:32 +00:00
saravanakumardb1
51e4c5f271 Merge PR #5: Phase 2 Slice 3 — factory-agent integration
Wires agent-queue.sh to the fleet coordinator behind AQ_FLEET=1 (offline path
unchanged when off). Fencing-aware (stale leaseEpoch -> self-abort + quarantine),
offline-degrade, tracker echo via fleet_events. selftest 60/60 (53 prior + 7 new);
token env-sourced; no bodyMd/token leak (asserted).
2026-05-29 22:50:56 -07:00
saravanakumardb1
21ebf8b1b7 docs(agent-queue): fleet integration section + roadmap P2-S3 ticks
README: "Fleet integration (Phase 2)" — AQ_FLEET flag, env table, claim/heartbeat/
report/fence/renew protocol, offline-degrade + quarantine, offline-vs-fleet explainer.
Roadmap: tick the Phase-2 §14 factory-agent item, add a P2-S3 slice note, bump §0
Phase 2 -> 55%.
2026-05-29 22:45:44 -07:00
saravanakumardb1
064dbf3d8f test(agent-queue): fleet integration selftest cases (P2-S3)
Adds 7 stub-driven fleet cases (AQ_FLEET_API_CMD stub, no live coordinator); never
weakens the prior 53 (full suite now 60 green):
- flag OFF (default): zero coordinator calls; offline job completes unchanged
- register(heartbeat)+claim -> coordinator job materialized + executed to review/
- report+checkpoint: PATCH carries stage+leaseEpoch (+ wipBranch on building)
- FENCING: stale-epoch 409 -> self-abort + quarantine (never shipped)
- lease renew (unit): POST .../lease/renew with current leaseEpoch
- offline-degrade: coordinator 5xx -> job completes locally (degraded), not quarantined
- no-leak: bodyMd/token never appear in report payloads
2026-05-29 22:45:44 -07:00
saravanakumardb1
1d84712b47 feat(agent-queue): wire runner to fleet coordinator at minimal hook points (P2-S3)
Sources lib/fleet-client.sh and adds a few fleet_enabled-gated hooks so the offline
git-queue path is byte-for-byte unchanged when AQ_FLEET is unset/0:

- cmd_run: register at loop start; per-iteration heartbeat (cadence) + lease renew
  for in-flight fleet jobs + claim one coordinator job into inbox when capacity.
- meta: persist fleet_job_id + fleet_lease_epoch (from claim frontmatter).
- run_worker: report `building` (with WIP checkpoint) after WIP setup and `review`
  before accepting the agent's output — a FENCED (stale-epoch/409) report self-aborts
  and quarantines (never ships); 5xx/unreachable degrades (finish locally).
- _auto_echo: for fleet jobs route the outcome echo through the coordinator
  (fleet_events) instead of the direct tracker echo; offline jobs unchanged.
- cmd_ship: fence-check before shipping a fleet job; release lease after.
- status: show factory id + per-job fleet=<id>@e<epoch>; insights lists fleet_* fields.
- dispatch + help: `fleet-status` command + a FLEET env section.
2026-05-29 22:45:44 -07:00
saravanakumardb1
a10d4003e6 feat(agent-queue): fleet coordinator client library (lib/fleet-client.sh, P2-S3)
New sourced library implementing the factory side of the Phase-2 `fleet`
coordinator contract — curl-only + POSIX awk, reusing the Slice-4 HTTP/JSON
helper patterns, no new deps. Every function is a no-op unless AQ_FLEET=1.

- fleet_enabled / fleet_api (AQ_FLEET_API_CMD test seam) / _fleet_call
- fleet_detect_caps (reuses detect_capabilities) -> JSON caps array
- fleet_heartbeat (+ _maybe cadence): registration == first heartbeat
- fleet_claim: POST /fleet/claim, parse job id/bodyMd/leaseEpoch, materialize a
  transient local .md (fleet-job-id + fleet-lease-epoch in frontmatter)
- fleet_report: PATCH fenced stage transition {stage, leaseEpoch, checkpoint?};
  returns ok / FENCED(2, stale epoch -> self-abort) / degraded(1, unreachable)
- fleet_lease_renew / fleet_lease_release / fleet_renew_active (fenced)
- fleet_quarantine: park a reclaimed (fenced) job in failed/ for human triage
- cmd_fleet_status: register + print factory identity/caps

Report payloads carry only stage/epoch/checkpoint — never prompt/bodyMd/token.
2026-05-29 22:45:44 -07:00
saravanakumardb1
10395983e7 docs(agent-queue): draft parallel P2 prompts — scheduler/router core (§7) + fleet artifacts blob wiring (§13) 2026-05-29 22:32:41 -07:00
Hermes VM
a8dd166108 docs: add Hermes dashboard v2 roadmap + CI/E2E delegation brief
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 05:26:49 +00:00
Hermes VM
13a105ba23 feat(vm): Phase 5 closure — GPU/freshness checks, chaos validation, I/O alert
vm-health-check.sh:
- check_gpu(): nvidia-smi probe; "CPU-only" OK on this VM (no GPU)
- check_image_freshness(): flag containers running images >30d old.
  Skips third-party images (gitea, grafana, prom, mcr.microsoft, axllent,
  caddy, traefik, valkey, cadvisor) — they have their own rebuild cadence.
  Currently flags 19 stale product images (~60d old).

chaos-validation.sh:
- Monthly chaos test: kill PID 1 in chronomind-web, wait up to 35 min
  for docker-health-watchdog to detect + restart. Telegram pass/fail.
- Refuses to run if target not healthy. systemd timer fires 1st of month
  at 10:00 UTC (after 08:00 weekly digest).

vm-io-anomaly-check.sh:
- 6h avg sda write rate; transition alerts at WARN (1 GB/hr) /
  CRIT (2.5 GB/hr). De-dupes via /var/log/vm-io-anomaly-state so the
  alert fires once per transition, not every 6h. Current baseline:
  ~1.94 GB/hr (orphan-container state-file writes; see Phase 0.3).
- Reports recovery to OK when rate drops back.

vm/page.tsx: gpu + image_freshness added to CHECK_META so they render
with proper icon/label and slot into CHECK_ORDER.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-30 05:26:49 +00:00
Hermes VM
76ef17f26b feat(vm): Phase 2.3 closure — OOM watchdog + orphan-container docs
OOM watchdog:
- vm-oom-watchdog.sh — scans journalctl -k since cursor for oom-kill,
  killed-process, and "out of memory ... killed" entries; maps cgroup
  hits back to container names via docker inspect; posts a single
  Telegram alert per scan window (no dedupe needed — cursor advances
  on every run). Cursor at /var/log/vm-oom-cursor, log at
  /var/log/vm-oom-watchdog.log.
- Systemd: OnBootSec=10min, OnUnitActiveSec=1h, Persistent=true.

Orphan containers (no compose file on disk):
- trading-backend → docker update --memory=768m (high-I/O bot)
- gitea-npm-registry → docker update --memory=512m
- orphan-containers.md captures canonical configs for recovery
  (env, mounts, networks, restart policy, memory limits).

Closes Phase 2.3 (post-monitoring) and Phase 3.3 (orphan limits).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-30 05:26:49 +00:00
Hermes VM
253e888a24 feat(infra): Phase 2.3 — memory limits across all active Docker stacks
Apply deploy.resources.limits.memory to 45 services across 5 compose files.
Limits take effect on next docker compose up (no running containers affected).

Limits derived from 2-day Prometheus RSS baseline (avg of 2026-05-27-29):

  common_plat ecosystem (37 services):
    cosmos-emulator: 1g   (319 MiB baseline, can spike on writes)
    loki:           384m  (75 MiB)
    prometheus:     384m  (91 MiB, grows with series cardinality)
    node-exporter:  128m  (21 MiB, very stable)
    cadvisor:       256m  (38 MiB)
    valkey:         128m  (tiny)
    caddy:          256m  (35 MiB)
    platform-service: 512m (61 MiB)
    extraction-service: 512m (99 MiB, Python sidecar)
    mcp-server:     384m  (21 MiB)
    product backends: 512m (30-65 MiB each)
    product webs:   512m  (35-93 MiB each)
    llmlab-dashboard: 512m (Ollama proxy, larger cache budget)

  dashboard (2 services): backend 512m, web 512m
  invttrdg (2 services): backend 768m (159 MiB + heavy state writes),
                          web 256m (nginx SPA)
  clock/chronomind (2 services): backend 512m, web 512m
  notes/notelett (2 services): backend 512m, web 512m

Ollama host process has NO limit (model load unpredictable, up to 8 GB).
trading-backend compose file not on disk — limit not applied.
gitea-npm-registry started manually — limit not applied.

Monitor OOMKill for 48h after next stack restart:
  dmesg | grep -i oom

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-30 05:26:49 +00:00
Hermes VM
42c3b9cdd5 feat(dashboard/vm): Phase 3.3 — All Containers panel with CPU/RAM, logs, bulk restart
- repository.ts: getAllContainers() — batch docker inspect + docker stats
  --no-stream merged by container name; returns state, health, uptime,
  CPU%, RAM, memLimitMiB (0=no limit), restart count, stack from compose
  label; getContainerLogs() — docker logs --tail --timestamps
- routes.ts: GET /api/vm/containers (all, with stats; ~3s for 38
  containers), GET /api/vm/containers/:name/logs?lines=N
- api.ts: ContainerInfo interface; vmApi.getAllContainers(),
  vmApi.getContainerLogs()
- vm/page.tsx: ContainersPanel — collapsible (lazy-loads on first open);
  filter chips (All/Running/Unhealthy/No Limit) + stack dropdown;
  per-row log viewer (inline pre, dark bg, 50-line tail); per-row
  restart button; bulk "Restart N unhealthy" with confirmation modal;
  Fragment key pattern for row+log-row pairs

I/O anomaly (Phase 0.3) root cause identified: invttrdg-backend and
trading-backend write bot_state.json + .bak on every market tick
(5×/min and 2×/min respectively) into container overlay layer →
~6 GB/day — intentional bot behaviour, no fix needed, trend chart
already in place to monitor.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-30 05:26:49 +00:00
Hermes VM
8d32cb7980 feat(dashboard/vm): Phases 4.1-4.3 — Prometheus trends, sparklines, weekly digest
- prometheus.ts: new Prometheus client with 7d/30d range queries for disk,
  memory, swap, CPU steal, and disk I/O (GB/hr); getWeeklyDigestData()
  aggregates all metrics for digest and API endpoint
- routes.ts: GET /api/vm/metrics/trend?metric=…&range=… and
  GET /api/vm/weekly-digest endpoints
- api.ts: TrendPoint/TrendSeries types; getTrend() and getMemoryTrend()
  added to vmApi
- vm/page.tsx: Sparkline (pure SVG polyline+fill), TrendCard with
  latest/avg/peak and threshold colouring, TrendsPanel with lazy load
  on first open; Promise.allSettled() isolation for all 5 data panels
- vm-weekly-digest.sh: weekly Telegram digest via docker exec into
  devops-backend to reach Prometheus; emoji severity indicators; cron
  summary from /var/log/vm-cleanup.log
- systemd timer: Mon 08:00 UTC, Persistent=true (fires on next boot
  if missed); first trigger 2026-06-02

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-30 05:26:49 +00:00
saravanakumardb1
9a073ef225 docs(agent-queue): draft P2-S3 factory-agent integration prompt (claim/heartbeat/report/fence behind AQ_FLEET) 2026-05-29 22:03:12 -07:00
saravanakumardb1
7d6275f935 Merge origin/main (CI + installers) into Slice 4 merge 2026-05-29 21:59:57 -07:00
saravanakumardb1
2ad5c6dee5 Merge PR #4: Phase 1 Slice 4 — tracker adapter (task <-> job round-trip)
Closes Phase 1. selftest 53/53 (46 regression + 7 new); one-way echo sends
metrics only (no prompt/secrets — asserted); token env-sourced; Items API
contract (GET/PATCH status/POST comments, bearer + X-Product-Id).
2026-05-29 21:58:24 -07:00
saravanakumardb1
8ae504ca30 docs(agent-queue): tracker integration + close Phase 1 §10/§14 adapter (P1-S4)
README: Tracker integration section (from-tracker/to-tracker, env config, label->manifest table, one-way-echo rule, AQ_TRACKER_AUTO, real-use note). Roadmap: tick §10 Phase-1 adapter items + the §14 tracker-adapter item; add P1-S4 slice note; §0 Phase 1 -> 95% (remaining: budget.wall + Node dash surfacing).
2026-05-29 21:35:16 -07:00
saravanakumardb1
1e0a17bbc0 test(agent-queue): tracker adapter selftest cases (P1-S4)
Adds (never weakens) 7 stub-driven cases (AQ_TRACKER_API_CMD stub, no live service): from-tracker create + label mapping + idempotent; to-tracker shipped echo (PATCH done + metrics comment, asserts NO prompt body sent) + idempotent; HTTP 500 non-fatal; AQ_TRACKER_AUTO auto-echo on run. Full suite green (53 checks).
2026-05-29 21:35:16 -07:00
saravanakumardb1
b7a9ea1b7a feat(agent-queue): tracker adapter — task <-> job round-trip (P1-S4)
Implements §10 single-host tracker integration, closing the last Phase-1 §14 item:

- tracker_api: one curl-only HTTP wrapper (base URL + bearer + productId header),
  overridable via AQ_TRACKER_API_CMD so tests need no live service. Emits the
  response body + a trailing HTTP-code line; _api_call splits into API_BODY/API_CODE.
- aq from-tracker <ITEM_ID>: GET the Item, map title/description -> job body,
  labels (engine-class:/profile:/priority:/cap:) + Item priority -> frontmatter,
  and stamp tracker-item + a stable idempotency-key tracker-<id>. Materializes a
  .md into inbox/ via cmd_add; idempotent (Slice 1 dedupe) so a re-pull never dups.
  JSON parsed with POSIX awk (no jq) — mac + linux safe.
- aq to-tracker <job>: one-way echo (child -> tracker, §24.5). PATCHes the Item
  status (building/review/testing->in_progress, shipped->done, failures->wont_fix,
  all overridable) and posts a metrics-only comment (result/attempts/duration/
  tokens/cost/diff — NEVER prompt content or secrets). Idempotent via meta
  tracker_echoed; an echo failure (e.g. HTTP 500) is logged and non-fatal — the
  tracker is downstream, never authoritative for execution.
- Opt-in auto-echo (AQ_TRACKER_AUTO=1, default OFF): the worker echoes on each
  transition (building via cmd_run, review/testing/failed via run_worker, shipped
  via ship/promote); never blocks or fails a job.
- status + insights surface tracker-item and the last echoed status.

curl-only HTTP; no new runtime deps; conventional + backward-compatible.
2026-05-29 21:35:06 -07:00
Saravanakumar D
5a278ad119 ci: add GitHub Actions CI (shellcheck, syntax, preview)
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-29 21:31:00 -07:00
Saravanakumar D
efe0da3169 chore(devops): add cross-platform runners and README; normalize EOLs
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-29 21:26:47 -07:00
Saravanakumar D
b6dc0768e3 chore(devops): finalize CLI install report and helper
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-29 21:20:52 -07:00
Saravanakumar D
9e28d85a64 chore(devops): update CLI install report and add symlink helper
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-29 21:20:52 -07:00
saravanakumardb1
d0348f23de docs(agent-queue): P0 atomic-claim resolved (PR #29) — tick §4/§13/§14 fleet items 2026-05-29 21:05:38 -07:00
saravanakumardb1
2e9bd4dd1e docs(agent-queue): record P2 Foundation merged + track P0 atomic-claim hardening (§4)
- §4: implementation-status note — fleet module merged (PR #28); atomic claim NOT
  yet concurrency-safe (rev-CAS over unconditional write, sequential-only test)
- add phase2-atomic-claim-hardening.md: updateIfMatch in @bytelyst/datastore
  (Cosmos If-Match + process-atomic memory) + concurrent claim tests
2026-05-29 20:43:28 -07:00
saravanakumardb1
0e94705ab7 docs(agent-queue): draft Phase 2 Foundation long-run prompt (fleet module + coordinator: claim/lease/fencing/reaper) 2026-05-29 19:54:33 -07:00
saravanakumardb1
d43cab8afe Merge PR: Phase 1 Slice 2 — profiles (persona/caps/scope inheritance) + deps/DAG (blocked, cycle detection)
Reviewed against §5/§6 single-host scope; selftest 46/46 (34 regression + 12 new);
profile resolution precedence, persona injection, warn-only scope, deps soft/cycle verified.
2026-05-29 19:49:28 -07:00
saravanakumardb1
e183919c60 docs(agent-queue): profiles + deps docs; tick §5/§6 + bump Phase 1 to 80% (P1-S2)
README: Profiles & deps section (resolution precedence, persona, allowed-scope warn-only, deps/blocked + cycle detection); manifest table moves profile/deps/deps-mode to active. Roadmap: tick §6 catalog/persona/inheritance/allowed-scope and §5 deps + the §14 profile/deps/scope boxes; add P1-S2 slice note; §0 Phase 1 -> 80%.
2026-05-29 19:26:33 -07:00
saravanakumardb1
71d8a7cd4e test(agent-queue): profiles + deps/DAG selftest cases (P1-S2)
Adds (never weakens) temp-catalog + temp-git cases: profile verify inheritance + job-override precedence, persona-injection golden, profile capability inheritance, allowed-scope warn-only + path_in_scope unit, deps block->run, deps-mode soft (testing/), and submit-time cycle rejection. Full suite green (46 checks).
2026-05-29 19:26:26 -07:00
saravanakumardb1
f2dabdeb81 feat(agent-queue): starter profile catalog (P1-S2)
profiles/<name>.md presets (name, persona, capabilities, default-verify, engine-class, prefers-engine, allowed-scope, review-policy) for developer, backend-engineer, frontend-engineer, ux-designer, ui-designer, qa, reviewer, docs-writer, and a reserved planner.
2026-05-29 19:26:26 -07:00
saravanakumardb1
3d99f04427 feat(agent-queue): profiles (persona + presets) and single-host deps/DAG (P1-S2)
Implements roadmap §6 (profiles) and §5 deps on the bash runner, backward-compatible
(jobs without profile/deps behave exactly as before).

Profiles (§6):
- profile_get / profile_persona / fm_eff helpers + PROFILES_DIR (AGENT_QUEUE_PROFILES
  override). A job's `profile:` inherits verify (<- default-verify), capabilities,
  engine-class, prefers-engine, allowed-scope, review-policy when the job omits them;
  job fields always override (precedence job > profile > default). Resolution runs via
  fm_eff inside the capability gate and resolve_engine, so inherited caps/engine-class
  take effect before launch.
- persona injection: the profile's persona block is prepended to the stripped body
  fed to the engine (job .md unchanged on disk; nothing secret logged).
- allowed-scope guardrail (WARN-ONLY): scope_check logs a non-blocking WARNING +
  records scope_warning= for changed paths outside the globs; path_in_scope is a
  pure, unit-testable matcher (`dir/**` = subtree).

deps / DAG, single host (§5):
- deps reference other jobs by idempotency-key. dep_satisfied: shipped/ (hard) or
  shipped/+testing/ (deps-mode: soft). deps_unmet drives a block-with-reason skip in
  inbox selection (never launched/failed); cmd_status surfaces "blocked (waiting on
  <keys>)". deps_would_cycle rejects cyclic submits on `add`.
- _drain_pending: `--once` drains past dep-blocked jobs (idle can't satisfy them)
  while still waiting on retry/recovery backoff timers.

Meta now records effective (inherited) capabilities/engine-class/prefers-engine/
review-policy/allowed-scope so `status` reflects resolved config.
2026-05-29 19:26:16 -07:00
saravanakumardb1
7c4f5bc9b0 docs(agent-queue): draft Slice 4 (tracker adapter) + Phase 2 Slice 1 (fleet data model) 2026-05-29 19:11:09 -07:00
saravanakumardb1
0443590ce4 docs(agent-queue): update Slice 2 prompt — branch off main (Slice 1+3 merged) 2026-05-29 19:05:34 -07:00
saravanakumardb1
41f91d7ea1 Merge PR: Phase 1 Slice 3 — single-host resilience (crash recovery, WIP checkpoint/resume, retry) + execution insights
Reviewed against §11/§25/§26 single-host scope; selftest 34/34 (regression intact);
WIP protects current branch + PID-reuse guard on orphan recovery verified.
2026-05-29 19:00:13 -07:00
saravanakumardb1
87a4bf591a docs(agent-queue): Resilience + Insights docs; tick §11/§25/§26 single-host (P1-S3)
README: Resilience + Insights sections, retry frontmatter active (manifest table), retries_exhausted/recovered result values, recover/insights commands, honest token caveat. Roadmap: tick fully-completed single-host boxes in §11/§25/§26 with annotations; bump §0 Phase 1 to 55%.
2026-05-29 18:43:38 -07:00
saravanakumardb1
f46dd38adb test(agent-queue): resilience + insights selftest cases (P1-S3)
Adds (never weakens) temp-git-repo + stub cases: orphan recovery (+idempotent), WIP checkpoint/numstat, non-git skip, WIP resume, retry on verify_failed and crash (incl. no-retry when class absent), parse_usage extraction, per-engine aggregate. Inbox-empty-safe counts; avoids the pipefail+grep -q SIGPIPE trap.
2026-05-29 18:43:30 -07:00
saravanakumardb1
679d8b72cd feat(agent-queue): dashboard insights column for finished jobs (P1-S3)
Read-only from meta: tokens or cost + attempts + line deltas + duration; recognizes the new retries_exhausted result. agent-queue.sh stays the source of truth.
2026-05-29 18:43:30 -07:00
saravanakumardb1
1758bc1ab1 feat(agent-queue): single-host crash recovery, WIP checkpoint/resume, retry + insights (P1-S3)
Implements the single-host bash equivalents of roadmap §25 (durability/crash
recovery) and §26 (execution insights), plus §11 retry/dead-letter stand-in.

Resilience (A1-A4):
- recover_orphans + `recover` command: building/ jobs with a dead worker (dead
  pid, pidstart reuse-guard) are moved back to inbox/ with attempts incremented,
  on `run` startup and each loop. Idempotent (folder location is the guard).
- WIP checkpointing: for a git cwd, _wip_start creates/checks out aq/wip/<job>
  and _wip_checkpoint commits changes on every exit path via an EXIT/INT/TERM
  trap; never commits to main/current branch; non-git cwd skipped. RESUME: a
  relaunch whose aq/wip/<job> exists checks it out first (continue from
  checkpoint). wip_base persisted in a write-once sidecar.
- retry policy (now functional): retry { max, backoff, on } requeues failures
  whose class (timeout|verify_failed|crash) is in `on`, honoring backoff via
  next_eligible (selection skips until eligible), up to max attempts; exhaustion
  -> failed/ result=retries_exhausted with the WIP branch + full log preserved.
- state integrity: all meta writes stay append-only; attempts/next_eligible/wip_*
  are re-derivable; recovery is crash-safe.

Insights (B1-B6):
- per-run metrics into meta: duration_s, exit, result, attempts, and (git cwd)
  files_changed/lines_added/lines_deleted from numstat wip_base..HEAD.
- parse_usage(engine, log) adapter: generic AQ_USAGE line + Claude/Codex token
  heuristics; Devin/Copilot TODO; usage_estimated flag; never fabricates numbers.
- status insights sub-line; new `insights [job]` command (per-job metrics or a
  recent table + per-engine token/cost/success/duration rollup).
- privacy: only metrics are recorded, never prompt content or secrets.

Backward-compatible: legacy .md and non-git cwd behave exactly as before.
2026-05-29 18:43:21 -07:00
saravanakumardb1
bc0c0e263c Merge PR #1: Phase 1 Slice 1 — evolved manifest, priority, capabilities, engine-class, idempotency
Reviewed against the diff (capability gate before launch, 3-pass idempotency,
priority+age selection, engine-class resolution, timeout/flock launch). selftest 18/18.
2026-05-29 18:12:41 -07:00
saravanakumardb1
1f18f5d7a3 docs(agent-queue): add Phase 1 Slice 3 prompt (resilience & insights, single host) 2026-05-29 18:10:43 -07:00
saravanakumardb1
beb225162a docs(agent-queue): add durability/crash-recovery (§25) + execution insights/token accounting (§26)
- §13: fleet_jobs stores instruction bodyMd (durable md SoT) + checkpoint;
  fleet_runs carries token/cost/model/tool/diff metrics
- §25: instructions durable in Cosmos md, WIP checkpoint branch aq/wip/<jobId>,
  orphan detection, resume-vs-restart, fencing, retry->dead_letter, crash taxonomy
- §26: per-run token/cost/latency/tool insights, honest metered-vs-estimated
  capture, rollups, control-plane surfacing, secret redaction
- feature-catalog rows for §25 and §26
2026-05-29 18:09:32 -07:00
saravanakumardb1
470b2ce8d0 docs(agent-queue): version Phase 1 slice prompts (slice1, slice2)
Track the delegated agent task prompts under docs/jobs/ so the slice
decomposition of the gigafactory roadmap is reproducible and reviewable.
2026-05-29 18:05:06 -07:00
saravanakumardb1
67d8aa5766 docs(agent-queue): add work hierarchy & composite delegation (roadmap/epic)
New §24 + feature-catalog row:
- two delegation modes: atomic (leaf bug/feature/task) vs composite (roadmap/epic)
- introduce job kind (leaf|composite); composite routes to a planner/orchestrator
  that fans out child leaf jobs as a DAG across factories/agents/profiles
- parentId hierarchy + rollup semantics (status/budget/verify/phase-gates) +
  idempotent re-run (skip shipped children)
- source-of-truth/sync discipline (one record referenced by many; one-way echo)
- HYBRID decision recorded: model kind/parentId/rollup in the fleet layer now,
  keep shared tracker ITEM_TYPES unchanged (label kind:roadmap), promote to a
  first-class epic type later via additive migration once proven
- phasing: leaf-only P1-P2; manual composite P3; auto-decomposition planner P3->P5
2026-05-29 18:02:10 -07:00