bytelyst-devops-tools

Author	SHA1	Message	Date
Hermes VM	a8cf61a281	docs: Phase 8 — Telegram convention + delegation brief Closes the Phase 8 line that's actually a docs/codebase change. The other two Phase 8 items are VM-ops work (bot tokens + watchdog extensions) and live as a delegation brief. What's in this repo - `docs/hermes-operations.md` gains a "Telegram Notification Convention" section codifying: * routing per instance (Vijay → root chat, Bheem → Uma chat, cross-cutting → root) * silent-on-healthy + post-on-recovery * the numbered-emoji progress convention (`1️⃣`, `2️⃣`, …) and why it survives Telegram client rendering * approval-prompt UI expectation * "don't paste secrets" pointer back to `lib/logger.ts`'s redaction path-list - `docs/prompts/phase8-telegram-loop.md` — full delegation brief for the VM-side implementation. Design: dashboard backend writes new warnings (with `instance=<id>` tag, deduped over 1h) to an append-only log; both watchdogs tail it and route through the existing Telegram delivery path. Avoids splitting the delivery code into two places that would each need rate-limit + token- rotation handling. Brief is gated on Phase 4 — Uma's watchdog must exist first. - Roadmap Phase 8 ticked for "preserve numbered-emoji convention" (codified in operations doc); the other two items have notes pointing at the brief. Phase 8 doesn't fully close in this repo because the delivery loop needs real bot tokens and the Phase 4 Uma watchdog before it can be end-to-end validated. The codebase's contribution is everything that doesn't need a token: the convention, the design, and the delegation brief. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>	2026-05-30 08:05:52 +00:00
Hermes VM	14c7a8f59a	feat(dashboard): Phase 6 — severity-tagged alerts + per-instance actions + deep links Closes Phase 6 (the items that don't need a backend change). Three threads, all on the Hermes Mission Control overview: 1. Severity-tagged alerts on the ops panel New `RecentAlerts` component classifies each `recentAlerts` string into critical / warn / info by leading token (CRITICAL/ERROR/FATAL → critical; INFO/OK → info; default → warn — most ops alerts are warnings) and renders a colour-coded badge per alert. A per-severity radiogroup filter sits in the panel header with live counts. Pure UI — no backend contract change. The watchdog log tailer in `hermes-telemetry/repository.ts` already emits structured severities for the future migration off of leading-token parsing. 2. Per-instance action row on each `InstanceCard` Adds three buttons next to "Open dashboard" / "Copy URL": - "Copy SSH command": Tailscale-scoped only — never raw `ssh` — and per-instance user (`tailscale ssh root@<ts-ip>` for Vijay, `tailscale ssh uma@<ts-ip>` for Bheem). Disabled when the snapshot has no Tailscale IP. - "View tasks": deep link into the Task Ledger pre-filtered by instance via `/hermes/tasks?instance=<id>`. - "Open runbook": link to `docs/hermes-operations.md`. "How to restart this gateway" is intentionally a runbook link, not a button — restarting is privileged and should go through the documented procedure, not the dashboard UI. 3. URL-param hydration of the instance switcher `HermesInstanceProvider` now reads `?instance=` from the URL on mount (and on subsequent navigations to a different value). The URL value wins over the persisted localStorage selection so deep links from the ops panel land on a pre-filtered pane. The param is intentionally not auto-stripped — back/forward and copy-paste stay meaningful. Roadmap status: Phase 6 ticked except trend cards (deferred — needs client-side history persistence) and theme toggle (deferred — shell doesn't expose a switch primitive yet). Unified-alerts-feed bullet partially achieved by the new severity filter; the per-instance roll-up will land when a UI consumer is built for the Phase 3 telemetry endpoint. Verified: typecheck ✅, build ✅, 7/7 E2E ✅ (the existing switcher test exercises the new context code path; URL hydration is covered indirectly by the deep-link button → Task Ledger pre-filter). Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>	2026-05-30 08:03:57 +00:00
Hermes VM	efdf41f2bb	feat(dashboard): Phase 7 — gate /hermes/ops on requireAdmin + Phase 4 brief Two threads, one commit because they're both about closing dashboard- side roadmap items that don't need their own slice. Phase 7 — auth coverage on hermes routes: - `/api/hermes/ops` was the last unauthenticated Hermes endpoint — despite revealing instance / gateway / Tailscale-IP / backup-repo / warnings state. Now gated on `requireAdmin`, matching the new `/api/hermes/telemetry/:instance` from the previous slice and every other privileged route in this backend. - Privilege-surface table in `dashboard/DEPLOYMENT.md` updated to show `requireAdmin` for both Hermes routes; the previous "no auth, read-only ops snapshot" carve-out is gone. - Roadmap Phase 7 ticks for "require auth on hermes routes" + "keep hermes data private-only" with verification notes. Phase 4 — Bheem/Uma parity (delegation brief): - Phase 4 is VM ops, not codebase work — it requires sudo on the Hostinger VM, Uma-owned GitHub credentials, and Telegram bot tokens. None of it is editable in this repo. Wrote `docs/prompts/phase4-bheem-uma-parity.md` as a self-contained delegation brief covering: Uma persistent-backup repo + timer, Uma health watchdog, first restore rehearsal, quarterly drill reminder, and the dashboard-side verification (the /hermes/ops + /hermes/telemetry/bheem outputs that confirm the gap is closed). - Phase 4 section header in the roadmap now points at the brief and explains why the checkboxes stay open in this repo. Verified: backend 57/57 unit tests ✅, web 7/7 E2E ✅ (Playwright mocks bypass requireAdmin since they fulfill before the request reaches Fastify; real auth'd users get the same flow as every other admin route). Lint 0 errors, build green. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>	2026-05-30 07:59:48 +00:00
saravanakumardb1	08d8d715a1	docs(agent-queue): add Dependabot dependency-triage prompt for common-plat	2026-05-30 00:56:55 -07:00
Hermes VM	62c0cd60e0	feat(dashboard): Phase 3 slice 2 — Products pane on real service registry Closes the "drop the fabricated 50-item mock" Phase 3 line. The Mission Control Products pane now renders the real deployment registry as its primary view, sourced from `backend/src/modules/services` (the Cosmos-backed service registry) joined with the health module. Page layout: - Top "Live services" SectionCard: real services from `api.getServices()` joined with `api.getHealth()`. Per-card: status (up / degraded / down derived from the most recent health probe), version, health URL, repo path, last deploy, last health check, response time. Refresh button (busts the 30s health cache via `clearHealthCache`). Loading / empty / error states. Health-check poll loop is intentionally not added on this page — the home dashboard already runs one and our cache layer dedupes. - Bottom "Planned products (seed data)" SectionCard: the previous 50-item seed view, now clearly labelled `Seed` and demoted below the live data. Kept until manual entries for not-yet-deployed products are wired in (per the Phase 3 roadmap note). E2E: - `hermes.spec.ts` `beforeEach` now mocks `/api/services`, `/api/health`, `/api/health/cache` so the products page renders deterministically without a live backend (the dashboard spec already does the same for the home page). Verified: typecheck ✅, 13/13 web unit tests ✅, 7/7 E2E ✅, lint 0 errors, build green. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>	2026-05-30 07:56:51 +00:00
Hermes VM	ad16b1308e	feat(dashboard): Phase 3 slice 1 — hermes telemetry contract + backend endpoint First slice of Phase 3 ("real per-instance telemetry"). Defines the read-only artifact contract from Decision #1 (sessions, cron, memory, skills, watchdog alerts, backup history) and ships an admin-gated backend endpoint that probes the live Hermes instance, gracefully degrading to status:'unknown' wherever the source isn't readable. What's new - `backend/src/modules/hermes-telemetry/types.ts` — Zod schemas for every section of the snapshot, plus a `HermesProbeStatus` reused from hermes-ops so the UI can distinguish "definitely empty" from "couldn't read the source" for each section independently. - `backend/src/modules/hermes-telemetry/repository.ts` — implementation that: * shells out via `runuser -u <user> --` for cross-user instances (Bheem/uma) the same way `hermes-ops/repository.ts` does; * parses `hermes sessions stats / cron list / memory list / skills list --json` when the CLI is present, otherwise reports status:'unknown'; * tails the watchdog log and buckets each line by severity (critical / warn / info); * pulls `git -C <repo> log` against the instance's backup repo for backup history; * caches per-instance with a 30s TTL + in-flight coalescing, same pattern as hermes-ops. - `backend/src/modules/hermes-telemetry/routes.ts` — admin-only GET `/api/hermes/telemetry/:instance` (the `instance` path param is Zod-validated; the response is validated against `HermesTelemetrySnapshotSchema` before send so a shape regression surfaces here, not in the UI). - `backend/src/modules/hermes-telemetry/hermes-telemetry.test.ts` — 6 unit tests: ENOENT-on-everything case validates against the schema, JSON-parse path for sessions/cron/memory/skills, watchdog log severity bucketing, backup-history `git log` parsing, cache hit, per-instance cache isolation. Coverage: 95.17% lines on the new repository module. - `backend/vitest.config.ts` — telemetry repository added to the coverage gate's `include` list (ratchet). - `web/src/lib/api.ts` — typed surface for the new endpoint: `HermesTelemetrySnapshot` + sub-types + `api.getHermesTelemetry`. What's NOT in this slice - UI consumption. The Task Ledger / Agents / History panes still render mock data; converting them is queued for the next slices. This slice ships the contract + the backend so those slices can build on a stable shape. - Backward-compat replacement of `/api/hermes/ops` (which is unauthenticated today). That comes with the Phase 7 auth pass. Verified: backend typecheck ✅, 57/57 unit tests ✅, web typecheck ✅, lint 0 errors, coverage gate ≥95% lines on every gated file. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>	2026-05-30 07:53:37 +00:00
Hermes VM	ecd1f20d59	feat(dashboard): Phase 2 — instance dimension across Mission Control Closes Phase 2. Every entity in `web/src/lib/hermes` now carries an `instanceId: 'vijay' \| 'bheem'` (with `'all'` allowed for cross-cutting agents like Hermes Core / GitHub link), and a global instance switcher above every Mission Control pane filters them. Library changes (`web/src/lib/hermes.ts`): - New `HermesInstanceId` / `HermesInstanceFilter` types + `HERMES_INSTANCES` metadata array. - `instanceId` added to `HermesProduct`, `HermesTask`, `HermesEvent`, `HermesRun`, `HermesAgentStatus`. Seed data deterministically split ~50/50 across instances; agents tagged per-scope (Local VM runner → bheem, CLI runner / Scheduler → vijay, Hermes Core / GitHub / OpenClaw / deployment / notifications → all). - `getHermesTasks({instance})`, `getHermesProducts(view, instance)`, `getHermesAgents(instance)`, `getHermesHistory(instance)`, `getHermesOverview(instance)` all accept the filter; helper `instanceMatches(scope, filter)` keeps the semantics consistent (always-match for `'all'` on either side). UI changes: - New `HermesInstanceProvider` (React context, localStorage-backed under `hermes.instanceFilter.v1`, SSR-safe default to avoid hydration mismatch) mounted in `app/hermes/layout.tsx`. - New `HermesInstanceSwitcher` segmented control (radiogroup with aria-checked) rendered in the layout header above every pane. - New `HermesInstanceBadge` shown on task rows (Active Missions + Task Ledger), product cards (overview minicards + portfolio cards), and agent cards. - `/hermes` overview gains a "Per-instance roll-up" section that always shows Vijay vs Bheem side-by-side regardless of the active filter — that's the always-cross-instance comparison view, while the eight metric cards above it are filtered by the switcher. Tests: - 2 new unit tests in `lib/hermes.test.ts` (instance tagging on seed data + filter semantics across tasks/products/agents/overview). - 1 new E2E test asserting the switcher's radiogroup, default selection, and persistence-friendly state change. - All green: 13/13 web unit tests, 7/7 E2E. `web/test-results/` and `web/playwright-report/` added to `.gitignore` since they're regenerated per run. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>	2026-05-30 07:43:55 +00:00
saravanakumardb1	24fe1567f6	docs(agent-queue): draft Phase 2 next prompts — direct tracker->module wiring (§10) + two-factory parallel demo (exit criteria)	2026-05-30 00:40:21 -07:00
saravanakumardb1	1b3b7320c2	Merge: Phase 2 fleet feature flags + shadow/dual-run (#6 )	2026-05-30 00:31:24 -07:00
Hermes VM	13e5e1c551	ci(dashboard): Phase 5 P2 — wire Playwright E2E into Gitea CI Closes the Phase 5 P2 checkbox (second half — first half: pino logging in `1e64d75`). Phase 5 is now fully green. Two changes: 1. `web/e2e/hermes.spec.ts` now intercepts `/api/hermes/ops` with a fixture snapshot. The backend's hermes-ops endpoint shells out to `systemctl` / `git` / `ps` / `du` on the live VM and is therefore neither available nor deterministic in CI. Mocking it lets the suite run against the web stack alone (no backend, no live VM). Fixture shape mirrors the Zod schema in `backend/src/modules/hermes-ops/types.ts`. 2. `.gitea/workflows/ci.yml` re-enables the previously-commented-out E2E step. Adds a preceding `playwright install --with-deps chromium` step so the runner pulls the browser fresh per run. The web suite starts its own Next dev server via Playwright's `webServer` config (`pnpm exec next dev -p 3200`), so we do NOT start the backend in CI — every backend route used by the suite is mocked via `page.route` (auth, csrf, services, deployments, health/cache, seed, hermes-ops). Verified locally: `pnpm exec playwright test` → 6 passed in 19.5s (2 hermes specs + 4 dashboard/login specs across desktop + mobile). Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>	2026-05-30 07:28:50 +00:00
saravanakumardb1	fbecbe82b6	feat(agent-queue): fleet feature flags + shadow/dual-run (Phase 2) Add a safe, reversible path to validate the fleet coordinator against the proven single-host path BEFORE cutover, via three independently-toggleable flags: AQ_FLEET=0 pure offline (zero coordinator calls; offline path unchanged) AQ_FLEET_ROUTE=1 route_via_service: coordinator authoritative for claim (default = P2-S3) AQ_FLEET_ROUTE=0 local inbox authoritative (coordinator not used to source work) AQ_FLEET_SHADOW=1 dual-run (needs AQ_FLEET=1 + ROUTE=0): query coordinator in parallel, record divergence, NEVER act on it Precedence: SHADOW only when ROUTE=0; if ROUTE=1 + SHADOW=1, ROUTE wins (one-shot warning). lib/fleet-client.sh: fleet_route_enabled / fleet_shadow_enabled / fleet_flags_warn_once / fleet_flags_state; fleet_shadow_claim (read-only — isolated `-shadow` factoryId + dryRun, releases any real lease, never materializes), fleet_shadow_compare (AGREE/DIVERGE/COORD_EMPTY/LOCAL_EMPTY → .state/fleet-shadow.log), fleet_shadow_report (shadow:true, response never acted on), cmd_fleet_shadow_report (counts + agreement rate). agent-queue.sh: ROUTE-gate claim sourcing (claim only when route_via_service); shadow hook after the local authoritative decision each iteration (best-effort, error-swallowed — shadow can never fail a real job); `fleet-shadow-report` subcommand + help; resolved flags surfaced in `status`/`fleet-status`. tryClaim/fence/offline paths unchanged. Strictly side-effect-free on real job state: shadow never ships, quarantines, or mutates real jobs. Offline path byte-for-byte unchanged when AQ_FLEET=0. selftest.sh: +8 checks (shadow AGREE/DIVERGE/COORD_EMPTY, non-fatal 5xx, ROUTE precedence, ROUTE=0 local-authoritative, fleet-shadow-report summary, shadow_report unit). 60 prior checks unchanged → 68 total green. README + GIGAFACTORY_ROADMAP document the flag model + cutover ladder. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>	2026-05-30 00:22:48 -07:00
Hermes VM	1e64d75fd4	feat(dashboard): Phase 5 P2 — structured pino logging with redaction First half of Phase 5 P2 (the "structured backend logging" piece; E2E-in-CI lands separately so the diff stays reviewable). Adds `lib/logger.ts` exporting a singleton pino instance shared between Fastify (via `loggerInstance`) and any non-request code path. One configured logger across the backend means uniform formatting, redaction, and log-level control: - LOG_LEVEL env knob (defaults: debug in non-prod, info in prod when NODE_ENV=production). Documented in `.env.example`. - Built-in redaction for Authorization / Cookie headers and the common secret-shaped field names (password, token, refreshToken, accessToken, csrfToken, JWT_SECRET, CSRF_SECRET, ENCRYPTION_KEY, COSMOS_KEY, AZURE_CLIENT_SECRET) so an accidental `req.log.info(req.body)` or `logger.error({ err, config }, …)` won't dump credentials. This is a backstop, not the primary defense — call sites should still avoid logging raw config/req. - JSON to stdout in every environment. Pipe through `pino-pretty` locally if you want pretty output; we deliberately don't bundle pino-pretty as a runtime dep. - `childLogger(module)` helper tags log lines with their origin so repositories/background workers don't have to repeat the module name on every line. Sweeps the runtime `console.error` sites that lose request context (deployment orchestrator background fire-and-forget, system docker stats/cleanup, backup CRUD, vm getAllContainers) onto the structured logger. CLI-only modules (`scripts/run-migrations.ts`, `migrations/index.ts`, `cosmos-init.ts` startup, `azure-keyvault.ts`, `config.ts` env warnings, `lib/migrations.ts` no-op message) keep `console.*` for now — they run before Fastify is up and are queued for a separate cleanup pass. Tests, typecheck, lint (0 errors), build green. Coverage gate still passing (≥95% lines on every gated file). Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>	2026-05-30 07:18:44 +00:00
Hermes VM	c6ec1a06ea	docs(dashboard): Phase 5 P1 — document privilege surface; gate /code-quality/check Closes the final Phase 5 P1 checkbox and REVIEW_ACTIONS #6. The backend container has root-equivalent host access via the docker socket, host log mounts, and the VM scripts mount, but until now the "who can do what to the host?" answer was scattered across compose files and route handlers. This commit centralizes it. DEPLOYMENT.md gains a "Privilege Surface" section that lists: - every host mount + container path + mode + purpose - every shell-outing route, the actual commands it runs, and the auth gate on each - what an admin token can do today (≈ host shell) - five known sharp edges (un-allow-listed container names, unvalidated projectPath, no per-route audit-log on shell-outs, container runs as root, global rate-limit only) - a P1 → P3 mitigation roadmap (allow-list wrapper around shell-outs, projectPath validation, audit-logging shell-outs, drop root in container, replace docker.sock with a verb-restricted proxy) Concurrent code fix: `POST /code-quality/check` was reachable unauthenticated despite shelling out to `npm run typecheck/lint/ build/test:run` in a caller-supplied `projectPath`. Added `preHandler: requireAdmin` to bring it in line with every other shell-outing route in the dashboard. Same commit because the documentation table promises this gate exists. REVIEW_ACTIONS #6 marked RESOLVED with the rationale; roadmap checkbox ticked. Tests, typecheck, lint (0 errors), build, and coverage gate (≥95% lines on every gated file) all stay green. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>	2026-05-30 07:05:51 +00:00
Hermes VM	824f31586a	docs(dashboard): Phase 5 P1 — fix port/endpoint drift, dedupe deployment docs Closes the Phase 5 P1 doc-drift checkbox and REVIEW_ACTIONS #5. The 3000-vs-3049 confusion came from prose claims in three docs that each picked a different "right" answer. The truth is: the web container listens on :3000; docker-compose maps `127.0.0.1:3049:3000`; production is fronted by Traefik on `https://devops.bytelyst.com`. Encoding that explicitly so future readers don't have to dig through compose files: - DEPLOYMENT.md becomes canonical. Its content is now the (more accurate) old DEPLOYMENT_GUIDE.md merged with a "Ports — quick reference" table covering Local dev / Docker Compose / Production Traefik, plus a Local-development section for `pnpm dev`. - DEPLOYMENT_GUIDE.md → 5-line redirect stub pointing at DEPLOYMENT.md (kept for `deploy.sh` and any external links). - deploy.sh updated to point at DEPLOYMENT.md. - README.md "Web port: 3000" line rewritten to spell out container vs Compose-host vs dev-mode and link to the port table. - ENDPOINTS.md gets a top-of-file note: every `localhost:3000` URL in that file is the `pnpm dev` workflow; substitute `:3049` for the Dockerized stack. - REVIEW_ACTIONS.md #5 marked RESOLVED with the rationale. No code, behavior, lint, or test changes. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>	2026-05-30 07:03:05 +00:00
Hermes VM	3fc471e880	chore(dashboard): Phase 5 P1 — remove dead SSE log-stream claim Closes the long-standing SSE TODO. The previous attempt with `fastify-sse-v2 ^4` was incompatible with Fastify 5 and was never wired in; the README/DEPLOYMENT.md kept advertising "real-time log streaming" that didn't exist. The web client never used EventSource — `web/src/ lib/api.ts` already polls `/deployments/:id/logs` via the normal `apiRequest` helper. Resolution: remove the claim, not ship the feature. - drop `fastify-sse-v2` dep from `backend/package.json` + lockfile - delete the commented-out plugin import + register in `server.ts`, replace with a NOTE explaining the JSON-polling decision and how to add a stream later (`reply.raw`) - remove the `TODO: Re-enable SSE` comment in `deployments/routes.ts`; the endpoint already returns JSON, document that explicitly - rewrite the README "Deployment Log Streaming" section as "Deployment Logs" (JSON-polled, no SSE); fix the endpoint table - flip the DEPLOYMENT.md bullet from "Real-time log streaming (SSE)" to "Deployment log retrieval (JSON polling — no SSE)" - mark REVIEW_ACTIONS #4 RESOLVED with the reasoning - tick the roadmap checkbox If a real-time stream is wanted later, ship it explicitly via `reply.raw` and update README/DEPLOYMENT.md/the route comment in the same change. Don't reintroduce a half-disabled plugin. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>	2026-05-30 07:00:07 +00:00
Hermes VM	18180aab78	test(dashboard): Phase 5 P1 — auth/csrf/health/orchestrator tests + coverage gate Closes the Phase 5 P1 testing checkbox. Adds 35 new unit tests across the modules called out in the roadmap and wires a v8 coverage gate into CI. Coverage of newly-tested files (lines / branches): lib/auth.ts 94.4% / 100% lib/csrf.ts 95.1% / 90% modules/health/repository.ts 100% / 92% modules/deployments/orchestrator.ts 95.2% / 74% modules/services/repository.ts 100% / 100% modules/hermes-ops/repository.ts 95.2% / 68% Threshold (lines/funcs/stmts ≥85%, branches ≥65%) is scoped to those six files via `coverage.include` so untested legacy modules (vm, system, audit, route handlers) report but don't gate. Add files there as they gain real tests — ratchet up, never relax. Test approach mirrors the existing services/hermes-ops suites: hoisted mocks for I/O (fetch, child_process, fs/promises, cosmos-init), real JOSE-signed JWTs for the auth path, fake timers for cache TTL and CSRF expiry assertions. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>	2026-05-30 06:56:16 +00:00
saravanakumardb1	5c0ae020c0	docs(agent-queue): draft P2 prompts — factory enrollment+tokens (§12) + feature flags/shadow-dualrun	2026-05-29 23:52:14 -07:00
Hermes VM	cf5428acd1	feat(dashboard): Phase 1 — harden hermes-ops backend + tests - Short-TTL (30s) snapshot cache + in-flight coalescing so the panel poll and concurrent refreshes don't fan out ~20 systemctl/git/ps/du subprocesses each time; snapshot carries a `cached` flag and `getHermesOpsSnapshot({force})`. - Distinguish "unit inactive" (down) from "probe couldn't run" (unknown): a new exec() wrapper reports whether the command actually ran (ENOENT/timeout = unknown) vs exited non-zero with output (e.g. systemctl is-active -> inactive). Per-field ProbeStatus on gateway/dashboard/timer/repo; warnings differentiate "is not active" from "status could not be determined". - Robust Bheem/Uma checks: `runuser -u uma -- systemctl --user is-active/ is-enabled` with a ps / existsSync fallback so a failed probe degrades to the legacy check instead of a false "down". - Zod schema (HermesOpsSnapshotSchema) as the stable typed contract; the route validates output before sending. New status fields are additive (active/ enabled/url/etc. preserved) so the existing web client is unaffected. - Unit tests (mock execFile/fs): healthy snapshot, down vs unknown mapping, runuser->ps fallback, unreadable repo, cache hit + force bypass, request coalescing. Backend: 16 tests green. Roadmap: check off Phase 1 items and Phase 5 P0 in hermes_dashboard_v2_roadmap.md. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-30 06:50:32 +00:00
Hermes VM	3ee4e7104e	fix(dashboard): Phase 5 P0 — correct CI workspace path + real ESLint - ci.yml: actions/checkout into the runner workspace instead of cd-ing into a hard-coded host path and `git reset --hard origin/main` on the live checkout; install via `pnpm install:gitea` (self-contained, no sibling common-plat checkout); E2E step left as a TODO pointer (ci-e2e-hardening, Phase 5 P2). - Fix the same stale /opt/bytelyst/bytelyst-devops-tools path in deploy.sh, scripts/deploy-hotcopy.sh, DEPLOYMENT.md, DEPLOYMENT_GUIDE.md. - Replace the no-op `lint` echoes with real ESLint 9 flat configs (js + typescript-eslint recommended) for backend and web; add a root `pnpm lint`. - Fix the 10 errors lint surfaced, incl. require('os') in an ESM backend (system/repository.ts -> import * as os), prefer-const x4, and a ternary expression-statement in web vm/page.tsx. Verified locally: secret-scan, lint (0 errors; correctly fails on bad code), typecheck, unit tests (backend 9 / web 11), and build all green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-30 06:50:32 +00:00
saravanakumardb1	51e4c5f271	Merge PR #5 : Phase 2 Slice 3 — factory-agent integration Wires agent-queue.sh to the fleet coordinator behind AQ_FLEET=1 (offline path unchanged when off). Fencing-aware (stale leaseEpoch -> self-abort + quarantine), offline-degrade, tracker echo via fleet_events. selftest 60/60 (53 prior + 7 new); token env-sourced; no bodyMd/token leak (asserted).	2026-05-29 22:50:56 -07:00
saravanakumardb1	21ebf8b1b7	docs(agent-queue): fleet integration section + roadmap P2-S3 ticks README: "Fleet integration (Phase 2)" — AQ_FLEET flag, env table, claim/heartbeat/ report/fence/renew protocol, offline-degrade + quarantine, offline-vs-fleet explainer. Roadmap: tick the Phase-2 §14 factory-agent item, add a P2-S3 slice note, bump §0 Phase 2 -> 55%.	2026-05-29 22:45:44 -07:00
saravanakumardb1	064dbf3d8f	test(agent-queue): fleet integration selftest cases (P2-S3) Adds 7 stub-driven fleet cases (AQ_FLEET_API_CMD stub, no live coordinator); never weakens the prior 53 (full suite now 60 green): - flag OFF (default): zero coordinator calls; offline job completes unchanged - register(heartbeat)+claim -> coordinator job materialized + executed to review/ - report+checkpoint: PATCH carries stage+leaseEpoch (+ wipBranch on building) - FENCING: stale-epoch 409 -> self-abort + quarantine (never shipped) - lease renew (unit): POST .../lease/renew with current leaseEpoch - offline-degrade: coordinator 5xx -> job completes locally (degraded), not quarantined - no-leak: bodyMd/token never appear in report payloads	2026-05-29 22:45:44 -07:00
saravanakumardb1	1d84712b47	feat(agent-queue): wire runner to fleet coordinator at minimal hook points (P2-S3) Sources lib/fleet-client.sh and adds a few fleet_enabled-gated hooks so the offline git-queue path is byte-for-byte unchanged when AQ_FLEET is unset/0: - cmd_run: register at loop start; per-iteration heartbeat (cadence) + lease renew for in-flight fleet jobs + claim one coordinator job into inbox when capacity. - meta: persist fleet_job_id + fleet_lease_epoch (from claim frontmatter). - run_worker: report `building` (with WIP checkpoint) after WIP setup and `review` before accepting the agent's output — a FENCED (stale-epoch/409) report self-aborts and quarantines (never ships); 5xx/unreachable degrades (finish locally). - _auto_echo: for fleet jobs route the outcome echo through the coordinator (fleet_events) instead of the direct tracker echo; offline jobs unchanged. - cmd_ship: fence-check before shipping a fleet job; release lease after. - status: show factory id + per-job fleet=<id>@e<epoch>; insights lists fleet_* fields. - dispatch + help: `fleet-status` command + a FLEET env section.	2026-05-29 22:45:44 -07:00
saravanakumardb1	a10d4003e6	feat(agent-queue): fleet coordinator client library (lib/fleet-client.sh, P2-S3) New sourced library implementing the factory side of the Phase-2 `fleet` coordinator contract — curl-only + POSIX awk, reusing the Slice-4 HTTP/JSON helper patterns, no new deps. Every function is a no-op unless AQ_FLEET=1. - fleet_enabled / fleet_api (AQ_FLEET_API_CMD test seam) / _fleet_call - fleet_detect_caps (reuses detect_capabilities) -> JSON caps array - fleet_heartbeat (+ _maybe cadence): registration == first heartbeat - fleet_claim: POST /fleet/claim, parse job id/bodyMd/leaseEpoch, materialize a transient local .md (fleet-job-id + fleet-lease-epoch in frontmatter) - fleet_report: PATCH fenced stage transition {stage, leaseEpoch, checkpoint?}; returns ok / FENCED(2, stale epoch -> self-abort) / degraded(1, unreachable) - fleet_lease_renew / fleet_lease_release / fleet_renew_active (fenced) - fleet_quarantine: park a reclaimed (fenced) job in failed/ for human triage - cmd_fleet_status: register + print factory identity/caps Report payloads carry only stage/epoch/checkpoint — never prompt/bodyMd/token.	2026-05-29 22:45:44 -07:00
saravanakumardb1	10395983e7	docs(agent-queue): draft parallel P2 prompts — scheduler/router core (§7) + fleet artifacts blob wiring (§13)	2026-05-29 22:32:41 -07:00
Hermes VM	a8dd166108	docs: add Hermes dashboard v2 roadmap + CI/E2E delegation brief Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-30 05:26:49 +00:00
Hermes VM	13a105ba23	feat(vm): Phase 5 closure — GPU/freshness checks, chaos validation, I/O alert vm-health-check.sh: - check_gpu(): nvidia-smi probe; "CPU-only" OK on this VM (no GPU) - check_image_freshness(): flag containers running images >30d old. Skips third-party images (gitea, grafana, prom, mcr.microsoft, axllent, caddy, traefik, valkey, cadvisor) — they have their own rebuild cadence. Currently flags 19 stale product images (~60d old). chaos-validation.sh: - Monthly chaos test: kill PID 1 in chronomind-web, wait up to 35 min for docker-health-watchdog to detect + restart. Telegram pass/fail. - Refuses to run if target not healthy. systemd timer fires 1st of month at 10:00 UTC (after 08:00 weekly digest). vm-io-anomaly-check.sh: - 6h avg sda write rate; transition alerts at WARN (1 GB/hr) / CRIT (2.5 GB/hr). De-dupes via /var/log/vm-io-anomaly-state so the alert fires once per transition, not every 6h. Current baseline: ~1.94 GB/hr (orphan-container state-file writes; see Phase 0.3). - Reports recovery to OK when rate drops back. vm/page.tsx: gpu + image_freshness added to CHECK_META so they render with proper icon/label and slot into CHECK_ORDER. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-30 05:26:49 +00:00
Hermes VM	76ef17f26b	feat(vm): Phase 2.3 closure — OOM watchdog + orphan-container docs OOM watchdog: - vm-oom-watchdog.sh — scans journalctl -k since cursor for oom-kill, killed-process, and "out of memory ... killed" entries; maps cgroup hits back to container names via docker inspect; posts a single Telegram alert per scan window (no dedupe needed — cursor advances on every run). Cursor at /var/log/vm-oom-cursor, log at /var/log/vm-oom-watchdog.log. - Systemd: OnBootSec=10min, OnUnitActiveSec=1h, Persistent=true. Orphan containers (no compose file on disk): - trading-backend → docker update --memory=768m (high-I/O bot) - gitea-npm-registry → docker update --memory=512m - orphan-containers.md captures canonical configs for recovery (env, mounts, networks, restart policy, memory limits). Closes Phase 2.3 (post-monitoring) and Phase 3.3 (orphan limits). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-30 05:26:49 +00:00
Hermes VM	253e888a24	feat(infra): Phase 2.3 — memory limits across all active Docker stacks Apply deploy.resources.limits.memory to 45 services across 5 compose files. Limits take effect on next docker compose up (no running containers affected). Limits derived from 2-day Prometheus RSS baseline (avg of 2026-05-27-29): common_plat ecosystem (37 services): cosmos-emulator: 1g (319 MiB baseline, can spike on writes) loki: 384m (75 MiB) prometheus: 384m (91 MiB, grows with series cardinality) node-exporter: 128m (21 MiB, very stable) cadvisor: 256m (38 MiB) valkey: 128m (tiny) caddy: 256m (35 MiB) platform-service: 512m (61 MiB) extraction-service: 512m (99 MiB, Python sidecar) mcp-server: 384m (21 MiB) product backends: 512m (30-65 MiB each) product webs: 512m (35-93 MiB each) llmlab-dashboard: 512m (Ollama proxy, larger cache budget) dashboard (2 services): backend 512m, web 512m invttrdg (2 services): backend 768m (159 MiB + heavy state writes), web 256m (nginx SPA) clock/chronomind (2 services): backend 512m, web 512m notes/notelett (2 services): backend 512m, web 512m Ollama host process has NO limit (model load unpredictable, up to 8 GB). trading-backend compose file not on disk — limit not applied. gitea-npm-registry started manually — limit not applied. Monitor OOMKill for 48h after next stack restart: dmesg \| grep -i oom Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-30 05:26:49 +00:00
Hermes VM	42c3b9cdd5	feat(dashboard/vm): Phase 3.3 — All Containers panel with CPU/RAM, logs, bulk restart - repository.ts: getAllContainers() — batch docker inspect + docker stats --no-stream merged by container name; returns state, health, uptime, CPU%, RAM, memLimitMiB (0=no limit), restart count, stack from compose label; getContainerLogs() — docker logs --tail --timestamps - routes.ts: GET /api/vm/containers (all, with stats; ~3s for 38 containers), GET /api/vm/containers/:name/logs?lines=N - api.ts: ContainerInfo interface; vmApi.getAllContainers(), vmApi.getContainerLogs() - vm/page.tsx: ContainersPanel — collapsible (lazy-loads on first open); filter chips (All/Running/Unhealthy/No Limit) + stack dropdown; per-row log viewer (inline pre, dark bg, 50-line tail); per-row restart button; bulk "Restart N unhealthy" with confirmation modal; Fragment key pattern for row+log-row pairs I/O anomaly (Phase 0.3) root cause identified: invttrdg-backend and trading-backend write bot_state.json + .bak on every market tick (5×/min and 2×/min respectively) into container overlay layer → ~6 GB/day — intentional bot behaviour, no fix needed, trend chart already in place to monitor. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-30 05:26:49 +00:00
Hermes VM	8d32cb7980	feat(dashboard/vm): Phases 4.1-4.3 — Prometheus trends, sparklines, weekly digest - prometheus.ts: new Prometheus client with 7d/30d range queries for disk, memory, swap, CPU steal, and disk I/O (GB/hr); getWeeklyDigestData() aggregates all metrics for digest and API endpoint - routes.ts: GET /api/vm/metrics/trend?metric=…&range=… and GET /api/vm/weekly-digest endpoints - api.ts: TrendPoint/TrendSeries types; getTrend() and getMemoryTrend() added to vmApi - vm/page.tsx: Sparkline (pure SVG polyline+fill), TrendCard with latest/avg/peak and threshold colouring, TrendsPanel with lazy load on first open; Promise.allSettled() isolation for all 5 data panels - vm-weekly-digest.sh: weekly Telegram digest via docker exec into devops-backend to reach Prometheus; emoji severity indicators; cron summary from /var/log/vm-cleanup.log - systemd timer: Mon 08:00 UTC, Persistent=true (fires on next boot if missed); first trigger 2026-06-02 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-30 05:26:49 +00:00
saravanakumardb1	9a073ef225	docs(agent-queue): draft P2-S3 factory-agent integration prompt (claim/heartbeat/report/fence behind AQ_FLEET)	2026-05-29 22:03:12 -07:00
saravanakumardb1	7d6275f935	Merge origin/main (CI + installers) into Slice 4 merge	2026-05-29 21:59:57 -07:00
saravanakumardb1	2ad5c6dee5	Merge PR #4 : Phase 1 Slice 4 — tracker adapter (task <-> job round-trip) Closes Phase 1. selftest 53/53 (46 regression + 7 new); one-way echo sends metrics only (no prompt/secrets — asserted); token env-sourced; Items API contract (GET/PATCH status/POST comments, bearer + X-Product-Id).	2026-05-29 21:58:24 -07:00
saravanakumardb1	8ae504ca30	docs(agent-queue): tracker integration + close Phase 1 §10/§14 adapter (P1-S4) README: Tracker integration section (from-tracker/to-tracker, env config, label->manifest table, one-way-echo rule, AQ_TRACKER_AUTO, real-use note). Roadmap: tick §10 Phase-1 adapter items + the §14 tracker-adapter item; add P1-S4 slice note; §0 Phase 1 -> 95% (remaining: budget.wall + Node dash surfacing).	2026-05-29 21:35:16 -07:00
saravanakumardb1	1e0a17bbc0	test(agent-queue): tracker adapter selftest cases (P1-S4) Adds (never weakens) 7 stub-driven cases (AQ_TRACKER_API_CMD stub, no live service): from-tracker create + label mapping + idempotent; to-tracker shipped echo (PATCH done + metrics comment, asserts NO prompt body sent) + idempotent; HTTP 500 non-fatal; AQ_TRACKER_AUTO auto-echo on run. Full suite green (53 checks).	2026-05-29 21:35:16 -07:00
saravanakumardb1	b7a9ea1b7a	feat(agent-queue): tracker adapter — task <-> job round-trip (P1-S4) Implements §10 single-host tracker integration, closing the last Phase-1 §14 item: - tracker_api: one curl-only HTTP wrapper (base URL + bearer + productId header), overridable via AQ_TRACKER_API_CMD so tests need no live service. Emits the response body + a trailing HTTP-code line; _api_call splits into API_BODY/API_CODE. - aq from-tracker <ITEM_ID>: GET the Item, map title/description -> job body, labels (engine-class:/profile:/priority:/cap:) + Item priority -> frontmatter, and stamp tracker-item + a stable idempotency-key tracker-<id>. Materializes a .md into inbox/ via cmd_add; idempotent (Slice 1 dedupe) so a re-pull never dups. JSON parsed with POSIX awk (no jq) — mac + linux safe. - aq to-tracker <job>: one-way echo (child -> tracker, §24.5). PATCHes the Item status (building/review/testing->in_progress, shipped->done, failures->wont_fix, all overridable) and posts a metrics-only comment (result/attempts/duration/ tokens/cost/diff — NEVER prompt content or secrets). Idempotent via meta tracker_echoed; an echo failure (e.g. HTTP 500) is logged and non-fatal — the tracker is downstream, never authoritative for execution. - Opt-in auto-echo (AQ_TRACKER_AUTO=1, default OFF): the worker echoes on each transition (building via cmd_run, review/testing/failed via run_worker, shipped via ship/promote); never blocks or fails a job. - status + insights surface tracker-item and the last echoed status. curl-only HTTP; no new runtime deps; conventional + backward-compatible.	2026-05-29 21:35:06 -07:00
Saravanakumar D	5a278ad119	ci: add GitHub Actions CI (shellcheck, syntax, preview) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-05-29 21:31:00 -07:00
Saravanakumar D	efe0da3169	chore(devops): add cross-platform runners and README; normalize EOLs Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-05-29 21:26:47 -07:00
Saravanakumar D	b6dc0768e3	chore(devops): finalize CLI install report and helper Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-05-29 21:20:52 -07:00
Saravanakumar D	9e28d85a64	chore(devops): update CLI install report and add symlink helper Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-05-29 21:20:52 -07:00
saravanakumardb1	d0348f23de	docs(agent-queue): P0 atomic-claim resolved (PR #29 ) — tick §4/§13/§14 fleet items	2026-05-29 21:05:38 -07:00
saravanakumardb1	2e9bd4dd1e	docs(agent-queue): record P2 Foundation merged + track P0 atomic-claim hardening (§4) - §4: implementation-status note — fleet module merged (PR #28); atomic claim NOT yet concurrency-safe (rev-CAS over unconditional write, sequential-only test) - add phase2-atomic-claim-hardening.md: updateIfMatch in @bytelyst/datastore (Cosmos If-Match + process-atomic memory) + concurrent claim tests	2026-05-29 20:43:28 -07:00
saravanakumardb1	0e94705ab7	docs(agent-queue): draft Phase 2 Foundation long-run prompt (fleet module + coordinator: claim/lease/fencing/reaper)	2026-05-29 19:54:33 -07:00
saravanakumardb1	d43cab8afe	Merge PR: Phase 1 Slice 2 — profiles (persona/caps/scope inheritance) + deps/DAG (blocked, cycle detection) Reviewed against §5/§6 single-host scope; selftest 46/46 (34 regression + 12 new); profile resolution precedence, persona injection, warn-only scope, deps soft/cycle verified.	2026-05-29 19:49:28 -07:00
saravanakumardb1	e183919c60	docs(agent-queue): profiles + deps docs; tick §5/§6 + bump Phase 1 to 80% (P1-S2) README: Profiles & deps section (resolution precedence, persona, allowed-scope warn-only, deps/blocked + cycle detection); manifest table moves profile/deps/deps-mode to active. Roadmap: tick §6 catalog/persona/inheritance/allowed-scope and §5 deps + the §14 profile/deps/scope boxes; add P1-S2 slice note; §0 Phase 1 -> 80%.	2026-05-29 19:26:33 -07:00
saravanakumardb1	71d8a7cd4e	test(agent-queue): profiles + deps/DAG selftest cases (P1-S2) Adds (never weakens) temp-catalog + temp-git cases: profile verify inheritance + job-override precedence, persona-injection golden, profile capability inheritance, allowed-scope warn-only + path_in_scope unit, deps block->run, deps-mode soft (testing/), and submit-time cycle rejection. Full suite green (46 checks).	2026-05-29 19:26:26 -07:00
saravanakumardb1	f2dabdeb81	feat(agent-queue): starter profile catalog (P1-S2) profiles/<name>.md presets (name, persona, capabilities, default-verify, engine-class, prefers-engine, allowed-scope, review-policy) for developer, backend-engineer, frontend-engineer, ux-designer, ui-designer, qa, reviewer, docs-writer, and a reserved planner.	2026-05-29 19:26:26 -07:00
saravanakumardb1	3d99f04427	feat(agent-queue): profiles (persona + presets) and single-host deps/DAG (P1-S2) Implements roadmap §6 (profiles) and §5 deps on the bash runner, backward-compatible (jobs without profile/deps behave exactly as before). Profiles (§6): - profile_get / profile_persona / fm_eff helpers + PROFILES_DIR (AGENT_QUEUE_PROFILES override). A job's `profile:` inherits verify (<- default-verify), capabilities, engine-class, prefers-engine, allowed-scope, review-policy when the job omits them; job fields always override (precedence job > profile > default). Resolution runs via fm_eff inside the capability gate and resolve_engine, so inherited caps/engine-class take effect before launch. - persona injection: the profile's persona block is prepended to the stripped body fed to the engine (job .md unchanged on disk; nothing secret logged). - allowed-scope guardrail (WARN-ONLY): scope_check logs a non-blocking WARNING + records scope_warning= for changed paths outside the globs; path_in_scope is a pure, unit-testable matcher (`dir/**` = subtree). deps / DAG, single host (§5): - deps reference other jobs by idempotency-key. dep_satisfied: shipped/ (hard) or shipped/+testing/ (deps-mode: soft). deps_unmet drives a block-with-reason skip in inbox selection (never launched/failed); cmd_status surfaces "blocked (waiting on <keys>)". deps_would_cycle rejects cyclic submits on `add`. - _drain_pending: `--once` drains past dep-blocked jobs (idle can't satisfy them) while still waiting on retry/recovery backoff timers. Meta now records effective (inherited) capabilities/engine-class/prefers-engine/ review-policy/allowed-scope so `status` reflects resolved config.	2026-05-29 19:26:16 -07:00
saravanakumardb1	7c4f5bc9b0	docs(agent-queue): draft Slice 4 (tracker adapter) + Phase 2 Slice 1 (fleet data model)	2026-05-29 19:11:09 -07:00

1 2 3 4 5 ...

306 Commits