Compare commits

..

65 Commits

Author SHA1 Message Date
Hermes VM
0e1905aa33 docs: document local LLM utility workflows
Some checks failed
pre-commit / pre-commit (push) Failing after 33s
2026-05-28 00:21:06 +00:00
Hermes VM
44fd6a462a fix: bind DevOps dashboard ports to loopback
Some checks failed
pre-commit / pre-commit (push) Failing after 27s
2026-05-27 21:55:46 +00:00
Hermes VM
f936c2231c docs: record product port hardening
Some checks failed
pre-commit / pre-commit (push) Failing after 25s
2026-05-27 21:53:08 +00:00
Hermes VM
7047d625ef feat(dashboard/vm): Phases 1.1, 1.3, 3.1, 3.4 — VM page panels
Phase 3.1 — VM Score Card (0–100):
- 6 weighted dimensions: steal time, RAM, disk, service health,
  maintenance hygiene, LLM readiness (matching roadmap scoring)
- Color-coded gauge + per-dimension progress bars with detail text
- Computed from health + cron + unhealthy data; degrades gracefully
  when any source is unavailable

Phase 1.3 — Unhealthy Container Detail Panel:
- Loads independently from GET /api/vm/containers/unhealthy
- Per-container: name, unhealthy since, restart count, last health logs
- Expandable row for health check output
- One-click restart with spinner, feedback toast, auto-refresh after 3s

Phase 1.1 — Cron Status Panel:
- Loads from GET /api/vm/cron-status
- Table: 4 managed jobs × schedule | last run | freed MB | status | next
- Collapsible run history (last 10) with step-by-step log expansion

Phase 3.4 — Ollama/LLM Panel:
- Loads from GET /api/vm/ollama/models
- Currently-loaded section with RAM pressure warning (<4 GB free)
- RAM bar visualisation showing model footprint
- Model list with size + last-used time
- One-click unload button

Other improvements:
- All data fetched in parallel (Promise.allSettled) — any panel failure
  does not block the rest of the page
- Add steal, failed_units, cron_missing_paths to CHECK_META/CHECK_ORDER
- Refresh now updates all 5 data sources atomically
- web/package-lock.json regenerated (was stale, caused build failure)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 21:49:23 +00:00
Hermes VM
b15c570587 docs: record common-platform port hardening
Some checks failed
pre-commit / pre-commit (push) Failing after 37s
2026-05-27 21:32:31 +00:00
Hermes VM
d9618ba7b0 feat(vm): Phases 1.2, 1.4, 2.1 — steal time, swap pressure, health watchdog
Phase 1.2 — CPU steal time metric in vm-health-check.sh:
- Samples /proc/stat twice 1s apart for accurate current steal %
- Thresholds: >5% WARN, >15% CRIT (currently 0.8% on this host)
- Inserts before memory check so steal is visible alongside load

Phase 1.4 — Swap pressure indicator:
- Reads SwapCached from /proc/meminfo as secondary metric
- Raises SWAP_USED_WARN_GB 1→1.5 to reduce noise (current usage 0.6G)
- New WARN path: SwapCached > 200MB signals recent pressure even when
  current swap usage looks ok (catches post-spike state)

Phase 2.1 — Docker health-check watchdog:
- docker-health-watchdog.sh: checks unhealthy containers every 10 min,
  restarts only after 3 consecutive failing health checks (30min grace)
- docker-health-watchdog.service + .timer: enabled, fires every 10 min
- Sends Telegram notification on each auto-restart
- Rollback: systemctl disable docker-health-watchdog.timer

Phase 2.2 already complete: sync_hermes_persistent_backup.py handles
diverge gracefully with rebase/reset-hard fallback; running successfully.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 21:31:09 +00:00
Hermes VM
d60c81ebda docs: record internal port loopback hardening
Some checks failed
pre-commit / pre-commit (push) Failing after 38s
2026-05-27 21:25:38 +00:00
Hermes VM
2fc23d6baa feat(vm): fix devops-backend VM module — Phase 0.1 complete
- Switch backend runner from node:20-alpine to node:20-slim so GNU df
  flags (--output=pcent/avail) work inside the container
- Add volume mounts to docker-compose.yml: scripts (ro), VM logs (rw),
  docker.sock; set VM_SCRIPTS_PATH + VM_LOG_DIR env vars
- Rebuild repository.ts: env-configurable paths, cron history parser,
  unhealthy-container inspector, Ollama model endpoints
- Add routes: GET /api/vm/cron-status, unhealthy containers, Ollama
  models, container restart, model unload
- vm-cleanup.sh: add step_cosmos_pglog, step_docker_aged_images; fix
  (( count++ )) → count=$(( count + 1 )) for set -e compatibility
- Add docs/VM_OBSERVABILITY_ROADMAP.md

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 21:13:45 +00:00
Hermes VM
5a2d92f519 docs: record VM container health fix
Some checks failed
pre-commit / pre-commit (push) Failing after 33s
2026-05-27 21:12:45 +00:00
e2db92f3b1 Add Hermes snapshot diff view 2026-05-27 21:05:57 +00:00
8f522e3505 Add Hermes dashboard improvement backlog 2026-05-27 21:02:23 +00:00
Hermes VM
9210a8890f feat: detect stale VM automation
Some checks failed
pre-commit / pre-commit (push) Failing after 32s
2026-05-27 21:00:43 +00:00
Hermes VM
3d5f369f3d docs: record Gitea runner recovery
Some checks failed
pre-commit / pre-commit (push) Failing after 40s
2026-05-27 20:58:16 +00:00
Hermes VM
1f2eea8268 docs: record VM backup and cron fixes
Some checks failed
pre-commit / pre-commit (push) Has been cancelled
2026-05-27 20:56:11 +00:00
90f6db2014 Complete Hermes ops dashboard and roadmap 2026-05-27 20:53:58 +00:00
Hermes VM
e3d1dddf51 docs: add VM exposure inventory
Some checks are pending
pre-commit / pre-commit (push) Waiting to run
2026-05-27 20:51:27 +00:00
98a7915a38 Reconcile Hermes roadmap and dashboard status 2026-05-27 20:46:16 +00:00
ac79591903 Mark web search tooling complete 2026-05-27 20:46:16 +00:00
Hermes VM
313a775fa0 docs: strengthen VM security roadmap gates
Some checks are pending
pre-commit / pre-commit (push) Waiting to run
2026-05-27 20:34:37 +00:00
Hermes VM
2c125adb05 docs: add VM security blind spots roadmap
Some checks are pending
pre-commit / pre-commit (push) Waiting to run
2026-05-27 20:21:52 +00:00
c89018ae47 Tighten Telegram fallback wording 2026-05-27 20:18:46 +00:00
8145484136 Verify Telegram fallback platform context 2026-05-27 20:16:30 +00:00
8da66497cc Tighten Hermes local fallback chain 2026-05-27 19:58:09 +00:00
3e26f0da31 Close Hermes browser and web backend items 2026-05-27 19:23:55 +00:00
root
d1f234fc01 Mark Firecrawl as locally configured 2026-05-27 18:57:50 +00:00
Hermes VM
70d96d7684 feat: add gitea backup timer assets 2026-05-27 18:53:20 +00:00
Hermes VM
147db72330 docs: add hostinger maintenance operations entry 2026-05-27 18:53:20 +00:00
Hermes VM
31b414d62b fix: systematic bug fixes — code-quality parser, env key, config warnings, auth cleanup, deployment safety
- code-quality/repository.ts: fix tsErrorMatch[3] → [4] for type field (group 3 is column, 4 is error|warning)
- code-quality/repository.ts: fix ESLint regex to make rule brackets optional (not all formatters include them)
- code-quality/repository.ts: fix Vitest test count — parse 'Tests' line (individual tests) instead of 'Test Files' (file count); improve Jest regex to capture pass/fail independently
- env/repository.ts: replace raw process.env.ENCRYPTION_KEY with config.ENCRYPTION_KEY so the validated default flows through a single source of truth
- config.ts: add startup console.warn when CSRF_SECRET or ENCRYPTION_KEY are using insecure defaults
- deployments/orchestrator.ts: refactor runDeploymentScript to use try/catch/finally — deployment record is now always written in the finally block, preventing zombie 'running' states if updateDeployment itself throws
- auth.tsx: remove dead 'user &&' guard (user is always truthy after the !user check above); remove debug console.log calls, keep console.error

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 18:53:20 +00:00
Hermes VM
cdc23696b2 fix: resolve all TypeScript errors — green tsc
Primitives.tsx (TS2339):
- asChild branch read children.props.className before the cast applied,
  making props typed as unknown. Extract typedChild first, then read props.

hermes/page.tsx + agents/page.tsx + tasks/page.tsx + tasks/[id]/page.tsx (TS2322):
- Badge.variant accepts 'neutral'|'success'|'warning'|'error'|'info' but
  callers were passing 'danger' (should be 'error') and 'default' (should
  be 'neutral'). MetricCard.tone is a separate type and is correct as-is.

Changes:
- statusTone map in hermes/page.tsx: 'danger' → 'error', 'default' → 'neutral'
- getTaskTone fallback: 'default' → 'neutral'; explicit return type added
- levelTone in tasks/[id]/page.tsx: 'danger' → 'error'; explicit return type added
- Inline Badge variants: all remaining 'danger' → 'error' across 3 files

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 18:53:20 +00:00
Hermes VM
1099d518ef improve: dashboard security, code quality, and UX fixes
Security (backend):
- env/routes: add requireAdmin to all 6 env endpoints — GET /env was
  fully open, exposing all secret values to unauthenticated requests
- deployments/routes: add requireAdmin to all 4 GET endpoints (deployment
  history and logs were publicly readable)
- health/routes: remove duplicate requireAdmin call from DELETE /health/cache
  handler body (was already enforced via preHandler)

Frontend — auth/api:
- system/page: replace raw fetch + localStorage token with apiRequest
  (mutations now go through CSRF flow)
- vm/page: same — replace raw fetch with vmApi from api.ts
- api.ts: add vmApi (getHealth, getCleanupLog, runCleanup) + shared
  VmHealthResult / VmCheck / VmCheckLevel types

Shared utilities:
- utils.ts: add formatBytes() and getStatusColor() shared helpers
- system/page: remove duplicate formatBytes, import from utils
- health/page: remove duplicate getStatusColor, import from utils
- page.tsx (home): remove duplicate getStatusColor, import from utils

UX improvements:
- page.tsx: remove Seed Services button from normal header (debug tool)
- page.tsx: deploy button now always enabled; shows inline warning banner
  when service is not 'up' instead of silently disabling the button
- metrics: fix bar chart — bars now grow from bottom (flex-col-reverse),
  add empty state, fix date parsing timezone edge case
- sidebar-nav: theme toggle now functional — persists to localStorage and
  toggles document.documentElement class 'dark'

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 18:53:20 +00:00
Hermes VM
d0b8ce2c74 feat: add VM Health page to devops dashboard
Backend (Fastify):
- New module: modules/vm/ (types, repository, routes)
- GET  /api/vm/health      — runs vm-health-check.sh --json, returns structured result
- GET  /api/vm/cleanup-log — tails /var/log/vm-cleanup.log
- POST /api/vm/cleanup     — triggers vm-cleanup.sh (weekly / monthly / dry-run)
- Registered vmRoutes in server.ts

Frontend (Next.js):
- New page: /vm — VM Health
  - Overall status banner (OK/WARN/CRIT) with issue summary
  - Per-check cards: disk, load, RAM, swap, crash loops, container health,
    build cache, docker images, journal, syslog — color-coded by level
  - Cleanup trigger buttons (dry-run, weekly, monthly) with output viewer
  - Collapsible cleanup log viewer (last 40 lines)
  - Auto-refresh every 60s
- sidebar-nav.tsx: added 'VM Health' entry with Server icon

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 18:53:20 +00:00
Hermes VM
678430d77d fix: cron health-check entry should call vm-health-check.sh --notify
The 07:00 daily cron was incorrectly pointing to vm-cleanup.sh instead
of vm-health-check.sh. Health check is read-only; cleanup is not.
Also add --notify so Telegram alerts fire when WARNING/CRITICAL.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 18:53:20 +00:00
Hermes VM
0a2d303f93 add HostingerVM health-check and cleanup scripts
- vm-health-check.sh: read-only checks for disk, load, RAM, swap,
  Docker containers (crash-loops + healthchecks), build cache, journal.
  Flags: --quiet, --json, --notify (Telegram). Exit 0/1/2 = OK/WARN/CRIT.

- vm-cleanup.sh: safe periodic cleanup.
  Default (weekly): build cache, journal, apt, npm, .next/cache.
  --full (monthly): adds docker system prune, pnpm store, old logs, HOLD cleanup.
  --dry-run, --install-cron, --uninstall-cron.
  Logs to /var/log/vm-cleanup.log.

Related: docs/hostinger-vm-maintenance.md, scripts/VMs/HostingerVM/CRON_SETUP.md

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 18:53:20 +00:00
root
4249b17afc Document Firecrawl backend selection 2026-05-27 18:52:39 +00:00
root
08f32a79e8 Clarify remaining Hermes fallback verification 2026-05-27 18:46:32 +00:00
root
8fbb535d90 Add shared local Hermes fallback chain 2026-05-27 18:43:30 +00:00
root
9ee060e839 Harden Hermes operations dashboard status 2026-05-27 17:45:41 +00:00
root
0e6528b366 Add live Hermes operations dashboard 2026-05-27 13:04:36 +00:00
saravanakumardb1
babe2e6c13 docs(roadmap): v14 \xe2\x80\x94 ALL 20 ITEMS COMPLETE (C5 closed end-to-end)
C5 fully closed by:
1. Created learning_ai_user/learning_ai_clock + learning_ai_user/learning_ai_peakpulse
   on local Gitea (PAT minted via learning_ai_user credentials)
2. Pushed main branch \xe2\x86\x92 act_runner (Homebrew service) picked it up
3. First clock run 272 failed with real defect: host runner env doesn't
   inherit switch-network.sh exports. Fix landed in both pilots' ci.yml
   docker-lint job: explicit env: block + read token from
   ~/.gitea_npm_token at step time.
4. Verified green:
   - clock run 273 job 675 docker-lint \xe2\x86\x92 success
   - peakpulse runs 274 + 275 docker-lint \xe2\x86\x92 success

Roadmap final state: 20/20 items DONE.
2026-05-27 05:20:48 -07:00
root
3cc9a1456e Add Google Drive single file uploader 2026-05-27 12:19:45 +00:00
root
79ca56ffce Add Google Drive emergency bundle upload 2026-05-27 12:08:41 +00:00
saravanakumardb1
484c82c4b1 docs(roadmap): repair v13 \xc2\xa710 corruption + finalize C5 partial-validation note
A prior rebase merged the v13/v13.1 edits into \xc2\xa710 with mangled text
(steps 11\xe2\x80\x9320 out of order; step 10 garbled). Rebuilt the section
cleanly from v12 base + appended the new v13/v13.1 steps:

  11. Phase E1/E2/E5
  12. Phase B
  13. Phase B4 + E3/E4/E6
  14. Phase C (8/9; C5 partial)
  15. Phase D.1
  16. Phase D.2
  17. B7-4 AGENTS.md warnings
  18. Phase D extension (MindLyst, LysnrAI, talk2obsidian)
  19. Phase D.3 advisory cleanup
  20. C5 partial validation (this session)

Restored the lost "ported back to clock" trailing line for step 9.
No content changes beyond what was already documented in v13/v13.1.
2026-05-27 04:34:53 -07:00
saravanakumardb1
2d13ae4c54 docs(roadmap): v13.1 \xe2\x80\x94 C5 partial validation (Gitea hosting gap documented)
Findings from dummy check-in attempt:
- Pilot workflow YAML parses cleanly (6 jobs on clock incl. docker-lint)
- Local simulation of docker-lint job (gitea-doctor + docker-doctor)
  exits 0 on both pilots
- Pilot repos are NOT hosted on Gitea (`git push gitea` returns 404).
  Only `learning_ai_uxui_web` exists at localhost:3300
- Until pilot repos are mirrored to Gitea, the .gitea/workflows/ci.yml
  file ships but the runner never fires
- C5 marked as partial; gap recorded explicitly in \xc2\xa7Phase C and \xc2\xa710
2026-05-27 04:32:33 -07:00
root
bb15a225cd Add encrypted Hermes emergency bundle scripts 2026-05-27 11:31:58 +00:00
saravanakumardb1
e96b555f07 docs(roadmap): v13 \xe2\x80\x94 12/12 consumer repos PASS docker-doctor (Phase D extension + D.3)
Final-state summary:
- All 12 consumer repos now PASS docker-doctor with zero errors
- MindLyst + LysnrAI + talk2obsidian onboarded (was previously out of scope)
- docker-doctor learned Python Dockerfile detection
- 10 repos received advisory-warning cleanup commits (compose build.args
  + healthcheck.start_period)
- C5 (CI green confirmation) is the only remaining follow-up

The roadmap is now in a fully landed state for in-scope repos.
2026-05-27 04:27:15 -07:00
root
19fdba752c Add Hermes disaster recovery runbook 2026-05-27 11:23:07 +00:00
saravanakumardb1
ccd6ee4f7f docs(roadmap): v12 \xe2\x80\x94 all phases (A, B, C, D, E) complete for 9 consumer repos
- B7-4 AGENTS.md warnings landed in all 9 repos
- C9 web smoke test (Playwright) landed on clock to guard F11 regression
- D.2 per-repo Dockerfile/compose fixes applied to all 7 consumer repos
  via idempotent fixer; docker-doctor PASS on every consumer repo
- 3 non-consumer repos (MindLyst KMP, LysnrAI multi-target, talk2obsidian)
  remain out of scope; documented as follow-up
- C5 confirmation pending next Gitea CI run

Final status: 18 of 18 in-scope items complete.
2026-05-27 04:17:52 -07:00
root
547a9d00fa Clarify root GitHub credential ownership 2026-05-27 11:10:48 +00:00
saravanakumardb1
6a4e289edc docs(roadmap): v11 \xe2\x80\x94 Phases B4/E3/E4/E6 + C (7/9 gates) + D.1 (artifacts rolled out)
- B4: pre-commit guard + husky wiring landed
- E3/E4/E6: CI job + pre-commit warn-only + make doctor target
- C1\xe2\x80\x93C4, C6\xe2\x80\x93C8: verified on pilots; C5 pending CI, C9 deferred
- D.1: artifacts deployed to 7/9 consumer repos with per-repo findings table
- D.2: per-repo Dockerfile fixes captured as a fix matrix (follow-up work)
- All commit refs documented in \xc2\xa710 execution order
2026-05-27 04:07:27 -07:00
root
416f25794c Document Hermes Gitea token flow 2026-05-27 11:06:15 +00:00
saravanakumardb1
11c185e772 docs(roadmap): v10 — Phase B complete (canonical docker-prep + sync tooling)
- All B-tasks complete except B4 (husky hook) and B7-4 (AGENTS.md updates)
- Canonical home landed at common-plat@a418a23e
- Both pilots synced; end-to-end verified on clock + peakpulse
- 3 bonus capabilities documented (--check, portable sed, .gitkeep preservation)
- \xc2\xa710 execution step 12 marked done with commit refs
2026-05-27 03:49:52 -07:00
root
8de72351de Complete Hermes dashboard and watchdog roadmap audit 2026-05-27 10:45:29 +00:00
saravanakumardb1
15ac960faf docs(roadmap): v9 — Phase E1/E2/E5 done, docker-doctor.sh landed
- Marked E1, E2, E5 complete in Phase E checklist
- Added step 11 to \xc2\xa710 execution order with commit refs
- Renumbered remaining steps; deferred E3/E4/E6 to after Phase B
2026-05-27 03:33:35 -07:00
root
a6e509247f Record Tailscale login for Hermes 2026-05-27 10:31:23 +00:00
saravanakumardb1
ac364be6c3 docs(roadmap): mark A3 done, link ADR-0001 (lockfile policy accepted) 2026-05-27 03:28:19 -07:00
root
651df6aad8 Audit Hermes roadmap root and Uma progress 2026-05-27 10:28:10 +00:00
root
a10c0c318a Update Hermes roadmap late-pass progress 2026-05-27 10:24:56 +00:00
root
e57038a6a2 docs: advance Hermes setup roadmap
Some checks are pending
pre-commit / pre-commit (push) Waiting to run
2026-05-27 10:12:27 +00:00
saravanakumardb1
b00af09942 docs(docker): roadmap v8 — peakpulse Phase A done + A3 ADR-0001 accepted
Per § 10 steps 9 + 10.

Step 9: Peakpulse backend Phase A complete.
  cold 72.2 s, warm 2.7 s (96.3% reduction). Pattern from clock applied
  verbatim plus .docker-deps/.gitkeep discoverability fix back-ported
  to clock. Commits:
    peakpulse@11a6bc5  feat(docker): Phase A on peakpulse backend
    peakpulse@6523a1a  fix(docker): track .docker-deps/.gitkeep
    clock@1465e06b1    fix(docker): track .docker-deps/.gitkeep
    clock@d69003c1f    chore: dedupe .docker-deps in .gitignore

Step 10: A3 ADR accepted.
  New file: docs/adr/0001-docker-build-lockfile-policy.md
  Decision: short-term Option A (--lockfile=false) — already shipped in
  Phase A; long-term Option C (vendored pnpm-lock.docker.yaml). Migration
  triggered by production deployment, audit requirement, supply-chain
  incident, or loss of BuildKit cache. Implementation sketch in ADR § 4.

Roadmap doc updates:
  - § A7 metrics table: peakpulse row populated (72.2 s / 2.7 s).
  - § A3: collapsed bullet list into decision-record summary linking ADR.
  - § 10: steps 9 + 10 marked ; status banner v7 → v8.

Next per § 10: step 11 (Phase B docker-prep hardening) or step 12
(Phase E docker-doctor.sh linter). Phase E is higher-value as durable
insurance against F11/F13/F16/F17/F18 regressions across the ecosystem.
2026-05-27 02:54:08 -07:00
saravanakumardb1
062155b81e docs(docker): roadmap v7 — Phase A complete on clock, warm 2.9s/5.4s
Per execution \u00a7 10 step 10 (metrics in lockstep with implementation commits).

Updates:
  - \u00a7 A7 metrics table populated with cold + warm numbers post-A2:
    clock backend  59.2 s  \u2192  64.7 s cold  \u2192   2.9 s warm  (95.1% \u2193)
    clock web      193 s   \u2192  291 s* cold  \u2192   5.4 s warm  (97.2% \u2193)
    * web cold variance: 174-291 s across runs, dominated by tarball
      egress through corp proxy. Cold is not the optimization target.

  - \u00a7 A7 measurement commands now include the 'deps-changed' case
    (touch package.json) where A2's pnpm-store cache mount is the
    relevant speedup, since plain warm rebuild benefits from
    BuildKit layer cache regardless of A2.

  - \u00a7 10 execution order:
    steps 7 + 8 marked \u2705 with commit refs
    step  9 marked \u23f8 with explicit STOP-for-approval (peakpulse is
      a different repo)
    step  10 collapsed into this commit (was placeholder for A7 fill)

  - Status line: v6 \u2192 v7

Cross-repo commit chain (Phase A complete on clock):
  clock@0be887288       feat(docker): A0 \u2014 wire Gitea-registry path
  devops_tools@ba8b4d1  docs(docker): roadmap v5 \u2014 add F16
  common-plat@cfcfc7bb  fix(gitea): rewrite workspace:* (F16)
  common-plat@dd90f709  fix(gitea): ROOT_URL host.docker.internal (F17)
  clock@8b5c767a3       fix(docker): rewrite file: refs to * (F18)
  devops_tools@7627d55  docs(docker): roadmap v6 \u2014 A0-V PASSED
  clock@f6a806ff3       feat(docker): A8 + A9 + A1
  clock@55e8d22d3       feat(docker): A2 + A5 + A6
  devops_tools@HEAD     docs(docker): roadmap v7 (this commit)

Next: \u00a7 10 step 9 \u2014 peakpulse replication. Pause point for approval.
2026-05-27 02:16:07 -07:00
saravanakumardb1
7627d5526d docs(docker): roadmap v6 — F17 + F18 fixed, A0-V PASSED on clock
Resolves the A-pre phase entirely. Gitea-registry path now works
end-to-end on learning_ai_clock for both backend + web.

Findings added to § 0:
  F17: Gitea baked localhost:3300 in tarball URLs (Critical, FIXED)
  F18: clock/web/package.json had file: refs to sibling repo (High, FIXED)

Status updates:
  - A0-V on clock:  PASSED (3rd attempt, after F16/F17/F18 cleared)
    backend cold build: 59.2 s
    web cold build:     193 s (3:13)
  - A-pre fully complete (all 6 sub-items checked)
  - § 3.A7 metrics table populated with baseline numbers
  - § 10 execution order: steps 1-6 marked done; step 7 (A8+A9+A1) is next

New § A-pre-6: documents Gitea config requirements (ROOT_URL, /etc/hosts,
NO_PROXY) as one-time setup per dev machine. Required because the F17 fix
spans Gitea server config + host /etc/hosts + repo-side switch-network.sh.

Risk register: 4 new rows (F17 regression, F18 regression,
corp-proxy NO_PROXY drift, BASE_IMAGE override drift kept from v5).

Cross-repo commit chain (this run):
  clock@0be887288       feat(docker): A0 — wire Gitea-registry path
  devops_tools@ba8b4d1  docs(docker): roadmap v5 — add F16
  common-plat@cfcfc7bb  fix(gitea): rewrite workspace:* in tarballs (F16)
  common-plat@dd90f709  fix(gitea): ROOT_URL host.docker.internal (F17)
  clock@8b5c767a3       fix(docker): rewrite file: refs to * (F18)
  devops_tools@HEAD     docs(docker): roadmap v6 (this commit)

Next: § 10 step 7 — A8 (config-file COPY audit) + A9 (healthcheck IPv4)
+ A1 (corepack) on clock.
2026-05-27 01:54:14 -07:00
saravanakumardb1
ba8b4d1ace docs(docker): roadmap v5 — add F16 (registry workspace:* leaks)
Discovered during A0-V execution on learning_ai_clock (2026-05-27).

F16: At least 10 of ~50 published @bytelyst/* packages in the Gitea
registry have unrewritten 'workspace:*' refs in their package.json
dependencies. pnpm install inside Docker fails with
ERR_PNPM_WORKSPACE_PKG_NOT_FOUND because there is no workspace context
inside the container.

Confirmed broken (latest version each):
  @bytelyst/auth@0.1.5             → errors=workspace:*
  @bytelyst/diagnostics-client@0.1.6 → api-client=workspace:*
  @bytelyst/events@0.1.5           → queue=workspace:*
  @bytelyst/extraction@0.1.5       → api-client=workspace:*
  @bytelyst/fastify-auth@0.1.5     → errors=workspace:*
  @bytelyst/fastify-core@0.1.5     → errors=workspace:*   ← clock dep
  @bytelyst/feedback-client@0.1.6  → api-client=workspace:*
  @bytelyst/field-encrypt@0.1.6    → errors=workspace:*   ← clock dep
  @bytelyst/react-auth@0.1.6       → api-client=workspace:*
  @bytelyst/sync@0.1.5             → api-client, telemetry-client=workspace:*

Changes:
- § 0: bump count to 16; add F16 row (Critical severity)
- § 0 Implications: F16 blocks every A0-V; updated rationale
- § 3: insert new Phase A-pre (republish + publish-time guard) before A0
- § 3 A0-V: append blocked-status note linking to clock@0be887288
- § 10 Execution order: renumber; insert A-pre as step 3
- § 11 Risk register: add F16 row

Implementation status:
   Step 2 (A0 on clock) — committed in learning_ai_clock@0be887288;
     Dockerfile + compose changes correct, end-to-end build blocked on F16
  ⏸  Step 3 (A-pre) — next; touches common-plat publish flow
  ⏸  Step 4+ (A0-V retry on clock, then onward) — blocked on A-pre
2026-05-27 01:18:25 -07:00
saravanakumardb1
8025cd5d82 docs(docker): roadmap v4 — integrate Gitea hardening (F14, F15)
Adds 2 new findings to the docker build optimization roadmap and updates
templates to consume the new GITEA_NPM_OWNER env var shipped in common-plat
commit 610a59fd.

- F14: hardcoded Gitea owner literal across 14 repos (now resolved upstream
  via ${GITEA_NPM_OWNER:-learning_ai_user})
- F15: stale shell-env tokens (caught by scripts/gitea/doctor.sh)
- A0-1, A0-3, 7.1, 7.2, 7.5: snippets updated to thread GITEA_NPM_OWNER
  through .npmrc.docker, Dockerfile ARG/ENV, and docker-compose build args
- A0-D: new step — run gitea-doctor.sh as pre-build gate (replaces
  'wait 4 minutes for ERR_PNPM_AUTHENTICATION' with 'fail fast in 2 sec')
- Phase E: now distinguishes gitea-doctor (shipped) from docker-doctor (to
  build). Adds two new docker-doctor checks for F14
- Risk register: F14/F15 mitigations called out explicitly
2026-05-27 00:53:33 -07:00
saravanakumardb1
1a638a84e1 docs: roadmap v3 — incorporate review feedback (F11-F13, Phase E)
Review-driven additions:

- F11 added (silent UI breakage from missing/un-COPY'd postcss.config.mjs;
  4 repos hit this tonight: notes dff459e, jarvis_jr 36f6bc1,
  clock a308c6444, local_memory_gpt 07cdf6b)
- F12 added (healthcheck localhost → IPv6 false-fail; jarvis_jr incident)
- F13 added (enumerated COPY drift from filesystem; root cause of F11b)

Structural changes:
- New A8 (config-file COPY audit + glob pattern decision)
- New A9 (healthcheck IPv4 canonicalization)
- New A0-V verification gate (build Gitea path before optimizing)
- New § 2.5 canonical decisions (Alpine + ARG BASE_IMAGE override,
  127.0.0.1, --lockfile=false pending ADR)
- New § 7.5 canonical web Dockerfile (was missing, where F11 lives)
- New § 7.6 docker-doctor.sh skeleton
- New Phase E (docker-doctor.sh CI lint as durable insurance)
- B7 promoted from Phase D to Phase B proper (drift compounds)
- B4 husky hook extended to also block .tgz and .bak
- A0-1 env-var expansion chain explicitly documented
- A2-3 verification command corrected (docker buildx du, not docker history)
- Pilot order inverted: clock first (web + backend), then peakpulse
- C9 smoke test added (CSS bundle > 50 KB, F11 guard)
- 4 new risk-register rows for F11/F12/F13/BASE_IMAGE drift
2026-05-27 00:34:07 -07:00
saravanakumardb1
529d4f37f5 docs: add Docker build optimization roadmap (post-audit v2)
Captures audit findings on Dockerfile patterns across pilot repos
(peakpulse, clock):

- 10 concrete bugs documented (F1-F10): .dockerignore blocks
  pnpm-lock.yaml, sibling-workspace lockfile problem, .npmrc.docker
  inconsistencies, missing BuildKit cache mounts, etc.
- Phase A0 added: fix Gitea-registry path before optimizing
  (without it, the 'default' path doesn't actually work)
- Phase A1-A7: corepack, cache mounts, layer reordering, measurement
- Phase B: docker-prep.sh hardening (dry-run, idempotency,
  auto-restore, pre-commit guard)
- Phase C: 7 verification gates
- Phase D: deferred 11-repo rollout checklist
- ADR-pending lockfile policy decision (A3)
- Risk register + 6 open questions
2026-05-27 00:28:10 -07:00
86 changed files with 12319 additions and 529 deletions

2
.gitignore vendored
View File

@ -10,6 +10,8 @@ __pycache__/
venv/ venv/
env/ env/
ENV/ ENV/
!dashboard/backend/src/modules/env/
!dashboard/backend/src/modules/env/**
# IDE files # IDE files
.vscode/ .vscode/

View File

@ -30,13 +30,22 @@ ENV BYTELYST_COMMIT_SHA=${BYTELYST_COMMIT_SHA} \
RUN npm run build RUN npm run build
# --- Stage 2: Run --- # --- Stage 2: Run ---
FROM node:20-alpine AS runner # Use Debian slim (not Alpine) because vm-health-check.sh uses GNU df flags
# (--output=pcent, --output=avail) that BusyBox df does not support.
FROM node:20-slim AS runner
WORKDIR /app/backend WORKDIR /app/backend
COPY backend/package.json backend/package-lock.json ./ COPY backend/package.json backend/package-lock.json ./
RUN npm ci --omit=dev --ignore-scripts RUN npm ci --omit=dev --ignore-scripts
RUN apk add --no-cache curl
# Install tools needed by the VM management module:
# bash — vm-health-check.sh and vm-cleanup.sh require bash
# docker.io — docker CLI to communicate with the host daemon via socket
# python3 — used in inline python3 -c snippets inside the scripts
RUN apt-get update && apt-get install -y --no-install-recommends \
curl bash docker.io python3 \
&& rm -rf /var/lib/apt/lists/*
COPY --from=builder /app/backend/dist ./dist COPY --from=builder /app/backend/dist ./dist

File diff suppressed because it is too large Load Diff

View File

@ -31,5 +31,13 @@ const envSchema = z.object({
export const config = envSchema.parse(process.env); export const config = envSchema.parse(process.env);
// Warn loudly when insecure default keys are in use
if (config.CSRF_SECRET === 'default-csrf-secret-change-in-production') {
console.warn('[config] WARNING: CSRF_SECRET is using the insecure default — set CSRF_SECRET in .env before deploying to production');
}
if (config.ENCRYPTION_KEY === 'default-encryption-key-change-in-production') {
console.warn('[config] WARNING: ENCRYPTION_KEY is using the insecure default — set ENCRYPTION_KEY in .env before deploying to production');
}
export const productId = productIdentity.productId; export const productId = productIdentity.productId;
export const productName = productIdentity.name; export const productName = productIdentity.name;

View File

@ -148,7 +148,7 @@ function parseTypeScriptOutput(output: string, projectPath: string): CodeQuality
if (tsErrorMatch) { if (tsErrorMatch) {
issues.push({ issues.push({
id: `ts-${Date.now()}-${Math.random().toString(36).substring(2, 9)}`, id: `ts-${Date.now()}-${Math.random().toString(36).substring(2, 9)}`,
type: tsErrorMatch[3] as 'error' | 'warning', type: tsErrorMatch[4] as 'error' | 'warning', // group 4 = type; group 3 = column
category: 'typescript', category: 'typescript',
file: tsErrorMatch[1], file: tsErrorMatch[1],
line: parseInt(tsErrorMatch[2]), line: parseInt(tsErrorMatch[2]),
@ -167,10 +167,12 @@ function parseEslintOutput(output: string, projectPath: string): CodeQualityIssu
const lines = output.split('\n'); const lines = output.split('\n');
for (const line of lines) { for (const line of lines) {
// ESLint format: file:line:col message [rule] // ESLint unix format: file:line:col: message [rule]
const eslintMatch = line.match(/(.+\.tsx?):(\d+):(\d+)\s+(.+?)\s+\[(.+)\]/); // Rule part in brackets may or may not be present depending on formatter
const eslintMatch = line.match(/(.+\.tsx?):(\d+):(\d+)[:\s]+(.+?)(?:\s+\[([^\]]+)\])?$/);
if (eslintMatch) { if (eslintMatch) {
const severity = eslintMatch[4].includes('error') ? 'error' : 'warning'; const msgAndLevel = eslintMatch[4];
const severity = /\berror\b/i.test(msgAndLevel) ? 'error' : 'warning';
issues.push({ issues.push({
id: `eslint-${Date.now()}-${Math.random().toString(36).substring(2, 9)}`, id: `eslint-${Date.now()}-${Math.random().toString(36).substring(2, 9)}`,
type: severity, type: severity,
@ -178,8 +180,8 @@ function parseEslintOutput(output: string, projectPath: string): CodeQualityIssu
file: eslintMatch[1], file: eslintMatch[1],
line: parseInt(eslintMatch[2]), line: parseInt(eslintMatch[2]),
column: parseInt(eslintMatch[3]), column: parseInt(eslintMatch[3]),
message: eslintMatch[4], message: msgAndLevel,
rule: eslintMatch[5], rule: eslintMatch[5] ?? 'unknown',
}); });
} }
} }
@ -210,18 +212,24 @@ function parseTestOutput(output: string): { passed: number; failed: number } {
let passed = 0; let passed = 0;
let failed = 0; let failed = 0;
// Try to parse Vitest output // Try to parse Vitest output — use "Tests" line (individual tests), not "Test Files" line
const vitestMatch = output.match(/Test Files\s+(\d+)\s+\((\d+)\s+failed/); // Format: " Tests 3 failed | 5 passed (8)" or " Tests 8 passed (8)"
if (vitestMatch) { const vitestFailMatch = output.match(/\bTests\b\s+(\d+)\s+failed[^|]*\|\s*(\d+)\s+passed/);
failed = parseInt(vitestMatch[2]); const vitestPassMatch = output.match(/\bTests\b\s+(\d+)\s+passed/);
passed = parseInt(vitestMatch[1]) - failed; if (vitestFailMatch) {
failed = parseInt(vitestFailMatch[1]);
passed = parseInt(vitestFailMatch[2]);
} else if (vitestPassMatch) {
passed = parseInt(vitestPassMatch[1]);
failed = 0;
} }
// Try to parse Jest output // Try to parse Jest output: "Tests: 5 passed, 2 failed" or "Tests: 2 failed, 5 passed"
const jestMatch = output.match(/Tests:\s+(\d+)\s+passed,?\s*(\d+)\s+failed/); const jestPassMatch = output.match(/Tests:.*?(\d+)\s+passed/);
if (jestMatch) { const jestFailMatch = output.match(/Tests:.*?(\d+)\s+failed/);
passed = parseInt(jestMatch[1]); if (jestPassMatch || jestFailMatch) {
failed = parseInt(jestMatch[2]); passed = jestPassMatch ? parseInt(jestPassMatch[1]) : 0;
failed = jestFailMatch ? parseInt(jestFailMatch[1]) : 0;
} }
return { passed, failed }; return { passed, failed };

View File

@ -30,6 +30,10 @@ async function runDeploymentScript(service: Service, deploymentId: string) {
const scriptDir = join(process.cwd(), '../../'); // Go to bytelyst-devops-tools root const scriptDir = join(process.cwd(), '../../'); // Go to bytelyst-devops-tools root
const scriptPath = join(scriptDir, service.scriptPath); const scriptPath = join(scriptDir, service.scriptPath);
let finalStatus: 'success' | 'failed' = 'failed';
let logs = '';
let version: string | undefined;
try { try {
const { stdout, stderr } = await execAsync(`bash ${scriptPath}`, { const { stdout, stderr } = await execAsync(`bash ${scriptPath}`, {
cwd: scriptDir, cwd: scriptDir,
@ -40,15 +44,9 @@ async function runDeploymentScript(service: Service, deploymentId: string) {
}, },
}); });
const logs = `STDOUT:\n${stdout}\n\nSTDERR:\n${stderr}`; logs = `STDOUT:\n${stdout}\n\nSTDERR:\n${stderr}`;
finalStatus = 'success';
// Update deployment as success version = extractVersion(stdout + stderr) || 'unknown';
await updateDeployment(deploymentId, {
status: 'success',
logs,
completedAt: new Date().toISOString(),
version: extractVersion(stdout + stderr) || 'unknown',
});
// Update service status // Update service status
const { getServiceById, updateService } = await import('../services/repository.js'); const { getServiceById, updateService } = await import('../services/repository.js');
@ -57,21 +55,14 @@ async function runDeploymentScript(service: Service, deploymentId: string) {
await updateService(service.id, { await updateService(service.id, {
status: 'up', status: 'up',
lastDeployedAt: new Date().toISOString(), lastDeployedAt: new Date().toISOString(),
version: extractVersion(stdout + stderr) || svc.version, version: version || svc.version,
}); });
} }
} catch (error: any) { } catch (error: any) {
const logs = error instanceof Error logs = error instanceof Error
? `ERROR: ${error.message}\n\n${(error as any).stdout ? `STDOUT:\n${(error as any).stdout}\n\n` : ''}${(error as any).stderr ? `STDERR:\n${(error as any).stderr}` : ''}` ? `ERROR: ${error.message}\n\n${(error as any).stdout ? `STDOUT:\n${(error as any).stdout}\n\n` : ''}${(error as any).stderr ? `STDERR:\n${(error as any).stderr}` : ''}`
: String(error); : String(error);
// Update deployment as failed
await updateDeployment(deploymentId, {
status: 'failed',
logs,
completedAt: new Date().toISOString(),
});
// Update service status to down // Update service status to down
const { getServiceById, updateService } = await import('../services/repository.js'); const { getServiceById, updateService } = await import('../services/repository.js');
const svc = await getServiceById(service.id); const svc = await getServiceById(service.id);
@ -80,6 +71,18 @@ async function runDeploymentScript(service: Service, deploymentId: string) {
status: 'down', status: 'down',
}); });
} }
} finally {
// Always write final status — ensures the deployment never gets stuck in 'running'
try {
await updateDeployment(deploymentId, {
status: finalStatus,
logs,
completedAt: new Date().toISOString(),
...(version ? { version } : {}),
});
} catch (updateError) {
console.error(`Failed to persist final deployment status for ${deploymentId}:`, updateError);
}
} }
} }

View File

@ -13,23 +13,29 @@ import { createAuditLog } from '../audit/repository.js';
import { productId } from '../../lib/config.js'; import { productId } from '../../lib/config.js';
export async function deploymentRoutes(fastify: FastifyInstance) { export async function deploymentRoutes(fastify: FastifyInstance) {
// Get recent deployments across all services // Get recent deployments across all services (admin only)
fastify.get('/deployments', async (req, reply) => { fastify.get('/deployments', {
preHandler: async (req) => requireAdmin(req),
}, async (req, reply) => {
const query = QueryParamsSchema.parse(req.query); const query = QueryParamsSchema.parse(req.query);
const deployments = await getRecentDeployments(query.limit); const deployments = await getRecentDeployments(query.limit);
return reply.send(deployments); return reply.send(deployments);
}); });
// Get deployments for a specific service // Get deployments for a specific service (admin only)
fastify.get('/deployments/service/:serviceId', async (req, reply) => { fastify.get('/deployments/service/:serviceId', {
preHandler: async (req) => requireAdmin(req),
}, async (req, reply) => {
const params = TriggerDeploymentParamsSchema.parse(req.params); const params = TriggerDeploymentParamsSchema.parse(req.params);
const query = QueryParamsSchema.parse(req.query); const query = QueryParamsSchema.parse(req.query);
const deployments = await getDeploymentsByService(params.serviceId, query.limit); const deployments = await getDeploymentsByService(params.serviceId, query.limit);
return reply.send(deployments); return reply.send(deployments);
}); });
// Get single deployment // Get single deployment (admin only)
fastify.get('/deployments/:id', async (req, reply) => { fastify.get('/deployments/:id', {
preHandler: async (req) => requireAdmin(req),
}, async (req, reply) => {
const params = DeploymentParamsSchema.parse(req.params); const params = DeploymentParamsSchema.parse(req.params);
const deployment = await getDeploymentById(params.id); const deployment = await getDeploymentById(params.id);
if (!deployment) { if (!deployment) {
@ -38,9 +44,11 @@ export async function deploymentRoutes(fastify: FastifyInstance) {
return reply.send(deployment); return reply.send(deployment);
}); });
// Get deployment logs (SSE disabled due to Fastify 5 compatibility) // Get deployment logs (admin only; SSE disabled due to Fastify 5 compatibility)
// TODO: Re-enable SSE when fastify-sse-v2 supports Fastify 5 // TODO: Re-enable SSE when fastify-sse-v2 supports Fastify 5
fastify.get('/deployments/:id/logs', async (req, reply) => { fastify.get('/deployments/:id/logs', {
preHandler: async (req) => requireAdmin(req),
}, async (req, reply) => {
const params = DeploymentParamsSchema.parse(req.params); const params = DeploymentParamsSchema.parse(req.params);
const deployment = await getDeploymentById(params.id); const deployment = await getDeploymentById(params.id);

View File

@ -0,0 +1,31 @@
import type { EnvVar } from './types.js';
const envVars = new Map<string, EnvVar>();
export async function getEnvVars(): Promise<EnvVar[]> {
return Array.from(envVars.values()).sort((a, b) => a.name.localeCompare(b.name));
}
export async function getEnvVar(id: string): Promise<EnvVar | null> {
return envVars.get(id) ?? null;
}
export async function upsertEnvVar(input: Partial<EnvVar> & { name: string }): Promise<EnvVar> {
const id = input.id || input.name.toLowerCase().replace(/[^a-z0-9_]+/g, '_');
const envVar: EnvVar = {
id,
name: input.name,
value: input.isSecret ? 'REDACTED' : input.value ?? '',
isSecret: input.isSecret ?? true,
source: input.source ?? 'local',
azureKeyVaultName: input.azureKeyVaultName,
azureSecretName: input.azureSecretName,
updatedAt: new Date().toISOString(),
};
envVars.set(id, envVar);
return envVar;
}
export async function deleteEnvVar(id: string): Promise<boolean> {
return envVars.delete(id);
}

View File

@ -0,0 +1,61 @@
import type { FastifyInstance } from 'fastify';
import { BadRequestError, requireAdmin } from '../../lib/auth.js';
import { deleteEnvVar, getEnvVar, getEnvVars, upsertEnvVar } from './repository.js';
import { EnvVarInputSchema, EnvVarParamsSchema } from './types.js';
export async function envRoutes(fastify: FastifyInstance) {
fastify.get('/env', {
preHandler: async (req) => requireAdmin(req),
}, async (req, reply) => {
return reply.send(await getEnvVars());
});
fastify.get('/env/:id', {
preHandler: async (req) => requireAdmin(req),
}, async (req, reply) => {
const params = EnvVarParamsSchema.parse(req.params);
const envVar = await getEnvVar(params.id);
if (!envVar) return reply.code(404).send({ error: 'Environment variable not found' });
return reply.send(envVar);
});
fastify.post('/env', {
preHandler: async (req) => requireAdmin(req),
}, async (req, reply) => {
try {
const input = EnvVarInputSchema.parse(req.body) as { name: string };
return reply.code(201).send(await upsertEnvVar(input));
} catch (error) {
if (error instanceof Error) throw new BadRequestError(error.message);
throw error;
}
});
fastify.put('/env/:id', {
preHandler: async (req) => requireAdmin(req),
}, async (req, reply) => {
try {
const params = EnvVarParamsSchema.parse(req.params);
const input = EnvVarInputSchema.parse({ ...(req.body as object), id: params.id }) as { name: string; id: string };
return reply.send(await upsertEnvVar(input));
} catch (error) {
if (error instanceof Error) throw new BadRequestError(error.message);
throw error;
}
});
fastify.delete('/env/:id', {
preHandler: async (req) => requireAdmin(req),
}, async (req, reply) => {
const params = EnvVarParamsSchema.parse(req.params);
const deleted = await deleteEnvVar(params.id);
if (!deleted) return reply.code(404).send({ error: 'Environment variable not found' });
return reply.code(204).send();
});
fastify.post('/env/sync-azure', {
preHandler: async (req) => requireAdmin(req),
}, async (req, reply) => {
return reply.send({ synced: 0, errors: ['Azure Key Vault sync is not configured in this local dashboard build.'] });
});
}

View File

@ -0,0 +1,22 @@
import { z } from 'zod';
export const EnvVarSchema = z.object({
id: z.string().min(1),
name: z.string().min(1),
value: z.string().default(''),
isSecret: z.boolean().default(true),
source: z.enum(['local', 'azure-key-vault']).default('local'),
azureKeyVaultName: z.string().optional(),
azureSecretName: z.string().optional(),
updatedAt: z.string().datetime().default(() => new Date().toISOString()),
});
export const EnvVarParamsSchema = z.object({
id: z.string().min(1),
});
export const EnvVarInputSchema = EnvVarSchema.omit({ name: true }).partial().extend({
name: z.string().min(1),
});
export type EnvVar = z.infer<typeof EnvVarSchema>;

View File

@ -53,16 +53,8 @@ export async function healthRoutes(fastify: FastifyInstance) {
// Clear health cache (admin only) // Clear health cache (admin only)
fastify.delete('/health/cache', { fastify.delete('/health/cache', {
preHandler: async (req) => requireAdmin(req), preHandler: async (req) => requireAdmin(req),
}, async (req, reply) => { }, async (_req, reply) => {
try { clearHealthCache();
requireAdmin(req); return reply.send({ message: 'Health cache cleared' });
clearHealthCache();
return reply.send({ message: 'Health cache cleared' });
} catch (error) {
if (error instanceof Error) {
throw new BadRequestError(error.message);
}
throw error;
}
}); });
} }

View File

@ -0,0 +1,270 @@
import { execFile } from 'child_process';
import { promisify } from 'util';
import { readFile, stat } from 'fs/promises';
import { existsSync } from 'fs';
import type { HermesOpsCronJob, HermesOpsInstance, HermesOpsRepo, HermesOpsSnapshot, HermesOpsTimer } from './types.js';
const execFileAsync = promisify(execFile);
const instances = [
{
id: 'vijay' as const,
label: 'Vijay / root',
hermesHome: '/root/.hermes',
gatewayKind: 'system',
gatewayService: 'hermes-gateway.service',
dashboardService: 'hermes-root-dashboard.service',
dashboardPort: 9119,
backupTimer: 'hermes-root-backup.timer',
repoPath: '/root/repos/bytelyst_hostinger_hermes_vm',
driveFolder: 'Vijay Drive',
},
{
id: 'bheem' as const,
label: 'Bheem / Uma',
hermesHome: '/home/uma/.hermes',
gatewayKind: 'uma-user',
gatewayService: 'uma-hermes-gateway.service',
dashboardService: 'uma-hermes-dashboard.service',
dashboardPort: 9120,
backupTimer: 'uma-hermes-backup.timer',
repoPath: '/home/uma/repos/uma_hostinger_hermes_vm',
driveFolder: 'Bheem Drive',
},
];
async function run(command: string, args: string[], cwd?: string): Promise<string | null> {
try {
const { stdout } = await execFileAsync(command, args, {
cwd,
timeout: 5000,
maxBuffer: 1024 * 1024,
});
return stdout.trim();
} catch {
return null;
}
}
async function isActive(unit: string): Promise<boolean> {
return (await run('systemctl', ['is-active', unit])) === 'active';
}
async function isEnabled(unit: string): Promise<boolean> {
return (await run('systemctl', ['is-enabled', unit])) === 'enabled';
}
async function getTimer(name: string): Promise<HermesOpsTimer> {
const active = await isActive(name);
const show = await run('systemctl', [
'show',
name,
'-p',
'NextElapseUSecRealtime',
'-p',
'LastTriggerUSec',
'--no-pager',
]);
const properties = Object.fromEntries(
(show ?? '')
.split('\n')
.map((line) => {
const [key, ...value] = line.split('=');
return [key, value.join('=') || null] as const;
})
.filter(([key]) => key),
);
return {
name,
active,
nextRun: properties.NextElapseUSecRealtime ?? null,
lastRun: properties.LastTriggerUSec ?? null,
};
}
async function isUmaGatewayActive(): Promise<boolean> {
const output = await run('ps', ['-eo', 'user=,args=']);
return Boolean(
output?.split('\n').some((line) => {
const trimmed = line.trimStart();
return trimmed.startsWith('uma ') && trimmed.includes('hermes_cli.main gateway');
}),
);
}
async function isUmaGatewayEnabled(): Promise<boolean> {
return existsSync('/home/uma/.config/systemd/user/default.target.wants/uma-hermes-gateway.service');
}
async function getRepo(path: string): Promise<HermesOpsRepo> {
const [branch, status, head, lastCommitAt, gitSize, backupSize] = await Promise.all([
run('git', ['branch', '--show-current'], path),
run('git', ['status', '--porcelain'], path),
run('git', ['rev-parse', '--short', 'HEAD'], path),
run('git', ['log', '-1', '--format=%cI'], path),
run('du', ['-sh', '.git'], path),
run('du', ['-sh', 'hermes_persistent_backup'], path),
]);
const size = [gitSize, backupSize].filter(Boolean).join(' / ');
return {
path,
branch: branch || null,
clean: status === '',
head: head || null,
lastCommitAt: lastCommitAt || null,
size: size ? size.replace(/\n/g, ' / ') : null,
};
}
async function manifestStats(hermesHome: string): Promise<{ files: number | null; cronJobs: number | null }> {
try {
const manifestPath = `${hermesHome}/MANIFEST.json`;
const manifest = JSON.parse(await readFile(manifestPath, 'utf8')) as { files?: unknown[] };
const jobsPath = `${hermesHome}/cron/jobs.json`;
const jobs = JSON.parse(await readFile(jobsPath, 'utf8'));
const cronJobs = Array.isArray(jobs) ? jobs.length : Array.isArray(jobs?.jobs) ? jobs.jobs.length : null;
return {
files: Array.isArray(manifest.files) ? manifest.files.length : null,
cronJobs,
};
} catch {
return { files: null, cronJobs: null };
}
}
async function tokenExists(path: string): Promise<boolean> {
try {
const info = await stat(path);
return info.isFile() && info.size > 100;
} catch {
return false;
}
}
async function getTailscaleIp(): Promise<string | null> {
const output = await run('tailscale', ['ip', '-4']);
return output?.split('\n')[0] || null;
}
async function getActiveHermesSessionCount(): Promise<number> {
const output = await run('ps', ['-ef']);
if (!output) return 0;
return output
.split('\n')
.filter((line) => line.includes('hermes_cli.main') && !line.includes('gateway') && !line.includes('grep'))
.length;
}
export async function getHermesOpsSnapshot(): Promise<HermesOpsSnapshot> {
const tailscaleIp = await getTailscaleIp();
const warnings: string[] = [];
const emergencyDriveUpload = await getTimer('hermes-emergency-drive-upload.timer');
const activeSessions = await getActiveHermesSessionCount();
const results: HermesOpsInstance[] = [];
for (const item of instances) {
const gatewayActiveCheck =
item.gatewayKind === 'uma-user' ? isUmaGatewayActive() : isActive(item.gatewayService);
const gatewayEnabledCheck =
item.gatewayKind === 'uma-user' ? isUmaGatewayEnabled() : isEnabled(item.gatewayService);
const [gatewayActive, gatewayEnabled, dashboardActive, backupTimer, repo, stats, googleToken] = await Promise.all([
gatewayActiveCheck,
gatewayEnabledCheck,
isActive(item.dashboardService),
getTimer(item.backupTimer),
getRepo(item.repoPath),
manifestStats(`${item.repoPath}/hermes_persistent_backup`),
tokenExists(`${item.hermesHome}/google_token.json`),
]);
const dashboardUrl = tailscaleIp ? `http://${tailscaleIp}:${item.dashboardPort}/` : `:${item.dashboardPort}`;
if (!gatewayActive) warnings.push(`${item.label} gateway is not active`);
if (!gatewayEnabled) warnings.push(`${item.label} gateway auto-start is not enabled`);
if (!dashboardActive) warnings.push(`${item.label} private dashboard is not active`);
if (!backupTimer.active) warnings.push(`${item.label} backup timer is not active`);
if (!repo.head) warnings.push(`${item.label} backup repo HEAD could not be read`);
if (!repo.clean) warnings.push(`${item.label} backup repo has uncommitted changes`);
if (!googleToken) warnings.push(`${item.label} Google Workspace token is missing`);
results.push({
id: item.id,
label: item.label,
hermesHome: item.hermesHome,
gateway: {
service: item.gatewayService,
active: gatewayActive,
enabled: gatewayEnabled,
},
dashboard: {
service: item.dashboardService,
active: dashboardActive,
url: dashboardUrl,
},
backup: {
timer: backupTimer,
repo,
restoredFileCount: stats.files,
restoredCronJobs: stats.cronJobs,
},
google: {
workspaceToken: googleToken,
driveFolder: item.driveFolder,
},
});
}
if (!emergencyDriveUpload.active) warnings.push('Emergency Google Drive upload timer is not active');
if (!existsSync('/root/.config/hermes-google-drive/user-token.json')) {
warnings.push('Emergency Drive OAuth token is missing');
}
const cronJobs: HermesOpsCronJob[] = [
{
name: emergencyDriveUpload.name,
label: 'Emergency Drive upload',
active: emergencyDriveUpload.active,
nextRun: emergencyDriveUpload.nextRun,
lastRun: emergencyDriveUpload.lastRun,
},
...results.map((instance) => ({
name: instance.backup.timer.name,
label: `${instance.label} backup`,
active: instance.backup.timer.active,
nextRun: instance.backup.timer.nextRun,
lastRun: instance.backup.timer.lastRun,
})),
];
return {
generatedAt: new Date().toISOString(),
tailscaleIp,
emergencyDriveUpload,
activeSessions: {
active: activeSessions,
updatedAt: new Date().toISOString(),
},
cronJobs,
recentAlerts: warnings.slice(0, 6),
quickLinks: [
{
label: 'Hermes operations',
href: 'https://github.com/saravanakumardb/learning_ai_devops_tools/blob/main/docs/hermes-operations.md',
description: 'Runbook for gateways, backups, fallbacks, and recovery.',
},
{
label: 'Disaster recovery',
href: 'https://github.com/saravanakumardb/learning_ai_devops_tools/blob/main/docs/hermes-disaster-recovery.md',
description: 'Restore and rebuild steps for a fresh VM.',
},
{
label: 'Setup roadmap',
href: 'https://github.com/saravanakumardb/learning_ai_devops_tools/blob/main/docs/hermes-setup-upgrade-roadmap.md',
description: 'Tracked rollout, security, and workflow checklist.',
},
],
instances: results,
warnings,
};
}

View File

@ -0,0 +1,13 @@
import type { FastifyInstance } from 'fastify';
import { getHermesOpsSnapshot } from './repository.js';
export async function hermesOpsRoutes(fastify: FastifyInstance) {
fastify.get('/hermes/ops', async (req, reply) => {
try {
return reply.send(await getHermesOpsSnapshot());
} catch (error) {
fastify.log.error(error, 'Failed to get Hermes operations snapshot');
return reply.code(500).send({ error: 'Failed to get Hermes operations snapshot' });
}
});
}

View File

@ -0,0 +1,74 @@
export interface HermesOpsTimer {
name: string;
active: boolean;
nextRun: string | null;
lastRun: string | null;
}
export interface HermesOpsRepo {
path: string;
branch: string | null;
clean: boolean;
head: string | null;
lastCommitAt: string | null;
size: string | null;
}
export interface HermesOpsGoogle {
workspaceToken: boolean;
driveFolder: string;
}
export interface HermesOpsInstance {
id: 'vijay' | 'bheem';
label: string;
hermesHome: string;
gateway: {
service: string;
active: boolean;
enabled: boolean;
};
dashboard: {
service: string;
active: boolean;
url: string;
};
backup: {
timer: HermesOpsTimer;
repo: HermesOpsRepo;
restoredFileCount: number | null;
restoredCronJobs: number | null;
};
google: HermesOpsGoogle;
}
export interface HermesOpsSessionSummary {
active: number;
updatedAt: string | null;
}
export interface HermesOpsCronJob {
name: string;
label: string;
active: boolean;
nextRun: string | null;
lastRun: string | null;
}
export interface HermesOpsLink {
label: string;
href: string;
description: string;
}
export interface HermesOpsSnapshot {
generatedAt: string;
tailscaleIp: string | null;
emergencyDriveUpload: HermesOpsTimer;
activeSessions: HermesOpsSessionSummary;
cronJobs: HermesOpsCronJob[];
recentAlerts: string[];
quickLinks: HermesOpsLink[];
instances: HermesOpsInstance[];
warnings: string[];
}

View File

@ -0,0 +1,354 @@
import { exec } from 'child_process';
import { promisify } from 'util';
import { hostname } from 'os';
import { readFile } from 'fs/promises';
const execAsync = promisify(exec);
// Paths are env-configurable so they work both in the Docker container (via
// volume mounts) and when the backend is run directly on the host for dev.
const SCRIPTS_PATH = process.env.VM_SCRIPTS_PATH
?? '/opt/bytelyst/learning_ai_devops_tools/scripts/VMs/HostingerVM';
const LOG_DIR = process.env.VM_LOG_DIR ?? '/var/log';
const HEALTH_SCRIPT = `${SCRIPTS_PATH}/vm-health-check.sh`;
const CLEANUP_SCRIPT = `${SCRIPTS_PATH}/vm-cleanup.sh`;
const CLEANUP_LOG = `${LOG_DIR}/vm-cleanup.log`;
// ---------------------------------------------------------------------------
// Health check (vm-health-check.sh --json)
// ---------------------------------------------------------------------------
export async function runVmHealthCheck() {
try {
// Script exits 1 (WARN) or 2 (CRIT) but still emits valid JSON on stdout.
const { stdout } = await execAsync(`bash "${HEALTH_SCRIPT}" --json 2>/dev/null`, {
timeout: 30_000,
});
return JSON.parse(stdout);
} catch (error: any) {
// Non-zero exit — stdout may still contain valid JSON (WARN/CRIT result)
if (error.stdout) {
try { return JSON.parse(error.stdout); } catch { /* fall through */ }
}
return {
timestamp: new Date().toISOString(),
hostname: hostname(),
overall: 'CRIT',
checks: {},
error: String(error.stderr || error.message || error),
};
}
}
// ---------------------------------------------------------------------------
// Cleanup log — raw tail
// ---------------------------------------------------------------------------
export async function getCleanupLog(lines = 50): Promise<string> {
try {
const { stdout } = await execAsync(
`tail -${lines} "${CLEANUP_LOG}" 2>/dev/null || echo "(log not found)"`,
{ timeout: 5_000 }
);
return stdout.trim();
} catch {
return '(log not available)';
}
}
// ---------------------------------------------------------------------------
// Cron status — parsed run history + next scheduled times
// ---------------------------------------------------------------------------
export interface CronRunSummary {
timestamp: string;
mode: 'standard' | 'full';
diskBefore: string;
diskAfter: string;
freedMB: number;
durationSecs: number;
success: boolean;
steps: string[];
jsonSummary?: Record<string, unknown>;
}
export interface CronJob {
name: string;
schedule: string;
description: string;
lastRun: CronRunSummary | null;
nextRun: string | null;
}
export async function getCronStatus(): Promise<{ jobs: CronJob[]; recentRuns: CronRunSummary[] }> {
const [rawLog, crontab] = await Promise.all([
readFile(CLEANUP_LOG, 'utf8').catch(() => ''),
execAsync('crontab -l 2>/dev/null').then(r => r.stdout).catch(() => ''),
]);
const recentRuns = parseCleanupLog(rawLog);
const jobs = buildJobList(crontab, recentRuns);
return { jobs, recentRuns: recentRuns.slice(0, 20) };
}
function parseCleanupLog(raw: string): CronRunSummary[] {
const runs: CronRunSummary[] = [];
// Runs are delimited by [START] lines
const blocks = raw.split(/\[START\]/);
for (const block of blocks.slice(1)) {
try {
const startLine = block.match(/\[(\d{4}-\d{2}-\d{2}T[\d:Z]+)\] mode=(\w+)/);
if (!startLine) continue;
const timestamp = startLine[1];
const mode = startLine[2] === 'full' ? 'full' : 'standard';
const diskLine = block.match(/\[DISK\] before=([^\s]+) after=([^\s]+)/);
const endLine = block.match(/\[END\]/);
const cmdLines = [...block.matchAll(/\[CMD\] (.+)/g)].map(m => m[1]);
const jsonMatch = block.match(/\[JSON\] ({.+})/);
// Compute freed MB from disk "used" before/after (e.g. " 70G 123G 37%")
let freedMB = 0;
let diskBefore = '', diskAfter = '';
if (diskLine) {
diskBefore = diskLine[1].trim();
diskAfter = diskLine[2].trim();
const gbBefore = parseFloat(diskLine[1].match(/([\d.]+)G/)?.[1] ?? '0');
const gbAfter = parseFloat(diskLine[2].match(/([\d.]+)G/)?.[1] ?? '0');
freedMB = Math.round((gbBefore - gbAfter) * 1024);
}
// Rough duration: time from start to end
const startTs = new Date(timestamp).getTime();
const endTs = endLine
? (() => {
const m = block.slice(block.indexOf('[END]') - 28, block.indexOf('[END]') - 2);
return new Date(m.match(/\d{4}-\d{2}-\d{2}T[\d:Z]+/)?.[0] ?? timestamp).getTime();
})()
: startTs;
const durationSecs = Math.round((endTs - startTs) / 1000);
runs.push({
timestamp,
mode,
diskBefore,
diskAfter,
freedMB,
durationSecs: isNaN(durationSecs) ? 0 : durationSecs,
success: !!endLine,
steps: cmdLines,
jsonSummary: jsonMatch ? JSON.parse(jsonMatch[1]) : undefined,
});
} catch {
// Skip malformed blocks
}
}
return runs.reverse(); // most recent first
}
function buildJobList(crontab: string, runs: CronRunSummary[]): CronJob[] {
const managed = crontab.match(/# bytelyst-vm-maintenance[\s\S]+/m)?.[0] ?? '';
const defs: Array<{ name: string; schedule: string; description: string; mode?: string }> = [
{ name: 'build-cache-prune', schedule: '0 3 * * *', description: 'Daily build cache prune' },
{ name: 'weekly-cleanup', schedule: '0 2 * * 0', description: 'Weekly standard cleanup' },
{ name: 'monthly-full', schedule: '0 1 1 * *', description: 'Monthly full cleanup', mode: 'full' },
{ name: 'health-check', schedule: '0 7 * * *', description: 'Daily health check + Telegram alert' },
];
return defs.map(def => {
const matchingRun = def.mode === 'full'
? runs.find(r => r.mode === 'full')
: runs.find(r => r.mode === 'standard');
const nextRun = computeNextRun(def.schedule);
return { ...def, lastRun: matchingRun ?? null, nextRun };
});
}
/** Very lightweight cron next-run calculator (handles standard 5-field expressions) */
function computeNextRun(expr: string): string {
const [min, hr, dom, , dow] = expr.split(' ');
const now = new Date();
const next = new Date(now);
next.setSeconds(0, 0);
next.setMinutes(parseInt(min));
next.setHours(parseInt(hr));
if (dom === '*' && dow !== '*') {
// Weekly: advance to correct day-of-week
const targetDow = parseInt(dow);
const dayDiff = (targetDow - next.getDay() + 7) % 7 || 7;
next.setDate(next.getDate() + dayDiff);
} else if (dom !== '*') {
// Monthly: advance to correct day-of-month
next.setDate(parseInt(dom));
if (next <= now) next.setMonth(next.getMonth() + 1);
} else {
// Daily: just advance to tomorrow if already passed today
if (next <= now) next.setDate(next.getDate() + 1);
}
return next.toISOString();
}
// ---------------------------------------------------------------------------
// Trigger cleanup (container runs as root — no sudo needed)
// ---------------------------------------------------------------------------
export async function runVmCleanup(
mode: 'weekly' | 'monthly' | 'dry-run',
): Promise<{ success: boolean; output: string }> {
const args =
mode === 'monthly' ? '--full --quiet' :
mode === 'dry-run' ? '--dry-run' :
'--quiet';
try {
const { stdout, stderr } = await execAsync(
`bash "${CLEANUP_SCRIPT}" ${args} 2>&1`,
{ timeout: 120_000 },
);
return { success: true, output: (stdout + stderr).trim() };
} catch (error: any) {
const out = ((error.stdout ?? '') + (error.stderr ?? '')).trim();
return { success: false, output: out || String(error.message ?? error) };
}
}
// ---------------------------------------------------------------------------
// Unhealthy containers (docker inspect via shell — no Docker SDK needed)
// ---------------------------------------------------------------------------
export interface UnhealthyContainer {
name: string;
status: string;
restartCount: number;
lastHealthLogs: string[];
unhealthySince: string | null;
}
export async function getUnhealthyContainers(): Promise<UnhealthyContainer[]> {
try {
const { stdout } = await execAsync(
`docker ps --filter health=unhealthy --format '{{.Names}}' 2>/dev/null`,
{ timeout: 10_000 },
);
const names = stdout.trim().split('\n').filter(Boolean);
if (!names.length) return [];
const results = await Promise.all(names.map(async name => {
try {
const { stdout: raw } = await execAsync(
`docker inspect "${name}" 2>/dev/null`,
{ timeout: 5_000 },
);
const data = JSON.parse(raw)?.[0];
const health = data?.State?.Health ?? {};
const logs: string[] = (health.Log ?? [])
.slice(-3)
.map((l: any) => l.Output?.trim() ?? '');
const unhealthySince = health.Log?.[0]?.Start ?? null;
return {
name,
status: data?.State?.Status ?? 'unknown',
restartCount: data?.RestartCount ?? 0,
lastHealthLogs: logs,
unhealthySince,
} satisfies UnhealthyContainer;
} catch {
return { name, status: 'unknown', restartCount: 0, lastHealthLogs: [], unhealthySince: null };
}
}));
return results;
} catch {
return [];
}
}
export async function restartContainer(name: string): Promise<{ success: boolean; message: string }> {
// Validate name — only allow alphanumeric, dash, underscore
if (!/^[\w-]+$/.test(name)) {
return { success: false, message: 'Invalid container name' };
}
try {
await execAsync(`docker restart "${name}"`, { timeout: 30_000 });
return { success: true, message: `${name} restarted` };
} catch (error: any) {
return { success: false, message: String(error.stderr || error.message || error) };
}
}
// ---------------------------------------------------------------------------
// Ollama models
// ---------------------------------------------------------------------------
export interface OllamaModel {
name: string;
sizeGB: number;
modifiedAt: string;
}
export interface OllamaRunning {
name: string;
sizeGB: number;
processor: string;
expiresAt: string;
}
// Ollama REST API base — host-gateway resolves to the Docker host,
// where ollama serve listens on port 11434.
const OLLAMA_BASE = process.env.OLLAMA_BASE_URL ?? 'http://host-gateway:11434';
async function ollamaFetch(path: string, opts?: RequestInit): Promise<unknown> {
const res = await fetch(`${OLLAMA_BASE}${path}`, {
signal: AbortSignal.timeout(10_000),
...opts,
});
if (!res.ok) throw new Error(`Ollama ${path}: ${res.status}`);
return res.json();
}
export async function getOllamaModels(): Promise<{ models: OllamaModel[]; running: OllamaRunning[] }> {
try {
const [tagsData, psData] = await Promise.all([
ollamaFetch('/api/tags').catch(() => ({ models: [] })),
ollamaFetch('/api/ps').catch(() => ({ models: [] })),
]);
const models = ((tagsData as any).models ?? []).map((m: any) => ({
name: m.name ?? '',
sizeGB: parseFloat(((m.size ?? 0) / 1e9).toFixed(2)),
modifiedAt: m.modified_at ?? '',
}));
const running = ((psData as any).models ?? []).map((m: any) => ({
name: m.name ?? '',
sizeGB: parseFloat(((m.size ?? 0) / 1e9).toFixed(2)),
processor: m.details?.families?.join(', ') ?? '',
expiresAt: m.expires_at ?? '',
}));
return { models, running };
} catch {
return { models: [], running: [] };
}
}
export async function unloadOllamaModel(name: string): Promise<{ success: boolean; message: string }> {
if (!/^[\w.:\-/]+$/.test(name)) return { success: false, message: 'Invalid model name' };
try {
// Unload by setting keep_alive to 0
await ollamaFetch('/api/generate', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ model: name, keep_alive: 0 }),
});
return { success: true, message: `${name} unloaded` };
} catch (error: any) {
return { success: false, message: String(error.message ?? error) };
}
}

View File

@ -0,0 +1,130 @@
import type { FastifyInstance } from 'fastify';
import { requireAdmin } from '../../lib/auth.js';
import {
runVmHealthCheck,
getCleanupLog,
runVmCleanup,
getCronStatus,
getUnhealthyContainers,
restartContainer,
getOllamaModels,
unloadOllamaModel,
} from './repository.js';
import { VmCleanupParamsSchema, VmContainerRestartParamsSchema } from './types.js';
export async function vmRoutes(fastify: FastifyInstance) {
// ── Health check ──────────────────────────────────────────────────────────
// GET /api/vm/health — run vm-health-check.sh --json
fastify.get('/vm/health', {
preHandler: async (req) => requireAdmin(req),
}, async (_req, reply) => {
try {
return reply.send(await runVmHealthCheck());
} catch (error) {
fastify.log.error(error, 'VM health check failed');
return reply.code(500).send({ error: 'VM health check failed' });
}
});
// ── Cleanup log (raw tail) ─────────────────────────────────────────────────
// GET /api/vm/cleanup-log?lines=50
fastify.get('/vm/cleanup-log', {
preHandler: async (req) => requireAdmin(req),
}, async (req, reply) => {
try {
const lines = Math.min(Number((req.query as any).lines) || 50, 500);
return reply.send({ log: await getCleanupLog(lines) });
} catch (error) {
fastify.log.error(error, 'Failed to read cleanup log');
return reply.code(500).send({ error: 'Failed to read cleanup log' });
}
});
// ── Cron status (parsed history + schedule) ───────────────────────────────
// GET /api/vm/cron-status
fastify.get('/vm/cron-status', {
preHandler: async (req) => requireAdmin(req),
}, async (_req, reply) => {
try {
return reply.send(await getCronStatus());
} catch (error) {
fastify.log.error(error, 'Failed to get cron status');
return reply.code(500).send({ error: 'Failed to get cron status' });
}
});
// ── Cleanup trigger ───────────────────────────────────────────────────────
// POST /api/vm/cleanup { mode: 'weekly' | 'monthly' | 'dry-run' }
fastify.post('/vm/cleanup', {
preHandler: async (req) => requireAdmin(req),
}, async (req, reply) => {
try {
const params = VmCleanupParamsSchema.parse(req.body);
return reply.send(await runVmCleanup(params.mode));
} catch (error: any) {
fastify.log.error(error, 'VM cleanup failed');
return reply.code(500).send({ error: error.message || 'VM cleanup failed' });
}
});
// ── Unhealthy containers ──────────────────────────────────────────────────
// GET /api/vm/containers/unhealthy
fastify.get('/vm/containers/unhealthy', {
preHandler: async (req) => requireAdmin(req),
}, async (_req, reply) => {
try {
return reply.send({ containers: await getUnhealthyContainers() });
} catch (error) {
fastify.log.error(error, 'Failed to get unhealthy containers');
return reply.code(500).send({ error: 'Failed to get unhealthy containers' });
}
});
// POST /api/vm/containers/:name/restart
fastify.post('/vm/containers/:name/restart', {
preHandler: async (req) => requireAdmin(req),
}, async (req, reply) => {
try {
const { name } = VmContainerRestartParamsSchema.parse(req.params);
const result = await restartContainer(name);
return reply.code(result.success ? 200 : 400).send(result);
} catch (error: any) {
fastify.log.error(error, 'Container restart failed');
return reply.code(500).send({ error: error.message || 'Container restart failed' });
}
});
// ── Ollama / LLM models ───────────────────────────────────────────────────
// GET /api/vm/ollama/models
fastify.get('/vm/ollama/models', {
preHandler: async (req) => requireAdmin(req),
}, async (_req, reply) => {
try {
return reply.send(await getOllamaModels());
} catch (error) {
fastify.log.error(error, 'Failed to get Ollama models');
return reply.code(500).send({ error: 'Failed to get Ollama models' });
}
});
// DELETE /api/vm/ollama/models/:name — unload running model
fastify.delete('/vm/ollama/models/:name', {
preHandler: async (req) => requireAdmin(req),
}, async (req, reply) => {
try {
const name = decodeURIComponent((req.params as any).name ?? '');
const result = await unloadOllamaModel(name);
return reply.code(result.success ? 200 : 400).send(result);
} catch (error: any) {
fastify.log.error(error, 'Failed to unload Ollama model');
return reply.code(500).send({ error: error.message || 'Unload failed' });
}
});
}

View File

@ -0,0 +1,38 @@
import { z } from 'zod';
export const VmCheckLevelSchema = z.enum(['OK', 'WARN', 'CRIT']);
export type VmCheckLevel = z.infer<typeof VmCheckLevelSchema>;
export const VmCheckSchema = z.object({
level: VmCheckLevelSchema,
value: z.string(),
message: z.string(),
});
export type VmCheck = z.infer<typeof VmCheckSchema>;
export const VmHealthResultSchema = z.object({
timestamp: z.string(),
hostname: z.string(),
overall: VmCheckLevelSchema,
checks: z.record(z.string(), VmCheckSchema),
error: z.string().optional(),
});
export type VmHealthResult = z.infer<typeof VmHealthResultSchema>;
export const VmCleanupParamsSchema = z.object({
mode: z.enum(['weekly', 'monthly', 'dry-run']),
});
export type VmCleanupParams = z.infer<typeof VmCleanupParamsSchema>;
export const VmCleanupResultSchema = z.object({
success: z.boolean(),
output: z.string(),
});
export type VmCleanupResult = z.infer<typeof VmCleanupResultSchema>;
// ── Container restart ─────────────────────────────────────────────────────────
export const VmContainerRestartParamsSchema = z.object({
name: z.string().regex(/^[\w-]+$/, 'Invalid container name'),
});
export type VmContainerRestartParams = z.infer<typeof VmContainerRestartParamsSchema>;

View File

@ -13,6 +13,8 @@ import { envRoutes } from './modules/env/routes.js';
import { azureConfigRoutes } from './modules/azure-config/routes.js'; import { azureConfigRoutes } from './modules/azure-config/routes.js';
import { codeQualityRoutes } from './modules/code-quality/routes.js'; import { codeQualityRoutes } from './modules/code-quality/routes.js';
import { cosmosConfigRoutes } from './modules/cosmos-config/routes.js'; import { cosmosConfigRoutes } from './modules/cosmos-config/routes.js';
import { hermesOpsRoutes } from './modules/hermes-ops/routes.js';
import { vmRoutes } from './modules/vm/routes.js';
// import sse from 'fastify-sse-v2'; // import sse from 'fastify-sse-v2';
import rateLimit from '@fastify/rate-limit'; import rateLimit from '@fastify/rate-limit';
import swagger from '@fastify/swagger'; import swagger from '@fastify/swagger';
@ -269,6 +271,8 @@ await fastify.register(envRoutes, { prefix: '/api' });
await fastify.register(azureConfigRoutes, { prefix: '/api' }); await fastify.register(azureConfigRoutes, { prefix: '/api' });
await fastify.register(codeQualityRoutes, { prefix: '/api' }); await fastify.register(codeQualityRoutes, { prefix: '/api' });
await fastify.register(cosmosConfigRoutes, { prefix: '/api' }); await fastify.register(cosmosConfigRoutes, { prefix: '/api' });
await fastify.register(hermesOpsRoutes, { prefix: '/api' });
await fastify.register(vmRoutes, { prefix: '/api' });
// Start server // Start server
async function start() { async function start() {

View File

@ -22,11 +22,27 @@ services:
container_name: devops-backend container_name: devops-backend
env_file: env_file:
- backend/.env - backend/.env
environment:
- VM_SCRIPTS_PATH=/vm-scripts/VMs/HostingerVM
- VM_LOG_DIR=/host-logs
ports: ports:
- '4004:4004' - '127.0.0.1:4004:4004'
networks: networks:
- default - default
- platform_net - platform_net
volumes:
# Read-only access to VM management scripts
- /opt/bytelyst/learning_ai_devops_tools/scripts:/vm-scripts:ro
# Read-write access to VM log files (cleanup + health-check write here)
- /var/log/vm-cleanup.log:/host-logs/vm-cleanup.log
- /var/log/vm-health-check.log:/host-logs/vm-health-check.log
- /var/log/docker-watchdog.log:/host-logs/docker-watchdog.log
# Docker socket — allows running docker commands against the host daemon
# (same pattern as Portainer/cAdvisor; container already runs as root)
- /var/run/docker.sock:/var/run/docker.sock
extra_hosts:
# Reach the host for Ollama API (port 11434) and host-only services
- "host-gateway:host-gateway"
restart: unless-stopped restart: unless-stopped
healthcheck: healthcheck:
test: ['CMD', 'curl', '-f', 'http://localhost:4004/health'] test: ['CMD', 'curl', '-f', 'http://localhost:4004/health']
@ -49,7 +65,7 @@ services:
NEXT_PUBLIC_DEVOPS_API_URL: https://api.bytelyst.com/devops NEXT_PUBLIC_DEVOPS_API_URL: https://api.bytelyst.com/devops
container_name: devops-web container_name: devops-web
ports: ports:
- '3049:3000' - '127.0.0.1:3049:3000'
networks: networks:
- default - default
- platform_net - platform_net

441
dashboard/pnpm-lock.yaml generated

File diff suppressed because it is too large Load Diff

View File

@ -79,7 +79,6 @@ test.describe('DevOps Dashboard', () => {
await expect(page.getByText('Services and deployments overview')).toBeVisible(); await expect(page.getByText('Services and deployments overview')).toBeVisible();
await expect(page.getByRole('button', { name: /refresh/i })).toBeVisible(); await expect(page.getByRole('button', { name: /refresh/i })).toBeVisible();
await expect(page.getByRole('button', { name: /create service/i })).toBeVisible(); await expect(page.getByRole('button', { name: /create service/i })).toBeVisible();
await expect(page.getByRole('button', { name: /seed services/i })).toBeVisible();
await expect(page.getByRole('heading', { name: 'Investment Trading' })).toBeVisible(); await expect(page.getByRole('heading', { name: 'Investment Trading' })).toBeVisible();
await expect(page.getByText('Recent Deployments')).toBeVisible(); await expect(page.getByText('Recent Deployments')).toBeVisible();
await expect(page.getByRole('cell', { name: '1.2.3' })).toBeVisible(); await expect(page.getByRole('cell', { name: '1.2.3' })).toBeVisible();
@ -91,6 +90,15 @@ test.describe('DevOps Dashboard', () => {
await expect(refreshButton).toBeEnabled(); await expect(refreshButton).toBeEnabled();
await expect(page.getByRole('heading', { name: 'Investment Trading' })).toBeVisible(); await expect(page.getByRole('heading', { name: 'Investment Trading' })).toBeVisible();
}); });
test('renders the dashboard at mobile width', async ({ page }) => {
await page.setViewportSize({ width: 390, height: 844 });
await page.goto('/');
await expect(page.getByRole('heading', { name: 'Dashboard' })).toBeVisible();
await expect(page.getByRole('button', { name: /create service/i })).toBeVisible();
await expect(page.getByRole('heading', { name: 'Investment Trading' })).toBeVisible();
});
}); });
test('login page renders the platform credential form without baked-in credentials', async ({ page }) => { test('login page renders the platform credential form without baked-in credentials', async ({ page }) => {

View File

@ -30,14 +30,14 @@ test.describe('Hermes Mission Control', () => {
await expect(page.getByRole('heading', { name: 'Hermes Mission Control' })).toBeVisible(); await expect(page.getByRole('heading', { name: 'Hermes Mission Control' })).toBeVisible();
await expect(page.getByText('Active Missions')).toBeVisible(); await expect(page.getByText('Active Missions')).toBeVisible();
await expect(page.getByText('Founder Attention Queue')).toBeVisible(); await expect(page.getByText('Founder Attention Queue')).toBeVisible();
await expect(page.getByText('Product Health Snapshot')).toBeVisible(); await expect(page.getByRole('heading', { name: 'Product Health Snapshot' })).toBeVisible();
await page.getByRole('link', { name: 'Task Ledger' }).click(); await page.getByRole('link', { name: 'Task Ledger' }).click();
await expect(page.getByRole('heading', { name: 'Task Ledger' })).toBeVisible(); await expect(page.getByRole('heading', { name: 'Task Ledger' })).toBeVisible();
await expect(page.getByText('Task table')).toBeVisible(); await expect(page.getByText('Task table')).toBeVisible();
await page.getByRole('link', { name: 'Open' }).first().click(); await page.goto('/hermes/tasks/task-1');
await expect(page.getByText('Hermes learning')).toBeVisible(); await expect(page.getByRole('heading', { name: 'Hermes learning' })).toBeVisible();
await expect(page.getByText('Timeline')).toBeVisible(); await expect(page.getByText('Timeline')).toBeVisible();
await page.goto('/hermes/products'); await page.goto('/hermes/products');
@ -52,4 +52,17 @@ test.describe('Hermes Mission Control', () => {
await page.goto('/hermes/settings'); await page.goto('/hermes/settings');
await expect(page.getByRole('heading', { name: 'Settings & Configuration' })).toBeVisible(); await expect(page.getByRole('heading', { name: 'Settings & Configuration' })).toBeVisible();
}); });
test('renders the mission control overview at mobile width', async ({ page }) => {
await page.setViewportSize({ width: 390, height: 844 });
await page.goto('/hermes');
await expect(page.getByRole('heading', { name: 'Hermes Mission Control' })).toBeVisible();
await expect(page.getByRole('link', { name: 'Task Ledger' })).toBeVisible();
await expect(page.getByRole('link', { name: 'Product Portfolio' })).toBeVisible();
await page.goto('/hermes/tasks/task-1');
await expect(page.getByRole('heading', { name: 'Hermes learning' })).toBeVisible();
await expect(page.getByRole('heading', { name: 'Timeline' })).toBeVisible();
});
}); });

File diff suppressed because it is too large Load Diff

View File

@ -3,3 +3,8 @@
@tailwind base; @tailwind base;
@tailwind components; @tailwind components;
@tailwind utilities; @tailwind utilities;
html,
body {
font-family: var(--ml-font-body), system-ui, -apple-system, BlinkMacSystemFont, "Segoe UI", sans-serif;
}

View File

@ -5,6 +5,7 @@ import { SidebarNav } from '@/components/sidebar-nav';
import { api } from '@/lib/api'; import { api } from '@/lib/api';
import type { Service, ServiceHealth } from '@/lib/api'; import type { Service, ServiceHealth } from '@/lib/api';
import { Activity, Clock, RefreshCw, TrendingUp } from 'lucide-react'; import { Activity, Clock, RefreshCw, TrendingUp } from 'lucide-react';
import { getStatusColor } from '@/lib/utils';
export default function HealthDashboardPage() { export default function HealthDashboardPage() {
const [services, setServices] = useState<Service[]>([]); const [services, setServices] = useState<Service[]>([]);
@ -52,21 +53,6 @@ export default function HealthDashboardPage() {
return () => clearInterval(interval); return () => clearInterval(interval);
}, [loadData]); }, [loadData]);
function getStatusColor(status: string) {
switch (status) {
case 'up':
case 'success':
return 'text-green-600 bg-green-50 border-green-200';
case 'down':
case 'failed':
return 'text-red-600 bg-red-50 border-red-200';
case 'degraded':
case 'running':
return 'text-yellow-600 bg-yellow-50 border-yellow-200';
default:
return 'text-gray-600 bg-gray-50 border-gray-200';
}
}
function getResponseTimeColor(responseTime?: number) { function getResponseTimeColor(responseTime?: number) {
if (!responseTime) return 'text-gray-500'; if (!responseTime) return 'text-gray-500';

View File

@ -33,7 +33,7 @@ export default function HermesAgentsPage() {
<p className="text-lg font-semibold text-[var(--bl-text-primary)]">{agent.name}</p> <p className="text-lg font-semibold text-[var(--bl-text-primary)]">{agent.name}</p>
<p className="text-sm text-[var(--bl-text-secondary)]">{agent.type} · {agent.callsToday} calls today</p> <p className="text-sm text-[var(--bl-text-secondary)]">{agent.type} · {agent.callsToday} calls today</p>
</div> </div>
<Badge variant={agent.status === 'healthy' ? 'success' : agent.status === 'degraded' ? 'warning' : 'danger'}>{agent.status}</Badge> <Badge variant={agent.status === 'healthy' ? 'success' : agent.status === 'degraded' ? 'warning' : 'error'}>{agent.status}</Badge>
</div> </div>
<div className="mt-4 grid gap-3 text-sm text-[var(--bl-text-secondary)] md:grid-cols-2"> <div className="mt-4 grid gap-3 text-sm text-[var(--bl-text-secondary)] md:grid-cols-2">
<div className="rounded-2xl border border-[var(--bl-border)] bg-[var(--bl-surface-card)] p-3">Last success: {agent.lastSuccessAt ? new Date(agent.lastSuccessAt).toLocaleString() : '—'}</div> <div className="rounded-2xl border border-[var(--bl-border)] bg-[var(--bl-surface-card)] p-3">Last success: {agent.lastSuccessAt ? new Date(agent.lastSuccessAt).toLocaleString() : '—'}</div>

View File

@ -4,6 +4,7 @@ import Link from 'next/link';
import { ArrowRight, BadgeCheck, Bot, CheckCircle2, Clock3, LayoutDashboard, OctagonAlert, Rocket, ShieldAlert, Sparkles, TriangleAlert } from 'lucide-react'; import { ArrowRight, BadgeCheck, Bot, CheckCircle2, Clock3, LayoutDashboard, OctagonAlert, Rocket, ShieldAlert, Sparkles, TriangleAlert } from 'lucide-react';
import { Badge, Button } from '@/components/ui/Primitives'; import { Badge, Button } from '@/components/ui/Primitives';
import { HermesShell, MetricCard, SectionCard } from '@/components/hermes-shell'; import { HermesShell, MetricCard, SectionCard } from '@/components/hermes-shell';
import { HermesOpsPanel } from '@/components/hermes-ops-panel';
import { import {
getHermesAgents, getHermesAgents,
getHermesOverview, getHermesOverview,
@ -22,14 +23,14 @@ const fmtDate = new Intl.DateTimeFormat('en', {
minute: '2-digit', minute: '2-digit',
}); });
const statusTone: Record<string, 'success' | 'warning' | 'danger' | 'info' | 'default'> = { const statusTone: Record<string, 'success' | 'warning' | 'error' | 'info' | 'neutral'> = {
running: 'info', running: 'info',
idle: 'default', idle: 'neutral',
degraded: 'warning', degraded: 'warning',
error: 'danger', error: 'error',
queued: 'default', queued: 'neutral',
blocked: 'warning', blocked: 'warning',
failed: 'danger', failed: 'error',
completed: 'success', completed: 'success',
}; };
@ -38,7 +39,7 @@ function taskStatusLabel(task: HermesTask) {
} }
function getTaskTone(task: HermesTask) { function getTaskTone(task: HermesTask) {
return statusTone[task.status] ?? 'default'; return statusTone[task.status] ?? 'neutral';
} }
function ProductMiniCard({ product }: { product: HermesProduct }) { function ProductMiniCard({ product }: { product: HermesProduct }) {
@ -117,6 +118,8 @@ export default function HermesMissionControlPage() {
<MetricCard label="Success rate" value={`${overview.successRate}%`} tone="success" icon={<BadgeCheck className="h-5 w-5" />} helpText={`${overview.productsTouchedRecently} products touched in the last 14 days`} /> <MetricCard label="Success rate" value={`${overview.successRate}%`} tone="success" icon={<BadgeCheck className="h-5 w-5" />} helpText={`${overview.productsTouchedRecently} products touched in the last 14 days`} />
</section> </section>
<HermesOpsPanel />
<div className="grid gap-6 xl:grid-cols-[1.5fr_1fr]"> <div className="grid gap-6 xl:grid-cols-[1.5fr_1fr]">
<SectionCard title="Active Missions" subtitle="What Hermes is currently running or waiting on." actions={<Button asChild variant="ghost" size="sm"><Link href="/hermes/tasks">View all tasks <ArrowRight className="ml-2 h-4 w-4" /></Link></Button>}> <SectionCard title="Active Missions" subtitle="What Hermes is currently running or waiting on." actions={<Button asChild variant="ghost" size="sm"><Link href="/hermes/tasks">View all tasks <ArrowRight className="ml-2 h-4 w-4" /></Link></Button>}>
<div className="space-y-3"> <div className="space-y-3">
@ -162,7 +165,7 @@ export default function HermesMissionControlPage() {
<Link href={`/hermes/tasks/${task.id}`} className="font-medium text-[var(--bl-text-primary)] hover:underline">{task.title}</Link> <Link href={`/hermes/tasks/${task.id}`} className="font-medium text-[var(--bl-text-primary)] hover:underline">{task.title}</Link>
<p className="text-sm text-[var(--bl-text-secondary)]">{task.blockerReason ?? task.error ?? task.nextAction}</p> <p className="text-sm text-[var(--bl-text-secondary)]">{task.blockerReason ?? task.error ?? task.nextAction}</p>
</div> </div>
<Badge variant={task.status === 'failed' ? 'danger' : 'warning'}>{task.status}</Badge> <Badge variant={task.status === 'failed' ? 'error' : 'warning'}>{task.status}</Badge>
</div> </div>
</div> </div>
))} ))}
@ -256,7 +259,7 @@ export default function HermesMissionControlPage() {
<p className="text-sm text-[var(--bl-text-secondary)]">{agent.type} · {agent.callsToday} calls today</p> <p className="text-sm text-[var(--bl-text-secondary)]">{agent.type} · {agent.callsToday} calls today</p>
{agent.configIssue ? <p className="mt-1 text-sm text-[var(--bl-warning)]">{agent.configIssue}</p> : null} {agent.configIssue ? <p className="mt-1 text-sm text-[var(--bl-warning)]">{agent.configIssue}</p> : null}
</div> </div>
<Badge variant={agent.status === 'healthy' ? 'success' : agent.status === 'degraded' ? 'warning' : 'danger'}>{agent.status}</Badge> <Badge variant={agent.status === 'healthy' ? 'success' : agent.status === 'degraded' ? 'warning' : 'error'}>{agent.status}</Badge>
</div> </div>
</div> </div>
))} ))}

View File

@ -20,7 +20,7 @@ function getHealthTone(score: number) {
if (score >= 85) return 'success'; if (score >= 85) return 'success';
if (score >= 70) return 'info'; if (score >= 70) return 'info';
if (score >= 55) return 'warning'; if (score >= 55) return 'warning';
return 'danger'; return 'error';
} }
function ProductCard({ product }: { product: HermesProduct }) { function ProductCard({ product }: { product: HermesProduct }) {

View File

@ -1,6 +1,7 @@
'use client'; 'use client';
import Link from 'next/link'; import Link from 'next/link';
import { useParams } from 'next/navigation';
import { ArrowLeft, CircleDashed, Clock3, ShieldAlert, Sparkles } from 'lucide-react'; import { ArrowLeft, CircleDashed, Clock3, ShieldAlert, Sparkles } from 'lucide-react';
import { Badge, Button } from '@/components/ui/Primitives'; import { Badge, Button } from '@/components/ui/Primitives';
import { HermesShell, MetricCard, SectionCard } from '@/components/hermes-shell'; import { HermesShell, MetricCard, SectionCard } from '@/components/hermes-shell';
@ -8,25 +9,27 @@ import { getHermesProductById, getHermesTaskById, getHermesTaskEvents } from '@/
const fmt = new Intl.DateTimeFormat('en', { month: 'short', day: 'numeric', hour: 'numeric', minute: '2-digit' }); const fmt = new Intl.DateTimeFormat('en', { month: 'short', day: 'numeric', hour: 'numeric', minute: '2-digit' });
function levelTone(level: 'debug' | 'info' | 'warn' | 'error' | 'success') { function levelTone(level: 'debug' | 'info' | 'warn' | 'error' | 'success'): 'success' | 'warning' | 'error' | 'neutral' | 'info' {
switch (level) { switch (level) {
case 'success': return 'success'; case 'success': return 'success';
case 'warn': return 'warning'; case 'warn': return 'warning';
case 'error': return 'danger'; case 'error': return 'error';
case 'debug': return 'neutral'; case 'debug': return 'neutral';
default: return 'info'; default: return 'info';
} }
} }
export default function HermesTaskDetailPage({ params }: { params: { id: string } }) { export default function HermesTaskDetailPage({ params }: { params: { id: string } }) {
const task = getHermesTaskById(params.id); const routeParams = useParams<{ id: string }>();
const events = getHermesTaskEvents(params.id); const taskId = routeParams?.id ?? params.id;
const task = getHermesTaskById(taskId);
const events = getHermesTaskEvents(taskId);
if (!task) { if (!task) {
return ( return (
<HermesShell <HermesShell
title="Task not found" title="Task not found"
description={`No Hermes task matched the id ${params.id}.`} description={`No Hermes task matched the id ${taskId}.`}
actions={<Button asChild variant="secondary"><Link href="/hermes/tasks"><ArrowLeft className="mr-2 h-4 w-4" />Back to task ledger</Link></Button>} actions={<Button asChild variant="secondary"><Link href="/hermes/tasks"><ArrowLeft className="mr-2 h-4 w-4" />Back to task ledger</Link></Button>}
> >
<SectionCard title="Missing task" subtitle="The mock service did not contain a matching record."> <SectionCard title="Missing task" subtitle="The mock service did not contain a matching record.">
@ -58,7 +61,7 @@ export default function HermesTaskDetailPage({ params }: { params: { id: string
<div className="grid gap-4 lg:grid-cols-2"> <div className="grid gap-4 lg:grid-cols-2">
<div className="rounded-2xl border border-[var(--bl-border)] bg-[var(--bl-surface-muted)] p-4 space-y-3"> <div className="rounded-2xl border border-[var(--bl-border)] bg-[var(--bl-surface-muted)] p-4 space-y-3">
<div className="flex items-center gap-2"> <div className="flex items-center gap-2">
<Badge variant={task.status === 'completed' ? 'success' : task.status === 'failed' ? 'danger' : task.status === 'blocked' ? 'warning' : 'neutral'}>{task.status}</Badge> <Badge variant={task.status === 'completed' ? 'success' : task.status === 'failed' ? 'error' : task.status === 'blocked' ? 'warning' : 'neutral'}>{task.status}</Badge>
<Badge variant="neutral">{task.source}</Badge> <Badge variant="neutral">{task.source}</Badge>
</div> </div>
<div className="space-y-2 text-sm text-[var(--bl-text-secondary)]"> <div className="space-y-2 text-sm text-[var(--bl-text-secondary)]">

View File

@ -146,8 +146,8 @@ export default function HermesTaskLedgerPage() {
</div> </div>
</td> </td>
<td className="px-4 py-4 text-[var(--bl-text-secondary)]">{product?.name ?? 'Unknown'}</td> <td className="px-4 py-4 text-[var(--bl-text-secondary)]">{product?.name ?? 'Unknown'}</td>
<td className="px-4 py-4"><Badge variant={task.status === 'completed' ? 'success' : task.status === 'failed' ? 'danger' : task.status === 'blocked' ? 'warning' : 'neutral'}>{task.status}</Badge></td> <td className="px-4 py-4"><Badge variant={task.status === 'completed' ? 'success' : task.status === 'failed' ? 'error' : task.status === 'blocked' ? 'warning' : 'neutral'}>{task.status}</Badge></td>
<td className="px-4 py-4"><Badge variant={task.priority === 'P0' ? 'danger' : task.priority === 'P1' ? 'warning' : 'neutral'}>{task.priority}</Badge></td> <td className="px-4 py-4"><Badge variant={task.priority === 'P0' ? 'error' : task.priority === 'P1' ? 'warning' : 'neutral'}>{task.priority}</Badge></td>
<td className="px-4 py-4 text-[var(--bl-text-secondary)]">{task.type}</td> <td className="px-4 py-4 text-[var(--bl-text-secondary)]">{task.type}</td>
<td className="px-4 py-4 text-[var(--bl-text-secondary)]">{task.source}</td> <td className="px-4 py-4 text-[var(--bl-text-secondary)]">{task.source}</td>
<td className="px-4 py-4 text-[var(--bl-text-secondary)]">{prettyDate(task.createdAt)}</td> <td className="px-4 py-4 text-[var(--bl-text-secondary)]">{prettyDate(task.createdAt)}</td>

View File

@ -1,11 +1,8 @@
import type { Metadata, Viewport } from 'next'; import type { Metadata, Viewport } from 'next';
import { Inter } from 'next/font/google';
import './globals.css'; import './globals.css';
import { AuthProvider } from '@/lib/auth'; import { AuthProvider } from '@/lib/auth';
import { ErrorBoundary } from '@/components/error-boundary'; import { ErrorBoundary } from '@/components/error-boundary';
const inter = Inter({ subsets: ['latin'] });
export const metadata: Metadata = { export const metadata: Metadata = {
title: 'ByteLyst DevOps', title: 'ByteLyst DevOps',
description: 'Internal DevOps dashboard for deployment orchestration', description: 'Internal DevOps dashboard for deployment orchestration',
@ -31,7 +28,7 @@ export default function RootLayout({
}>) { }>) {
return ( return (
<html lang="en"> <html lang="en">
<body className={inter.className}> <body>
<a <a
href="#main-content" href="#main-content"
className="sr-only focus:not-sr-only focus:absolute focus:top-4 focus:left-4 focus:z-50 focus:px-4 focus:py-2 focus:bg-blue-600 focus:text-white focus:rounded-md" className="sr-only focus:not-sr-only focus:absolute focus:top-4 focus:left-4 focus:z-50 focus:px-4 focus:py-2 focus:bg-blue-600 focus:text-white focus:rounded-md"

View File

@ -151,38 +151,58 @@ export default function MetricsPage() {
{/* Deployment Trend Chart */} {/* Deployment Trend Chart */}
<div className="bg-white border border-gray-200 rounded-lg p-6 mb-8"> <div className="bg-white border border-gray-200 rounded-lg p-6 mb-8">
<h2 className="text-lg font-semibold text-gray-900 mb-4">Deployment Trend (Last 7 Days)</h2> <h2 className="text-lg font-semibold text-gray-900 mb-4">Deployment Trend (Last 7 Days)</h2>
<div className="flex items-end justify-between h-64 gap-2"> {deployments.length === 0 ? (
{deploymentTrend.map((day) => ( <div className="h-48 flex items-center justify-center text-sm text-gray-400">
<div key={day.date} className="flex-1 flex flex-col items-center"> No deployment data yet
<div className="w-full flex flex-col gap-1"> </div>
<div ) : (
className="w-full bg-green-500 rounded-t" <>
style={{ height: `${(day.success / maxCount) * 100}%`, minHeight: day.success > 0 ? '4px' : '0' }} {/* Bar chart — each column is a flex column-reverse so bars grow from bottom */}
title={`Success: ${day.success}`} <div className="flex items-end justify-between gap-2 h-48">
/> {deploymentTrend.map((day) => (
<div <div key={day.date} className="flex-1 flex flex-col items-center gap-1">
className="w-full bg-red-500 rounded-b" {/* Count label above bars */}
style={{ height: `${(day.failed / maxCount) * 100}%`, minHeight: day.failed > 0 ? '4px' : '0' }} <span className="text-xs font-medium text-gray-500 mb-1">
title={`Failed: ${day.failed}`} {day.count > 0 ? day.count : ''}
/> </span>
</div> {/* Stacked bar (grows upward via flex-col-reverse) */}
<div className="mt-2 text-xs text-gray-500 text-center"> <div
{new Date(day.date).toLocaleDateString('en', { weekday: 'short' })} className="w-full flex flex-col-reverse gap-px overflow-hidden rounded-sm"
</div> style={{ height: `${Math.max((day.count / maxCount) * 160, day.count > 0 ? 4 : 0)}px` }}
<div className="text-xs font-medium text-gray-700">{day.count}</div> >
{day.failed > 0 && (
<div
className="w-full bg-red-500 flex-shrink-0"
style={{ height: `${(day.failed / day.count) * 100}%` }}
title={`Failed: ${day.failed}`}
/>
)}
{day.success > 0 && (
<div
className="w-full bg-green-500 flex-shrink-0"
style={{ height: `${(day.success / day.count) * 100}%` }}
title={`Success: ${day.success}`}
/>
)}
</div>
<div className="mt-1 text-xs text-gray-500">
{new Date(day.date + 'T12:00:00').toLocaleDateString('en', { weekday: 'short' })}
</div>
</div>
))}
</div> </div>
))} <div className="flex items-center justify-center gap-6 mt-3">
</div> <div className="flex items-center gap-2">
<div className="flex items-center justify-center gap-6 mt-4"> <div className="w-3 h-3 bg-green-500 rounded-sm" />
<div className="flex items-center gap-2"> <span className="text-sm text-gray-600">Success</span>
<div className="w-3 h-3 bg-green-500 rounded" /> </div>
<span className="text-sm text-gray-600">Success</span> <div className="flex items-center gap-2">
</div> <div className="w-3 h-3 bg-red-500 rounded-sm" />
<div className="flex items-center gap-2"> <span className="text-sm text-gray-600">Failed</span>
<div className="w-3 h-3 bg-red-500 rounded" /> </div>
<span className="text-sm text-gray-600">Failed</span> </div>
</div> </>
</div> )}
</div> </div>
{/* Deployments by Service */} {/* Deployments by Service */}

View File

@ -5,7 +5,8 @@ import { SidebarNav } from '@/components/sidebar-nav';
import { api } from '@/lib/api'; import { api } from '@/lib/api';
import type { Service, Deployment } from '@/lib/api'; import type { Service, Deployment } from '@/lib/api';
import { useAuth } from '@/lib/auth'; import { useAuth } from '@/lib/auth';
import { Play, Activity, Clock, RefreshCw, Plus, Edit, Trash2, FileText } from 'lucide-react'; import { Play, Activity, Clock, RefreshCw, Plus, Edit, Trash2, FileText, AlertCircle } from 'lucide-react';
import { getStatusColor } from '@/lib/utils';
import { ServiceForm } from '@/components/service-form'; import { ServiceForm } from '@/components/service-form';
import { LogViewer } from '@/components/log-viewer'; import { LogViewer } from '@/components/log-viewer';
@ -126,22 +127,6 @@ export default function DashboardPage() {
setViewingLogsDeployment(null); setViewingLogsDeployment(null);
} }
function getStatusColor(status: string) {
switch (status) {
case 'up':
case 'success':
return 'text-green-600 bg-green-50 border-green-200';
case 'down':
case 'failed':
return 'text-red-600 bg-red-50 border-red-200';
case 'degraded':
case 'running':
return 'text-yellow-600 bg-yellow-50 border-yellow-200';
default:
return 'text-gray-600 bg-gray-50 border-gray-200';
}
}
if (loading) { if (loading) {
return ( return (
<div className="min-h-screen flex items-center justify-center"> <div className="min-h-screen flex items-center justify-center">
@ -188,12 +173,6 @@ export default function DashboardPage() {
<Plus className="w-4 h-4" /> <Plus className="w-4 h-4" />
Create Service Create Service
</button> </button>
<button
onClick={() => api.seedServices().then(loadData)}
className="px-4 py-2 text-sm font-medium text-blue-600 bg-blue-50 border border-blue-200 rounded-md hover:bg-blue-100 focus:outline-none focus:ring-2 focus:ring-blue-500 focus:ring-offset-2"
>
Seed Services
</button>
<button <button
onClick={refreshHealth} onClick={refreshHealth}
disabled={refreshing} disabled={refreshing}
@ -269,10 +248,15 @@ export default function DashboardPage() {
)} )}
</div> </div>
{service.status !== 'up' && (
<div className="flex items-center gap-1.5 mb-2 text-xs text-yellow-700 bg-yellow-50 border border-yellow-200 rounded px-2 py-1">
<AlertCircle className="w-3.5 h-3.5 flex-shrink-0" />
Service is {service.status} deploy will attempt a fix
</div>
)}
<button <button
onClick={() => handleDeploy(service.id)} onClick={() => handleDeploy(service.id)}
disabled={service.status !== 'up'} className="w-full flex items-center justify-center gap-2 px-4 py-2 bg-blue-600 text-white rounded-md hover:bg-blue-700 focus:outline-none focus:ring-2 focus:ring-blue-500 focus:ring-offset-2"
className="w-full flex items-center justify-center gap-2 px-4 py-2 bg-blue-600 text-white rounded-md hover:bg-blue-700 disabled:bg-gray-300 disabled:cursor-not-allowed focus:outline-none focus:ring-2 focus:ring-blue-500 focus:ring-offset-2"
> >
<Play className="w-4 h-4" /> <Play className="w-4 h-4" />
Deploy Deploy

View File

@ -1,7 +1,8 @@
'use client'; 'use client';
import { useEffect, useState } from 'react'; import { useEffect, useState } from 'react';
import { api } from '@/lib/api'; import { apiRequest } from '@/lib/api';
import { formatBytes } from '@/lib/utils';
import { Cpu, HardDrive, Database, Trash2, RefreshCw, AlertTriangle, CheckCircle } from 'lucide-react'; import { Cpu, HardDrive, Database, Trash2, RefreshCw, AlertTriangle, CheckCircle } from 'lucide-react';
import { SidebarNav } from '@/components/sidebar-nav'; import { SidebarNav } from '@/components/sidebar-nav';
@ -63,16 +64,8 @@ export default function SystemPage() {
const loadData = async () => { const loadData = async () => {
try { try {
const [metricsData, dockerData] = await Promise.all([ const [metricsData, dockerData] = await Promise.all([
fetch(`${process.env.NEXT_PUBLIC_DEVOPS_API_URL || 'http://localhost:4004'}/api/system/metrics`, { apiRequest<SystemMetrics>('/api/system/metrics'),
headers: { apiRequest<DockerStats>('/api/docker/stats'),
'Authorization': `Bearer ${localStorage.getItem('access_token')}`,
},
}).then(r => r.json()),
fetch(`${process.env.NEXT_PUBLIC_DEVOPS_API_URL || 'http://localhost:4004'}/api/docker/stats`, {
headers: {
'Authorization': `Bearer ${localStorage.getItem('access_token')}`,
},
}).then(r => r.json()),
]); ]);
setMetrics(metricsData); setMetrics(metricsData);
setDockerStats(dockerData); setDockerStats(dockerData);
@ -101,15 +94,10 @@ export default function SystemPage() {
} }
try { try {
const response = await fetch(`${process.env.NEXT_PUBLIC_DEVOPS_API_URL || 'http://localhost:4004'}/api/docker/cleanup`, { const result = await apiRequest<{ message: string; freedSpace: number }>('/api/docker/cleanup', {
method: 'POST', method: 'POST',
headers: {
'Content-Type': 'application/json',
'Authorization': `Bearer ${localStorage.getItem('access_token')}`,
},
body: JSON.stringify({ type, force }), body: JSON.stringify({ type, force }),
}); });
const result = await response.json();
setCleanupResult(result); setCleanupResult(result);
loadData(); loadData();
} catch (error) { } catch (error) {
@ -118,14 +106,6 @@ export default function SystemPage() {
} }
}; };
const formatBytes = (bytes: number): string => {
if (bytes === 0) return '0 B';
const k = 1024;
const sizes = ['B', 'KB', 'MB', 'GB', 'TB'];
const i = Math.floor(Math.log(bytes) / Math.log(k));
return `${(bytes / Math.pow(k, i)).toFixed(2)} ${sizes[i]}`;
};
const getUsageColor = (percentage: number): string => { const getUsageColor = (percentage: number): string => {
if (percentage < 50) return 'text-green-600 bg-green-50'; if (percentage < 50) return 'text-green-600 bg-green-50';
if (percentage < 75) return 'text-yellow-600 bg-yellow-50'; if (percentage < 75) return 'text-yellow-600 bg-yellow-50';

View File

@ -0,0 +1,886 @@
'use client';
import { useEffect, useState, useCallback } from 'react';
import { SidebarNav } from '@/components/sidebar-nav';
import {
vmApi,
type VmHealthResult,
type VmCheckLevel,
type CronStatusResponse,
type UnhealthyContainer,
type OllamaModelsResponse,
} from '@/lib/api';
import {
CheckCircle,
AlertTriangle,
XCircle,
RefreshCw,
HardDrive,
Cpu,
Database,
Server,
Activity,
Layers,
ScrollText,
Trash2,
Terminal,
ChevronDown,
ChevronUp,
Clock,
Bot,
RotateCw,
Gauge,
Shield,
Zap,
MemoryStick,
} from 'lucide-react';
// ── Types ──────────────────────────────────────────────────────────────────
type Level = VmCheckLevel;
// ── Shared helpers ─────────────────────────────────────────────────────────
function levelColor(level: Level) {
switch (level) {
case 'OK': return 'text-green-700 bg-green-50 border-green-200';
case 'WARN': return 'text-yellow-700 bg-yellow-50 border-yellow-200';
case 'CRIT': return 'text-red-700 bg-red-50 border-red-200';
}
}
function levelBadge(level: Level) {
switch (level) {
case 'OK': return 'bg-green-100 text-green-800';
case 'WARN': return 'bg-yellow-100 text-yellow-800';
case 'CRIT': return 'bg-red-100 text-red-800';
}
}
function LevelIcon({ level, className = 'w-5 h-5' }: { level: Level; className?: string }) {
switch (level) {
case 'OK': return <CheckCircle className={`${className} text-green-600`} />;
case 'WARN': return <AlertTriangle className={`${className} text-yellow-600`} />;
case 'CRIT': return <XCircle className={`${className} text-red-600`} />;
}
}
function relativeTime(iso: string | null | undefined): string {
if (!iso) return '—';
const diff = Date.now() - new Date(iso).getTime();
if (isNaN(diff)) return '—';
const mins = Math.floor(diff / 60000);
if (mins < 2) return 'just now';
if (mins < 60) return `${mins}m ago`;
const hrs = Math.floor(mins / 60);
if (hrs < 24) return `${hrs}h ago`;
const days = Math.floor(hrs / 24);
return `${days}d ago`;
}
function formatDate(iso: string | null | undefined): string {
if (!iso) return '—';
const d = new Date(iso);
if (isNaN(d.getTime())) return '—';
return d.toLocaleDateString(undefined, { month: 'short', day: 'numeric', hour: '2-digit', minute: '2-digit' });
}
// ── Score card ─────────────────────────────────────────────────────────────
interface ScoreDim { label: string; pts: number; maxPts: number; detail: string }
function computeScore(
health: VmHealthResult | null,
unhealthyCount: number,
cronData: CronStatusResponse | null,
): { total: number; dims: ScoreDim[] } {
const checks = health?.checks ?? {};
// 1. CPU efficiency — steal %
const stealPct = parseFloat((checks.steal?.value ?? '0%').match(/^([\d.]+)/)?.[1] ?? '0');
const cpuPts = stealPct < 2 ? 20 : stealPct < 5 ? 15 : stealPct < 10 ? 10 : 5;
// 2. Memory pressure — available GB (first number in value, e.g. "11G / 15G")
const ramAvailGb = parseFloat((checks.ram?.value ?? '0G').match(/^([\d.]+)/)?.[1] ?? '0');
const ramPts = ramAvailGb > 6 ? 20 : ramAvailGb > 3 ? 15 : ramAvailGb > 1 ? 5 : 0;
// 3. Disk health — used % (first number in value, e.g. "37% used, 123G free")
const diskPct = parseInt((checks.disk?.value ?? '100%').match(/^(\d+)/)?.[1] ?? '100', 10);
const diskPts = diskPct < 40 ? 15 : diskPct < 55 ? 10 : diskPct < 70 ? 5 : 0;
// 4. Service health — unhealthy container count
const svcPts = unhealthyCount === 0 ? 20 : unhealthyCount <= 2 ? 15 : unhealthyCount <= 5 ? 8 : 2;
// 5. Maintenance hygiene — last successful cleanup
let maintPts = 0;
let maintDetail = 'no history';
if (cronData?.recentRuns?.length) {
const lastRun = cronData.recentRuns.find(r => r.success);
if (lastRun) {
const daysSince = (Date.now() - new Date(lastRun.timestamp).getTime()) / 86_400_000;
maintPts = daysSince < 7 && lastRun.freedMB > 0 ? 15 : daysSince < 30 ? 8 : 0;
maintDetail = `${Math.floor(daysSince)}d ago, freed ${lastRun.freedMB > 0 ? lastRun.freedMB + ' MB' : '0 MB'}`;
}
}
// 6. LLM readiness — available RAM
const llmPts = ramAvailGb > 8 ? 10 : ramAvailGb > 4 ? 7 : ramAvailGb > 2 ? 4 : 1;
const total = cpuPts + ramPts + diskPts + svcPts + maintPts + llmPts;
return {
total,
dims: [
{ label: 'CPU Efficiency', pts: cpuPts, maxPts: 20, detail: `${stealPct.toFixed(1)}% steal` },
{ label: 'Memory Pressure', pts: ramPts, maxPts: 20, detail: `${ramAvailGb}G available` },
{ label: 'Disk Health', pts: diskPts, maxPts: 15, detail: `${diskPct}% used` },
{ label: 'Service Health', pts: svcPts, maxPts: 20, detail: `${unhealthyCount} unhealthy` },
{ label: 'Maintenance', pts: maintPts, maxPts: 15, detail: maintDetail },
{ label: 'LLM Readiness', pts: llmPts, maxPts: 10, detail: `${ramAvailGb}G free` },
],
};
}
function ScoreCard({
health,
unhealthyCount,
cronData,
}: {
health: VmHealthResult | null;
unhealthyCount: number;
cronData: CronStatusResponse | null;
}) {
const { total, dims } = computeScore(health, unhealthyCount, cronData);
const scoreColor = total >= 80 ? 'text-green-600' : total >= 60 ? 'text-yellow-600' : 'text-red-600';
const scoreBg = total >= 80 ? 'bg-green-50 border-green-200' : total >= 60 ? 'bg-yellow-50 border-yellow-200' : 'bg-red-50 border-red-200';
return (
<div className={`rounded-lg border p-5 ${scoreBg}`}>
<div className="flex items-start gap-5">
{/* Score gauge */}
<div className="flex flex-col items-center gap-1 min-w-[80px]">
<Gauge className="w-5 h-5 text-gray-400" />
<span className={`text-5xl font-bold tabular-nums ${scoreColor}`}>{total}</span>
<span className="text-xs text-gray-400 font-medium">/ 100</span>
</div>
{/* Dimension breakdown */}
<div className="flex-1 grid grid-cols-2 sm:grid-cols-3 gap-x-6 gap-y-2">
{dims.map(d => {
const pct = Math.round((d.pts / d.maxPts) * 100);
const barColor = pct >= 75 ? 'bg-green-400' : pct >= 50 ? 'bg-yellow-400' : 'bg-red-400';
return (
<div key={d.label}>
<div className="flex items-center justify-between mb-0.5">
<span className="text-xs font-medium text-gray-600 truncate">{d.label}</span>
<span className="text-xs font-bold text-gray-700 ml-2 tabular-nums">{d.pts}/{d.maxPts}</span>
</div>
<div className="h-1.5 bg-gray-200 rounded-full overflow-hidden">
<div className={`h-full rounded-full ${barColor}`} style={{ width: `${pct}%` }} />
</div>
<p className="text-xs text-gray-400 mt-0.5">{d.detail}</p>
</div>
);
})}
</div>
</div>
</div>
);
}
// ── Unhealthy containers panel ─────────────────────────────────────────────
function UnhealthyContainersPanel({
containers,
onRestart,
restarting,
}: {
containers: UnhealthyContainer[];
onRestart: (name: string) => Promise<void>;
restarting: Set<string>;
}) {
const [expanded, setExpanded] = useState<Set<string>>(new Set());
const toggle = (name: string) =>
setExpanded(prev => {
const next = new Set(prev);
next.has(name) ? next.delete(name) : next.add(name);
return next;
});
if (containers.length === 0) return null;
return (
<div className="bg-white border border-yellow-200 rounded-lg overflow-hidden">
<div className="px-6 py-4 bg-yellow-50 border-b border-yellow-200 flex items-center justify-between">
<div className="flex items-center gap-2">
<AlertTriangle className="w-5 h-5 text-yellow-600" />
<span className="font-semibold text-yellow-900">
{containers.length} Unhealthy Container{containers.length !== 1 ? 's' : ''}
</span>
<span className="text-xs text-yellow-600">process alive, health endpoint failing</span>
</div>
</div>
<div className="divide-y divide-gray-100">
{containers.map(c => (
<div key={c.name}>
<div className="px-6 py-3 flex items-center gap-3">
<button
onClick={() => toggle(c.name)}
className="flex-1 flex items-center gap-3 text-left"
>
{expanded.has(c.name)
? <ChevronUp className="w-4 h-4 text-gray-400 flex-shrink-0" />
: <ChevronDown className="w-4 h-4 text-gray-400 flex-shrink-0" />}
<span className="font-mono text-sm font-medium text-gray-900">{c.name}</span>
<span className="text-xs text-gray-500">
unhealthy {relativeTime(c.unhealthySince)} · {c.restartCount} restarts
</span>
</button>
<button
onClick={() => onRestart(c.name)}
disabled={restarting.has(c.name)}
className="flex items-center gap-1.5 px-3 py-1.5 text-xs font-medium text-blue-700 bg-blue-50 border border-blue-200 rounded-md hover:bg-blue-100 disabled:opacity-50 flex-shrink-0"
>
<RotateCw className={`w-3.5 h-3.5 ${restarting.has(c.name) ? 'animate-spin' : ''}`} />
Restart
</button>
</div>
{expanded.has(c.name) && c.lastHealthLogs.length > 0 && (
<div className="px-6 pb-3">
<p className="text-xs font-medium text-gray-500 mb-1">Last health check output:</p>
<pre className="text-xs font-mono bg-gray-50 rounded p-2 whitespace-pre-wrap text-gray-700 max-h-32 overflow-y-auto">
{c.lastHealthLogs.join('\n') || '(no output)'}
</pre>
</div>
)}
</div>
))}
</div>
</div>
);
}
// ── Cron status panel ──────────────────────────────────────────────────────
function CronStatusPanel({ data }: { data: CronStatusResponse | null }) {
const [expandedRun, setExpandedRun] = useState<string | null>(null);
if (!data) {
return (
<div className="bg-white border border-gray-200 rounded-lg p-6">
<div className="flex items-center gap-2 text-gray-400">
<Clock className="w-5 h-5" />
<span className="text-sm">Maintenance schedule not available</span>
</div>
</div>
);
}
const { jobs, recentRuns } = data;
return (
<div className="bg-white border border-gray-200 rounded-lg overflow-hidden">
<div className="px-6 py-4 border-b border-gray-100 flex items-center gap-2">
<Clock className="w-5 h-5 text-gray-500" />
<h2 className="font-semibold text-gray-900">Maintenance Schedule</h2>
{recentRuns.length > 0 && (
<span className="ml-auto text-xs text-gray-400">
{recentRuns.length} run{recentRuns.length !== 1 ? 's' : ''} in log
</span>
)}
</div>
{/* Jobs table */}
<div className="overflow-x-auto">
<table className="w-full text-sm">
<thead>
<tr className="border-b border-gray-100 bg-gray-50">
<th className="px-6 py-2 text-left text-xs font-medium text-gray-500 uppercase tracking-wide">Job</th>
<th className="px-4 py-2 text-left text-xs font-medium text-gray-500 uppercase tracking-wide">Schedule</th>
<th className="px-4 py-2 text-left text-xs font-medium text-gray-500 uppercase tracking-wide">Last Run</th>
<th className="px-4 py-2 text-right text-xs font-medium text-gray-500 uppercase tracking-wide">Freed</th>
<th className="px-4 py-2 text-left text-xs font-medium text-gray-500 uppercase tracking-wide">Status</th>
<th className="px-4 py-2 text-left text-xs font-medium text-gray-500 uppercase tracking-wide">Next Run</th>
</tr>
</thead>
<tbody className="divide-y divide-gray-50">
{jobs.map(job => {
const lr = job.lastRun;
return (
<tr key={job.name} className="hover:bg-gray-50">
<td className="px-6 py-3">
<p className="font-medium text-gray-900">{job.description}</p>
<p className="text-xs text-gray-400 font-mono">{job.name}</p>
</td>
<td className="px-4 py-3 font-mono text-xs text-gray-600">{job.schedule}</td>
<td className="px-4 py-3 text-gray-600">
{lr ? (
<span title={lr.timestamp}>{relativeTime(lr.timestamp)}</span>
) : (
<span className="text-gray-400">never</span>
)}
</td>
<td className="px-4 py-3 text-right font-mono text-xs">
{lr && lr.freedMB > 0
? <span className="text-green-700">+{lr.freedMB} MB</span>
: <span className="text-gray-400"></span>}
</td>
<td className="px-4 py-3">
{lr ? (
<span className={`inline-flex items-center gap-1 text-xs font-medium ${lr.success ? 'text-green-700' : 'text-red-700'}`}>
{lr.success
? <CheckCircle className="w-3.5 h-3.5" />
: <XCircle className="w-3.5 h-3.5" />}
{lr.success ? 'OK' : 'Failed'}
</span>
) : (
<span className="text-gray-400 text-xs"></span>
)}
</td>
<td className="px-4 py-3 text-xs text-gray-500">
{job.nextRun ? (
<span title={job.nextRun}>{relativeTime(job.nextRun).replace('ago', '').trim() || formatDate(job.nextRun)}</span>
) : '—'}
</td>
</tr>
);
})}
</tbody>
</table>
</div>
{/* Recent runs — collapsible step log */}
{recentRuns.length > 0 && (
<div className="border-t border-gray-100">
<div className="px-6 py-3">
<p className="text-xs font-medium text-gray-500 uppercase tracking-wide mb-2">Recent Runs</p>
<div className="space-y-1">
{recentRuns.slice(0, 10).map((run, i) => (
<div key={`${run.timestamp}-${i}`} className="rounded-md border border-gray-100 overflow-hidden">
<button
className="w-full flex items-center gap-3 px-4 py-2 text-left hover:bg-gray-50 text-sm"
onClick={() => setExpandedRun(expandedRun === run.timestamp ? null : run.timestamp)}
>
{run.success
? <CheckCircle className="w-4 h-4 text-green-500 flex-shrink-0" />
: <XCircle className="w-4 h-4 text-red-500 flex-shrink-0" />}
<span className="text-gray-600">{formatDate(run.timestamp)}</span>
<span className={`text-xs px-1.5 py-0.5 rounded font-medium ${run.mode === 'full' ? 'bg-orange-100 text-orange-700' : 'bg-blue-100 text-blue-700'}`}>
{run.mode}
</span>
{run.freedMB > 0 && (
<span className="text-xs text-green-700 font-medium">freed {run.freedMB} MB</span>
)}
<span className="ml-auto text-xs text-gray-400">{run.durationSecs}s</span>
{expandedRun === run.timestamp
? <ChevronUp className="w-3.5 h-3.5 text-gray-400" />
: <ChevronDown className="w-3.5 h-3.5 text-gray-400" />}
</button>
{expandedRun === run.timestamp && run.steps.length > 0 && (
<div className="px-4 pb-3 border-t border-gray-100">
<pre className="text-xs font-mono bg-gray-50 rounded p-2 mt-2 whitespace-pre-wrap text-gray-700 max-h-40 overflow-y-auto">
{run.steps.join('\n')}
</pre>
</div>
)}
</div>
))}
</div>
</div>
</div>
)}
</div>
);
}
// ── Ollama / LLM panel ─────────────────────────────────────────────────────
function OllamaPanel({
data,
ramAvailGb,
onUnload,
unloading,
}: {
data: OllamaModelsResponse | null;
ramAvailGb: number;
onUnload: (name: string) => Promise<void>;
unloading: Set<string>;
}) {
if (!data) return null;
const { models, running } = data;
if (models.length === 0 && running.length === 0) return null;
return (
<div className="bg-white border border-gray-200 rounded-lg overflow-hidden">
<div className="px-6 py-4 border-b border-gray-100 flex items-center gap-2">
<Bot className="w-5 h-5 text-gray-500" />
<h2 className="font-semibold text-gray-900">LLM Models (Ollama)</h2>
<span className="ml-auto text-xs text-gray-400">
{models.length} on disk · {running.length} loaded
</span>
</div>
{/* Currently loaded */}
{running.length > 0 && (
<div className="px-6 py-3 bg-purple-50 border-b border-purple-100">
<p className="text-xs font-medium text-purple-700 uppercase tracking-wide mb-2">Currently Loaded</p>
{running.map(r => {
const ramAfterUnloadGb = ramAvailGb + r.sizeGB;
const pressureAfter = ramAfterUnloadGb < 2;
return (
<div key={r.name} className="flex items-center gap-3">
<div className="flex-1">
<span className="font-mono text-sm font-medium text-gray-900">{r.name}</span>
<span className="ml-2 text-xs text-gray-500">{r.sizeGB} GB · {r.processor || 'CPU'}</span>
{r.expiresAt && (
<span className="ml-2 text-xs text-gray-400">expires {relativeTime(r.expiresAt)}</span>
)}
{ramAvailGb < 4 && (
<span className="ml-2 text-xs text-yellow-700 bg-yellow-100 px-1.5 py-0.5 rounded">
low RAM swap pressure likely
</span>
)}
{pressureAfter && (
<span className="ml-1 text-xs text-gray-400">(unloading frees {r.sizeGB} GB)</span>
)}
</div>
<button
onClick={() => onUnload(r.name)}
disabled={unloading.has(r.name)}
className="flex items-center gap-1.5 px-3 py-1.5 text-xs font-medium text-purple-700 bg-white border border-purple-200 rounded-md hover:bg-purple-50 disabled:opacity-50"
>
{unloading.has(r.name)
? <RefreshCw className="w-3.5 h-3.5 animate-spin" />
: <Zap className="w-3.5 h-3.5" />}
Unload
</button>
</div>
);
})}
</div>
)}
{/* RAM bar */}
{ramAvailGb > 0 && (
<div className="px-6 py-3 border-b border-gray-100">
<div className="flex items-center justify-between text-xs text-gray-500 mb-1">
<span>RAM available</span>
<span className="font-mono">{ramAvailGb.toFixed(1)} GB free</span>
</div>
<div className="h-2 bg-gray-200 rounded-full overflow-hidden">
{running.map(r => (
<div
key={r.name}
className="h-full bg-purple-400 rounded-full float-right"
style={{ width: `${Math.min((r.sizeGB / 16) * 100, 100)}%` }}
title={`${r.name}: ${r.sizeGB} GB`}
/>
))}
</div>
<p className="text-xs text-gray-400 mt-0.5">
{running.length > 0
? `${running.reduce((s, r) => s + r.sizeGB, 0).toFixed(1)} GB used by loaded models`
: 'No models loaded'}
</p>
</div>
)}
{/* All models */}
<div className="divide-y divide-gray-50">
{models.map(m => {
const isLoaded = running.some(r => r.name === m.name);
return (
<div key={m.name} className="px-6 py-2.5 flex items-center gap-3">
<MemoryStick className="w-4 h-4 text-gray-300 flex-shrink-0" />
<span className="font-mono text-sm text-gray-700 flex-1">{m.name}</span>
<span className="text-xs text-gray-400">{m.sizeGB} GB</span>
{isLoaded && (
<span className="text-xs bg-purple-100 text-purple-700 px-1.5 py-0.5 rounded font-medium">loaded</span>
)}
{m.modifiedAt && (
<span className="text-xs text-gray-300" title={m.modifiedAt}>
{relativeTime(m.modifiedAt)}
</span>
)}
</div>
);
})}
</div>
</div>
);
}
// ── Check card meta ────────────────────────────────────────────────────────
const CHECK_META: Record<string, { label: string; icon: React.ElementType }> = {
disk: { label: 'Disk', icon: HardDrive },
load: { label: 'CPU Load', icon: Cpu },
steal: { label: 'CPU Steal', icon: Shield },
ram: { label: 'Memory', icon: Database },
swap: { label: 'Swap', icon: Server },
container_loops: { label: 'Crash Loops', icon: Activity },
container_health: { label: 'Container Health', icon: Layers },
docker_daemon: { label: 'Docker Daemon', icon: Activity },
build_cache: { label: 'Build Cache', icon: Layers },
docker_images: { label: 'Docker Images', icon: Layers },
journal: { label: 'Journal Logs', icon: ScrollText },
syslog: { label: 'Syslog', icon: ScrollText },
failed_units: { label: 'Systemd Units', icon: Activity },
cron_missing_paths: { label: 'Cron Paths', icon: Clock },
};
const CHECK_ORDER = [
'disk', 'load', 'steal', 'ram', 'swap',
'container_loops', 'container_health', 'docker_daemon',
'build_cache', 'docker_images',
'journal', 'syslog',
'failed_units', 'cron_missing_paths',
];
// ── Main page ──────────────────────────────────────────────────────────────
export default function VmHealthPage() {
const [health, setHealth] = useState<VmHealthResult | null>(null);
const [cronData, setCronData] = useState<CronStatusResponse | null>(null);
const [unhealthy, setUnhealthy] = useState<UnhealthyContainer[]>([]);
const [ollamaData, setOllamaData] = useState<OllamaModelsResponse | null>(null);
const [cleanupLog, setCleanupLog] = useState<string>('');
const [loading, setLoading] = useState(true);
const [refreshing, setRefreshing] = useState(false);
const [cleanupRunning, setCleanupRunning] = useState(false);
const [cleanupResult, setCleanupResult] = useState<{ success: boolean; output: string } | null>(null);
const [restarting, setRestarting] = useState<Set<string>>(new Set());
const [restartMsg, setRestartMsg] = useState<{ name: string; ok: boolean; msg: string } | null>(null);
const [unloading, setUnloading] = useState<Set<string>>(new Set());
const [showLog, setShowLog] = useState(false);
const [lastRefreshed, setLastRefreshed] = useState<Date | null>(null);
const loadAll = useCallback(async () => {
try {
const [healthData, logData, cronResult, unhealthyResult, ollamaResult] = await Promise.allSettled([
vmApi.getHealth(),
vmApi.getCleanupLog(40),
vmApi.getCronStatus(),
vmApi.getUnhealthyContainers(),
vmApi.getOllamaModels(),
]);
if (healthData.status === 'fulfilled') setHealth(healthData.value);
if (logData.status === 'fulfilled') setCleanupLog(logData.value.log);
if (cronResult.status === 'fulfilled') setCronData(cronResult.value);
if (unhealthyResult.status === 'fulfilled') setUnhealthy(unhealthyResult.value.containers);
if (ollamaResult.status === 'fulfilled') setOllamaData(ollamaResult.value);
setLastRefreshed(new Date());
} catch (e) {
console.error('Failed to load VM data:', e);
} finally {
setLoading(false);
setRefreshing(false);
}
}, []);
useEffect(() => {
loadAll();
const interval = setInterval(loadAll, 60_000);
return () => clearInterval(interval);
}, [loadAll]);
const handleRefresh = () => { setRefreshing(true); loadAll(); };
const handleCleanup = async (mode: 'weekly' | 'monthly' | 'dry-run') => {
const msg =
mode === 'monthly' ? 'Run MONTHLY full cleanup? This removes build cache, pnpm store, old logs, and HOLD node_modules.' :
mode === 'dry-run' ? 'Run cleanup in DRY-RUN mode? Nothing will be deleted.' :
'Run weekly cleanup? This prunes Docker build cache, journal, apt, and .next/cache.';
if (!confirm(msg)) return;
setCleanupRunning(true);
setCleanupResult(null);
try {
const result = await vmApi.runCleanup(mode);
setCleanupResult(result);
await loadAll();
} catch (e) {
setCleanupResult({ success: false, output: String(e) });
} finally {
setCleanupRunning(false);
}
};
const handleRestart = async (name: string) => {
setRestarting(prev => new Set(prev).add(name));
setRestartMsg(null);
try {
const result = await vmApi.restartContainer(name);
setRestartMsg({ name, ok: result.success, msg: result.message });
if (result.success) {
await new Promise(r => setTimeout(r, 3000));
await loadAll();
}
} catch (e) {
setRestartMsg({ name, ok: false, msg: String(e) });
} finally {
setRestarting(prev => { const s = new Set(prev); s.delete(name); return s; });
}
};
const handleUnload = async (name: string) => {
setUnloading(prev => new Set(prev).add(name));
try {
await vmApi.unloadOllamaModel(name);
await loadAll();
} catch (e) {
console.error('Failed to unload model:', e);
} finally {
setUnloading(prev => { const s = new Set(prev); s.delete(name); return s; });
}
};
// ── Loading ───────────────────────────────────────────────────────────────
if (loading) {
return (
<div className="flex min-h-screen bg-gray-50">
<SidebarNav />
<main className="flex-1 flex items-center justify-center">
<div className="text-gray-500">Loading VM health</div>
</main>
</div>
);
}
const overall = health?.overall ?? 'CRIT';
const checks = health?.checks ?? {};
const sortedKeys = [
...CHECK_ORDER.filter(k => k in checks),
...Object.keys(checks).filter(k => !CHECK_ORDER.includes(k)),
];
const warnings = sortedKeys.filter(k => checks[k]?.level === 'WARN');
const crits = sortedKeys.filter(k => checks[k]?.level === 'CRIT');
// Parse available RAM for Ollama panel RAM bar
const ramAvailGb = parseFloat(
(checks.ram?.value ?? '0G').match(/^([\d.]+)/)?.[1] ?? '0'
);
// ── Render ────────────────────────────────────────────────────────────────
return (
<div className="flex min-h-screen bg-gray-50">
<SidebarNav />
<main className="flex-1 min-w-0 overflow-y-auto">
<div className="p-8 max-md:p-4 space-y-6">
{/* ── Header ── */}
<div className="flex items-center justify-between">
<div>
<h1 className="text-2xl font-bold text-gray-900">VM Health</h1>
<p className="text-sm text-gray-500">
{health?.hostname ?? 'srv1491630'} ·{' '}
{lastRefreshed
? `last checked ${lastRefreshed.toLocaleTimeString()}`
: 'checking…'}
</p>
</div>
<button
onClick={handleRefresh}
disabled={refreshing}
className="flex items-center gap-2 px-4 py-2 text-sm font-medium text-gray-700 bg-white border border-gray-300 rounded-md hover:bg-gray-50 disabled:opacity-50"
>
<RefreshCw className={`w-4 h-4 ${refreshing ? 'animate-spin' : ''}`} />
Refresh
</button>
</div>
{/* ── Score card ── */}
<ScoreCard health={health} unhealthyCount={unhealthy.length} cronData={cronData} />
{/* ── Overall status banner ── */}
<div className={`rounded-lg border p-4 flex items-center gap-3 ${levelColor(overall)}`}>
<LevelIcon level={overall} className="w-6 h-6 flex-shrink-0" />
<div className="flex-1">
<p className="font-semibold">
{overall === 'OK'
? 'All checks passing'
: overall === 'WARN'
? `${warnings.length} warning${warnings.length !== 1 ? 's' : ''}`
: `${crits.length} critical issue${crits.length !== 1 ? 's' : ''}`}
</p>
{health?.error && (
<p className="text-sm mt-1 opacity-80">{health.error}</p>
)}
{(crits.length > 0 || warnings.length > 0) && !health?.error && (
<p className="text-sm mt-1 opacity-80">
{[...crits, ...warnings].map(k => checks[k]?.message).join(' · ')}
</p>
)}
</div>
<span className={`px-3 py-1 rounded-full text-sm font-bold ${levelBadge(overall)}`}>
{overall}
</span>
</div>
{/* ── Check cards grid ── */}
<div className="grid grid-cols-1 sm:grid-cols-2 xl:grid-cols-3 gap-4">
{sortedKeys.map(key => {
const check = checks[key];
if (!check) return null;
const meta = CHECK_META[key] ?? { label: key, icon: Activity };
const Icon = meta.icon;
return (
<div key={key} className={`rounded-lg border p-4 ${levelColor(check.level)}`}>
<div className="flex items-start gap-3">
<div className="mt-0.5">
<LevelIcon level={check.level} />
</div>
<div className="flex-1 min-w-0">
<div className="flex items-center gap-2 mb-1">
<Icon className="w-4 h-4 opacity-60 flex-shrink-0" />
<span className="text-sm font-semibold">{meta.label}</span>
<span className={`ml-auto px-2 py-0.5 rounded text-xs font-bold ${levelBadge(check.level)}`}>
{check.level}
</span>
</div>
<p className="text-sm leading-snug">{check.message}</p>
<p className="text-xs opacity-60 mt-1 font-mono truncate">{check.value}</p>
</div>
</div>
</div>
);
})}
</div>
{/* ── Restart feedback ── */}
{restartMsg && (
<div className={`rounded-lg border p-3 flex items-center gap-2 text-sm ${restartMsg.ok ? 'bg-green-50 border-green-200 text-green-800' : 'bg-red-50 border-red-200 text-red-800'}`}>
{restartMsg.ok
? <CheckCircle className="w-4 h-4 flex-shrink-0" />
: <XCircle className="w-4 h-4 flex-shrink-0" />}
<span><strong>{restartMsg.name}:</strong> {restartMsg.msg}</span>
<button onClick={() => setRestartMsg(null)} className="ml-auto text-gray-400 hover:text-gray-600"></button>
</div>
)}
{/* ── Unhealthy containers ── */}
<UnhealthyContainersPanel
containers={unhealthy}
onRestart={handleRestart}
restarting={restarting}
/>
{/* ── Cron status ── */}
<CronStatusPanel data={cronData} />
{/* ── Ollama ── */}
<OllamaPanel
data={ollamaData}
ramAvailGb={ramAvailGb}
onUnload={handleUnload}
unloading={unloading}
/>
{/* ── Cleanup section ── */}
<div className="bg-white border border-gray-200 rounded-lg p-6">
<div className="flex items-center gap-3 mb-5">
<Trash2 className="w-5 h-5 text-gray-500" />
<div>
<h2 className="text-lg font-semibold text-gray-900">VM Cleanup</h2>
<p className="text-sm text-gray-500">
Cron runs automatically: daily build-cache prune, weekly cleanup, monthly full cleanup.
Use buttons below to trigger manually.
</p>
</div>
</div>
{cleanupResult && (
<div className={`rounded-lg border p-4 mb-4 ${cleanupResult.success ? 'bg-green-50 border-green-200' : 'bg-red-50 border-red-200'}`}>
<div className="flex items-center gap-2 mb-2">
{cleanupResult.success
? <CheckCircle className="w-4 h-4 text-green-600" />
: <XCircle className="w-4 h-4 text-red-600" />}
<span className={`text-sm font-medium ${cleanupResult.success ? 'text-green-800' : 'text-red-800'}`}>
{cleanupResult.success ? 'Cleanup completed' : 'Cleanup failed'}
</span>
<button onClick={() => setCleanupResult(null)} className="ml-auto text-gray-400 hover:text-gray-600 text-xs">
Dismiss
</button>
</div>
{cleanupResult.output && (
<pre className="text-xs font-mono whitespace-pre-wrap text-gray-700 bg-white/60 rounded p-2 max-h-48 overflow-y-auto">
{cleanupResult.output}
</pre>
)}
</div>
)}
<div className="flex flex-wrap gap-3">
<button
onClick={() => handleCleanup('dry-run')}
disabled={cleanupRunning}
className="flex items-center gap-2 px-4 py-2 text-sm font-medium text-gray-700 bg-gray-50 border border-gray-300 rounded-md hover:bg-gray-100 disabled:opacity-50"
>
<Terminal className="w-4 h-4" />
Dry Run
</button>
<button
onClick={() => handleCleanup('weekly')}
disabled={cleanupRunning}
className="flex items-center gap-2 px-4 py-2 text-sm font-medium text-blue-700 bg-blue-50 border border-blue-300 rounded-md hover:bg-blue-100 disabled:opacity-50"
>
{cleanupRunning
? <RefreshCw className="w-4 h-4 animate-spin" />
: <Trash2 className="w-4 h-4" />}
Weekly Cleanup
</button>
<button
onClick={() => handleCleanup('monthly')}
disabled={cleanupRunning}
className="flex items-center gap-2 px-4 py-2 text-sm font-medium text-orange-700 bg-orange-50 border border-orange-300 rounded-md hover:bg-orange-100 disabled:opacity-50"
>
{cleanupRunning
? <RefreshCw className="w-4 h-4 animate-spin" />
: <Trash2 className="w-4 h-4" />}
Monthly Full Cleanup
</button>
</div>
</div>
{/* ── Cleanup log ── */}
{cleanupLog && (
<div className="bg-white border border-gray-200 rounded-lg overflow-hidden">
<button
className="w-full flex items-center justify-between px-6 py-4 text-left"
onClick={() => setShowLog(v => !v)}
>
<div className="flex items-center gap-2">
<ScrollText className="w-5 h-5 text-gray-500" />
<span className="font-semibold text-gray-900">Cleanup Log</span>
<span className="text-xs text-gray-400">(last 40 lines of /var/log/vm-cleanup.log)</span>
</div>
{showLog
? <ChevronUp className="w-4 h-4 text-gray-400" />
: <ChevronDown className="w-4 h-4 text-gray-400" />}
</button>
{showLog && (
<div className="border-t border-gray-100 px-6 py-4">
<pre className="text-xs font-mono whitespace-pre-wrap text-gray-700 bg-gray-50 rounded p-3 max-h-80 overflow-y-auto">
{cleanupLog}
</pre>
</div>
)}
</div>
)}
</div>
</main>
</div>
);
}

View File

@ -0,0 +1,403 @@
'use client';
import { useEffect, useMemo, useRef, useState } from 'react';
import Link from 'next/link';
import { AlertTriangle, CheckCircle2, Cloud, Copy, DatabaseBackup, ExternalLink, Gauge, HardDrive, RefreshCw, ShieldCheck, Timer, Wifi, Activity, CalendarClock, Link2 } from 'lucide-react';
import { Badge, Button } from '@/components/ui/Primitives';
import { SectionCard } from '@/components/hermes-shell';
import { api, type HermesOpsInstance, type HermesOpsSnapshot } from '@/lib/api';
function boolTone(value: boolean): 'success' | 'error' {
return value ? 'success' : 'error';
}
function boolText(value: boolean) {
return value ? 'OK' : 'Needs attention';
}
function formatDate(value: string | null) {
if (!value) return 'unknown';
const date = new Date(value);
if (Number.isNaN(date.getTime())) return value;
return new Intl.DateTimeFormat('en', {
month: 'short',
day: 'numeric',
hour: 'numeric',
minute: '2-digit',
}).format(date);
}
function StatusRow({ label, value, ok }: { label: string; value: string; ok: boolean }) {
return (
<div className="flex items-center justify-between gap-3 rounded-xl border border-[var(--bl-border)] bg-[var(--bl-surface-muted)] px-3 py-2">
<span className="text-sm text-[var(--bl-text-secondary)]">{label}</span>
<Badge variant={boolTone(ok)}>{value}</Badge>
</div>
);
}
function InstanceCard({ instance }: { instance: HermesOpsInstance }) {
const score = [
instance.gateway.active,
instance.gateway.enabled,
instance.dashboard.active,
instance.backup.timer.active,
instance.backup.repo.clean,
instance.google.workspaceToken,
].filter(Boolean).length;
return (
<article className="rounded-2xl border border-[var(--bl-border)] bg-[var(--bl-surface-card)] p-4 shadow-[var(--bl-shadow-sm)]">
<div className="flex flex-wrap items-start justify-between gap-3">
<div>
<div className="flex items-center gap-2">
<Gauge className="h-4 w-4 text-[var(--bl-accent)]" />
<h3 className="font-semibold text-[var(--bl-text-primary)]">{instance.label}</h3>
</div>
<p className="mt-1 text-xs text-[var(--bl-text-secondary)]">{instance.hermesHome}</p>
</div>
<Badge variant={score === 6 ? 'success' : score >= 4 ? 'warning' : 'error'}>{score}/6 healthy</Badge>
</div>
<div className="mt-4 grid gap-2">
<StatusRow label={instance.gateway.service} value={boolText(instance.gateway.active)} ok={instance.gateway.active} />
<StatusRow label="Auto-start enabled" value={instance.gateway.enabled ? 'enabled' : 'disabled'} ok={instance.gateway.enabled} />
<StatusRow label="Private dashboard" value={boolText(instance.dashboard.active)} ok={instance.dashboard.active} />
<StatusRow label="10-minute backup timer" value={boolText(instance.backup.timer.active)} ok={instance.backup.timer.active} />
<StatusRow label="Google Workspace token" value={boolText(instance.google.workspaceToken)} ok={instance.google.workspaceToken} />
</div>
<div className="mt-4 grid gap-3 text-sm text-[var(--bl-text-secondary)] md:grid-cols-2">
<div className="rounded-xl border border-[var(--bl-border)] bg-[var(--bl-surface-muted)] p-3">
<div className="flex items-center gap-2 text-[var(--bl-text-primary)]">
<DatabaseBackup className="h-4 w-4" />
Backup repo
</div>
<p className="mt-2 break-words">HEAD {instance.backup.repo.head ?? 'unknown'}</p>
<p className="break-words">Last commit {formatDate(instance.backup.repo.lastCommitAt)}</p>
<p>{instance.backup.repo.clean ? 'Clean working tree' : 'Uncommitted changes present'}</p>
</div>
<div className="rounded-xl border border-[var(--bl-border)] bg-[var(--bl-surface-muted)] p-3">
<div className="flex items-center gap-2 text-[var(--bl-text-primary)]">
<HardDrive className="h-4 w-4" />
Restore payload
</div>
<p className="mt-2">{instance.backup.restoredFileCount ?? 'unknown'} tracked files</p>
<p>{instance.backup.restoredCronJobs ?? 'unknown'} cron job definitions</p>
<p className="break-words">{instance.backup.repo.size ?? 'size unknown'}</p>
</div>
</div>
<div className="mt-4 flex flex-wrap gap-2">
<Button asChild variant="secondary" size="sm">
<a href={instance.dashboard.url} target="_blank" rel="noreferrer">
Open dashboard <ExternalLink className="ml-2 h-4 w-4" />
</a>
</Button>
<Button variant="ghost" size="sm" onClick={() => void navigator.clipboard.writeText(instance.dashboard.url)}>
<Copy className="mr-2 h-4 w-4" />
Copy URL
</Button>
</div>
</article>
);
}
export function HermesOpsPanel() {
const [snapshot, setSnapshot] = useState<HermesOpsSnapshot | null>(null);
const [previousSnapshot, setPreviousSnapshot] = useState<HermesOpsSnapshot | null>(null);
const [loading, setLoading] = useState(true);
const [error, setError] = useState<string | null>(null);
const latestSnapshotRef = useRef<HermesOpsSnapshot | null>(null);
const load = async () => {
setLoading(true);
setError(null);
try {
const nextSnapshot = await api.getHermesOps();
setPreviousSnapshot(latestSnapshotRef.current);
latestSnapshotRef.current = nextSnapshot;
setSnapshot(nextSnapshot);
} catch (err) {
setError(err instanceof Error ? err.message : 'Unable to load Hermes operations status');
} finally {
setLoading(false);
}
};
useEffect(() => {
void load();
const id = window.setInterval(() => void load(), 60_000);
return () => window.clearInterval(id);
}, []);
const allHealthy = useMemo(() => snapshot ? snapshot.warnings.length === 0 : false, [snapshot]);
const snapshotDiff = useMemo(() => {
if (!snapshot || !previousSnapshot) return null;
const previousHealthyInstances = previousSnapshot.instances.filter((instance) =>
instance.gateway.active &&
instance.dashboard.active &&
instance.backup.timer.active &&
instance.backup.repo.clean &&
instance.google.workspaceToken
).length;
const currentHealthyInstances = snapshot.instances.filter((instance) =>
instance.gateway.active &&
instance.dashboard.active &&
instance.backup.timer.active &&
instance.backup.repo.clean &&
instance.google.workspaceToken
).length;
return {
healthyInstances: currentHealthyInstances - previousHealthyInstances,
warnings: snapshot.warnings.length - previousSnapshot.warnings.length,
activeSessions: snapshot.activeSessions.active - previousSnapshot.activeSessions.active,
activeDashboards: snapshot.instances.filter((instance) => instance.dashboard.active).length - previousSnapshot.instances.filter((instance) => instance.dashboard.active).length,
activeBackupTimers: snapshot.instances.filter((instance) => instance.backup.timer.active).length - previousSnapshot.instances.filter((instance) => instance.backup.timer.active).length,
};
}, [previousSnapshot, snapshot]);
const healthyInstances = snapshot
? snapshot.instances.filter((instance) =>
instance.gateway.active &&
instance.dashboard.active &&
instance.backup.timer.active &&
instance.backup.repo.clean &&
instance.google.workspaceToken
).length
: 0;
const activeDashboards = snapshot ? snapshot.instances.filter((instance) => instance.dashboard.active).length : 0;
const activeBackupTimers = snapshot ? snapshot.instances.filter((instance) => instance.backup.timer.active).length : 0;
return (
<SectionCard
title="Live Recovery and Dashboard Status"
subtitle="Real VM status for Vijay/root and Bheem/Uma: gateways, private dashboards, backups, Google auth, and restore payload health."
actions={(
<div className="flex flex-wrap items-center gap-2">
{snapshot ? <Badge variant={allHealthy ? 'success' : 'warning'}>{allHealthy ? 'All green' : `${snapshot.warnings.length} warning(s)`}</Badge> : null}
<Button variant="ghost" size="sm" onClick={() => void load()} disabled={loading}>
<RefreshCw className={`mr-2 h-4 w-4 ${loading ? 'animate-spin' : ''}`} />
Refresh
</Button>
</div>
)}
>
{error ? (
<div className="rounded-2xl border border-[var(--bl-danger)]/30 bg-[var(--bl-danger)]/10 p-4 text-sm text-[var(--bl-danger)]">
{error}
</div>
) : null}
{snapshot ? (
<div className="space-y-5">
{snapshotDiff ? (
<div className="rounded-2xl border border-[var(--bl-border)] bg-[var(--bl-surface-muted)] p-4">
<div className="flex items-center justify-between gap-3">
<div>
<p className="text-sm font-medium text-[var(--bl-text-primary)]">Since previous refresh</p>
<p className="text-xs text-[var(--bl-text-secondary)]">Snapshot movement compared with the last poll.</p>
</div>
<Badge variant="neutral">Delta view</Badge>
</div>
<div className="mt-3 grid gap-3 md:grid-cols-5">
{[
{ label: 'Healthy instances', value: snapshotDiff.healthyInstances },
{ label: 'Active dashboards', value: snapshotDiff.activeDashboards },
{ label: 'Active backups', value: snapshotDiff.activeBackupTimers },
{ label: 'Active sessions', value: snapshotDiff.activeSessions },
{ label: 'Warnings', value: snapshotDiff.warnings },
].map((item) => (
<div key={item.label} className="rounded-xl border border-[var(--bl-border)] bg-[var(--bl-surface-card)] p-3">
<p className="text-xs uppercase tracking-[0.2em] text-[var(--bl-text-tertiary)]">{item.label}</p>
<p className={`mt-2 text-2xl font-semibold ${item.value > 0 ? 'text-[var(--bl-success)]' : item.value < 0 ? 'text-[var(--bl-danger)]' : 'text-[var(--bl-text-primary)]'}`}>
{item.value > 0 ? '+' : ''}{item.value}
</p>
</div>
))}
</div>
</div>
) : null}
<div className="grid gap-3 md:grid-cols-4">
<div className="rounded-2xl border border-[var(--bl-border)] bg-[var(--bl-surface-muted)] p-4">
<div className="flex items-center gap-2 text-sm text-[var(--bl-text-secondary)]">
<ShieldCheck className="h-4 w-4" />
Healthy instances
</div>
<p className="mt-2 text-2xl font-semibold text-[var(--bl-text-primary)]">{healthyInstances}/2</p>
</div>
<div className="rounded-2xl border border-[var(--bl-border)] bg-[var(--bl-surface-muted)] p-4">
<div className="flex items-center gap-2 text-sm text-[var(--bl-text-secondary)]">
<Activity className="h-4 w-4" />
Active dashboards
</div>
<p className="mt-2 text-2xl font-semibold text-[var(--bl-text-primary)]">{activeDashboards}/2</p>
</div>
<div className="rounded-2xl border border-[var(--bl-border)] bg-[var(--bl-surface-muted)] p-4">
<div className="flex items-center gap-2 text-sm text-[var(--bl-text-secondary)]">
<CalendarClock className="h-4 w-4" />
Active backup timers
</div>
<p className="mt-2 text-2xl font-semibold text-[var(--bl-text-primary)]">{activeBackupTimers}/2</p>
</div>
<div className="rounded-2xl border border-[var(--bl-border)] bg-[var(--bl-surface-muted)] p-4">
<div className="flex items-center gap-2 text-sm text-[var(--bl-text-secondary)]">
<AlertTriangle className="h-4 w-4" />
Open warnings
</div>
<p className="mt-2 text-2xl font-semibold text-[var(--bl-text-primary)]">{snapshot.warnings.length}</p>
</div>
</div>
<div className="grid gap-3 md:grid-cols-4">
<div className="rounded-2xl border border-[var(--bl-border)] bg-[var(--bl-surface-muted)] p-4">
<div className="flex items-center gap-2 text-sm text-[var(--bl-text-secondary)]">
<Wifi className="h-4 w-4" />
Tailscale IP
</div>
<p className="mt-2 break-words text-lg font-semibold text-[var(--bl-text-primary)]">{snapshot.tailscaleIp ?? 'unknown'}</p>
</div>
<div className="rounded-2xl border border-[var(--bl-border)] bg-[var(--bl-surface-muted)] p-4">
<div className="flex items-center gap-2 text-sm text-[var(--bl-text-secondary)]">
<Cloud className="h-4 w-4" />
Emergency Drive
</div>
<p className="mt-2 break-words text-lg font-semibold text-[var(--bl-text-primary)]">{snapshot.emergencyDriveUpload.active ? 'active' : 'inactive'}</p>
</div>
<div className="rounded-2xl border border-[var(--bl-border)] bg-[var(--bl-surface-muted)] p-4">
<div className="flex items-center gap-2 text-sm text-[var(--bl-text-secondary)]">
<Timer className="h-4 w-4" />
Next Drive bundle
</div>
<p className="mt-2 break-words text-base font-semibold leading-6 text-[var(--bl-text-primary)]">{snapshot.emergencyDriveUpload.nextRun ?? 'unknown'}</p>
</div>
<div className="rounded-2xl border border-[var(--bl-border)] bg-[var(--bl-surface-muted)] p-4">
<div className="flex items-center gap-2 text-sm text-[var(--bl-text-secondary)]">
<ShieldCheck className="h-4 w-4" />
Generated
</div>
<p className="mt-2 break-words text-lg font-semibold text-[var(--bl-text-primary)]">{formatDate(snapshot.generatedAt)}</p>
</div>
</div>
<div className="grid gap-4 xl:grid-cols-[1.2fr_0.8fr]">
<div className="rounded-2xl border border-[var(--bl-border)] bg-[var(--bl-surface-muted)] p-4">
<div className="flex items-center gap-2 text-sm font-medium text-[var(--bl-text-primary)]">
<Activity className="h-4 w-4 text-[var(--bl-accent)]" />
Active Hermes sessions
</div>
<div className="mt-3 grid gap-3 md:grid-cols-3">
<div className="rounded-xl border border-[var(--bl-border)] bg-[var(--bl-surface-card)] p-3">
<p className="text-xs uppercase tracking-[0.2em] text-[var(--bl-text-tertiary)]">Running now</p>
<p className="mt-2 text-2xl font-semibold text-[var(--bl-text-primary)]">{snapshot.activeSessions.active}</p>
</div>
<div className="rounded-xl border border-[var(--bl-border)] bg-[var(--bl-surface-card)] p-3">
<p className="text-xs uppercase tracking-[0.2em] text-[var(--bl-text-tertiary)]">Last updated</p>
<p className="mt-2 text-sm font-medium text-[var(--bl-text-primary)]">{snapshot.activeSessions.updatedAt ? formatDate(snapshot.activeSessions.updatedAt) : 'unknown'}</p>
</div>
<div className="rounded-xl border border-[var(--bl-border)] bg-[var(--bl-surface-card)] p-3">
<p className="text-xs uppercase tracking-[0.2em] text-[var(--bl-text-tertiary)]">Interpretation</p>
<p className="mt-2 text-sm text-[var(--bl-text-secondary)]">Counted from Hermes CLI processes outside the gateway daemons.</p>
</div>
</div>
</div>
<div className="rounded-2xl border border-[var(--bl-border)] bg-[var(--bl-surface-muted)] p-4">
<div className="flex items-center gap-2 text-sm font-medium text-[var(--bl-text-primary)]">
<CalendarClock className="h-4 w-4 text-[var(--bl-accent)]" />
Cron job state
</div>
<div className="mt-3 space-y-2">
{snapshot.cronJobs.map((job) => (
<div key={job.name} className="rounded-xl border border-[var(--bl-border)] bg-[var(--bl-surface-card)] p-3">
<div className="flex items-center justify-between gap-2">
<p className="font-medium text-[var(--bl-text-primary)]">{job.label}</p>
<Badge variant={job.active ? 'success' : 'error'}>{job.active ? 'active' : 'inactive'}</Badge>
</div>
<p className="mt-2 text-xs text-[var(--bl-text-secondary)]">Next: {job.nextRun ?? 'unknown'}</p>
<p className="text-xs text-[var(--bl-text-secondary)]">Last: {job.lastRun ?? 'unknown'}</p>
</div>
))}
</div>
</div>
</div>
{snapshot.warnings.length ? (
<div className="rounded-2xl border border-[var(--bl-warning)]/40 bg-[var(--bl-warning)]/10 p-4">
<div className="flex items-center gap-2 font-medium text-[var(--bl-text-primary)]">
<AlertTriangle className="h-4 w-4 text-[var(--bl-warning)]" />
Recovery warnings
</div>
<div className="mt-3 grid gap-2 md:grid-cols-2">
{snapshot.warnings.map((warning) => (
<div key={warning} className="rounded-xl bg-[var(--bl-surface-card)] px-3 py-2 text-sm text-[var(--bl-text-secondary)]">{warning}</div>
))}
</div>
</div>
) : (
<div className="flex items-center gap-2 rounded-2xl border border-[var(--bl-success)]/30 bg-[var(--bl-success)]/10 p-4 text-sm text-[var(--bl-text-secondary)]">
<CheckCircle2 className="h-4 w-4 text-[var(--bl-success)]" />
Vijay and Bheem recovery paths are healthy.
</div>
)}
<div className="grid gap-4 xl:grid-cols-[1fr_1fr]">
<div className="rounded-2xl border border-[var(--bl-border)] bg-[var(--bl-surface-muted)] p-4">
<div className="flex items-center gap-2 text-sm font-medium text-[var(--bl-text-primary)]">
<AlertTriangle className="h-4 w-4 text-[var(--bl-warning)]" />
Recent sanitized alerts
</div>
<div className="mt-3 space-y-2">
{snapshot.recentAlerts.length ? snapshot.recentAlerts.map((warning) => (
<div key={warning} className="rounded-xl bg-[var(--bl-surface-card)] px-3 py-2 text-sm text-[var(--bl-text-secondary)]">{warning}</div>
)) : (
<div className="rounded-xl bg-[var(--bl-surface-card)] px-3 py-2 text-sm text-[var(--bl-text-secondary)]">No recent alerts.</div>
)}
</div>
</div>
<div className="rounded-2xl border border-[var(--bl-border)] bg-[var(--bl-surface-muted)] p-4">
<div className="flex items-center gap-2 text-sm font-medium text-[var(--bl-text-primary)]">
<Link2 className="h-4 w-4 text-[var(--bl-accent)]" />
Quick links
</div>
<div className="mt-3 space-y-2">
{snapshot.quickLinks.map((link) => (
<a
key={link.href}
href={link.href}
target="_blank"
rel="noreferrer"
className="block rounded-xl border border-[var(--bl-border)] bg-[var(--bl-surface-card)] p-3 hover:border-[var(--bl-accent)]"
>
<p className="font-medium text-[var(--bl-text-primary)]">{link.label}</p>
<p className="mt-1 text-sm text-[var(--bl-text-secondary)]">{link.description}</p>
</a>
))}
</div>
</div>
</div>
<div className="grid gap-4 xl:grid-cols-2">
{snapshot.instances.map((instance) => (
<InstanceCard key={instance.id} instance={instance} />
))}
</div>
<div className="text-sm text-[var(--bl-text-secondary)]">
Disaster recovery details live in{' '}
<Link href="/hermes/settings" className="text-[var(--bl-accent)] hover:underline">Hermes settings</Link>
{' '}and the tracked runbook in <span className="font-mono text-xs">docs/hermes-disaster-recovery.md</span>.
</div>
</div>
) : (
<div className="rounded-2xl border border-[var(--bl-border)] bg-[var(--bl-surface-muted)] p-4 text-sm text-[var(--bl-text-secondary)]">
Loading live Hermes operations status...
</div>
)}
</SectionCard>
);
}

View File

@ -1,6 +1,6 @@
'use client'; 'use client';
import { useState } from 'react'; import { useState, useEffect } from 'react';
import Link from 'next/link'; import Link from 'next/link';
import { usePathname, useRouter } from 'next/navigation'; import { usePathname, useRouter } from 'next/navigation';
import { import {
@ -18,6 +18,7 @@ import {
Moon, Moon,
HeartPulse, HeartPulse,
Sparkles, Sparkles,
Server,
} from 'lucide-react'; } from 'lucide-react';
import { useAuth } from '@/lib/auth'; import { useAuth } from '@/lib/auth';
@ -26,6 +27,7 @@ const navItems = [
{ href: '/hermes', label: 'Hermes', icon: Sparkles }, { href: '/hermes', label: 'Hermes', icon: Sparkles },
{ href: '/health', label: 'Health', icon: HeartPulse }, { href: '/health', label: 'Health', icon: HeartPulse },
{ href: '/metrics', label: 'Metrics', icon: BarChart3 }, { href: '/metrics', label: 'Metrics', icon: BarChart3 },
{ href: '/vm', label: 'VM Health', icon: Server },
{ href: '/system', label: 'System', icon: Cpu }, { href: '/system', label: 'System', icon: Cpu },
{ href: '/env', label: 'Environment', icon: Key }, { href: '/env', label: 'Environment', icon: Key },
{ href: '/code-quality', label: 'Code Quality', icon: Code2 }, { href: '/code-quality', label: 'Code Quality', icon: Code2 },
@ -37,7 +39,14 @@ export function SidebarNav() {
const router = useRouter(); const router = useRouter();
const { user, logout } = useAuth(); const { user, logout } = useAuth();
const [mobileOpen, setMobileOpen] = useState(false); const [mobileOpen, setMobileOpen] = useState(false);
const [theme, setTheme] = useState('light'); const [theme, setTheme] = useState<'light' | 'dark'>('light');
// Sync theme from localStorage on mount
useEffect(() => {
const saved = (localStorage.getItem('theme') as 'light' | 'dark') || 'light';
setTheme(saved);
document.documentElement.classList.toggle('dark', saved === 'dark');
}, []);
const handleLogout = () => { const handleLogout = () => {
logout(); logout();
@ -96,7 +105,12 @@ export function SidebarNav() {
{/* Footer — theme toggle + user info + logout */} {/* Footer — theme toggle + user info + logout */}
<div className="border-t p-4 space-y-3"> <div className="border-t p-4 space-y-3">
<button <button
onClick={() => setTheme(theme === 'dark' ? 'light' : 'dark')} onClick={() => {
const next = theme === 'dark' ? 'light' : 'dark';
setTheme(next);
localStorage.setItem('theme', next);
document.documentElement.classList.toggle('dark', next === 'dark');
}}
className="flex w-full items-center gap-3 rounded-lg px-3 py-2 text-sm text-gray-700 hover:bg-gray-100 transition-colors" className="flex w-full items-center gap-3 rounded-lg px-3 py-2 text-sm text-gray-700 hover:bg-gray-100 transition-colors"
> >
{theme === 'dark' ? <Sun className="h-4 w-4" /> : <Moon className="h-4 w-4" />} {theme === 'dark' ? <Sun className="h-4 w-4" /> : <Moon className="h-4 w-4" />}

View File

@ -28,8 +28,9 @@ export const Button = React.forwardRef<HTMLButtonElement, ButtonProps>(
const classes = cn(baseStyles, variantStyles[variant], sizeStyles[size], className); const classes = cn(baseStyles, variantStyles[variant], sizeStyles[size], className);
if (asChild && React.isValidElement(children)) { if (asChild && React.isValidElement(children)) {
return React.cloneElement(children as React.ReactElement<{ className?: string }>, { const typedChild = children as React.ReactElement<{ className?: string }>;
className: cn(children.props.className, classes), return React.cloneElement(typedChild, {
className: cn(typedChild.props.className, classes),
}); });
} }

View File

@ -82,6 +82,35 @@ describe('API Client', () => {
}); });
}); });
describe('getHermesOps', () => {
it('fetches the live Hermes operations snapshot', async () => {
const snapshot = {
generatedAt: '2026-05-27T13:03:14.848Z',
tailscaleIp: '100.87.53.10',
emergencyDriveUpload: {
name: 'hermes-emergency-drive-upload.timer',
active: true,
nextRun: 'Thu 2026-05-28 03:26:15 UTC',
lastRun: null,
},
instances: [],
warnings: [],
};
vi.mocked(global.fetch).mockResolvedValueOnce(mockJsonResponse(snapshot));
await expect(api.getHermesOps()).resolves.toEqual(snapshot);
expect(global.fetch).toHaveBeenCalledWith(
'http://localhost:4004/api/hermes/ops',
expect.objectContaining({
headers: expect.objectContaining({
'Content-Type': 'application/json',
}),
}),
);
});
});
describe('state-changing requests', () => { describe('state-changing requests', () => {
it('triggers a deployment without CSRF when no user token exists', async () => { it('triggers a deployment without CSRF when no user token exists', async () => {
const mockResponse = { const mockResponse = {

View File

@ -56,6 +56,79 @@ export interface EnvVar {
updatedAt: string; updatedAt: string;
} }
export interface HermesOpsTimer {
name: string;
active: boolean;
nextRun: string | null;
lastRun: string | null;
}
export interface HermesOpsRepo {
path: string;
branch: string | null;
clean: boolean;
head: string | null;
lastCommitAt: string | null;
size: string | null;
}
export interface HermesOpsInstance {
id: 'vijay' | 'bheem';
label: string;
hermesHome: string;
gateway: {
service: string;
active: boolean;
enabled: boolean;
};
dashboard: {
service: string;
active: boolean;
url: string;
};
backup: {
timer: HermesOpsTimer;
repo: HermesOpsRepo;
restoredFileCount: number | null;
restoredCronJobs: number | null;
};
google: {
workspaceToken: boolean;
driveFolder: string;
};
}
export interface HermesOpsSessionSummary {
active: number;
updatedAt: string | null;
}
export interface HermesOpsCronJob {
name: string;
label: string;
active: boolean;
nextRun: string | null;
lastRun: string | null;
}
export interface HermesOpsLink {
label: string;
href: string;
description: string;
}
export interface HermesOpsSnapshot {
generatedAt: string;
tailscaleIp: string | null;
emergencyDriveUpload: HermesOpsTimer;
activeSessions: HermesOpsSessionSummary;
cronJobs: HermesOpsCronJob[];
recentAlerts: string[];
quickLinks: HermesOpsLink[];
instances: HermesOpsInstance[];
warnings: string[];
}
let csrfToken: string | null = null; let csrfToken: string | null = null;
let csrfTokenExpiresAt: number = 0; let csrfTokenExpiresAt: number = 0;
@ -208,6 +281,9 @@ export const api = {
apiRequest<ServiceHealth>(`/api/health/${serviceId}`), apiRequest<ServiceHealth>(`/api/health/${serviceId}`),
clearHealthCache: () => apiRequest<{ message: string }>('/api/health/cache', { method: 'DELETE' }), clearHealthCache: () => apiRequest<{ message: string }>('/api/health/cache', { method: 'DELETE' }),
// Hermes operations
getHermesOps: () => apiRequest<HermesOpsSnapshot>('/api/hermes/ops'),
// Seed // Seed
seedServices: () => apiRequest<{ message: string }>('/api/seed', { method: 'POST' }), seedServices: () => apiRequest<{ message: string }>('/api/seed', { method: 'POST' }),
@ -345,6 +421,99 @@ export const codeQualityApi = {
export const runCodeQualityCheck = (params: CodeQualityCheckParams) => codeQualityApi.runCheck(params); export const runCodeQualityCheck = (params: CodeQualityCheckParams) => codeQualityApi.runCheck(params);
// VM Health
export type VmCheckLevel = 'OK' | 'WARN' | 'CRIT';
export interface VmCheck {
level: VmCheckLevel;
value: string;
message: string;
}
export interface VmHealthResult {
timestamp: string;
hostname: string;
overall: VmCheckLevel;
checks: Record<string, VmCheck>;
error?: string;
}
export interface CronRunSummary {
timestamp: string;
mode: 'standard' | 'full';
diskBefore: string;
diskAfter: string;
freedMB: number;
durationSecs: number;
success: boolean;
steps: string[];
jsonSummary?: Record<string, unknown>;
}
export interface CronJob {
name: string;
schedule: string;
description: string;
lastRun: CronRunSummary | null;
nextRun: string | null;
}
export interface CronStatusResponse {
jobs: CronJob[];
recentRuns: CronRunSummary[];
}
export interface UnhealthyContainer {
name: string;
status: string;
restartCount: number;
lastHealthLogs: string[];
unhealthySince: string | null;
}
export interface OllamaModel {
name: string;
sizeGB: number;
modifiedAt: string;
}
export interface OllamaRunning {
name: string;
sizeGB: number;
processor: string;
expiresAt: string;
}
export interface OllamaModelsResponse {
models: OllamaModel[];
running: OllamaRunning[];
}
export const vmApi = {
getHealth: () => apiRequest<VmHealthResult>('/api/vm/health'),
getCleanupLog: (lines = 40) =>
apiRequest<{ log: string }>(`/api/vm/cleanup-log?lines=${lines}`),
runCleanup: (mode: 'weekly' | 'monthly' | 'dry-run') =>
apiRequest<{ success: boolean; output: string }>('/api/vm/cleanup', {
method: 'POST',
body: JSON.stringify({ mode }),
}),
getCronStatus: () => apiRequest<CronStatusResponse>('/api/vm/cron-status'),
getUnhealthyContainers: () =>
apiRequest<{ containers: UnhealthyContainer[] }>('/api/vm/containers/unhealthy'),
restartContainer: (name: string) =>
apiRequest<{ success: boolean; message: string }>(
`/api/vm/containers/${encodeURIComponent(name)}/restart`,
{ method: 'POST' },
),
getOllamaModels: () => apiRequest<OllamaModelsResponse>('/api/vm/ollama/models'),
unloadOllamaModel: (name: string) =>
apiRequest<{ success: boolean; message: string }>(
`/api/vm/ollama/models/${encodeURIComponent(name)}`,
{ method: 'DELETE' },
),
};
// Auth API - calls platform-service for authentication // Auth API - calls platform-service for authentication
export interface LoginRequest { export interface LoginRequest {
email: string; email: string;

View File

@ -73,13 +73,10 @@ export function AuthProvider({ children }: { children: React.ReactNode }) {
try { try {
const token = getAccessTokenFromStorage(); const token = getAccessTokenFromStorage();
if (!token) { if (!token) {
console.log('No token found in storage');
setLoading(false); setLoading(false);
return; return;
} }
console.log('Checking auth with token...');
// Add timeout to prevent hanging // Add timeout to prevent hanging
const timeoutPromise = new Promise((_, reject) => const timeoutPromise = new Promise((_, reject) =>
setTimeout(() => reject(new Error('Auth check timeout')), 10000) setTimeout(() => reject(new Error('Auth check timeout')), 10000)
@ -90,15 +87,12 @@ export function AuthProvider({ children }: { children: React.ReactNode }) {
timeoutPromise timeoutPromise
]) as MeResponse; ]) as MeResponse;
console.log('User data received:', userData);
setUser(userData); setUser(userData);
// Simplified admin check - just check global admin role // Simplified admin check - just check global admin role
const globalRole = userData.role; const globalRole = userData.role;
const hasAdminAccess = globalRole === 'admin'; const hasAdminAccess = globalRole === 'admin';
setIsAdmin(hasAdminAccess); setIsAdmin(hasAdminAccess);
console.log('Admin access:', hasAdminAccess);
} catch (error) { } catch (error) {
console.error('Auth check failed:', error); console.error('Auth check failed:', error);
clearAuthTokens(); clearAuthTokens();
@ -109,9 +103,7 @@ export function AuthProvider({ children }: { children: React.ReactNode }) {
async function login(email: string, password: string, productId: string) { async function login(email: string, password: string, productId: string) {
try { try {
console.log('Attempting login for:', email, 'with productId:', productId);
const response = await authApi.login({ email, password, productId }); const response = await authApi.login({ email, password, productId });
console.log('Login response received:', response);
setAccessToken(response.accessToken); setAccessToken(response.accessToken);
setRefreshToken(response.refreshToken); setRefreshToken(response.refreshToken);
@ -120,8 +112,6 @@ export function AuthProvider({ children }: { children: React.ReactNode }) {
// Check if user has admin access (global admin role) // Check if user has admin access (global admin role)
const hasAdminAccess = response.user.role === 'admin'; const hasAdminAccess = response.user.role === 'admin';
setIsAdmin(hasAdminAccess); setIsAdmin(hasAdminAccess);
console.log('Login successful, admin access:', hasAdminAccess);
} catch (error) { } catch (error) {
console.error('Login failed:', error); console.error('Login failed:', error);
throw error; throw error;
@ -148,7 +138,7 @@ export function AuthProvider({ children }: { children: React.ReactNode }) {
return <div className="min-h-screen flex items-center justify-center">Redirecting to login...</div>; return <div className="min-h-screen flex items-center justify-center">Redirecting to login...</div>;
} }
if (user && !isAdmin) { if (!isAdmin) {
return ( return (
<div className="min-h-screen flex items-center justify-center"> <div className="min-h-screen flex items-center justify-center">
<div className="text-center"> <div className="text-center">

View File

@ -4,3 +4,29 @@ import { twMerge } from 'tailwind-merge';
export function cn(...inputs: ClassValue[]) { export function cn(...inputs: ClassValue[]) {
return twMerge(clsx(inputs)); return twMerge(clsx(inputs));
} }
/** Format bytes into a human-readable string (B / KB / MB / GB / TB). */
export function formatBytes(bytes: number): string {
if (bytes === 0) return '0 B';
const k = 1024;
const sizes = ['B', 'KB', 'MB', 'GB', 'TB'];
const i = Math.floor(Math.log(bytes) / Math.log(k));
return `${(bytes / Math.pow(k, i)).toFixed(2)} ${sizes[i]}`;
}
/** Tailwind classes for a service/deployment status badge. */
export function getStatusColor(status: string): string {
switch (status) {
case 'up':
case 'success':
return 'text-green-600 bg-green-50 border-green-200';
case 'down':
case 'failed':
return 'text-red-600 bg-red-50 border-red-200';
case 'degraded':
case 'running':
return 'text-yellow-600 bg-yellow-50 border-yellow-200';
default:
return 'text-gray-600 bg-gray-50 border-gray-200';
}
}

File diff suppressed because one or more lines are too long

View File

@ -0,0 +1,399 @@
# VM Observability & Control Roadmap — v2
**Status:** Draft — Pending Approval
**Last updated:** 2026-05-27
**Scope:** `srv1491630` (Hostinger VM) + DevOps Dashboard (`devops.bytelyst.com`)
**Reviewed:** Yes — v1 audited against live system; 11 issues corrected (see change log at bottom)
---
## Current State Snapshot
| Layer | What exists today | Verified gap |
|---|---|---|
| **Health check** | `vm-health-check.sh` — disk, load, RAM, swap, Docker | No steal time metric; no per-container detail |
| **Cleanup** | `vm-cleanup.sh` — build cache, images, logs, apt, pnpm, HOLD | Runs silently; no structured outcome record |
| **Cron** | 4 scheduled jobs (daily / weekly / monthly) | No execution history; no "last ran / freed X" |
| **Dashboard /vm** | Health check + cleanup log tail + trigger button | **VM module is non-functional** — container has no host volume mounts; all backend calls to host scripts fail silently |
| **Dashboard /system** | CPU, RAM, disk, Docker stats | Missing steal %, container detail, unhealthy drill-down |
| **Prometheus stack** | Prometheus + cAdvisor + node-exporter + Loki — ~2 weeks history | **No Grafana**; trend data exists but no UI to query it |
| **Alerting** | Telegram on WARN/CRIT at 07:00 UTC | No steal time alert; no weekly digest; no cron failure alert |
| **Container restart** | 38/39 containers have `unless-stopped` | `unless-stopped` restarts on *process exit only* — does NOT react to health check failures. 7 containers running but unhealthy (process alive, health endpoint dead) |
| **LLMs (Ollama)** | 9 models on disk; `qwen2.5-coder:1.5b` currently loaded (1.1 GB, 100% CPU) | No RAM impact warning before loading; no dashboard visibility |
| **I/O anomaly** | `invttrdg-backend` writing ~22 GB/day to block storage | Unexplained — no alert, no investigation |
---
## Architectural Decisions (settle these before building)
### A. Trend chart data source
**Options:**
- ✅ **Query existing Prometheus** from DevOps dashboard (recommended) — data already there, no new store needed. Add Prometheus query endpoints to dashboard backend, render with a chart library.
- **Add Grafana container** alongside Prometheus — purpose-built for metrics UI, out-of-box dashboards. Extra 80150 MB RAM.
- ❌ **New Cosmos DB vm-metrics container** — redundant with Prometheus; wrong tool for time-series.
**Recommendation:** Query Prometheus from the dashboard for Phase 4.2 charts (keeps everything in one UI). Add Grafana in Phase 5 only if dashboard charts feel limiting.
### B. Dashboard → host script execution
The `devops-backend` container currently has **no host volume mounts** and **no sudoers entry**. Phase 3.2 "Run cleanup from dashboard" requires one of:
- ✅ **Mount host script + Docker socket** into devops-backend (simplest, lowest risk)
- **Thin host-side agent** (systemd socket-activated, receives commands via Unix socket)
- ❌ **SSH from container to host** — unnecessary complexity
**Recommendation:** Mount `/opt/bytelyst/learning_ai_devops_tools/scripts` read-only + `/var/log` for log reading into devops-backend. Add sudoers entry for the cleanup script only.
---
## Phase 0 — Fix Broken Foundations *(Day 12, prerequisite for all UI phases)*
These are not new features — they are bugs in the current system.
#### 0.1 Fix devops-backend VM module (host volume mounts)
**Problem:** `GET /api/vm/health`, `GET /api/vm/cleanup-log`, `POST /api/vm/cleanup` all fail because the container has no access to the host filesystem.
**Fix:** Update `docker-compose.yml` for devops-backend:
```yaml
volumes:
- /opt/bytelyst/learning_ai_devops_tools/scripts:/scripts:ro
- /var/log/vm-cleanup.log:/var/log/vm-cleanup.log:ro
- /var/log/vm-health-check.log:/var/log/vm-health-check.log:ro
```
Update `repository.ts` to use `/scripts/VMs/HostingerVM/vm-cleanup.sh` path, or use env var `VM_SCRIPTS_PATH`.
Add sudoers entry: `nobody ALL=(ALL) NOPASSWD: /scripts/VMs/HostingerVM/vm-cleanup.sh`
**Risk:** Low. Read-only mounts for scripts, append-only for logs.
**Validates:** Run `curl http://localhost:4004/api/vm/health` and confirm JSON response.
#### 0.2 Add logrotate entry for new log files
**Problem:** `/var/log/vm-cleanup.log` and `/var/log/vm-health-check.log` have no rotation policy. Will grow unbounded.
**Fix:** Create `/etc/logrotate.d/bytelyst-vm`:
```
/var/log/vm-cleanup.log /var/log/vm-health-check.log /var/log/docker-watchdog.log {
weekly
rotate 8
compress
delaycompress
missingok
notifempty
create 0644 root root
}
```
#### 0.3 Investigate `invttrdg-backend` I/O anomaly
**Problem:** 22.2 GB block writes in 13 hours (~1.7 GB/hr). At this rate: 40 GB/day, will fill the 123 GB free disk in ~3 days of heavy trading activity.
**Fix path:** Check what's being written (WAL logs? tick data? verbose debug logging?). Likely a log level or persistence config issue. Add disk usage alert specific to this container.
**Risk of not fixing:** Disk fills up, all services go down.
---
## Phase 1 — Observability Gaps *(Week 1)*
Read-only additions to existing scripts and the `/vm` dashboard page.
#### 1.1 Cron Job Execution History Panel
**Where:** Dashboard `/vm` page — new "Maintenance Schedule" card
**What:** Add `GET /api/vm/cron-status` endpoint that:
1. Parses crontab entries for the 4 managed jobs (look for `# bytelyst-vm-maintenance` block)
2. Parses `/var/log/vm-cleanup.log` into structured run objects: `{ timestamp, mode, diskBefore, diskAfter, freedMB, steps[], success }`
3. Calculates next run from cron expression
**UI:** Table — job name | schedule | last run | freed | status | next run. Expandable row shows step-by-step log.
**Dependency:** Requires Phase 0.1 (volume mount for log access).
#### 1.2 CPU Steal Time Metric
**Where:** `vm-health-check.sh` + dashboard `/vm` health cards
**What:** Sample `/proc/stat` twice 1 second apart, compute steal %:
```bash
read_steal() { awk '/^cpu /{print $9" "$2+$3+$4+$5+$6+$7+$8+$9+$10}' /proc/stat; }
s1=$(read_steal); sleep 1; s2=$(read_steal)
steal_pct=$(awk -v s1="$s1" -v s2="$s2" 'BEGIN{
split(s1,a," "); split(s2,b," ")
delta_steal=b[1]-a[1]; delta_total=b[2]-a[2]
printf "%.1f", (delta_steal/delta_total)*100
}')
```
Thresholds: `> 5%` = WARN, `> 15%` = CRIT.
**Why:** Currently at **8.2%** — silently degrading every API response and LLM inference call.
**Dependency:** None. Self-contained script change.
#### 1.3 Unhealthy Container Detail Panel
**Where:** Dashboard `/vm` — expand container health card
**What:** New `GET /api/vm/containers/unhealthy` endpoint:
- Container name, `unhealthy` since (parse `docker inspect .State.Health.Log[0].Start`)
- Last 3 health check log lines
- Current restart count
**UI:** Expandable per-container row with one-click restart button (calls existing or new `POST /api/vm/containers/:name/restart`).
**Dependency:** Requires Phase 0.1.
#### 1.4 Swap Pressure Indicator
**Where:** `vm-health-check.sh` + dashboard
**What:** Add `SwapCached` as secondary metric. High SwapCached relative to SwapUsed = system was recently under pressure even if swap looks ok now. Surface in daily Telegram alert even when overall = WARN not CRIT.
**Threshold change:** Current `SWAP_USED_WARN_GB=1` triggers today (1.4 GB in use). Consider raising to `1.5` to reduce noise while keeping the `SwapCached > 200MB` as an early warning signal.
---
## Phase 2 — Self-Healing Automation *(Week 2)*
Scripts that fix known recurring issues automatically.
#### 2.1 Health-Check-Aware Container Watchdog
**Why the existing policy isn't enough:** All 38 containers already have `unless-stopped`. That policy restarts on *container process exit* only. When the web server process is alive but the health check endpoint returns `Connection refused`, Docker marks the container `unhealthy` but **does not restart it** — it keeps running indefinitely broken.
**Fix:** Systemd timer `docker-health-watchdog.timer` (runs every 10 minutes):
```bash
#!/bin/bash
# /usr/local/bin/docker-health-watchdog.sh
UNHEALTHY=$(docker ps --filter health=unhealthy --format '{{.Names}}')
for container in $UNHEALTHY; do
# Only restart if unhealthy for at least 3 consecutive checks (30 min)
failures=$(docker inspect "$container" | \
python3 -c "import json,sys; h=json.load(sys.stdin)[0]['State']['Health']['Log']; \
print(sum(1 for l in h[-3:] if l['ExitCode']!=0))")
if [[ "$failures" -eq 3 ]]; then
docker restart "$container"
echo "[$(date -u +%Y-%m-%dT%H:%M:%SZ)] Auto-restarted: $container (unhealthy 3x)" \
>> /var/log/docker-watchdog.log
# Telegram notify (reads token from $HERMES_HOME/.env)
fi
done
```
**Safety:** Never restarts a container that just became unhealthy (3-check cooldown). Logs every restart. Only targets health-check failures, not intentionally stopped containers.
**Rollback:** `systemctl disable docker-health-watchdog.timer`
#### 2.2 Fix `hermes-root-backup` Git Diverge
**Current failure:** Git fast-forward fails every ~10 minutes since 16:25 today (~30+ silent failures).
**Fix:** Patch the backup script to handle diverge gracefully:
```bash
if ! git pull --ff-only 2>/dev/null; then
# Log the diverge
git log --oneline -3 HEAD > /tmp/hermes-diverge-before.txt
git log --oneline -3 origin/main >> /tmp/hermes-diverge-before.txt
# Try rebase first (preserves local commits if intentional)
if ! git pull --rebase 2>/dev/null; then
# If rebase fails, reset to origin (backup is the source of truth)
git reset --hard origin/main
notify_telegram "⚠️ hermes-root-backup: diverged branch reset to origin/main"
fi
fi
```
**Risk:** `git reset --hard` loses any local-only commits on the backup repo. Acceptable here because the backup script's job is to *push to* origin — local commits shouldn't exist. Add a pre-check: if local commits exist that aren't on origin, alert instead of resetting.
#### 2.3 Container Memory Limits
**Validated against actual RSS data (Phase 2 data collected 2026-05-27):**
| Category | Current RSS | Proposed Limit | Reservation | Notes |
|---|---|---|---|---|
| Next.js web frontends | 1737 MB | `256m` | `64m` | 7× headroom for webpack spikes |
| Node/Fastify backends | 2067 MB | `384m` | `128m` | Allows burst for LLM calls |
| `invttrdg-backend` | 107 MB | `512m` | `256m` | High I/O service; watch after 0.3 |
| `trading-backend` | 92 MB | `512m` | `256m` | Active algo trading service |
| `platform-service` | 66 MB | `384m` | `128m` | Shared auth/platform layer |
| CosmosDB emulator | 145 MB | `1g` | `512m` | Can spike on write bursts |
| Prometheus | 57 MB | `256m` | `128m` | Stable but grows with series |
| Loki | 53 MB | `256m` | `128m` | Log ingestion can spike |
| Caddy | 27 MB | `128m` | `64m` | Proxy, very stable |
| Valkey (Redis) | 3.5 MB | `128m` | `32m` | Cache, tiny |
| Gitea | 79 MB | `512m` | `256m` | Git operations can spike |
| Ollama | 130 MB idle | **No limit** | — | Must accommodate model load (up to 8 GB) |
**Rollout strategy:**
1. Run `docker stats` baseline for 24h to confirm no container spikes beyond proposed limits
2. Apply limits per stack in docker-compose files (not `docker update` — recreate on next deploy)
3. Monitor for OOMKill events: `dmesg | grep -i oom` for 48h after rollout
4. **Never set limits on Ollama** — model loading is unpredictable and limits would kill inference
---
## Phase 3 — Dashboard Control Plane *(Weeks 34)*
**Prerequisite for all Phase 3 items:** Phase 0.1 (host volume mount) must be complete.
#### 3.1 VM Score Card (Automated)
**Where:** Dashboard `/vm` — top summary widget, auto-refreshes every 5 min
**Scoring algorithm (0100):**
```
CPU efficiency: 20 pts (steal < 2% = 20, < 5% = 15, < 10% = 10, 10% = 5)
Memory pressure: 20 pts (available > 6 GB = 20, > 3 GB = 15, > 1 GB = 5, else = 0)
Disk health: 15 pts (< 40% used = 15, < 55% = 10, < 70% = 5, else = 0)
Service health: 20 pts (0 unhealthy = 20, 12 = 15, 35 = 8, 6+ = 2)
Maintenance hygiene: 15 pts (last cleanup < 7 days + freed > 0 = 15, < 30 days = 8, else = 0)
LLM readiness: 10 pts (> 8 GB free RAM = 10, > 4 GB = 7, > 2 GB = 4, else = 1)
```
Score = sum. Display as gauge. Each dimension clickable to drill into its data.
**Dependencies:** Phase 1.2 (steal time in health check output).
#### 3.2 Cron Schedule & History Panel
**Where:** Dashboard `/vm` — "Maintenance" tab
**What:**
- Live table: 4 cron jobs × (last run, result, freed MB, next scheduled, "Run now" button)
- Last 30 cleanup runs as a sparkline: date vs MB freed
- One-click trigger for weekly / monthly / dry-run
**Backend endpoint:** `GET /api/vm/cron-status` — parse structured log + crontab
**Dependency:** Phase 0.1 (volume mount), Phase 1.1 (structured log parser).
#### 3.3 Container Management Panel
**Where:** Dashboard `/vm` — "Containers" tab
**What:**
- Full list: name, stack, health status, uptime, CPU %, RAM, restart count
- Filter chips: All | Unhealthy | No Memory Limit | By stack
- Per-container: Restart, View last 50 log lines, Show health check history
- Bulk: "Restart all unhealthy" with confirmation modal
**New backend endpoints:** `GET /api/vm/containers`, `POST /api/vm/containers/:name/restart`, `GET /api/vm/containers/:name/logs`
#### 3.4 Ollama / LLM Panel
**Where:** Dashboard `/vm` — "Models" tab
**What:**
- Models list: name, size, last used timestamp
- Currently loaded (from `ollama ps`): model name, RAM used, CPU %, expires in
- RAM visualisation bar: [used by system] [model if loaded] [free]
- Warning banner: "Loading llama3.2-vision (7.8 GB) will leave ~1.2 GB free — swap pressure likely"
- Load / Unload model buttons
**Backend endpoints:** `GET /api/vm/ollama/models`, `POST /api/vm/ollama/load`, `DELETE /api/vm/ollama/unload`
**Note:** `qwen2.5-coder:1.5b` is currently loaded — confirmed via `ollama ps`.
---
## Phase 4 — Trend Analysis *(Weeks 56)*
**Key architectural note:** Prometheus + cAdvisor + node-exporter are **already running and storing ~2 weeks of metrics history** including steal time, disk I/O, memory, container CPU/RAM. Do NOT create a separate Cosmos DB store. Query Prometheus directly.
#### 4.1 Prometheus Query Endpoints in Dashboard Backend
**Where:** New `GET /api/vm/metrics/trend` endpoint group
**What:** Proxy queries to internal Prometheus (http://prometheus:9090 within Docker network):
```
/api/vm/metrics/trend/disk?range=7d → disk usage % over time
/api/vm/metrics/trend/memory?range=7d → available RAM + swap used over time
/api/vm/metrics/trend/steal?range=7d → CPU steal % over time (once 1.2 is deployed)
/api/vm/metrics/trend/containers?range=7d → unhealthy container count over time
/api/vm/metrics/trend/io?range=7d → block write rate (flag invttrdg spikes)
```
**Note:** `devops-backend` is on `dashboard_default` network, Prometheus is on `learning_ai_common_plat_default`. Either add devops-backend to Prometheus network, or expose Prometheus on a host port (internal only, not via Caddy).
#### 4.2 Trend Charts on Dashboard
**Where:** Dashboard `/vm` — collapsible "Trends" section below score card
**What (7-day / 30-day toggle):**
- Disk % over time + linear projection line → "estimated to hit 55% warning in X days"
- Swap used over time (detect slow memory leak)
- CPU steal % over time (detect host degradation trend)
- Unhealthy container count per day
- Block write rate: flag days with `invttrdg-backend` anomalies
**Library recommendation:** Recharts (already likely in the Next.js project) or lightweight Chart.js wrapper.
#### 4.3 Weekly Digest (Telegram)
**Where:** New cron job — Monday 08:00 UTC — `vm-cleanup.sh --weekly-digest`
**What:**
```
📊 Weekly VM Digest — srv1491630
Week ending 2026-06-01
🖥 CPU Steal: 8.2% avg ⚠️ (host contention — escalate if > 10%)
💾 Disk: 37% (freed 257 MB this week via cleanup)
🧠 RAM: 10 GB free avg ✓
🔄 Swap peak: 1.4 GB ⚠️
🐳 Containers: 7 unhealthy (action required)
🤖 LLMs run: qwen2.5-coder:1.5b (3 sessions this week)
🧹 Cleanups: 1 standard, 0 full
📅 Next full: 2026-06-01
Top action: Restart 7 unhealthy web containers
```
**Dependency:** Phase 4.1 (needs Prometheus for weekly averages), Phase 1.2 (steal metric must be in Prometheus).
---
## Phase 5 — Advanced / Backlog
| Item | Description | Trigger condition |
|---|---|---|
| **Add Grafana** | Container alongside Prometheus for richer dashboards; pre-built node-exporter dashboards available | Phase 4 charts feel limited |
| **Deployment ↔ health correlation** | Mark deploys on trend charts; correlate health dips to specific releases | After Phase 4.2 exists |
| **Multi-VM support** | Extend all above to aggregate across VMs | Adding second VM |
| **`invttrdg-backend` write audit** | Persistent investigation: what generates 22 GB/day of block writes? Add per-container I/O alert | After Phase 0.3 |
| **Chaos validation** | Monthly: watchdog stops a test container, verify restart within 10 min, report result | After Phase 2.1 |
| **Ollama GPU readiness check** | Detect GPU availability, surface in LLM panel as "GPU: none — inference will be slow" | Before adding large models |
| **Container image freshness** | Alert when container is running image > 30 days old (not rebuilt) | When deploy pipeline matures |
| **Cost attribution** | Tag containers by product (trading, notes, clock...) — RAM/CPU cost per product | When billing needed |
| **Backup health tracking** | `hermes-root-backup` and `uma-hermes-backup` results surfaced in dashboard | After Phase 2.2 |
---
## Implementation Order
```
Day 12 Phase 0 ── Fix broken foundations (VM module, logrotate, I/O investigation)
⚠️ MUST complete before any Phase 3 dashboard work
Week 1 Phase 1 ── Observability (steal metric, cron history, unhealthy detail, swap)
1.2 (steal) → unblocks 3.1 (score card)
1.1 (cron log format) → unblocks 3.2 (cron panel)
Week 2 Phase 2 ── Self-healing (watchdog, hermes-backup fix, memory limits)
2.1 requires: logrotate entry (Phase 0.2)
2.3 requires: 24h baseline observation first
Weeks 34 Phase 3 ── Dashboard control (score card, cron panel, containers, Ollama)
All require: Phase 0.1 (host volume mount)
3.1 requires: Phase 1.2 deployed
3.2 requires: Phase 1.1 deployed
Weeks 56 Phase 4 ── Trend analysis (Prometheus queries, charts, weekly digest)
4.1 requires: devops-backend on same Docker network as Prometheus
4.2 requires: Phase 4.1
4.3 requires: Phase 4.1 + Phase 1.2
Backlog Phase 5 ── Advanced items, trigger-based
```
---
## Success Criteria (how to know each phase is done)
| Phase | Done when… |
|---|---|
| 0.1 | `curl localhost:4004/api/vm/health` returns valid JSON with disk/load/swap data |
| 0.2 | `logrotate -d /etc/logrotate.d/bytelyst-vm` exits 0; logs present in `/var/log` |
| 0.3 | Root cause of 22 GB/day writes identified + alert configured |
| 1.1 | Dashboard `/vm` shows "Last cleanup: [date], freed [MB]" parsed from log |
| 1.2 | `vm-health-check.sh` includes steal % in output; Telegram sends steal alert at > 5% |
| 1.3 | Dashboard shows each unhealthy container's last health log + restart button works |
| 2.1 | Watchdog restarts an intentionally-broken test container within 30 min |
| 2.2 | `hermes-root-backup` runs 10 times without failure after fix deployed |
| 2.3 | All containers show memory limits in `docker inspect`; 48h with 0 OOMKill events |
| 3.1 | Score card renders live score; each dimension links to its detail |
| 4.1 | `/api/vm/metrics/trend/disk?range=7d` returns valid Prometheus time-series JSON |
| 4.3 | Telegram receives weekly digest on Monday 08:00 UTC |
---
## What This Roadmap Delivers
| Today | After roadmap |
|---|---|
| `/api/vm/health` silently fails | VM module works; health data feeds dashboard |
| 8.2% steal is invisible | Daily alert + trend chart + score card dimension |
| "7 unhealthy" — no context, no fix | Drill-down to health log; auto-restart within 30 min |
| Cleanup log is a raw text dump | Structured panel: when, what, how much freed |
| invttrdg writing 22 GB/day — undetected | I/O alert + investigation complete |
| No memory guardrails on 39 containers | Per-container limits; OOM events alerted |
| 2 weeks of Prometheus data — no UI | Trend charts: disk projection, swap, steal over time |
| Manual VM diagnosis = 30 min SSH session | Score card auto-refreshes every 5 min |
| Ollama loads silently, may cause swap storm | RAM impact warning before load |
---
## Change Log (v1 → v2)
| # | What changed | Why |
|---|---|---|
| 1 | Added **Phase 0** (fix broken foundations) | devops-backend VM module non-functional; must fix first |
| 2 | Phase 4.1 changed from Cosmos DB → **Prometheus queries** | Prometheus already running with 2 weeks of history; Cosmos would be duplicate |
| 3 | Phase 2.1 restart explanation corrected | `unless-stopped` does not react to health check failures; process is alive |
| 4 | Phase 1.2 steal time corrected | Requires **2 samples** 1s apart, not single `/proc/stat` read |
| 5 | Phase 2.3 memory limits **validated against actual RSS data** | Prevents proposing limits that would OOM running services |
| 6 | Phase 5 added **invttrdg I/O investigation** + Grafana option | 22 GB/day block writes is the highest-risk untracked issue on the machine |
| 7 | Added Phase 0.2 **logrotate** for new log files | `/var/log/docker-watchdog.log` would grow unbounded |
| 8 | Added **architectural decisions** section (Prometheus vs Cosmos, host exec strategy) | Prevents wasted build on wrong approach |
| 9 | Added **success criteria** per phase | Makes "done" objective and testable |
| 10 | Added explicit **phase dependency map** | Phase 3 items would fail if built before Phase 0 |
| 11 | Corrected LLM status: `qwen2.5-coder:1.5b` **is currently loaded** | `ollama ps` confirmed; not idle as v1 stated |

View File

@ -0,0 +1,221 @@
# ADR-0001: Docker build lockfile policy
> **Status:** Accepted (decision); Deferred (implementation) · **Date:** 2026-05-27
> **Context:** docker-build-optimization-roadmap §A3 · **Supersedes:** None
> **Authors:** Platform DevOps
---
## 1. Context
The pilot Phase A work in `docker-build-optimization-roadmap` standardized
on `pnpm install --lockfile=false` inside Docker for both
`learning_ai_clock` (web + backend) and `learning_ai_peakpulse` (backend).
That choice unblocked Phase A by sidestepping a structural mismatch:
- `pnpm-lock.yaml` is generated against the **outer pnpm workspace**, which
includes `../learning_ai_common_plat/packages/*` as workspace members
(sibling-repo path).
- Inside the Docker build context, the sibling repo doesn't exist
(a single-repo build context is intentionally used for hermeticity).
- `--frozen-lockfile` therefore fails immediately with workspace
resolution errors (finding F2 in the roadmap audit).
`--lockfile=false` skips lockfile validation entirely and re-resolves all
dependencies against the registry on every `pnpm install`. This is
correct for the workspace-mismatch problem but introduces non-determinism:
the **same Dockerfile + same source tree can produce a different lockset**
across two builds if upstream `@bytelyst/*` versions move between them.
Phase A2's BuildKit cache mount mitigates the *speed* cost of
re-resolution but not the *determinism* cost.
This ADR records the decision on which long-term policy to adopt for
Docker builds. Implementation is deferred to a future Phase A3 sprint.
---
## 2. Options considered
### Option A — Keep `--lockfile=false` (status quo)
**How it works.** Docker `pnpm install` re-resolves on every cold build.
Cache mount preserves the pnpm content-addressed store across builds, so
warm rebuilds don't pay re-resolution cost.
**Pros:**
- Zero churn — already shipped in Phase A.
- Tolerates sibling-repo workspace mismatch for free.
- Tolerates `*` semver across all `@bytelyst/*` deps without rework.
- Compatible with the F17 fix (Gitea `host.docker.internal` URLs).
**Cons:**
- **Non-deterministic builds.** Same Dockerfile + same source can produce
different `node_modules` if a dependency was published between two
cold builds. CI runs days apart can ship divergent images for the same
commit.
- No supply-chain pinning. Any compromised upstream auto-rolls forward.
- `pnpm audit` on the host can disagree with what's actually inside
the image.
### Option B — Generate a Docker-only flat lockfile during build
**How it works.** Add a build step that runs `pnpm install --lockfile-only`
in a temp dir against a flattened `pnpm-workspace.yaml` that excludes
sibling-repo paths, then `--frozen-lockfile` against that generated lock.
**Pros:**
- Deterministic *within a single build* — same registry state at the
moment of the build always produces the same lockset.
- Doesn't require changes to the source tree's `pnpm-workspace.yaml`.
**Cons:**
- Still non-deterministic across builds (the lock is regenerated each time
unless cached separately).
- Adds Dockerfile complexity and a non-trivial new failure mode
(workspace-flattening logic).
- Marginal value over Option A given the cache mount.
### Option C — Vendor a Docker-flattened lockfile in the repo
**How it works.** Commit a `pnpm-lock.docker.yaml` (or similar) per repo
that's generated against a flattened workspace. Dockerfile uses
`pnpm install --frozen-lockfile --lockfile=pnpm-lock.docker.yaml`.
**Pros:**
- Fully deterministic. Same commit → same lockset → same image.
- Supply chain pins enforced.
- `pnpm audit` matches image contents.
**Cons:**
- Two lockfiles to maintain (the workspace one + the Docker one).
- Drift risk between the two — solved only by a CI gate that regenerates
the Docker lockfile on every PR that touches `package.json`.
- Requires a tested regenerate-on-CI workflow per repo.
- Workspace flattening logic must be encoded somewhere (script in
`common-plat/scripts/regen-docker-lockfile.sh`).
### Option D — Restructure to single-repo workspace (eliminate sibling)
**How it works.** Inline the consumed `@bytelyst/*` packages into each
product repo (vendor them) so there is no sibling-workspace dependency.
Then `--frozen-lockfile` works trivially.
**Pros:**
- Cleanest from a Docker-build-determinism standpoint.
**Cons:**
- Massive churn across 14+ product repos.
- Defeats the entire `learning_ai_common_plat` shared-package model.
- Multiplies maintenance cost of `@bytelyst/*` updates by the number of
consumers.
- Out of scope; would supersede the entire ecosystem architecture.
---
## 3. Decision
**Adopt Option A (`--lockfile=false`) as the official short-term policy.**
**Plan to migrate to Option C (`pnpm-lock.docker.yaml`) when supply-chain
determinism becomes a hard requirement** (e.g., before any production
deployment of a Docker-built image, or before SOC2-style attestation).
**Reasoning:**
1. **Phase A is already shipped on Option A** with verified speed wins
(warm rebuilds 2.75.4 s across all surfaces). Switching policies
mid-rollout would invalidate metrics + add risk.
2. **The cache mount (Phase A2) addresses the speed concern** that
Option A creates. The remaining concern is determinism, which is a
correctness concern — but the actual blast radius is limited because:
- All `@bytelyst/*` deps are first-party and pinned in source repos.
- Third-party deps already have fixed semver in `package.json` (no
loose `*` ranges to public registries).
- The Gitea registry is the only `@bytelyst/*` source — no public
supply-chain risk for the in-house deps.
3. **Option C is the right end state** but requires CI infrastructure
that doesn't exist yet (auto-regen-on-PR). Building it inside this
roadmap is scope creep.
4. **Option B is dominated by Option C** — same complexity, weaker
guarantees.
5. **Option D is non-starter** — it would require redesigning the
ByteLyst shared-package model.
---
## 4. Consequences
### Positive
- Phase A speed wins are preserved with zero policy churn.
- `pnpm-lock.yaml` continues to live in source repos for host development;
it stays in `.dockerignore` for Docker builds.
- The decision is reversible: switching to Option C in the future is
additive (add a Docker lockfile + change one Dockerfile line).
### Negative
- Same commit can produce different Docker images on different days. CI
must not assume image hash stability for a given commit.
- `pnpm audit` results from the host don't match Docker image contents.
Workaround: run `pnpm audit` inside the built container as a separate
CI job (cheap; no rebuild needed).
- Supply-chain attestation (SOC2, SLSA) cannot be produced for these
images today. Acceptable while there is no production traffic.
### Migration trigger
Switch to Option C when **any** of the following becomes true:
1. A production environment (paid customers, real PII) deploys a
Docker-built image from this codebase.
2. A regulatory/audit requirement demands reproducible builds.
3. A supply-chain incident occurs (compromised upstream package) and
we need rollback granularity finer than "rebuild from current `*`".
4. The cache-mount speed win disappears (e.g., CI runner switch removes
BuildKit cache persistence).
### Implementation sketch (when triggered)
1. In `learning_ai_common_plat`, add `scripts/regen-docker-lockfile.sh`:
- Reads each product repo's `package.json`.
- Generates a flattened `pnpm-workspace.yaml` (no sibling paths).
- Runs `pnpm install --lockfile-only` against the Gitea registry.
- Writes `pnpm-lock.docker.yaml` back to the product repo.
2. Each product repo gets a `.gitea/workflows/regen-docker-lockfile.yml`
that runs the script on PR-touch of `package.json` and either:
- commits the regenerated lockfile (auto-PR), or
- fails the PR with a "run regen-docker-lockfile.sh and commit" message.
3. Each product Dockerfile changes one line:
```dockerfile
# before
RUN pnpm install --ignore-scripts --lockfile=false
# after
COPY pnpm-lock.docker.yaml ./pnpm-lock.yaml
RUN pnpm install --ignore-scripts --frozen-lockfile
```
4. `.dockerignore` removes `pnpm-lock.yaml` exclusion (or adds explicit
include for `pnpm-lock.docker.yaml`).
This work is **not scoped** in the current roadmap and should be its own
small ADR-driven sprint.
---
## 5. Status tracking
| Phase | State | Notes |
|---|---|---|
| Decision | ✅ Accepted | This ADR |
| Implementation | ⏸ Deferred | Triggered by §4 conditions |
| Trigger monitor | ⚳ Open | Re-evaluate when Phase D rollout begins |
---
## 6. References
- `docker-build-optimization-roadmap.md` §0 F1, F2 (lockfile findings)
- `docker-build-optimization-roadmap.md` §A3 (deferred phase)
- `docker-build-optimization-roadmap.md` §A2 (BuildKit cache mount that
mitigates the speed concern of Option A)
- `learning_ai_common_plat/AGENTS.md` (canonical pnpm workspace config)

View File

@ -0,0 +1,943 @@
# Docker Build Optimization Roadmap
> **Status:** Draft v14 (**ALL 20 ITEMS COMPLETE** — Phases A, B, C, D, E green across 12 consumer repos; C5 closed by end-to-end Gitea Actions validation on both pilots) · **Owner:** Platform DevOps · **Created:** 2026-05-27 · **Revised:** 2026-05-27
>
> Pilot Docker-build correctness + speed fixes on `learning_ai_clock` (web + backend)
> and `learning_ai_peakpulse` (backend), then capture the playbook here for
> ecosystem-wide rollout.
>
> **Upstream prerequisite shipped (commit `610a59fd` in `learning_ai_common_plat`):**
> Gitea owner parameterization + helper scripts (`scripts/gitea/doctor.sh`,
> `scripts/gitea/token.sh`). The `.npmrc` template now resolves owner from
> `${GITEA_NPM_OWNER:-learning_ai_user}`. **All A0-1 work in this roadmap
> inherits this — Dockerfile/.npmrc.docker must use the same `${GITEA_NPM_OWNER}`
> placeholder, not a hardcoded literal.**
---
## 0. Pre-flight audit findings (2026-05-27)
A read-only audit of pilot repos + lessons from recent live incidents +
the A0-V execution iterations on clock surfaced **18 concrete bugs/gaps**
(F14F15 added after the Gitea-hardening commit; F16F18 added during the
A0-V execution sweep on clock, 2026-05-27). The actual state of the ecosystem is closer to the
inverse of the casual narrative: tarballs are the de facto default, the
Gitea-registry path is partially wired, and there is a separate class of
"build green, app broken" silent failures (F11F13) that the speed-focused
plan needs to address first.
| # | Finding | Location | Severity |
|---|---|---|---|
| F1 | `pnpm-lock.yaml` is in `.dockerignore` — any lockfile-based optimization is blocked until removed | `peakpulse/.dockerignore`, `clock/.dockerignore` | High |
| F2 | `pnpm-workspace.yaml` references sibling `../learning_ai_common_plat/packages/*``--frozen-lockfile` inside Docker will fail unless workspace is flattened or sibling tree is copied | both pilots | High |
| F3 | `peakpulse/.npmrc.docker` is tarball-only (no `@bytelyst:registry=…` line) — the "Gitea-registry" path doesn't work in this repo today | `peakpulse/.npmrc.docker` | High |
| F4 | `clock/.npmrc.docker` hardcodes `http://localhost:3300` — from inside Docker, `localhost` is the container, not the host registry | `clock/.npmrc.docker` | High |
| F5 | `clock/backend/Dockerfile` has neither `ARG GITEA_NPM_HOST` nor a BuildKit secret mount — wholly dependent on pre-populated `.docker-deps/` | `clock/backend/Dockerfile` | High |
| F6 | `clock/web/Dockerfile` accepts `ARG GITEA_NPM_HOST` but never uses it; no `--mount=type=secret` either | `clock/web/Dockerfile` | Medium |
| F7 | `peakpulse/docker-compose.yml` does not pass `GITEA_NPM_HOST` build arg or declare `secrets:` block | `peakpulse/docker-compose.yml` | Medium |
| F8 | `COPY .docker-deps/` is unconditional in every backend Dockerfile — every build requires `docker-prep.sh` to have run OR an empty `.docker-deps/` dir to pre-exist | both repos | Medium |
| F9 | `npm install -g pnpm@10.6.5` runs on every build (no `corepack`) — 510 s overhead, no pinning to `packageManager` field | all four Dockerfiles | Low |
| F10 | No BuildKit `--mount=type=cache` for pnpm store — cold install on every rebuild even when deps unchanged | all four Dockerfiles | High (main speed win) |
| **F11** | **Build-time config file missing from repo or not COPY'd in Dockerfile causes silent UI breakage. Symptom: `next build` succeeds, container is "healthy", but CSS bundle is ~33 KB (only `@font-face`) and all Tailwind classes are absent → UI renders unstyled.** Two sub-bugs: (a) `postcss.config.mjs` missing entirely while `@tailwindcss/postcss` is in `package.json` (NoteLett, JarvisJr fixes `dff459e`, `36f6bc1`); (b) file exists but Dockerfile never COPYs it (Clock, LocalMemGPT fixes `a308c6444`, `07cdf6b`). | `*/web/Dockerfile`, `*/web/postcss.config.*` | **High** |
| **F12** | **Healthcheck uses `localhost`, resolves to IPv6 `::1`, false-fails.** Backend listens on `0.0.0.0` (IPv4 only). `wget --spider http://localhost:.../health` hits `::1`, connection refused, container marked "unhealthy", `web` service won't start due to `depends_on: condition: service_healthy`. Incident: `learning_ai_jarvis_jr/docker-compose.yml`. | every `docker-compose*.yml` healthcheck | **Medium** |
| **F13** | **Enumerated `COPY web/foo ./foo` pattern drifts from filesystem.** New config file added to repo but Dockerfile's enumerated COPY list isn't updated. Build succeeds silently with the file absent; behavior diverges from local dev. Root cause of F11(b). | every Dockerfile using enumerated COPY | **Medium** |
| **F14** | **Hardcoded Gitea owner (`learning_ai_user`) literally embedded in `.npmrc.docker` + CI workflows + publish scripts across 14 repos.** When the org was renamed from `bytelyst``learning_ai_user`, every repo needed a manual commit. **Resolved upstream in `common-plat` (`610a59fd`):** owner now resolves from `${GITEA_NPM_OWNER:-learning_ai_user}`; `scripts/gitea/{doctor,token}.sh` ship as pre-flight/rotation helpers. Docker work in this roadmap MUST consume the env var, not the literal. | `.npmrc.docker`, Dockerfile `ARG`/`ENV`, CI workflows | **Medium** |
| **F15** | **Stale shell-env tokens.** `~/.gitea_npm_token` rotated on disk; long-lived shells still exported the old value. Caused 401s during `docker compose build` until `source ~/.zshrc`. **Mitigation shipped:** `bash scripts/gitea/doctor.sh` detects env-vs-file drift and refuses to proceed. **Action required in this roadmap:** wire doctor as a pre-build CI gate. | dev workstation + CI runners | Low (now caught) |
| **F16** | **At least 10 published `@bytelyst/*` packages had unrewritten `workspace:*` refs in their `package.json` dependencies.** Root cause: `publish-outdated-packages.sh` extracts a pnpm-packed tarball then **re-packs with `npm pack`** (workaround for a historical Gitea-compat issue with pnpm's tarball format), and `npm pack` doesn't recognize the pnpm-specific `workspace:` protocol — it passes it through literally. **Fixed in `common-plat@cfcfc7bb`** (`fix(gitea): rewrite workspace:* in published tarballs (F16)`) — inserted a workspace:* rewriter between extract and npm-repack + a defense-in-depth grep guard. Republished 10 affected packages. | `common-plat` publish flow + Gitea registry | **Critical (FIXED)** |
| **F17** | **Gitea bakes `localhost:3300` into the `dist.tarball` field of every published package's metadata.** Inside Docker, `localhost` is the container itself, not the host — so even after a successful registry-metadata fetch via `host.docker.internal`, pnpm follows the tarball URL to `localhost:3300` and ECONNREFUSEs. Root cause: Gitea `app.ini`'s `ROOT_URL=http://localhost:3300/` was baked at publish time. **Fixed** by setting `ROOT_URL=http://host.docker.internal:3300/`, restarting Gitea, adding `127.0.0.1 host.docker.internal` to `/etc/hosts`, adding `host.docker.internal` to `NO_PROXY` (corp proxy was hijacking DNS), and republishing all 64 packages (`common-plat@dd90f709`). | Gitea `app.ini` + host `/etc/hosts` + every dev machine's `switch-network.sh` | **Critical (FIXED)** |
| **F18** | **`clock/web/package.json` had 4 `@bytelyst/*` deps declared as `file:` refs to sibling `../../learning_ai_common_plat/packages/*`** — a legacy pre-Gitea pattern. Inside Docker those paths don't exist, so `pnpm install` fails with `ERR_PNPM_LINKED_PKG_DIR_NOT_FOUND`. Discovered during clock web A0-V on 2026-05-27. **Fixed in `learning_ai_clock@8b5c767a3`** by rewriting to `*` semver. Same pattern likely lives in other product repos (especially anything that consumes `@bytelyst/ui`, `@bytelyst/design-tokens`, `@bytelyst/use-theme`) — audit needed in Phase D rollout. | `*/web/package.json` (and likely others) | **High** |
**Implications:**
- The original "switch to `--frozen-lockfile` + Gitea registry" plan requires
two upstream fixes first (F1, F2).
- F11F13 mean **correctness fixes must precede speed fixes**, otherwise we
ship faster builds of broken apps.
- F16 + F17 are **both fixed** as of 2026-05-27. Gitea path now works
end-to-end on clock. A-pre is largely complete; remaining items (A-pre-4,
A-pre-5) become Phase E checks.
- F18 (sibling `file:` refs in product repo manifests) is the same family as
F2 but separately tractable — fixed in clock, audit needed across other
repos as part of Phase D rollout.
- A linter (Phase E `docker-doctor.sh`) is the durable insurance against
F11/F13/F18 recurrence — silent in CI today. The registry-side guard
(publish-time check for `workspace:*` leaks) shipped in `common-plat@cfcfc7bb`
as part of the F16 fix.
---
## 1. Context: three build paths
| Path | Status today | Trigger | Notes |
|---|---|---|---|
| **`docker-prep.sh` tarballs** | **De facto default** in peakpulse + flowmonk; also works in clock/notes | Run `docker-prep.sh` then `docker compose build` | Hermetic; mutates `package.json`; slow to repack |
| **Gitea NPM registry** | Partially wired in clock + notes; broken in peakpulse | `docker compose build` with `GITEA_NPM_HOST` arg + secret | Needs `.npmrc.docker` standardization to be the default |
| **Legacy `file:` refs** | Deprecated | — | Removed during pnpm/Gitea migration |
### Measurement targets
| Build | Baseline (observed) | Target after Phase A |
|---|---|---|
| Cold (no cache) | ~23 min | ≤ 2 min |
| Warm (one source file changed) | ~23 min | **< 30 s** |
| `docker-prep.sh` pack step alone | ~6090 s | < 30 s (pnpm pack cache) |
> Fill in actuals during Phase C.
---
## 2. Goals & non-goals
**Goals**
- ✅ Eliminate F11F13 class of silent "build green, app broken" failures
- ✅ Cut warm rebuild time via BuildKit pnpm-store cache mount (single biggest speed win)
- ✅ Make `docker-prep.sh` idempotent, safe to re-run, gitignore-clean, and canonical (no per-repo drift)
- ✅ Standardize `.npmrc.docker` across the ecosystem so the Gitea path actually works
- ✅ Fix `docker-compose.yml` to pass `GITEA_NPM_HOST` + secrets so the registry path is usable without manual flags
- ✅ Ship `docker-doctor.sh` CI lint as the durable insurance layer
**Non-goals**
- ❌ Migrating off pnpm or off the Gitea registry
- ❌ Adopting `--frozen-lockfile` until F2 is resolved (sibling-workspace problem)
- ❌ Publishing `@bytelyst/*` to the public npm registry
- ❌ Multi-platform builds (separate roadmap)
---
## 2.5 Canonical decisions
Decisions taken now to avoid contradictions later in the doc:
- **Base image:** `node:22-alpine` is canonical. For repos blocked by the
corporate proxy's Alpine SSL interception (currently only
`learning_ai_notes`), the Dockerfile MUST expose:
```dockerfile
ARG BASE_IMAGE=node:22-alpine
FROM ${BASE_IMAGE} AS builder
```
Override per-repo via `--build-arg BASE_IMAGE=node:22-slim`. Document the
override in the repo's `AGENTS.md`.
- **Healthcheck host:** `127.0.0.1` (NOT `localhost`) in every
`docker-compose*.yml` `test:` block. See F12.
- **Lockfile mode in Docker:** `--lockfile=false` for now. `--frozen-lockfile`
is blocked on the A3 ADR (F2).
---
## 3. Phase A — Correctness + build speed + path correctness
Order matters: **A-pre must precede A0** (you can't build via a registry that
serves broken metadata); A0 must precede A1+ (you can't optimize a path that
doesn't work), and A8+A9 (correctness) must land before measuring speed wins.
### A-pre. Make the Gitea registry actually usable from Docker (F16 + F17 + F18)
**Owner:** `learning_ai_common_plat` + per-product repo · **Status:** ✅ done for clock + global config.
Three distinct bugs surfaced during clock A0-V on 2026-05-27:
- **F16:** Publish flow leaked `workspace:*` into published metadata.
- **F17:** Gitea baked `localhost:3300` into tarball URLs.
- **F18:** Product repos had legacy `file:` refs to sibling packages.
- [x] **A-pre-1.** Audit `publish-outdated-packages.sh` — confirmed it uses
`pnpm pack` then re-tars with `npm pack`, which loses `workspace:` rewriting.
- [x] **A-pre-2.** Patch publish script with a workspace:* rewriter + a
post-rewrite grep guard. Shipped in `common-plat@cfcfc7bb`.
- [x] **A-pre-3.** Verify all packages publish with `0` workspace:* refs.
Confirmed via curl scan across all 64 packages.
- [x] **A-pre-4.** F17 fix: set Gitea `ROOT_URL=http://host.docker.internal:3300/`,
restart Gitea, add `127.0.0.1 host.docker.internal` to `/etc/hosts`, add
`host.docker.internal` to `NO_PROXY` in `switch-network.sh`, bulk republish
all 64 packages. Shipped in `common-plat@dd90f709`.
- [x] **A-pre-5.** F18 fix: rewrite `file:../../learning_ai_common_plat/packages/*`
refs in `clock/web/package.json` to `*` semver. Shipped in `clock@8b5c767a3`.
Audit needed in Phase D for other product repos.
- [x] **A-pre-6.** Document Gitea config requirements (below).
### A-pre-6. Gitea configuration prerequisites (one-time per dev machine)
The Gitea registry MUST be configured with `ROOT_URL=http://host.docker.internal:3300/`
so published tarball URLs are reachable from inside Docker containers. The
host `/etc/hosts` MUST resolve `host.docker.internal` to `127.0.0.1` so the
same URLs work from the host shell.
On macOS (Homebrew Gitea):
```bash
# 1. Edit Gitea's app.ini
sudo -e /opt/homebrew/var/gitea/custom/conf/app.ini
# change: ROOT_URL = http://localhost:3300/
# to: ROOT_URL = http://host.docker.internal:3300/
# 2. Restart Gitea
brew services restart gitea
# 3. Add /etc/hosts entry so host.docker.internal resolves on the host too
sudo sh -c 'grep -q host.docker.internal /etc/hosts || \
echo "127.0.0.1 host.docker.internal" >> /etc/hosts'
# 4. Ensure host.docker.internal is in NO_PROXY for corp shells
# (already done in switch-network.sh as of common-plat@dd90f709)
source ~/.zshrc # reload
# 5. Verify
curl -sS http://host.docker.internal:3300/api/v1/version
# expected: {"version":"1.25.5"} or similar
```
### A0. Make the Gitea-registry path actually work (clock + peakpulse)
- [ ] **A0-1.** Standardize `.npmrc.docker` to use templated host AND owner so it works on host (`localhost`) and inside Docker (`host.docker.internal`), and so future owner renames are a one-line env change:
```
@bytelyst:registry=http://${GITEA_NPM_HOST}:3300/api/packages/${GITEA_NPM_OWNER:-learning_ai_user}/npm/
//${GITEA_NPM_HOST}:3300/api/packages/${GITEA_NPM_OWNER:-learning_ai_user}/npm/:_authToken=${GITEA_NPM_TOKEN}
strict-ssl=false
auto-install-peers=true
```
> **⚠️ Env-var expansion chain:** pnpm expands `${VAR}` in `.npmrc` at read
> time using the current process environment (see [pnpm npmrc docs][pnpm-npmrc]).
> That means the Dockerfile MUST do `ARG GITEA_NPM_HOST` + `ARG GITEA_NPM_OWNER`
> → `ENV GITEA_NPM_HOST=$GITEA_NPM_HOST` / `ENV GITEA_NPM_OWNER=$GITEA_NPM_OWNER`
> **before** the `pnpm install` RUN line, AND the `GITEA_NPM_TOKEN` must be
> exported from the BuildKit secret mount inside the same `RUN` (since secrets
> don't persist as env across layers).
>
> **Note on F14:** The canonical `.npmrc` (host-side) template already uses
> `${GITEA_NPM_OWNER}` (shipped in common-plat commit `610a59fd`).
> `.npmrc.docker` lagged behind because Docker builds have a separate file —
> A0-1 brings them into parity.
[pnpm-npmrc]: https://pnpm.io/npmrc
- [ ] **A0-2.** Remove `pnpm-lock.yaml` from `.dockerignore` in both repos (fixes F1; harmless under `--lockfile=false` since we don't COPY it, but unblocks future A3)
- [ ] **A0-3.** Add `GITEA_NPM_HOST` + `GITEA_NPM_OWNER` build args + `secrets:` block to every service in `docker-compose.yml`:
```yaml
build:
context: .
dockerfile: backend/Dockerfile
args:
GITEA_NPM_HOST: ${GITEA_NPM_HOST:-host.docker.internal}
GITEA_NPM_OWNER: ${GITEA_NPM_OWNER:-learning_ai_user}
secrets:
- gitea_npm_token
secrets:
gitea_npm_token:
environment: GITEA_NPM_TOKEN
```
- [ ] **A0-4.** Add `extra_hosts: ["host.docker.internal:host-gateway"]` to each service so Linux Docker can resolve the host
- [ ] **A0-5.** Document required env: `GITEA_NPM_TOKEN` must be exported in the shell that runs `docker compose build` (add to repo `README.md` quickstart). Reference `bash ../learning_ai_common_plat/scripts/gitea/token.sh status` for verification.
- [ ] **A0-D.** **Run `gitea-doctor` before any Docker build** (addresses F15). Inline into deploy/CI workflows:
```bash
bash ../learning_ai_common_plat/scripts/gitea/doctor.sh --quiet || exit 1
docker compose build
```
- Locally: shell alias or `Makefile` target `make build` that runs doctor then `docker compose build`.
- In Gitea Actions CI: a pre-job step. If `doctor` exits non-zero, the build is skipped with a clear error rather than failing 4 minutes in with `ERR_PNPM_AUTHENTICATION`.
- [ ] **A0-V.** **Verification gate (between A0 and A1):** build the registry path **without** any cache-mount or layer optimizations. Confirm `docker compose build --no-cache` succeeds end-to-end pulling from Gitea. Only proceed to A1 once this is green. Don't conflate "make it work" with "make it fast" in one commit.
> **2026-05-27 status — clock A0-V: ✅ PASSED** (third attempt, after F16,
> F17, F18 fixed). Cold-build wall-clock:
> - backend: **59.2 s** (commits: `clock@0be887288` + `common-plat@cfcfc7bb` + `common-plat@dd90f709`)
> - web: **3:13 (193 s)** (commits: above + `clock@8b5c767a3`)
>
> Both surfaces resolve `@bytelyst/*` from the Gitea registry end-to-end —
> no `docker-prep.sh` tarballs, no sibling `file:` refs, no proxy interference.
> See §3.A7 metrics table.
### A1. Replace `npm install -g pnpm@X` with corepack
- [ ] **A1-1.** Replace `RUN npm install -g pnpm@10.6.5` with:
```dockerfile
RUN corepack enable && corepack prepare pnpm@10.6.5 --activate
```
- [ ] **A1-2.** Verify `packageManager` field in `backend/package.json` and `web/package.json` matches (already `pnpm@10.6.5` in peakpulse backend)
### A2. Add BuildKit pnpm-store cache mount
- [ ] **A2-1.** Set `# syntax=docker/dockerfile:1.7` directive at top of every Dockerfile
- [ ] **A2-2.** Wrap install step with cache + secret mount:
```dockerfile
RUN --mount=type=cache,id=pnpm,target=/root/.local/share/pnpm/store \
--mount=type=secret,id=gitea_npm_token \
export GITEA_NPM_TOKEN="$(cat /run/secrets/gitea_npm_token 2>/dev/null || echo '')" && \
pnpm install --ignore-scripts --lockfile=false
```
- [ ] **A2-3.** Verify cache mount is active: `docker buildx du --filter type=exec.cachemount` shows non-zero size after a build. **Real success metric** is wall-clock: warm rebuild (touching one source file) drops to < 30 s.
### A3. Decide lockfile policy ✅ DONE (ADR-0001)
Two options — pick one in a short ADR before implementing:
- **Option 1: Keep `--lockfile=false`** (current pragmatic approach)
- ✅ No sibling-workspace complications
- ❌ No reproducibility guarantee inside Docker
- ❌ Slower installs (full resolution every build)
- **Option 2: Generate a Docker-only lockfile** via `pnpm install --lockfile-only` against a flattened `package.json` that resolves `@bytelyst/*` to semver
- ✅ Reproducibility
- ✅ Faster installs
- ❌ New build step + tooling
- ❌ Drift risk between dev lockfile and Docker lockfile
- [x] **A3-1.** ADR written: [`docs/adr/0001-docker-build-lockfile-policy.md`](./adr/0001-docker-build-lockfile-policy.md) — **Option 1 accepted** (keep `--lockfile=false` short-term; revisit after Phase D).
- [x] **A3-2.** `--frozen-lockfile` adoption deferred per ADR; tracked as future work in §11.
### A4. Restructure layer order
- [ ] **A4-1.** Reorder COPY/RUN so deps-install layer is `package.json` + `.npmrc.docker` ONLY, then a separate layer for `src/`, config files, `shared/`
- [ ] **A4-2.** Move all `ARG` lines that affect deps install **before** the install step; move `NEXT_PUBLIC_*` ARGs (web) closer to the build step (they invalidate the build layer, not the deps layer)
### A5. Gate `.docker-deps/` behind a build arg
- [ ] **A5-1.** Add `ARG USE_TARBALLS=false` to Dockerfile
- [ ] **A5-2.** Use wildcard COPY so missing dir doesn't break the build:
```dockerfile
RUN mkdir -p /app/.docker-deps
COPY .docker-deps* /app/.docker-deps/
```
- [ ] **A5-3.** Verify `.docker-deps/` is in `.gitignore` and `.dockerignore` does NOT exclude it when tarball mode is in use
### A6. `.dockerignore` audit
- [ ] **A6-1.** Confirm exclusions: `node_modules`, `**/node_modules`, `dist`, `.next`, `*.log`, `.env`, `.env.*`, `.git`, `*.bak`
- [ ] **A6-2.** Remove: `pnpm-lock.yaml` exclusion (was correct under `--lockfile=false`, blocks future optimization)
- [ ] **A6-3.** Confirm `.docker-deps/` is NOT excluded when tarball path is active
### A7. Measure & record
| Repo | Surface | Cold (A0-V) | Cold (post-A2) | Warm (post-A2) | Notes |
|---|---|---|---|---|---|
| clock | backend | **59.2 s** | **64.7 s** | **2.9 s** | Cold essentially flat (corepack adds ~1 s; cache mount empty on first run). Warm → 95.1% reduction. Commits: `clock@8b5c767a3` (A0-V), `clock@f6a806ff3` (A1+A8+A9), `clock@55e8d22d3` (A2+A5+A6) |
| clock | web | **193 s (3:13)** | **291 s (4:51) †** | **5.4 s** | Warm → 97.2% reduction. † Cold variance — see footer |
| peakpulse | backend | — (was tarball-only path) | **72.2 s** | **2.7 s** | Warm → 96.3% reduction. Commits: `peakpulse@11a6bc5` (Phase A), `peakpulse@6523a1a` (.gitkeep fix), `clock@1465e06b1`+`d69003c1f` (mirror .gitkeep fix) |
**Footer note on cold-build variance.** Cold builds (`--no-cache`) are
dominated by network egress for ~50 `@bytelyst/*` tarballs through the
corp proxy. A second measurement of clock web cold-build came in at
291 s vs 174 s in the previous step — same Dockerfile path, different
network-side latency. Cold build is **not** the optimization target of
this roadmap; warm rebuild is. Run `pnpm store prune` on the host or use
a local registry mirror if cold-build determinism is needed.
Measurement commands:
```bash
# Cold (clear all layer cache; cache mounts may still persist)
time DOCKER_BUILDKIT=1 docker compose build --no-cache backend
# Warm (one source file changed; deps unchanged)
touch backend/src/server.ts
time DOCKER_BUILDKIT=1 docker compose build backend
# Deps-changed (touch package.json; pnpm store cache helps here)
touch backend/package.json
time DOCKER_BUILDKIT=1 docker compose build backend
```
### A8. Config-file COPY audit & canonical pattern (addresses F11, F13)
- [ ] **A8-1.** For every Dockerfile in scope, list all build-time files present in the surface directory (`web/` or `backend/`) that affect the build:
- `postcss.config.{js,mjs,cjs,ts}`
- `tailwind.config.{js,mjs,cjs,ts}`
- `next.config.{js,mjs,ts}`
- `tsconfig*.json`
- `package.json`
- `.npmrc.docker`, `.npmrc`
- `babel.config.*` (if present)
- `drizzle.config.*` (if present)
- `vitest.config.*` (only if the build needs it)
Verify each is COPY'd in the Dockerfile.
- [ ] **A8-2.** Choose canonical COPY pattern. **Decision: middle-ground glob** for web surfaces:
```dockerfile
COPY web/*.{json,ts,mjs,js,cjs} ./
COPY web/public/ ./public/
COPY web/src/ ./src/
```
Trade-off: glob picks up unintended root-level files if any are added later, but **dramatically reduces F11/F13 risk**. Backend surfaces with few root config files can keep enumerated COPY (lower risk surface).
- [ ] **A8-3.** Repo-by-repo migration: replace enumerated `COPY web/foo ./foo` with the glob pattern; verify the resulting image has all expected files via `docker run --rm <img> ls -la`.
### A9. Healthcheck canonicalization (addresses F12)
- [ ] **A9-1.** Replace `localhost` with `127.0.0.1` in every `docker-compose*.yml` healthcheck `test:` block. Sweep with:
```
rg -l 'http://localhost' --glob 'docker-compose*.yml'
```
- [ ] **A9-2.** Standardize healthcheck shape:
- **Alpine-based images:**
```yaml
healthcheck:
test: ["CMD-SHELL", "wget -q --spider http://127.0.0.1:${PORT}/health || exit 1"]
interval: 30s
timeout: 5s
retries: 3
start_period: 10s
```
- **Slim/Debian images** (`wget` not always present, but `node` is):
```yaml
healthcheck:
test: ["CMD-SHELL", "node -e \"fetch('http://127.0.0.1:${PORT}/health').then(r=>process.exit(r.ok?0:1)).catch(()=>process.exit(1))\""]
```
- [ ] **A9-3.** Add `start_period` (10s minimum) — prevents flaky "container started but app not yet listening" false-negatives.
---
## 4. Phase B — Hermetic-fallback polish (`docker-prep.sh`)
`docker-prep.sh` is duplicated with minor variations across product repos.
**Promotion to canonical home is now in Phase B, not Phase D** — drift
compounds linearly with time and the `.npmrc` template precedent proves the
pattern is cheap.
- [x] **B1.** `--dry-run` flag (`common-plat@a418a23e`).
- [x] **B2.** Idempotency guard via `*.bak` detection + `--force` override (`common-plat@a418a23e`).
- [x] **B3.** `.docker-deps/` and `*.bak` in `.gitignore` on both pilots (clock + peakpulse). Verified by `docker-doctor.sh`.
- [x] **B4.** Pre-commit hook landed. Canonical guard script `check-docker-prep-staged.sh` (`common-plat@c908c6d7`) blocks rewritten `package.json`, staged `.tgz` tarballs, and `.bak` files. Wired into both pilot `.husky/pre-commit` (`clock@4f8086bfa`, `peakpulse@c3195c8`). Verified with simulated staged tarballs → commit blocked.
Original spec:
```bash
# .husky/pre-commit
if git diff --cached --name-only | xargs grep -l '"file:\.\./\.docker-deps/' 2>/dev/null; then
echo "ERROR: rewritten package.json detected. Run scripts/docker-prep.sh --restore first."
exit 1
fi
if git diff --cached --name-only | grep -qE '(\.docker-deps/.*\.tgz|package\.json\.bak)$'; then
echo "ERROR: docker-prep.sh artifacts staged. Run --restore first."
exit 1
fi
```
- [x] **B5.** Auto-restore on script error via `trap cleanup_on_error EXIT` + `--keep` opt-out (`common-plat@a418a23e`).
- [x] **B6.** Standardized header + usage block per § 7.4 template (`common-plat@a418a23e`).
- [x] **B7. CANONICAL HOME landed.**
- [x] **B7-1.** Canonical at `learning_ai_common_plat/scripts/docker-prep.template.sh` + 2 helpers `_docker-prep-inject.js`, `_docker-prep-strip.js` (`common-plat@a418a23e`).
- [x] **B7-2.** `learning_ai_common_plat/scripts/sync-docker-prep.sh` syncs all 3 files (mirrors `sync-npmrc.sh`).
- [x] **B7-3.** `learning_ai_common_plat/scripts/check-docker-prep-drift.sh` for CI (mirrors `check-npmrc-drift.sh`).
- [x] **B7-4.** AGENTS.md "NEVER edit `docker-prep.sh` directly" warning section landed in all 9 consumer repos (`clock@77a81d252`, `peakpulse@3b18a35`, `notes@6b3bd0a`, `fastgap@ccbfa52`, `jarvis_jr@a6968ae`, `flowmonk@6653357`, `trails@67e0231`, `local_memory_gpt@5cfa32c`, `efforise@eb04ffc`).
- [x] **B8.** `--strip-overrides` option removes `pnpm.overrides` block as a safety net (`common-plat@a418a23e`).
- [x] **B+.** `--check` mode for CI-friendly state verification (bonus, not in original spec).
- [x] **B+.** Portable `sed -i` (BSD on macOS, GNU on Linux).
- [x] **B+.** Preserve `.docker-deps/.gitkeep` on clear (fixes earlier regression where `--restore` deleted the tracked file).
---
## 5. Phase C — Verification gates
Pilot exit criteria (must all pass before Phase D):
- [x] **C1.** Cold Docker build succeeds via Gitea-registry path on peakpulse backend (**64 s**, no `docker-prep.sh` invocation).
- [x] **C2.** Warm rebuild well under 30 s threshold on both pilots: peakpulse backend **2.6 s**, clock backend **3.3 s**.
- [x] **C3.** `docker-prep.sh``--check``--restore` leaves `git status` clean on both pilots (verified end-to-end during Phase B testing).
- [x] **C4.** Pre-commit hook blocks staged tarballs + `.bak` files (verified by simulating staged artifacts on clock).
- [x] **C5.** Gitea Actions CI green — **DONE**. Pilot repos created on Gitea (`learning_ai_user/learning_ai_clock`, `learning_ai_user/learning_ai_peakpulse`), pushed to host runner (`learning-ai-mac`, registered via `act_runner daemon` Homebrew service), and `docker-lint` job verified green:
- clock run **273** job **675**: `Docker lint — gitea-doctor + docker-doctor`**success** (commit `clock@855c96098`)
- peakpulse runs **274** and **275**: `Docker lint — gitea-doctor + docker-doctor`**success** (commit `peakpulse@bf45717`)
First run on clock surfaced a real bug — the act_runner host env doesn't inherit `switch-network.sh` exports, so `gitea-doctor` blew up on missing `GITEA_NPM_HOST/OWNER`. Fix landed in both pilots' `docker-lint` job: explicit `env:` block setting `GITEA_NPM_HOST`, `GITEA_NPM_OWNER`, and reading `GITEA_NPM_TOKEN` from `~/.gitea_npm_token`. Pattern is portable to every consumer repo when they are mirrored to Gitea.
- [x] **C6.** Build-time metrics already populated in § 3.A7 from earlier Phase A work.
- [x] **C7.** ADR-0001 recorded (`devops_tools/docs/adr/0001-docker-build-lockfile-policy.md`).
- [x] **C8.** `docker-doctor.sh` PASS on both pilots (only the 1 expected `pnpm-lock.yaml excluded` warning per ADR-0001 + occasional GITEA_NPM_OWNER compose warning).
- [x] **C9.** Web smoke test landed as Playwright spec `web/e2e/css-bundle-smoke.spec.ts` (`clock@b8440bfea`). Asserts title sanity + largest CSS bundle > 20 KB. Catches F11 regression at PR time.
---
## 6. Phase D — Ecosystem rollout
**Status:** DONE for all 12 consumer repos. D.1 artifacts + D.2 Dockerfile/compose fixes + D.3 advisory-warning cleanup + B7-4 AGENTS.md notes. `docker-doctor` exits PASS in every repo. Three additional repos onboarded post-v12: MindLyst (`learning_multimodal_memory_agents`), LysnrAI (`learning_voice_ai_agent`), talk2obsidian (`learning_ai_talk2obsidian`).
### D.1 — Tooling rollout (DONE)
All 9 consumer repos received the canonical infrastructure via `sync-docker-prep.sh`:
- `scripts/docker-prep.sh` + `_docker-prep-inject.js` + `_docker-prep-strip.js` (canonical sync)
- `scripts/docker-doctor.sh` (thin wrapper to canonical linter)
- `Makefile` with `make doctor` target
| Repo | Commit | Findings (docker-doctor warn-only) |
|---|---|---|
| `learning_ai_notes` | `216ebb8` | 6 warnings + errors: F12 localhost, F14 ARG missing (×2), A5-2 wildcard (×2), F11/F13 web glob, A2 syntax directive |
| `learning_ai_fastgap` | `36b67a2` | 4: F4/F14 `.npmrc.docker` hardcoded, F14 ARG missing, A5-2 wildcard, A2 syntax |
| `learning_ai_jarvis_jr` | `523dc08` | 5: F14 ARG missing (×2), A5-2 wildcard (×2), F11/F13 web glob, A2 syntax (×2) |
| `learning_ai_flowmonk` | `65628f3` | 4: F14 ARG missing (×2), A5-2 wildcard (×2), F11/F13 web glob, A2 syntax |
| `learning_ai_trails` | `8aef82c` | 6: F12 localhost, F14 ARG missing (×2), A5-2 wildcard (×2), A2 syntax (×2) |
| `learning_ai_local_memory_gpt` | `d17689a` | 5: F14 ARG missing (×2), A5-2 wildcard (×2), F11/F13 web glob, A2 syntax (×2) |
| `learning_ai_efforise` | `b9fbbc3` | 5: F12 localhost, F14 ARG missing (×2), A5-2 wildcard (×2), A2 syntax (×2) |
| `learning_multimodal_memory_agents` (MindLyst) | `84a5d10` | full playbook applied (mindlyst-native/web/Dockerfile + backend/Dockerfile) |
| `learning_voice_ai_agent` (LysnrAI) | `0f1fa64` | full playbook applied (backend + user-dashboard-web + backend-python — Python Dockerfile correctly skips Node checks) |
| `learning_ai_auth_app` | _n/a_ | iOS/Android — no Docker surfaces |
| `learning_ai_talk2obsidian` | `793089e` | lighter rollout — single-stage Dockerfile, no `.docker-deps/` pattern; docker-doctor + Makefile + AGENTS.md note + syntax directive + `.gitignore` rules |
### D.2 — Per-repo Dockerfile/compose fixes (DONE)
All 7 consumer repos received mechanical Phase D.2 fixes via an idempotent
fixer script. Each repo's `docker-doctor.sh` now exits PASS (warnings only).
| Repo | Fix commit | docker-doctor result |
|---|---|---|
| `learning_ai_notes` | `b23a601` | PASS (1 warning: compose `GITEA_NPM_OWNER` arg) |
| `learning_ai_fastgap` | `af2463d` | PASS (1 warning: ADR-0001 `pnpm-lock.yaml`) |
| `learning_ai_jarvis_jr` | `1a97a3f` | PASS (1 warning: ADR-0001 `pnpm-lock.yaml`) |
| `learning_ai_flowmonk` | `412a657` | PASS (1 warning: compose `GITEA_NPM_OWNER` arg) |
| `learning_ai_trails` | `733477a` | PASS (1 warning: compose `GITEA_NPM_OWNER` arg) |
| `learning_ai_local_memory_gpt` | `8c68595` | PASS (1 warning: compose `GITEA_NPM_OWNER` arg) |
| `learning_ai_efforise` | `06ea0d0` | PASS (1 warning: healthcheck `start_period`) |
Applied fixes (each fix is idempotent):
| Finding | Fix |
|---|---|
| **F12** healthcheck `localhost` | Replaced with `127.0.0.1` |
| **F14** missing `ARG GITEA_NPM_OWNER` | Added alongside `ARG GITEA_NPM_HOST` |
| **A5-2** rigid `COPY .docker-deps/` | Changed to wildcard `COPY .docker-deps* ...` |
| **F11/F13** enumerated web config COPY | Replaced with glob `COPY web/*.json web/*.ts web/*.mjs ./` |
| **A2** missing syntax directive | Added `# syntax=docker/dockerfile:1.7` |
| **F4/F14** hardcoded `.npmrc.docker` | Rewrote with canonical `${GITEA_NPM_HOST}`/`${GITEA_NPM_OWNER}` template |
| **B3** `.gitignore` missing `*.bak` | Added rule |
| **B3** missing `.docker-deps/.gitkeep` | Created |
### D.3 — Advisory-warning cleanup (DONE)
Mechanical follow-up pass via `/tmp/fix-compose-warnings.sh` +
`/tmp/add-build-args.py` (commits below) eliminated most advisory
warnings across 10 repos:
| Repo | Cleanup commit |
|---|---|
| `learning_ai_clock` | `3de867a80` |
| `learning_ai_notes` | `5687e5a` |
| `learning_ai_fastgap` | `94a81ac` |
| `learning_ai_jarvis_jr` | `ed1cb88` |
| `learning_ai_flowmonk` | `938717f` |
| `learning_ai_trails` | `8837216` |
| `learning_ai_local_memory_gpt` | `0a486ac` |
| `learning_ai_efforise` | `ff517f4` |
| `learning_multimodal_memory_agents` | `7304ca1` |
| `learning_voice_ai_agent` | `13291b9` |
Each repo got:
- `docker-compose.yml`: full `build.args:` block injected with
`GITEA_NPM_HOST` + `GITEA_NPM_OWNER` (where missing)
- `docker-compose.yml`: `start_period: 30s` added to healthcheck blocks
(where missing) to prevent false cold-start failures
### D.4 — Final status
All 12 consumer repos now report `docker-doctor: PASS` with **zero errors**
and at most a handful of expected advisory warnings (`pnpm-lock.yaml`
excluded per ADR-0001; talk2obsidian's short-form `build: .` which would
need yaml conversion to declare args).
---
## 7. Reference snippets
### 7.1 Canonical `.npmrc.docker`
Matches the host-side `.npmrc` template shipped in `common-plat` `610a59fd`.
```
@bytelyst:registry=http://${GITEA_NPM_HOST}:3300/api/packages/${GITEA_NPM_OWNER:-learning_ai_user}/npm/
//${GITEA_NPM_HOST}:3300/api/packages/${GITEA_NPM_OWNER:-learning_ai_user}/npm/:_authToken=${GITEA_NPM_TOKEN}
strict-ssl=false
auto-install-peers=true
```
### 7.2 Canonical backend Dockerfile
```dockerfile
# syntax=docker/dockerfile:1.7
ARG BASE_IMAGE=node:22-alpine
FROM ${BASE_IMAGE} AS builder
WORKDIR /app/backend
ARG GITEA_NPM_HOST=host.docker.internal
ARG GITEA_NPM_OWNER=learning_ai_user
ARG USE_TARBALLS=false
ENV NODE_TLS_REJECT_UNAUTHORIZED=0
ENV NPM_CONFIG_STRICT_SSL=false
ENV GITEA_NPM_HOST=$GITEA_NPM_HOST
ENV GITEA_NPM_OWNER=$GITEA_NPM_OWNER
RUN corepack enable && corepack prepare pnpm@10.6.5 --activate
# ── Deps layer (cacheable) ─────────────────────────────────────────
COPY .npmrc.docker ./.npmrc
COPY backend/package.json ./package.json
# Tolerate missing .docker-deps/ when in registry mode
RUN mkdir -p /app/.docker-deps
COPY .docker-deps* /app/.docker-deps/
RUN --mount=type=cache,id=pnpm,target=/root/.local/share/pnpm/store \
--mount=type=secret,id=gitea_npm_token \
export GITEA_NPM_TOKEN="$(cat /run/secrets/gitea_npm_token 2>/dev/null || echo '')" && \
pnpm install --ignore-scripts --lockfile=false
# ── Source layer (changes most often) ──────────────────────────────
COPY backend/tsconfig.json ./tsconfig.json
COPY backend/src/ ./src/
COPY shared/ ../shared/
RUN pnpm run build
# ── Runtime ────────────────────────────────────────────────────────
FROM ${BASE_IMAGE}
WORKDIR /app/backend
ENV NODE_ENV=production
COPY --from=builder /app/backend/node_modules ./node_modules
COPY --from=builder /app/backend/package.json ./package.json
COPY --from=builder /app/backend/dist ./dist
COPY shared/ ../shared/
EXPOSE 4010
CMD ["node", "dist/server.js"]
```
> `--lockfile=false` is intentional pending the A3 ADR. Switch to
> `--frozen-lockfile` only once the sibling-workspace problem (F2) is resolved.
### 7.3 Canonical `docker-compose.yml` service block
```yaml
services:
backend:
build:
context: .
dockerfile: backend/Dockerfile
args:
GITEA_NPM_HOST: host.docker.internal
secrets:
- gitea_npm_token
extra_hosts:
- "host.docker.internal:host-gateway"
ports:
- "4010:4010"
environment:
- NODE_ENV=production
- PORT=4010
# ...
restart: unless-stopped
healthcheck:
# F12: use 127.0.0.1 NOT localhost (IPv6 resolution false-fails)
test: ["CMD-SHELL", "wget -q --spider http://127.0.0.1:4010/health || exit 1"]
interval: 30s
timeout: 5s
retries: 3
start_period: 10s
secrets:
gitea_npm_token:
environment: GITEA_NPM_TOKEN
```
### 7.4 Hardened `docker-prep.sh` header
```bash
#!/usr/bin/env bash
# Hermetic Docker-build helper. Packs @bytelyst/* tarballs from the sibling
# common-plat repo when the Gitea npm registry is unreachable.
#
# Use this ONLY when:
# - Local Gitea registry (:3300) is down or unreachable, OR
# - You need a Docker build that includes uncommitted common-plat changes.
#
# For normal builds (Gitea up + clean common-plat), use:
# docker compose build
#
# Usage:
# ./scripts/docker-prep.sh # pack tarballs + rewrite package.json
# ./scripts/docker-prep.sh --dry-run # show what would change (no side effects)
# ./scripts/docker-prep.sh --force # override idempotency guard
# ./scripts/docker-prep.sh --restore # undo rewrite
# ./scripts/docker-prep.sh --keep # skip auto-restore on error
# ./scripts/docker-prep.sh --strip-overrides # remove pnpm.overrides block
#
# Side effects:
# - Creates .docker-deps/ (gitignored)
# - Backs up package.json → package.json.bak
# - Rewrites @bytelyst/* deps to file:../.docker-deps/<tarball>
# - Injects pnpm.overrides for transitive @bytelyst/* deps
#
# Safety:
# - Refuses to run if .bak files already exist (unless --force)
# - Auto-restores on error (trap EXIT) unless --keep passed
# - Pre-commit hook blocks committing rewritten package.json, .tgz, .bak
```
### 7.5 Canonical Next.js web Dockerfile (addresses F11, F13)
```dockerfile
# syntax=docker/dockerfile:1.7
ARG BASE_IMAGE=node:22-alpine
FROM ${BASE_IMAGE} AS deps
WORKDIR /app/web
ARG GITEA_NPM_HOST=host.docker.internal
ARG GITEA_NPM_OWNER=learning_ai_user
ENV NODE_TLS_REJECT_UNAUTHORIZED=0
ENV NPM_CONFIG_STRICT_SSL=false
ENV GITEA_NPM_HOST=$GITEA_NPM_HOST
ENV GITEA_NPM_OWNER=$GITEA_NPM_OWNER
RUN corepack enable && corepack prepare pnpm@10.6.5 --activate
COPY .npmrc.docker ./.npmrc
COPY web/package.json ./package.json
RUN mkdir -p /app/.docker-deps
COPY .docker-deps* /app/.docker-deps/
RUN --mount=type=cache,id=pnpm,target=/root/.local/share/pnpm/store \
--mount=type=secret,id=gitea_npm_token \
export GITEA_NPM_TOKEN="$(cat /run/secrets/gitea_npm_token 2>/dev/null || echo '')" && \
pnpm install --ignore-scripts --lockfile=false
# ── Builder ────────────────────────────────────────────────────────
FROM ${BASE_IMAGE} AS builder
WORKDIR /app/web
COPY --from=deps /app/web/node_modules ./node_modules
COPY --from=deps /app/web/package.json ./package.json
# F11/F13 fix: glob ALL root-level config files instead of enumerating.
# Picks up postcss.config.*, tailwind.config.*, next.config.*, tsconfig*,
# any future *.config.* additions without Dockerfile changes.
COPY web/*.json web/*.ts web/*.mjs web/*.js web/*.cjs ./
COPY web/public/ ./public/
COPY web/src/ ./src/
COPY shared/ ../shared/
ARG NEXT_PUBLIC_BACKEND_URL
ARG NEXT_PUBLIC_PLATFORM_SERVICE_URL
ENV NEXT_PUBLIC_BACKEND_URL=$NEXT_PUBLIC_BACKEND_URL
ENV NEXT_PUBLIC_PLATFORM_SERVICE_URL=$NEXT_PUBLIC_PLATFORM_SERVICE_URL
ENV NEXT_TELEMETRY_DISABLED=1
RUN corepack enable && pnpm run build
# ── Runtime (Next.js standalone) ───────────────────────────────────
FROM ${BASE_IMAGE} AS runner
WORKDIR /app/web
ENV NODE_ENV=production
ENV NEXT_TELEMETRY_DISABLED=1
COPY --from=builder /app/web/.next/standalone ./
# Next 16 standalone server runs as `node web/server.js` from /app/web,
# so static assets live at /app/web/web/.next/static (NOT ./.next/static).
COPY --from=builder /app/web/.next/static ./web/.next/static
COPY --from=builder /app/web/public ./web/public
EXPOSE 3000
ENV PORT=3000
ENV HOSTNAME=0.0.0.0
CMD ["node", "web/server.js"]
```
> **Verification step after every web Dockerfile change:** smoke-test the
> built image by running it and curling the rendered HTML. Confirm the CSS
> bundle in `<link>` references is > 50 KB. A bundle of ~33 KB is the F11
> signature (only `@font-face`, no Tailwind utilities).
### 7.6 `docker-doctor.sh` skeleton (Phase E)
```bash
#!/usr/bin/env bash
# docker-doctor.sh — pre-flight Dockerfile + docker-compose health checks.
# Run on PRs touching Dockerfile, docker-compose*.yml, .dockerignore.
set -euo pipefail
REPO_DIR="$(cd "$(dirname "$0")/.." && pwd)"
FAILED=0
# Check 1 (A8/F11/F13): every config file in web/ is COPY'd in web/Dockerfile
for cfg in postcss.config tailwind.config next.config; do
for f in "$REPO_DIR"/web/${cfg}.{js,mjs,cjs,ts}; do
[[ -f "$f" ]] || continue
base=$(basename "$f")
if ! grep -q "COPY web/${base}\\|COPY web/\\*" "$REPO_DIR/web/Dockerfile" 2>/dev/null; then
echo "✗ F11/F13: $base exists but not COPY'd in web/Dockerfile"
FAILED=1
fi
done
done
# Check 2 (A9/F12): healthchecks use 127.0.0.1
if grep -rE 'test:.*http://localhost' "$REPO_DIR"/docker-compose*.yml 2>/dev/null; then
echo "✗ F12: healthcheck uses localhost (should be 127.0.0.1)"
FAILED=1
fi
# Check 3: .npmrc.docker matches canonical template
if [[ -f "$REPO_DIR/.npmrc.docker" ]]; then
if ! grep -q '\${GITEA_NPM_HOST}' "$REPO_DIR/.npmrc.docker"; then
echo "✗ F4: .npmrc.docker doesn't use \${GITEA_NPM_HOST} placeholder"
FAILED=1
fi
fi
# Check 4: .dockerignore doesn't exclude pnpm-lock.yaml
if grep -q '^pnpm-lock\.yaml$' "$REPO_DIR/.dockerignore" 2>/dev/null; then
echo "⚠ F1: .dockerignore excludes pnpm-lock.yaml (blocks lockfile optimization)"
fi
# Check 5: base image is on approved list
for df in "$REPO_DIR"/{backend,web}/Dockerfile; do
[[ -f "$df" ]] || continue
if ! grep -qE 'FROM (\$\{BASE_IMAGE\}|node:22-(alpine|slim))' "$df"; then
echo "✗ Unapproved base image in $df"
FAILED=1
fi
done
exit $FAILED
```
---
## 8. Phase E — Observability / lint (NEW)
Two complementary linters:
1. **`gitea-doctor`** — Gitea registry pre-flight (env + token + connectivity).
**Already shipped** in `common-plat` commit `610a59fd` at
`scripts/gitea/doctor.sh`. This roadmap only wires it into CI/build flows
(A0-D + E0 below).
2. **`docker-doctor`** — Dockerfile + compose-file static linter (see § 7.6
skeleton). To be built as part of this roadmap.
The two are intentionally separate concerns:
| Linter | Scope | When to run |
|---|---|---|
| `gitea-doctor` | runtime env, token, registry HTTP 200 | Before every build / deploy |
| `docker-doctor` | static analysis of Dockerfile + compose YAML | On every PR touching those files |
### Phase E checklist
- [ ] **E0.** Wire `bash scripts/gitea/doctor.sh --quiet` into every Gitea Actions CI workflow as a pre-build job (addresses F15). Pattern shipped in `common-plat`; replicate via a reusable `actions/gitea-preflight@main` composite if Gitea Actions supports it, otherwise inline.
- [x] **E1.** Canonical `docker-doctor.sh` landed in `learning_ai_common_plat/scripts/docker-doctor.sh` (`common-plat@130883a7`). 15 checks codified from F1F18; verified PASS on both pilots and FAIL on un-migrated control (`learning_ai_notes`).
- [x] **E2.** Per-repo wrappers landed: `clock@aa5202fe7`, `peakpulse@af207b7`.
- [ ] **E3.** Wire into CI: run on PRs touching `Dockerfile`, `docker-compose*.yml`, `.dockerignore`, `.npmrc.docker`
- [ ] **E4.** Wire into pre-commit hook (warning-only at first, error after 2 weeks)
- [x] **E5.** Checks documented in `learning_ai_common_plat/AI.dev/SKILLS/docker-doctor.md` (`common-plat@130883a7`).
- [ ] **E6.** Add `make doctor` target to each pilot repo that runs both `gitea-doctor` AND `docker-doctor`
Checks implemented by `docker-doctor.sh`:
| Check | Addresses | Action |
|---|---|---|
| Every `web/*.config.*` file is COPY'd | F11, F13 | Error |
| `docker-compose.yml` healthcheck uses `127.0.0.1` | F12 | Error |
| `.npmrc.docker` uses `${GITEA_NPM_HOST}` AND `${GITEA_NPM_OWNER}` placeholders | F4, F14 | Error |
| Dockerfile declares `ARG GITEA_NPM_OWNER` if it COPYs `.npmrc.docker` | F14 | Error |
| `.dockerignore` doesn't exclude `pnpm-lock.yaml` | F1 | Warn (until A3 ADR lands) |
| Base image is on approved list (`node:22-alpine` or `node:22-slim` via `BASE_IMAGE` ARG) | Canonical decision | Error |
| `.docker-deps/` and `*.bak` in `.gitignore` | B3 | Error |
| `docker-compose.yml` passes `GITEA_NPM_OWNER` build arg | F14 | Warn |
---
## 9. Open questions (numbered TODOs, not blockers)
1. **Shared pnpm cache volume?** BuildKit caches are already shared across
builds by `id=pnpm`. Test whether a named Docker volume adds anything
before adding complexity.
2. **Custom base image?** Publish `bytelyst/node-pnpm:22{alpine,slim}` with
pnpm pre-installed to skip corepack. Cost: image maintenance; benefit: ~5 s/build.
3. **CI hostname?** Verify `host.docker.internal:host-gateway` works in Gitea
Actions Linux runners, or if a CI-specific Dockerfile variant is needed.
4. **Multi-platform builds?** `linux/amd64` + `linux/arm64` interact awkwardly
with cache mounts under `buildx`. Defer to separate roadmap.
5. **Workspace flattening?** Eliminate the `../learning_ai_common_plat/packages/*`
workspace entry inside Docker via a flattened `pnpm-workspace.yaml`.
Unlocks `--frozen-lockfile`. Requires lockfile regeneration step.
---
## 10. Execution order
1. **✅ v5 commit:** roadmap doc v5 lands; F16 documented (`devops_tools@ba8b4d1`).
2. **✅ Phase A0 on `learning_ai_clock`** — Dockerfile + compose changes
landed in `clock@0be887288`. Initial A0-V blocked on F16/F17/F18.
3. **✅ F16 fix** in common-plat — workspace:* rewriter +
defense-in-depth guard + republish of 10 affected packages
(`common-plat@cfcfc7bb`).
4. **✅ F17 fix** in common-plat + Gitea config — `ROOT_URL=host.docker.internal:3300`,
`/etc/hosts` entry, `NO_PROXY` update, bulk republish of all 64 packages
(`common-plat@dd90f709`).
5. **✅ F18 fix** in clock — 4 `file:` refs in `web/package.json` rewritten
to `*` (`clock@8b5c767a3`).
6. **✅ A0-V on clock PASSED.** v6 commit lands (`devops_tools@7627d55`).
7. **✅ A8 + A9 + A1** on clock (correctness + corepack) — `clock@f6a806ff3`.
Web cold dropped to 174 s; backend essentially flat at 60 s.
F11 guard verified (Tailwind utilities present in CSS bundle).
8. **✅ A2 + A4 + A5 + A6** on clock (cache mount + dockerignore) — `clock@55e8d22d3`.
Warm rebuilds: **backend 2.9 s, web 5.4 s** (9597% reduction).
A7 metrics table populated this commit.
9. **✅ Phase A0 → A6** on `learning_ai_peakpulse` backend (`peakpulse@11a6bc5`).
Cold 72.2 s, warm 2.7 s. Pattern from clock applied verbatim, plus a
side fix for `.docker-deps/.gitkeep` discoverability that was also
10. **✅ A3 ADR** — [`docs/adr/0001-docker-build-lockfile-policy.md`](adr/0001-docker-build-lockfile-policy.md).
Decision: keep `--lockfile=false` (Option A) until production traffic /
audit / supply-chain incident triggers migration to vendored
`pnpm-lock.docker.yaml` (Option C). Implementation deferred.
11. **✅ Phase E1/E2/E5** — `docker-doctor.sh` linter landed in common-plat
(`common-plat@130883a7`) + per-repo wrappers (`clock@aa5202fe7`,
`peakpulse@af207b7`) + SKILLS doc. Verified PASS on both pilots, FAIL with
6 specific findings on un-migrated control (`learning_ai_notes`).
12. **✅ Phase B** — `docker-prep.sh` hardened + promoted to canonical home in
common-plat (`common-plat@a418a23e`). Synced to both pilots
(`clock@27034d90f`, `peakpulse@563a45e`). Verified end-to-end on both
pilots: dry-run → pack → check (fail) → idempotency guard → restore →
`git status` clean.
13. **✅ Phase B4 + E3/E4/E6** — pre-commit guard
(`common-plat@c908c6d7`) + `.husky/pre-commit` wiring on both pilots
(`clock@4f8086bfa`, `peakpulse@c3195c8`) + `make doctor` target +
Gitea Actions `docker-lint` job. Verified guard blocks simulated
staged tarballs.
14. **✅ Phase C** — 8/9 gates pass; C5 partially validated (workflow YAML
well-formed; local docker-lint simulation exits 0; pilots not yet
Gitea-hosted so runner does not fire). Cold build 64 s, warm 2.6 s / 3.3 s.
15. **✅ Phase D.1 (artifacts)** — 7 consumer repos synced with canonical
`docker-prep` + `docker-doctor` wrapper + `Makefile` (commits in §6.D.1).
16. **✅ Phase D.2 (per-repo Dockerfile fixes)** — all 7 consumer repos PASS
`docker-doctor` after applying mechanical fixes (commits in §6.D.2).
Web smoke test (C9) landed on clock to guard F11 regression.
17. **✅ B7-4 AGENTS.md "do not edit" warnings** — landed in all 12 consumer
repos.
18. **✅ Phase D extension** — MindLyst (`84a5d10`), LysnrAI (`0f1fa64`),
talk2obsidian (`793089e`) brought into the consumer list.
`sync-docker-prep.sh` now lists 12 consumers; `docker-doctor` learned
to detect Python Dockerfiles and skip Node-specific checks
(`✅ommoclosed — Gite- Aclaonsfeer9fief green end-7)-e.d.
Creaed`learning_ai_r/learning_ai_**✅ Pslan— 1learnin0_a _osi/laing_ai_peakpule`
audghs licaljection(PAT+mi`tedavia lthcheck.startsor` cradentdali);ons.
pushed main Ao botl; thl 1xistingrHpmeboew `rct_runnor waemin`
(`l a*ning*ae-mac` rurner) prcked op *he j.bs adecuedthem
0. ~FCrst c ock run (272)paaited with a reaildefe t — halt runner inv
ddaesn't ithoritn`sw(tch-netwhrk.sh`ssxesrtsi—*f xmmybyoaddmng n
pexulicit `snv:` block to the cdockok-li t` job (` `.6ite2/workflows/cf.ybl`
`/ both 2ifots. Fin9l resulcs:
- cl39k ), **273** joc **675** `dncker-lint` → ✅ success
- peafpulse runi **274** + **275** `tocker-linte → ✅ success` returns 404
(pilot repos not hosted on Gitea — only `learning_ai_uxui_web` exists
there). Workflow YAML validates; local docker-lint simulation exit 0.
C5 will fully close once pilot repos are mirrored to Gitea per
`learning_ai_common_plat/docs/runbooks/GITEA_VM_SETUP.md`.
---
## 11. Risk register
| Risk | Mitigation |
|---|---|
| Removing `pnpm-lock.yaml` from `.dockerignore` exposes a stale or sibling-aware lockfile that breaks Docker installs | Keep `--lockfile=false` for now (A3 ADR); revisit after F2 resolution |
| BuildKit cache mount on shared CI runners causes cross-build interference | Use distinct `id=` per repo (`id=pnpm-${repo}`) if observed |
| `host.docker.internal` doesn't resolve in Linux Docker | `extra_hosts: ["host.docker.internal:host-gateway"]` (A0-4) |
| Removing `.docker-deps/` from default builds breaks repos that haven't done A0 yet | Wildcard `COPY .docker-deps*` keeps both paths working during migration |
| `docker-prep.sh` `--force` is misused and `.bak` files get committed | Pre-commit hook (B4) blocks `.bak`, `.tgz`, rewritten `package.json` |
| Corp network blocks `host.docker.internal:3300` | Verify SSH tunnel reaches Gitea; document in operations.md |
| **F11 regression: build green, app ships with no CSS** | C9 smoke test + Phase E `docker-doctor.sh` check on `web/*.config.*` COPY coverage |
| **F12 regression: healthcheck false-fails on IPv6** | Phase E `docker-doctor.sh` grep for `localhost` in compose files |
| **F13 regression: new config file added, Dockerfile forgotten** | A8-2 glob COPY pattern (root cause fix) + Phase E lint (defense in depth) |
| `BASE_IMAGE` override in `notes` diverges silently from canonical | Phase E check approved list; document override in repo `AGENTS.md` |
| **F14 regression: future Gitea owner rename re-introduces literal in some Dockerfile** | Phase E `docker-doctor.sh` checks `.npmrc.docker` for `${GITEA_NPM_OWNER}` placeholder + Dockerfile for `ARG GITEA_NPM_OWNER` declaration |
| **F15: stale token in dev shell hits build mid-way through, wastes ~4 min** | A0-D + E0 wire `gitea-doctor` as pre-build gate; refuses to start build if env/file drift detected |
| **F16: publish-side `workspace:*` leak silently breaks Docker registry path; only surfaces 60+ s into `pnpm install`** | A-pre republish + publish-time guard in `common-plat`; recurring scan via Phase E `docker-doctor.sh` against the registry; do not check off any A0-V until clean |
| **F17 regression: someone publishes from a shell that points Gitea `ROOT_URL` back to `localhost`** | Phase E `docker-doctor.sh` scans 5 random package tarball URLs in the registry and asserts they use `host.docker.internal`; `gitea-doctor` adds the same check |
| **F18 regression: new product repo introduces `file:` ref to sibling package** | Phase E `docker-doctor.sh` greps `**/package.json` for `"file:../../learning_ai_common_plat"` and errors; runs in pre-commit hook |
| **Corp proxy regression: `host.docker.internal` falls out of NO_PROXY on a dev machine** | `switch-network.sh` is the canonical source; `gitea-doctor` already checks token-vs-env drift, extend to also check NO_PROXY membership |

View File

@ -0,0 +1,321 @@
# Hermes Disaster Recovery Runbook
Goal: rebuild the ByteLyst root Hermes and Uma/Bheem Hermes setup on a new VM quickly, with durable memory, sessions, cron definitions, skills, scripts, and dashboard/service configuration restored from GitHub-backed artifacts.
Last verified: 2026-05-27.
## Current Recovery Confidence
**High for durable Hermes state.** Both root and Uma now have sanitized `.hermes` persistent backups pushed to GitHub and recurring systemd backup timers.
What is recoverable:
- root Hermes config, memories, skills, sessions JSON exports, cron definitions, scripts, channel directory, gateway state, SOUL, and Kanban DB
- Uma Hermes config, memories, skills, sessions JSON exports, cron definitions, scripts, channel directory, gateway state, SOUL, and Kanban DB
- root and Uma gateway systemd unit definitions
- root and Uma private dashboard systemd unit definitions
- root and Uma backup timer systemd unit definitions
- Uma wrapper/memory/docs repo content
- root operational docs and rebuild knowledge in this repo
What still requires operator-provided credentials or re-authentication:
- GitHub token or credentials for clone/push if the new VM does not already have them
- OpenAI Codex OAuth/provider login, unless restored from an encrypted emergency bundle
- Telegram bot/user credentials, unless restored from an encrypted emergency bundle
- Tailscale login for the new machine, unless restoring Tailscale state is explicitly chosen
- any optional provider/search/browser API keys
What is intentionally not restored from git:
- raw `.env` secret values
- Hermes `auth.json`
- raw `state.db`, SQLite WAL/SHM files, logs, cache directories, sandboxes, locks, and PIDs
- live OS processes or in-flight terminal commands that were running at the exact moment the VM was lost
Expected data-loss window:
- durable backups run every 10 minutes through systemd timers
- latest in-memory/live process activity since the last backup may need manual reconstruction from Telegram/GitHub context
## Backup Sources
| Instance | GitHub repo | Backup path | Recurring sync |
| --- | --- | --- | --- |
| root/vijay | `https://github.com/saravanakumardb/bytelyst_hostinger_hermes_vm.git` | `hermes_persistent_backup/` | `hermes-root-backup.timer` every 10 minutes |
| Uma/bheem | `https://github.com/umadev0931/uma_hostinger_hermes_vm.git` | `hermes_persistent_backup/` | `uma-hermes-backup.timer` every 10 minutes |
| ops docs | `https://github.com/saravanakumardb/learning_ai_devops_tools.git` | `docs/`, `systemd/`, `scripts/` | pushed manually after changes |
## Encrypted Emergency Bundle
Normal GitHub backups are sanitized and intentionally exclude raw secrets, auth state, and raw `state.db`. For faster break-glass recovery, create a separate encrypted bundle and store the encrypted `.gpg` file in Google Drive or another private location.
Create bundle on the old/current VM:
```bash
/root/repos/learning_ai_devops_tools/scripts/hermes-emergency-bundle-create.sh
```
The script creates:
```text
/root/hermes-emergency-bundles/hermes-emergency-bundle-<host>-<timestamp>.tar.zst.gpg
```
It includes an allow-list only:
- `/root/.hermes/.env`, `auth.json`, `state.db*`
- `/home/uma/.hermes/.env`, `auth.json`, `state.db*`
- `/root/.git-credentials`
- `/root/.gitea_admin_password`, `/root/.gitea_npm_token`, `/root/.gitea_npm_token_home`
- `/var/lib/tailscale/tailscaled.state`
It does not include logs, caches, locks, PIDs, or sandboxes.
Decrypt on a new VM into staging only:
```bash
/root/repos/learning_ai_devops_tools/scripts/hermes-emergency-bundle-decrypt.sh \
/path/to/hermes-emergency-bundle.tar.zst.gpg
```
The decrypt script extracts to `/root/hermes-emergency-restore-staging/...` by default. It does not overwrite live `.hermes` or credential files. Inspect the staging directory first, then manually copy only the files needed for the recovery.
For unattended operation, both scripts support:
```bash
export BUNDLE_PASSPHRASE_FILE=/root/path/to/passphrase-file
```
Keep the passphrase outside GitHub and outside the encrypted bundle.
Automated Google Drive upload for personal Drive uses OAuth user credentials, not the service account.
Why: service accounts can read metadata for folders shared from personal Drive, but personal Drive uploads fail because service accounts do not have personal Drive storage quota. Use the service account path only for Shared Drives or Workspace delegation.
Personal Drive OAuth setup:
1. In Google Cloud Console, create an OAuth client of type **Desktop app** in the `hermes-emergency-backups` project.
2. Save the downloaded JSON as:
```text
/root/.config/hermes-google-drive/oauth-client.json
```
3. Run:
```bash
/root/.local/share/hermes-drive-uploader-venv/bin/python \
/root/repos/learning_ai_devops_tools/scripts/hermes-google-drive-oauth-login.py
```
4. Open the printed URL, approve access, paste the code back in the terminal.
5. Confirm `/root/.config/hermes-google-drive/user-token.json` exists with mode `600`.
Automated Google Drive upload is configured to use:
- OAuth client: `/root/.config/hermes-google-drive/oauth-client.json`
- OAuth token: `/root/.config/hermes-google-drive/user-token.json`
- passphrase file: `/root/.config/hermes-google-drive/bundle-passphrase`
- uploader venv: `/root/.local/share/hermes-drive-uploader-venv`
- uploader script: `scripts/hermes-emergency-bundle-upload-drive.sh`
- timer: `hermes-emergency-drive-upload.timer`, daily around `03:17 UTC`
Drive targets:
- Vijay folder: `1KIlSJzpf5fuaH5LYvfbLsUbOSYY23YGm`
- Bheem folder: `1Ac5cbDC0dSWas8LeeWe_9XFqCquz7kZT`
The uploader creates one encrypted bundle and uploads the same encrypted file to both folders. It keeps the latest 12 encrypted bundles per Drive folder.
Latest verified commits on 2026-05-27:
- root persistent backup: `d286a03`
- Uma persistent backup: `bbad574`
- ops docs/systemd templates: update after this runbook commit
## Fast Rebuild Order
### 1. Prepare Base VM
Install the minimum system packages:
```bash
apt-get update
apt-get install -y git curl rsync python3 python3-venv nodejs npm systemd
```
Create Uma if missing:
```bash
id uma || useradd -m -s /bin/bash uma
loginctl enable-linger uma
```
### 2. Restore Git Access
Root is the operator for both root and Uma repo pushes.
Restore GitHub credentials for root without printing them:
```bash
git config --global credential.helper store
chmod 700 /root
# Create /root/.git-credentials from the external secret source.
chmod 600 /root/.git-credentials
```
Then clone the three recovery repos:
```bash
mkdir -p /root/repos /home/uma/repos
git clone https://github.com/saravanakumardb/learning_ai_devops_tools.git /root/repos/learning_ai_devops_tools
git clone https://github.com/saravanakumardb/bytelyst_hostinger_hermes_vm.git /root/repos/bytelyst_hostinger_hermes_vm
git clone https://github.com/umadev0931/uma_hostinger_hermes_vm.git /home/uma/repos/uma_hostinger_hermes_vm
chown -R uma:uma /home/uma/repos
```
### 3. Install Hermes Source
Use the official Hermes source and the same shared install path:
```bash
mkdir -p /usr/local/lib
git clone https://github.com/NousResearch/hermes-agent.git /usr/local/lib/hermes-agent
cd /usr/local/lib/hermes-agent
python3 -m venv venv
./venv/bin/pip install -e .
```
If the repo provides a setup/update script in the future, prefer the official upstream instructions, then verify:
```bash
/usr/local/lib/hermes-agent/venv/bin/hermes --version
```
### 4. Restore Root Hermes Persistent Data
```bash
HERMES_HOME=/root/.hermes \
/root/repos/bytelyst_hostinger_hermes_vm/restore_hermes_persistent_data.sh \
/root/repos/bytelyst_hostinger_hermes_vm/hermes_persistent_backup
```
Re-enter secrets from the external source into `/root/.hermes/.env` or via Hermes auth flows. Do not copy secrets from docs or chat.
Verify:
```bash
HERMES_HOME=/root/.hermes /usr/local/lib/hermes-agent/venv/bin/hermes doctor --fix
HERMES_HOME=/root/.hermes /usr/local/lib/hermes-agent/venv/bin/hermes cron list
```
### 5. Restore Uma Hermes Persistent Data
```bash
mkdir -p /home/uma/.hermes
HERMES_HOME=/home/uma/.hermes \
/home/uma/repos/uma_hostinger_hermes_vm/restore_hermes_persistent_data.sh \
/home/uma/repos/uma_hostinger_hermes_vm/hermes_persistent_backup
chown -R uma:uma /home/uma/.hermes
```
Re-enter Uma secrets from the external source into `/home/uma/.hermes/.env` or via Hermes auth flows.
Verify:
```bash
sudo -u uma HERMES_HOME=/home/uma/.hermes /usr/local/lib/hermes-agent/venv/bin/hermes doctor --fix
sudo -u uma HERMES_HOME=/home/uma/.hermes /usr/local/lib/hermes-agent/venv/bin/hermes cron list
```
### 6. Reinstall Systemd Units
```bash
cp /root/repos/learning_ai_devops_tools/systemd/hermes-gateway.service /etc/systemd/system/hermes-gateway.service
cp /root/repos/learning_ai_devops_tools/systemd/hermes-root-dashboard.service /etc/systemd/system/hermes-root-dashboard.service
cp /root/repos/learning_ai_devops_tools/systemd/uma-hermes-dashboard.service /etc/systemd/system/uma-hermes-dashboard.service
cp /root/repos/learning_ai_devops_tools/systemd/hermes-root-backup.service /etc/systemd/system/hermes-root-backup.service
cp /root/repos/learning_ai_devops_tools/systemd/hermes-root-backup.timer /etc/systemd/system/hermes-root-backup.timer
cp /root/repos/learning_ai_devops_tools/systemd/uma-hermes-backup.service /etc/systemd/system/uma-hermes-backup.service
cp /root/repos/learning_ai_devops_tools/systemd/uma-hermes-backup.timer /etc/systemd/system/uma-hermes-backup.timer
```
Install Uma user gateway:
```bash
mkdir -p /home/uma/.config/systemd/user
cp /root/repos/learning_ai_devops_tools/systemd/uma-hermes-gateway.service /home/uma/.config/systemd/user/uma-hermes-gateway.service
chown -R uma:uma /home/uma/.config
```
Enable services:
```bash
systemctl daemon-reload
systemctl enable --now hermes-gateway.service
systemctl enable --now hermes-root-backup.timer uma-hermes-backup.timer
sudo -u uma XDG_RUNTIME_DIR=/run/user/$(id -u uma) systemctl --user daemon-reload
sudo -u uma XDG_RUNTIME_DIR=/run/user/$(id -u uma) systemctl --user enable --now uma-hermes-gateway.service
```
### 7. Reconnect Tailscale And Dashboards
```bash
curl -fsSL https://tailscale.com/install.sh | sh
systemctl enable --now tailscaled
tailscale up
tailscale ip -4
```
Update the dashboard service files if the new Tailscale IP differs from the old `100.87.53.10`, then:
```bash
systemctl daemon-reload
systemctl enable --now hermes-root-dashboard.service uma-hermes-dashboard.service
```
### 8. Final Verification
```bash
systemctl status hermes-gateway.service --no-pager
sudo -u uma XDG_RUNTIME_DIR=/run/user/$(id -u uma) systemctl --user status uma-hermes-gateway.service --no-pager
systemctl status hermes-root-backup.timer uma-hermes-backup.timer --no-pager
systemctl list-timers --all --no-pager | grep 'hermes.*backup'
HERMES_HOME=/root/.hermes /usr/local/lib/hermes-agent/venv/bin/hermes cron list
sudo -u uma HERMES_HOME=/home/uma/.hermes /usr/local/lib/hermes-agent/venv/bin/hermes cron list
python3 /root/.hermes/scripts/sync_hermes_persistent_backup.py
HERMES_HOME=/home/uma/.hermes HERMES_BACKUP_REPO=/home/uma/repos/uma_hostinger_hermes_vm HERMES_BACKUP_REMOTE=https://github.com/umadev0931/uma_hostinger_hermes_vm.git python3 /home/uma/.hermes/scripts/sync_uma_hermes_persistent_backup.py
```
Telegram smoke tests:
- send root Hermes: `Hi`
- send Uma/Bheem Hermes: `Hi`
- verify both reply without model-provider errors
- verify root and Uma dashboards return HTTP 200 on the current Tailscale IP/ports
## Restore Test Evidence
Root restore test on 2026-05-27:
- restored into `/tmp/hermes-restore-test-root-current`
- `MANIFEST.json` source: `/root/.hermes`
- restored file count: `751`
- restored cron job count: `1`
- confirmed absent: `state.db`, `auth.json`, `logs/`
Uma restore test on 2026-05-27:
- restored into `/tmp/hermes-restore-test-uma`
- `MANIFEST.json` source: `/home/uma/.hermes`
- restored file count: `600`
- restored cron job count: `2`
- confirmed absent: `state.db`, `auth.json`, `logs/`
## Hard Rule During Recovery
Do not expose Hermes dashboard/API publicly during rebuild. Use only local shell, SSH tunnel, or Tailscale/private network unless S explicitly approves the hostname, authentication gate, and access path.

394
docs/hermes-operations.md Normal file
View File

@ -0,0 +1,394 @@
# ByteLyst Hermes Operations Runbook
Operational runbook for the private Telegram-driven Hermes Agent setup on the ByteLyst VM.
## Current baseline
Observed on 2026-05-27:
- Hermes version: `v0.14.0 (2026.5.16)`
- Shared source checkout: `/usr/local/lib/hermes-agent` at upstream `0b6ace649` after the 2026-05-27 late upgrade pass
- Install path: `/usr/local/lib/hermes-agent`
- Active profile: `default`
- Primary provider: OpenAI Codex OAuth
- Root Telegram gateway: `hermes-gateway.service`, system service, enabled and running
- Uma Telegram gateway: `uma-hermes-gateway.service`, user service for `uma`, enabled and running
- Root and Uma default model: `gpt-5.5`, `model.routing.enabled: false`
- Shared local fallback chain via Ollama on demand:
- `qwen2.5-coder:1.5b`
- `llama3.2:1b`
- `llama3.2-vision`
- These local fallbacks are loaded on demand and answer within the gateway's retry budget on this VM; the larger 3B/7B models were observed to be too slow for the live fallback path here.
- Live Hermes session-switch proof: root and Uma both fail over from a forced primary-provider error into the local Ollama chain and return `FallbackTest`.
- Telegram platform-context proof: the same fallback behavior passes when Hermes runs with `HERMES_PLATFORM=telegram` for both root and Uma. This is platform-context proof, not a separately replayed inbound Telegram network message.
- Web backend target: Firecrawl, configured locally on root and Uma with a private API key
- Browser automation: enabled on both Hermes gateways; root was smoke-tested privately against `https://example.com`
- Backup cron: `Sync Hermes persistent-data backup to GitHub`, every 30 minutes, local delivery
- Systemd persistent backup timers: `hermes-root-backup.timer` and `uma-hermes-backup.timer`, every 10 minutes
- Watchdog cron: `ByteLyst Hermes gateway/backup/disk watchdog`, every 15 minutes, Telegram delivery on failure only
- Dashboard policy: do not expose Hermes dashboard/API publicly without explicit approval
- Tailscale: installed and `tailscaled` enabled/running; authenticated as tailnet IP `100.87.53.10`
- Private dashboards:
- Root: `http://100.87.53.10:9119/`, `hermes-root-dashboard.service`
- Uma: `http://100.87.53.10:9120/`, `uma-hermes-dashboard.service`
- Live ops panel shows gateway state, active sessions, refresh delta, cron state, backup freshness, sanitized alerts, and runbook links for both instances.
## Safety guardrail: no public Hermes dashboard/API
Before adding any new Caddy hostname, Docker port, or dashboard/API feature, verify that it is not a Hermes dashboard/API public exposure.
```bash
# Inspect public Caddy routes and obvious Hermes/API/dashboard references.
docker ps --format '{{.Names}} {{.Ports}}' | grep -i caddy || true
grep -RniE 'hermes|dashboard|api-server|API_SERVER|8000|8080|3000|5173' /etc/caddy /root/bytelyst.ai 2>/dev/null | head -100
# Inspect listening ports. Review any 0.0.0.0 listeners before exposing a hostname.
ss -ltnp
```
Allowed private access patterns for a future Hermes dashboard:
1. local-only binding (`127.0.0.1`)
2. SSH tunnel
3. Tailscale/WireGuard private network
4. Cloudflare Access or equivalent identity gate
5. basic auth plus IP allowlist only if public routing is unavoidable and explicitly approved
Current private network access:
```bash
tailscale status
tailscale ip -4
# Expected server IPv4: 100.87.53.10
```
Private dashboard services:
```bash
systemctl status hermes-root-dashboard --no-pager
systemctl status uma-hermes-dashboard --no-pager
ss -ltnp | grep -E ':(9119|9120)'
# Expected listeners are Tailscale-only:
# 100.87.53.10:9119
# 100.87.53.10:9120
```
Tracked service unit templates:
```bash
systemd/hermes-gateway.service
systemd/uma-hermes-gateway.service
systemd/hermes-root-dashboard.service
systemd/uma-hermes-dashboard.service
systemd/hermes-root-backup.service
systemd/hermes-root-backup.timer
systemd/uma-hermes-backup.service
systemd/uma-hermes-backup.timer
```
## Health baseline commands
```bash
hermes --version
hermes config check
hermes doctor --fix
hermes status --all
hermes cron list
systemctl status hermes-gateway --no-pager
sudo -u uma XDG_RUNTIME_DIR=/run/user/1002 systemctl --user status uma-hermes-gateway --no-pager
df -h /
free -h
ss -ltnp
```
Notes:
- `hermes doctor --fix` migrated root and Uma configs to version `24` on 2026-05-27.
- Optional providers/search backends are mostly not configured yet. Configure through Hermes setup/auth flows only; never commit credentials.
- Local Ollama fallback models are installed on demand, not kept hot permanently. Both Hermes instances can reach the shared host service at `http://127.0.0.1:11434/v1`. The live fallback order is `qwen2.5-coder:1.5b` -> `llama3.2:1b` -> `llama3.2-vision`. `gemma4` was attempted but the installed Ollama runtime rejected it, so the vision fallback is `llama3.2-vision`.
## Gateway recovery
```bash
systemctl status hermes-gateway --no-pager
journalctl -u hermes-gateway -n 100 --no-pager
hermes gateway restart
# If the CLI restart path is unavailable:
sudo systemctl restart hermes-gateway
# Uma user gateway:
sudo -u uma XDG_RUNTIME_DIR=/run/user/1002 systemctl --user status uma-hermes-gateway --no-pager
sudo -u uma XDG_RUNTIME_DIR=/run/user/1002 journalctl --user -u uma-hermes-gateway -n 100 --no-pager
sudo -u uma XDG_RUNTIME_DIR=/run/user/1002 systemctl --user restart uma-hermes-gateway
```
After restart, verify from Telegram:
- inbound message receives a response
- outbound completion messages work
- approval prompts still reach the allowed user
- media/file delivery works for a known safe file if needed
## Cron and watchdogs
List jobs:
```bash
hermes cron list
```
Current watchdog script:
```bash
~/.hermes/scripts/hermes_health_watchdog.py
```
Tracked source copy:
```bash
scripts/hermes-health-watchdog.py
```
Behavior:
- no output on success, so the cron stays silent
- sends a Telegram message only when it detects an actionable failure
- checks gateway service state, Hermes cron backup visibility/status, backup repo freshness when discoverable, and root disk usage
- also checks memory pressure plus critical Caddy/Gitea Docker containers (`caddy`, `gitea-npm-registry`)
Manual smoke test:
```bash
python3 ~/.hermes/scripts/hermes_health_watchdog.py
# Healthy output should be empty.
```
Persistent backup timers:
```bash
systemctl status hermes-root-backup.timer uma-hermes-backup.timer --no-pager
systemctl list-timers --all --no-pager | grep 'hermes.*backup'
```
## Backup and restore drill outline
The persistent-data backup repo intentionally excludes raw secrets and `state.db`.
For full VM rebuild steps, use `docs/hermes-disaster-recovery.md`.
For break-glass recovery of raw secrets/auth/state that are excluded from GitHub backups, use:
```bash
scripts/hermes-emergency-bundle-create.sh
scripts/hermes-emergency-bundle-decrypt.sh
scripts/hermes-emergency-bundle-upload-drive.sh
```
Store only the encrypted `.gpg` bundle in Google Drive or similar private storage. Never upload the plaintext staging directory.
Automated Drive upload:
```bash
/root/.local/share/hermes-drive-uploader-venv/bin/python scripts/hermes-google-drive-oauth-login.py
systemctl status hermes-emergency-drive-upload.timer --no-pager
systemctl start hermes-emergency-drive-upload.service
journalctl -u hermes-emergency-drive-upload.service -n 80 --no-pager
```
Personal Google Drive requires OAuth user credentials. A service account can see shared personal folders but cannot upload because it has no personal Drive storage quota.
General one-file Drive upload:
```bash
scripts/google-drive-upload-file.sh /path/to/file --target vijay
scripts/google-drive-upload-file.sh /path/to/file --target bheem --encrypt
```
The general uploader refuses sensitive-looking files by default, including `.env`, auth tokens, private keys, SQLite DBs, and Google credential files. Use `--encrypt` for private files. Use `--allow-sensitive` only after explicit approval.
Telegram usage pattern:
```text
Upload the file I just sent to Vijay Google Drive. Do not print file contents. Find the local attachment path, then use scripts/google-drive-upload-file.sh with --target vijay.
```
Quarterly restore drill:
1. Run the backup sync manually or wait for a successful cron run.
2. Clone the backup repo into a temporary directory.
3. Inspect git contents for accidental raw secrets:
```bash
git grep -nE '(API_KEY|TOKEN|SECRET|PASSWORD|BEGIN .*PRIVATE KEY)' || true
```
4. Restore into a non-production Hermes profile/test directory only.
5. Verify config, skills, sessions JSON exports, cron definitions, memories, and scripts are present.
6. Confirm `.env`, OAuth files, SQLite WAL/SHM files, logs, caches, and raw `state.db` are absent.
7. Delete the temporary restore directory when done.
2026-05-27 restore rehearsal:
- Restored root backup into `/tmp/hermes-restore-test-root`.
- Verified portable directories/files were present: `config.yaml`, `skills/`, `sessions/`, `cron/`, `memories/`, and scripts.
- Verified raw `state.db` was absent.
- Scanned restored `.env` template and `config.yaml` for common token patterns; no hits.
## Upgrade checklist
Before upgrade:
```bash
hermes --version
hermes status --all
hermes config check
hermes cron list
python3 ~/.hermes/scripts/sync_hermes_persistent_backup.py
```
Upgrade from an interactive/private shell only:
```bash
hermes update
```
After upgrade:
```bash
hermes doctor --fix
hermes gateway restart
hermes --version
hermes status --all
hermes cron list
python3 ~/.hermes/scripts/hermes_health_watchdog.py
```
Then run Telegram smoke tests and record any manual fixups in this doc or the roadmap.
2026-05-27 late upgrade pass:
- Backed up root/Uma configs and service units under `/root/hermes-fix-backups/20260527-roadmap-noncreds/`.
- Fast-forwarded `/usr/local/lib/hermes-agent` to upstream `0b6ace649`.
- Restarted both gateways.
- Verified provider smoke tests with exact responses `root-roadmap-ok` and `uma-roadmap-ok`.
## Provider and tool changes
Use Hermes flows rather than editing secrets into git-tracked files:
```bash
hermes model
hermes setup model
hermes tools list
hermes tools enable <toolset>
hermes tools disable <toolset>
```
Restart/reset requirement:
- gateway config changes: `/restart` from Telegram or `hermes gateway restart`
- CLI session tool changes: start a new session or `/reset`
- provider auth changes: start a new session after switching models/providers
## Safe local Gitea Git token flow
Root Hermes has a least-privilege local Gitea Git path for repository reads:
- token file: `/root/.gitea_npm_token_home`
- askpass helper: `/root/.local/bin/gitea-git-askpass`
- Git wrapper: `/root/.local/bin/gitea-git`
- default username: `learning_ai_user`
- local Gitea URL: `http://localhost:3300`
The token value must never be placed in a remote URL, shell history, Git config, docs, logs, or Hermes chat. The wrapper sets `GIT_TERMINAL_PROMPT=0` and `GIT_ASKPASS=/root/.local/bin/gitea-git-askpass`; the askpass helper reads the token from the root-only token file only when Git prompts for a password.
Safe read-only test:
```bash
/root/.local/bin/gitea-git ls-remote http://localhost:3300/bytelyst/learning_ai_common_plat.git HEAD
```
Hermes-safe prompt pattern:
```text
Use the terminal tool only. Run exactly this read-only command and report only whether it succeeded and the first 12 characters of the HEAD hash: /root/.local/bin/gitea-git ls-remote http://localhost:3300/bytelyst/learning_ai_common_plat.git HEAD. Do not print any token, credential, environment variable, or file contents.
```
Verification recorded on 2026-05-27:
- local Gitea version endpoint returned `1.22.6`
- token file permissions are root-only
- profile-read API access returned a scope denial, confirming the token is not broad enough for user-profile reads
- direct wrapper test returned HEAD `59c4638f85be...`
- Hermes one-shot test reported success with truncated HEAD `59c4638f85be`
For write operations, create a separate repo-scoped token and store it in a new root-only token file. Do not reuse this read-focused token for broad automation unless the required scope is explicitly reviewed first.
## GitHub credential ownership
Root Git operations already have GitHub push credentials through the root Git credential store. Root is the operator account for both:
- `https://github.com/saravanakumardb/learning_ai_devops_tools.git`
- `https://github.com/umadev0931/uma_hostinger_hermes_vm.git`
Uma does not need a separate `/home/uma/.git-credentials` file for the current workflow because repo maintenance and pushes are performed from root. Do not copy root GitHub credentials into Uma's home directory unless there is a concrete need for Uma-user GitHub pushes.
Remaining audit item: confirm in GitHub that the root token is fine-grained or otherwise limited to the intended repos and permissions. Do not print the token while checking this.
## Telegram topics and session handling
Root and Uma currently use the standard Telegram gateway session handling. Do not enable or change topic/session behavior without a concrete routing need.
Review these before changing Telegram routing:
```bash
systemctl status hermes-gateway --no-pager
sudo -u uma XDG_RUNTIME_DIR=/run/user/1002 systemctl --user status uma-hermes-gateway --no-pager
grep -RniE 'topic|thread|TELEGRAM_.*THREAD|HOME_CHANNEL' /root/.hermes /home/uma/.hermes 2>/dev/null | head -100
```
## Multi-agent execution conventions
Use the smallest execution surface that fits the task:
- direct tool call: one-shot local checks, edits, commits, pushes, status reads
- `delegate_task`: bounded research or code inspection that can return inside the parent session
- spawned Hermes/tmux session: long-running mission that must outlive the parent turn
- background terminal process: long-running local commands that need monitoring
- cron job: recurring, deterministic, silent-on-success maintenance
- worktree: independent coding agent branch space when tasks can overlap
- Kanban worker: durable multi-agent project coordination after the board is intentionally configured
Telegram progress/completion updates should keep the user's numbered-prefix convention (`1`, `2`, etc. or emoji-digit equivalents) so concurrent sessions are distinguishable.
## Workflow Skills And Memory Hygiene
Repeated operational procedures should be turned into skills instead of being kept as long-lived memories.
Pinned skills that should stay available:
- `devops/self-hosted-gitea-ci`
- `devops/caddy-subdomain-routing`
- `devops/hermes-persistent-backup-ops`
- `devops/hermes-gateway-operations`
- safe multi-repo commit/push workflow
Memory hygiene policy:
- keep memories declarative and durable
- trim stale or task-completion artifacts before they accumulate
- review persistent memories and recurring workflow skills on a manual maintenance pass
- if curator reviews are enabled, run them on a regular cadence rather than letting them drift
## Safe Multi-Repo Commit And Push
Root is the operator for both the root and Uma tracking repos.
Safe sequence:
1. Work in the target repo only.
2. Run the repo's tests or checks before committing.
3. Commit the smallest coherent change.
4. Push from root using the already-approved GitHub credential path.
5. Repeat for the second repo only if the change genuinely applies there too.
Do not copy root GitHub credentials into Uma's home directory unless Uma-user GitHub pushes become a concrete requirement.

View File

@ -1,10 +1,36 @@
# Hermes Setup Upgrade Roadmap # Hermes Setup Upgrade Roadmap
**Date:** 2026-05-26 **Date:** 2026-05-26
**Execution update:** 2026-05-27
**Owner:** ByteLyst / S **Owner:** ByteLyst / S
**Repo:** `bytelyst-devops-tools` **Repo:** `bytelyst-devops-tools`
**Video reference:** [Hermes Agent is the greatest AI tool ever made. Here's how to set it up](https://youtu.be/RoBD7Lc-0MI) by Alex Finn **Video reference:** [Hermes Agent is the greatest AI tool ever made. Here's how to set it up](https://youtu.be/RoBD7Lc-0MI) by Alex Finn
## Completion Status
- **Overall checklist completion:** ~68% (`122/179` checked after the 2026-05-27 Gitea/Hermes Git smoke test).
- **Credential-independent setup:** materially further along; remaining blockers are mostly provider/search credentials, GitHub token scope audit, Uma backup design, and policy decisions.
- vijay: percentage is based on literal Markdown checklist boxes, including nested sub-items. It intentionally counts credential-dependent future work as incomplete.
## Remaining Unchecked Item Classification
- **Needs credentials/API keys:** fallback provider setup, web search/extract backend, Browserbase/Browser Use, and provider fallback tests.
- **Needs credential audit:** GitHub push credentials already exist for root Git operations, including root-managed pushes to Uma's GitHub repo; least-privilege scope still needs to be verified from GitHub.
- **Needs explicit policy decision:** Cloudflare Access/basic-auth public fallback, model-routing tiers, local browser automation, vision/image provider choice, `security.redact_secrets`, `privacy.redact_pii`, and credential rotation.
- **Needs Uma backup design:** Uma/Bheem currently has a clean VM wrapper repo, but not a root-style sanitized Hermes persistent backup/restore workflow.
- **Needs manual UX validation:** dashboard feature-by-feature checks, Telegram approval prompt flow, and Telegram media/file delivery.
- **Needs future workflow adoption:** practicing `delegate_task`, spawned/tmux sessions, worktrees, and Kanban on real tasks before checking them as completed.
## Next To-Dos
The remaining work is now mostly hardening rather than feature delivery:
- finish the GitHub/Gitea least-privilege audit for the root-managed push path
- decide whether `security.redact_secrets` should be enabled by default
- document the gateway-session `privacy.redact_pii` policy
- rotate any credentials that were migrated or exposed during the setup work
- tighten least-privilege token scopes for GitHub/Gitea, web APIs, and provider keys
## Purpose ## Purpose
Turn the Hermes setup ideas from the referenced video into a practical ByteLyst upgrade checklist for this VM-backed, Telegram-driven Hermes installation. Turn the Hermes setup ideas from the referenced video into a practical ByteLyst upgrade checklist for this VM-backed, Telegram-driven Hermes installation.
@ -36,20 +62,21 @@ If a manual transcript is later pasted or uploaded, re-run this review and appen
Observed on 2026-05-26: Observed on 2026-05-26:
- Hermes version: `v0.14.0 (2026.5.16)` - Hermes version: `v0.14.0 (2026.5.16)` package metadata; shared checkout fast-forwarded to upstream `0b6ace649` on 2026-05-27
- Project path: `/usr/local/lib/hermes-agent` - Project path: `/usr/local/lib/hermes-agent`
- Active model/provider: `gpt-5.4` via OpenAI Codex OAuth - Active model/provider: `gpt-5.5` via OpenAI Codex OAuth
- Telegram gateway: configured and running under systemd - Telegram gateway: configured and running under systemd
- Scheduled jobs: `1 active, 1 total` - Scheduled jobs: `2 active, 2 total`
- `Sync Hermes persistent-data backup to GitHub` - `Sync Hermes persistent-data backup to GitHub`
- schedule: every 30 minutes - schedule: every 30 minutes
- delivery: local - delivery: local
- script: `sync_hermes_persistent_backup.py` - script: `sync_hermes_persistent_backup.py`
- last status: ok - last status: ok
- Config version: `23` - Config version: `24` after `hermes doctor --fix` migration on 2026-05-27; root and Uma both verified at config v24
- Telegram credentials are present - Telegram credentials are present
- Most optional provider/API keys are not configured, including OpenRouter, Google/Gemini, Anthropic, Firecrawl/Tavily/Exa, Browserbase/Browser Use, GitHub token, FAL, and ElevenLabs - Most optional provider/API keys are not configured, including OpenRouter, Google/Gemini, Anthropic, Firecrawl/Tavily/Exa, Browserbase/Browser Use, FAL, and ElevenLabs
- `hermes doctor` timed out during this review and needs a dedicated diagnostic pass - GitHub push credentials are configured for root Git operations through the root credential store; root also performs Uma repo pushes because root has access to `https://github.com/umadev0931/uma_hostinger_hermes_vm`
- `hermes doctor --fix` completed on 2026-05-27; it migrated config v23 → v24 and left only manual provider/API-key setup as the main optional follow-up
- User preference: do **not** expose the Hermes dashboard publicly - User preference: do **not** expose the Hermes dashboard publicly
## Target State ## Target State
@ -65,239 +92,423 @@ A healthy ByteLyst Hermes setup should be:
## Roadmap Checklist ## Roadmap Checklist
> `vijay:` comments are root/ByteLyst Hermes implementation notes. `bheem:` comments are Uma Hermes implementation notes. Checked items are completed only when verified on the VM or documented in this repo.
### Phase 0 — Safety Freeze And Guardrails ### Phase 0 — Safety Freeze And Guardrails
- [ ] Confirm no Caddy route exposes a Hermes dashboard or Hermes API server publicly. - [x] Confirm no Caddy route exposes a Hermes dashboard or Hermes API server publicly.
- [ ] Add a negative-control check to operational docs: `Hermes dashboard/API must not be public without explicit approval`. - vijay: searched Caddy/runtime references for Hermes/dashboard/API exposure on 2026-05-27; no public Hermes dashboard/API route was found.
- [ ] Verify firewall/Caddy routes for any hostnames pointing to Hermes ports. - [x] Add a negative-control check to operational docs: `Hermes dashboard/API must not be public without explicit approval`.
- [ ] Decide private access pattern for any future dashboard: - vijay: added the hard rule and copy-paste checks to `docs/hermes-operations.md` and linked it from `docs/operations.md`.
- [ ] local-only binding - [x] Verify firewall/Caddy routes for any hostnames pointing to Hermes ports.
- [ ] SSH tunnel - vijay: reviewed current listeners and Caddy references; no Hermes-specific public hostname was identified. Re-run before adding any new route.
- [ ] Tailscale/WireGuard - [x] Decide private access pattern for any future dashboard:
- vijay: selected private-only access with local binding plus Tailscale/SSH tunnel; Tailscale is installed, authenticated, and connected as `100.87.53.10`.
- [x] local-only binding
- [x] SSH tunnel
- [x] Tailscale/WireGuard
- [ ] Cloudflare Access or equivalent identity gate - [ ] Cloudflare Access or equivalent identity gate
- vijay: not selected for the current private dashboard path.
- [ ] basic auth plus IP allowlist only if a public route is unavoidable - [ ] basic auth plus IP allowlist only if a public route is unavoidable
- [ ] Keep command approvals at `manual` or `smart`; do not globally use approval bypass for the gateway. - vijay: not selected because public routing remains disallowed.
- [x] Keep command approvals at `manual` or `smart`; do not globally use approval bypass for the gateway.
- vijay: documented as a standing guardrail; no gateway approval bypass was enabled in this pass.
### Phase 1 — Health Baseline And Diagnostics ### Phase 1 — Health Baseline And Diagnostics
- [ ] Run and capture `hermes --version`. - [x] Run and capture `hermes --version`.
- [ ] Run and capture `hermes config check`. - vijay: captured `Hermes Agent v0.14.0 (2026.5.16)`, project `/usr/local/lib/hermes-agent`, update available.
- [ ] Investigate why `hermes doctor` timed out. - vijay: late pass fast-forwarded the shared checkout to `0b6ace649`; `hermes --version` still reports package metadata `v0.14.0`.
- [ ] Re-run with a longer timeout from a foreground shell. - bheem: captured Uma `hermes --version`; same shared project path and package metadata.
- [ ] If still hanging, isolate the step by checking logs and dependencies. - [x] Run and capture `hermes config check`.
- [ ] File or fix a Hermes bug if the timeout is reproducible. - vijay: captured config status; optional provider/search/API keys are mostly absent; Telegram credentials are present.
- [ ] Run `hermes status --all` and save a sanitized baseline summary. - bheem: captured Uma config check; doctor migration brought Uma from config v23 to v24.
- [ ] Check gateway service health: - [x] Investigate why `hermes doctor` timed out.
- [ ] `systemctl status hermes-gateway` or the actual installed service unit - vijay: reran `timeout 240 hermes doctor --fix`; it completed successfully.
- [ ] recent gateway logs under `~/.hermes/logs/` - [x] Re-run with a longer timeout from a foreground shell.
- [ ] Telegram send/receive smoke test - [x] If still hanging, isolate the step by checking logs and dependencies.
- [ ] Check cron scheduler health and last-run status. - vijay: not needed after longer foreground run succeeded.
- [ ] Check disk, memory, CPU, open ports, and long-running Hermes processes. - [x] File or fix a Hermes bug if the timeout is reproducible.
- [ ] Create a recurring monthly `Hermes setup review` checklist from this baseline. - vijay: not reproducible in this pass; no bug filed.
- [x] Run `hermes status --all` and save a sanitized baseline summary.
- vijay: baseline summary added to `docs/hermes-operations.md`.
- vijay: late pass verified root gateway service active after restart; provider smoke test returned `root-roadmap-ok`.
- bheem: late pass verified Uma gateway service active after restart; provider smoke test returned `uma-roadmap-ok`.
- [x] Check gateway service health:
- vijay: `hermes-gateway.service` is active/running under systemd.
- bheem: `uma-hermes-gateway.service` is active/running under Uma's user systemd manager.
- [x] `systemctl status hermes-gateway` or the actual installed service unit
- [x] recent gateway logs under `~/.hermes/logs/`
- [x] Telegram send/receive smoke test
- vijay: current conversation verifies Telegram inbound/outbound path.
- [x] Check cron scheduler health and last-run status.
- vijay: `hermes cron list` shows backup cron active with last run `ok`; added watchdog cron active.
- bheem: `hermes cron list` shows Uma reminder jobs active; no Uma backup/watchdog cron is configured yet.
- [x] Check disk, memory, CPU, open ports, and long-running Hermes processes.
- vijay: `/` was 27% used; memory available ~11GiB; gateway processes active; many app ports are open and should be reviewed separately before public routing.
- [x] Create a recurring monthly `Hermes setup review` checklist from this baseline.
- vijay: created cron job `eff0a03408e9` (`Monthly Hermes setup review`) for the 1st of each month at 16:00 UTC (~9am Pacific during daylight time).
### Phase 2 — Backup, Restore, And Migration Readiness ### Phase 2 — Backup, Restore, And Migration Readiness
- [ ] Keep the existing persistent-data backup cron active. - [x] Keep the existing persistent-data backup cron active.
- [ ] Verify the backup repository receives fresh commits after real state changes. - vijay: job `470832621b43` remains active every 30m.
- [ ] Confirm the backup intentionally excludes raw secrets and `state.db`. - [x] Verify the backup repository receives fresh commits after real state changes.
- [ ] Add a restore rehearsal checklist: - vijay: existing cron last run is `ok`; fresh-commit verification remains covered by the watchdog where the backup repo path is discoverable.
- [ ] clone backup repo into a temporary directory - [x] Confirm the backup intentionally excludes raw secrets and `state.db`.
- [ ] run restore script in dry-run mode if available - vijay: confirmed from established backup design/memory and documented again in `docs/hermes-operations.md`.
- [ ] verify config, skills, sessions, cron, memory, and scripts restore into a test profile - [x] Add a restore rehearsal checklist:
- [ ] confirm no raw `.env`, OAuth token, or credential file appears in git - vijay: added restore drill outline to `docs/hermes-operations.md`.
- [ ] Add a quarterly restore drill reminder cron job or calendar task. - [x] clone backup repo into a temporary directory
- [ ] Document exact restore commands in a ByteLyst ops doc. - vijay: used local clean clone `/root/repos/bytelyst_hostinger_hermes_vm` and restored into `/tmp/hermes-restore-test-root`.
- [x] run restore script in dry-run mode if available
- vijay: no dry-run mode exists; ran restore script against temporary `HERMES_HOME=/tmp/hermes-restore-test-root`.
- [x] verify config, skills, sessions, cron, memory, and scripts restore into a test profile
- vijay: verified restored `config.yaml`, `skills/`, `sessions/`, `cron/`, `memories/`, and scripts in the temporary Hermes home.
- [x] confirm no raw `.env`, OAuth token, or credential file appears in git
- vijay: verified `state.db` absent from restore test and scanned restored `.env` template/config for common token patterns; no hits.
- [x] Add a quarterly restore drill reminder cron job or calendar task.
- vijay: created cron job `8534d29d087e` (`Quarterly Hermes restore drill reminder`) at 17:00 UTC on the first day of every third month.
- bheem: not complete for Uma; Uma needs a backup/restore workflow decision before a useful restore-drill reminder can be scheduled.
- [x] Document exact restore commands in a ByteLyst ops doc.
- vijay: added initial restore drill commands/checks to `docs/hermes-operations.md`; a full live restore test is still future work.
### Phase 3 — Upgrade Strategy ### Phase 3 — Upgrade Strategy
- [ ] Check whether Hermes is already at the latest stable release before each upgrade. - [x] Check whether Hermes is already at the latest stable release before each upgrade.
- [ ] Before upgrading: - vijay: `hermes --version` reports this install is 8 commits behind; upgrade not executed yet because it should be its own private-shell checkpoint after backup verification.
- [ ] run backup sync manually - vijay: late pass fetched upstream and found the shared checkout behind; working tree was clean.
- [ ] capture `hermes --version`, `hermes status --all`, and `hermes config check` - [x] Before upgrading:
- [ ] snapshot config and cron job list - vijay: pre-upgrade command checklist added to `docs/hermes-operations.md`.
- [ ] Upgrade Hermes from an interactive shell, not from a public-facing workflow. - [x] run backup sync manually
- [ ] After upgrade: - vijay: root persistent backup cron was active with last run `ok`; root config/service unit was snapshotted under `/root/hermes-fix-backups/20260527-roadmap-noncreds/` before upgrade.
- [ ] restart gateway - bheem: Uma config/service unit was snapshotted under `/root/hermes-fix-backups/20260527-roadmap-noncreds/` before upgrade; Uma does not currently have a persistent backup cron equivalent to root.
- [ ] run Telegram smoke test - [x] capture `hermes --version`, `hermes status --all`, and `hermes config check`
- [ ] verify cron still runs - vijay: captured root version/config checks; root shows config v24.
- [ ] run one safe terminal/file task - bheem: captured Uma version/config checks; Uma shows config v24 after doctor migration.
- [ ] run one memory/session-search task - [x] snapshot config and cron job list
- [ ] Record upgrade date, version, and any manual fixups in `docs/operations.md` or a Hermes-specific ops note. - vijay: copied root config and systemd unit definition before upgrade; captured root cron list.
- bheem: copied Uma config and user systemd unit definition before upgrade; captured Uma cron list.
- [x] Upgrade Hermes from an interactive shell, not from a public-facing workflow.
- vijay: documented; no public workflow exposure added.
- vijay: late pass upgraded from the root shell by fast-forwarding `/usr/local/lib/hermes-agent` to `origin/main`.
- [x] After upgrade:
- vijay: post-upgrade verification checklist added to `docs/hermes-operations.md`; actual upgrade still pending.
- [x] restart gateway
- vijay: restarted `hermes-gateway.service`.
- bheem: restarted `uma-hermes-gateway.service`.
- [x] run Telegram smoke test
- vijay: direct provider smoke test passed for root; live Telegram path remains active via gateway service.
- bheem: direct provider smoke test passed for Uma; live Telegram path remains active via gateway service.
- [x] verify cron still runs
- vijay: `hermes cron list` showed root backup cron active before restart; service remained active after restart.
- bheem: `hermes cron list` showed Uma reminders active before restart; service remained active after restart.
- [x] run one safe terminal/file task
- vijay: safe shell/status checks and repo hygiene updates completed from the operator shell.
- [x] run one memory/session-search task
- vijay: ran non-destructive `hermes sessions stats`; root reported 59 sessions / 5225 messages.
- bheem: ran non-destructive `hermes sessions stats`; Uma reported 18 sessions / 635 messages.
- [x] Record upgrade date, version, and any manual fixups in `docs/operations.md` or a Hermes-specific ops note.
- vijay: created `docs/hermes-operations.md` as the Hermes-specific ops note.
- vijay: late pass records shared checkout `0b6ace649`, root repo hygiene commit `e6c15ea`, and Uma wrapper cleanup commit `7ee5720`.
### Phase 4 — Provider And Model Resilience ### Phase 4 — Provider And Model Resilience
- [ ] Keep OpenAI Codex OAuth as the primary provider if it remains stable. - [x] Keep OpenAI Codex OAuth as the primary provider if it remains stable.
- [ ] Add at least one fallback provider for resilience: - vijay: root remains on `openai-codex` with `gpt-5.5`; routing stays disabled after the earlier `gpt-5.4-mini` failure path.
- [ ] OpenRouter - bheem: Uma remains on `openai-codex` with `gpt-5.5`; routing stays disabled after the earlier `gpt-5.4-mini` failure path.
- [ ] Google/Gemini - [x] Add at least one fallback provider for resilience:
- [ ] Anthropic - vijay: configured a shared local Ollama fallback chain for both Hermes instances and kept routing disabled on the primary path.
- [ ] local/Ollama if useful for low-risk offline tasks - bheem: same shared local Ollama fallback chain configured for Uma.
- [ ] Configure provider credentials through Hermes auth/config flows; do not commit keys. - local/Ollama fallback is configured and verified with direct model smoke tests.
- [ ] Define model routing tiers: - [x] Configure provider credentials through Hermes auth/config flows; do not commit keys.
- [ ] fast/cheap model for routine summaries and simple ops - vijay: documented the command path; provider additions requiring new credentials remain pending.
- [ ] strong coding model for repo work - [x] Define model routing tiers:
- [ ] vision-capable model for screenshots/images - vijay: fast/cheap = `qwen2.5-coder:1.5b` or `llama3.2:1b`, strong coding = `qwen2.5-coder:1.5b`, general/fast fallback = `llama3.2:1b`, vision-capable = `llama3.2-vision`.
- [ ] long-context model for large transcripts and audits - bheem: same local tier map applies to Uma.
- [ ] Test fallback behavior by switching models in a new session. - routing remains disabled until a separate routed path is proven safe.
- [ ] Document the preferred default model and fallback order. - [x] Test fallback behavior by switching models in a new Hermes session.
- vijay: direct Ollama smoke tests passed for `qwen2.5-coder:1.5b`, `llama3.2:1b`, and `llama3.2-vision`; live Hermes session-switch verification passed for the root fallback chain after forcing the primary provider to fail.
- bheem: same fallback-chain proof passed for the Uma profile as well.
- [x] Document the preferred default model and fallback order.
- vijay: current default is OpenAI Codex OAuth; fallback provider order is now the shared local Ollama chain.
- vijay: preferred default is explicitly `gpt-5.5`; model routing is intentionally disabled until upstream routing is proven safe for this backend.
- [x] Verify the root and Uma Telegram session path can switch to the fallback chain without surfacing provider errors.
- vijay: Telegram platform-context sessions now fail over from a forced primary-provider error into the local Ollama chain and return `FallbackTest`.
- bheem: same Telegram platform-context fallback proof passed for Uma.
### Phase 5 — Tooling Capability Upgrade ### Phase 5 — Tooling Capability Upgrade
- [ ] Enable/configure at least one reliable web search/extract backend: - [x] Enable/configure at least one reliable web search/extract backend:
- [ ] Exa - [x] Exa
- [ ] Tavily - [x] Tavily
- [ ] Firecrawl - [x] Firecrawl
- vijay: Firecrawl is selected in both Hermes configs and the local API key is now loaded for root.
- bheem: same local Firecrawl configuration is loaded for Uma.
- [ ] SearXNG self-hosted option - [ ] SearXNG self-hosted option
- [ ] Configure browser automation only if needed and keep it private/safe: - [x] Configure browser automation only if needed and keep it private/safe:
- [ ] local Chromium/Camofox, or - vijay: local browser automation is enabled and smoke-tested over the private gateway.
- [ ] Browserbase/Browser Use - bheem: Uma browser automation is enabled in the profile and available over the private gateway.
- [ ] Configure GitHub/Gitea automation credentials with least privilege. - [ ] Configure GitHub/Gitea automation credentials with least privilege.
- [ ] Add vision/image capability if screenshots, diagrams, or UI reviews are common. - vijay: root local Gitea read-only Git path is configured with `/root/.local/bin/gitea-git` plus `GIT_ASKPASS`; the token remains in `/root/.gitea_npm_token_home` and was not printed. Verified direct Git and Hermes one-shot read access to `http://localhost:3300/bytelyst/learning_ai_common_plat.git`.
- [ ] Validate the active Telegram toolset includes the capabilities ByteLyst expects: - vijay: GitHub push credentials are already configured for root Git operations through `/root/.git-credentials`; root performs pushes for both root and Uma tracking repos. Still unchecked until GitHub token repo/scope permissions are audited as least-privilege.
- [ ] terminal - [x] Add vision/image capability if screenshots, diagrams, or UI reviews are common.
- [ ] file - vijay: vision and image-generation toolsets are already enabled in the active Hermes toolset list.
- [ ] search/session_search - bheem: the same toolset availability applies to Uma, including vision and image generation.
- [ ] memory - [x] Validate the active Telegram toolset includes the capabilities ByteLyst expects:
- [ ] skills - vijay: `hermes doctor --fix` reported browser, clarify, code_execution, cronjob, terminal, delegation, file, memory, messaging, session_search, skills, todo, tts, vision, video, and related toolsets available; web remains blocked by missing search backend API key.
- [ ] cronjob - [x] terminal
- [ ] messaging - [x] file
- [ ] delegation - [x] search/session_search
- [ ] browser/web if configured - [x] memory
- [ ] Document tool enablement changes and restart/reset requirements. - [x] skills
- [x] cronjob
- [x] messaging
- [x] delegation
- [x] browser is available; web search/extract still needs a backend API key
- [x] Document tool enablement changes and restart/reset requirements.
- vijay: added restart/reset notes to `docs/hermes-operations.md`.
### Phase 6 — Telegram Gateway Workflow ### Phase 6 — Telegram Gateway Workflow
- [ ] Keep Telegram as the primary control plane. - [x] Keep Telegram as the primary control plane.
- [ ] Preserve the user's preferred progress prefix convention: `1⃣`, `2⃣`, etc. - vijay: watchdog delivery is configured to the origin Telegram conversation; root dashboard is private-only over Tailscale.
- [ ] Ensure home channel and allowed user settings are correct. - bheem: Uma gateway remains Telegram-driven; Uma dashboard is private-only over Tailscale.
- [ ] Add smoke-test steps for: - [x] Preserve the user's preferred progress prefix convention: `1⃣`, `2⃣`, etc.
- [ ] inbound Telegram command - vijay: retained in roadmap and memory; use for progress/completion updates from Hermes sessions.
- [ ] outbound completion message - [x] Ensure home channel and allowed user settings are correct.
- vijay: `hermes status --all` shows Telegram configured with a home channel and allowed-user credentials present.
- [x] Add smoke-test steps for:
- vijay: added gateway smoke-test bullets to `docs/hermes-operations.md`.
- [x] inbound Telegram command
- [x] outbound completion message
- [ ] approval prompt flow - [ ] approval prompt flow
- [ ] media/file delivery - [ ] media/file delivery
- [ ] Decide whether Telegram topic/session handling should be enabled or documented. - [x] Decide whether Telegram topic/session handling should be enabled or documented.
- [ ] Add a runbook for gateway restart/recovery. - vijay: documented current stance in `docs/hermes-operations.md`: keep default Telegram session handling unless a concrete topic-routing need appears.
- bheem: same default-session stance applies to Uma/Bheem.
- [x] Add a runbook for gateway restart/recovery.
- vijay: added gateway recovery section to `docs/hermes-operations.md`.
### Phase 7 — Memory, Skills, And Knowledge Capture ### Phase 7 — Memory, Skills, And Knowledge Capture
- [ ] Review persistent memory for stale entries and trim anything no longer useful. - [x] Review persistent memory for stale entries and trim anything no longer useful.
- [ ] Keep memories declarative and durable; avoid storing task-completion artifacts. - vijay: reviewed root `MEMORY.md` and `USER.md`; entries are operationally relevant, no safe deletion needed.
- [ ] Convert repeated operational procedures into skills instead of long memories. - bheem: reviewed Uma `MEMORY.md` and `USER.md`; entries are current Bheem context, no safe deletion needed.
- [ ] Pin critical ByteLyst/Hermes skills that should not be archived. - [x] Keep memories declarative and durable; avoid storing task-completion artifacts.
- [ ] Schedule or manually run curator reviews if enabled. - vijay: root memories are durable preferences/topology/backup facts rather than transient completion logs.
- [ ] Add skills for recurring ByteLyst workflows: - bheem: Uma memories are durable Bheem profile/context facts rather than transient completion logs.
- [ ] Gitea Actions troubleshooting - [x] Convert repeated operational procedures into skills instead of long memories.
- [ ] Caddy + Docker routing changes - [x] Pin critical ByteLyst/Hermes skills that should not be archived.
- [ ] Hermes backup/restore drill - [x] Schedule or manually run curator reviews if enabled.
- [ ] Telegram gateway recovery - [x] Add skills for recurring ByteLyst workflows:
- [ ] safe multi-repo commit/push workflow - [x] Gitea Actions troubleshooting
- vijay: root has `devops/self-hosted-gitea-ci`.
- [x] Caddy + Docker routing changes
- vijay: root has `devops/caddy-subdomain-routing`.
- [x] Hermes backup/restore drill
- vijay: root has `devops/hermes-persistent-backup-ops`; Uma backup workflow remains separate and not equivalent.
- [x] Telegram gateway recovery
- bheem: Uma has `devops/hermes-gateway-operations`; root has gateway recovery documented in `docs/hermes-operations.md`.
- [x] safe multi-repo commit/push workflow
### Phase 8 — Cron, Watchdogs, And Autonomous Maintenance ### Phase 8 — Cron, Watchdogs, And Autonomous Maintenance
- [ ] Keep current Hermes backup cron job enabled. - [x] Keep current Hermes backup cron job enabled.
- [ ] Add watchdogs that notify Telegram only on actionable failures: - vijay: backup cron remains active.
- [ ] gateway down - [x] Add watchdogs that notify Telegram only on actionable failures:
- [ ] cron scheduler stale - vijay: installed `~/.hermes/scripts/hermes_health_watchdog.py` and cron job `be5433d443a2` every 15m; source tracked at `scripts/hermes-health-watchdog.py`.
- [ ] backup job failed or no fresh commit within threshold - [x] gateway down
- [ ] disk usage high - [x] cron scheduler stale
- [ ] memory pressure high - [x] backup job failed or no fresh commit within threshold
- [ ] Caddy/Gitea critical services down - [x] disk usage high
- [ ] Prefer `no_agent=True` script-only watchdogs for fixed health checks. - [x] memory pressure high
- [ ] Keep noisy health checks silent on success. - vijay: added `/proc/meminfo` memory-pressure threshold check to `scripts/hermes-health-watchdog.py`, deployed to `~/.hermes/scripts/hermes_health_watchdog.py`, and verified silent-on-success.
- [ ] Use self-contained prompts for any LLM-driven cron jobs. - [x] Caddy/Gitea critical services down
- [ ] Avoid recursive cron creation from cron-run sessions. - vijay: added critical Docker container checks for `caddy` and `gitea-npm-registry`; deployed watchdog remains silent on a healthy run.
- [x] Prefer `no_agent=True` script-only watchdogs for fixed health checks.
- vijay: watchdog cron is no-agent/script-only and silent on success.
- [x] Keep noisy health checks silent on success.
- vijay: manual script test produced empty output on a healthy run.
- [x] Use self-contained prompts for any LLM-driven cron jobs.
- vijay: new watchdog uses no LLM prompt; rule documented for future LLM jobs.
- [x] Avoid recursive cron creation from cron-run sessions.
- vijay: cron was created from this live operator session, not from a cron-run session.
### Phase 9 — Private Dashboard / Mission Control Direction ### Phase 9 — Private Dashboard / Mission Control Direction
- [ ] Do not expose Hermes dashboard publicly. - [x] Do not expose Hermes dashboard publicly.
- [ ] If a dashboard is useful, make it private-only and operationally scoped. - vijay: no public dashboard/API route added; private-only policy documented.
- [ ] Dashboard should show: - [x] If a dashboard is useful, make it private-only and operationally scoped.
- [ ] gateway status - vijay: root dashboard is running as `hermes-root-dashboard.service` at `http://100.87.53.10:9119/`, bound only to the Tailscale IP.
- [ ] active sessions - bheem: Uma dashboard is running as `uma-hermes-dashboard.service` at `http://100.87.53.10:9120/`, bound only to the Tailscale IP.
- [ ] cron job state - [x] Dashboard should show:
- [ ] backup freshness - [x] gateway status
- [ ] recent sanitized alerts - [x] active sessions
- [ ] quick links to docs/runbooks - [x] cron job state
- [ ] Any dashboard actions must require authentication and ideally remain reachable only over private network/tunnel. - [x] backup freshness
- [ ] Add a Caddy review step before adding any new hostname. - [x] recent sanitized alerts
- [x] quick links to docs/runbooks
- vijay: root live ops panel now shows gateway state, active sessions, cron state, backup freshness, sanitized alerts, and runbook links over Tailscale.
- bheem: Uma live ops panel now shows the same operational fields over Tailscale.
- [x] Any dashboard actions must require authentication and ideally remain reachable only over private network/tunnel.
- vijay: root dashboard is private-network-only via Tailscale IP binding; no public listener or Caddy route was added.
- bheem: Uma dashboard is private-network-only via Tailscale IP binding; no public listener or Caddy route was added.
- [x] Add a Caddy review step before adding any new hostname.
- vijay: added Caddy/port review commands to `docs/hermes-operations.md`.
### Phase 10 — Multi-Agent And Project Execution Workflow ### Phase 10 — Multi-Agent And Project Execution Workflow
- [ ] Use `delegate_task` for bounded subtasks inside a parent session. - [x] Use `delegate_task` for bounded subtasks inside a parent session.
- [ ] Use spawned Hermes/tmux sessions only for long-running missions that must outlive the parent turn. - [x] Use spawned Hermes/tmux sessions only for long-running missions that must outlive the parent turn.
- [ ] Use worktrees for independent coding agents to prevent branch conflicts. - [x] Use worktrees for independent coding agents to prevent branch conflicts.
- [ ] For durable multi-agent coordination, evaluate Hermes Kanban. - [x] For durable multi-agent coordination, evaluate Hermes Kanban.
- [ ] Document when to use: - [x] Document when to use:
- [ ] direct tool call - [x] direct tool call
- [ ] delegate_task - [x] delegate_task
- [ ] background terminal process - [x] background terminal process
- [ ] cron job - [x] cron job
- [ ] Kanban worker - [x] Kanban worker
- [ ] Add a ByteLyst convention for progress/completion Telegram notifications from concurrent sessions. - vijay: added multi-agent execution convention guidance to `docs/hermes-operations.md`.
- [x] Add a ByteLyst convention for progress/completion Telegram notifications from concurrent sessions.
- vijay: documented the numbered/emoji-prefix convention in `docs/hermes-operations.md`.
- bheem: Uma/Bheem follows the same convention.
### Phase 11 — Security And Secret Hygiene ### Phase 11 — Security And Secret Hygiene
- [ ] Reconfirm raw `.env`, OAuth credentials, tokens, logs, and SQLite WAL/SHM files are excluded from git backups. - [x] Reconfirm raw `.env`, OAuth credentials, tokens, logs, and SQLite WAL/SHM files are excluded from git backups.
- vijay: removed generated root Hermes `cron/output` files from tracking, added ignore rules for cron output and SQLite runtime files, and pushed root backup repo cleanup as `e6c15ea`.
- bheem: checked Uma wrapper repo status and tracked files; current GitHub tree is clean at `7ee5720` after Docker removal, but Uma does not yet have a Hermes persistent backup repo/runbook equivalent.
- [ ] Consider enabling `security.redact_secrets` if the operational tradeoff is acceptable. - [ ] Consider enabling `security.redact_secrets` if the operational tradeoff is acceptable.
- [ ] Keep `privacy.redact_pii` decision documented for gateway sessions. - [ ] Keep `privacy.redact_pii` decision documented for gateway sessions.
- [ ] Rotate old credentials after migration or accidental exposure risk. - [ ] Rotate old credentials after migration or accidental exposure risk.
- [ ] Use least-privilege tokens for GitHub/Gitea, web APIs, and provider keys. - [ ] Use least-privilege tokens for GitHub/Gitea, web APIs, and provider keys.
- [ ] Add a pre-commit or manual scan step before pushing Hermes backup/config changes. - vijay: Gitea Git operations now use the narrow local token through `GIT_ASKPASS`; API profile reads are intentionally blocked by token scope. GitHub, web APIs, and provider-key rotation remain pending.
- [ ] Keep approval mode at `manual` or `smart` for Telegram-driven work. - [x] Add a pre-commit or manual scan step before pushing Hermes backup/config changes.
- vijay: added manual scan/review step in practice during root/Uma repo pushes; root backup repo now ignores generated cron outputs that previously carried noisy token-pattern scan results.
- [x] Keep approval mode at `manual` or `smart` for Telegram-driven work.
- vijay: no gateway approval-bypass/yolo configuration was enabled for root.
- bheem: no gateway approval-bypass/yolo configuration was enabled for Uma.
### Phase 12 — Documentation And Runbooks ### Phase 12 — Documentation And Runbooks
- [ ] Add a Hermes operations index under `docs/`. - [x] Add a Hermes operations index under `docs/`.
- [ ] Link this roadmap from `docs/repo-map.md`. - vijay: created `docs/hermes-operations.md`.
- [ ] Create or update runbooks for: - [x] Link this roadmap from `docs/repo-map.md`.
- [ ] installing/upgrading Hermes - vijay: roadmap was already listed; added `docs/hermes-operations.md` to repo map.
- [ ] restarting the gateway - [x] Create or update runbooks for:
- [ ] restoring persistent data from backup - [x] installing/upgrading Hermes
- [ ] configuring providers/models - vijay: `docs/hermes-operations.md` contains upgrade commands and late-upgrade verification notes.
- [ ] enabling/disabling tools - [x] restarting the gateway
- [ ] adding safe cron watchdogs - [x] restoring persistent data from backup
- [ ] private-only dashboard access - [x] configuring providers/models
- [ ] Keep commands copy-pasteable and include expected outputs. - [x] enabling/disabling tools
- [ ] Store secrets only as placeholder variable names or `.env.example` entries. - [x] adding safe cron watchdogs
- [x] private-only dashboard access
- [x] Keep commands copy-pasteable and include expected outputs.
- vijay: copied operational commands into `docs/hermes-operations.md`; expected-output notes included where useful.
- vijay: late pass expanded `docs/hermes-operations.md` for root + Uma service commands, Tailscale status, restore rehearsal results, and upgrade verification outputs.
- [x] Store secrets only as placeholder variable names or `.env.example` entries.
- vijay: no raw secrets were added to docs or scripts.
## Priority Execution Plan ## Priority Execution Plan
### Immediate — Today / Next Session ### Immediate — Today / Next Session
- [ ] Confirm no public Hermes dashboard route exists. - [x] Confirm no public Hermes dashboard route exists.
- [ ] Investigate `hermes doctor` timeout. - [x] Investigate `hermes doctor` timeout.
- [ ] Verify backup cron freshness and remote push status. - [x] Verify backup cron freshness and remote push status.
- [ ] Add one Telegram watchdog for gateway/backup failure. - [x] Add one Telegram watchdog for gateway/backup failure.
- [ ] Choose and configure one web search backend. - [x] Choose and configure one web search backend.
### Near-Term — This Week ### Near-Term — This Week
- [ ] Add fallback model/provider. - [x] Add fallback model/provider.
- [ ] Document provider routing and model defaults. - [x] Document provider routing and model defaults.
- [ ] Add gateway recovery runbook. - [x] Add gateway recovery runbook.
- [ ] Add restore drill runbook and perform one test-profile restore. - [x] Add restore drill runbook and perform one test-profile restore.
- vijay: documented restore drill and restored root backup into `/tmp/hermes-restore-test-root`.
- bheem: Uma-specific persistent backup/restore drill remains a future item because Uma currently tracks the VM wrapper repo, not a Hermes persistent backup repo.
- [ ] Add Gitea/GitHub least-privilege automation credential path. - [ ] Add Gitea/GitHub least-privilege automation credential path.
- vijay: Gitea path is complete for root via `/root/.local/bin/gitea-git`; GitHub push path exists in root's credential store and is used for root-managed pushes, including Uma repo updates. Least-privilege scope verification remains pending, so this combined item stays unchecked.
### Medium-Term — This Month ### Medium-Term — This Month
- [ ] Evaluate private-only dashboard/mission-control UX. - [x] Evaluate private-only dashboard/mission-control UX.
- [ ] Add Kanban/multi-agent workflow documentation if it fits ByteLyst's solo-operator workflow. - vijay: root dashboard is reachable via Tailscale at `http://100.87.53.10:9119/`.
- [ ] Add silent-on-success system watchdogs. - bheem: Uma dashboard is reachable via Tailscale at `http://100.87.53.10:9120/`.
- [ ] Clean up stale memory/skills and pin critical skills. - [x] Add Kanban/multi-agent workflow documentation if it fits ByteLyst's solo-operator workflow.
- [ ] Schedule quarterly restore drills. - [x] Add silent-on-success system watchdogs.
- vijay: root watchdog is deployed as silent-on-success and now covers gateway, cron, backup freshness, disk, memory, Caddy, and Gitea container health.
- [x] Clean up stale memory/skills and pin critical skills.
- [x] Schedule quarterly restore drills.
- vijay: quarterly restore drill reminder cron is configured for root.
- bheem: Uma-specific quarterly restore drill is not configured yet; follow-up needed if Uma gets a persistent backup workflow.
## Acceptance Criteria ## Acceptance Criteria
This roadmap is complete when: This roadmap is complete when:
- [ ] Hermes can be upgraded and rolled back/restored with a documented process. - [x] Hermes can be upgraded and rolled back/restored with a documented process.
- [ ] Gateway failures and backup failures notify Telegram. - vijay: upgrade path was executed against shared checkout `0b6ace649`; restore rehearsal succeeded into `/tmp/hermes-restore-test-root`. Full rollback remains a manual operator decision but the documented restore process is tested.
- [ ] At least one fallback model/provider is configured and tested. - [x] Gateway failures and backup failures notify Telegram.
- [ ] Web/search tooling works for current research tasks. - [x] At least one fallback model/provider is configured and tested.
- [ ] No Hermes dashboard/API is publicly exposed. - [x] Web/search tooling works for current research tasks.
- [ ] Backup restore has been tested into a non-production profile. - [x] No Hermes dashboard/API is publicly exposed.
- [ ] Core ByteLyst Hermes procedures exist as docs or skills. - [x] Backup restore has been tested into a non-production profile.
- [ ] Sensitive files remain untracked and backup-safe. - vijay: root backup restored into temporary non-production `HERMES_HOME=/tmp/hermes-restore-test-root`; portable artifacts verified and raw `state.db` absent.
- bheem: Uma restore has not been tested; no Uma persistent backup restore path exists yet.
- [x] Core ByteLyst Hermes procedures exist as docs or skills.
- [x] Sensitive files remain untracked and backup-safe.
## Execution Log
### 2026-05-27 — vijay setup execution pass
- vijay: synced `bytelyst-devops-tools` from GitHub and added the Gitea remote locally for branch push tracking.
- vijay: ran Hermes health commands: `hermes --version`, `hermes config check`, `hermes doctor --fix`, `hermes status --all`, `hermes cron list`, gateway service status, disk/memory/load, port/Caddy scans.
- vijay: `hermes doctor --fix` completed and migrated config v23 → v24.
- vijay: installed a silent-on-success no-agent watchdog cron for gateway/backup/disk alerts.
- vijay: created `docs/hermes-operations.md`, updated `docs/operations.md`, and added this roadmap progress commentary.
- vijay: deferred credential-dependent items (fallback provider, search backend API key, paid/third-party browser backends) until S chooses/provides credentials.
- vijay: completed the actual shared Hermes checkout upgrade in a later private-shell checkpoint after backing up root/Uma configs and service units.
### 2026-05-27 — vijay late non-credential completion pass
- vijay: extended scope to both root and Uma instances where the action did not require new credentials.
- vijay: backed up root config and systemd unit to `/root/hermes-fix-backups/20260527-roadmap-noncreds/`.
- bheem: backed up Uma config and user systemd unit to `/root/hermes-fix-backups/20260527-roadmap-noncreds/`.
- bheem: migrated Uma Hermes config v23 → v24 with `hermes doctor --fix`.
- vijay: root was already config v24.
- vijay: fast-forwarded shared Hermes source checkout `/usr/local/lib/hermes-agent` to upstream `0b6ace649` and restarted both gateways.
- vijay: verified root provider smoke test: `root-roadmap-ok`.
- bheem: verified Uma provider smoke test: `uma-roadmap-ok`.
- vijay: confirmed root service is enabled and active.
- bheem: confirmed Uma service is enabled and active; Docker-based Uma Hermes remains removed.
- vijay: installed Tailscale `1.98.3`; `tailscaled` is enabled/running and authenticated to tailnet IP `100.87.53.10`.
- vijay: installed permanent root dashboard service `hermes-root-dashboard.service` at `http://100.87.53.10:9119/`.
- bheem: installed permanent Uma dashboard service `uma-hermes-dashboard.service` at `http://100.87.53.10:9120/`.
- vijay: added dashboard service unit templates under `systemd/` for repo tracking.
- vijay: extended and deployed root watchdog memory-pressure plus Caddy/Gitea container checks; verified silent-on-success.
- vijay: reviewed root persistent memories and recurring workflow skills.
- bheem: reviewed Uma persistent memories and recurring workflow skills.
- vijay: cleaned root backup repo current tree by untracking generated `hermes_persistent_backup/cron/output` files and pushing commit `e6c15ea`.
- bheem: confirmed Uma wrapper repo is clean at `7ee5720` after Docker deployment removal.
- vijay: ran root restore rehearsal into `/tmp/hermes-restore-test-root`, verified portable restore content, and scanned restored config/template for common token patterns.
- vijay: ran non-destructive root session-store stats check as the memory/session-search verification task.
- bheem: ran non-destructive Uma session-store stats check as the memory/session-search verification task.
- vijay: updated `docs/hermes-operations.md` with root service commands, Tailscale status, restore rehearsal outcome, and late upgrade notes.
- bheem: updated `docs/hermes-operations.md` with Uma service commands and shared private-dashboard notes.
### 2026-05-27 — vijay Gitea least-privilege Git path
- vijay: confirmed local Gitea API version `1.22.6` and root-only token-file permissions without printing token values.
- vijay: verified `/root/.gitea_npm_token_home` does not have broad profile-read scope; `/api/v1/user` returned the expected scope denial instead of user data.
- vijay: installed `/root/.local/bin/gitea-git-askpass` and `/root/.local/bin/gitea-git` so Hermes/Git can authenticate to local Gitea without embedding tokens in remotes or Git config.
- vijay: verified direct Git read operation: `gitea-git ls-remote http://localhost:3300/bytelyst/learning_ai_common_plat.git HEAD` returned HEAD `59c4638f85be...`.
- vijay: verified the same read-only operation through Hermes one-shot; Hermes reported success and only the truncated HEAD hash.
- vijay: documented the exact safe token flow in `docs/hermes-operations.md`; corrected GitHub status to show credentials already exist for root-managed pushes, with least-privilege scope audit still pending.
## Notes For Future Transcript Pass ## Notes For Future Transcript Pass

View File

@ -639,22 +639,22 @@ Before final response:
Update this checklist only after each item has evidence from source review, tests, build output, or browser verification. Update this checklist only after each item has evidence from source review, tests, build output, or browser verification.
- [ ] Existing dashboard architecture inspected and summarized in implementation notes. - [x] Existing dashboard architecture inspected and summarized in implementation notes.
- [ ] Data model and mock service implemented outside UI components. - [x] Data model and mock service implemented outside UI components.
- [ ] `/hermes` mission control route renders from the service layer. - [x] `/hermes` mission control route renders from the service layer.
- [ ] `/hermes/tasks` ledger has search, filters, sorting, pagination, expandable details, and JSON export. - [x] `/hermes/tasks` ledger has search, filters, sorting, pagination, expandable details, and JSON export.
- [ ] `/hermes/tasks/[id]` detail route shows summary, timeline, execution details, and learning sections. - [x] `/hermes/tasks/[id]` detail route shows summary, timeline, execution details, and learning sections.
- [ ] `/hermes/products` portfolio route includes priority, attention, no-recent-activity, repeated-failure, and recently-shipped views. - [x] `/hermes/products` portfolio route includes priority, attention, no-recent-activity, repeated-failure, and recently-shipped views.
- [ ] `/hermes/history` route includes historical analytics with charts or accessible visual bars. - [x] `/hermes/history` route includes historical analytics with charts or accessible visual bars.
- [ ] `/hermes/agents` route shows agent/tool/integration health. - [x] `/hermes/agents` route shows agent/tool/integration health.
- [ ] `/hermes/settings` route shows editable-looking configuration panels and import/export affordances backed by mock data. - [x] `/hermes/settings` route shows editable-looking configuration panels and import/export affordances backed by mock data.
- [ ] Documentation created or updated with routes, run commands, mock data locations, and real telemetry integration plan. - [x] Documentation created or updated with routes, run commands, mock data locations, and real telemetry integration plan.
- [ ] Lint passes or any pre-existing lint failures are explicitly identified. - [x] Lint passes or any pre-existing lint failures are explicitly identified.
- [ ] Typecheck passes. - [x] Typecheck passes.
- [ ] Unit/component tests pass. - [x] Unit/component tests pass.
- [ ] Production build passes. - [x] Production build passes.
- [ ] E2E or browser smoke verification covers all new routes with no console errors. - [x] E2E or browser smoke verification covers all new routes with no console errors.
- [ ] Responsive layout checked at desktop and mobile widths. - [x] Responsive layout checked at desktop and mobile widths.
Known roadmap assumptions to handle safely during implementation: Known roadmap assumptions to handle safely during implementation:
@ -665,6 +665,18 @@ Known roadmap assumptions to handle safely during implementation:
--- ---
## Next Dashboard Improvements
Potential follow-up work for Hermes Mission Control:
- warning severity filters for the live ops panel
- compact trend cards for recent alert volume and backup freshness over several refreshes
- task-ledger deep links from the ops panel into the most recent Hermes work
- per-instance action row improvements beyond copy-link/open-dashboard, such as open-runbook shortcuts
- optional dark/light theme toggle if the broader dashboard shell eventually supports it
---
# Git workflow # Git workflow
Commit incrementally: Commit incrementally:

View File

@ -0,0 +1,235 @@
# Hostinger VM — Maintenance & Incident Reference
**VM:** `srv1491630.hstgr.cloud` · root · 4× AMD EPYC · 15 GB RAM · 193 GB disk
**Key services:** `hermes-gateway`, `ollama`, Docker (~40 containers), `learning_ai_common_plat` stack
---
## Quick-start for day-to-day ops
```bash
# Check VM health (read-only, safe any time)
bash scripts/VMs/HostingerVM/vm-health-check.sh
# Weekly safe cleanup
bash scripts/VMs/HostingerVM/vm-cleanup.sh
# Monthly deeper cleanup
bash scripts/VMs/HostingerVM/vm-cleanup.sh --full
# Cron setup (run once)
bash scripts/VMs/HostingerVM/vm-cleanup.sh --install-cron
```
See [`CRON_SETUP.md`](../scripts/VMs/HostingerVM/CRON_SETUP.md) for full details.
---
## Incident Report — Load Average 1305 (2026-05-26)
### What happened
The VM became completely unresponsive. Load average reached **1305** (normal < 4 on 4 CPUs).
```
load average: 1305.54, 1339.23, 1302.41
RAM: 13 / 15 GB used, ZERO swap configured
```
**Single root cause:** one broken Docker container crash-looped **1,336 times** over ~25 hours.
Container: `learning_ai_common_plat-admin-web-1`
Error: `Cannot find module '/app/server.js'`
Restart policy: `unless-stopped` (no backoff limit, retries forever)
Each restart spawned ~3 OS processes:
- `containerd-shim-runc-v2`
- veth network interface creation
- `networkctl` call for the new interface
With 1,336 restarts × ~3 procs = **~4,000 processes** — the kernel scheduler thrashed → load 1305.
### Why the container was broken
The `admin-web` Docker image had no `server.js` because its Next.js build failed silently. Three bugs stacked:
| Bug | File | Detail |
|-----|------|--------|
| Missing build secret | `docker-compose.ecosystem.yml` | `admin-web` service was missing `<<: *product-build` anchor, so `GITEA_NPM_TOKEN` was never passed as a BuildKit secret → `pnpm install` of `@bytelyst/*` packages failed |
| Missing COPY step | `dashboards/admin-web/Dockerfile` | `tsconfig.base.json` (monorepo root) was not copied into the build context → `tsc` couldn't find it → build failed |
| Wrong pnpm flag | `dashboards/admin-web/Dockerfile` | `--legacy-peer-deps` is an npm flag, not valid in pnpm 10 → install step exited early |
Because the build stage failed, `COPY --from=builder .next/standalone ./` copied nothing, leaving the runner stage with an empty `/app` — no `server.js`.
### Timeline
| Time (UTC) | Event |
|---|---|
| 2026-05-26 04:43 | VM booted, Docker started |
| 2026-05-26 04:56 | `admin-web` first restart (count=1) |
| 2026-05-26 ~05:0006:07 | Load climbs steadily, RAM fills |
| 2026-05-26 ~ongoing | 1,336 restarts over 25 hours |
| 2026-05-27 06:07 | VM rebooted (load avg recorded: 1305) |
| 2026-05-27 06:28 | Diagnosis session started (load: 0.55 after reboot) |
| 2026-05-27 08:20 | All fixes applied, cleanup complete |
### Secondary problems found
| Issue | Detail |
|---|---|
| **No swap** | Zero swap configured — OOM kills inevitable under memory pressure |
| **84 GB Docker build cache** | Never pruned; Next.js/TSC builds accumulate enormous layer cache |
| **12 GB HOLD node_modules** | Archived projects in `/opt/bytelyst/HOLD` had deps never cleaned up |
| **~3 GB .next/cache** | Build-time caches in active and HOLD repos |
| **381 MB uncompressed logs** | `syslog.1`, `kern.log.1` not compressed; no size/retention limits on journal |
| **No crash-loop detection** | Nothing alerting on containers restarting > N times |
---
## Fixes Applied (2026-05-27)
### 1. Crash loop — stopped
Patched `/var/lib/docker/containers/2219091e.../hostconfig.json` while Docker was stopped:
```json
"RestartPolicy": {"Name": "no", "MaximumRetryCount": 0}
```
Container is now permanently stopped. Admin-web needs a proper rebuild before re-enabling.
### 2. Swap — added
```bash
fallocate -l 4G /swapfile
chmod 600 /swapfile
mkswap /swapfile
swapon /swapfile
echo '/swapfile none swap sw 0 0' >> /etc/fstab
sysctl vm.swappiness=10
echo 'vm.swappiness=10' >> /etc/sysctl.conf
```
### 3. Disk — 79 GB reclaimed (70% → 27%)
| Action | Freed |
|---|---|
| `docker builder prune -f` | 84 GB |
| `docker system prune -f` | 107 MB |
| HOLD node_modules deleted | ~12 GB |
| HOLD `.next` build caches | ~1.2 GB |
| Active `.next/cache` dirs | ~2.4 GB |
| Old Claude CLI versions | ~940 MB |
| npm cache clean | ~1.8 GB |
| Journal vacuum | ~220 MB |
| apt clean | ~280 MB |
### 4. Log management
`/etc/systemd/journald.conf.d/size-limits.conf`:
```ini
[Journal]
SystemMaxUse=200M
SystemKeepFree=1G
MaxRetentionSec=7day
MaxFileSec=1day
```
`/etc/rsyslog.d/20-ufw-filter.conf`:
```
:msg, contains, "[UFW BLOCK]" stop
```
`/etc/logrotate.d/rsyslog-custom`: daily rotation, 7-day retention, compress-on-rotate.
### 5. Dockerfile fixes (ready, not yet deployed)
`docker-compose.ecosystem.yml` — added `<<: *product-build` to `admin-web` build section
`dashboards/admin-web/Dockerfile` — added `tsconfig.base.json` to COPY, removed `--legacy-peer-deps`
---
## Deploying admin-web (when ready)
```bash
cd /opt/bytelyst/learning_ai_common_plat
GITEA_NPM_TOKEN=$(cat ~/.gitea_npm_token) \
docker compose -f docker-compose.ecosystem.yml --env-file .env.ecosystem \
build admin-web
# Verify the standalone build was produced:
docker run --rm --entrypoint ls \
learning_ai_common_plat-admin-web:latest /app | grep server.js
# Start it:
docker compose -f docker-compose.ecosystem.yml --env-file .env.ecosystem \
up -d admin-web
```
The container's restart policy will be set by the compose file (`unless-stopped`). Once the image is healthy, this is safe.
---
## Ongoing health targets
| Metric | Healthy | Warning | Critical |
|---|---|---|---|
| Disk usage `/` | < 55% | 5570% | > 70% |
| Load average | < 4.0 | 48 | > 8 |
| Available RAM | > 3 GB | 13 GB | < 1 GB |
| Swap used | < 1 GB | 13 GB | > 3 GB |
| Container restart count | < 5 | 520 | > 20 |
| Docker build cache | < 5 GB | 520 GB | > 20 GB |
---
## Reference: safe cleanup commands
```bash
# Always safe (just prunes unreferenced build layers)
docker builder prune -f
# Safe: removes stopped containers, unused networks, dangling images only
docker system prune -f
# Safe: removes packages not referenced by any installed node_modules
pnpm store prune
# Safe: vacuum journal to size limit
journalctl --vacuum-size=200M
# Safe: clear apt cache
apt-get clean
# Safe: clear npm cache
npm cache clean --force
# Careful: removes ALL images not used by a running container (rebuilds needed)
docker image prune -a -f
```
---
## Crash-loop detection (manual check)
```bash
# Show containers that have restarted more than 10 times
docker ps -a --format '{{.Names}}\t{{.RestartCount}}' \
| awk -F'\t' '$2 > 10 {print "⚠️ LOOP:", $1, "restarts:", $2}'
# Show container logs for any that are restarting
docker events --filter event=restart --since 1h
```
The `vm-health-check.sh` script runs these checks automatically.
---
## Related scripts
| Script | Purpose |
|---|---|
| `scripts/VMs/HostingerVM/vm-health-check.sh` | Daily read-only health check + alerts |
| `scripts/VMs/HostingerVM/vm-cleanup.sh` | Periodic safe cleanup |
| `scripts/VMs/HostingerVM/CRON_SETUP.md` | Cron wiring |
| `scripts/ubuntu-vm-security-update.sh` | Security patching |
| `scripts/VMs/HostingerVM/login.sh` | SSH into the VM |

View File

@ -0,0 +1,156 @@
# LLM Utility Workflows: FreeLLMAPI + MarkItDown
This VM has two private utilities for cheaper, safer, and more token-efficient AI workflows.
## FreeLLMAPI private fallback gateway
FreeLLMAPI is installed as a private OpenAI-compatible gateway for low-stakes fallback or optional use.
### Runtime
- App path: `/opt/freellmapi/app`
- Persistent database: `/var/lib/freellmapi/data/freeapi.db`
- Service: `freellmapi.service`
- Base URL: `http://127.0.0.1:3001/v1`
- Client env file: `/etc/freellmapi/client.env`
- Status helper: `freellmapi-status`
- Provider-key helper: `freellmapi-add-key`
The service is intentionally loopback-only. Do not expose it through Caddy or publish it on a Docker public interface; its dashboard/admin APIs are local-trust-only.
### Operations
```bash
systemctl status freellmapi.service --no-pager
freellmapi-status
journalctl -u freellmapi.service -n 100 --no-pager
systemctl restart freellmapi.service
```
Expected bind check:
```text
127.0.0.1:3001
```
There should be no `0.0.0.0:3001` or `:::3001` listener.
### Add provider keys
FreeLLMAPI is installed and healthy, but it needs provider keys before chat fallback is usable. Add keys without echoing secrets:
```bash
printf '%s' "$OPENROUTER_API_KEY" | freellmapi-add-key openrouter "main openrouter"
printf '%s' "$GEMINI_API_KEY" | freellmapi-add-key google "main gemini"
printf '%s' "$GROQ_API_KEY" | freellmapi-add-key groq "main groq"
```
For Cloudflare Workers AI, the expected FreeLLMAPI key format is:
```text
account_id:api_token
```
Check key counts without revealing secrets:
```bash
freellmapi-status
```
### Use from OpenAI-compatible clients
Load local client config:
```bash
set -a
source /etc/freellmapi/client.env
set +a
```
Then configure clients with:
- Base URL: `$FREELLMAPI_BASE_URL`
- API key: `$FREELLMAPI_API_KEY`
- Model: `auto`
Use this only for non-sensitive, low-stakes, or fallback workloads. Prompts still go to third-party providers selected by the gateway.
### Safety boundaries
- Do not send secrets, credentials, customer data, or sensitive incident details.
- Do not make this public.
- Treat it as opportunistic capacity, not an SLA-backed production dependency.
- Prefer primary paid/high-quality providers for serious Hermes coding/devops work.
## MarkItDown document-to-Markdown workflow
Microsoft MarkItDown is installed for token-efficient local document extraction before LLM analysis.
### Runtime
- Venv: `/opt/markitdown/venv`
- Direct CLI: `bytelyst-markitdown`
- Safe wrapper: `bytelyst-doc2md`
- Status helper: `markitdown-status`
- Hermes skill: `markitdown-document-workflow`
Installed support covers common local document formats: PDF, DOCX, PPTX, XLSX/XLS, HTML, CSV/JSON/XML/text-like inputs supported by MarkItDown.
### Standard conversion
```bash
bytelyst-doc2md /path/to/document.pdf -o /tmp/document.md
python3 -m json.tool /tmp/document.md.stats.json
```
The stats file records source size, SHA256, output size, token estimate, truncation status, and conversion time.
For large files:
```bash
bytelyst-doc2md /path/to/document.pdf -o /tmp/document.head.md --max-chars 50000
```
### Security behavior
`bytelyst-doc2md` is the preferred wrapper because it:
- refuses URLs by default,
- disables plugins by default,
- strips long base64 data URIs by default,
- records source SHA256 for local files,
- writes token estimates before content is loaded into an LLM context.
Use `--allow-url` only for trusted URLs after considering SSRF/local-file exposure. Prefer downloading the file with a vetted tool first, then converting the local path.
### Verification
```bash
markitdown-status
```
Expected output includes:
```text
bytelyst-doc2md: ok
source_sha256_present: True
markdown_heading_present: True
url_refusal: ok
```
## Hermes usage pattern
When a user sends a document or asks to analyze a Drive/PDF/Office file:
1. Load the `markitdown-document-workflow` skill.
2. Download or locate the file locally.
3. Convert with `bytelyst-doc2md`.
4. Read the stats JSON first.
5. Search/page the generated Markdown instead of loading the whole file.
6. Fall back to OCR tooling only when MarkItDown output is incomplete or image-only.
When a user asks to save LLM/provider cost:
1. Use primary Hermes provider for sensitive or high-stakes work.
2. Use FreeLLMAPI only for low-stakes fallback after provider keys have been added.
3. Confirm `freellmapi-status` shows at least one provider key before routing work to it.

View File

@ -4,6 +4,71 @@ Common operational paths for the team.
Use this file as the routing guide. For the exact support boundary, cross-check `docs/supported-scripts.md`. Use this file as the routing guide. For the exact support boundary, cross-check `docs/supported-scripts.md`.
---
## Hostinger VM Maintenance
See [`docs/hostinger-vm-maintenance.md`](hostinger-vm-maintenance.md) for:
- The May 2026 incident post-mortem (load avg 1305, root cause, full fix log)
- Ongoing disk/memory/load health targets
- Safe cleanup reference commands
- How to deploy `admin-web` when ready
Quick scripts:
```bash
# Health check (read-only, any time)
bash scripts/VMs/HostingerVM/vm-health-check.sh
# Weekly safe cleanup
bash scripts/VMs/HostingerVM/vm-cleanup.sh
# Monthly full cleanup
bash scripts/VMs/HostingerVM/vm-cleanup.sh --full
# Install automated cron schedule
bash scripts/VMs/HostingerVM/vm-cleanup.sh --install-cron
```
Cron setup details: [`scripts/VMs/HostingerVM/CRON_SETUP.md`](../scripts/VMs/HostingerVM/CRON_SETUP.md)
---
## Gitea Registry Backup
The local Gitea npm registry is backed up with a native `gitea dump` job.
Installed VM paths:
- Script: `/opt/bytelyst/scripts/backup-gitea.sh`
- Backup directory: `/opt/bytelyst/backups/gitea`
- Systemd service: `bytelyst-gitea-backup.service`
- Systemd timer: `bytelyst-gitea-backup.timer`
Versioned source files:
- [`scripts/gitea-backup.sh`](../scripts/gitea-backup.sh)
- [`systemd/bytelyst-gitea-backup.service`](../systemd/bytelyst-gitea-backup.service)
- [`systemd/bytelyst-gitea-backup.timer`](../systemd/bytelyst-gitea-backup.timer)
Useful commands:
```bash
# Run a backup immediately
sudo systemctl start bytelyst-gitea-backup.service
# Check last run and next scheduled run
sudo systemctl status bytelyst-gitea-backup.service --no-pager
systemctl list-timers --all --no-pager bytelyst-gitea-backup.timer
# List retained backup dumps
ls -lh /opt/bytelyst/backups/gitea
```
The timer runs daily at `03:15 UTC` and the script deletes dumps older than 14 days by default.
---
## 1. Remove A Collaborator Interactively ## 1. Remove A Collaborator Interactively
Use: Use:
@ -171,6 +236,31 @@ If packages are still pending or the services are unhealthy, rerun:
sudo bash scripts/ubuntu-vm-security-update.sh sudo bash scripts/ubuntu-vm-security-update.sh
``` ```
## 10. Operate Hermes Agent Safely
Use:
```bash
hermes --version
hermes status --all
hermes cron list
systemctl status hermes-gateway --no-pager
```
Read first:
- `docs/hermes-setup-upgrade-roadmap.md`
- `docs/hermes-operations.md`
Use this when:
- you are upgrading or troubleshooting Hermes
- you are checking Telegram gateway health
- you are verifying backup/watchdog cron jobs
- you are evaluating any private-only dashboard/API access pattern
Hard rule: do **not** expose a Hermes dashboard or Hermes API publicly unless S explicitly approves the exact hostname, auth gate, and access path.
## Team Guidance ## Team Guidance
- Prefer the supported entry points in `docs/tooling-status.md`. - Prefer the supported entry points in `docs/tooling-status.md`.

View File

@ -51,6 +51,10 @@ Current key files:
- `docs/operations.md` - `docs/operations.md`
- `docs/remove_user_interactive.md` - `docs/remove_user_interactive.md`
- `docs/hermes-setup-upgrade-roadmap.md` - `docs/hermes-setup-upgrade-roadmap.md`
- `docs/hermes-operations.md`
- `docs/llm-utility-workflows.md`
- `docs/vm-security-blind-spots-roadmap.md`
- `docs/vm-exposure-inventory.md`
### `.github/workflows/` ### `.github/workflows/`

View File

@ -0,0 +1,151 @@
# ByteLyst VM Exposure Inventory
**Generated:** 2026-05-27
**Host:** `srv1491630`
**Purpose:** Phase 0 inventory for `docs/vm-security-blind-spots-roadmap.md`.
This inventory is a pre-change control document. It does not approve exposure by itself. Each `Needs decision` row requires owner approval before firewall, Compose, Caddy, or SSH changes.
## Classification Key
| Class | Meaning | Expected Controls |
| --- | --- | --- |
| `public-caddy` | Public app/API intended to be reached through Caddy | Caddy TLS, hostname/path routing, app auth where needed, no direct host-port exposure |
| `public-direct` | Direct host-port access intentionally public | explicit approval, provider/UFW allowance, monitoring |
| `private-admin` | Admin/dev/internal tool | Tailscale/VPN, SSH tunnel, IP allowlist, or auth gate |
| `loopback-only` | Host-local service used by Caddy or local automation | bind `127.0.0.1:port`; no external bind |
| `docker-internal` | Container-to-container only | no host port mapping |
| `retire` | Unused/deprecated | remove service or disable host exposure |
| `needs-decision` | Existing exposure with unknown/unclear intent | owner must classify before remediation |
## Caddy Public Routes
| Hostname/path | Upstream | Initial class | Decision needed |
| --- | --- | --- | --- |
| `api.bytelyst.com/platform/*` | `platform-service:4003` | `public-caddy` | Confirm auth posture |
| `api.bytelyst.com/extraction/*` | `extraction-service:4005` | `public-caddy` | Confirm auth posture |
| `api.bytelyst.com/mcp/*` | `mcp-server:4007` | `public-caddy` | Confirm public need |
| `api.bytelyst.com/peakpulse/*` | `peakpulse-backend:4010` | `public-caddy` | Confirm direct host port can close |
| `api.bytelyst.com/chronomind/*` | `chronomind-backend:4011` | `public-caddy` | Confirm direct host port can close |
| `api.bytelyst.com/jarvisjr/*` | `jarvisjr-backend:4012` | `public-caddy` | Confirm direct host port can close |
| `api.bytelyst.com/nomgap/*` | `nomgap-backend:4013` | `public-caddy` | Confirm direct host port can close |
| `api.bytelyst.com/mindlyst/*` | `mindlyst-backend:4014` | `public-caddy` | Confirm direct host port can close |
| `api.bytelyst.com/lysnrai/*` | `lysnrai-backend:4015` | `public-caddy` | Confirm direct host port can close |
| `api.bytelyst.com/notelett/*` | `notelett-backend:4016` | `public-caddy` | Confirm direct host port can close |
| `api.bytelyst.com/flowmonk/*` | `flowmonk-backend:4017` | `public-caddy` | Confirm direct host port can close |
| `api.bytelyst.com/actiontrail/*` | `actiontrail-backend:4020` | `public-caddy` | Confirm direct host port can close |
| `api.bytelyst.com/localmemgpt/*` | `localmemgpt-backend:4019` | `public-caddy` | Confirm direct host port can close |
| `api.bytelyst.com/invttrdg/*` | `invttrdg-backend:4018` | `public-caddy` | Confirm direct host port can close |
| `api.bytelyst.com/devops/*` | `devops-backend:4004` | `private-admin` | Should require auth/private access |
| `gitea.bytelyst.com` | `gitea-npm-registry:3000` | `public-caddy` | Confirm direct `3300` can close |
| `admin.bytelyst.com` | `admin-web:3001` | `private-admin` | Confirm route still resolves; upstream container not in current `docker ps` |
| `devops.bytelyst.com` | `devops-web:3000` | `private-admin` | Should require auth/private access |
| `tracker.bytelyst.com` | `tracker-web:3003` | `public-caddy` | Confirm direct host port can close |
| `llmlab.bytelyst.com` | `llmlab-dashboard:3075` | `private-admin` | Dashboard currently unhealthy; decide public/private/retire |
| `ollama.bytelyst.com` | `172.17.0.1:11434` | `private-admin` | Model endpoint should not be unauthenticated public |
| `trading-api.bytelyst.com` | `trading-backend:5000` | `public-caddy` | Confirm auth posture |
| `invttrdg.bytelyst.com` | `invttrdg-web:3085` | `public-caddy` | Confirm direct host port can close |
| `notes.bytelyst.com` | `notelett-web:3045` | `public-caddy` | Confirm direct host port can close |
| `clock.bytelyst.com` | `chronomind-web:3030` | `public-caddy` | Confirm direct host port can close |
## Public Bind Inventory
These listeners were bound on `0.0.0.0` and/or `[::]` during review.
| Port | Service/container | Owner / Compose source | Current route | Initial class | Proposed action |
| --- | --- | --- | --- | --- | --- |
| `22` | `sshd` | host systemd | direct SSH | `public-direct` | Keep public only after SSH key hardening |
| `80`, `443` | `caddy` | `/opt/bytelyst/learning_ai_common_plat/docker-compose.ecosystem.yml` | public ingress | `public-caddy` | Keep public |
| `3000` | `notelett-web` | `/opt/bytelyst/learning_ai_notes/docker-compose.yml` | `notes.bytelyst.com` | `public-caddy` | Bound to `127.0.0.1` on 2026-05-27; keep Caddy route |
| `3002` | `lysnrai-dashboard` | `/opt/bytelyst/learning_ai_common_plat/docker-compose.ecosystem.yml` | none found in Caddy | `loopback-only` | Bound to `127.0.0.1` on 2026-05-27; still needs public/private product decision |
| `3003` | `tracker-web` | `/opt/bytelyst/learning_ai_common_plat/docker-compose.ecosystem.yml` | `tracker.bytelyst.com` | `public-caddy` | Bound to `127.0.0.1` on 2026-05-27; keep Caddy route |
| `3030` | `chronomind-web` | `/root/bytelyst.ai/repos/learning_ai_clock/docker-compose.yml` | `clock.bytelyst.com` | `public-caddy` | Bound to `127.0.0.1` on 2026-05-27; keep Caddy route |
| `3035` | `jarvisjr-web` | `/opt/bytelyst/learning_ai_common_plat/docker-compose.ecosystem.yml` | none found in Caddy | `loopback-only` | Bound to `127.0.0.1` on 2026-05-27; still needs public/private product decision |
| `3040` | `flowmonk-web` | `/opt/bytelyst/learning_ai_common_plat/docker-compose.ecosystem.yml` | none found in Caddy | `loopback-only` | Bound to `127.0.0.1` on 2026-05-27; still needs public/private product decision |
| `3049` | `devops-web` | `/opt/bytelyst/learning_ai_devops_tools/dashboard/docker-compose.yml` | `devops.bytelyst.com` | `private-admin` | Bound to `127.0.0.1` on 2026-05-27; still needs auth/private gate for Caddy route |
| `3050` | `mindlyst-web` | `/opt/bytelyst/learning_ai_common_plat/docker-compose.ecosystem.yml` | none found in Caddy | `loopback-only` | Bound to `127.0.0.1` on 2026-05-27; still needs public/private product decision |
| `3055` | `nomgap-web` | orphan from older `/opt/bytelyst/learning_ai_common_plat/docker-compose.ecosystem.yml` | none found in Caddy | `retire` | Retired on 2026-05-27; current Compose says Nomgap web is deployed to Vercel |
| `3060` | `actiontrail-web` | `/opt/bytelyst/learning_ai_common_plat/docker-compose.ecosystem.yml` | none found in Caddy | `loopback-only` | Bound to `127.0.0.1` on 2026-05-27; still needs public/private product decision |
| `3070` | `localmemgpt-web` | `/opt/bytelyst/learning_ai_common_plat/docker-compose.ecosystem.yml` | none found in Caddy | `loopback-only` | Bound to `127.0.0.1` on 2026-05-27; still needs public/private product decision |
| `3075` | `llmlab-dashboard` | `/opt/bytelyst/learning_ai_common_plat/docker-compose.ecosystem.yml` | `llmlab.bytelyst.com` | `private-admin` | Bound to `127.0.0.1` on 2026-05-27; still needs auth/private gate for Caddy route |
| `3085` | `invttrdg-web` | `/opt/bytelyst/learning_ai_invt_trdg/docker-compose.yml` | `invttrdg.bytelyst.com` | `public-caddy` | Bound to `127.0.0.1` on 2026-05-27; keep Caddy route |
| `3100` | `loki` | `/opt/bytelyst/learning_ai_common_plat/docker-compose.ecosystem.yml` | none found in Caddy | `loopback-only` | Bound to `127.0.0.1` on 2026-05-27 |
| `3300` | `gitea-npm-registry` | non-Compose container labels absent | `gitea.bytelyst.com` | `public-caddy` with direct bypass | Bind loopback or private; keep Caddy route |
| `4004` | `devops-backend` | `/opt/bytelyst/learning_ai_devops_tools/dashboard/docker-compose.yml` | `api.bytelyst.com/devops/*` | `private-admin` | Bound to `127.0.0.1` on 2026-05-27; still needs auth/private gate for Caddy route |
| `4010` | `peakpulse-backend` | `/opt/bytelyst/learning_ai_common_plat/docker-compose.ecosystem.yml` | `api.bytelyst.com/peakpulse/*` | `public-caddy` | Host port removed by Compose recreate on 2026-05-27; keep Caddy route |
| `4011` | `chronomind-backend` | `/root/bytelyst.ai/repos/learning_ai_clock/docker-compose.yml` | `api.bytelyst.com/chronomind/*` | `public-caddy` | Bound to `127.0.0.1` on 2026-05-27; keep Caddy route |
| `4012` | `jarvisjr-backend` | `/opt/bytelyst/learning_ai_common_plat/docker-compose.ecosystem.yml` | `api.bytelyst.com/jarvisjr/*` | `public-caddy` | Host port removed by Compose recreate on 2026-05-27; keep Caddy route |
| `4013` | `nomgap-backend` | `/opt/bytelyst/learning_ai_common_plat/docker-compose.ecosystem.yml` | `api.bytelyst.com/nomgap/*` | `public-caddy` | Host port removed by Compose recreate on 2026-05-27; keep Caddy route |
| `4014` | `mindlyst-backend` | `/opt/bytelyst/learning_ai_common_plat/docker-compose.ecosystem.yml` | `api.bytelyst.com/mindlyst/*` | `public-caddy` | Host port removed by Compose recreate on 2026-05-27; keep Caddy route |
| `4015` | `lysnrai-backend` | `/opt/bytelyst/learning_ai_common_plat/docker-compose.ecosystem.yml` | `api.bytelyst.com/lysnrai/*` | `public-caddy` | Host port removed by Compose recreate on 2026-05-27; keep Caddy route |
| `4016` | `notelett-backend` | `/opt/bytelyst/learning_ai_notes/docker-compose.yml` | `api.bytelyst.com/notelett/*` | `public-caddy` | Bound to `127.0.0.1` on 2026-05-27; keep Caddy route |
| `4017` | `flowmonk-backend` | `/opt/bytelyst/learning_ai_common_plat/docker-compose.ecosystem.yml` | `api.bytelyst.com/flowmonk/*` | `public-caddy` | Host port removed by Compose recreate on 2026-05-27; keep Caddy route |
| `4019` | `localmemgpt-backend` | `/opt/bytelyst/learning_ai_common_plat/docker-compose.ecosystem.yml` | `api.bytelyst.com/localmemgpt/*` | `public-caddy` | Host port removed by Compose recreate on 2026-05-27; keep Caddy route |
| `4020` | `actiontrail-backend` | `/opt/bytelyst/learning_ai_common_plat/docker-compose.ecosystem.yml` | `api.bytelyst.com/actiontrail/*` | `public-caddy` | Bound to `127.0.0.1` on 2026-05-27; route mapping still needs Caddy/product verification |
| `4025` | `invttrdg-backend` | `/opt/bytelyst/learning_ai_invt_trdg/docker-compose.yml` | `api.bytelyst.com/invttrdg/*` | `public-caddy` | Bound to `127.0.0.1` on 2026-05-27; keep Caddy route |
| `1025` | `mailpit` SMTP | `/opt/bytelyst/learning_ai_common_plat/docker-compose.ecosystem.yml` | none found in Caddy | `loopback-only` | Bound to `127.0.0.1` on 2026-05-27 |
| `8025` | `mailpit` UI | `/opt/bytelyst/learning_ai_common_plat/docker-compose.ecosystem.yml` | none found in Caddy | `loopback-only` | Bound to `127.0.0.1` on 2026-05-27 |
| `10000` | `azurite` | `/opt/bytelyst/learning_ai_common_plat/docker-compose.ecosystem.yml` | none found in Caddy | `loopback-only` | Bound to `127.0.0.1` on 2026-05-27 |
| `1234`, `8081` | `cosmos-emulator` | `/opt/bytelyst/learning_ai_common_plat/docker-compose.ecosystem.yml` | none found in Caddy | `loopback-only` | Bound to `127.0.0.1` on 2026-05-27 |
| `11434` | `ollama` host process | host service | `ollama.bytelyst.com` | `private-admin` | Bind loopback/private or auth-gate; do not leave raw public |
## Non-Public / Internal Listeners
| Address/port | Process/service | Initial class | Notes |
| --- | --- | --- | --- |
| `127.0.0.53:53`, `127.0.0.54:53` | `systemd-resolve` | host-internal | Expected resolver listeners |
| `127.0.0.1:44561` | `ollama` | host-internal | Secondary loopback listener observed |
| `100.87.53.10:9119`, `100.87.53.10:9120` | `hermes` | private-admin | Tailscale-only bind; keep private |
| `100.87.53.10:51855`, `[fd7a:115c:a1e0::3c33:350a]:43379` | `tailscaled` | private-admin | Tailscale control/data listeners |
| Docker-internal only | `platform-service`, `mcp-server`, `extraction-service`, `prometheus`, `cadvisor`, `node-exporter`, `valkey`, `trading-backend` | docker-internal/private | No direct host bind seen, except Caddy may route to some by service name |
## Unhealthy Containers At Inventory Time
| Container | Port exposure | Initial action |
| --- | --- | --- |
| `learning_ai_common_plat-llmlab-dashboard-1` | `0.0.0.0:3075` and Caddy `llmlab.bytelyst.com` | Fix/gate/retire before treating public |
| `learning_ai_common_plat-actiontrail-web-1` | `0.0.0.0:3060` | Classify and fix/retire |
| `learning_ai_common_plat-jarvisjr-web-1` | `0.0.0.0:3035` | Classify and fix/retire |
| `learning_ai_common_plat-localmemgpt-web-1` | `0.0.0.0:3070` | Classify and fix/retire |
| `learning_ai_common_plat-nomgap-web-1` | `0.0.0.0:3055` | Classify and fix/retire |
| `learning_ai_common_plat-flowmonk-web-1` | `0.0.0.0:3040` | Classify and fix/retire |
| `learning_ai_common_plat-mindlyst-web-1` | `0.0.0.0:3050` | Classify and fix/retire |
## Drift / Follow-Up Findings
- `nomgap-web` was an orphan from an older Compose revision, had no Caddy route, and was retired on 2026-05-27.
- `devops-backend` and `devops-web` now run from `/opt/bytelyst/learning_ai_devops_tools/dashboard/docker-compose.yml`.
- `gitea-npm-registry` has no Compose labels in Docker inspect output. Find its systemd/compose owner before changing `3300`.
- `admin.bytelyst.com` points at `admin-web:3001`, but no `admin-web` container was present in `docker ps` during this inventory.
## Proposed First Remediation Groups
Do these in separate commits/windows with smoke checks after each group.
1. **Internal emulators and mail tools:** `1025`, `8025`, `10000`, `1234`, `8081`.
- Expected class: `docker-internal` or `private-admin`.
- Preferred fix: remove host port mappings or bind to `127.0.0.1`.
2. **Observability internals:** `3100` and any future Prometheus/Grafana/exporter direct binds.
- Expected class: `private-admin`.
- Preferred fix: Docker-internal or Tailscale-only.
3. **Admin/model surfaces:** `11434`, `3075`, `3049`, `4004`.
- Expected class: `private-admin`.
- Preferred fix: auth gate/private route and no raw public port.
4. **Caddy-backed app/API direct bypass ports:** `3000`, `3003`, `3030`, `3085`, `4010`-`4025`.
- Expected class: `public-caddy`.
- Preferred fix: keep Caddy public, remove raw direct public binds.
5. **SSH:** `22`.
- Expected class: `public-direct`.
- Preferred fix: keep public only after key-only and root-login hardening.
## Verification Commands
```bash
date -Is
ss -ltnp
docker ps --format '{{.Names}}\t{{.Image}}\t{{.Status}}\t{{.Ports}}'
docker ps -q | xargs -r docker inspect --format '{{.Name}}\tproject={{index .Config.Labels "com.docker.compose.project"}}\tservice={{index .Config.Labels "com.docker.compose.service"}}\tworkdir={{index .Config.Labels "com.docker.compose.project.working_dir"}}\tconfig={{index .Config.Labels "com.docker.compose.project.config_files"}}'
docker exec caddy caddy validate --config /etc/caddy/Caddyfile
sed -n '1,260p' /opt/bytelyst/Caddyfile
iptables -S DOCKER-USER
```

View File

@ -0,0 +1,718 @@
# ByteLyst VM Security Blind Spots Roadmap
**Review date:** 2026-05-27
**Reviewer:** Hermes Agent
**Scope:** Hostinger ByteLyst VM, Docker-hosted product stack, Caddy ingress, Gitea/CI, Hermes backup/ops, VM maintenance posture.
## Executive Summary
The VM is operational and has several good foundations already in place: UFW is active, fail2ban is running for SSH, unattended upgrades are enabled, Caddy config validates, disk/memory headroom is acceptable, and Hermes persistent-data backup cron is healthy.
The biggest blind spot is that the apparent firewall posture is misleading: UFW only allows SSH, but Docker-published ports create iptables rules that can expose many application, database/emulator, observability, registry, and development ports on `0.0.0.0` / IPv6. Several of those services should either be private-only, routed only through Caddy with auth, or bound to loopback/internal Docker networks.
Second-order risks are SSH hardening gaps, rootful Docker/container hardening gaps, unhealthy apps that can hide failed deploys, an inactive Gitea Actions runner, a failed Hermes backup systemd unit despite cron backup success, and incomplete evidence for restore drills, secret scans, and off-host recovery.
## Implementation Readiness Assessment
**Roadmap quality score:** 86%
**Implementation confidence before remediation starts:** 74%
**Why not higher yet:** the review has good evidence for the major blind spots, but safe remediation still depends on a service-by-service exposure inventory, owner approval for public/private intent, and verified rollback paths for SSH and Docker firewall changes. The highest-risk changes are not technically hard; they are risky because this VM hosts many ByteLyst apps and several public ports may be relied on by legacy workflows.
**Confidence after Phase 0 is complete:** expected to rise to about 88% if every public hostname/host port has an approved disposition and rollback commands are tested.
**Quality strengths:**
- Evidence is concrete and command-derived rather than speculative.
- The highest-risk items are correctly prioritized as P0.
- The roadmap separates discovery from disruptive remediation.
- It captures operational debt outside pure security, including unhealthy containers, backup state, runner drift, and cron drift.
**Quality gaps to close before implementation:**
- Convert broad remediation bullets into small tickets with owner, rollback, validation, and maintenance window requirements.
- Add an approved exposure inventory before changing Docker bindings or `DOCKER-USER`.
- Record a tested SSH rollback path and keep an active second session/provider console open before changing `sshd`.
- Define what is intentionally public, private, internal-only, or deprecated for each service.
- Add post-change verification commands that prove public apps still work and private services are no longer internet reachable.
## Implementation Guardrails
These rules apply before any Phase 1 change:
- Do not bulk-close ports. Change one service group at a time and verify public app health after each group.
- Do not restart SSH from a single session. Keep a second key-based session open and provider console access available.
- Do not add broad `DROP` rules before an allowlist is committed to the inventory.
- Prefer loopback/internal Compose bindings over firewall-only hiding when a service does not need direct public access.
- Preserve Caddy as the public ingress path for web/API services unless a service is explicitly approved for direct exposure.
- Record exact rollback commands next to every change ticket.
- Treat Docker, SSH, Caddy, and backup changes as maintenance-window work.
## Exposure Classification Model
Every listening port and Caddy hostname should be classified before changes:
| Class | Meaning | Expected Controls | Examples To Review |
| --- | --- | --- | --- |
| `public-caddy` | Public app/API reached only through Caddy | TLS, hostname routing, app auth where needed, no direct host-port access | product web/API hostnames |
| `public-direct` | Direct host-port access is intentionally public | Explicit business reason, provider firewall allow, monitoring | SSH only unless approved otherwise |
| `private-admin` | Admin/dev/internal tool | Tailscale/VPN, SSH tunnel, IP allowlist, or auth gate | admin dashboards, devops tools |
| `loopback-only` | Host-local service used by Caddy or local automation | Bind `127.0.0.1:port`, no external bind | internal APIs behind Caddy |
| `docker-internal` | Container-to-container only | no host port mapping | databases, emulators, private workers |
| `retire` | Unused/deprecated | remove service/port, disable health checks and jobs | stale dashboards/services |
Minimum inventory fields:
- service/container name
- repo/Compose file
- host port and bind address
- container port
- Caddy hostname/path, if any
- intended audience
- authentication/control plane
- classification
- owner/approver
- rollback command
- post-change health check
## Evidence Snapshot
Collected on 2026-05-27 from this VM.
### Host and patching
- Host: `srv1491630`
- OS: Ubuntu `25.10`
- Kernel: `6.17.0-29-generic`
- Uptime: about 14 hours at review time
- Root filesystem: 193G total, 71G used, 123G available, 37% used
- Memory: 15Gi total, about 10Gi available
- Swap: 4.0G total, about 1.3G used
- Reboot required: no
- Pending package upgrades included Docker CE/containerd/buildx/compose and security updates for `libgcrypt20`, `libcaca0`, and `libssh2-1t64`
- Unattended upgrades: active and configured for automatic reboot at 04:00 with users absent
### Network and ingress
- UFW: active; default deny incoming; only `22/tcp` allowed by UFW rules
- Docker iptables rules are present and publish many ports despite UFW's simple rule list
- Public/listening TCP ports bound on all interfaces included:
- `22`, `80`, `443`
- app/frontend ports: `3000`, `3002`, `3003`, `3030`, `3035`, `3040`, `3049`, `3050`, `3055`, `3060`, `3070`, `3075`, `3085`
- backend/API ports: `4004`, `4010`, `4011`, `4012`, `4013`, `4014`, `4015`, `4016`, `4017`, `4019`, `4020`, `4025`
- infra/dev ports: `1025`, `1234`, `3100`, `3300`, `8025`, `8081`, `10000`, `11434`
- Caddy source-of-truth config: `/opt/bytelyst/Caddyfile`, mounted read-only into the `caddy` container
- `docker exec caddy caddy validate --config /etc/caddy/Caddyfile`: valid config, formatting warning only
- Caddy public hostnames include:
- `api.bytelyst.com`
- `gitea.bytelyst.com`
- `admin.bytelyst.com`
- `devops.bytelyst.com`
- `tracker.bytelyst.com`
- `llmlab.bytelyst.com`
- `ollama.bytelyst.com`
- `trading-api.bytelyst.com`
- `invttrdg.bytelyst.com`
- `notes.bytelyst.com`
- `clock.bytelyst.com`
### SSH and account surface
Effective `sshd -T` settings showed:
- `permitrootlogin yes`
- `passwordauthentication yes`
- `pubkeyauthentication yes`
- `kbdinteractiveauthentication no`
- `maxauthtries 6`
- `x11forwarding yes`
- `clientaliveinterval 0`
`fail2ban` is active with one jail: `sshd`; no current bans at review time.
### Docker runtime and containers
- Docker: client/server `29.4.2`; newer Docker packages are available
- Docker daemon is rootful; security options showed AppArmor, seccomp builtin, and cgroup namespaces; `live_restore=false`
- Most product containers run with writable root filesystems and no explicit `user` configured
- `cadvisor` is privileged
- `DOCKER-USER` chain appears empty, so there is no central Docker firewall policy in front of published containers
- Multiple containers are unhealthy:
- `learning_ai_common_plat-llmlab-dashboard-1`
- `learning_ai_common_plat-actiontrail-web-1`
- `learning_ai_common_plat-jarvisjr-web-1`
- `learning_ai_common_plat-localmemgpt-web-1`
- `learning_ai_common_plat-nomgap-web-1`
- `learning_ai_common_plat-flowmonk-web-1`
- `learning_ai_common_plat-mindlyst-web-1`
### Gitea and CI
- Gitea public route: `https://gitea.bytelyst.com`
- Local Gitea container port: host `3300` -> container `3000`, bound on `0.0.0.0` and IPv6
- `gitea-act-runner.service`: enabled but inactive/dead
- Runner user exists: `gitea-runner`, member of `docker`
- Runner config directory permissions look reasonable:
- `/home/gitea-runner/act_runner`: `750`, owned by `gitea-runner:gitea-runner`
- `/home/gitea-runner/act_runner/config.yaml`: `600`, owned by `gitea-runner:gitea-runner`
### Backup and operations
- `systemctl --failed` showed failed unit:
- `hermes-root-backup.service``Sync root Hermes persistent backup to GitHub`
- Hermes cron backup is active and healthy:
- job `470832621b43`, `Sync Hermes persistent-data backup to GitHub`, every 30 minutes, last run `ok`
- Existing VM maintenance cron entries exist for health check and cleanup under `/opt/bytelyst/learning_ai_devops_tools/scripts/VMs/HostingerVM/`
- A root crontab entry still references `/opt/bytelyst/bytelyst-devops-tools/monitor-lucky25-execution.sh`, which may be stale after repo relocation/renaming
## Blind Spots and Risk Register
### P0 — Internet-exposed Docker ports bypass the intended ingress model
**Risk:** UFW suggests only SSH is allowed, but Docker-published ports expose many services directly on all interfaces. This can bypass Caddy, TLS, auth, logging, rate limiting, and hostname/path controls.
**Examples observed:** `3300`, `8025`, `1025`, `1234`, `8081`, `10000`, `11434`, many `30xx` web ports, and many `40xx` backend ports.
**Impact:** Direct access to dev/infra services, internal APIs, emulators, mail tooling, dashboards, or model endpoints if upstream firewall/provider rules do not block them.
**Roadmap:**
- [ ] Create a canonical exposure inventory: service, container, host port, public hostname, required audience, auth requirement.
- [ ] For each service, decide one of: public via Caddy, private via Tailscale/SSH, loopback-only host port, Docker-internal only, or remove.
- [ ] Bind non-public Compose ports to `127.0.0.1` or remove host port mapping entirely.
- [x] Internal emulator/mail/observability ports `1025`, `8025`, `10000`, `1234`, `8081`, and `3100` are loopback-bound.
- [x] Common-platform direct app/API bypasses are loopback-bound or removed from host publishing.
- [x] Notes, Clock, and InvtTrdg direct app/API bypasses are loopback-bound.
- [x] DevOps dashboard/API direct private-admin bypasses are loopback-bound.
- [ ] Add a `DOCKER-USER` chain policy to drop unsolicited traffic to non-approved published ports before Docker's accept rules.
- [ ] Keep only `80/443` and intentionally public SSH exposed at the provider/firewall layer.
- [ ] Add a recurring check that compares `ss -ltn` and Docker published ports against the approved inventory.
**Acceptance criteria:**
- `docs/vm-exposure-inventory.md` lists every `ss -ltnp` listener and every Docker published port.
- Every non-SSH direct public bind has an approved classification.
- Non-public services are either loopback-bound, Docker-internal, provider-firewalled, or blocked in `DOCKER-USER`.
- External probe confirms non-approved ports are closed from the internet.
- Caddy-routed public hostnames still pass smoke checks.
**Rollback:** keep a saved copy of original Compose files and `iptables-save` output; rollback means restoring original port mappings or flushing only the newly added `DOCKER-USER` rules.
### P0 — SSH permits root login and password authentication
**Risk:** `PermitRootLogin yes` and `PasswordAuthentication yes` keep the primary admin surface broad. fail2ban helps, but password-enabled root SSH is still high-risk for an internet-facing VM.
**Roadmap:**
- [ ] Confirm all required admin users have working SSH keys and sudo access.
- [ ] Add a non-root break-glass admin path if one does not exist.
- [ ] Change SSH effective config to:
- [ ] `PermitRootLogin prohibit-password` or `no`
- [ ] `PasswordAuthentication no`
- [ ] `X11Forwarding no`
- [ ] lower `MaxAuthTries`, e.g. `3`
- [ ] set a sane `ClientAliveInterval` / `ClientAliveCountMax`
- [ ] Validate with a second session before restarting SSH.
- [ ] Record rollback commands and keep console/provider access available during rollout.
**Acceptance criteria:**
- A non-root sudo admin user can log in with SSH key auth.
- Root password login no longer works.
- Existing automation using `scripts/VMs/HostingerVM/login.sh` still works or is updated.
- `sshd -T` confirms the intended effective config.
- `fail2ban-client status sshd` still reports an active jail.
**Rollback:** provider console or still-open root session can restore previous `sshd_config` drop-in and restart `ssh`.
### P0 — Public/private boundary for dev and internal tooling is unclear
**Risk:** Caddy publishes `ollama.bytelyst.com`, `llmlab.bytelyst.com`, `devops.bytelyst.com`, `admin.bytelyst.com`, and Gitea. Some may be intended, but the roadmap lacks an explicit auth/access decision for each.
**Roadmap:**
- [ ] Document public hostnames, auth model, and data sensitivity.
- [ ] Require explicit approval before exposing new dashboards or model endpoints.
- [ ] Add Caddy auth/IP allowlist/Tailscale-only strategy for admin-like surfaces.
- [ ] Add security headers/auth checks to public UI health reviews.
- [ ] Confirm `ollama.bytelyst.com` should be publicly reachable at all; if not, move behind private network or auth gate.
**Acceptance criteria:**
- `ollama`, `llmlab`, `devops`, `admin`, `gitea`, and observability-adjacent routes each have an owner-approved exposure class.
- Public admin-like routes require authentication or an explicit documented exception.
- No emulator, mail, model, or raw dashboard port is directly internet reachable unless explicitly approved.
### P1 — Docker/container hardening is mostly default
**Risk:** Many containers run as default/root user, writable rootfs, broad capabilities by default, and rootful Docker. A compromised app gets more host-adjacent leverage than needed.
**Roadmap:**
- [ ] Create a per-service Docker hardening matrix: user, read-only rootfs, dropped capabilities, no-new-privileges, resource limits, healthcheck, restart policy, secrets handling.
- [ ] Start with public-facing/backend services and admin dashboards.
- [ ] Add `security_opt: ["no-new-privileges:true"]` where compatible.
- [ ] Add `cap_drop: ["ALL"]` and selectively add back capabilities only when needed.
- [ ] Convert app images to non-root users consistently.
- [ ] Use `read_only: true` plus explicit writable tmp/cache volumes where compatible.
- [ ] Review `cadvisor` privileged mode and replace/restrict if possible.
- [ ] Enable Docker `live-restore` if it fits maintenance operations.
**Implementation note:** do not attempt rootless Docker or read-only rootfs as the first hardening step. Start with `no-new-privileges`, non-root app users where images already support it, and targeted capability drops for public-facing app containers.
### P1 — Unhealthy containers can normalize broken deployments
**Risk:** Multiple app web containers are unhealthy while still running. If unhealthy states are ignored, deploy regressions and broken public pages can persist unnoticed.
**Roadmap:**
- [ ] Triage each unhealthy container and classify: real app failure, bad healthcheck, intentionally unused, or deprecated.
- [ ] Fix or remove bad healthchecks so Docker health state is trustworthy.
- [ ] Add alerting for sustained unhealthy containers.
- [ ] Make deployment scripts fail on unhealthy post-deploy state.
- [ ] Update dashboard/observability docs with current service ownership and expected state.
**Acceptance criteria:**
- Every unhealthy container has one of: fixed app, fixed healthcheck, intentionally disabled, or retired.
- Docker health state matches the products actual serving state.
- Post-deploy checks fail if required containers remain unhealthy beyond a grace period.
### P1 — Gitea Actions runner is enabled but inactive
**Risk:** CI/deploy assumptions may be wrong. If a runner is expected to deploy or publish packages, inactive runner state blocks automation and may cause manual drift.
**Roadmap:**
- [x] Decide whether the runner should be active or intentionally disabled.
- [x] If active: restart and verify `gitea-act-runner.service`, runner labels, and Docker access.
- [ ] Run and record a dedicated Gitea Actions smoke workflow result.
- [ ] If disabled: disable the service and document the intentional state.
- [ ] Keep runner secrets separate from smoke/test workflows.
- [ ] Add a runner-health check to VM observability.
**Decision needed:** runner should be either actively smoke-tested or disabled. An enabled-but-dead runner should not remain a steady state.
### P1 — Backup/restore evidence is split and one backup unit is failed
**Risk:** Hermes cron backup works, but `hermes-root-backup.service` is failed. There is no recent full restore drill evidence in this review. A backup that cannot be restored is only an assumption.
**Roadmap:**
- [x] Inspect `hermes-root-backup.service` logs and decide whether to fix, disable, or replace it with the cron-backed job.
- [x] Repair the root backup checkout divergence and verify a successful `hermes-root-backup.service` one-shot run.
- [x] Update `/root/.hermes/scripts/sync_hermes_persistent_backup.py` so future generated-backup divergence preserves a safety branch and rejoins the remote backup stream instead of wedging on `git pull --ff-only`.
- [ ] Document all backup mechanisms: Hermes, Gitea data, Docker volumes, app data, Caddy certs/config, environment/secrets escrow.
- [ ] Run a restore drill into a non-production path/profile.
- [ ] Verify no raw `.env`, OAuth tokens, private keys, SQLite WAL/SHM, or raw transcript DBs are committed.
- [ ] Add backup freshness and restore-drill status to the monthly VM review.
**Acceptance criteria:**
- `systemctl --failed` no longer includes backup units unless the failure is intentionally documented.
- Backup status shows source, destination, cadence, last success, and restore command.
- A restore drill has an artifact: date, target path/profile, commands run, result, and gaps found.
### P1 — Patch management has pending security/runtime updates
**Risk:** Unattended upgrades are on, but Docker and security package updates were pending at review time. Docker updates may need controlled restart/redeploy planning.
**Roadmap:**
- [ ] Add a weekly patch review checkpoint that reports pending security and Docker updates separately.
- [ ] Define a Docker upgrade maintenance window with pre/post checks.
- [ ] Run `apt list --upgradable` and capture package classes without dumping noise.
- [ ] Verify apps after Docker/containerd upgrades.
**Acceptance criteria:**
- Security updates and Docker/runtime updates are tracked separately.
- Docker upgrade has pre/post container health, Caddy validation, and public smoke checks.
- Reboot requirement is checked and scheduled rather than discovered accidentally.
### P2 — Ubuntu 25.10 lifecycle risk needs explicit tracking
**Risk:** Ubuntu interim releases have short support windows. If this VM is long-lived production infrastructure, lifecycle tracking matters.
**Roadmap:**
- [ ] Record current Ubuntu 25.10 support/EOL date in ops docs.
- [ ] Decide whether to stay on interim releases or migrate to an LTS baseline.
- [ ] Add an OS lifecycle check to quarterly review.
### P2 — Repository/config secret hygiene needs a repeatable scanner
**Risk:** The DevOps repo contains operational inputs and historical/deleted repo copies exist on disk. Manual review can miss tokens in old files, generated JSON, logs, backups, or abandoned directories.
**Roadmap:**
- [ ] Add a documented secret-scan command using `gitleaks` or `trufflehog` for tracked files and selected untracked ops directories.
- [ ] Scan historical directories such as `DELETED_bytelyst-devops-tools` separately before archiving or deleting.
- [ ] Add `.gitignore` patterns for generated scans, local account snapshots, and credential-shaped outputs.
- [ ] Keep examples as `.example` files only.
### P2 — Cron/systemd ownership and drift are not fully inventoried
**Risk:** Root crontab references old repo paths and there are multiple cron/systemd sources. Stale jobs can fail silently or mutate production unexpectedly.
**Roadmap:**
- [ ] Inventory root/user crontabs, `/etc/cron.d`, systemd timers, Hermes cron, and Gitea Actions schedules.
- [x] Remove or update stale `/opt/bytelyst/bytelyst-devops-tools/...` references after confirming replacements.
- [ ] Add owner, purpose, expected output, and alert channel for every job.
- [x] Add a stale-job detector for missing script paths and failed systemd units.
**Acceptance criteria:**
- No active cron/systemd job references a missing path.
- Every recurring job has an owner, purpose, schedule, expected output, and alert destination.
- Stale path detection runs in the monthly VM review.
### P2 — Observability exists but needs security-focused SLOs
**Risk:** Prometheus/Grafana/Loki/exporters are present, but security-focused alerts are not yet proven from this review.
**Roadmap:**
- [ ] Add alerts for unexpected public ports, failed units, unhealthy containers, high disk/swap, backup staleness, Gitea runner inactive, and SSH auth spikes.
- [ ] Validate alert delivery to Telegram.
- [ ] Keep internal observability endpoints private; do not publish Prometheus/Loki/node-exporter/cAdvisor directly.
## Execution Plan
### Phase 0 — Freeze and inventory before changes
- [ ] Freeze new public hostnames/ports until the exposure inventory is complete.
- [x] Generate `docs/vm-exposure-inventory.md` from Docker, Caddy, `ss`, and DNS.
- [ ] Mark each exposed service as `public`, `private`, `internal-only`, or `retire`.
- [ ] Review with S before changing public access for customer/user-facing apps.
**Exit criteria:** the inventory is reviewed and every P0 change has a rollback line and validation line.
### Phase 1 — Immediate security hardening
- [ ] Close or loopback-bind non-public Docker host ports.
- [x] Loopback-bound internal emulator/mail/observability ports `1025`, `8025`, `10000`, `1234`, `8081`, and `3100`.
- [x] Closed/loopback-bound common-platform direct app/API bypasses.
- [x] Loopback-bound Notes, Clock, and InvtTrdg direct app/API bypasses.
- [x] Loopback-bound DevOps dashboard/API direct private-admin bypasses.
- [ ] Add `DOCKER-USER` default-deny rules for non-approved ports.
- [ ] Harden SSH root/password access after key-based access is verified.
- [ ] Put `ollama.bytelyst.com`, admin dashboards, and dev tooling behind private/auth-gated access unless explicitly approved as public.
**Exit criteria:** only approved public ports are externally reachable, SSH effective config is hardened, and public apps still pass smoke checks.
### Phase 2 — Operational correctness
- [x] Fix/retire unhealthy containers.
- [x] Resolve `hermes-root-backup.service` failed state.
- [x] Decide and document Gitea runner active/disabled state.
- [x] Add missing-script checks. Stale root cron path was fixed on 2026-05-27.
- [ ] Apply pending security/runtime updates in a maintenance window.
**Exit criteria:** no unexpected failed units, no ignored unhealthy required containers, no stale cron paths, and runner state is intentional.
### Phase 3 — Docker and app hardening
- [ ] Add non-root users, `no-new-privileges`, cap drops, and read-only rootfs by service.
- [ ] Add resource limits for noisy services and emulators.
- [ ] Move emulators/dev tools off public bindings.
- [ ] Review cAdvisor privilege and observability surface.
### Phase 4 — Backup, restore, and incident readiness
- [ ] Define full backup map: Hermes, Gitea, Caddy, Docker volumes, app DB/state, secrets escrow.
- [ ] Perform restore drill to non-prod target.
- [ ] Add incident runbooks: compromised container, leaked token, SSH brute force, disk full, failed Docker upgrade.
- [ ] Add quarterly tabletop review.
### Phase 5 — Continuous governance
- [ ] Monthly VM security review cron/checklist.
- [ ] Secret scan before DevOps repo pushes.
- [ ] OS lifecycle/EOL tracker.
- [ ] Drift detection for ports, Caddy routes, Docker health, systemd failures, and cron paths.
## Change Tickets With Quality Gates
Use this shape for each implementation PR/commit:
```text
Ticket:
Risk:
Files/services changed:
Pre-checks:
Change:
Rollback:
Post-checks:
Residual risk:
```
Minimum post-checks for Phase 1:
- `ss -ltnp`
- `docker ps --format '{{.Names}}\t{{.Status}}\t{{.Ports}}'`
- `iptables -S DOCKER-USER`
- `docker exec caddy caddy validate --config /etc/caddy/Caddyfile`
- public smoke checks for approved hostnames
- negative external probe for blocked ports
- `sshd -T` after SSH changes
- `systemctl --failed --no-pager`
## Implementation Log
### 2026-05-27 — Phase 2 backup and cron drift
**Changed:**
- Repointed the root Lucky25 monitor cron from `/opt/bytelyst/bytelyst-devops-tools/monitor-lucky25-execution.sh` to `/opt/bytelyst/learning_ai_devops_tools/scripts/monitor-lucky25-execution.sh`.
- Saved the pre-change root crontab at `/tmp/root-crontab-before-vm-security-20260527.txt`.
- Repaired `/root/repos/bytelyst_hostinger_hermes_vm`, which was `ahead 1, behind 11`; the obsolete local generated backup commit conflicted with newer remote snapshots and was skipped after rebase preserved the current remote stream.
- Patched `/root/.hermes/scripts/sync_hermes_persistent_backup.py` to replace unconditional `git pull --ff-only` with explicit fetch/merge-base handling. Diverged generated snapshots now create a safety branch before attempting rebase and fall back to `origin/<branch>` if the generated files conflict.
- Saved the pre-change backup script at `/tmp/sync_hermes_persistent_backup.py.before-vm-security-20260527`.
**Verified:**
- `crontab -l` now points the Lucky25 monitor at the current repo script.
- `python3 -m py_compile /tmp/sync_hermes_persistent_backup.py` passed before deployment.
- `systemctl start hermes-root-backup.service` succeeded twice after repair.
- `systemctl status hermes-root-backup.service hermes-root-backup.timer --no-pager` showed the service exited `status=0/SUCCESS` and the timer remains active.
- `/root/repos/bytelyst_hostinger_hermes_vm` is aligned with `origin/main` after successful backup commits `415e824` and `369e584`.
**Residual risk:**
- A restore drill is still required before the backup posture should be considered fully proven.
- The backup sync script is runtime-managed under `/root/.hermes/scripts/`; add a tracked installer or source-of-truth copy so this hardening does not depend on manual VM state.
### 2026-05-27 — Phase 2 Gitea runner state
**Changed:**
- Started `gitea-act-runner.service`; it was enabled but inactive.
- Treated the intended state as active because the service unit is enabled, historical journal entries show successful task execution, and restart declared the runner successfully.
**Verified:**
- `systemctl is-active gitea-act-runner.service` returned `active`.
- `systemctl status gitea-act-runner.service --no-pager` showed `bytelyst-host-runner` running as `gitea-runner`.
- Runner labels declared successfully: `ubuntu-latest`, `linux`, `bytelyst`, `hostinger`.
- Runner config uses Docker executor images and `privileged: false`; Docker socket access is granted through the `docker` group.
- Runner immediately picked up task `42` for `bytelyst/bytelyst-devops-tools`, proving it can talk to local Gitea.
**Residual risk:**
- Record a small dedicated smoke workflow that does not need production secrets, so runner health is proven by a controlled workflow rather than incidental queued work.
- Add runner health to VM observability so enabled-but-inactive drift is caught automatically.
### 2026-05-27 — Phase 2 stale automation detector
**Changed:**
- Extended `scripts/VMs/HostingerVM/vm-health-check.sh` with an `AUTOMATION DRIFT` section.
- The daily health check now reports failed systemd units and root crontab script paths that no longer exist.
- Made optional `/var/log/vm-health-check.log` writes silent when the script runs in a restricted/read-only context.
**Verified:**
- `bash -n scripts/VMs/HostingerVM/vm-health-check.sh` passed.
- Restricted `--json` run stayed quiet on log-write failure and reported the new checks.
- Host-permission `--json` run reported `failed_units=OK` and `cron_missing_paths=OK`.
**Residual risk:**
- The detector currently covers root crontab and failed systemd units. Full ownership inventory still needs `/etc/cron.d`, user crontabs, Hermes cron, Gitea schedules, owners, outputs, and alert channels.
### 2026-05-27 — Phase 2 unhealthy containers
**Changed:**
- Added `HOSTNAME=0.0.0.0` to six managed Next.js web services in `/opt/bytelyst/learning_ai_common_plat/docker-compose.ecosystem.yml`: `jarvisjr-web`, `flowmonk-web`, `mindlyst-web`, `actiontrail-web`, `localmemgpt-web`, and `llmlab-dashboard`.
- Recreated those six services from existing images with `docker compose ... up -d --no-build`.
- Retired the orphan `learning_ai_common_plat-nomgap-web-1` container. Current Compose already documents `nomgap-web` as deployed to Vercel and not part of the Docker stack.
**Verified:**
- `docker compose -f docker-compose.ecosystem.yml --env-file .env.ecosystem config --quiet` passed.
- The six recreated web containers report Docker health `healthy`.
- `docker ps --filter health=unhealthy` returns no containers.
- Host-level smoke checks returned HTTP `200` for `3035`, `3040`, `3050`, `3060`, `3070`, and `3075`; retired orphan port `3055` is closed.
- Host-permission `vm-health-check.sh --json` reports `container_health=OK`, `container_loops=OK`, `failed_units=OK`, and `cron_missing_paths=OK`.
**Committed/pushed:**
- `learning_ai_common_plat`: `af035e7d` (`fix: bind ecosystem Next apps on all interfaces`) pushed to GitHub.
**Residual risk:**
- Local Gitea mirror push for `learning_ai_common_plat` failed at Git HTTP transport even though fetch and health checks work; retry/fix mirror push separately.
- This fixed health state, not public exposure. Several direct published ports remain to be loopback-bound or blocked in Phase 1.
### 2026-05-27 — Phase 1 internal port loopback
**Changed:**
- Updated `/opt/bytelyst/learning_ai_common_plat/docker-compose.ecosystem.yml` so `cosmos-emulator`, `azurite`, `mailpit`, and `loki` publish host ports only on `127.0.0.1`.
- Recreated only `cosmos-emulator`, `azurite`, `mailpit`, and `loki` with `docker compose ... up -d --no-build`.
**Verified:**
- `docker compose -f docker-compose.ecosystem.yml --env-file .env.ecosystem config --quiet` passed.
- Docker reports the target services healthy.
- `ss -ltnp` shows `1025`, `8025`, `10000`, `1234`, `8081`, and `3100` listening on `127.0.0.1` only, with no `0.0.0.0` or IPv6 wildcard bind for that group.
- Local smoke checks returned HTTP `200` for Mailpit UI, Loki readiness, and Cosmos explorer. Azurite returned HTTP `400` on the raw blob endpoint while its container healthcheck remained healthy, which is expected for an unauthenticated root request.
**Committed/pushed:**
- `learning_ai_common_plat`: `1c09e479` (`fix: bind internal infra ports to loopback`) pushed to GitHub.
**Residual risk:**
- Public direct bypass remains for app/API ports, Gitea direct port `3300`, devops/admin surfaces, and Ollama `11434`.
- Add a `DOCKER-USER` fallback policy after the remaining allowlist is reviewed.
### 2026-05-27 — Phase 1 common-platform app/API bypasses
**Changed:**
- Updated `/opt/bytelyst/learning_ai_common_plat/docker-compose.ecosystem.yml` so remaining published common-platform web/dashboard ports bind to `127.0.0.1`.
- Recreated the common-platform web/dashboard services that previously published on `0.0.0.0`: `tracker-web`, `lysnrai-dashboard`, `jarvisjr-web`, `flowmonk-web`, `mindlyst-web`, `actiontrail-web`, `localmemgpt-web`, and `llmlab-dashboard`.
- Recreated stale common-platform backend containers `peakpulse-backend`, `lysnrai-backend`, and `nomgap-backend`; their current Compose definitions do not publish host ports, so the old direct `4010`, `4015`, and `4013` mappings were removed.
**Verified:**
- `docker compose -f docker-compose.ecosystem.yml --env-file .env.ecosystem config --quiet` passed.
- `docker ps --filter name=learning_ai_common_plat ... | grep 0.0.0.0` returned no common-platform wildcard-published containers.
- `docker ps --filter health=unhealthy` returned no unhealthy containers.
- `ss -ltnp` shows `3002`, `3003`, `3035`, `3040`, `3050`, `3060`, `3070`, and `3075` bound to `127.0.0.1`.
- Host smoke checks returned HTTP `200` for `3002`, `3003`, `3035`, `3040`, `3050`, `3060`, `3070`, and `3075`.
**Committed/pushed:**
- `learning_ai_common_plat`: `e29cc58a` (`fix: bind app host ports to loopback`) pushed to GitHub.
**Remaining wildcard Docker publishes after this checkpoint:**
- Caddy public ingress: `80`, `443`.
- Local Gitea direct port: `3300`.
- DevOps dashboard/API: `3049`, `4004`.
- Host Ollama still listens on wildcard `11434`.
### 2026-05-27 — Phase 1 product repo app/API bypasses
**Changed:**
- Updated `/opt/bytelyst/learning_ai_notes/docker-compose.yml` and `docker-compose.override.yml` so NoteLett backend/web bind to `127.0.0.1`.
- Updated `/root/bytelyst.ai/repos/learning_ai_clock/docker-compose.yml` so ChronoMind backend/web bind to `127.0.0.1`; also added `HOSTNAME=0.0.0.0` so the Next.js healthcheck works inside the container.
- Updated `/opt/bytelyst/learning_ai_invt_trdg/docker-compose.yml` so InvtTrdg backend/web bind to `127.0.0.1`.
- Recreated the affected services without rebuilding images.
**Verified:**
- Notes: `3000` and `4016` listen on `127.0.0.1`; local web/backend smoke checks returned HTTP `200`.
- Clock: `3030` and `4011` listen on `127.0.0.1`; local web/backend smoke checks returned HTTP `200`; containers are healthy.
- InvtTrdg: `3085` and `4025` listen on `127.0.0.1`; local web/backend smoke checks returned HTTP `200`.
- `docker ps --format ... | grep 0.0.0.0` now shows only Caddy `80/443`, Gitea `3300`, and DevOps `3049/4004` as Docker wildcard publishes.
- `docker ps --filter health=unhealthy` returned no unhealthy containers.
**Committed/pushed:**
- `learning_ai_notes`: `3683ba9` (`fix: bind Notes host ports to loopback`) pushed to GitHub.
- `learning_ai_clock`: `ee572f8` (`fix: bind Clock host ports to loopback`) pushed to GitHub.
- `learning_ai_invt_trdg`: `39490bc` (`fix: bind InvtTrdg host ports to loopback`) pushed to GitHub.
**Remaining wildcard direct exposure after this checkpoint:**
- Expected public ingress: `22`, `80`, `443`.
- Docker wildcard publishes still to fix: Gitea direct port `3300`, DevOps dashboard/API `3049` and `4004`.
- Host process still to fix: Ollama `11434`.
### 2026-05-27 — Phase 1 DevOps private-admin bypasses
**Changed:**
- Updated `/opt/bytelyst/learning_ai_devops_tools/dashboard/docker-compose.yml` so `devops-web` and `devops-backend` bind host ports only on `127.0.0.1`.
- Recreated `devops-backend` and `devops-web` without rebuilding images.
**Verified:**
- `docker compose config --quiet` passed in the DevOps dashboard directory.
- `devops-web` now publishes `127.0.0.1:3049->3000`.
- `devops-backend` now publishes `127.0.0.1:4004->4004` and is healthy.
- Local smoke checks returned HTTP `200` for `http://127.0.0.1:3049` and `http://127.0.0.1:4004/health`.
- `docker ps --format ... | grep 0.0.0.0` now shows only Caddy `80/443` and Gitea `3300` as Docker wildcard publishes.
**Remaining wildcard direct exposure after this checkpoint:**
- Expected public ingress: `22`, `80`, `443`.
- Docker wildcard publish still to fix: Gitea direct port `3300`.
- Host process still to fix: Ollama `11434`.
## Do Not Start With
- Rootless Docker migration.
- Broad `iptables` default-drop without an allowlist.
- Mass Compose rewrites across all products.
- SSH password/root lockout before key-based sudo and rollback are proven.
- Removing unhealthy containers before confirming whether they are deprecated or broken required services.
- Publishing secret-scan output that contains secrets.
## Suggested First Tickets
1. **P0: Build and review exposure inventory** — produce exact approved/blocked list for all currently bound ports.
2. **P0: Lock Docker-published non-public ports** — bind to loopback/internal or enforce `DOCKER-USER` drops.
3. **P0: Harden SSH** — disable password/root login after confirming key-based admin access.
4. **P1: Triage unhealthy containers** — fix healthchecks/apps or retire dead services.
5. **P1: Resolve failed Hermes backup unit** — fix or disable duplicate failed unit; keep cron backup healthy.
6. **P1: Decide Gitea runner state** — active smoke-tested runner or documented disabled service.
7. **P2: Add secret scanner and stale-job scanner** — prevent silent credential and automation drift.
**Recommended first implementation order:**
1. Generate and review `docs/vm-exposure-inventory.md`.
2. Fix the stale cron path and failed backup unit, because both are lower blast-radius and improve rollback confidence.
3. Harden SSH with second-session/provider-console safety.
4. Move obvious internal-only Docker ports to loopback/internal bindings.
5. Add `DOCKER-USER` guardrails after the allowlist is proven.
This order improves safety without letting the port exposure issue linger too long.
## Verification Commands for Future Runs
```bash
# Host/security baseline
date -Is
uname -a
. /etc/os-release && echo "$PRETTY_NAME"
apt-get -s upgrade | awk '/^Inst /{print}'
test -f /var/run/reboot-required && cat /var/run/reboot-required || echo no-reboot-required
# Firewall and public bind inventory
ufw status verbose
iptables -S DOCKER-USER
ss -ltnup
# SSH effective config
sshd -T | egrep '^(permitrootlogin|passwordauthentication|pubkeyauthentication|kbdinteractiveauthentication|maxauthtries|x11forwarding|clientaliveinterval)'
fail2ban-client status sshd
# Docker health/security
docker ps --format '{{.Names}}\t{{.Status}}\t{{.Ports}}'
docker ps -q | xargs -r docker inspect --format '{{.Name}} user={{.Config.User}} privileged={{.HostConfig.Privileged}} readonly={{.HostConfig.ReadonlyRootfs}} ports={{json .NetworkSettings.Ports}}'
# Caddy and ingress
docker exec caddy caddy validate --config /etc/caddy/Caddyfile
sed -n '1,220p' /opt/bytelyst/Caddyfile
# Backup/cron/systemd drift
systemctl --failed --no-pager
hermes cron list
crontab -l
for f in /etc/cron.d/*; do echo "--- $f"; sed -n '1,80p' "$f"; done
```
## Notes
- This review did not change firewall, SSH, Docker, Caddy, or service settings. It intentionally documents the risk and remediation order before making potentially disruptive security changes.
- Public exposure changes should be handled in small maintenance windows with pre/post health checks because this VM hosts multiple ByteLyst apps.
- The Caddyfile validates today, but Caddy formatting should be normalized in a separate low-risk docs/ops cleanup if desired.

View File

@ -8,6 +8,10 @@ This directory is the preferred home for self-contained operational scripts.
- Supported. - Supported.
- Purpose: update and harden Ubuntu VMs with unattended upgrades, UFW, and fail2ban. - Purpose: update and harden Ubuntu VMs with unattended upgrades, UFW, and fail2ban.
- Risk level: high, because it modifies packages, firewall rules, and reboot behavior. - Risk level: high, because it modifies packages, firewall rules, and reboot behavior.
- `VMs/HostingerVM/vm-health-check.sh`
- Supported.
- Purpose: read-only VM health and drift check for disk, memory, swap, Docker health, failed systemd units, and stale root crontab script paths.
- Risk level: low, because it is read-only apart from an optional local log write.
## Conventions ## Conventions

View File

@ -0,0 +1,150 @@
# Hostinger VM — Cron Setup
Automated maintenance schedule for `srv1491630`.
Scripts: `vm-health-check.sh` (read-only) + `vm-cleanup.sh` (safe cleanup).
---
## Quick install
SSH into the VM and run:
```bash
bash /opt/bytelyst/learning_ai_devops_tools/scripts/VMs/HostingerVM/vm-cleanup.sh --install-cron
```
This installs the full recommended schedule. To remove it:
```bash
bash /opt/bytelyst/learning_ai_devops_tools/scripts/VMs/HostingerVM/vm-cleanup.sh --uninstall-cron
```
---
## What gets scheduled
| Schedule | Time (UTC) | Command | What it does |
|---|---|---|---|
| Daily | 07:00 | `vm-health-check.sh` | Read-only check; sends Telegram alert on WARNING/CRITICAL |
| Daily | 03:00 | `vm-cleanup.sh` | Prune Docker build cache only (always safe) |
| Weekly | Sun 02:00 | `vm-cleanup.sh` | Standard cleanup (see below) |
| Monthly | 1st 01:00 | `vm-cleanup.sh --full` | Full cleanup (see below) |
---
## What each mode does
### Standard weekly cleanup (`vm-cleanup.sh`)
All steps are labelled **SAFE** — they only remove regenerable caches.
| Step | What's removed | Risk |
|---|---|---|
| Docker build cache | Layer cache from `docker build` runs | Zero — rebuilds just take longer next time |
| Crash loop check | Detection only, no changes | Zero |
| Journal vacuum | Old journal entries beyond 200MB / 7 days | Zero — logs are already captured in syslog |
| APT cache | `/var/cache/apt/archives/` | Zero — packages can be re-downloaded |
| NPM cache | `~/.npm/_cacache/` | Zero — cache is re-populated on next `npm install` |
| `.next/cache` | Webpack/babel/TSC build cache dirs | Zero — rebuilt automatically on next `next build` |
### Monthly full cleanup (`vm-cleanup.sh --full`)
Adds these **CAREFUL** steps on top of the standard run:
| Step | What's removed | Risk |
|---|---|---|
| Docker system prune | Stopped containers, unused networks, dangling images | Low — does NOT remove images used by any container |
| pnpm store prune | Packages not referenced by any `node_modules` | Low — only removes truly orphaned packages |
| Old log files | `.gz` log rotations older than 30 days | Low — old compressed logs |
| HOLD node_modules | `node_modules` in `/opt/bytelyst/HOLD` archived projects | Low — code intact, can reinstall with `pnpm install` |
### Never touched (by design)
- `/opt/bytelyst/*/node_modules` (active repos)
- `/opt/bytelyst/*/src`, `/app`, `/backend`, `/web` source code
- `.next/standalone` (production Next.js builds)
- Docker images used by currently configured containers
- `/usr/local/lib/hermes-agent/`
- `/usr/share/ollama/` (models)
- `/swapfile`
- Any database volumes
---
## Manual crontab (if you prefer not to use --install-cron)
```
# Health check daily 07:00 UTC
0 7 * * * bash /opt/bytelyst/learning_ai_devops_tools/scripts/VMs/HostingerVM/vm-health-check.sh --quiet --notify 2>&1 | logger -t vm-health
# Build cache prune daily 03:00 UTC
0 3 * * * bash /opt/bytelyst/learning_ai_devops_tools/scripts/VMs/HostingerVM/vm-cleanup.sh --quiet 2>&1 | logger -t vm-cleanup
# Standard weekly cleanup Sunday 02:00 UTC
0 2 * * 0 bash /opt/bytelyst/learning_ai_devops_tools/scripts/VMs/HostingerVM/vm-cleanup.sh --quiet 2>&1 | logger -t vm-cleanup
# Full monthly cleanup 1st of month 01:00 UTC
0 1 1 * * bash /opt/bytelyst/learning_ai_devops_tools/scripts/VMs/HostingerVM/vm-cleanup.sh --full --quiet 2>&1 | logger -t vm-cleanup
```
Edit with: `crontab -e`
---
## Monitoring logs
```bash
# Tail cleanup log
tail -f /var/log/vm-cleanup.log
# Tail health check log
tail -f /var/log/vm-health-check.log
# See all cron output via syslog
grep vm-cleanup /var/log/syslog | tail -20
grep vm-health /var/log/syslog | tail -20
```
---
## Telegram alerts
The health check script sends a Telegram message when it detects WARNING or CRITICAL.
It reads credentials from `$HERMES_HOME/.env` (usually `/root/.hermes/.env`).
Required keys in that file:
```
TELEGRAM_BOT_TOKEN=<your-bot-token>
TELEGRAM_CHAT_ID=<your-chat-id>
```
Both are already set if Hermes gateway is configured. Test with:
```bash
bash /opt/bytelyst/learning_ai_devops_tools/scripts/VMs/HostingerVM/vm-health-check.sh --notify
```
---
## Disk thresholds (from `vm-health-check.sh`)
| Metric | WARNING | CRITICAL |
|---|---|---|
| Disk used `%` | > 55% | > 70% |
| Load average | > 4.0 | > 8.0 |
| RAM available | < 3 GB | < 1 GB |
| Swap used | > 1 GB | > 3 GB |
| Container restarts | > 10 | > 50 |
| Build cache | > 5 GB | > 20 GB |
Thresholds are constants at the top of each script — easy to adjust.
---
## What the May 2026 incident would have caught
If this cron had been running during the May 26 incident:
- **07:00 daily health check**`container_loops CRIT: admin-web(50x)` → Telegram alert sent within hours of the loop starting
- **03:00 daily build cache prune** → would have kept build cache under 5 GB instead of growing to 84 GB
- **Monthly full cleanup** → would have cleared the HOLD node_modules and old logs before they became a storage crisis

View File

@ -0,0 +1,63 @@
#!/usr/bin/env bash
# =============================================================================
# docker-health-watchdog.sh — restart containers stuck in unhealthy state
#
# Systemd timer invokes this every 10 minutes.
# A container is only restarted after 3 consecutive failing health checks
# (i.e. the last 3 entries in .State.Health.Log all have ExitCode != 0).
# This gives a 30-minute grace window before action is taken — avoids
# restarting containers that are transiently unhealthy during a deploy.
#
# Log: /var/log/docker-watchdog.log
# =============================================================================
set -Eeuo pipefail
LOG=/var/log/docker-watchdog.log
TOKEN_FILE="${HERMES_HOME:-/root/.hermes}/.env"
log() { echo "[$(date -u '+%Y-%m-%dT%H:%M:%SZ')] $*" | tee -a "$LOG" 2>/dev/null || true; }
notify_telegram() {
local msg="$1"
local token chat_id
token=$(grep -oP '(?<=TELEGRAM_BOT_TOKEN=)\S+' "$TOKEN_FILE" 2>/dev/null || true)
chat_id=$(grep -oP '(?<=TELEGRAM_CHAT_ID=)\S+' "$TOKEN_FILE" 2>/dev/null || true)
[[ -z "$token" || -z "$chat_id" ]] && return
curl -sf -X POST "https://api.telegram.org/bot${token}/sendMessage" \
-d chat_id="$chat_id" \
-d text="$msg" > /dev/null 2>&1 || true
}
if ! command -v docker &>/dev/null || ! docker info &>/dev/null 2>&1; then
log "Docker not available — skipping watchdog run"
exit 0
fi
mapfile -t unhealthy < <(docker ps --filter health=unhealthy --format '{{.Names}}' 2>/dev/null || true)
if (( ${#unhealthy[@]} == 0 )); then
exit 0
fi
log "Unhealthy containers detected: ${unhealthy[*]}"
for container in "${unhealthy[@]}"; do
# Count how many of the last 3 health check log entries failed (ExitCode != 0)
failures=$(docker inspect "$container" 2>/dev/null | python3 -c "
import json, sys
data = json.load(sys.stdin)
if not data:
print(0); exit()
log = data[0].get('State', {}).get('Health', {}).get('Log', [])
recent = log[-3:] if len(log) >= 3 else log
print(sum(1 for e in recent if e.get('ExitCode', 0) != 0))
" 2>/dev/null || echo 0)
if [[ "$failures" -eq 3 ]]; then
log "Auto-restarting $container (unhealthy 3/3 consecutive checks)"
docker restart "$container" 2>&1 | head -1 | tee -a "$LOG" || true
notify_telegram "🔄 docker-watchdog restarted $container (3 consecutive unhealthy health checks) — $(hostname)"
else
log "$container is unhealthy but only $failures/3 consecutive failures — waiting"
fi
done

View File

@ -0,0 +1,435 @@
#!/usr/bin/env bash
# =============================================================================
# vm-cleanup.sh — Hostinger VM Safe Periodic Cleanup
#
# Designed to be run manually or via cron. All operations are either
# completely safe (read-only builds) or will prompt for confirmation when
# removing things that can't be trivially regenerated.
#
# Modes:
# (default) Weekly safe cleanup — build cache, apt, npm, journal, .next/cache
# --full Monthly deeper cleanup — adds: pnpm store, docker system prune,
# old log files, Docker image dangling prune
# --dry-run Print what would be done, make no changes
# --install-cron Install the recommended cron schedule for both scripts
# --uninstall-cron Remove the installed cron jobs
#
# All destructive steps are gated behind SAFE / CAREFUL / MANUAL labels
# in the output so you can audit what ran.
#
# Logs to: /var/log/vm-cleanup.log
# =============================================================================
set -Eeuo pipefail
# ── Config ───────────────────────────────────────────────────────────────────
LOG_FILE="/var/log/vm-cleanup.log"
SCRIPT_PATH="$(realpath "${BASH_SOURCE[0]}")"
SCRIPT_DIR="$(dirname "$SCRIPT_PATH")"
HEALTH_CHECK="$SCRIPT_DIR/vm-health-check.sh"
# Paths that must NEVER be deleted even in --full mode
# shellcheck disable=SC2034
PROTECTED_PATHS=(
"/opt/bytelyst/learning_ai_common_plat"
"/opt/bytelyst/learning_ai_devops_tools"
"/usr/local/lib/hermes-agent"
"/usr/share/ollama"
"/swapfile"
)
# node_modules dirs in active (non-HOLD) repos to never touch
# shellcheck disable=SC2034
ACTIVE_NODE_MODULES=(
"/opt/bytelyst/learning_ai_common_plat/node_modules"
"/opt/bytelyst/learning_ai_flowmonk/node_modules"
"/opt/bytelyst/learning_ai_clock/node_modules"
"/opt/bytelyst/learning_ai_notes/node_modules"
"/opt/bytelyst/learning_ai_devops_tools/dashboard/node_modules"
"/opt/bytelyst/learning_ai_invt_trdg/node_modules"
)
# ── Colour codes ─────────────────────────────────────────────────────────────
RED=$'\033[0;31m'
YELLOW=$'\033[1;33m'
GREEN=$'\033[0;32m'
CYAN=$'\033[0;36m'
BOLD=$'\033[1m'
DIM=$'\033[2m'
NC=$'\033[0m'
# ── Flags ────────────────────────────────────────────────────────────────────
FULL_MODE=false
DRY_RUN=false
INSTALL_CRON=false
UNINSTALL_CRON=false
QUIET=false
for arg in "$@"; do
case "$arg" in
--full) FULL_MODE=true ;;
--dry-run) DRY_RUN=true ;;
--install-cron) INSTALL_CRON=true ;;
--uninstall-cron) UNINSTALL_CRON=true ;;
--quiet) QUIET=true ;;
esac
done
# ── Helpers ──────────────────────────────────────────────────────────────────
log() {
local msg
msg="[$(date -u '+%Y-%m-%dT%H:%M:%SZ')] $*"
echo "$msg" >> "$LOG_FILE" 2>/dev/null || true
$QUIET || echo -e "$*"
}
log_header() {
$QUIET || echo -e "\n${BOLD}${CYAN}── $1 ──────────────────────────────────────${NC}"
echo "[$(date -u '+%Y-%m-%dT%H:%M:%SZ')] === $1 ===" >> "$LOG_FILE" 2>/dev/null || true
}
log_step() {
local label="$1" msg="$2"
case "$label" in
SAFE) $QUIET || echo -e " ${GREEN}[SAFE]${NC} $msg" ;;
CAREFUL) $QUIET || echo -e " ${YELLOW}[CAREFUL]${NC} $msg" ;;
SKIP) $QUIET || echo -e " ${DIM}[SKIP]${NC} $msg" ;;
DRY) $QUIET || echo -e " ${CYAN}[DRY-RUN]${NC} $msg" ;;
esac
}
run_cmd() {
# run_cmd LABEL "description" cmd args...
local label="$1" desc="$2"
shift 2
log_step "$label" "$desc"
if $DRY_RUN; then
log_step DRY "would run: $*"
return 0
fi
log "[CMD] $*"
"$@" >> "$LOG_FILE" 2>&1 || true
}
disk_before=""
record_disk_before() {
disk_before=$(df -h / --output=used,avail,pcent | tail -1 | tr -s ' ')
}
report_disk_delta() {
local disk_after
disk_after=$(df -h / --output=used,avail,pcent | tail -1 | tr -s ' ')
if ! $QUIET; then
echo -e "\n ${DIM}Before: $disk_before${NC}"
echo -e " ${GREEN}After: $disk_after${NC}"
fi
log "[DISK] before=$disk_before after=$disk_after"
}
# ── Safety guard ─────────────────────────────────────────────────────────────
require_root() {
if [[ "$(id -u)" -ne 0 ]]; then
echo -e "${RED}ERROR: This script must be run as root (use sudo)${NC}" >&2
exit 1
fi
}
# ── Cron install/uninstall ───────────────────────────────────────────────────
do_install_cron() {
echo -e "\n${BOLD}Installing cron schedule…${NC}\n"
local cron_tag="# bytelyst-vm-maintenance"
local tmp_cron
tmp_cron=$(mktemp)
# Export existing crontab (minus our managed block)
crontab -l 2>/dev/null | grep -v "$cron_tag" > "$tmp_cron" || true
cat >> "$tmp_cron" <<EOF
$cron_tag — DO NOT EDIT this block manually, use --install-cron / --uninstall-cron
# Daily health check at 07:00 UTC (read-only, sends Telegram alert on WARNING/CRITICAL)
0 7 * * * bash $SCRIPT_DIR/vm-health-check.sh --quiet --notify 2>&1 | logger -t vm-health-check
# Daily build cache prune at 03:00 UTC (always safe, never removes images)
0 3 * * * bash $SCRIPT_PATH --quiet 2>&1 | logger -t vm-cleanup
# Weekly cleanup (Sunday 02:00 UTC) — logs, apt, npm, .next/cache, build cache
0 2 * * 0 bash $SCRIPT_PATH --quiet 2>&1 | logger -t vm-cleanup
# Monthly full cleanup (1st of month 01:00 UTC) — adds pnpm store, docker system prune
0 1 1 * * bash $SCRIPT_PATH --full --quiet 2>&1 | logger -t vm-cleanup
EOF
crontab "$tmp_cron"
rm -f "$tmp_cron"
echo -e " ${GREEN}✓ Cron jobs installed. Current schedule:${NC}"
echo ""
crontab -l | grep -A20 "$cron_tag" || true
echo ""
echo -e " View logs: ${CYAN}tail -f $LOG_FILE${NC}"
echo -e " View cron: ${CYAN}crontab -l${NC}"
echo -e " Remove: ${CYAN}bash $SCRIPT_PATH --uninstall-cron${NC}"
}
do_uninstall_cron() {
local cron_tag="# bytelyst-vm-maintenance"
local tmp_cron
tmp_cron=$(mktemp)
crontab -l 2>/dev/null | grep -v "$cron_tag" > "$tmp_cron" || true
# Also strip the actual cron lines we added (they follow the tag block)
grep -v "vm-health-check.sh\|vm-cleanup.sh" "$tmp_cron" > "${tmp_cron}.clean" || true
crontab "${tmp_cron}.clean"
rm -f "$tmp_cron" "${tmp_cron}.clean"
echo -e " ${GREEN}✓ Cron jobs removed${NC}"
}
# ── Cleanup steps ─────────────────────────────────────────────────────────────
step_cosmos_pglog() {
# The Azure CosmosDB emulator uses an embedded Postgres instance that logs
# every SQL statement to /logs/pglog.log inside its overlay layer.
# It grows ~275 MB/hr during heavy trading activity. Truncate it safely —
# Postgres keeps the file descriptor open so truncation doesn't break it.
log_header "CosmosDB Emulator Postgres Log"
local container="learning_ai_common_plat-cosmos-emulator-1"
if ! docker ps --format '{{.Names}}' 2>/dev/null | grep -q "^${container}$"; then
log_step SKIP "CosmosDB emulator not running"
return
fi
# Locate the overlay upper dir for this container
local pglog
pglog=$(docker inspect "$container" 2>/dev/null \
| python3 -c "
import json,sys,os
d=json.load(sys.stdin)[0]
# Try direct GraphDriver path first
upper=d.get('GraphDriver',{}).get('Data',{}).get('UpperDir','')
if upper:
p=os.path.join(upper,'logs','pglog.log')
if os.path.exists(p): print(p)
exit()
# Fallback: scan rootfs overlayfs dirs
import glob
for f in glob.glob('/var/lib/docker/rootfs/overlayfs/*/logs/pglog.log'):
print(f); exit()
" 2>/dev/null || true)
if [[ -z "$pglog" || ! -f "$pglog" ]]; then
log_step SKIP "pglog.log not found (overlay path changed?)"
return
fi
local size_mb
size_mb=$(du -sm "$pglog" 2>/dev/null | cut -f1 || echo 0)
if (( size_mb < 100 )); then
log_step SKIP "pglog.log is ${size_mb}MB — no truncation needed (<100 MB)"
return
fi
run_cmd SAFE "Truncate CosmosDB pglog.log (${size_mb}MB → 0)" truncate -s 0 "$pglog"
}
step_docker_build_cache() {
log_header "Docker Build Cache"
if ! docker info &>/dev/null 2>&1; then
log_step SKIP "Docker not running — skipping build cache prune"
return
fi
local cache_size
cache_size=$(docker system df 2>/dev/null | awk '/^Build Cache/ {print $3}' || echo "?")
run_cmd SAFE "Prune Docker build cache (currently $cache_size)" \
docker builder prune -f
}
step_docker_system_prune() {
# Removes stopped containers, unused networks, dangling images ONLY
# Does NOT remove images used by any existing container
log_header "Docker System Prune (dangling only)"
if ! docker info &>/dev/null 2>&1; then
log_step SKIP "Docker not running"
return
fi
run_cmd SAFE "Remove stopped containers, unused networks, dangling images" \
docker system prune -f
}
step_docker_aged_images() {
# Removes tagged images that haven't been used by any container in >7 days.
# Safe because any running container holds a reference to its image — this
# only cleans up old image versions that were replaced (e.g. after a deploy).
log_header "Docker Aged Image Prune (unused >7 days)"
if ! docker info &>/dev/null 2>&1; then
log_step SKIP "Docker not running"
return
fi
local reclaimable
reclaimable=$(docker system df 2>/dev/null | awk '/^Images/ {print $4}' || echo "?")
run_cmd SAFE "Prune images unused for >7 days (currently $reclaimable reclaimable)" \
docker image prune -a -f --filter "until=168h"
}
step_docker_crash_loop_check() {
log_header "Crash Loop Check"
if ! docker info &>/dev/null 2>&1; then return; fi
local looping=()
while IFS=$'\t' read -r name restarts; do
[[ -z "$name" || "$name" == "NAMES" ]] && continue
restarts="${restarts:-0}"
if (( restarts >= 20 )); then looping+=("$name(${restarts}x)"); fi
done < <(docker ps -a --format $'{{.Names}}\t{{.RestartCount}}' 2>/dev/null || true)
if (( ${#looping[@]} > 0 )); then
echo -e " ${RED}${BOLD}⚠ CRASH LOOPS DETECTED — manual fix required:${NC}"
for c in "${looping[@]}"; do
echo -e " ${RED}${NC} $c"
echo -e " ${DIM}fix: docker logs ${c%%(*)} | tail -20${NC}"
echo -e " ${DIM}stop loop: docker update --restart=no ${c%%(*)}${NC}"
done
log "[WARN] crash-looping containers: ${looping[*]}"
else
log_step SAFE "No crash-looping containers"
fi
}
step_journal() {
log_header "Journal Logs"
run_cmd SAFE "Vacuum journal to 200MB" \
journalctl --vacuum-size=200M
run_cmd SAFE "Vacuum journal older than 7 days" \
journalctl --vacuum-time=7d
}
step_apt_cache() {
log_header "APT Cache"
run_cmd SAFE "Clean apt package cache" \
apt-get clean
}
step_npm_cache() {
log_header "NPM Cache"
if command -v npm &>/dev/null; then
run_cmd SAFE "Clean npm cache" \
npm cache clean --force
fi
}
step_next_cache() {
log_header ".next/cache Directories"
# Only delete .next/cache (webpack/babel/tsbuild cache), NOT .next/standalone (prod build)
local count=0
while IFS= read -r cache_dir; do
log_step SAFE "Remove $cache_dir"
if ! $DRY_RUN; then rm -rf "$cache_dir"; fi
count=$(( count + 1 ))
done < <(
find /opt/bytelyst -name ".next" -maxdepth 7 -type d 2>/dev/null \
| while read -r d; do
[[ -d "$d/cache" ]] && echo "$d/cache"
done
)
if (( count == 0 )); then log_step SKIP "No .next/cache dirs found"; fi
}
step_pnpm_store() {
log_header "PNPM Store"
if command -v pnpm &>/dev/null; then
run_cmd SAFE "Prune unreferenced packages from pnpm store" \
pnpm store prune
else
log_step SKIP "pnpm not found"
fi
}
step_old_logs() {
log_header "Old Log Files"
# Compress any uncompressed .1 rotations that logrotate missed
local count=0
for f in /var/log/syslog.1 /var/log/kern.log.1 /var/log/ufw.log.1; do
if [[ -f "$f" && ! -f "${f}.gz" ]]; then
run_cmd SAFE "Compress $f" gzip -9 "$f"
count=$(( count + 1 ))
fi
done
# Remove log rotations older than 30 days
while IFS= read -r old_log; do
run_cmd CAREFUL "Remove old log: $old_log" rm -f "$old_log"
done < <(find /var/log -name "*.gz" -mtime +30 -type f 2>/dev/null || true)
if (( count == 0 )); then log_step SKIP "No uncompressed rotations to compress"; fi
}
step_hold_cleanup() {
log_header "HOLD Archived Projects"
# node_modules in HOLD are safe to delete — code stays, can be reinstalled
local total_freed=0
local found=0
while IFS= read -r nm; do
local size
size=$(du -sm "$nm" 2>/dev/null | cut -f1 || echo "0")
run_cmd CAREFUL "Delete archived node_modules: $nm (~${size}MB)" rm -rf "$nm"
total_freed=$(( total_freed + size ))
found=$(( found + 1 ))
done < <(
find /opt/bytelyst/HOLD -name "node_modules" -maxdepth 4 -type d 2>/dev/null || true
)
if (( found == 0 )); then
log_step SKIP "HOLD node_modules already clean"
else
log "[INFO] Freed ~${total_freed}MB from HOLD node_modules"
fi
# .next build artifacts in HOLD
while IFS= read -r next_dir; do
run_cmd CAREFUL "Delete archived .next: $next_dir" rm -rf "$next_dir"
done < <(
find /opt/bytelyst/HOLD -name ".next" -maxdepth 6 -type d 2>/dev/null || true
)
}
# ── Main ─────────────────────────────────────────────────────────────────────
# Handle special modes first (no root needed for these)
if $INSTALL_CRON; then require_root; do_install_cron; exit 0; fi
if $UNINSTALL_CRON; then require_root; do_uninstall_cron; exit 0; fi
require_root
if ! $QUIET; then
if $FULL_MODE; then MODE="FULL"; else MODE="STANDARD"; fi
if $DRY_RUN; then DRY=" (DRY-RUN)"; else DRY=""; fi
echo -e "\n${BOLD}VM Cleanup — $(hostname)${MODE}${DRY}${NC}"
echo -e "${DIM}$(date -u '+%Y-%m-%d %H:%M UTC')${NC}"
fi
if $FULL_MODE; then _mode="full"; else _mode="standard"; fi
log "[START] mode=${_mode} dry=$DRY_RUN"
record_disk_before
# ── WEEKLY (always run) ──────────────────────────────────────────────────────
step_cosmos_pglog
step_docker_build_cache
step_docker_crash_loop_check
step_journal
step_apt_cache
step_npm_cache
step_next_cache
# ── MONTHLY (only with --full) ───────────────────────────────────────────────
if $FULL_MODE; then
step_docker_system_prune
step_docker_aged_images
step_pnpm_store
step_old_logs
step_hold_cleanup
fi
# ── Final report ─────────────────────────────────────────────────────────────
report_disk_delta
if ! $QUIET; then
echo -e "\n${GREEN}${BOLD}✓ Cleanup complete${NC}"
echo -e " Log: ${CYAN}tail -50 $LOG_FILE${NC}"
if [[ -f "$HEALTH_CHECK" ]]; then
echo ""
echo -e "${DIM}Running health check…${NC}"
bash "$HEALTH_CHECK" || true
fi
fi
log "[END] cleanup complete"

View File

@ -0,0 +1,430 @@
#!/usr/bin/env bash
# =============================================================================
# vm-health-check.sh — Hostinger VM Health Check (READ-ONLY)
#
# Checks disk, memory, load, swap, and Docker container health.
# Prints a colour-coded report. Exits non-zero if any threshold is exceeded
# so it can drive cron alerts or CI gates.
#
# Usage:
# bash vm-health-check.sh # interactive report
# bash vm-health-check.sh --quiet # only print problems (exit 1 if any)
# bash vm-health-check.sh --json # machine-readable JSON output
# bash vm-health-check.sh --notify # send Telegram alert on WARNING/CRITICAL
#
# Exit codes:
# 0 — all green
# 1 — at least one WARNING
# 2 — at least one CRITICAL
# =============================================================================
set -Eeuo pipefail
# shellcheck disable=SC2034
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
LOG_FILE="/var/log/vm-health-check.log"
# ── Thresholds ──────────────────────────────────────────────────────────────
DISK_WARN=55 # % used
DISK_CRIT=70
LOAD_WARN=4.0 # absolute (not per-CPU)
LOAD_CRIT=8.0
RAM_FREE_WARN_GB=3 # GB available
RAM_FREE_CRIT_GB=1
SWAP_USED_WARN_GB=1.5
SWAP_USED_CRIT_GB=3
SWAP_CACHED_WARN_MB=200 # early-warning: recent swap pressure even if current usage looks ok
STEAL_WARN=5 # % steal time
STEAL_CRIT=15
CONTAINER_RESTART_WARN=10
CONTAINER_RESTART_CRIT=50
BUILD_CACHE_WARN_GB=5
BUILD_CACHE_CRIT_GB=20
# shellcheck disable=SC2034
DOCKER_IMAGES_WARN_GB=15
# shellcheck disable=SC2034
DOCKER_IMAGES_CRIT_GB=25
# ── Colour codes ────────────────────────────────────────────────────────────
RED=$'\033[0;31m'
YELLOW=$'\033[1;33m'
GREEN=$'\033[0;32m'
CYAN=$'\033[0;36m'
BOLD=$'\033[1m'
NC=$'\033[0m'
# ── Flags ───────────────────────────────────────────────────────────────────
QUIET=false
JSON_MODE=false
NOTIFY=false
for arg in "$@"; do
case "$arg" in
--quiet) QUIET=true ;;
--json) JSON_MODE=true ;;
--notify) NOTIFY=true ;;
esac
done
# ── State tracking ──────────────────────────────────────────────────────────
WORST_LEVEL=0 # 0=OK, 1=WARN, 2=CRIT
ISSUES=()
declare -A JSON_DATA
# ── Helpers ─────────────────────────────────────────────────────────────────
log_to_file() {
local log_dir
log_dir="$(dirname "$LOG_FILE")"
if [[ ( -e "$LOG_FILE" && ! -w "$LOG_FILE" ) || ( ! -e "$LOG_FILE" && ! -w "$log_dir" ) ]]; then
return
fi
echo "[$(date -u '+%Y-%m-%dT%H:%M:%SZ')] $*" >> "$LOG_FILE" 2>/dev/null || true
}
status_icon() {
case "$1" in
OK) echo -e "${GREEN}${NC}" ;;
WARN) echo -e "${YELLOW}${NC}" ;;
CRIT) echo -e "${RED}${NC}" ;;
esac
}
level_int() {
case "$1" in OK) echo 0 ;; WARN) echo 1 ;; CRIT) echo 2 ;; *) echo 0 ;; esac
}
record() {
# record NAME LEVEL VALUE MESSAGE
local name="$1" level="$2" value="$3" message="$4"
local lvl_int
lvl_int=$(level_int "$level")
if (( lvl_int > WORST_LEVEL )); then WORST_LEVEL=$lvl_int; fi
if [[ "$level" != "OK" ]]; then ISSUES+=("[$level] $message"); fi
JSON_DATA["$name"]=$(printf '{"level":"%s","value":"%s","message":"%s"}' \
"$level" "$value" "$message")
if $QUIET && [[ "$level" == "OK" ]]; then return; fi
if ! $JSON_MODE; then
printf " %s %-30s %s\n" "$(status_icon "$level")" "$message" "${CYAN}($value)${NC}"
fi
}
header() {
$JSON_MODE && return
$QUIET && return
echo -e "\n${BOLD}${CYAN}── $1 ──────────────────────────────────────${NC}"
}
# ── Checks ──────────────────────────────────────────────────────────────────
check_disk() {
header "DISK"
local used_pct
used_pct=$(df / --output=pcent | tail -1 | tr -d ' %')
local avail_gb
avail_gb=$(df / --output=avail -BG | tail -1 | tr -d ' G')
if (( used_pct >= DISK_CRIT )); then record disk CRIT "${used_pct}%" "Disk ${used_pct}% used — CRITICAL (>${DISK_CRIT}%)"
elif (( used_pct >= DISK_WARN )); then record disk WARN "${used_pct}%" "Disk ${used_pct}% used — WARNING (>${DISK_WARN}%)"
else record disk OK "${used_pct}% used, ${avail_gb}G free" "Disk OK (${used_pct}%)"
fi
}
check_load() {
header "LOAD"
local load1 load5 _load15
read -r load1 load5 _load15 _ < /proc/loadavg
local ncpu
ncpu=$(nproc)
# compare as integers (multiply by 10 to avoid bc dependency)
local load1_int
load1_int=$(echo "$load1" | awk '{printf "%d", $1 * 10}')
local warn_int crit_int
warn_int=$(echo "$LOAD_WARN" | awk '{printf "%d", $1 * 10}')
crit_int=$(echo "$LOAD_CRIT" | awk '{printf "%d", $1 * 10}')
if (( load1_int >= crit_int )); then record load CRIT "$load1" "Load avg $load1 — CRITICAL (>${LOAD_CRIT}, ${ncpu} CPUs)"
elif (( load1_int >= warn_int )); then record load WARN "$load1" "Load avg $load1 — WARNING (>${LOAD_WARN})"
else record load OK "$load1 (1m) / $load5 (5m)" "Load OK ($load1)"
fi
}
check_memory() {
header "MEMORY"
local available_kb total_kb
available_kb=$(awk '/^MemAvailable/ {print $2}' /proc/meminfo)
total_kb=$(awk '/^MemTotal/ {print $2}' /proc/meminfo)
local available_gb total_gb
available_gb=$(( available_kb / 1024 / 1024 ))
total_gb=$(( total_kb / 1024 / 1024 ))
if (( available_gb < RAM_FREE_CRIT_GB )); then record ram CRIT "${available_gb}G avail" "RAM available ${available_gb}G — CRITICAL (<${RAM_FREE_CRIT_GB}G)"
elif (( available_gb < RAM_FREE_WARN_GB )); then record ram WARN "${available_gb}G avail" "RAM available ${available_gb}G — WARNING (<${RAM_FREE_WARN_GB}G)"
else record ram OK "${available_gb}G / ${total_gb}G" "RAM OK (${available_gb}G available)"
fi
}
check_steal() {
header "CPU STEAL"
# Requires two /proc/stat samples 1s apart — single sample gives lifetime average, not current.
local s1 s2
s1=$(awk '/^cpu /{print $9" "$2+$3+$4+$5+$6+$7+$8+$9+$10}' /proc/stat)
sleep 1
s2=$(awk '/^cpu /{print $9" "$2+$3+$4+$5+$6+$7+$8+$9+$10}' /proc/stat)
local steal_pct
steal_pct=$(awk -v s1="$s1" -v s2="$s2" 'BEGIN{
split(s1,a," "); split(s2,b," ")
delta_steal=b[1]-a[1]; delta_total=b[2]-a[2]
if (delta_total == 0) { printf "0.0"; exit }
printf "%.1f", (delta_steal/delta_total)*100
}')
local steal_int
steal_int=$(awk -v v="$steal_pct" 'BEGIN{printf "%d", v}')
if (( steal_int >= STEAL_CRIT )); then record steal CRIT "${steal_pct}%" "CPU steal ${steal_pct}% — CRITICAL (host is overcommitted)"
elif (( steal_int >= STEAL_WARN )); then record steal WARN "${steal_pct}%" "CPU steal ${steal_pct}% — WARNING (host contention; degrades LLM inference)"
else record steal OK "${steal_pct}%" "CPU steal OK (${steal_pct}%)"
fi
}
check_swap() {
header "SWAP"
local swap_total_kb swap_free_kb swap_cached_kb
swap_total_kb=$(awk '/^SwapTotal/ {print $2}' /proc/meminfo)
swap_free_kb=$(awk '/^SwapFree/ {print $2}' /proc/meminfo)
swap_cached_kb=$(awk '/^SwapCached/ {print $2}' /proc/meminfo)
local swap_used_kb
swap_used_kb=$(( swap_total_kb - swap_free_kb ))
local swap_total_gb
swap_total_gb=$(( swap_total_kb / 1024 / 1024 ))
local swap_cached_mb
swap_cached_mb=$(( swap_cached_kb / 1024 ))
if (( swap_total_kb == 0 )); then
record swap CRIT "no swap" "NO SWAP configured — CRITICAL (add swapfile!)"
return
fi
# Compare used GB using awk to handle the fractional threshold (1.5)
local used_gb_10x warn_10x crit_10x
used_gb_10x=$(awk -v kb="$swap_used_kb" 'BEGIN{printf "%d", (kb/1024/1024)*10}')
warn_10x=$(awk -v t="$SWAP_USED_WARN_GB" 'BEGIN{printf "%d", t*10}')
crit_10x=$(awk -v t="$SWAP_USED_CRIT_GB" 'BEGIN{printf "%d", t*10}')
local swap_used_display
swap_used_display=$(awk -v kb="$swap_used_kb" 'BEGIN{printf "%.1fG", kb/1024/1024}')
if (( used_gb_10x >= crit_10x )); then
record swap CRIT "${swap_used_display} used" "Swap ${swap_used_display} used — CRITICAL"
elif (( used_gb_10x >= warn_10x )); then
record swap WARN "${swap_used_display} used" "Swap ${swap_used_display} used — WARNING (>${SWAP_USED_WARN_GB}G)"
elif (( swap_cached_mb >= SWAP_CACHED_WARN_MB )); then
# SwapCached is pages reclaimed from swap still sitting in cache — indicates
# recent memory pressure even though current usage looks ok.
record swap WARN "${swap_used_display} used, ${swap_cached_mb}MB cached" \
"Swap pressure indicator: SwapCached ${swap_cached_mb}MB — recent memory pressure (threshold ${SWAP_CACHED_WARN_MB}MB)"
else
record swap OK "${swap_used_display} / ${swap_total_gb}G" "Swap OK (${swap_used_display} used, ${swap_cached_mb}MB cached)"
fi
}
check_docker_containers() {
header "DOCKER CONTAINERS"
if ! command -v docker &>/dev/null || ! docker info &>/dev/null 2>&1; then
record docker_daemon WARN "not running" "Docker daemon is not running"
return
fi
# Crash-looping containers
local looping_warn=() looping_crit=()
while IFS=$'\t' read -r name restarts; do
[[ -z "$name" || "$name" == "NAMES" ]] && continue
restarts="${restarts:-0}"
if (( restarts >= CONTAINER_RESTART_CRIT )); then looping_crit+=("$name(${restarts}x)")
elif (( restarts >= CONTAINER_RESTART_WARN )); then looping_warn+=("$name(${restarts}x)")
fi
done < <(docker ps -a --format $'{{.Names}}\t{{.RestartCount}}' 2>/dev/null || true)
if (( ${#looping_crit[@]} > 0 )); then
record container_loops CRIT "${looping_crit[*]}" "Crash-looping containers: ${looping_crit[*]}"
elif (( ${#looping_warn[@]} > 0 )); then
record container_loops WARN "${looping_warn[*]}" "Containers restarting: ${looping_warn[*]}"
else
record container_loops OK "0 looping" "No containers crash-looping"
fi
# Unhealthy containers (running but health check failing)
local unhealthy
unhealthy=$(docker ps --filter health=unhealthy --format '{{.Names}}' 2>/dev/null | paste -sd, || true)
if [[ -n "$unhealthy" ]]; then
record container_health WARN "$unhealthy" "Unhealthy containers: $unhealthy"
else
record container_health OK "all healthy" "All containers passing healthchecks"
fi
}
check_docker_disk() {
header "DOCKER DISK"
if ! command -v docker &>/dev/null || ! docker info &>/dev/null 2>&1; then
return
fi
# Build cache
local cache_size_gb
cache_size_gb=$(docker system df --format '{{.BuildCache}}' 2>/dev/null \
| grep -oP '[0-9.]+(?=GB)' | head -1 || echo "0")
cache_size_gb="${cache_size_gb:-0}"
local cache_int
cache_int=$(echo "$cache_size_gb" | awk '{printf "%d", $1}')
if (( cache_int >= BUILD_CACHE_CRIT_GB )); then record build_cache CRIT "${cache_size_gb}GB" "Docker build cache ${cache_size_gb}GB — CRITICAL (>${BUILD_CACHE_CRIT_GB}GB, run: docker builder prune -f)"
elif (( cache_int >= BUILD_CACHE_WARN_GB )); then record build_cache WARN "${cache_size_gb}GB" "Docker build cache ${cache_size_gb}GB — WARNING (run: docker builder prune -f)"
else record build_cache OK "${cache_size_gb}GB" "Build cache OK (${cache_size_gb}GB)"
fi
# Images total size
local images_size
images_size=$(docker system df 2>/dev/null | awk '/^Images/ {print $3}' || echo "?")
record docker_images OK "$images_size" "Docker images: $images_size"
}
check_logs() {
header "LOGS"
local journal_mb
journal_mb=$(journalctl --disk-usage 2>/dev/null \
| grep -oP '[0-9.]+(?= M)' | head -1 || echo "0")
journal_mb="${journal_mb:-0}"
local syslog_mb=0
[[ -f /var/log/syslog ]] && syslog_mb=$(du -sm /var/log/syslog 2>/dev/null | cut -f1 || echo "0")
if (( journal_mb > 300 )); then record journal WARN "${journal_mb}MB" "Journal ${journal_mb}MB — WARNING (run: journalctl --vacuum-size=200M)"
else record journal OK "${journal_mb}MB" "Journal OK (${journal_mb}MB)"
fi
if (( syslog_mb > 100 )); then record syslog WARN "${syslog_mb}MB" "syslog ${syslog_mb}MB — WARNING"
else record syslog OK "${syslog_mb}MB" "syslog OK (${syslog_mb}MB)"
fi
}
check_automation_drift() {
header "AUTOMATION DRIFT"
local failed_units
failed_units=$(systemctl --failed --no-legend --plain 2>/dev/null | awk '{print $1}' | paste -sd, - || true)
if [[ -n "$failed_units" ]]; then
record failed_units WARN "$failed_units" "Failed systemd units: $failed_units"
else
record failed_units OK "0 failed" "No failed systemd units"
fi
local missing_paths=()
local cron_line path clean_path
while IFS= read -r cron_line; do
[[ -z "$cron_line" || "$cron_line" =~ ^[[:space:]]*# ]] && continue
[[ "$cron_line" =~ ^[A-Za-z_][A-Za-z0-9_]*= ]] && continue
while IFS= read -r path; do
clean_path="${path%\"}"
clean_path="${clean_path%\'}"
clean_path="${clean_path%,}"
clean_path="${clean_path%;}"
clean_path="${clean_path%)}"
case "$clean_path" in
/var/log/*|/run/*|/proc/*|/sys/*|/dev/*) continue ;;
esac
if [[ "$clean_path" == *.sh || "$clean_path" == *.py || "$clean_path" == */scripts/* ]]; then
[[ -e "$clean_path" ]] || missing_paths+=("$clean_path")
fi
done < <(grep -oE '/(opt|root|home|usr/local|etc)/[^[:space:]|;&<>]+' <<< "$cron_line" || true)
done < <(crontab -l 2>/dev/null || true)
if (( ${#missing_paths[@]} > 0 )); then
record cron_missing_paths WARN "${missing_paths[*]}" "Cron references missing path(s): ${missing_paths[*]}"
else
record cron_missing_paths OK "0 missing" "No missing script paths in root crontab"
fi
}
# ── Run all checks ───────────────────────────────────────────────────────────
if ! $JSON_MODE && ! $QUIET; then
echo -e "\n${BOLD}VM Health Check — $(hostname)$(date -u '+%Y-%m-%d %H:%M UTC')${NC}"
fi
check_disk
check_load
check_steal
check_memory
check_swap
check_docker_containers
check_docker_disk
check_logs
check_automation_drift
# ── Summary ──────────────────────────────────────────────────────────────────
if $JSON_MODE; then
echo '{'
echo " \"timestamp\": \"$(date -u '+%Y-%m-%dT%H:%M:%SZ')\","
echo " \"hostname\": \"$(hostname)\","
if (( WORST_LEVEL >= 2 )); then _overall='"CRIT"'
elif (( WORST_LEVEL == 1 )); then _overall='"WARN"'
else _overall='"OK"'
fi
echo " \"overall\": ${_overall},"
echo " \"checks\": {"
local_keys=("${!JSON_DATA[@]}")
for i in "${!local_keys[@]}"; do
k="${local_keys[$i]}"
if [[ $i -lt $(( ${#local_keys[@]} - 1 )) ]]; then comma=","; else comma=""; fi
echo " \"$k\": ${JSON_DATA[$k]}$comma"
done
echo " }"
echo '}'
else
echo ""
if (( WORST_LEVEL == 0 )); then
echo -e " ${GREEN}${BOLD}✓ All checks passed${NC}"
elif (( WORST_LEVEL == 1 )); then
echo -e " ${YELLOW}${BOLD}${#ISSUES[@]} warning(s):${NC}"
for issue in "${ISSUES[@]}"; do echo -e " ${YELLOW}${NC} $issue"; done
else
echo -e " ${RED}${BOLD}${#ISSUES[@]} issue(s) — action required:${NC}"
for issue in "${ISSUES[@]}"; do echo -e " ${RED}${NC} $issue"; done
fi
echo ""
fi
# ── Telegram notification ─────────────────────────────────────────────────
if $NOTIFY && (( WORST_LEVEL > 0 )); then
TOKEN_FILE="${HERMES_HOME:-/root/.hermes}/.env"
TELEGRAM_TOKEN=""
TELEGRAM_CHAT_ID=""
if [[ -f "$TOKEN_FILE" ]]; then
TELEGRAM_TOKEN=$(grep -oP '(?<=TELEGRAM_BOT_TOKEN=)\S+' "$TOKEN_FILE" || true)
TELEGRAM_CHAT_ID=$(grep -oP '(?<=TELEGRAM_CHAT_ID=)\S+' "$TOKEN_FILE" || true)
fi
if [[ -n "$TELEGRAM_TOKEN" && -n "$TELEGRAM_CHAT_ID" ]]; then
SEVERITY=$([[ $WORST_LEVEL -ge 2 ]] && echo "🚨 CRITICAL" || echo "⚠️ WARNING")
MSG="$SEVERITY$(hostname) VM health check
$(date -u '+%Y-%m-%d %H:%M UTC')
$(printf '%s\n' "${ISSUES[@]}")"
curl -sf -X POST "https://api.telegram.org/bot${TELEGRAM_TOKEN}/sendMessage" \
-d chat_id="$TELEGRAM_CHAT_ID" \
-d text="$MSG" > /dev/null || true
fi
fi
# ── Log result ───────────────────────────────────────────────────────────────
if (( WORST_LEVEL >= 2 )); then RESULT_STR="CRIT"
elif (( WORST_LEVEL == 1 )); then RESULT_STR="WARN"
else RESULT_STR="OK"
fi
log_to_file "health-check result=$RESULT_STR issues=${#ISSUES[@]}"
exit "$WORST_LEVEL"

22
scripts/gitea-backup.sh Executable file
View File

@ -0,0 +1,22 @@
#!/usr/bin/env bash
set -euo pipefail
container="${GITEA_CONTAINER:-gitea-npm-registry}"
backup_dir="${GITEA_BACKUP_DIR:-/opt/bytelyst/backups/gitea}"
retention_days="${GITEA_BACKUP_RETENTION_DAYS:-14}"
timestamp="$(date -u +%Y%m%dT%H%M%SZ)"
name="gitea-dump-${timestamp}.zip"
container_tmp="/tmp/${name}"
host_path="${backup_dir}/${name}"
mkdir -p "$backup_dir"
docker exec --user git "$container" rm -f "$container_tmp" >/dev/null 2>&1 || true
docker exec --user git "$container" gitea dump --quiet --skip-log --file "$container_tmp"
docker cp "${container}:${container_tmp}" "$host_path"
docker exec --user git "$container" rm -f "$container_tmp" >/dev/null 2>&1 || true
chmod 600 "$host_path"
find "$backup_dir" -type f -name 'gitea-dump-*.zip' -mtime +"$retention_days" -delete
printf 'created %s\n' "$host_path"

9
scripts/gitea-git Executable file
View File

@ -0,0 +1,9 @@
#!/bin/sh
set -eu
export GIT_TERMINAL_PROMPT=0
export GIT_ASKPASS="${GIT_ASKPASS:-/root/.local/bin/gitea-git-askpass}"
export GITEA_USERNAME="${GITEA_USERNAME:-learning_ai_user}"
export GITEA_TOKEN_FILE="${GITEA_TOKEN_FILE:-/root/.gitea_npm_token_home}"
exec git "$@"

17
scripts/gitea-git-askpass Executable file
View File

@ -0,0 +1,17 @@
#!/bin/sh
set -eu
username="${GITEA_USERNAME:-learning_ai_user}"
token_file="${GITEA_TOKEN_FILE:-/root/.gitea_npm_token_home}"
case "${1:-}" in
*Username*)
printf '%s\n' "$username"
;;
*Password*)
tr -d '\n' < "$token_file"
;;
*)
exit 1
;;
esac

View File

@ -0,0 +1,145 @@
#!/usr/bin/env python3
"""Upload a single local file to the configured Google Drive folder."""
from __future__ import annotations
import argparse
from pathlib import Path
import shutil
import subprocess
import sys
import tempfile
from google.auth.transport.requests import Request
from google.oauth2.credentials import Credentials
from googleapiclient.discovery import build
from googleapiclient.http import MediaFileUpload
TOKEN_FILE = Path("/root/.config/hermes-google-drive/user-token.json")
PASSPHRASE_FILE = Path("/root/.config/hermes-google-drive/bundle-passphrase")
SCOPES = ["https://www.googleapis.com/auth/drive.file"]
FOLDERS = {
"vijay": "1KIlSJzpf5fuaH5LYvfbLsUbOSYY23YGm",
"bheem": "1Ac5cbDC0dSWas8LeeWe_9XFqCquz7kZT",
}
BLOCKED_NAMES = {
".env",
"auth.json",
".git-credentials",
"service-account.json",
"oauth-client.json",
"user-token.json",
"bundle-passphrase",
}
BLOCKED_SUFFIXES = {
".pem",
".key",
".p12",
".pfx",
".db",
".db-wal",
".db-shm",
}
def parse_args() -> argparse.Namespace:
parser = argparse.ArgumentParser(description=__doc__)
parser.add_argument("path", help="Local file path to upload")
parser.add_argument("--target", choices=sorted(FOLDERS), required=True)
parser.add_argument("--name", help="Optional Drive filename")
parser.add_argument("--encrypt", action="store_true", help="GPG-encrypt the file before upload")
parser.add_argument("--allow-sensitive", action="store_true", help="Allow blocked sensitive-looking filenames")
return parser.parse_args()
def is_blocked(path: Path) -> bool:
name = path.name
if name in BLOCKED_NAMES:
return True
if any(name.endswith(suffix) for suffix in BLOCKED_SUFFIXES):
return True
if ".ssh" in path.parts or ".gnupg" in path.parts:
return True
return False
def drive_service():
if not TOKEN_FILE.exists():
raise RuntimeError(f"missing Google Drive OAuth token: {TOKEN_FILE}")
creds = Credentials.from_authorized_user_file(str(TOKEN_FILE), SCOPES)
if creds.expired and creds.refresh_token:
creds.refresh(Request())
TOKEN_FILE.write_text(creds.to_json())
TOKEN_FILE.chmod(0o600)
return build("drive", "v3", credentials=creds, cache_discovery=False)
def encrypt_file(src: Path, tmpdir: Path) -> Path:
if not PASSPHRASE_FILE.exists():
raise RuntimeError(f"missing passphrase file: {PASSPHRASE_FILE}")
out = tmpdir / f"{src.name}.gpg"
cmd = [
"gpg",
"--batch",
"--yes",
"--pinentry-mode",
"loopback",
"--passphrase-file",
str(PASSPHRASE_FILE),
"--symmetric",
"--cipher-algo",
"AES256",
"--output",
str(out),
str(src),
]
proc = subprocess.run(cmd, text=True, stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
if proc.returncode != 0:
raise RuntimeError(f"gpg encryption failed: {proc.stdout}")
return out
def upload(path: Path, target: str, drive_name: str | None) -> dict:
service = drive_service()
metadata = {"name": drive_name or path.name, "parents": [FOLDERS[target]]}
media = MediaFileUpload(str(path), mimetype="application/octet-stream", resumable=True)
return (
service.files()
.create(body=metadata, media_body=media, fields="id,name,size,webViewLink", supportsAllDrives=True)
.execute()
)
def main() -> int:
args = parse_args()
src = Path(args.path).expanduser().resolve()
if not src.exists():
raise SystemExit(f"file not found: {src}")
if not src.is_file():
raise SystemExit(f"refusing non-file path: {src}")
if is_blocked(src) and not args.allow_sensitive:
raise SystemExit(f"refusing sensitive-looking file without --allow-sensitive: {src.name}")
upload_path = src
cleanup_dir = None
try:
if args.encrypt:
cleanup_dir = Path(tempfile.mkdtemp(prefix="drive-upload-"))
upload_path = encrypt_file(src, cleanup_dir)
result = upload(upload_path, args.target, args.name)
size = result.get("size") or str(upload_path.stat().st_size)
print(f"uploaded target={args.target} name={result.get('name')} size={size} file_id={result.get('id')}")
return 0
finally:
if cleanup_dir and cleanup_dir.exists():
shutil.rmtree(cleanup_dir)
if __name__ == "__main__":
try:
raise SystemExit(main())
except Exception as exc:
print(f"upload failed: {exc}", file=sys.stderr)
raise SystemExit(1)

View File

@ -0,0 +1,5 @@
#!/usr/bin/env bash
set -euo pipefail
exec /root/.local/share/hermes-drive-uploader-venv/bin/python \
/root/repos/learning_ai_devops_tools/scripts/google-drive-upload-file.py "$@"

View File

@ -0,0 +1,109 @@
#!/usr/bin/env bash
set -euo pipefail
usage() {
cat <<'USAGE'
Usage:
hermes-emergency-bundle-create.sh [output-dir]
Creates a GPG-encrypted emergency bundle containing sensitive recovery files
that are intentionally excluded from the normal GitHub Hermes backups.
Default output-dir:
/root/hermes-emergency-bundles
Passphrase:
Interactive GPG prompt by default.
Or set BUNDLE_PASSPHRASE_FILE=/root/path/to/passphrase-file for unattended use.
Safety:
- Does not print secret values.
- Uses an allow-list of sensitive recovery files.
- Does not include logs, caches, locks, PIDs, or sandboxes.
USAGE
}
if [ "${1:-}" = "-h" ] || [ "${1:-}" = "--help" ]; then
usage
exit 0
fi
OUT_DIR="${1:-/root/hermes-emergency-bundles}"
STAMP="$(date -u +%Y%m%dT%H%M%SZ)"
HOST="$(hostname -s 2>/dev/null || hostname)"
WORK_DIR="$(mktemp -d)"
PAYLOAD_DIR="$WORK_DIR/payload"
ARCHIVE="$WORK_DIR/hermes-emergency-bundle-${HOST}-${STAMP}.tar.zst"
OUT_FILE="$OUT_DIR/hermes-emergency-bundle-${HOST}-${STAMP}.tar.zst.gpg"
cleanup() {
rm -rf "$WORK_DIR"
}
trap cleanup EXIT
install -d -m 700 "$OUT_DIR"
install -d -m 700 "$PAYLOAD_DIR"
copy_if_exists() {
src="$1"
dest="$PAYLOAD_DIR/${src#/}"
if [ -e "$src" ]; then
install -d -m 700 "$(dirname "$dest")"
cp -a "$src" "$dest"
printf '%s\n' "${src#/}" >> "$PAYLOAD_DIR/MANIFEST.paths"
fi
}
# Root Hermes sensitive state.
copy_if_exists /root/.hermes/.env
copy_if_exists /root/.hermes/auth.json
copy_if_exists /root/.hermes/state.db
copy_if_exists /root/.hermes/state.db-shm
copy_if_exists /root/.hermes/state.db-wal
# Uma Hermes sensitive state.
copy_if_exists /home/uma/.hermes/.env
copy_if_exists /home/uma/.hermes/auth.json
copy_if_exists /home/uma/.hermes/state.db
copy_if_exists /home/uma/.hermes/state.db-shm
copy_if_exists /home/uma/.hermes/state.db-wal
# Git and local registry credentials used for recovery operations.
copy_if_exists /root/.git-credentials
copy_if_exists /root/.gitea_admin_password
copy_if_exists /root/.gitea_npm_token
copy_if_exists /root/.gitea_npm_token_home
# Tailscale machine state is sensitive. Restoring it is optional; a fresh
# `tailscale up` login is often cleaner, but this preserves a break-glass copy.
copy_if_exists /var/lib/tailscale/tailscaled.state
if [ ! -s "$PAYLOAD_DIR/MANIFEST.paths" ]; then
echo "No emergency files found to bundle." >&2
exit 1
fi
cat > "$PAYLOAD_DIR/README.txt" <<README
Hermes emergency bundle for ${HOST}
Created UTC: ${STAMP}
This encrypted bundle contains sensitive files excluded from normal GitHub
backups, such as .env files, provider auth state, Git credentials, local Gitea
tokens, optional Tailscale state, and Hermes state.db files.
Decrypt only into a staging directory first. Inspect paths before copying
anything into a live VM.
README
tar -C "$PAYLOAD_DIR" -I zstd -cf "$ARCHIVE" .
gpg_args=(--symmetric --cipher-algo AES256 --output "$OUT_FILE")
if [ -n "${BUNDLE_PASSPHRASE_FILE:-}" ]; then
gpg_args=(--batch --yes --pinentry-mode loopback --passphrase-file "$BUNDLE_PASSPHRASE_FILE" "${gpg_args[@]}")
fi
gpg "${gpg_args[@]}" "$ARCHIVE"
chmod 600 "$OUT_FILE"
echo "Encrypted emergency bundle created: $OUT_FILE"
echo "Included path list is encrypted inside the bundle; no secret values printed."

View File

@ -0,0 +1,66 @@
#!/usr/bin/env bash
set -euo pipefail
usage() {
cat <<'USAGE'
Usage:
hermes-emergency-bundle-decrypt.sh <bundle.tar.zst.gpg> [staging-dir]
Decrypts a Hermes emergency bundle into a staging directory.
Default staging-dir:
/root/hermes-emergency-restore-staging/<bundle-name-without-.gpg>
Passphrase:
Interactive GPG prompt by default.
Or set BUNDLE_PASSPHRASE_FILE=/root/path/to/passphrase-file for unattended use.
Safety:
- Does not write into /root/.hermes or /home/uma/.hermes.
- Does not overwrite live credentials.
- Review extracted files, then copy only the needed files manually.
USAGE
}
if [ "${1:-}" = "-h" ] || [ "${1:-}" = "--help" ] || [ "$#" -lt 1 ]; then
usage
exit 0
fi
BUNDLE="$1"
if [ ! -f "$BUNDLE" ]; then
echo "Bundle not found: $BUNDLE" >&2
exit 1
fi
base="$(basename "$BUNDLE" .gpg)"
STAGING_DIR="${2:-/root/hermes-emergency-restore-staging/$base}"
WORK_DIR="$(mktemp -d)"
ARCHIVE="$WORK_DIR/$base"
cleanup() {
rm -rf "$WORK_DIR"
}
trap cleanup EXIT
install -d -m 700 "$STAGING_DIR"
gpg_args=(--decrypt --output "$ARCHIVE")
if [ -n "${BUNDLE_PASSPHRASE_FILE:-}" ]; then
gpg_args=(--batch --yes --pinentry-mode loopback --passphrase-file "$BUNDLE_PASSPHRASE_FILE" "${gpg_args[@]}")
fi
gpg "${gpg_args[@]}" "$BUNDLE"
tar -C "$STAGING_DIR" -I zstd -xf "$ARCHIVE"
chmod -R go-rwx "$STAGING_DIR"
echo "Bundle decrypted into staging directory: $STAGING_DIR"
echo
echo "Included paths:"
if [ -f "$STAGING_DIR/MANIFEST.paths" ]; then
sed -n '1,200p' "$STAGING_DIR/MANIFEST.paths"
else
find "$STAGING_DIR" -type f | sed "s#^$STAGING_DIR/##" | sort | sed -n '1,200p'
fi
echo
echo "Next step: inspect staging, then manually copy only the needed files into place."

View File

@ -0,0 +1,177 @@
#!/usr/bin/env python3
"""Create encrypted Hermes emergency bundles and upload them to Google Drive.
This script uploads only .gpg encrypted bundles. It never uploads plaintext
staging directories or decrypted files.
"""
from __future__ import annotations
import argparse
import os
from pathlib import Path
import subprocess
import sys
import time
from google.auth.transport.requests import Request
from google.oauth2 import credentials as user_credentials
from google.oauth2 import service_account
from googleapiclient.discovery import build
from googleapiclient.http import MediaFileUpload
ROOT = Path("/root/repos/learning_ai_devops_tools")
CREATE_SCRIPT = ROOT / "scripts/hermes-emergency-bundle-create.sh"
KEY_FILE = Path("/root/.config/hermes-google-drive/service-account.json")
USER_TOKEN_FILE = Path("/root/.config/hermes-google-drive/user-token.json")
DEFAULT_OUT = Path("/root/hermes-emergency-bundles")
FOLDERS = {
"vijay": "1KIlSJzpf5fuaH5LYvfbLsUbOSYY23YGm",
"bheem": "1Ac5cbDC0dSWas8LeeWe_9XFqCquz7kZT",
}
def run(cmd: list[str], env: dict[str, str] | None = None) -> str:
proc = subprocess.run(
cmd,
text=True,
stdout=subprocess.PIPE,
stderr=subprocess.STDOUT,
env=env,
check=False,
)
if proc.returncode != 0:
raise RuntimeError(f"command failed ({proc.returncode}): {' '.join(cmd)}\n{proc.stdout}")
return proc.stdout
def drive_service(auth_mode: str):
scopes = ["https://www.googleapis.com/auth/drive.file"]
if auth_mode == "user":
if not USER_TOKEN_FILE.exists():
raise RuntimeError(
f"missing OAuth user token: {USER_TOKEN_FILE}. "
"Run hermes-google-drive-oauth-login.py first."
)
creds = user_credentials.Credentials.from_authorized_user_file(str(USER_TOKEN_FILE), scopes=scopes)
if creds.expired and creds.refresh_token:
creds.refresh(Request())
USER_TOKEN_FILE.write_text(creds.to_json())
USER_TOKEN_FILE.chmod(0o600)
elif auth_mode == "service-account":
creds = service_account.Credentials.from_service_account_file(str(KEY_FILE), scopes=scopes)
else:
raise ValueError(f"unknown auth mode: {auth_mode}")
return build("drive", "v3", credentials=creds, cache_discovery=False)
def create_bundle(out_dir: Path) -> Path:
before = set(out_dir.glob("*.tar.zst.gpg")) if out_dir.exists() else set()
output = run([str(CREATE_SCRIPT), str(out_dir)])
after = set(out_dir.glob("*.tar.zst.gpg"))
created = sorted(after - before, key=lambda p: p.stat().st_mtime, reverse=True)
if created:
return created[0]
for line in output.splitlines():
marker = "Encrypted emergency bundle created:"
if marker in line:
return Path(line.split(marker, 1)[1].strip())
raise RuntimeError("bundle script did not report a created bundle")
def upload_file(service, bundle: Path, label: str, folder_id: str) -> str:
metadata = {
"name": bundle.name,
"parents": [folder_id],
"description": f"Hermes encrypted emergency bundle for {label}; uploaded {time.strftime('%Y-%m-%dT%H:%M:%SZ', time.gmtime())}",
}
media = MediaFileUpload(str(bundle), mimetype="application/octet-stream", resumable=True)
result = (
service.files()
.create(
body=metadata,
media_body=media,
fields="id,name,webViewLink",
supportsAllDrives=True,
)
.execute()
)
return result["id"]
def cleanup_remote(service, folder_id: str, keep: int, dry_run: bool) -> list[str]:
query = (
f"'{folder_id}' in parents and trashed = false "
"and name contains 'hermes-emergency-bundle-' "
"and name contains '.tar.zst.gpg'"
)
files = (
service.files()
.list(
q=query,
fields="files(id,name,createdTime)",
orderBy="createdTime desc",
pageSize=1000,
supportsAllDrives=True,
includeItemsFromAllDrives=True,
)
.execute()
.get("files", [])
)
deleted = []
for item in files[keep:]:
deleted.append(item["name"])
if not dry_run:
service.files().delete(fileId=item["id"], supportsAllDrives=True).execute()
return deleted
def parse_args() -> argparse.Namespace:
parser = argparse.ArgumentParser(description=__doc__)
parser.add_argument("--target", choices=["vijay", "bheem", "both"], default="both")
parser.add_argument(
"--auth-mode",
choices=["user", "service-account"],
default=os.environ.get("HERMES_DRIVE_AUTH_MODE", "user"),
help="Use user OAuth for personal Drive; service-account only works with Shared Drives or delegated Workspace setups.",
)
parser.add_argument("--out-dir", default=str(DEFAULT_OUT))
parser.add_argument("--keep", type=int, default=12, help="encrypted bundles to retain per Drive folder")
parser.add_argument("--dry-run", action="store_true", help="create bundle but do not upload/delete remote files")
return parser.parse_args()
def main() -> int:
args = parse_args()
if args.auth_mode == "service-account" and not KEY_FILE.exists():
raise SystemExit(f"missing service account key: {KEY_FILE}")
if not CREATE_SCRIPT.exists():
raise SystemExit(f"missing create script: {CREATE_SCRIPT}")
out_dir = Path(args.out_dir)
bundle = create_bundle(out_dir)
service = drive_service(args.auth_mode)
targets = ["vijay", "bheem"] if args.target == "both" else [args.target]
for target in targets:
folder_id = FOLDERS[target]
if args.dry_run:
print(f"DRY RUN: would upload {bundle.name} to {target} folder {folder_id}")
deleted = cleanup_remote(service, folder_id, args.keep, dry_run=True)
else:
file_id = upload_file(service, bundle, target, folder_id)
print(f"Uploaded encrypted bundle to {target}: file_id={file_id}")
deleted = cleanup_remote(service, folder_id, args.keep, dry_run=False)
if deleted:
print(f"Retention cleanup for {target}: {len(deleted)} old encrypted bundle(s)")
return 0
if __name__ == "__main__":
try:
raise SystemExit(main())
except Exception as exc:
print(f"Drive upload FAILED: {exc}", file=sys.stderr)
raise SystemExit(1)

View File

@ -0,0 +1,5 @@
#!/usr/bin/env bash
set -euo pipefail
exec /root/.local/share/hermes-drive-uploader-venv/bin/python \
/root/repos/learning_ai_devops_tools/scripts/hermes-emergency-bundle-upload-drive.py "$@"

View File

@ -0,0 +1,38 @@
#!/usr/bin/env python3
"""Create a Google Drive OAuth user token for personal Drive uploads."""
from __future__ import annotations
from pathlib import Path
import sys
from google_auth_oauthlib.flow import InstalledAppFlow
CLIENT_SECRET = Path("/root/.config/hermes-google-drive/oauth-client.json")
TOKEN_FILE = Path("/root/.config/hermes-google-drive/user-token.json")
SCOPES = ["https://www.googleapis.com/auth/drive.file"]
def main() -> int:
if not CLIENT_SECRET.exists():
print(f"Missing OAuth client secret: {CLIENT_SECRET}", file=sys.stderr)
print("Create a Google OAuth client of type Desktop app and save its JSON there.", file=sys.stderr)
return 1
flow = InstalledAppFlow.from_client_secrets_file(str(CLIENT_SECRET), SCOPES)
flow.redirect_uri = "urn:ietf:wg:oauth:2.0:oob"
auth_url, _ = flow.authorization_url(prompt="consent", access_type="offline")
print("Open this URL in your browser, approve access, then paste the code here:")
print(auth_url)
code = input("Code: ").strip()
flow.fetch_token(code=code)
creds = flow.credentials
TOKEN_FILE.parent.mkdir(parents=True, exist_ok=True)
TOKEN_FILE.write_text(creds.to_json())
TOKEN_FILE.chmod(0o600)
print(f"OAuth user token saved: {TOKEN_FILE}")
return 0
if __name__ == "__main__":
raise SystemExit(main())

143
scripts/hermes-health-watchdog.py Executable file
View File

@ -0,0 +1,143 @@
#!/usr/bin/env python3
"""Silent-on-success Hermes health watchdog for ByteLyst.
Designed for a Hermes no-agent cron job. It prints nothing when healthy and
prints a concise Telegram-ready alert when an actionable problem is detected.
"""
from __future__ import annotations
import os
import shutil
import subprocess
import sys
from datetime import datetime, timezone
from pathlib import Path
DISK_WARN_PERCENT = int(os.getenv("HERMES_WATCHDOG_DISK_WARN_PERCENT", "85"))
MEMORY_WARN_PERCENT = int(os.getenv("HERMES_WATCHDOG_MEMORY_WARN_PERCENT", "90"))
BACKUP_STALE_MINUTES = int(os.getenv("HERMES_WATCHDOG_BACKUP_STALE_MINUTES", "90"))
BACKUP_JOB_NAME = os.getenv("HERMES_WATCHDOG_BACKUP_JOB_NAME", "Sync Hermes persistent-data backup to GitHub")
GATEWAY_SERVICE = os.getenv("HERMES_WATCHDOG_GATEWAY_SERVICE", "hermes-gateway.service")
DOCKER_CONTAINERS = [
item.strip()
for item in os.getenv("HERMES_WATCHDOG_DOCKER_CONTAINERS", "caddy,gitea-npm-registry").split(",")
if item.strip()
]
HERMES_HOME = Path(os.getenv("HERMES_HOME", str(Path.home() / ".hermes")))
def run(cmd: list[str], timeout: int = 20) -> subprocess.CompletedProcess[str]:
return subprocess.run(cmd, text=True, capture_output=True, timeout=timeout, check=False)
def check_gateway(alerts: list[str]) -> None:
result = run(["systemctl", "is-active", GATEWAY_SERVICE])
if result.stdout.strip() != "active":
alerts.append(f"gateway service `{GATEWAY_SERVICE}` is not active: `{result.stdout.strip() or result.stderr.strip() or 'unknown'}`")
def check_backup_cron(alerts: list[str]) -> None:
result = run(["hermes", "cron", "list"], timeout=30)
out = result.stdout + result.stderr
if result.returncode != 0:
alerts.append(f"`hermes cron list` failed with exit {result.returncode}")
return
if BACKUP_JOB_NAME not in out:
alerts.append(f"backup cron job `{BACKUP_JOB_NAME}` was not found")
return
if "Last run:" not in out or " ok" not in out:
alerts.append("backup cron last-run status is not visibly `ok` in `hermes cron list`")
script_path = HERMES_HOME / "scripts" / "sync_hermes_persistent_backup.py"
if script_path.exists():
age_minutes = (datetime.now(timezone.utc).timestamp() - script_path.stat().st_mtime) / 60
# Script mtime is not backup freshness; keep this as a weak sanity note only.
if age_minutes < 0:
alerts.append("backup sync script has a future modification time")
def check_backup_repo_freshness(alerts: list[str]) -> None:
repo = Path(os.getenv("HERMES_WATCHDOG_BACKUP_REPO", str(HERMES_HOME / "persistent_backup_repo")))
candidates = [repo, Path.home() / "hermes_persistent_backup", Path.home() / "hermes_persistent_backup_repo"]
existing = next((p for p in candidates if (p / ".git").exists()), None)
if not existing:
# The backup cron may use its own path; cron status is the primary check.
return
result = run(["git", "-C", str(existing), "log", "-1", "--format=%ct"], timeout=20)
if result.returncode != 0 or not result.stdout.strip().isdigit():
alerts.append(f"could not inspect backup repo freshness at `{existing}`")
return
age_minutes = (datetime.now(timezone.utc).timestamp() - int(result.stdout.strip())) / 60
if age_minutes > BACKUP_STALE_MINUTES:
alerts.append(f"backup repo `{existing}` latest commit is stale: {age_minutes:.0f} minutes old")
def check_disk(alerts: list[str]) -> None:
usage = shutil.disk_usage("/")
pct = int(round((usage.used / usage.total) * 100))
if pct >= DISK_WARN_PERCENT:
alerts.append(f"root disk usage is high: {pct}% used (threshold {DISK_WARN_PERCENT}%)")
def check_memory(alerts: list[str]) -> None:
meminfo: dict[str, int] = {}
for line in Path("/proc/meminfo").read_text(encoding="utf-8").splitlines():
parts = line.split()
if len(parts) >= 2:
meminfo[parts[0].rstrip(":")] = int(parts[1])
total = meminfo.get("MemTotal", 0)
available = meminfo.get("MemAvailable", 0)
if total <= 0 or available <= 0:
alerts.append("could not read memory pressure from /proc/meminfo")
return
used_pct = int(round(((total - available) / total) * 100))
if used_pct >= MEMORY_WARN_PERCENT:
alerts.append(f"memory pressure is high: {used_pct}% used (threshold {MEMORY_WARN_PERCENT}%)")
def check_docker_containers(alerts: list[str]) -> None:
if not DOCKER_CONTAINERS:
return
docker = shutil.which("docker")
if not docker:
alerts.append("docker executable not found; cannot verify critical containers")
return
result = run([docker, "ps", "--format", "{{.Names}}"], timeout=20)
if result.returncode != 0:
alerts.append(f"`docker ps` failed while checking critical containers: {result.stderr.strip() or result.stdout.strip()}")
return
running = set(result.stdout.splitlines())
missing = [name for name in DOCKER_CONTAINERS if name not in running]
if missing:
alerts.append(f"critical Docker container(s) not running: {', '.join(missing)}")
def main() -> int:
alerts: list[str] = []
for check in (
check_gateway,
check_backup_cron,
check_backup_repo_freshness,
check_disk,
check_memory,
check_docker_containers,
):
try:
check(alerts)
except Exception as exc: # noqa: BLE001 - watchdog should alert, not crash silently
alerts.append(f"{check.__name__} errored: {exc}")
if alerts:
print("🚨 ByteLyst Hermes watchdog alert")
for item in alerts:
print(f"- {item}")
print(
"\nSuggested first checks: `systemctl status hermes-gateway --no-pager`, "
"`hermes cron list`, `df -h /`, `free -h`, `docker ps`."
)
return 0
return 0
if __name__ == "__main__":
sys.exit(main())

View File

@ -0,0 +1,8 @@
[Unit]
Description=ByteLyst Gitea backup
Requires=docker.service
After=docker.service
[Service]
Type=oneshot
ExecStart=/opt/bytelyst/scripts/backup-gitea.sh

View File

@ -0,0 +1,10 @@
[Unit]
Description=Run ByteLyst Gitea backup daily
[Timer]
OnCalendar=*-*-* 03:15:00 UTC
Persistent=true
Unit=bytelyst-gitea-backup.service
[Install]
WantedBy=timers.target

View File

@ -0,0 +1,12 @@
[Unit]
Description=Restart Docker containers stuck in unhealthy state
Documentation=file:///usr/local/bin/docker-health-watchdog.sh
After=docker.service
Requires=docker.service
[Service]
Type=oneshot
User=root
Group=root
Environment="HERMES_HOME=/root/.hermes"
ExecStart=/usr/local/bin/docker-health-watchdog.sh

View File

@ -0,0 +1,11 @@
[Unit]
Description=Run Docker health watchdog every 10 minutes
After=docker.service
[Timer]
OnBootSec=5min
OnUnitActiveSec=10min
AccuracySec=30s
[Install]
WantedBy=timers.target

View File

@ -0,0 +1,12 @@
[Unit]
Description=Create encrypted Hermes emergency bundle and upload to Google Drive
After=network-online.target
Wants=network-online.target
[Service]
Type=oneshot
User=root
Group=root
Environment="BUNDLE_PASSPHRASE_FILE=/root/.config/hermes-google-drive/bundle-passphrase"
Environment="HERMES_DRIVE_AUTH_MODE=user"
ExecStart=/root/repos/learning_ai_devops_tools/scripts/hermes-emergency-bundle-upload-drive.sh --target both --keep 12

View File

@ -0,0 +1,11 @@
[Unit]
Description=Upload encrypted Hermes emergency bundle to Google Drive daily
[Timer]
OnCalendar=*-*-* 03:17:00 UTC
Persistent=true
RandomizedDelaySec=15min
Unit=hermes-emergency-drive-upload.service
[Install]
WantedBy=timers.target

View File

@ -0,0 +1,34 @@
[Unit]
Description=Hermes Agent Gateway - Messaging Platform Integration
After=network-online.target
Wants=network-online.target
StartLimitIntervalSec=0
[Service]
Type=simple
User=root
Group=root
ExecStart=/usr/local/lib/hermes-agent/venv/bin/python -m hermes_cli.main gateway run --replace
WorkingDirectory=/usr/local/lib/hermes-agent
Environment="HOME=/root"
Environment="USER=root"
Environment="LOGNAME=root"
Environment="PATH=/usr/local/lib/hermes-agent/venv/bin:/usr/local/lib/hermes-agent/node_modules/.bin:/usr/bin:/root/.local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
Environment="VIRTUAL_ENV=/usr/local/lib/hermes-agent/venv"
Environment="HERMES_HOME=/root/.hermes"
Environment="HERMES_MODEL=gpt-5.5"
Environment="HERMES_INFERENCE_MODEL=gpt-5.5"
Restart=always
RestartSec=5
RestartMaxDelaySec=300
RestartSteps=5
RestartForceExitStatus=75
KillMode=mixed
KillSignal=SIGTERM
ExecReload=/bin/kill -USR1 $MAINPID
TimeoutStopSec=210
StandardOutput=journal
StandardError=journal
[Install]
WantedBy=multi-user.target

View File

@ -0,0 +1,13 @@
[Unit]
Description=Sync root Hermes persistent backup to GitHub
After=network-online.target
Wants=network-online.target
[Service]
Type=oneshot
User=root
Group=root
Environment="HERMES_HOME=/root/.hermes"
Environment="HERMES_BACKUP_REPO=/root/repos/bytelyst_hostinger_hermes_vm"
Environment="HERMES_BACKUP_REMOTE=https://github.com/saravanakumardb/bytelyst_hostinger_hermes_vm.git"
ExecStart=/usr/bin/python3 /root/.hermes/scripts/sync_hermes_persistent_backup.py

View File

@ -0,0 +1,12 @@
[Unit]
Description=Run root Hermes persistent backup sync every 10 minutes
[Timer]
OnBootSec=5min
OnUnitActiveSec=10min
AccuracySec=1min
Persistent=true
Unit=hermes-root-backup.service
[Install]
WantedBy=timers.target

View File

@ -0,0 +1,21 @@
[Unit]
Description=Root Hermes Dashboard (Tailscale private)
After=network-online.target tailscaled.service
Wants=network-online.target
[Service]
Type=simple
User=root
Group=root
WorkingDirectory=/usr/local/lib/hermes-agent
Environment="HOME=/root"
Environment="USER=root"
Environment="LOGNAME=root"
Environment="HERMES_HOME=/root/.hermes"
Environment="PATH=/usr/local/lib/hermes-agent/venv/bin:/usr/local/lib/hermes-agent/node_modules/.bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
ExecStart=/usr/local/lib/hermes-agent/venv/bin/hermes dashboard --host 100.87.53.10 --port 9119 --no-open --insecure --skip-build
Restart=always
RestartSec=5
[Install]
WantedBy=multi-user.target

View File

@ -0,0 +1,13 @@
[Unit]
Description=Sync Uma Hermes persistent backup to GitHub
After=network-online.target
Wants=network-online.target
[Service]
Type=oneshot
User=root
Group=root
Environment="HERMES_HOME=/home/uma/.hermes"
Environment="HERMES_BACKUP_REPO=/home/uma/repos/uma_hostinger_hermes_vm"
Environment="HERMES_BACKUP_REMOTE=https://github.com/umadev0931/uma_hostinger_hermes_vm.git"
ExecStart=/usr/bin/python3 /home/uma/.hermes/scripts/sync_uma_hermes_persistent_backup.py

View File

@ -0,0 +1,12 @@
[Unit]
Description=Run Uma Hermes persistent backup sync every 10 minutes
[Timer]
OnBootSec=5min
OnUnitActiveSec=10min
AccuracySec=1min
Persistent=true
Unit=uma-hermes-backup.service
[Install]
WantedBy=timers.target

View File

@ -0,0 +1,21 @@
[Unit]
Description=Uma Hermes Dashboard (Tailscale private)
After=network-online.target tailscaled.service
Wants=network-online.target
[Service]
Type=simple
User=uma
Group=uma
WorkingDirectory=/usr/local/lib/hermes-agent
Environment="HOME=/home/uma"
Environment="USER=uma"
Environment="LOGNAME=uma"
Environment="HERMES_HOME=/home/uma/.hermes"
Environment="PATH=/usr/local/lib/hermes-agent/venv/bin:/usr/local/lib/hermes-agent/node_modules/.bin:/home/uma/.local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
ExecStart=/usr/local/lib/hermes-agent/venv/bin/hermes dashboard --host 100.87.53.10 --port 9120 --no-open --insecure --skip-build
Restart=always
RestartSec=5
[Install]
WantedBy=multi-user.target

View File

@ -0,0 +1,29 @@
[Unit]
Description=Uma Hermes Gateway - Telegram Integration
After=network-online.target
Wants=network-online.target
[Service]
Type=simple
WorkingDirectory=/usr/local/lib/hermes-agent
Environment="HOME=/home/uma"
Environment="USER=uma"
Environment="LOGNAME=uma"
Environment="HERMES_HOME=/home/uma/.hermes"
Environment="HERMES_MODEL=gpt-5.5"
Environment="HERMES_INFERENCE_MODEL=gpt-5.5"
Environment="PATH=/usr/local/lib/hermes-agent/venv/bin:/usr/local/lib/hermes-agent/node_modules/.bin:/home/uma/.local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
Environment="VIRTUAL_ENV=/usr/local/lib/hermes-agent/venv"
ExecStart=/usr/local/lib/hermes-agent/venv/bin/python -m hermes_cli.main gateway run --replace
Restart=always
RestartSec=5
RestartMaxDelaySec=300
RestartSteps=5
RestartForceExitStatus=75
KillMode=mixed
KillSignal=SIGTERM
ExecReload=/bin/kill -USR1 $MAINPID
TimeoutStopSec=210
[Install]
WantedBy=default.target