bytelyst-devops-tools

Author	SHA1	Message	Date
Hermes VM	02b362399b	feat: complete hermes telemetry dashboard wiring	2026-05-31 08:28:26 +00:00
saravanakumardb1	abc8a0f517	fix(tracker-seed): cap dedupe list at limit=100 + auto-register products Two bugs caused duplicate items on re-run: the dedupe list used limit=500 (server caps at 100 -> 400 -> silent empty set -> dupes), and meta productIds weren't registered so GET /items 400'd ("Unknown product"). Now registers every referenced product first (idempotent) and lists with limit=100; dedupe failures are logged loudly. Verified idempotent: re-run skips all 16. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>	2026-05-30 23:45:16 -07:00
saravanakumardb1	ae7909018a	feat(scripts): one-shot gigafactory deploy + product registration deploy-gigafactory.sh loads platform-service/.env, starts the fleet backend, waits for /health, and registers the ecosystem products (idempotent) so live /api/fleet/* calls resolve. Supports --stop / --register-only / --no-register. Registered the 11 ecosystem products against the configured Cosmos during a live run; note fleet metrics needs a composite index on real Azure Cosmos. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>	2026-05-30 22:51:27 -07:00
saravanakumardb1	eb4e755c5f	feat(tracker-seed): seed script + payloads for engineering-review work items Files the ENGINEERING_REVIEW_SCORECARD.md P0-P3 action plan as tracker items (one per affected product) via the platform-service POST /api/items API. Dependency-free Node seeder mints an HS256 token from $JWT_SECRET, dedupes by title, and supports --dry-run. No live writes performed (stack is down); run the script once the platform stack is up. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>	2026-05-30 21:14:12 -07:00
Hermes VM	13a105ba23	feat(vm): Phase 5 closure — GPU/freshness checks, chaos validation, I/O alert vm-health-check.sh: - check_gpu(): nvidia-smi probe; "CPU-only" OK on this VM (no GPU) - check_image_freshness(): flag containers running images >30d old. Skips third-party images (gitea, grafana, prom, mcr.microsoft, axllent, caddy, traefik, valkey, cadvisor) — they have their own rebuild cadence. Currently flags 19 stale product images (~60d old). chaos-validation.sh: - Monthly chaos test: kill PID 1 in chronomind-web, wait up to 35 min for docker-health-watchdog to detect + restart. Telegram pass/fail. - Refuses to run if target not healthy. systemd timer fires 1st of month at 10:00 UTC (after 08:00 weekly digest). vm-io-anomaly-check.sh: - 6h avg sda write rate; transition alerts at WARN (1 GB/hr) / CRIT (2.5 GB/hr). De-dupes via /var/log/vm-io-anomaly-state so the alert fires once per transition, not every 6h. Current baseline: ~1.94 GB/hr (orphan-container state-file writes; see Phase 0.3). - Reports recovery to OK when rate drops back. vm/page.tsx: gpu + image_freshness added to CHECK_META so they render with proper icon/label and slot into CHECK_ORDER. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-30 05:26:49 +00:00
Hermes VM	76ef17f26b	feat(vm): Phase 2.3 closure — OOM watchdog + orphan-container docs OOM watchdog: - vm-oom-watchdog.sh — scans journalctl -k since cursor for oom-kill, killed-process, and "out of memory ... killed" entries; maps cgroup hits back to container names via docker inspect; posts a single Telegram alert per scan window (no dedupe needed — cursor advances on every run). Cursor at /var/log/vm-oom-cursor, log at /var/log/vm-oom-watchdog.log. - Systemd: OnBootSec=10min, OnUnitActiveSec=1h, Persistent=true. Orphan containers (no compose file on disk): - trading-backend → docker update --memory=768m (high-I/O bot) - gitea-npm-registry → docker update --memory=512m - orphan-containers.md captures canonical configs for recovery (env, mounts, networks, restart policy, memory limits). Closes Phase 2.3 (post-monitoring) and Phase 3.3 (orphan limits). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-30 05:26:49 +00:00
Hermes VM	8d32cb7980	feat(dashboard/vm): Phases 4.1-4.3 — Prometheus trends, sparklines, weekly digest - prometheus.ts: new Prometheus client with 7d/30d range queries for disk, memory, swap, CPU steal, and disk I/O (GB/hr); getWeeklyDigestData() aggregates all metrics for digest and API endpoint - routes.ts: GET /api/vm/metrics/trend?metric=…&range=… and GET /api/vm/weekly-digest endpoints - api.ts: TrendPoint/TrendSeries types; getTrend() and getMemoryTrend() added to vmApi - vm/page.tsx: Sparkline (pure SVG polyline+fill), TrendCard with latest/avg/peak and threshold colouring, TrendsPanel with lazy load on first open; Promise.allSettled() isolation for all 5 data panels - vm-weekly-digest.sh: weekly Telegram digest via docker exec into devops-backend to reach Prometheus; emoji severity indicators; cron summary from /var/log/vm-cleanup.log - systemd timer: Mon 08:00 UTC, Persistent=true (fires on next boot if missed); first trigger 2026-06-02 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-30 05:26:49 +00:00
Hermes VM	d9618ba7b0	feat(vm): Phases 1.2, 1.4, 2.1 — steal time, swap pressure, health watchdog Phase 1.2 — CPU steal time metric in vm-health-check.sh: - Samples /proc/stat twice 1s apart for accurate current steal % - Thresholds: >5% WARN, >15% CRIT (currently 0.8% on this host) - Inserts before memory check so steal is visible alongside load Phase 1.4 — Swap pressure indicator: - Reads SwapCached from /proc/meminfo as secondary metric - Raises SWAP_USED_WARN_GB 1→1.5 to reduce noise (current usage 0.6G) - New WARN path: SwapCached > 200MB signals recent pressure even when current swap usage looks ok (catches post-spike state) Phase 2.1 — Docker health-check watchdog: - docker-health-watchdog.sh: checks unhealthy containers every 10 min, restarts only after 3 consecutive failing health checks (30min grace) - docker-health-watchdog.service + .timer: enabled, fires every 10 min - Sends Telegram notification on each auto-restart - Rollback: systemctl disable docker-health-watchdog.timer Phase 2.2 already complete: sync_hermes_persistent_backup.py handles diverge gracefully with rebase/reset-hard fallback; running successfully. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-27 21:31:09 +00:00
Hermes VM	2fc23d6baa	feat(vm): fix devops-backend VM module — Phase 0.1 complete - Switch backend runner from node:20-alpine to node:20-slim so GNU df flags (--output=pcent/avail) work inside the container - Add volume mounts to docker-compose.yml: scripts (ro), VM logs (rw), docker.sock; set VM_SCRIPTS_PATH + VM_LOG_DIR env vars - Rebuild repository.ts: env-configurable paths, cron history parser, unhealthy-container inspector, Ollama model endpoints - Add routes: GET /api/vm/cron-status, unhealthy containers, Ollama models, container restart, model unload - vm-cleanup.sh: add step_cosmos_pglog, step_docker_aged_images; fix (( count++ )) → count=$(( count + 1 )) for set -e compatibility - Add docs/VM_OBSERVABILITY_ROADMAP.md Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-27 21:13:45 +00:00
Hermes VM	9210a8890f	feat: detect stale VM automation Some checks failed pre-commit / pre-commit (push) Failing after 32s Details	2026-05-27 21:00:43 +00:00
Hermes VM	70d96d7684	feat: add gitea backup timer assets	2026-05-27 18:53:20 +00:00
Hermes VM	678430d77d	fix: cron health-check entry should call vm-health-check.sh --notify The 07:00 daily cron was incorrectly pointing to vm-cleanup.sh instead of vm-health-check.sh. Health check is read-only; cleanup is not. Also add --notify so Telegram alerts fire when WARNING/CRITICAL. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-27 18:53:20 +00:00
Hermes VM	0a2d303f93	add HostingerVM health-check and cleanup scripts - vm-health-check.sh: read-only checks for disk, load, RAM, swap, Docker containers (crash-loops + healthchecks), build cache, journal. Flags: --quiet, --json, --notify (Telegram). Exit 0/1/2 = OK/WARN/CRIT. - vm-cleanup.sh: safe periodic cleanup. Default (weekly): build cache, journal, apt, npm, .next/cache. --full (monthly): adds docker system prune, pnpm store, old logs, HOLD cleanup. --dry-run, --install-cron, --uninstall-cron. Logs to /var/log/vm-cleanup.log. Related: docs/hostinger-vm-maintenance.md, scripts/VMs/HostingerVM/CRON_SETUP.md Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-27 18:53:20 +00:00
root	3cc9a1456e	Add Google Drive single file uploader	2026-05-27 12:19:45 +00:00
root	79ca56ffce	Add Google Drive emergency bundle upload	2026-05-27 12:08:41 +00:00
root	bb15a225cd	Add encrypted Hermes emergency bundle scripts	2026-05-27 11:31:58 +00:00
root	416f25794c	Document Hermes Gitea token flow	2026-05-27 11:06:15 +00:00
root	8de72351de	Complete Hermes dashboard and watchdog roadmap audit	2026-05-27 10:45:29 +00:00
root	e57038a6a2	docs: advance Hermes setup roadmap Some checks are pending pre-commit / pre-commit (push) Waiting to run Details	2026-05-27 10:12:27 +00:00
root	1deb832b1a	chore(devops): tighten deployment scripts	2026-05-18 09:01:03 +00:00
Saravana Achu Mac	59aa981914	chore: move login sh into scripts folder	2026-05-09 15:58:08 -07:00
root	38118bb445	Harden Ubuntu VM update script readiness checks	2026-05-05 03:09:57 +00:00
root	14d1b566d6	Add safe templates and tooling adoption docs	2026-05-05 01:16:27 +00:00
root	013a27069b	Harden Ubuntu VM security update script	2026-05-05 01:07:30 +00:00
sarvana7	8708ae55fe	Add Ubuntu VM security update automation script This script automates security updates for Ubuntu VMs, including unattended upgrades, SSH protection, and package integrity checks.	2026-05-04 18:03:47 -07:00

25 Commits