From 2c125adb053eb324948ab72922840695593b5d81 Mon Sep 17 00:00:00 2001 From: Hermes VM Date: Wed, 27 May 2026 20:21:51 +0000 Subject: [PATCH] docs: add VM security blind spots roadmap --- docs/repo-map.md | 1 + docs/vm-security-blind-spots-roadmap.md | 350 ++++++++++++++++++++++++ 2 files changed, 351 insertions(+) create mode 100644 docs/vm-security-blind-spots-roadmap.md diff --git a/docs/repo-map.md b/docs/repo-map.md index 06d5216..3fd76af 100644 --- a/docs/repo-map.md +++ b/docs/repo-map.md @@ -52,6 +52,7 @@ Current key files: - `docs/remove_user_interactive.md` - `docs/hermes-setup-upgrade-roadmap.md` - `docs/hermes-operations.md` +- `docs/vm-security-blind-spots-roadmap.md` ### `.github/workflows/` diff --git a/docs/vm-security-blind-spots-roadmap.md b/docs/vm-security-blind-spots-roadmap.md new file mode 100644 index 0000000..d1ba4c1 --- /dev/null +++ b/docs/vm-security-blind-spots-roadmap.md @@ -0,0 +1,350 @@ +# ByteLyst VM Security Blind Spots Roadmap + +**Review date:** 2026-05-27 +**Reviewer:** Hermes Agent +**Scope:** Hostinger ByteLyst VM, Docker-hosted product stack, Caddy ingress, Gitea/CI, Hermes backup/ops, VM maintenance posture. + +## Executive Summary + +The VM is operational and has several good foundations already in place: UFW is active, fail2ban is running for SSH, unattended upgrades are enabled, Caddy config validates, disk/memory headroom is acceptable, and Hermes persistent-data backup cron is healthy. + +The biggest blind spot is that the apparent firewall posture is misleading: UFW only allows SSH, but Docker-published ports create iptables rules that can expose many application, database/emulator, observability, registry, and development ports on `0.0.0.0` / IPv6. Several of those services should either be private-only, routed only through Caddy with auth, or bound to loopback/internal Docker networks. + +Second-order risks are SSH hardening gaps, rootful Docker/container hardening gaps, unhealthy apps that can hide failed deploys, an inactive Gitea Actions runner, a failed Hermes backup systemd unit despite cron backup success, and incomplete evidence for restore drills, secret scans, and off-host recovery. + +## Evidence Snapshot + +Collected on 2026-05-27 from this VM. + +### Host and patching + +- Host: `srv1491630` +- OS: Ubuntu `25.10` +- Kernel: `6.17.0-29-generic` +- Uptime: about 14 hours at review time +- Root filesystem: 193G total, 71G used, 123G available, 37% used +- Memory: 15Gi total, about 10Gi available +- Swap: 4.0G total, about 1.3G used +- Reboot required: no +- Pending package upgrades included Docker CE/containerd/buildx/compose and security updates for `libgcrypt20`, `libcaca0`, and `libssh2-1t64` +- Unattended upgrades: active and configured for automatic reboot at 04:00 with users absent + +### Network and ingress + +- UFW: active; default deny incoming; only `22/tcp` allowed by UFW rules +- Docker iptables rules are present and publish many ports despite UFW's simple rule list +- Public/listening TCP ports bound on all interfaces included: + - `22`, `80`, `443` + - app/frontend ports: `3000`, `3002`, `3003`, `3030`, `3035`, `3040`, `3049`, `3050`, `3055`, `3060`, `3070`, `3075`, `3085` + - backend/API ports: `4004`, `4010`, `4011`, `4012`, `4013`, `4014`, `4015`, `4016`, `4017`, `4019`, `4020`, `4025` + - infra/dev ports: `1025`, `1234`, `3100`, `3300`, `8025`, `8081`, `10000`, `11434` +- Caddy source-of-truth config: `/opt/bytelyst/Caddyfile`, mounted read-only into the `caddy` container +- `docker exec caddy caddy validate --config /etc/caddy/Caddyfile`: valid config, formatting warning only +- Caddy public hostnames include: + - `api.bytelyst.com` + - `gitea.bytelyst.com` + - `admin.bytelyst.com` + - `devops.bytelyst.com` + - `tracker.bytelyst.com` + - `llmlab.bytelyst.com` + - `ollama.bytelyst.com` + - `trading-api.bytelyst.com` + - `invttrdg.bytelyst.com` + - `notes.bytelyst.com` + - `clock.bytelyst.com` + +### SSH and account surface + +Effective `sshd -T` settings showed: + +- `permitrootlogin yes` +- `passwordauthentication yes` +- `pubkeyauthentication yes` +- `kbdinteractiveauthentication no` +- `maxauthtries 6` +- `x11forwarding yes` +- `clientaliveinterval 0` + +`fail2ban` is active with one jail: `sshd`; no current bans at review time. + +### Docker runtime and containers + +- Docker: client/server `29.4.2`; newer Docker packages are available +- Docker daemon is rootful; security options showed AppArmor, seccomp builtin, and cgroup namespaces; `live_restore=false` +- Most product containers run with writable root filesystems and no explicit `user` configured +- `cadvisor` is privileged +- `DOCKER-USER` chain appears empty, so there is no central Docker firewall policy in front of published containers +- Multiple containers are unhealthy: + - `learning_ai_common_plat-llmlab-dashboard-1` + - `learning_ai_common_plat-actiontrail-web-1` + - `learning_ai_common_plat-jarvisjr-web-1` + - `learning_ai_common_plat-localmemgpt-web-1` + - `learning_ai_common_plat-nomgap-web-1` + - `learning_ai_common_plat-flowmonk-web-1` + - `learning_ai_common_plat-mindlyst-web-1` + +### Gitea and CI + +- Gitea public route: `https://gitea.bytelyst.com` +- Local Gitea container port: host `3300` -> container `3000`, bound on `0.0.0.0` and IPv6 +- `gitea-act-runner.service`: enabled but inactive/dead +- Runner user exists: `gitea-runner`, member of `docker` +- Runner config directory permissions look reasonable: + - `/home/gitea-runner/act_runner`: `750`, owned by `gitea-runner:gitea-runner` + - `/home/gitea-runner/act_runner/config.yaml`: `600`, owned by `gitea-runner:gitea-runner` + +### Backup and operations + +- `systemctl --failed` showed failed unit: + - `hermes-root-backup.service` — `Sync root Hermes persistent backup to GitHub` +- Hermes cron backup is active and healthy: + - job `470832621b43`, `Sync Hermes persistent-data backup to GitHub`, every 30 minutes, last run `ok` +- Existing VM maintenance cron entries exist for health check and cleanup under `/opt/bytelyst/learning_ai_devops_tools/scripts/VMs/HostingerVM/` +- A root crontab entry still references `/opt/bytelyst/bytelyst-devops-tools/monitor-lucky25-execution.sh`, which may be stale after repo relocation/renaming + +## Blind Spots and Risk Register + +### P0 — Internet-exposed Docker ports bypass the intended ingress model + +**Risk:** UFW suggests only SSH is allowed, but Docker-published ports expose many services directly on all interfaces. This can bypass Caddy, TLS, auth, logging, rate limiting, and hostname/path controls. + +**Examples observed:** `3300`, `8025`, `1025`, `1234`, `8081`, `10000`, `11434`, many `30xx` web ports, and many `40xx` backend ports. + +**Impact:** Direct access to dev/infra services, internal APIs, emulators, mail tooling, dashboards, or model endpoints if upstream firewall/provider rules do not block them. + +**Roadmap:** + +- [ ] Create a canonical exposure inventory: service, container, host port, public hostname, required audience, auth requirement. +- [ ] For each service, decide one of: public via Caddy, private via Tailscale/SSH, loopback-only host port, Docker-internal only, or remove. +- [ ] Bind non-public Compose ports to `127.0.0.1` or remove host port mapping entirely. +- [ ] Add a `DOCKER-USER` chain policy to drop unsolicited traffic to non-approved published ports before Docker's accept rules. +- [ ] Keep only `80/443` and intentionally public SSH exposed at the provider/firewall layer. +- [ ] Add a recurring check that compares `ss -ltn` and Docker published ports against the approved inventory. + +### P0 — SSH permits root login and password authentication + +**Risk:** `PermitRootLogin yes` and `PasswordAuthentication yes` keep the primary admin surface broad. fail2ban helps, but password-enabled root SSH is still high-risk for an internet-facing VM. + +**Roadmap:** + +- [ ] Confirm all required admin users have working SSH keys and sudo access. +- [ ] Add a non-root break-glass admin path if one does not exist. +- [ ] Change SSH effective config to: + - [ ] `PermitRootLogin prohibit-password` or `no` + - [ ] `PasswordAuthentication no` + - [ ] `X11Forwarding no` + - [ ] lower `MaxAuthTries`, e.g. `3` + - [ ] set a sane `ClientAliveInterval` / `ClientAliveCountMax` +- [ ] Validate with a second session before restarting SSH. +- [ ] Record rollback commands and keep console/provider access available during rollout. + +### P0 — Public/private boundary for dev and internal tooling is unclear + +**Risk:** Caddy publishes `ollama.bytelyst.com`, `llmlab.bytelyst.com`, `devops.bytelyst.com`, `admin.bytelyst.com`, and Gitea. Some may be intended, but the roadmap lacks an explicit auth/access decision for each. + +**Roadmap:** + +- [ ] Document public hostnames, auth model, and data sensitivity. +- [ ] Require explicit approval before exposing new dashboards or model endpoints. +- [ ] Add Caddy auth/IP allowlist/Tailscale-only strategy for admin-like surfaces. +- [ ] Add security headers/auth checks to public UI health reviews. +- [ ] Confirm `ollama.bytelyst.com` should be publicly reachable at all; if not, move behind private network or auth gate. + +### P1 — Docker/container hardening is mostly default + +**Risk:** Many containers run as default/root user, writable rootfs, broad capabilities by default, and rootful Docker. A compromised app gets more host-adjacent leverage than needed. + +**Roadmap:** + +- [ ] Create a per-service Docker hardening matrix: user, read-only rootfs, dropped capabilities, no-new-privileges, resource limits, healthcheck, restart policy, secrets handling. +- [ ] Start with public-facing/backend services and admin dashboards. +- [ ] Add `security_opt: ["no-new-privileges:true"]` where compatible. +- [ ] Add `cap_drop: ["ALL"]` and selectively add back capabilities only when needed. +- [ ] Convert app images to non-root users consistently. +- [ ] Use `read_only: true` plus explicit writable tmp/cache volumes where compatible. +- [ ] Review `cadvisor` privileged mode and replace/restrict if possible. +- [ ] Enable Docker `live-restore` if it fits maintenance operations. + +### P1 — Unhealthy containers can normalize broken deployments + +**Risk:** Multiple app web containers are unhealthy while still running. If unhealthy states are ignored, deploy regressions and broken public pages can persist unnoticed. + +**Roadmap:** + +- [ ] Triage each unhealthy container and classify: real app failure, bad healthcheck, intentionally unused, or deprecated. +- [ ] Fix or remove bad healthchecks so Docker health state is trustworthy. +- [ ] Add alerting for sustained unhealthy containers. +- [ ] Make deployment scripts fail on unhealthy post-deploy state. +- [ ] Update dashboard/observability docs with current service ownership and expected state. + +### P1 — Gitea Actions runner is enabled but inactive + +**Risk:** CI/deploy assumptions may be wrong. If a runner is expected to deploy or publish packages, inactive runner state blocks automation and may cause manual drift. + +**Roadmap:** + +- [ ] Decide whether the runner should be active or intentionally disabled. +- [ ] If active: restart and verify `gitea-act-runner.service`, runner labels, Docker access, and a smoke workflow. +- [ ] If disabled: disable the service and document the intentional state. +- [ ] Keep runner secrets separate from smoke/test workflows. +- [ ] Add a runner-health check to VM observability. + +### P1 — Backup/restore evidence is split and one backup unit is failed + +**Risk:** Hermes cron backup works, but `hermes-root-backup.service` is failed. There is no recent full restore drill evidence in this review. A backup that cannot be restored is only an assumption. + +**Roadmap:** + +- [ ] Inspect `hermes-root-backup.service` logs and decide whether to fix, disable, or replace it with the cron-backed job. +- [ ] Document all backup mechanisms: Hermes, Gitea data, Docker volumes, app data, Caddy certs/config, environment/secrets escrow. +- [ ] Run a restore drill into a non-production path/profile. +- [ ] Verify no raw `.env`, OAuth tokens, private keys, SQLite WAL/SHM, or raw transcript DBs are committed. +- [ ] Add backup freshness and restore-drill status to the monthly VM review. + +### P1 — Patch management has pending security/runtime updates + +**Risk:** Unattended upgrades are on, but Docker and security package updates were pending at review time. Docker updates may need controlled restart/redeploy planning. + +**Roadmap:** + +- [ ] Add a weekly patch review checkpoint that reports pending security and Docker updates separately. +- [ ] Define a Docker upgrade maintenance window with pre/post checks. +- [ ] Run `apt list --upgradable` and capture package classes without dumping noise. +- [ ] Verify apps after Docker/containerd upgrades. + +### P2 — Ubuntu 25.10 lifecycle risk needs explicit tracking + +**Risk:** Ubuntu interim releases have short support windows. If this VM is long-lived production infrastructure, lifecycle tracking matters. + +**Roadmap:** + +- [ ] Record current Ubuntu 25.10 support/EOL date in ops docs. +- [ ] Decide whether to stay on interim releases or migrate to an LTS baseline. +- [ ] Add an OS lifecycle check to quarterly review. + +### P2 — Repository/config secret hygiene needs a repeatable scanner + +**Risk:** The DevOps repo contains operational inputs and historical/deleted repo copies exist on disk. Manual review can miss tokens in old files, generated JSON, logs, backups, or abandoned directories. + +**Roadmap:** + +- [ ] Add a documented secret-scan command using `gitleaks` or `trufflehog` for tracked files and selected untracked ops directories. +- [ ] Scan historical directories such as `DELETED_bytelyst-devops-tools` separately before archiving or deleting. +- [ ] Add `.gitignore` patterns for generated scans, local account snapshots, and credential-shaped outputs. +- [ ] Keep examples as `.example` files only. + +### P2 — Cron/systemd ownership and drift are not fully inventoried + +**Risk:** Root crontab references old repo paths and there are multiple cron/systemd sources. Stale jobs can fail silently or mutate production unexpectedly. + +**Roadmap:** + +- [ ] Inventory root/user crontabs, `/etc/cron.d`, systemd timers, Hermes cron, and Gitea Actions schedules. +- [ ] Remove or update stale `/opt/bytelyst/bytelyst-devops-tools/...` references after confirming replacements. +- [ ] Add owner, purpose, expected output, and alert channel for every job. +- [ ] Add a stale-job detector for missing script paths and failed systemd units. + +### P2 — Observability exists but needs security-focused SLOs + +**Risk:** Prometheus/Grafana/Loki/exporters are present, but security-focused alerts are not yet proven from this review. + +**Roadmap:** + +- [ ] Add alerts for unexpected public ports, failed units, unhealthy containers, high disk/swap, backup staleness, Gitea runner inactive, and SSH auth spikes. +- [ ] Validate alert delivery to Telegram. +- [ ] Keep internal observability endpoints private; do not publish Prometheus/Loki/node-exporter/cAdvisor directly. + +## Execution Plan + +### Phase 0 — Freeze and inventory before changes + +- [ ] Freeze new public hostnames/ports until the exposure inventory is complete. +- [ ] Generate `docs/vm-exposure-inventory.md` from Docker, Caddy, `ss`, and DNS. +- [ ] Mark each exposed service as `public`, `private`, `internal-only`, or `retire`. +- [ ] Review with S before changing public access for customer/user-facing apps. + +### Phase 1 — Immediate security hardening + +- [ ] Close or loopback-bind non-public Docker host ports. +- [ ] Add `DOCKER-USER` default-deny rules for non-approved ports. +- [ ] Harden SSH root/password access after key-based access is verified. +- [ ] Put `ollama.bytelyst.com`, admin dashboards, and dev tooling behind private/auth-gated access unless explicitly approved as public. + +### Phase 2 — Operational correctness + +- [ ] Fix/retire unhealthy containers. +- [ ] Resolve `hermes-root-backup.service` failed state. +- [ ] Decide and document Gitea runner active/disabled state. +- [ ] Remove stale cron paths and add missing-script checks. +- [ ] Apply pending security/runtime updates in a maintenance window. + +### Phase 3 — Docker and app hardening + +- [ ] Add non-root users, `no-new-privileges`, cap drops, and read-only rootfs by service. +- [ ] Add resource limits for noisy services and emulators. +- [ ] Move emulators/dev tools off public bindings. +- [ ] Review cAdvisor privilege and observability surface. + +### Phase 4 — Backup, restore, and incident readiness + +- [ ] Define full backup map: Hermes, Gitea, Caddy, Docker volumes, app DB/state, secrets escrow. +- [ ] Perform restore drill to non-prod target. +- [ ] Add incident runbooks: compromised container, leaked token, SSH brute force, disk full, failed Docker upgrade. +- [ ] Add quarterly tabletop review. + +### Phase 5 — Continuous governance + +- [ ] Monthly VM security review cron/checklist. +- [ ] Secret scan before DevOps repo pushes. +- [ ] OS lifecycle/EOL tracker. +- [ ] Drift detection for ports, Caddy routes, Docker health, systemd failures, and cron paths. + +## Suggested First Tickets + +1. **P0: Build and review exposure inventory** — produce exact approved/blocked list for all currently bound ports. +2. **P0: Lock Docker-published non-public ports** — bind to loopback/internal or enforce `DOCKER-USER` drops. +3. **P0: Harden SSH** — disable password/root login after confirming key-based admin access. +4. **P1: Triage unhealthy containers** — fix healthchecks/apps or retire dead services. +5. **P1: Resolve failed Hermes backup unit** — fix or disable duplicate failed unit; keep cron backup healthy. +6. **P1: Decide Gitea runner state** — active smoke-tested runner or documented disabled service. +7. **P2: Add secret scanner and stale-job scanner** — prevent silent credential and automation drift. + +## Verification Commands for Future Runs + +```bash +# Host/security baseline +date -Is +uname -a +. /etc/os-release && echo "$PRETTY_NAME" +apt-get -s upgrade | awk '/^Inst /{print}' +test -f /var/run/reboot-required && cat /var/run/reboot-required || echo no-reboot-required + +# Firewall and public bind inventory +ufw status verbose +iptables -S DOCKER-USER +ss -ltnup + +# SSH effective config +sshd -T | egrep '^(permitrootlogin|passwordauthentication|pubkeyauthentication|kbdinteractiveauthentication|maxauthtries|x11forwarding|clientaliveinterval)' +fail2ban-client status sshd + +# Docker health/security +docker ps --format '{{.Names}}\t{{.Status}}\t{{.Ports}}' +docker ps -q | xargs -r docker inspect --format '{{.Name}} user={{.Config.User}} privileged={{.HostConfig.Privileged}} readonly={{.HostConfig.ReadonlyRootfs}} ports={{json .NetworkSettings.Ports}}' + +# Caddy and ingress +docker exec caddy caddy validate --config /etc/caddy/Caddyfile +sed -n '1,220p' /opt/bytelyst/Caddyfile + +# Backup/cron/systemd drift +systemctl --failed --no-pager +hermes cron list +crontab -l +for f in /etc/cron.d/*; do echo "--- $f"; sed -n '1,80p' "$f"; done +``` + +## Notes + +- This review did not change firewall, SSH, Docker, Caddy, or service settings. It intentionally documents the risk and remediation order before making potentially disruptive security changes. +- Public exposure changes should be handled in small maintenance windows with pre/post health checks because this VM hosts multiple ByteLyst apps. +- The Caddyfile validates today, but Caddy formatting should be normalized in a separate low-risk docs/ops cleanup if desired.