docs: add VM security blind spots roadmap

2026-05-27 20:21:51 +00:00 · 2026-05-27 20:21:51 +00:00 · 2c125adb05
commit 2c125adb05
parent c89018ae47
2 changed files with 351 additions and 0 deletions
--- a/docs/repo-map.md
+++ b/docs/repo-map.md
@ -52,6 +52,7 @@ Current key files:
 - `docs/remove_user_interactive.md`
 - `docs/hermes-setup-upgrade-roadmap.md`
 - `docs/hermes-operations.md`
+- `docs/vm-security-blind-spots-roadmap.md`

 ### `.github/workflows/`

--- a/docs/vm-security-blind-spots-roadmap.md
+++ b/docs/vm-security-blind-spots-roadmap.md
@ -0,0 +1,350 @@
+# ByteLyst VM Security Blind Spots Roadmap
+
+**Review date:** 2026-05-27
+**Reviewer:** Hermes Agent
+**Scope:** Hostinger ByteLyst VM, Docker-hosted product stack, Caddy ingress, Gitea/CI, Hermes backup/ops, VM maintenance posture.
+
+## Executive Summary
+
+The VM is operational and has several good foundations already in place: UFW is active, fail2ban is running for SSH, unattended upgrades are enabled, Caddy config validates, disk/memory headroom is acceptable, and Hermes persistent-data backup cron is healthy.
+
+The biggest blind spot is that the apparent firewall posture is misleading: UFW only allows SSH, but Docker-published ports create iptables rules that can expose many application, database/emulator, observability, registry, and development ports on `0.0.0.0` / IPv6. Several of those services should either be private-only, routed only through Caddy with auth, or bound to loopback/internal Docker networks.
+
+Second-order risks are SSH hardening gaps, rootful Docker/container hardening gaps, unhealthy apps that can hide failed deploys, an inactive Gitea Actions runner, a failed Hermes backup systemd unit despite cron backup success, and incomplete evidence for restore drills, secret scans, and off-host recovery.
+
+## Evidence Snapshot
+
+Collected on 2026-05-27 from this VM.
+
+### Host and patching
+
+- Host: `srv1491630`
+- OS: Ubuntu `25.10`
+- Kernel: `6.17.0-29-generic`
+- Uptime: about 14 hours at review time
+- Root filesystem: 193G total, 71G used, 123G available, 37% used
+- Memory: 15Gi total, about 10Gi available
+- Swap: 4.0G total, about 1.3G used
+- Reboot required: no
+- Pending package upgrades included Docker CE/containerd/buildx/compose and security updates for `libgcrypt20`, `libcaca0`, and `libssh2-1t64`
+- Unattended upgrades: active and configured for automatic reboot at 04:00 with users absent
+
+### Network and ingress
+
+- UFW: active; default deny incoming; only `22/tcp` allowed by UFW rules
+- Docker iptables rules are present and publish many ports despite UFW's simple rule list
+- Public/listening TCP ports bound on all interfaces included:
+  - `22`, `80`, `443`
+  - app/frontend ports: `3000`, `3002`, `3003`, `3030`, `3035`, `3040`, `3049`, `3050`, `3055`, `3060`, `3070`, `3075`, `3085`
+  - backend/API ports: `4004`, `4010`, `4011`, `4012`, `4013`, `4014`, `4015`, `4016`, `4017`, `4019`, `4020`, `4025`
+  - infra/dev ports: `1025`, `1234`, `3100`, `3300`, `8025`, `8081`, `10000`, `11434`
+- Caddy source-of-truth config: `/opt/bytelyst/Caddyfile`, mounted read-only into the `caddy` container
+- `docker exec caddy caddy validate --config /etc/caddy/Caddyfile`: valid config, formatting warning only
+- Caddy public hostnames include:
+  - `api.bytelyst.com`
+  - `gitea.bytelyst.com`
+  - `admin.bytelyst.com`
+  - `devops.bytelyst.com`
+  - `tracker.bytelyst.com`
+  - `llmlab.bytelyst.com`
+  - `ollama.bytelyst.com`
+  - `trading-api.bytelyst.com`
+  - `invttrdg.bytelyst.com`
+  - `notes.bytelyst.com`
+  - `clock.bytelyst.com`
+
+### SSH and account surface
+
+Effective `sshd -T` settings showed:
+
+- `permitrootlogin yes`
+- `passwordauthentication yes`
+- `pubkeyauthentication yes`
+- `kbdinteractiveauthentication no`
+- `maxauthtries 6`
+- `x11forwarding yes`
+- `clientaliveinterval 0`
+
+`fail2ban` is active with one jail: `sshd`; no current bans at review time.
+
+### Docker runtime and containers
+
+- Docker: client/server `29.4.2`; newer Docker packages are available
+- Docker daemon is rootful; security options showed AppArmor, seccomp builtin, and cgroup namespaces; `live_restore=false`
+- Most product containers run with writable root filesystems and no explicit `user` configured
+- `cadvisor` is privileged
+- `DOCKER-USER` chain appears empty, so there is no central Docker firewall policy in front of published containers
+- Multiple containers are unhealthy:
+  - `learning_ai_common_plat-llmlab-dashboard-1`
+  - `learning_ai_common_plat-actiontrail-web-1`
+  - `learning_ai_common_plat-jarvisjr-web-1`
+  - `learning_ai_common_plat-localmemgpt-web-1`
+  - `learning_ai_common_plat-nomgap-web-1`
+  - `learning_ai_common_plat-flowmonk-web-1`
+  - `learning_ai_common_plat-mindlyst-web-1`
+
+### Gitea and CI
+
+- Gitea public route: `https://gitea.bytelyst.com`
+- Local Gitea container port: host `3300` -> container `3000`, bound on `0.0.0.0` and IPv6
+- `gitea-act-runner.service`: enabled but inactive/dead
+- Runner user exists: `gitea-runner`, member of `docker`
+- Runner config directory permissions look reasonable:
+  - `/home/gitea-runner/act_runner`: `750`, owned by `gitea-runner:gitea-runner`
+  - `/home/gitea-runner/act_runner/config.yaml`: `600`, owned by `gitea-runner:gitea-runner`
+
+### Backup and operations
+
+- `systemctl --failed` showed failed unit:
+  - `hermes-root-backup.service` — `Sync root Hermes persistent backup to GitHub`
+- Hermes cron backup is active and healthy:
+  - job `470832621b43`, `Sync Hermes persistent-data backup to GitHub`, every 30 minutes, last run `ok`
+- Existing VM maintenance cron entries exist for health check and cleanup under `/opt/bytelyst/learning_ai_devops_tools/scripts/VMs/HostingerVM/`
+- A root crontab entry still references `/opt/bytelyst/bytelyst-devops-tools/monitor-lucky25-execution.sh`, which may be stale after repo relocation/renaming
+
+## Blind Spots and Risk Register
+
+### P0 — Internet-exposed Docker ports bypass the intended ingress model
+
+**Risk:** UFW suggests only SSH is allowed, but Docker-published ports expose many services directly on all interfaces. This can bypass Caddy, TLS, auth, logging, rate limiting, and hostname/path controls.
+
+**Examples observed:** `3300`, `8025`, `1025`, `1234`, `8081`, `10000`, `11434`, many `30xx` web ports, and many `40xx` backend ports.
+
+**Impact:** Direct access to dev/infra services, internal APIs, emulators, mail tooling, dashboards, or model endpoints if upstream firewall/provider rules do not block them.
+
+**Roadmap:**
+
+- [ ] Create a canonical exposure inventory: service, container, host port, public hostname, required audience, auth requirement.
+- [ ] For each service, decide one of: public via Caddy, private via Tailscale/SSH, loopback-only host port, Docker-internal only, or remove.
+- [ ] Bind non-public Compose ports to `127.0.0.1` or remove host port mapping entirely.
+- [ ] Add a `DOCKER-USER` chain policy to drop unsolicited traffic to non-approved published ports before Docker's accept rules.
+- [ ] Keep only `80/443` and intentionally public SSH exposed at the provider/firewall layer.
+- [ ] Add a recurring check that compares `ss -ltn` and Docker published ports against the approved inventory.
+
+### P0 — SSH permits root login and password authentication
+
+**Risk:** `PermitRootLogin yes` and `PasswordAuthentication yes` keep the primary admin surface broad. fail2ban helps, but password-enabled root SSH is still high-risk for an internet-facing VM.
+
+**Roadmap:**
+
+- [ ] Confirm all required admin users have working SSH keys and sudo access.
+- [ ] Add a non-root break-glass admin path if one does not exist.
+- [ ] Change SSH effective config to:
+  - [ ] `PermitRootLogin prohibit-password` or `no`
+  - [ ] `PasswordAuthentication no`
+  - [ ] `X11Forwarding no`
+  - [ ] lower `MaxAuthTries`, e.g. `3`
+  - [ ] set a sane `ClientAliveInterval` / `ClientAliveCountMax`
+- [ ] Validate with a second session before restarting SSH.
+- [ ] Record rollback commands and keep console/provider access available during rollout.
+
+### P0 — Public/private boundary for dev and internal tooling is unclear
+
+**Risk:** Caddy publishes `ollama.bytelyst.com`, `llmlab.bytelyst.com`, `devops.bytelyst.com`, `admin.bytelyst.com`, and Gitea. Some may be intended, but the roadmap lacks an explicit auth/access decision for each.
+
+**Roadmap:**
+
+- [ ] Document public hostnames, auth model, and data sensitivity.
+- [ ] Require explicit approval before exposing new dashboards or model endpoints.
+- [ ] Add Caddy auth/IP allowlist/Tailscale-only strategy for admin-like surfaces.
+- [ ] Add security headers/auth checks to public UI health reviews.
+- [ ] Confirm `ollama.bytelyst.com` should be publicly reachable at all; if not, move behind private network or auth gate.
+
+### P1 — Docker/container hardening is mostly default
+
+**Risk:** Many containers run as default/root user, writable rootfs, broad capabilities by default, and rootful Docker. A compromised app gets more host-adjacent leverage than needed.
+
+**Roadmap:**
+
+- [ ] Create a per-service Docker hardening matrix: user, read-only rootfs, dropped capabilities, no-new-privileges, resource limits, healthcheck, restart policy, secrets handling.
+- [ ] Start with public-facing/backend services and admin dashboards.
+- [ ] Add `security_opt: ["no-new-privileges:true"]` where compatible.
+- [ ] Add `cap_drop: ["ALL"]` and selectively add back capabilities only when needed.
+- [ ] Convert app images to non-root users consistently.
+- [ ] Use `read_only: true` plus explicit writable tmp/cache volumes where compatible.
+- [ ] Review `cadvisor` privileged mode and replace/restrict if possible.
+- [ ] Enable Docker `live-restore` if it fits maintenance operations.
+
+### P1 — Unhealthy containers can normalize broken deployments
+
+**Risk:** Multiple app web containers are unhealthy while still running. If unhealthy states are ignored, deploy regressions and broken public pages can persist unnoticed.
+
+**Roadmap:**
+
+- [ ] Triage each unhealthy container and classify: real app failure, bad healthcheck, intentionally unused, or deprecated.
+- [ ] Fix or remove bad healthchecks so Docker health state is trustworthy.
+- [ ] Add alerting for sustained unhealthy containers.
+- [ ] Make deployment scripts fail on unhealthy post-deploy state.
+- [ ] Update dashboard/observability docs with current service ownership and expected state.
+
+### P1 — Gitea Actions runner is enabled but inactive
+
+**Risk:** CI/deploy assumptions may be wrong. If a runner is expected to deploy or publish packages, inactive runner state blocks automation and may cause manual drift.
+
+**Roadmap:**
+
+- [ ] Decide whether the runner should be active or intentionally disabled.
+- [ ] If active: restart and verify `gitea-act-runner.service`, runner labels, Docker access, and a smoke workflow.
+- [ ] If disabled: disable the service and document the intentional state.
+- [ ] Keep runner secrets separate from smoke/test workflows.
+- [ ] Add a runner-health check to VM observability.
+
+### P1 — Backup/restore evidence is split and one backup unit is failed
+
+**Risk:** Hermes cron backup works, but `hermes-root-backup.service` is failed. There is no recent full restore drill evidence in this review. A backup that cannot be restored is only an assumption.
+
+**Roadmap:**
+
+- [ ] Inspect `hermes-root-backup.service` logs and decide whether to fix, disable, or replace it with the cron-backed job.
+- [ ] Document all backup mechanisms: Hermes, Gitea data, Docker volumes, app data, Caddy certs/config, environment/secrets escrow.
+- [ ] Run a restore drill into a non-production path/profile.
+- [ ] Verify no raw `.env`, OAuth tokens, private keys, SQLite WAL/SHM, or raw transcript DBs are committed.
+- [ ] Add backup freshness and restore-drill status to the monthly VM review.
+
+### P1 — Patch management has pending security/runtime updates
+
+**Risk:** Unattended upgrades are on, but Docker and security package updates were pending at review time. Docker updates may need controlled restart/redeploy planning.
+
+**Roadmap:**
+
+- [ ] Add a weekly patch review checkpoint that reports pending security and Docker updates separately.
+- [ ] Define a Docker upgrade maintenance window with pre/post checks.
+- [ ] Run `apt list --upgradable` and capture package classes without dumping noise.
+- [ ] Verify apps after Docker/containerd upgrades.
+
+### P2 — Ubuntu 25.10 lifecycle risk needs explicit tracking
+
+**Risk:** Ubuntu interim releases have short support windows. If this VM is long-lived production infrastructure, lifecycle tracking matters.
+
+**Roadmap:**
+
+- [ ] Record current Ubuntu 25.10 support/EOL date in ops docs.
+- [ ] Decide whether to stay on interim releases or migrate to an LTS baseline.
+- [ ] Add an OS lifecycle check to quarterly review.
+
+### P2 — Repository/config secret hygiene needs a repeatable scanner
+
+**Risk:** The DevOps repo contains operational inputs and historical/deleted repo copies exist on disk. Manual review can miss tokens in old files, generated JSON, logs, backups, or abandoned directories.
+
+**Roadmap:**
+
+- [ ] Add a documented secret-scan command using `gitleaks` or `trufflehog` for tracked files and selected untracked ops directories.
+- [ ] Scan historical directories such as `DELETED_bytelyst-devops-tools` separately before archiving or deleting.
+- [ ] Add `.gitignore` patterns for generated scans, local account snapshots, and credential-shaped outputs.
+- [ ] Keep examples as `.example` files only.
+
+### P2 — Cron/systemd ownership and drift are not fully inventoried
+
+**Risk:** Root crontab references old repo paths and there are multiple cron/systemd sources. Stale jobs can fail silently or mutate production unexpectedly.
+
+**Roadmap:**
+
+- [ ] Inventory root/user crontabs, `/etc/cron.d`, systemd timers, Hermes cron, and Gitea Actions schedules.
+- [ ] Remove or update stale `/opt/bytelyst/bytelyst-devops-tools/...` references after confirming replacements.
+- [ ] Add owner, purpose, expected output, and alert channel for every job.
+- [ ] Add a stale-job detector for missing script paths and failed systemd units.
+
+### P2 — Observability exists but needs security-focused SLOs
+
+**Risk:** Prometheus/Grafana/Loki/exporters are present, but security-focused alerts are not yet proven from this review.
+
+**Roadmap:**
+
+- [ ] Add alerts for unexpected public ports, failed units, unhealthy containers, high disk/swap, backup staleness, Gitea runner inactive, and SSH auth spikes.
+- [ ] Validate alert delivery to Telegram.
+- [ ] Keep internal observability endpoints private; do not publish Prometheus/Loki/node-exporter/cAdvisor directly.
+
+## Execution Plan
+
+### Phase 0 — Freeze and inventory before changes
+
+- [ ] Freeze new public hostnames/ports until the exposure inventory is complete.
+- [ ] Generate `docs/vm-exposure-inventory.md` from Docker, Caddy, `ss`, and DNS.
+- [ ] Mark each exposed service as `public`, `private`, `internal-only`, or `retire`.
+- [ ] Review with S before changing public access for customer/user-facing apps.
+
+### Phase 1 — Immediate security hardening
+
+- [ ] Close or loopback-bind non-public Docker host ports.
+- [ ] Add `DOCKER-USER` default-deny rules for non-approved ports.
+- [ ] Harden SSH root/password access after key-based access is verified.
+- [ ] Put `ollama.bytelyst.com`, admin dashboards, and dev tooling behind private/auth-gated access unless explicitly approved as public.
+
+### Phase 2 — Operational correctness
+
+- [ ] Fix/retire unhealthy containers.
+- [ ] Resolve `hermes-root-backup.service` failed state.
+- [ ] Decide and document Gitea runner active/disabled state.
+- [ ] Remove stale cron paths and add missing-script checks.
+- [ ] Apply pending security/runtime updates in a maintenance window.
+
+### Phase 3 — Docker and app hardening
+
+- [ ] Add non-root users, `no-new-privileges`, cap drops, and read-only rootfs by service.
+- [ ] Add resource limits for noisy services and emulators.
+- [ ] Move emulators/dev tools off public bindings.
+- [ ] Review cAdvisor privilege and observability surface.
+
+### Phase 4 — Backup, restore, and incident readiness
+
+- [ ] Define full backup map: Hermes, Gitea, Caddy, Docker volumes, app DB/state, secrets escrow.
+- [ ] Perform restore drill to non-prod target.
+- [ ] Add incident runbooks: compromised container, leaked token, SSH brute force, disk full, failed Docker upgrade.
+- [ ] Add quarterly tabletop review.
+
+### Phase 5 — Continuous governance
+
+- [ ] Monthly VM security review cron/checklist.
+- [ ] Secret scan before DevOps repo pushes.
+- [ ] OS lifecycle/EOL tracker.
+- [ ] Drift detection for ports, Caddy routes, Docker health, systemd failures, and cron paths.
+
+## Suggested First Tickets
+
+1. **P0: Build and review exposure inventory** — produce exact approved/blocked list for all currently bound ports.
+2. **P0: Lock Docker-published non-public ports** — bind to loopback/internal or enforce `DOCKER-USER` drops.
+3. **P0: Harden SSH** — disable password/root login after confirming key-based admin access.
+4. **P1: Triage unhealthy containers** — fix healthchecks/apps or retire dead services.
+5. **P1: Resolve failed Hermes backup unit** — fix or disable duplicate failed unit; keep cron backup healthy.
+6. **P1: Decide Gitea runner state** — active smoke-tested runner or documented disabled service.
+7. **P2: Add secret scanner and stale-job scanner** — prevent silent credential and automation drift.
+
+## Verification Commands for Future Runs
+
+```bash
+# Host/security baseline
+date -Is
+uname -a
+. /etc/os-release && echo "$PRETTY_NAME"
+apt-get -s upgrade | awk '/^Inst /{print}'
+test -f /var/run/reboot-required && cat /var/run/reboot-required || echo no-reboot-required
+
+# Firewall and public bind inventory
+ufw status verbose
+iptables -S DOCKER-USER
+ss -ltnup
+
+# SSH effective config
+sshd -T | egrep '^(permitrootlogin|passwordauthentication|pubkeyauthentication|kbdinteractiveauthentication|maxauthtries|x11forwarding|clientaliveinterval)'
+fail2ban-client status sshd
+
+# Docker health/security
+docker ps --format '{{.Names}}\t{{.Status}}\t{{.Ports}}'
+docker ps -q | xargs -r docker inspect --format '{{.Name}} user={{.Config.User}} privileged={{.HostConfig.Privileged}} readonly={{.HostConfig.ReadonlyRootfs}} ports={{json .NetworkSettings.Ports}}'
+
+# Caddy and ingress
+docker exec caddy caddy validate --config /etc/caddy/Caddyfile
+sed -n '1,220p' /opt/bytelyst/Caddyfile
+
+# Backup/cron/systemd drift
+systemctl --failed --no-pager
+hermes cron list
+crontab -l
+for f in /etc/cron.d/*; do echo "--- $f"; sed -n '1,80p' "$f"; done
+```
+
+## Notes
+
+- This review did not change firewall, SSH, Docker, Caddy, or service settings. It intentionally documents the risk and remediation order before making potentially disruptive security changes.
+- Public exposure changes should be handled in small maintenance windows with pre/post health checks because this VM hosts multiple ByteLyst apps.
+- The Caddyfile validates today, but Caddy formatting should be normalized in a separate low-risk docs/ops cleanup if desired.