docs: add VM security blind spots roadmap
Some checks are pending
pre-commit / pre-commit (push) Waiting to run

This commit is contained in:
Hermes VM 2026-05-27 20:21:51 +00:00
parent c89018ae47
commit 2c125adb05
2 changed files with 351 additions and 0 deletions

View File

@ -52,6 +52,7 @@ Current key files:
- `docs/remove_user_interactive.md`
- `docs/hermes-setup-upgrade-roadmap.md`
- `docs/hermes-operations.md`
- `docs/vm-security-blind-spots-roadmap.md`
### `.github/workflows/`

View File

@ -0,0 +1,350 @@
# ByteLyst VM Security Blind Spots Roadmap
**Review date:** 2026-05-27
**Reviewer:** Hermes Agent
**Scope:** Hostinger ByteLyst VM, Docker-hosted product stack, Caddy ingress, Gitea/CI, Hermes backup/ops, VM maintenance posture.
## Executive Summary
The VM is operational and has several good foundations already in place: UFW is active, fail2ban is running for SSH, unattended upgrades are enabled, Caddy config validates, disk/memory headroom is acceptable, and Hermes persistent-data backup cron is healthy.
The biggest blind spot is that the apparent firewall posture is misleading: UFW only allows SSH, but Docker-published ports create iptables rules that can expose many application, database/emulator, observability, registry, and development ports on `0.0.0.0` / IPv6. Several of those services should either be private-only, routed only through Caddy with auth, or bound to loopback/internal Docker networks.
Second-order risks are SSH hardening gaps, rootful Docker/container hardening gaps, unhealthy apps that can hide failed deploys, an inactive Gitea Actions runner, a failed Hermes backup systemd unit despite cron backup success, and incomplete evidence for restore drills, secret scans, and off-host recovery.
## Evidence Snapshot
Collected on 2026-05-27 from this VM.
### Host and patching
- Host: `srv1491630`
- OS: Ubuntu `25.10`
- Kernel: `6.17.0-29-generic`
- Uptime: about 14 hours at review time
- Root filesystem: 193G total, 71G used, 123G available, 37% used
- Memory: 15Gi total, about 10Gi available
- Swap: 4.0G total, about 1.3G used
- Reboot required: no
- Pending package upgrades included Docker CE/containerd/buildx/compose and security updates for `libgcrypt20`, `libcaca0`, and `libssh2-1t64`
- Unattended upgrades: active and configured for automatic reboot at 04:00 with users absent
### Network and ingress
- UFW: active; default deny incoming; only `22/tcp` allowed by UFW rules
- Docker iptables rules are present and publish many ports despite UFW's simple rule list
- Public/listening TCP ports bound on all interfaces included:
- `22`, `80`, `443`
- app/frontend ports: `3000`, `3002`, `3003`, `3030`, `3035`, `3040`, `3049`, `3050`, `3055`, `3060`, `3070`, `3075`, `3085`
- backend/API ports: `4004`, `4010`, `4011`, `4012`, `4013`, `4014`, `4015`, `4016`, `4017`, `4019`, `4020`, `4025`
- infra/dev ports: `1025`, `1234`, `3100`, `3300`, `8025`, `8081`, `10000`, `11434`
- Caddy source-of-truth config: `/opt/bytelyst/Caddyfile`, mounted read-only into the `caddy` container
- `docker exec caddy caddy validate --config /etc/caddy/Caddyfile`: valid config, formatting warning only
- Caddy public hostnames include:
- `api.bytelyst.com`
- `gitea.bytelyst.com`
- `admin.bytelyst.com`
- `devops.bytelyst.com`
- `tracker.bytelyst.com`
- `llmlab.bytelyst.com`
- `ollama.bytelyst.com`
- `trading-api.bytelyst.com`
- `invttrdg.bytelyst.com`
- `notes.bytelyst.com`
- `clock.bytelyst.com`
### SSH and account surface
Effective `sshd -T` settings showed:
- `permitrootlogin yes`
- `passwordauthentication yes`
- `pubkeyauthentication yes`
- `kbdinteractiveauthentication no`
- `maxauthtries 6`
- `x11forwarding yes`
- `clientaliveinterval 0`
`fail2ban` is active with one jail: `sshd`; no current bans at review time.
### Docker runtime and containers
- Docker: client/server `29.4.2`; newer Docker packages are available
- Docker daemon is rootful; security options showed AppArmor, seccomp builtin, and cgroup namespaces; `live_restore=false`
- Most product containers run with writable root filesystems and no explicit `user` configured
- `cadvisor` is privileged
- `DOCKER-USER` chain appears empty, so there is no central Docker firewall policy in front of published containers
- Multiple containers are unhealthy:
- `learning_ai_common_plat-llmlab-dashboard-1`
- `learning_ai_common_plat-actiontrail-web-1`
- `learning_ai_common_plat-jarvisjr-web-1`
- `learning_ai_common_plat-localmemgpt-web-1`
- `learning_ai_common_plat-nomgap-web-1`
- `learning_ai_common_plat-flowmonk-web-1`
- `learning_ai_common_plat-mindlyst-web-1`
### Gitea and CI
- Gitea public route: `https://gitea.bytelyst.com`
- Local Gitea container port: host `3300` -> container `3000`, bound on `0.0.0.0` and IPv6
- `gitea-act-runner.service`: enabled but inactive/dead
- Runner user exists: `gitea-runner`, member of `docker`
- Runner config directory permissions look reasonable:
- `/home/gitea-runner/act_runner`: `750`, owned by `gitea-runner:gitea-runner`
- `/home/gitea-runner/act_runner/config.yaml`: `600`, owned by `gitea-runner:gitea-runner`
### Backup and operations
- `systemctl --failed` showed failed unit:
- `hermes-root-backup.service``Sync root Hermes persistent backup to GitHub`
- Hermes cron backup is active and healthy:
- job `470832621b43`, `Sync Hermes persistent-data backup to GitHub`, every 30 minutes, last run `ok`
- Existing VM maintenance cron entries exist for health check and cleanup under `/opt/bytelyst/learning_ai_devops_tools/scripts/VMs/HostingerVM/`
- A root crontab entry still references `/opt/bytelyst/bytelyst-devops-tools/monitor-lucky25-execution.sh`, which may be stale after repo relocation/renaming
## Blind Spots and Risk Register
### P0 — Internet-exposed Docker ports bypass the intended ingress model
**Risk:** UFW suggests only SSH is allowed, but Docker-published ports expose many services directly on all interfaces. This can bypass Caddy, TLS, auth, logging, rate limiting, and hostname/path controls.
**Examples observed:** `3300`, `8025`, `1025`, `1234`, `8081`, `10000`, `11434`, many `30xx` web ports, and many `40xx` backend ports.
**Impact:** Direct access to dev/infra services, internal APIs, emulators, mail tooling, dashboards, or model endpoints if upstream firewall/provider rules do not block them.
**Roadmap:**
- [ ] Create a canonical exposure inventory: service, container, host port, public hostname, required audience, auth requirement.
- [ ] For each service, decide one of: public via Caddy, private via Tailscale/SSH, loopback-only host port, Docker-internal only, or remove.
- [ ] Bind non-public Compose ports to `127.0.0.1` or remove host port mapping entirely.
- [ ] Add a `DOCKER-USER` chain policy to drop unsolicited traffic to non-approved published ports before Docker's accept rules.
- [ ] Keep only `80/443` and intentionally public SSH exposed at the provider/firewall layer.
- [ ] Add a recurring check that compares `ss -ltn` and Docker published ports against the approved inventory.
### P0 — SSH permits root login and password authentication
**Risk:** `PermitRootLogin yes` and `PasswordAuthentication yes` keep the primary admin surface broad. fail2ban helps, but password-enabled root SSH is still high-risk for an internet-facing VM.
**Roadmap:**
- [ ] Confirm all required admin users have working SSH keys and sudo access.
- [ ] Add a non-root break-glass admin path if one does not exist.
- [ ] Change SSH effective config to:
- [ ] `PermitRootLogin prohibit-password` or `no`
- [ ] `PasswordAuthentication no`
- [ ] `X11Forwarding no`
- [ ] lower `MaxAuthTries`, e.g. `3`
- [ ] set a sane `ClientAliveInterval` / `ClientAliveCountMax`
- [ ] Validate with a second session before restarting SSH.
- [ ] Record rollback commands and keep console/provider access available during rollout.
### P0 — Public/private boundary for dev and internal tooling is unclear
**Risk:** Caddy publishes `ollama.bytelyst.com`, `llmlab.bytelyst.com`, `devops.bytelyst.com`, `admin.bytelyst.com`, and Gitea. Some may be intended, but the roadmap lacks an explicit auth/access decision for each.
**Roadmap:**
- [ ] Document public hostnames, auth model, and data sensitivity.
- [ ] Require explicit approval before exposing new dashboards or model endpoints.
- [ ] Add Caddy auth/IP allowlist/Tailscale-only strategy for admin-like surfaces.
- [ ] Add security headers/auth checks to public UI health reviews.
- [ ] Confirm `ollama.bytelyst.com` should be publicly reachable at all; if not, move behind private network or auth gate.
### P1 — Docker/container hardening is mostly default
**Risk:** Many containers run as default/root user, writable rootfs, broad capabilities by default, and rootful Docker. A compromised app gets more host-adjacent leverage than needed.
**Roadmap:**
- [ ] Create a per-service Docker hardening matrix: user, read-only rootfs, dropped capabilities, no-new-privileges, resource limits, healthcheck, restart policy, secrets handling.
- [ ] Start with public-facing/backend services and admin dashboards.
- [ ] Add `security_opt: ["no-new-privileges:true"]` where compatible.
- [ ] Add `cap_drop: ["ALL"]` and selectively add back capabilities only when needed.
- [ ] Convert app images to non-root users consistently.
- [ ] Use `read_only: true` plus explicit writable tmp/cache volumes where compatible.
- [ ] Review `cadvisor` privileged mode and replace/restrict if possible.
- [ ] Enable Docker `live-restore` if it fits maintenance operations.
### P1 — Unhealthy containers can normalize broken deployments
**Risk:** Multiple app web containers are unhealthy while still running. If unhealthy states are ignored, deploy regressions and broken public pages can persist unnoticed.
**Roadmap:**
- [ ] Triage each unhealthy container and classify: real app failure, bad healthcheck, intentionally unused, or deprecated.
- [ ] Fix or remove bad healthchecks so Docker health state is trustworthy.
- [ ] Add alerting for sustained unhealthy containers.
- [ ] Make deployment scripts fail on unhealthy post-deploy state.
- [ ] Update dashboard/observability docs with current service ownership and expected state.
### P1 — Gitea Actions runner is enabled but inactive
**Risk:** CI/deploy assumptions may be wrong. If a runner is expected to deploy or publish packages, inactive runner state blocks automation and may cause manual drift.
**Roadmap:**
- [ ] Decide whether the runner should be active or intentionally disabled.
- [ ] If active: restart and verify `gitea-act-runner.service`, runner labels, Docker access, and a smoke workflow.
- [ ] If disabled: disable the service and document the intentional state.
- [ ] Keep runner secrets separate from smoke/test workflows.
- [ ] Add a runner-health check to VM observability.
### P1 — Backup/restore evidence is split and one backup unit is failed
**Risk:** Hermes cron backup works, but `hermes-root-backup.service` is failed. There is no recent full restore drill evidence in this review. A backup that cannot be restored is only an assumption.
**Roadmap:**
- [ ] Inspect `hermes-root-backup.service` logs and decide whether to fix, disable, or replace it with the cron-backed job.
- [ ] Document all backup mechanisms: Hermes, Gitea data, Docker volumes, app data, Caddy certs/config, environment/secrets escrow.
- [ ] Run a restore drill into a non-production path/profile.
- [ ] Verify no raw `.env`, OAuth tokens, private keys, SQLite WAL/SHM, or raw transcript DBs are committed.
- [ ] Add backup freshness and restore-drill status to the monthly VM review.
### P1 — Patch management has pending security/runtime updates
**Risk:** Unattended upgrades are on, but Docker and security package updates were pending at review time. Docker updates may need controlled restart/redeploy planning.
**Roadmap:**
- [ ] Add a weekly patch review checkpoint that reports pending security and Docker updates separately.
- [ ] Define a Docker upgrade maintenance window with pre/post checks.
- [ ] Run `apt list --upgradable` and capture package classes without dumping noise.
- [ ] Verify apps after Docker/containerd upgrades.
### P2 — Ubuntu 25.10 lifecycle risk needs explicit tracking
**Risk:** Ubuntu interim releases have short support windows. If this VM is long-lived production infrastructure, lifecycle tracking matters.
**Roadmap:**
- [ ] Record current Ubuntu 25.10 support/EOL date in ops docs.
- [ ] Decide whether to stay on interim releases or migrate to an LTS baseline.
- [ ] Add an OS lifecycle check to quarterly review.
### P2 — Repository/config secret hygiene needs a repeatable scanner
**Risk:** The DevOps repo contains operational inputs and historical/deleted repo copies exist on disk. Manual review can miss tokens in old files, generated JSON, logs, backups, or abandoned directories.
**Roadmap:**
- [ ] Add a documented secret-scan command using `gitleaks` or `trufflehog` for tracked files and selected untracked ops directories.
- [ ] Scan historical directories such as `DELETED_bytelyst-devops-tools` separately before archiving or deleting.
- [ ] Add `.gitignore` patterns for generated scans, local account snapshots, and credential-shaped outputs.
- [ ] Keep examples as `.example` files only.
### P2 — Cron/systemd ownership and drift are not fully inventoried
**Risk:** Root crontab references old repo paths and there are multiple cron/systemd sources. Stale jobs can fail silently or mutate production unexpectedly.
**Roadmap:**
- [ ] Inventory root/user crontabs, `/etc/cron.d`, systemd timers, Hermes cron, and Gitea Actions schedules.
- [ ] Remove or update stale `/opt/bytelyst/bytelyst-devops-tools/...` references after confirming replacements.
- [ ] Add owner, purpose, expected output, and alert channel for every job.
- [ ] Add a stale-job detector for missing script paths and failed systemd units.
### P2 — Observability exists but needs security-focused SLOs
**Risk:** Prometheus/Grafana/Loki/exporters are present, but security-focused alerts are not yet proven from this review.
**Roadmap:**
- [ ] Add alerts for unexpected public ports, failed units, unhealthy containers, high disk/swap, backup staleness, Gitea runner inactive, and SSH auth spikes.
- [ ] Validate alert delivery to Telegram.
- [ ] Keep internal observability endpoints private; do not publish Prometheus/Loki/node-exporter/cAdvisor directly.
## Execution Plan
### Phase 0 — Freeze and inventory before changes
- [ ] Freeze new public hostnames/ports until the exposure inventory is complete.
- [ ] Generate `docs/vm-exposure-inventory.md` from Docker, Caddy, `ss`, and DNS.
- [ ] Mark each exposed service as `public`, `private`, `internal-only`, or `retire`.
- [ ] Review with S before changing public access for customer/user-facing apps.
### Phase 1 — Immediate security hardening
- [ ] Close or loopback-bind non-public Docker host ports.
- [ ] Add `DOCKER-USER` default-deny rules for non-approved ports.
- [ ] Harden SSH root/password access after key-based access is verified.
- [ ] Put `ollama.bytelyst.com`, admin dashboards, and dev tooling behind private/auth-gated access unless explicitly approved as public.
### Phase 2 — Operational correctness
- [ ] Fix/retire unhealthy containers.
- [ ] Resolve `hermes-root-backup.service` failed state.
- [ ] Decide and document Gitea runner active/disabled state.
- [ ] Remove stale cron paths and add missing-script checks.
- [ ] Apply pending security/runtime updates in a maintenance window.
### Phase 3 — Docker and app hardening
- [ ] Add non-root users, `no-new-privileges`, cap drops, and read-only rootfs by service.
- [ ] Add resource limits for noisy services and emulators.
- [ ] Move emulators/dev tools off public bindings.
- [ ] Review cAdvisor privilege and observability surface.
### Phase 4 — Backup, restore, and incident readiness
- [ ] Define full backup map: Hermes, Gitea, Caddy, Docker volumes, app DB/state, secrets escrow.
- [ ] Perform restore drill to non-prod target.
- [ ] Add incident runbooks: compromised container, leaked token, SSH brute force, disk full, failed Docker upgrade.
- [ ] Add quarterly tabletop review.
### Phase 5 — Continuous governance
- [ ] Monthly VM security review cron/checklist.
- [ ] Secret scan before DevOps repo pushes.
- [ ] OS lifecycle/EOL tracker.
- [ ] Drift detection for ports, Caddy routes, Docker health, systemd failures, and cron paths.
## Suggested First Tickets
1. **P0: Build and review exposure inventory** — produce exact approved/blocked list for all currently bound ports.
2. **P0: Lock Docker-published non-public ports** — bind to loopback/internal or enforce `DOCKER-USER` drops.
3. **P0: Harden SSH** — disable password/root login after confirming key-based admin access.
4. **P1: Triage unhealthy containers** — fix healthchecks/apps or retire dead services.
5. **P1: Resolve failed Hermes backup unit** — fix or disable duplicate failed unit; keep cron backup healthy.
6. **P1: Decide Gitea runner state** — active smoke-tested runner or documented disabled service.
7. **P2: Add secret scanner and stale-job scanner** — prevent silent credential and automation drift.
## Verification Commands for Future Runs
```bash
# Host/security baseline
date -Is
uname -a
. /etc/os-release && echo "$PRETTY_NAME"
apt-get -s upgrade | awk '/^Inst /{print}'
test -f /var/run/reboot-required && cat /var/run/reboot-required || echo no-reboot-required
# Firewall and public bind inventory
ufw status verbose
iptables -S DOCKER-USER
ss -ltnup
# SSH effective config
sshd -T | egrep '^(permitrootlogin|passwordauthentication|pubkeyauthentication|kbdinteractiveauthentication|maxauthtries|x11forwarding|clientaliveinterval)'
fail2ban-client status sshd
# Docker health/security
docker ps --format '{{.Names}}\t{{.Status}}\t{{.Ports}}'
docker ps -q | xargs -r docker inspect --format '{{.Name}} user={{.Config.User}} privileged={{.HostConfig.Privileged}} readonly={{.HostConfig.ReadonlyRootfs}} ports={{json .NetworkSettings.Ports}}'
# Caddy and ingress
docker exec caddy caddy validate --config /etc/caddy/Caddyfile
sed -n '1,220p' /opt/bytelyst/Caddyfile
# Backup/cron/systemd drift
systemctl --failed --no-pager
hermes cron list
crontab -l
for f in /etc/cron.d/*; do echo "--- $f"; sed -n '1,80p' "$f"; done
```
## Notes
- This review did not change firewall, SSH, Docker, Caddy, or service settings. It intentionally documents the risk and remediation order before making potentially disruptive security changes.
- Public exposure changes should be handled in small maintenance windows with pre/post health checks because this VM hosts multiple ByteLyst apps.
- The Caddyfile validates today, but Caddy formatting should be normalized in a separate low-risk docs/ops cleanup if desired.