docs: strengthen VM security roadmap gates

2026-05-27 20:34:37 +00:00 · 2026-05-27 20:34:37 +00:00 · 313a775fa0
commit 313a775fa0
parent 2c125adb05
1 changed files with 169 additions and 0 deletions
--- a/docs/vm-security-blind-spots-roadmap.md
+++ b/docs/vm-security-blind-spots-roadmap.md
@ -12,6 +12,70 @@ The biggest blind spot is that the apparent firewall posture is misleading: UFW

 Second-order risks are SSH hardening gaps, rootful Docker/container hardening gaps, unhealthy apps that can hide failed deploys, an inactive Gitea Actions runner, a failed Hermes backup systemd unit despite cron backup success, and incomplete evidence for restore drills, secret scans, and off-host recovery.

+## Implementation Readiness Assessment
+
+**Roadmap quality score:** 86%
+
+**Implementation confidence before remediation starts:** 74%
+
+**Why not higher yet:** the review has good evidence for the major blind spots, but safe remediation still depends on a service-by-service exposure inventory, owner approval for public/private intent, and verified rollback paths for SSH and Docker firewall changes. The highest-risk changes are not technically hard; they are risky because this VM hosts many ByteLyst apps and several public ports may be relied on by legacy workflows.
+
+**Confidence after Phase 0 is complete:** expected to rise to about 88% if every public hostname/host port has an approved disposition and rollback commands are tested.
+
+**Quality strengths:**
+
+- Evidence is concrete and command-derived rather than speculative.
+- The highest-risk items are correctly prioritized as P0.
+- The roadmap separates discovery from disruptive remediation.
+- It captures operational debt outside pure security, including unhealthy containers, backup state, runner drift, and cron drift.
+
+**Quality gaps to close before implementation:**
+
+- Convert broad remediation bullets into small tickets with owner, rollback, validation, and maintenance window requirements.
+- Add an approved exposure inventory before changing Docker bindings or `DOCKER-USER`.
+- Record a tested SSH rollback path and keep an active second session/provider console open before changing `sshd`.
+- Define what is intentionally public, private, internal-only, or deprecated for each service.
+- Add post-change verification commands that prove public apps still work and private services are no longer internet reachable.
+
+## Implementation Guardrails
+
+These rules apply before any Phase 1 change:
+
+- Do not bulk-close ports. Change one service group at a time and verify public app health after each group.
+- Do not restart SSH from a single session. Keep a second key-based session open and provider console access available.
+- Do not add broad `DROP` rules before an allowlist is committed to the inventory.
+- Prefer loopback/internal Compose bindings over firewall-only hiding when a service does not need direct public access.
+- Preserve Caddy as the public ingress path for web/API services unless a service is explicitly approved for direct exposure.
+- Record exact rollback commands next to every change ticket.
+- Treat Docker, SSH, Caddy, and backup changes as maintenance-window work.
+
+## Exposure Classification Model
+
+Every listening port and Caddy hostname should be classified before changes:
+
+| Class | Meaning | Expected Controls | Examples To Review |
+| --- | --- | --- | --- |
+| `public-caddy` | Public app/API reached only through Caddy | TLS, hostname routing, app auth where needed, no direct host-port access | product web/API hostnames |
+| `public-direct` | Direct host-port access is intentionally public | Explicit business reason, provider firewall allow, monitoring | SSH only unless approved otherwise |
+| `private-admin` | Admin/dev/internal tool | Tailscale/VPN, SSH tunnel, IP allowlist, or auth gate | admin dashboards, devops tools |
+| `loopback-only` | Host-local service used by Caddy or local automation | Bind `127.0.0.1:port`, no external bind | internal APIs behind Caddy |
+| `docker-internal` | Container-to-container only | no host port mapping | databases, emulators, private workers |
+| `retire` | Unused/deprecated | remove service/port, disable health checks and jobs | stale dashboards/services |
+
+Minimum inventory fields:
+
+- service/container name
+- repo/Compose file
+- host port and bind address
+- container port
+- Caddy hostname/path, if any
+- intended audience
+- authentication/control plane
+- classification
+- owner/approver
+- rollback command
+- post-change health check
+
 ## Evidence Snapshot

 Collected on 2026-05-27 from this VM.
@ -121,6 +185,16 @@ Effective `sshd -T` settings showed:
 - [ ] Keep only `80/443` and intentionally public SSH exposed at the provider/firewall layer.
 - [ ] Add a recurring check that compares `ss -ltn` and Docker published ports against the approved inventory.

+**Acceptance criteria:**
+
+- `docs/vm-exposure-inventory.md` lists every `ss -ltnp` listener and every Docker published port.
+- Every non-SSH direct public bind has an approved classification.
+- Non-public services are either loopback-bound, Docker-internal, provider-firewalled, or blocked in `DOCKER-USER`.
+- External probe confirms non-approved ports are closed from the internet.
+- Caddy-routed public hostnames still pass smoke checks.
+
+**Rollback:** keep a saved copy of original Compose files and `iptables-save` output; rollback means restoring original port mappings or flushing only the newly added `DOCKER-USER` rules.
+
 ### P0 — SSH permits root login and password authentication

 **Risk:** `PermitRootLogin yes` and `PasswordAuthentication yes` keep the primary admin surface broad. fail2ban helps, but password-enabled root SSH is still high-risk for an internet-facing VM.
@ -138,6 +212,16 @@ Effective `sshd -T` settings showed:
 - [ ] Validate with a second session before restarting SSH.
 - [ ] Record rollback commands and keep console/provider access available during rollout.

+**Acceptance criteria:**
+
+- A non-root sudo admin user can log in with SSH key auth.
+- Root password login no longer works.
+- Existing automation using `scripts/VMs/HostingerVM/login.sh` still works or is updated.
+- `sshd -T` confirms the intended effective config.
+- `fail2ban-client status sshd` still reports an active jail.
+
+**Rollback:** provider console or still-open root session can restore previous `sshd_config` drop-in and restart `ssh`.
+
 ### P0 — Public/private boundary for dev and internal tooling is unclear

 **Risk:** Caddy publishes `ollama.bytelyst.com`, `llmlab.bytelyst.com`, `devops.bytelyst.com`, `admin.bytelyst.com`, and Gitea. Some may be intended, but the roadmap lacks an explicit auth/access decision for each.
@ -150,6 +234,12 @@ Effective `sshd -T` settings showed:
 - [ ] Add security headers/auth checks to public UI health reviews.
 - [ ] Confirm `ollama.bytelyst.com` should be publicly reachable at all; if not, move behind private network or auth gate.

+**Acceptance criteria:**
+
+- `ollama`, `llmlab`, `devops`, `admin`, `gitea`, and observability-adjacent routes each have an owner-approved exposure class.
+- Public admin-like routes require authentication or an explicit documented exception.
+- No emulator, mail, model, or raw dashboard port is directly internet reachable unless explicitly approved.
+
 ### P1 — Docker/container hardening is mostly default

 **Risk:** Many containers run as default/root user, writable rootfs, broad capabilities by default, and rootful Docker. A compromised app gets more host-adjacent leverage than needed.
@ -165,6 +255,8 @@ Effective `sshd -T` settings showed:
 - [ ] Review `cadvisor` privileged mode and replace/restrict if possible.
 - [ ] Enable Docker `live-restore` if it fits maintenance operations.

+**Implementation note:** do not attempt rootless Docker or read-only rootfs as the first hardening step. Start with `no-new-privileges`, non-root app users where images already support it, and targeted capability drops for public-facing app containers.
+
 ### P1 — Unhealthy containers can normalize broken deployments

 **Risk:** Multiple app web containers are unhealthy while still running. If unhealthy states are ignored, deploy regressions and broken public pages can persist unnoticed.
@ -177,6 +269,12 @@ Effective `sshd -T` settings showed:
 - [ ] Make deployment scripts fail on unhealthy post-deploy state.
 - [ ] Update dashboard/observability docs with current service ownership and expected state.

+**Acceptance criteria:**
+
+- Every unhealthy container has one of: fixed app, fixed healthcheck, intentionally disabled, or retired.
+- Docker health state matches the product’s actual serving state.
+- Post-deploy checks fail if required containers remain unhealthy beyond a grace period.
+
 ### P1 — Gitea Actions runner is enabled but inactive

 **Risk:** CI/deploy assumptions may be wrong. If a runner is expected to deploy or publish packages, inactive runner state blocks automation and may cause manual drift.
@ -189,6 +287,8 @@ Effective `sshd -T` settings showed:
 - [ ] Keep runner secrets separate from smoke/test workflows.
 - [ ] Add a runner-health check to VM observability.

+**Decision needed:** runner should be either actively smoke-tested or disabled. An enabled-but-dead runner should not remain a steady state.
+
 ### P1 — Backup/restore evidence is split and one backup unit is failed

 **Risk:** Hermes cron backup works, but `hermes-root-backup.service` is failed. There is no recent full restore drill evidence in this review. A backup that cannot be restored is only an assumption.
@ -201,6 +301,12 @@ Effective `sshd -T` settings showed:
 - [ ] Verify no raw `.env`, OAuth tokens, private keys, SQLite WAL/SHM, or raw transcript DBs are committed.
 - [ ] Add backup freshness and restore-drill status to the monthly VM review.

+**Acceptance criteria:**
+
+- `systemctl --failed` no longer includes backup units unless the failure is intentionally documented.
+- Backup status shows source, destination, cadence, last success, and restore command.
+- A restore drill has an artifact: date, target path/profile, commands run, result, and gaps found.
+
 ### P1 — Patch management has pending security/runtime updates

 **Risk:** Unattended upgrades are on, but Docker and security package updates were pending at review time. Docker updates may need controlled restart/redeploy planning.
@ -212,6 +318,12 @@ Effective `sshd -T` settings showed:
 - [ ] Run `apt list --upgradable` and capture package classes without dumping noise.
 - [ ] Verify apps after Docker/containerd upgrades.

+**Acceptance criteria:**
+
+- Security updates and Docker/runtime updates are tracked separately.
+- Docker upgrade has pre/post container health, Caddy validation, and public smoke checks.
+- Reboot requirement is checked and scheduled rather than discovered accidentally.
+
 ### P2 — Ubuntu 25.10 lifecycle risk needs explicit tracking

 **Risk:** Ubuntu interim releases have short support windows. If this VM is long-lived production infrastructure, lifecycle tracking matters.
@ -244,6 +356,12 @@ Effective `sshd -T` settings showed:
 - [ ] Add owner, purpose, expected output, and alert channel for every job.
 - [ ] Add a stale-job detector for missing script paths and failed systemd units.

+**Acceptance criteria:**
+
+- No active cron/systemd job references a missing path.
+- Every recurring job has an owner, purpose, schedule, expected output, and alert destination.
+- Stale path detection runs in the monthly VM review.
+
 ### P2 — Observability exists but needs security-focused SLOs

 **Risk:** Prometheus/Grafana/Loki/exporters are present, but security-focused alerts are not yet proven from this review.
@ -263,6 +381,8 @@ Effective `sshd -T` settings showed:
 - [ ] Mark each exposed service as `public`, `private`, `internal-only`, or `retire`.
 - [ ] Review with S before changing public access for customer/user-facing apps.

+**Exit criteria:** the inventory is reviewed and every P0 change has a rollback line and validation line.
+
 ### Phase 1 — Immediate security hardening

 - [ ] Close or loopback-bind non-public Docker host ports.
@ -270,6 +390,8 @@ Effective `sshd -T` settings showed:
 - [ ] Harden SSH root/password access after key-based access is verified.
 - [ ] Put `ollama.bytelyst.com`, admin dashboards, and dev tooling behind private/auth-gated access unless explicitly approved as public.

+**Exit criteria:** only approved public ports are externally reachable, SSH effective config is hardened, and public apps still pass smoke checks.
+
 ### Phase 2 — Operational correctness

 - [ ] Fix/retire unhealthy containers.
@ -278,6 +400,8 @@ Effective `sshd -T` settings showed:
 - [ ] Remove stale cron paths and add missing-script checks.
 - [ ] Apply pending security/runtime updates in a maintenance window.

+**Exit criteria:** no unexpected failed units, no ignored unhealthy required containers, no stale cron paths, and runner state is intentional.
+
 ### Phase 3 — Docker and app hardening

 - [ ] Add non-root users, `no-new-privileges`, cap drops, and read-only rootfs by service.
@ -299,6 +423,41 @@ Effective `sshd -T` settings showed:
 - [ ] OS lifecycle/EOL tracker.
 - [ ] Drift detection for ports, Caddy routes, Docker health, systemd failures, and cron paths.

+## Change Tickets With Quality Gates
+
+Use this shape for each implementation PR/commit:
+
+```text
+Ticket:
+Risk:
+Files/services changed:
+Pre-checks:
+Change:
+Rollback:
+Post-checks:
+Residual risk:
+```
+
+Minimum post-checks for Phase 1:
+
+- `ss -ltnp`
+- `docker ps --format '{{.Names}}\t{{.Status}}\t{{.Ports}}'`
+- `iptables -S DOCKER-USER`
+- `docker exec caddy caddy validate --config /etc/caddy/Caddyfile`
+- public smoke checks for approved hostnames
+- negative external probe for blocked ports
+- `sshd -T` after SSH changes
+- `systemctl --failed --no-pager`
+
+## Do Not Start With
+
+- Rootless Docker migration.
+- Broad `iptables` default-drop without an allowlist.
+- Mass Compose rewrites across all products.
+- SSH password/root lockout before key-based sudo and rollback are proven.
+- Removing unhealthy containers before confirming whether they are deprecated or broken required services.
+- Publishing secret-scan output that contains secrets.
+
 ## Suggested First Tickets

 1. **P0: Build and review exposure inventory** — produce exact approved/blocked list for all currently bound ports.
@ -309,6 +468,16 @@ Effective `sshd -T` settings showed:
 6. **P1: Decide Gitea runner state** — active smoke-tested runner or documented disabled service.
 7. **P2: Add secret scanner and stale-job scanner** — prevent silent credential and automation drift.

+**Recommended first implementation order:**
+
+1. Generate and review `docs/vm-exposure-inventory.md`.
+2. Fix the stale cron path and failed backup unit, because both are lower blast-radius and improve rollback confidence.
+3. Harden SSH with second-session/provider-console safety.
+4. Move obvious internal-only Docker ports to loopback/internal bindings.
+5. Add `DOCKER-USER` guardrails after the allowlist is proven.
+
+This order improves safety without letting the port exposure issue linger too long.
+
 ## Verification Commands for Future Runs

 ```bash