diff --git a/docs/vm-security-blind-spots-roadmap.md b/docs/vm-security-blind-spots-roadmap.md index d1ba4c1..d360f5a 100644 --- a/docs/vm-security-blind-spots-roadmap.md +++ b/docs/vm-security-blind-spots-roadmap.md @@ -12,6 +12,70 @@ The biggest blind spot is that the apparent firewall posture is misleading: UFW Second-order risks are SSH hardening gaps, rootful Docker/container hardening gaps, unhealthy apps that can hide failed deploys, an inactive Gitea Actions runner, a failed Hermes backup systemd unit despite cron backup success, and incomplete evidence for restore drills, secret scans, and off-host recovery. +## Implementation Readiness Assessment + +**Roadmap quality score:** 86% + +**Implementation confidence before remediation starts:** 74% + +**Why not higher yet:** the review has good evidence for the major blind spots, but safe remediation still depends on a service-by-service exposure inventory, owner approval for public/private intent, and verified rollback paths for SSH and Docker firewall changes. The highest-risk changes are not technically hard; they are risky because this VM hosts many ByteLyst apps and several public ports may be relied on by legacy workflows. + +**Confidence after Phase 0 is complete:** expected to rise to about 88% if every public hostname/host port has an approved disposition and rollback commands are tested. + +**Quality strengths:** + +- Evidence is concrete and command-derived rather than speculative. +- The highest-risk items are correctly prioritized as P0. +- The roadmap separates discovery from disruptive remediation. +- It captures operational debt outside pure security, including unhealthy containers, backup state, runner drift, and cron drift. + +**Quality gaps to close before implementation:** + +- Convert broad remediation bullets into small tickets with owner, rollback, validation, and maintenance window requirements. +- Add an approved exposure inventory before changing Docker bindings or `DOCKER-USER`. +- Record a tested SSH rollback path and keep an active second session/provider console open before changing `sshd`. +- Define what is intentionally public, private, internal-only, or deprecated for each service. +- Add post-change verification commands that prove public apps still work and private services are no longer internet reachable. + +## Implementation Guardrails + +These rules apply before any Phase 1 change: + +- Do not bulk-close ports. Change one service group at a time and verify public app health after each group. +- Do not restart SSH from a single session. Keep a second key-based session open and provider console access available. +- Do not add broad `DROP` rules before an allowlist is committed to the inventory. +- Prefer loopback/internal Compose bindings over firewall-only hiding when a service does not need direct public access. +- Preserve Caddy as the public ingress path for web/API services unless a service is explicitly approved for direct exposure. +- Record exact rollback commands next to every change ticket. +- Treat Docker, SSH, Caddy, and backup changes as maintenance-window work. + +## Exposure Classification Model + +Every listening port and Caddy hostname should be classified before changes: + +| Class | Meaning | Expected Controls | Examples To Review | +| --- | --- | --- | --- | +| `public-caddy` | Public app/API reached only through Caddy | TLS, hostname routing, app auth where needed, no direct host-port access | product web/API hostnames | +| `public-direct` | Direct host-port access is intentionally public | Explicit business reason, provider firewall allow, monitoring | SSH only unless approved otherwise | +| `private-admin` | Admin/dev/internal tool | Tailscale/VPN, SSH tunnel, IP allowlist, or auth gate | admin dashboards, devops tools | +| `loopback-only` | Host-local service used by Caddy or local automation | Bind `127.0.0.1:port`, no external bind | internal APIs behind Caddy | +| `docker-internal` | Container-to-container only | no host port mapping | databases, emulators, private workers | +| `retire` | Unused/deprecated | remove service/port, disable health checks and jobs | stale dashboards/services | + +Minimum inventory fields: + +- service/container name +- repo/Compose file +- host port and bind address +- container port +- Caddy hostname/path, if any +- intended audience +- authentication/control plane +- classification +- owner/approver +- rollback command +- post-change health check + ## Evidence Snapshot Collected on 2026-05-27 from this VM. @@ -121,6 +185,16 @@ Effective `sshd -T` settings showed: - [ ] Keep only `80/443` and intentionally public SSH exposed at the provider/firewall layer. - [ ] Add a recurring check that compares `ss -ltn` and Docker published ports against the approved inventory. +**Acceptance criteria:** + +- `docs/vm-exposure-inventory.md` lists every `ss -ltnp` listener and every Docker published port. +- Every non-SSH direct public bind has an approved classification. +- Non-public services are either loopback-bound, Docker-internal, provider-firewalled, or blocked in `DOCKER-USER`. +- External probe confirms non-approved ports are closed from the internet. +- Caddy-routed public hostnames still pass smoke checks. + +**Rollback:** keep a saved copy of original Compose files and `iptables-save` output; rollback means restoring original port mappings or flushing only the newly added `DOCKER-USER` rules. + ### P0 — SSH permits root login and password authentication **Risk:** `PermitRootLogin yes` and `PasswordAuthentication yes` keep the primary admin surface broad. fail2ban helps, but password-enabled root SSH is still high-risk for an internet-facing VM. @@ -138,6 +212,16 @@ Effective `sshd -T` settings showed: - [ ] Validate with a second session before restarting SSH. - [ ] Record rollback commands and keep console/provider access available during rollout. +**Acceptance criteria:** + +- A non-root sudo admin user can log in with SSH key auth. +- Root password login no longer works. +- Existing automation using `scripts/VMs/HostingerVM/login.sh` still works or is updated. +- `sshd -T` confirms the intended effective config. +- `fail2ban-client status sshd` still reports an active jail. + +**Rollback:** provider console or still-open root session can restore previous `sshd_config` drop-in and restart `ssh`. + ### P0 — Public/private boundary for dev and internal tooling is unclear **Risk:** Caddy publishes `ollama.bytelyst.com`, `llmlab.bytelyst.com`, `devops.bytelyst.com`, `admin.bytelyst.com`, and Gitea. Some may be intended, but the roadmap lacks an explicit auth/access decision for each. @@ -150,6 +234,12 @@ Effective `sshd -T` settings showed: - [ ] Add security headers/auth checks to public UI health reviews. - [ ] Confirm `ollama.bytelyst.com` should be publicly reachable at all; if not, move behind private network or auth gate. +**Acceptance criteria:** + +- `ollama`, `llmlab`, `devops`, `admin`, `gitea`, and observability-adjacent routes each have an owner-approved exposure class. +- Public admin-like routes require authentication or an explicit documented exception. +- No emulator, mail, model, or raw dashboard port is directly internet reachable unless explicitly approved. + ### P1 — Docker/container hardening is mostly default **Risk:** Many containers run as default/root user, writable rootfs, broad capabilities by default, and rootful Docker. A compromised app gets more host-adjacent leverage than needed. @@ -165,6 +255,8 @@ Effective `sshd -T` settings showed: - [ ] Review `cadvisor` privileged mode and replace/restrict if possible. - [ ] Enable Docker `live-restore` if it fits maintenance operations. +**Implementation note:** do not attempt rootless Docker or read-only rootfs as the first hardening step. Start with `no-new-privileges`, non-root app users where images already support it, and targeted capability drops for public-facing app containers. + ### P1 — Unhealthy containers can normalize broken deployments **Risk:** Multiple app web containers are unhealthy while still running. If unhealthy states are ignored, deploy regressions and broken public pages can persist unnoticed. @@ -177,6 +269,12 @@ Effective `sshd -T` settings showed: - [ ] Make deployment scripts fail on unhealthy post-deploy state. - [ ] Update dashboard/observability docs with current service ownership and expected state. +**Acceptance criteria:** + +- Every unhealthy container has one of: fixed app, fixed healthcheck, intentionally disabled, or retired. +- Docker health state matches the product’s actual serving state. +- Post-deploy checks fail if required containers remain unhealthy beyond a grace period. + ### P1 — Gitea Actions runner is enabled but inactive **Risk:** CI/deploy assumptions may be wrong. If a runner is expected to deploy or publish packages, inactive runner state blocks automation and may cause manual drift. @@ -189,6 +287,8 @@ Effective `sshd -T` settings showed: - [ ] Keep runner secrets separate from smoke/test workflows. - [ ] Add a runner-health check to VM observability. +**Decision needed:** runner should be either actively smoke-tested or disabled. An enabled-but-dead runner should not remain a steady state. + ### P1 — Backup/restore evidence is split and one backup unit is failed **Risk:** Hermes cron backup works, but `hermes-root-backup.service` is failed. There is no recent full restore drill evidence in this review. A backup that cannot be restored is only an assumption. @@ -201,6 +301,12 @@ Effective `sshd -T` settings showed: - [ ] Verify no raw `.env`, OAuth tokens, private keys, SQLite WAL/SHM, or raw transcript DBs are committed. - [ ] Add backup freshness and restore-drill status to the monthly VM review. +**Acceptance criteria:** + +- `systemctl --failed` no longer includes backup units unless the failure is intentionally documented. +- Backup status shows source, destination, cadence, last success, and restore command. +- A restore drill has an artifact: date, target path/profile, commands run, result, and gaps found. + ### P1 — Patch management has pending security/runtime updates **Risk:** Unattended upgrades are on, but Docker and security package updates were pending at review time. Docker updates may need controlled restart/redeploy planning. @@ -212,6 +318,12 @@ Effective `sshd -T` settings showed: - [ ] Run `apt list --upgradable` and capture package classes without dumping noise. - [ ] Verify apps after Docker/containerd upgrades. +**Acceptance criteria:** + +- Security updates and Docker/runtime updates are tracked separately. +- Docker upgrade has pre/post container health, Caddy validation, and public smoke checks. +- Reboot requirement is checked and scheduled rather than discovered accidentally. + ### P2 — Ubuntu 25.10 lifecycle risk needs explicit tracking **Risk:** Ubuntu interim releases have short support windows. If this VM is long-lived production infrastructure, lifecycle tracking matters. @@ -244,6 +356,12 @@ Effective `sshd -T` settings showed: - [ ] Add owner, purpose, expected output, and alert channel for every job. - [ ] Add a stale-job detector for missing script paths and failed systemd units. +**Acceptance criteria:** + +- No active cron/systemd job references a missing path. +- Every recurring job has an owner, purpose, schedule, expected output, and alert destination. +- Stale path detection runs in the monthly VM review. + ### P2 — Observability exists but needs security-focused SLOs **Risk:** Prometheus/Grafana/Loki/exporters are present, but security-focused alerts are not yet proven from this review. @@ -263,6 +381,8 @@ Effective `sshd -T` settings showed: - [ ] Mark each exposed service as `public`, `private`, `internal-only`, or `retire`. - [ ] Review with S before changing public access for customer/user-facing apps. +**Exit criteria:** the inventory is reviewed and every P0 change has a rollback line and validation line. + ### Phase 1 — Immediate security hardening - [ ] Close or loopback-bind non-public Docker host ports. @@ -270,6 +390,8 @@ Effective `sshd -T` settings showed: - [ ] Harden SSH root/password access after key-based access is verified. - [ ] Put `ollama.bytelyst.com`, admin dashboards, and dev tooling behind private/auth-gated access unless explicitly approved as public. +**Exit criteria:** only approved public ports are externally reachable, SSH effective config is hardened, and public apps still pass smoke checks. + ### Phase 2 — Operational correctness - [ ] Fix/retire unhealthy containers. @@ -278,6 +400,8 @@ Effective `sshd -T` settings showed: - [ ] Remove stale cron paths and add missing-script checks. - [ ] Apply pending security/runtime updates in a maintenance window. +**Exit criteria:** no unexpected failed units, no ignored unhealthy required containers, no stale cron paths, and runner state is intentional. + ### Phase 3 — Docker and app hardening - [ ] Add non-root users, `no-new-privileges`, cap drops, and read-only rootfs by service. @@ -299,6 +423,41 @@ Effective `sshd -T` settings showed: - [ ] OS lifecycle/EOL tracker. - [ ] Drift detection for ports, Caddy routes, Docker health, systemd failures, and cron paths. +## Change Tickets With Quality Gates + +Use this shape for each implementation PR/commit: + +```text +Ticket: +Risk: +Files/services changed: +Pre-checks: +Change: +Rollback: +Post-checks: +Residual risk: +``` + +Minimum post-checks for Phase 1: + +- `ss -ltnp` +- `docker ps --format '{{.Names}}\t{{.Status}}\t{{.Ports}}'` +- `iptables -S DOCKER-USER` +- `docker exec caddy caddy validate --config /etc/caddy/Caddyfile` +- public smoke checks for approved hostnames +- negative external probe for blocked ports +- `sshd -T` after SSH changes +- `systemctl --failed --no-pager` + +## Do Not Start With + +- Rootless Docker migration. +- Broad `iptables` default-drop without an allowlist. +- Mass Compose rewrites across all products. +- SSH password/root lockout before key-based sudo and rollback are proven. +- Removing unhealthy containers before confirming whether they are deprecated or broken required services. +- Publishing secret-scan output that contains secrets. + ## Suggested First Tickets 1. **P0: Build and review exposure inventory** — produce exact approved/blocked list for all currently bound ports. @@ -309,6 +468,16 @@ Effective `sshd -T` settings showed: 6. **P1: Decide Gitea runner state** — active smoke-tested runner or documented disabled service. 7. **P2: Add secret scanner and stale-job scanner** — prevent silent credential and automation drift. +**Recommended first implementation order:** + +1. Generate and review `docs/vm-exposure-inventory.md`. +2. Fix the stale cron path and failed backup unit, because both are lower blast-radius and improve rollback confidence. +3. Harden SSH with second-session/provider-console safety. +4. Move obvious internal-only Docker ports to loopback/internal bindings. +5. Add `DOCKER-USER` guardrails after the allowlist is proven. + +This order improves safety without letting the port exposure issue linger too long. + ## Verification Commands for Future Runs ```bash