docs: strengthen VM security roadmap gates
Some checks are pending
pre-commit / pre-commit (push) Waiting to run

This commit is contained in:
Hermes VM 2026-05-27 20:34:37 +00:00
parent 2c125adb05
commit 313a775fa0

View File

@ -12,6 +12,70 @@ The biggest blind spot is that the apparent firewall posture is misleading: UFW
Second-order risks are SSH hardening gaps, rootful Docker/container hardening gaps, unhealthy apps that can hide failed deploys, an inactive Gitea Actions runner, a failed Hermes backup systemd unit despite cron backup success, and incomplete evidence for restore drills, secret scans, and off-host recovery.
## Implementation Readiness Assessment
**Roadmap quality score:** 86%
**Implementation confidence before remediation starts:** 74%
**Why not higher yet:** the review has good evidence for the major blind spots, but safe remediation still depends on a service-by-service exposure inventory, owner approval for public/private intent, and verified rollback paths for SSH and Docker firewall changes. The highest-risk changes are not technically hard; they are risky because this VM hosts many ByteLyst apps and several public ports may be relied on by legacy workflows.
**Confidence after Phase 0 is complete:** expected to rise to about 88% if every public hostname/host port has an approved disposition and rollback commands are tested.
**Quality strengths:**
- Evidence is concrete and command-derived rather than speculative.
- The highest-risk items are correctly prioritized as P0.
- The roadmap separates discovery from disruptive remediation.
- It captures operational debt outside pure security, including unhealthy containers, backup state, runner drift, and cron drift.
**Quality gaps to close before implementation:**
- Convert broad remediation bullets into small tickets with owner, rollback, validation, and maintenance window requirements.
- Add an approved exposure inventory before changing Docker bindings or `DOCKER-USER`.
- Record a tested SSH rollback path and keep an active second session/provider console open before changing `sshd`.
- Define what is intentionally public, private, internal-only, or deprecated for each service.
- Add post-change verification commands that prove public apps still work and private services are no longer internet reachable.
## Implementation Guardrails
These rules apply before any Phase 1 change:
- Do not bulk-close ports. Change one service group at a time and verify public app health after each group.
- Do not restart SSH from a single session. Keep a second key-based session open and provider console access available.
- Do not add broad `DROP` rules before an allowlist is committed to the inventory.
- Prefer loopback/internal Compose bindings over firewall-only hiding when a service does not need direct public access.
- Preserve Caddy as the public ingress path for web/API services unless a service is explicitly approved for direct exposure.
- Record exact rollback commands next to every change ticket.
- Treat Docker, SSH, Caddy, and backup changes as maintenance-window work.
## Exposure Classification Model
Every listening port and Caddy hostname should be classified before changes:
| Class | Meaning | Expected Controls | Examples To Review |
| --- | --- | --- | --- |
| `public-caddy` | Public app/API reached only through Caddy | TLS, hostname routing, app auth where needed, no direct host-port access | product web/API hostnames |
| `public-direct` | Direct host-port access is intentionally public | Explicit business reason, provider firewall allow, monitoring | SSH only unless approved otherwise |
| `private-admin` | Admin/dev/internal tool | Tailscale/VPN, SSH tunnel, IP allowlist, or auth gate | admin dashboards, devops tools |
| `loopback-only` | Host-local service used by Caddy or local automation | Bind `127.0.0.1:port`, no external bind | internal APIs behind Caddy |
| `docker-internal` | Container-to-container only | no host port mapping | databases, emulators, private workers |
| `retire` | Unused/deprecated | remove service/port, disable health checks and jobs | stale dashboards/services |
Minimum inventory fields:
- service/container name
- repo/Compose file
- host port and bind address
- container port
- Caddy hostname/path, if any
- intended audience
- authentication/control plane
- classification
- owner/approver
- rollback command
- post-change health check
## Evidence Snapshot
Collected on 2026-05-27 from this VM.
@ -121,6 +185,16 @@ Effective `sshd -T` settings showed:
- [ ] Keep only `80/443` and intentionally public SSH exposed at the provider/firewall layer.
- [ ] Add a recurring check that compares `ss -ltn` and Docker published ports against the approved inventory.
**Acceptance criteria:**
- `docs/vm-exposure-inventory.md` lists every `ss -ltnp` listener and every Docker published port.
- Every non-SSH direct public bind has an approved classification.
- Non-public services are either loopback-bound, Docker-internal, provider-firewalled, or blocked in `DOCKER-USER`.
- External probe confirms non-approved ports are closed from the internet.
- Caddy-routed public hostnames still pass smoke checks.
**Rollback:** keep a saved copy of original Compose files and `iptables-save` output; rollback means restoring original port mappings or flushing only the newly added `DOCKER-USER` rules.
### P0 — SSH permits root login and password authentication
**Risk:** `PermitRootLogin yes` and `PasswordAuthentication yes` keep the primary admin surface broad. fail2ban helps, but password-enabled root SSH is still high-risk for an internet-facing VM.
@ -138,6 +212,16 @@ Effective `sshd -T` settings showed:
- [ ] Validate with a second session before restarting SSH.
- [ ] Record rollback commands and keep console/provider access available during rollout.
**Acceptance criteria:**
- A non-root sudo admin user can log in with SSH key auth.
- Root password login no longer works.
- Existing automation using `scripts/VMs/HostingerVM/login.sh` still works or is updated.
- `sshd -T` confirms the intended effective config.
- `fail2ban-client status sshd` still reports an active jail.
**Rollback:** provider console or still-open root session can restore previous `sshd_config` drop-in and restart `ssh`.
### P0 — Public/private boundary for dev and internal tooling is unclear
**Risk:** Caddy publishes `ollama.bytelyst.com`, `llmlab.bytelyst.com`, `devops.bytelyst.com`, `admin.bytelyst.com`, and Gitea. Some may be intended, but the roadmap lacks an explicit auth/access decision for each.
@ -150,6 +234,12 @@ Effective `sshd -T` settings showed:
- [ ] Add security headers/auth checks to public UI health reviews.
- [ ] Confirm `ollama.bytelyst.com` should be publicly reachable at all; if not, move behind private network or auth gate.
**Acceptance criteria:**
- `ollama`, `llmlab`, `devops`, `admin`, `gitea`, and observability-adjacent routes each have an owner-approved exposure class.
- Public admin-like routes require authentication or an explicit documented exception.
- No emulator, mail, model, or raw dashboard port is directly internet reachable unless explicitly approved.
### P1 — Docker/container hardening is mostly default
**Risk:** Many containers run as default/root user, writable rootfs, broad capabilities by default, and rootful Docker. A compromised app gets more host-adjacent leverage than needed.
@ -165,6 +255,8 @@ Effective `sshd -T` settings showed:
- [ ] Review `cadvisor` privileged mode and replace/restrict if possible.
- [ ] Enable Docker `live-restore` if it fits maintenance operations.
**Implementation note:** do not attempt rootless Docker or read-only rootfs as the first hardening step. Start with `no-new-privileges`, non-root app users where images already support it, and targeted capability drops for public-facing app containers.
### P1 — Unhealthy containers can normalize broken deployments
**Risk:** Multiple app web containers are unhealthy while still running. If unhealthy states are ignored, deploy regressions and broken public pages can persist unnoticed.
@ -177,6 +269,12 @@ Effective `sshd -T` settings showed:
- [ ] Make deployment scripts fail on unhealthy post-deploy state.
- [ ] Update dashboard/observability docs with current service ownership and expected state.
**Acceptance criteria:**
- Every unhealthy container has one of: fixed app, fixed healthcheck, intentionally disabled, or retired.
- Docker health state matches the products actual serving state.
- Post-deploy checks fail if required containers remain unhealthy beyond a grace period.
### P1 — Gitea Actions runner is enabled but inactive
**Risk:** CI/deploy assumptions may be wrong. If a runner is expected to deploy or publish packages, inactive runner state blocks automation and may cause manual drift.
@ -189,6 +287,8 @@ Effective `sshd -T` settings showed:
- [ ] Keep runner secrets separate from smoke/test workflows.
- [ ] Add a runner-health check to VM observability.
**Decision needed:** runner should be either actively smoke-tested or disabled. An enabled-but-dead runner should not remain a steady state.
### P1 — Backup/restore evidence is split and one backup unit is failed
**Risk:** Hermes cron backup works, but `hermes-root-backup.service` is failed. There is no recent full restore drill evidence in this review. A backup that cannot be restored is only an assumption.
@ -201,6 +301,12 @@ Effective `sshd -T` settings showed:
- [ ] Verify no raw `.env`, OAuth tokens, private keys, SQLite WAL/SHM, or raw transcript DBs are committed.
- [ ] Add backup freshness and restore-drill status to the monthly VM review.
**Acceptance criteria:**
- `systemctl --failed` no longer includes backup units unless the failure is intentionally documented.
- Backup status shows source, destination, cadence, last success, and restore command.
- A restore drill has an artifact: date, target path/profile, commands run, result, and gaps found.
### P1 — Patch management has pending security/runtime updates
**Risk:** Unattended upgrades are on, but Docker and security package updates were pending at review time. Docker updates may need controlled restart/redeploy planning.
@ -212,6 +318,12 @@ Effective `sshd -T` settings showed:
- [ ] Run `apt list --upgradable` and capture package classes without dumping noise.
- [ ] Verify apps after Docker/containerd upgrades.
**Acceptance criteria:**
- Security updates and Docker/runtime updates are tracked separately.
- Docker upgrade has pre/post container health, Caddy validation, and public smoke checks.
- Reboot requirement is checked and scheduled rather than discovered accidentally.
### P2 — Ubuntu 25.10 lifecycle risk needs explicit tracking
**Risk:** Ubuntu interim releases have short support windows. If this VM is long-lived production infrastructure, lifecycle tracking matters.
@ -244,6 +356,12 @@ Effective `sshd -T` settings showed:
- [ ] Add owner, purpose, expected output, and alert channel for every job.
- [ ] Add a stale-job detector for missing script paths and failed systemd units.
**Acceptance criteria:**
- No active cron/systemd job references a missing path.
- Every recurring job has an owner, purpose, schedule, expected output, and alert destination.
- Stale path detection runs in the monthly VM review.
### P2 — Observability exists but needs security-focused SLOs
**Risk:** Prometheus/Grafana/Loki/exporters are present, but security-focused alerts are not yet proven from this review.
@ -263,6 +381,8 @@ Effective `sshd -T` settings showed:
- [ ] Mark each exposed service as `public`, `private`, `internal-only`, or `retire`.
- [ ] Review with S before changing public access for customer/user-facing apps.
**Exit criteria:** the inventory is reviewed and every P0 change has a rollback line and validation line.
### Phase 1 — Immediate security hardening
- [ ] Close or loopback-bind non-public Docker host ports.
@ -270,6 +390,8 @@ Effective `sshd -T` settings showed:
- [ ] Harden SSH root/password access after key-based access is verified.
- [ ] Put `ollama.bytelyst.com`, admin dashboards, and dev tooling behind private/auth-gated access unless explicitly approved as public.
**Exit criteria:** only approved public ports are externally reachable, SSH effective config is hardened, and public apps still pass smoke checks.
### Phase 2 — Operational correctness
- [ ] Fix/retire unhealthy containers.
@ -278,6 +400,8 @@ Effective `sshd -T` settings showed:
- [ ] Remove stale cron paths and add missing-script checks.
- [ ] Apply pending security/runtime updates in a maintenance window.
**Exit criteria:** no unexpected failed units, no ignored unhealthy required containers, no stale cron paths, and runner state is intentional.
### Phase 3 — Docker and app hardening
- [ ] Add non-root users, `no-new-privileges`, cap drops, and read-only rootfs by service.
@ -299,6 +423,41 @@ Effective `sshd -T` settings showed:
- [ ] OS lifecycle/EOL tracker.
- [ ] Drift detection for ports, Caddy routes, Docker health, systemd failures, and cron paths.
## Change Tickets With Quality Gates
Use this shape for each implementation PR/commit:
```text
Ticket:
Risk:
Files/services changed:
Pre-checks:
Change:
Rollback:
Post-checks:
Residual risk:
```
Minimum post-checks for Phase 1:
- `ss -ltnp`
- `docker ps --format '{{.Names}}\t{{.Status}}\t{{.Ports}}'`
- `iptables -S DOCKER-USER`
- `docker exec caddy caddy validate --config /etc/caddy/Caddyfile`
- public smoke checks for approved hostnames
- negative external probe for blocked ports
- `sshd -T` after SSH changes
- `systemctl --failed --no-pager`
## Do Not Start With
- Rootless Docker migration.
- Broad `iptables` default-drop without an allowlist.
- Mass Compose rewrites across all products.
- SSH password/root lockout before key-based sudo and rollback are proven.
- Removing unhealthy containers before confirming whether they are deprecated or broken required services.
- Publishing secret-scan output that contains secrets.
## Suggested First Tickets
1. **P0: Build and review exposure inventory** — produce exact approved/blocked list for all currently bound ports.
@ -309,6 +468,16 @@ Effective `sshd -T` settings showed:
6. **P1: Decide Gitea runner state** — active smoke-tested runner or documented disabled service.
7. **P2: Add secret scanner and stale-job scanner** — prevent silent credential and automation drift.
**Recommended first implementation order:**
1. Generate and review `docs/vm-exposure-inventory.md`.
2. Fix the stale cron path and failed backup unit, because both are lower blast-radius and improve rollback confidence.
3. Harden SSH with second-session/provider-console safety.
4. Move obvious internal-only Docker ports to loopback/internal bindings.
5. Add `DOCKER-USER` guardrails after the allowlist is proven.
This order improves safety without letting the port exposure issue linger too long.
## Verification Commands for Future Runs
```bash