bytelyst-devops-tools/docs/vm-security-blind-spots-roadmap.md
Hermes VM 3d5f369f3d
Some checks failed
pre-commit / pre-commit (push) Failing after 40s
docs: record Gitea runner recovery
2026-05-27 20:58:16 +00:00

568 lines
30 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# ByteLyst VM Security Blind Spots Roadmap
**Review date:** 2026-05-27
**Reviewer:** Hermes Agent
**Scope:** Hostinger ByteLyst VM, Docker-hosted product stack, Caddy ingress, Gitea/CI, Hermes backup/ops, VM maintenance posture.
## Executive Summary
The VM is operational and has several good foundations already in place: UFW is active, fail2ban is running for SSH, unattended upgrades are enabled, Caddy config validates, disk/memory headroom is acceptable, and Hermes persistent-data backup cron is healthy.
The biggest blind spot is that the apparent firewall posture is misleading: UFW only allows SSH, but Docker-published ports create iptables rules that can expose many application, database/emulator, observability, registry, and development ports on `0.0.0.0` / IPv6. Several of those services should either be private-only, routed only through Caddy with auth, or bound to loopback/internal Docker networks.
Second-order risks are SSH hardening gaps, rootful Docker/container hardening gaps, unhealthy apps that can hide failed deploys, an inactive Gitea Actions runner, a failed Hermes backup systemd unit despite cron backup success, and incomplete evidence for restore drills, secret scans, and off-host recovery.
## Implementation Readiness Assessment
**Roadmap quality score:** 86%
**Implementation confidence before remediation starts:** 74%
**Why not higher yet:** the review has good evidence for the major blind spots, but safe remediation still depends on a service-by-service exposure inventory, owner approval for public/private intent, and verified rollback paths for SSH and Docker firewall changes. The highest-risk changes are not technically hard; they are risky because this VM hosts many ByteLyst apps and several public ports may be relied on by legacy workflows.
**Confidence after Phase 0 is complete:** expected to rise to about 88% if every public hostname/host port has an approved disposition and rollback commands are tested.
**Quality strengths:**
- Evidence is concrete and command-derived rather than speculative.
- The highest-risk items are correctly prioritized as P0.
- The roadmap separates discovery from disruptive remediation.
- It captures operational debt outside pure security, including unhealthy containers, backup state, runner drift, and cron drift.
**Quality gaps to close before implementation:**
- Convert broad remediation bullets into small tickets with owner, rollback, validation, and maintenance window requirements.
- Add an approved exposure inventory before changing Docker bindings or `DOCKER-USER`.
- Record a tested SSH rollback path and keep an active second session/provider console open before changing `sshd`.
- Define what is intentionally public, private, internal-only, or deprecated for each service.
- Add post-change verification commands that prove public apps still work and private services are no longer internet reachable.
## Implementation Guardrails
These rules apply before any Phase 1 change:
- Do not bulk-close ports. Change one service group at a time and verify public app health after each group.
- Do not restart SSH from a single session. Keep a second key-based session open and provider console access available.
- Do not add broad `DROP` rules before an allowlist is committed to the inventory.
- Prefer loopback/internal Compose bindings over firewall-only hiding when a service does not need direct public access.
- Preserve Caddy as the public ingress path for web/API services unless a service is explicitly approved for direct exposure.
- Record exact rollback commands next to every change ticket.
- Treat Docker, SSH, Caddy, and backup changes as maintenance-window work.
## Exposure Classification Model
Every listening port and Caddy hostname should be classified before changes:
| Class | Meaning | Expected Controls | Examples To Review |
| --- | --- | --- | --- |
| `public-caddy` | Public app/API reached only through Caddy | TLS, hostname routing, app auth where needed, no direct host-port access | product web/API hostnames |
| `public-direct` | Direct host-port access is intentionally public | Explicit business reason, provider firewall allow, monitoring | SSH only unless approved otherwise |
| `private-admin` | Admin/dev/internal tool | Tailscale/VPN, SSH tunnel, IP allowlist, or auth gate | admin dashboards, devops tools |
| `loopback-only` | Host-local service used by Caddy or local automation | Bind `127.0.0.1:port`, no external bind | internal APIs behind Caddy |
| `docker-internal` | Container-to-container only | no host port mapping | databases, emulators, private workers |
| `retire` | Unused/deprecated | remove service/port, disable health checks and jobs | stale dashboards/services |
Minimum inventory fields:
- service/container name
- repo/Compose file
- host port and bind address
- container port
- Caddy hostname/path, if any
- intended audience
- authentication/control plane
- classification
- owner/approver
- rollback command
- post-change health check
## Evidence Snapshot
Collected on 2026-05-27 from this VM.
### Host and patching
- Host: `srv1491630`
- OS: Ubuntu `25.10`
- Kernel: `6.17.0-29-generic`
- Uptime: about 14 hours at review time
- Root filesystem: 193G total, 71G used, 123G available, 37% used
- Memory: 15Gi total, about 10Gi available
- Swap: 4.0G total, about 1.3G used
- Reboot required: no
- Pending package upgrades included Docker CE/containerd/buildx/compose and security updates for `libgcrypt20`, `libcaca0`, and `libssh2-1t64`
- Unattended upgrades: active and configured for automatic reboot at 04:00 with users absent
### Network and ingress
- UFW: active; default deny incoming; only `22/tcp` allowed by UFW rules
- Docker iptables rules are present and publish many ports despite UFW's simple rule list
- Public/listening TCP ports bound on all interfaces included:
- `22`, `80`, `443`
- app/frontend ports: `3000`, `3002`, `3003`, `3030`, `3035`, `3040`, `3049`, `3050`, `3055`, `3060`, `3070`, `3075`, `3085`
- backend/API ports: `4004`, `4010`, `4011`, `4012`, `4013`, `4014`, `4015`, `4016`, `4017`, `4019`, `4020`, `4025`
- infra/dev ports: `1025`, `1234`, `3100`, `3300`, `8025`, `8081`, `10000`, `11434`
- Caddy source-of-truth config: `/opt/bytelyst/Caddyfile`, mounted read-only into the `caddy` container
- `docker exec caddy caddy validate --config /etc/caddy/Caddyfile`: valid config, formatting warning only
- Caddy public hostnames include:
- `api.bytelyst.com`
- `gitea.bytelyst.com`
- `admin.bytelyst.com`
- `devops.bytelyst.com`
- `tracker.bytelyst.com`
- `llmlab.bytelyst.com`
- `ollama.bytelyst.com`
- `trading-api.bytelyst.com`
- `invttrdg.bytelyst.com`
- `notes.bytelyst.com`
- `clock.bytelyst.com`
### SSH and account surface
Effective `sshd -T` settings showed:
- `permitrootlogin yes`
- `passwordauthentication yes`
- `pubkeyauthentication yes`
- `kbdinteractiveauthentication no`
- `maxauthtries 6`
- `x11forwarding yes`
- `clientaliveinterval 0`
`fail2ban` is active with one jail: `sshd`; no current bans at review time.
### Docker runtime and containers
- Docker: client/server `29.4.2`; newer Docker packages are available
- Docker daemon is rootful; security options showed AppArmor, seccomp builtin, and cgroup namespaces; `live_restore=false`
- Most product containers run with writable root filesystems and no explicit `user` configured
- `cadvisor` is privileged
- `DOCKER-USER` chain appears empty, so there is no central Docker firewall policy in front of published containers
- Multiple containers are unhealthy:
- `learning_ai_common_plat-llmlab-dashboard-1`
- `learning_ai_common_plat-actiontrail-web-1`
- `learning_ai_common_plat-jarvisjr-web-1`
- `learning_ai_common_plat-localmemgpt-web-1`
- `learning_ai_common_plat-nomgap-web-1`
- `learning_ai_common_plat-flowmonk-web-1`
- `learning_ai_common_plat-mindlyst-web-1`
### Gitea and CI
- Gitea public route: `https://gitea.bytelyst.com`
- Local Gitea container port: host `3300` -> container `3000`, bound on `0.0.0.0` and IPv6
- `gitea-act-runner.service`: enabled but inactive/dead
- Runner user exists: `gitea-runner`, member of `docker`
- Runner config directory permissions look reasonable:
- `/home/gitea-runner/act_runner`: `750`, owned by `gitea-runner:gitea-runner`
- `/home/gitea-runner/act_runner/config.yaml`: `600`, owned by `gitea-runner:gitea-runner`
### Backup and operations
- `systemctl --failed` showed failed unit:
- `hermes-root-backup.service``Sync root Hermes persistent backup to GitHub`
- Hermes cron backup is active and healthy:
- job `470832621b43`, `Sync Hermes persistent-data backup to GitHub`, every 30 minutes, last run `ok`
- Existing VM maintenance cron entries exist for health check and cleanup under `/opt/bytelyst/learning_ai_devops_tools/scripts/VMs/HostingerVM/`
- A root crontab entry still references `/opt/bytelyst/bytelyst-devops-tools/monitor-lucky25-execution.sh`, which may be stale after repo relocation/renaming
## Blind Spots and Risk Register
### P0 — Internet-exposed Docker ports bypass the intended ingress model
**Risk:** UFW suggests only SSH is allowed, but Docker-published ports expose many services directly on all interfaces. This can bypass Caddy, TLS, auth, logging, rate limiting, and hostname/path controls.
**Examples observed:** `3300`, `8025`, `1025`, `1234`, `8081`, `10000`, `11434`, many `30xx` web ports, and many `40xx` backend ports.
**Impact:** Direct access to dev/infra services, internal APIs, emulators, mail tooling, dashboards, or model endpoints if upstream firewall/provider rules do not block them.
**Roadmap:**
- [ ] Create a canonical exposure inventory: service, container, host port, public hostname, required audience, auth requirement.
- [ ] For each service, decide one of: public via Caddy, private via Tailscale/SSH, loopback-only host port, Docker-internal only, or remove.
- [ ] Bind non-public Compose ports to `127.0.0.1` or remove host port mapping entirely.
- [ ] Add a `DOCKER-USER` chain policy to drop unsolicited traffic to non-approved published ports before Docker's accept rules.
- [ ] Keep only `80/443` and intentionally public SSH exposed at the provider/firewall layer.
- [ ] Add a recurring check that compares `ss -ltn` and Docker published ports against the approved inventory.
**Acceptance criteria:**
- `docs/vm-exposure-inventory.md` lists every `ss -ltnp` listener and every Docker published port.
- Every non-SSH direct public bind has an approved classification.
- Non-public services are either loopback-bound, Docker-internal, provider-firewalled, or blocked in `DOCKER-USER`.
- External probe confirms non-approved ports are closed from the internet.
- Caddy-routed public hostnames still pass smoke checks.
**Rollback:** keep a saved copy of original Compose files and `iptables-save` output; rollback means restoring original port mappings or flushing only the newly added `DOCKER-USER` rules.
### P0 — SSH permits root login and password authentication
**Risk:** `PermitRootLogin yes` and `PasswordAuthentication yes` keep the primary admin surface broad. fail2ban helps, but password-enabled root SSH is still high-risk for an internet-facing VM.
**Roadmap:**
- [ ] Confirm all required admin users have working SSH keys and sudo access.
- [ ] Add a non-root break-glass admin path if one does not exist.
- [ ] Change SSH effective config to:
- [ ] `PermitRootLogin prohibit-password` or `no`
- [ ] `PasswordAuthentication no`
- [ ] `X11Forwarding no`
- [ ] lower `MaxAuthTries`, e.g. `3`
- [ ] set a sane `ClientAliveInterval` / `ClientAliveCountMax`
- [ ] Validate with a second session before restarting SSH.
- [ ] Record rollback commands and keep console/provider access available during rollout.
**Acceptance criteria:**
- A non-root sudo admin user can log in with SSH key auth.
- Root password login no longer works.
- Existing automation using `scripts/VMs/HostingerVM/login.sh` still works or is updated.
- `sshd -T` confirms the intended effective config.
- `fail2ban-client status sshd` still reports an active jail.
**Rollback:** provider console or still-open root session can restore previous `sshd_config` drop-in and restart `ssh`.
### P0 — Public/private boundary for dev and internal tooling is unclear
**Risk:** Caddy publishes `ollama.bytelyst.com`, `llmlab.bytelyst.com`, `devops.bytelyst.com`, `admin.bytelyst.com`, and Gitea. Some may be intended, but the roadmap lacks an explicit auth/access decision for each.
**Roadmap:**
- [ ] Document public hostnames, auth model, and data sensitivity.
- [ ] Require explicit approval before exposing new dashboards or model endpoints.
- [ ] Add Caddy auth/IP allowlist/Tailscale-only strategy for admin-like surfaces.
- [ ] Add security headers/auth checks to public UI health reviews.
- [ ] Confirm `ollama.bytelyst.com` should be publicly reachable at all; if not, move behind private network or auth gate.
**Acceptance criteria:**
- `ollama`, `llmlab`, `devops`, `admin`, `gitea`, and observability-adjacent routes each have an owner-approved exposure class.
- Public admin-like routes require authentication or an explicit documented exception.
- No emulator, mail, model, or raw dashboard port is directly internet reachable unless explicitly approved.
### P1 — Docker/container hardening is mostly default
**Risk:** Many containers run as default/root user, writable rootfs, broad capabilities by default, and rootful Docker. A compromised app gets more host-adjacent leverage than needed.
**Roadmap:**
- [ ] Create a per-service Docker hardening matrix: user, read-only rootfs, dropped capabilities, no-new-privileges, resource limits, healthcheck, restart policy, secrets handling.
- [ ] Start with public-facing/backend services and admin dashboards.
- [ ] Add `security_opt: ["no-new-privileges:true"]` where compatible.
- [ ] Add `cap_drop: ["ALL"]` and selectively add back capabilities only when needed.
- [ ] Convert app images to non-root users consistently.
- [ ] Use `read_only: true` plus explicit writable tmp/cache volumes where compatible.
- [ ] Review `cadvisor` privileged mode and replace/restrict if possible.
- [ ] Enable Docker `live-restore` if it fits maintenance operations.
**Implementation note:** do not attempt rootless Docker or read-only rootfs as the first hardening step. Start with `no-new-privileges`, non-root app users where images already support it, and targeted capability drops for public-facing app containers.
### P1 — Unhealthy containers can normalize broken deployments
**Risk:** Multiple app web containers are unhealthy while still running. If unhealthy states are ignored, deploy regressions and broken public pages can persist unnoticed.
**Roadmap:**
- [ ] Triage each unhealthy container and classify: real app failure, bad healthcheck, intentionally unused, or deprecated.
- [ ] Fix or remove bad healthchecks so Docker health state is trustworthy.
- [ ] Add alerting for sustained unhealthy containers.
- [ ] Make deployment scripts fail on unhealthy post-deploy state.
- [ ] Update dashboard/observability docs with current service ownership and expected state.
**Acceptance criteria:**
- Every unhealthy container has one of: fixed app, fixed healthcheck, intentionally disabled, or retired.
- Docker health state matches the products actual serving state.
- Post-deploy checks fail if required containers remain unhealthy beyond a grace period.
### P1 — Gitea Actions runner is enabled but inactive
**Risk:** CI/deploy assumptions may be wrong. If a runner is expected to deploy or publish packages, inactive runner state blocks automation and may cause manual drift.
**Roadmap:**
- [x] Decide whether the runner should be active or intentionally disabled.
- [x] If active: restart and verify `gitea-act-runner.service`, runner labels, and Docker access.
- [ ] Run and record a dedicated Gitea Actions smoke workflow result.
- [ ] If disabled: disable the service and document the intentional state.
- [ ] Keep runner secrets separate from smoke/test workflows.
- [ ] Add a runner-health check to VM observability.
**Decision needed:** runner should be either actively smoke-tested or disabled. An enabled-but-dead runner should not remain a steady state.
### P1 — Backup/restore evidence is split and one backup unit is failed
**Risk:** Hermes cron backup works, but `hermes-root-backup.service` is failed. There is no recent full restore drill evidence in this review. A backup that cannot be restored is only an assumption.
**Roadmap:**
- [x] Inspect `hermes-root-backup.service` logs and decide whether to fix, disable, or replace it with the cron-backed job.
- [x] Repair the root backup checkout divergence and verify a successful `hermes-root-backup.service` one-shot run.
- [x] Update `/root/.hermes/scripts/sync_hermes_persistent_backup.py` so future generated-backup divergence preserves a safety branch and rejoins the remote backup stream instead of wedging on `git pull --ff-only`.
- [ ] Document all backup mechanisms: Hermes, Gitea data, Docker volumes, app data, Caddy certs/config, environment/secrets escrow.
- [ ] Run a restore drill into a non-production path/profile.
- [ ] Verify no raw `.env`, OAuth tokens, private keys, SQLite WAL/SHM, or raw transcript DBs are committed.
- [ ] Add backup freshness and restore-drill status to the monthly VM review.
**Acceptance criteria:**
- `systemctl --failed` no longer includes backup units unless the failure is intentionally documented.
- Backup status shows source, destination, cadence, last success, and restore command.
- A restore drill has an artifact: date, target path/profile, commands run, result, and gaps found.
### P1 — Patch management has pending security/runtime updates
**Risk:** Unattended upgrades are on, but Docker and security package updates were pending at review time. Docker updates may need controlled restart/redeploy planning.
**Roadmap:**
- [ ] Add a weekly patch review checkpoint that reports pending security and Docker updates separately.
- [ ] Define a Docker upgrade maintenance window with pre/post checks.
- [ ] Run `apt list --upgradable` and capture package classes without dumping noise.
- [ ] Verify apps after Docker/containerd upgrades.
**Acceptance criteria:**
- Security updates and Docker/runtime updates are tracked separately.
- Docker upgrade has pre/post container health, Caddy validation, and public smoke checks.
- Reboot requirement is checked and scheduled rather than discovered accidentally.
### P2 — Ubuntu 25.10 lifecycle risk needs explicit tracking
**Risk:** Ubuntu interim releases have short support windows. If this VM is long-lived production infrastructure, lifecycle tracking matters.
**Roadmap:**
- [ ] Record current Ubuntu 25.10 support/EOL date in ops docs.
- [ ] Decide whether to stay on interim releases or migrate to an LTS baseline.
- [ ] Add an OS lifecycle check to quarterly review.
### P2 — Repository/config secret hygiene needs a repeatable scanner
**Risk:** The DevOps repo contains operational inputs and historical/deleted repo copies exist on disk. Manual review can miss tokens in old files, generated JSON, logs, backups, or abandoned directories.
**Roadmap:**
- [ ] Add a documented secret-scan command using `gitleaks` or `trufflehog` for tracked files and selected untracked ops directories.
- [ ] Scan historical directories such as `DELETED_bytelyst-devops-tools` separately before archiving or deleting.
- [ ] Add `.gitignore` patterns for generated scans, local account snapshots, and credential-shaped outputs.
- [ ] Keep examples as `.example` files only.
### P2 — Cron/systemd ownership and drift are not fully inventoried
**Risk:** Root crontab references old repo paths and there are multiple cron/systemd sources. Stale jobs can fail silently or mutate production unexpectedly.
**Roadmap:**
- [ ] Inventory root/user crontabs, `/etc/cron.d`, systemd timers, Hermes cron, and Gitea Actions schedules.
- [x] Remove or update stale `/opt/bytelyst/bytelyst-devops-tools/...` references after confirming replacements.
- [ ] Add owner, purpose, expected output, and alert channel for every job.
- [ ] Add a stale-job detector for missing script paths and failed systemd units.
**Acceptance criteria:**
- No active cron/systemd job references a missing path.
- Every recurring job has an owner, purpose, schedule, expected output, and alert destination.
- Stale path detection runs in the monthly VM review.
### P2 — Observability exists but needs security-focused SLOs
**Risk:** Prometheus/Grafana/Loki/exporters are present, but security-focused alerts are not yet proven from this review.
**Roadmap:**
- [ ] Add alerts for unexpected public ports, failed units, unhealthy containers, high disk/swap, backup staleness, Gitea runner inactive, and SSH auth spikes.
- [ ] Validate alert delivery to Telegram.
- [ ] Keep internal observability endpoints private; do not publish Prometheus/Loki/node-exporter/cAdvisor directly.
## Execution Plan
### Phase 0 — Freeze and inventory before changes
- [ ] Freeze new public hostnames/ports until the exposure inventory is complete.
- [x] Generate `docs/vm-exposure-inventory.md` from Docker, Caddy, `ss`, and DNS.
- [ ] Mark each exposed service as `public`, `private`, `internal-only`, or `retire`.
- [ ] Review with S before changing public access for customer/user-facing apps.
**Exit criteria:** the inventory is reviewed and every P0 change has a rollback line and validation line.
### Phase 1 — Immediate security hardening
- [ ] Close or loopback-bind non-public Docker host ports.
- [ ] Add `DOCKER-USER` default-deny rules for non-approved ports.
- [ ] Harden SSH root/password access after key-based access is verified.
- [ ] Put `ollama.bytelyst.com`, admin dashboards, and dev tooling behind private/auth-gated access unless explicitly approved as public.
**Exit criteria:** only approved public ports are externally reachable, SSH effective config is hardened, and public apps still pass smoke checks.
### Phase 2 — Operational correctness
- [ ] Fix/retire unhealthy containers.
- [x] Resolve `hermes-root-backup.service` failed state.
- [x] Decide and document Gitea runner active/disabled state.
- [ ] Add missing-script checks. Stale root cron path was fixed on 2026-05-27.
- [ ] Apply pending security/runtime updates in a maintenance window.
**Exit criteria:** no unexpected failed units, no ignored unhealthy required containers, no stale cron paths, and runner state is intentional.
### Phase 3 — Docker and app hardening
- [ ] Add non-root users, `no-new-privileges`, cap drops, and read-only rootfs by service.
- [ ] Add resource limits for noisy services and emulators.
- [ ] Move emulators/dev tools off public bindings.
- [ ] Review cAdvisor privilege and observability surface.
### Phase 4 — Backup, restore, and incident readiness
- [ ] Define full backup map: Hermes, Gitea, Caddy, Docker volumes, app DB/state, secrets escrow.
- [ ] Perform restore drill to non-prod target.
- [ ] Add incident runbooks: compromised container, leaked token, SSH brute force, disk full, failed Docker upgrade.
- [ ] Add quarterly tabletop review.
### Phase 5 — Continuous governance
- [ ] Monthly VM security review cron/checklist.
- [ ] Secret scan before DevOps repo pushes.
- [ ] OS lifecycle/EOL tracker.
- [ ] Drift detection for ports, Caddy routes, Docker health, systemd failures, and cron paths.
## Change Tickets With Quality Gates
Use this shape for each implementation PR/commit:
```text
Ticket:
Risk:
Files/services changed:
Pre-checks:
Change:
Rollback:
Post-checks:
Residual risk:
```
Minimum post-checks for Phase 1:
- `ss -ltnp`
- `docker ps --format '{{.Names}}\t{{.Status}}\t{{.Ports}}'`
- `iptables -S DOCKER-USER`
- `docker exec caddy caddy validate --config /etc/caddy/Caddyfile`
- public smoke checks for approved hostnames
- negative external probe for blocked ports
- `sshd -T` after SSH changes
- `systemctl --failed --no-pager`
## Implementation Log
### 2026-05-27 — Phase 2 backup and cron drift
**Changed:**
- Repointed the root Lucky25 monitor cron from `/opt/bytelyst/bytelyst-devops-tools/monitor-lucky25-execution.sh` to `/opt/bytelyst/learning_ai_devops_tools/scripts/monitor-lucky25-execution.sh`.
- Saved the pre-change root crontab at `/tmp/root-crontab-before-vm-security-20260527.txt`.
- Repaired `/root/repos/bytelyst_hostinger_hermes_vm`, which was `ahead 1, behind 11`; the obsolete local generated backup commit conflicted with newer remote snapshots and was skipped after rebase preserved the current remote stream.
- Patched `/root/.hermes/scripts/sync_hermes_persistent_backup.py` to replace unconditional `git pull --ff-only` with explicit fetch/merge-base handling. Diverged generated snapshots now create a safety branch before attempting rebase and fall back to `origin/<branch>` if the generated files conflict.
- Saved the pre-change backup script at `/tmp/sync_hermes_persistent_backup.py.before-vm-security-20260527`.
**Verified:**
- `crontab -l` now points the Lucky25 monitor at the current repo script.
- `python3 -m py_compile /tmp/sync_hermes_persistent_backup.py` passed before deployment.
- `systemctl start hermes-root-backup.service` succeeded twice after repair.
- `systemctl status hermes-root-backup.service hermes-root-backup.timer --no-pager` showed the service exited `status=0/SUCCESS` and the timer remains active.
- `/root/repos/bytelyst_hostinger_hermes_vm` is aligned with `origin/main` after successful backup commits `415e824` and `369e584`.
**Residual risk:**
- A restore drill is still required before the backup posture should be considered fully proven.
- The backup sync script is runtime-managed under `/root/.hermes/scripts/`; add a tracked installer or source-of-truth copy so this hardening does not depend on manual VM state.
### 2026-05-27 — Phase 2 Gitea runner state
**Changed:**
- Started `gitea-act-runner.service`; it was enabled but inactive.
- Treated the intended state as active because the service unit is enabled, historical journal entries show successful task execution, and restart declared the runner successfully.
**Verified:**
- `systemctl is-active gitea-act-runner.service` returned `active`.
- `systemctl status gitea-act-runner.service --no-pager` showed `bytelyst-host-runner` running as `gitea-runner`.
- Runner labels declared successfully: `ubuntu-latest`, `linux`, `bytelyst`, `hostinger`.
- Runner config uses Docker executor images and `privileged: false`; Docker socket access is granted through the `docker` group.
- Runner immediately picked up task `42` for `bytelyst/bytelyst-devops-tools`, proving it can talk to local Gitea.
**Residual risk:**
- Record a small dedicated smoke workflow that does not need production secrets, so runner health is proven by a controlled workflow rather than incidental queued work.
- Add runner health to VM observability so enabled-but-inactive drift is caught automatically.
## Do Not Start With
- Rootless Docker migration.
- Broad `iptables` default-drop without an allowlist.
- Mass Compose rewrites across all products.
- SSH password/root lockout before key-based sudo and rollback are proven.
- Removing unhealthy containers before confirming whether they are deprecated or broken required services.
- Publishing secret-scan output that contains secrets.
## Suggested First Tickets
1. **P0: Build and review exposure inventory** — produce exact approved/blocked list for all currently bound ports.
2. **P0: Lock Docker-published non-public ports** — bind to loopback/internal or enforce `DOCKER-USER` drops.
3. **P0: Harden SSH** — disable password/root login after confirming key-based admin access.
4. **P1: Triage unhealthy containers** — fix healthchecks/apps or retire dead services.
5. **P1: Resolve failed Hermes backup unit** — fix or disable duplicate failed unit; keep cron backup healthy.
6. **P1: Decide Gitea runner state** — active smoke-tested runner or documented disabled service.
7. **P2: Add secret scanner and stale-job scanner** — prevent silent credential and automation drift.
**Recommended first implementation order:**
1. Generate and review `docs/vm-exposure-inventory.md`.
2. Fix the stale cron path and failed backup unit, because both are lower blast-radius and improve rollback confidence.
3. Harden SSH with second-session/provider-console safety.
4. Move obvious internal-only Docker ports to loopback/internal bindings.
5. Add `DOCKER-USER` guardrails after the allowlist is proven.
This order improves safety without letting the port exposure issue linger too long.
## Verification Commands for Future Runs
```bash
# Host/security baseline
date -Is
uname -a
. /etc/os-release && echo "$PRETTY_NAME"
apt-get -s upgrade | awk '/^Inst /{print}'
test -f /var/run/reboot-required && cat /var/run/reboot-required || echo no-reboot-required
# Firewall and public bind inventory
ufw status verbose
iptables -S DOCKER-USER
ss -ltnup
# SSH effective config
sshd -T | egrep '^(permitrootlogin|passwordauthentication|pubkeyauthentication|kbdinteractiveauthentication|maxauthtries|x11forwarding|clientaliveinterval)'
fail2ban-client status sshd
# Docker health/security
docker ps --format '{{.Names}}\t{{.Status}}\t{{.Ports}}'
docker ps -q | xargs -r docker inspect --format '{{.Name}} user={{.Config.User}} privileged={{.HostConfig.Privileged}} readonly={{.HostConfig.ReadonlyRootfs}} ports={{json .NetworkSettings.Ports}}'
# Caddy and ingress
docker exec caddy caddy validate --config /etc/caddy/Caddyfile
sed -n '1,220p' /opt/bytelyst/Caddyfile
# Backup/cron/systemd drift
systemctl --failed --no-pager
hermes cron list
crontab -l
for f in /etc/cron.d/*; do echo "--- $f"; sed -n '1,80p' "$f"; done
```
## Notes
- This review did not change firewall, SSH, Docker, Caddy, or service settings. It intentionally documents the risk and remediation order before making potentially disruptive security changes.
- Public exposure changes should be handled in small maintenance windows with pre/post health checks because this VM hosts multiple ByteLyst apps.
- The Caddyfile validates today, but Caddy formatting should be normalized in a separate low-risk docs/ops cleanup if desired.