bytelyst-devops-tools/docs/vm-security-blind-spots-roadmap.md
Hermes VM 313a775fa0
Some checks are pending
pre-commit / pre-commit (push) Waiting to run
docs: strengthen VM security roadmap gates
2026-05-27 20:34:37 +00:00

26 KiB
Raw Blame History

ByteLyst VM Security Blind Spots Roadmap

Review date: 2026-05-27 Reviewer: Hermes Agent Scope: Hostinger ByteLyst VM, Docker-hosted product stack, Caddy ingress, Gitea/CI, Hermes backup/ops, VM maintenance posture.

Executive Summary

The VM is operational and has several good foundations already in place: UFW is active, fail2ban is running for SSH, unattended upgrades are enabled, Caddy config validates, disk/memory headroom is acceptable, and Hermes persistent-data backup cron is healthy.

The biggest blind spot is that the apparent firewall posture is misleading: UFW only allows SSH, but Docker-published ports create iptables rules that can expose many application, database/emulator, observability, registry, and development ports on 0.0.0.0 / IPv6. Several of those services should either be private-only, routed only through Caddy with auth, or bound to loopback/internal Docker networks.

Second-order risks are SSH hardening gaps, rootful Docker/container hardening gaps, unhealthy apps that can hide failed deploys, an inactive Gitea Actions runner, a failed Hermes backup systemd unit despite cron backup success, and incomplete evidence for restore drills, secret scans, and off-host recovery.

Implementation Readiness Assessment

Roadmap quality score: 86%

Implementation confidence before remediation starts: 74%

Why not higher yet: the review has good evidence for the major blind spots, but safe remediation still depends on a service-by-service exposure inventory, owner approval for public/private intent, and verified rollback paths for SSH and Docker firewall changes. The highest-risk changes are not technically hard; they are risky because this VM hosts many ByteLyst apps and several public ports may be relied on by legacy workflows.

Confidence after Phase 0 is complete: expected to rise to about 88% if every public hostname/host port has an approved disposition and rollback commands are tested.

Quality strengths:

  • Evidence is concrete and command-derived rather than speculative.
  • The highest-risk items are correctly prioritized as P0.
  • The roadmap separates discovery from disruptive remediation.
  • It captures operational debt outside pure security, including unhealthy containers, backup state, runner drift, and cron drift.

Quality gaps to close before implementation:

  • Convert broad remediation bullets into small tickets with owner, rollback, validation, and maintenance window requirements.
  • Add an approved exposure inventory before changing Docker bindings or DOCKER-USER.
  • Record a tested SSH rollback path and keep an active second session/provider console open before changing sshd.
  • Define what is intentionally public, private, internal-only, or deprecated for each service.
  • Add post-change verification commands that prove public apps still work and private services are no longer internet reachable.

Implementation Guardrails

These rules apply before any Phase 1 change:

  • Do not bulk-close ports. Change one service group at a time and verify public app health after each group.
  • Do not restart SSH from a single session. Keep a second key-based session open and provider console access available.
  • Do not add broad DROP rules before an allowlist is committed to the inventory.
  • Prefer loopback/internal Compose bindings over firewall-only hiding when a service does not need direct public access.
  • Preserve Caddy as the public ingress path for web/API services unless a service is explicitly approved for direct exposure.
  • Record exact rollback commands next to every change ticket.
  • Treat Docker, SSH, Caddy, and backup changes as maintenance-window work.

Exposure Classification Model

Every listening port and Caddy hostname should be classified before changes:

Class Meaning Expected Controls Examples To Review
public-caddy Public app/API reached only through Caddy TLS, hostname routing, app auth where needed, no direct host-port access product web/API hostnames
public-direct Direct host-port access is intentionally public Explicit business reason, provider firewall allow, monitoring SSH only unless approved otherwise
private-admin Admin/dev/internal tool Tailscale/VPN, SSH tunnel, IP allowlist, or auth gate admin dashboards, devops tools
loopback-only Host-local service used by Caddy or local automation Bind 127.0.0.1:port, no external bind internal APIs behind Caddy
docker-internal Container-to-container only no host port mapping databases, emulators, private workers
retire Unused/deprecated remove service/port, disable health checks and jobs stale dashboards/services

Minimum inventory fields:

  • service/container name
  • repo/Compose file
  • host port and bind address
  • container port
  • Caddy hostname/path, if any
  • intended audience
  • authentication/control plane
  • classification
  • owner/approver
  • rollback command
  • post-change health check

Evidence Snapshot

Collected on 2026-05-27 from this VM.

Host and patching

  • Host: srv1491630
  • OS: Ubuntu 25.10
  • Kernel: 6.17.0-29-generic
  • Uptime: about 14 hours at review time
  • Root filesystem: 193G total, 71G used, 123G available, 37% used
  • Memory: 15Gi total, about 10Gi available
  • Swap: 4.0G total, about 1.3G used
  • Reboot required: no
  • Pending package upgrades included Docker CE/containerd/buildx/compose and security updates for libgcrypt20, libcaca0, and libssh2-1t64
  • Unattended upgrades: active and configured for automatic reboot at 04:00 with users absent

Network and ingress

  • UFW: active; default deny incoming; only 22/tcp allowed by UFW rules
  • Docker iptables rules are present and publish many ports despite UFW's simple rule list
  • Public/listening TCP ports bound on all interfaces included:
    • 22, 80, 443
    • app/frontend ports: 3000, 3002, 3003, 3030, 3035, 3040, 3049, 3050, 3055, 3060, 3070, 3075, 3085
    • backend/API ports: 4004, 4010, 4011, 4012, 4013, 4014, 4015, 4016, 4017, 4019, 4020, 4025
    • infra/dev ports: 1025, 1234, 3100, 3300, 8025, 8081, 10000, 11434
  • Caddy source-of-truth config: /opt/bytelyst/Caddyfile, mounted read-only into the caddy container
  • docker exec caddy caddy validate --config /etc/caddy/Caddyfile: valid config, formatting warning only
  • Caddy public hostnames include:
    • api.bytelyst.com
    • gitea.bytelyst.com
    • admin.bytelyst.com
    • devops.bytelyst.com
    • tracker.bytelyst.com
    • llmlab.bytelyst.com
    • ollama.bytelyst.com
    • trading-api.bytelyst.com
    • invttrdg.bytelyst.com
    • notes.bytelyst.com
    • clock.bytelyst.com

SSH and account surface

Effective sshd -T settings showed:

  • permitrootlogin yes
  • passwordauthentication yes
  • pubkeyauthentication yes
  • kbdinteractiveauthentication no
  • maxauthtries 6
  • x11forwarding yes
  • clientaliveinterval 0

fail2ban is active with one jail: sshd; no current bans at review time.

Docker runtime and containers

  • Docker: client/server 29.4.2; newer Docker packages are available
  • Docker daemon is rootful; security options showed AppArmor, seccomp builtin, and cgroup namespaces; live_restore=false
  • Most product containers run with writable root filesystems and no explicit user configured
  • cadvisor is privileged
  • DOCKER-USER chain appears empty, so there is no central Docker firewall policy in front of published containers
  • Multiple containers are unhealthy:
    • learning_ai_common_plat-llmlab-dashboard-1
    • learning_ai_common_plat-actiontrail-web-1
    • learning_ai_common_plat-jarvisjr-web-1
    • learning_ai_common_plat-localmemgpt-web-1
    • learning_ai_common_plat-nomgap-web-1
    • learning_ai_common_plat-flowmonk-web-1
    • learning_ai_common_plat-mindlyst-web-1

Gitea and CI

  • Gitea public route: https://gitea.bytelyst.com
  • Local Gitea container port: host 3300 -> container 3000, bound on 0.0.0.0 and IPv6
  • gitea-act-runner.service: enabled but inactive/dead
  • Runner user exists: gitea-runner, member of docker
  • Runner config directory permissions look reasonable:
    • /home/gitea-runner/act_runner: 750, owned by gitea-runner:gitea-runner
    • /home/gitea-runner/act_runner/config.yaml: 600, owned by gitea-runner:gitea-runner

Backup and operations

  • systemctl --failed showed failed unit:
    • hermes-root-backup.serviceSync root Hermes persistent backup to GitHub
  • Hermes cron backup is active and healthy:
    • job 470832621b43, Sync Hermes persistent-data backup to GitHub, every 30 minutes, last run ok
  • Existing VM maintenance cron entries exist for health check and cleanup under /opt/bytelyst/learning_ai_devops_tools/scripts/VMs/HostingerVM/
  • A root crontab entry still references /opt/bytelyst/bytelyst-devops-tools/monitor-lucky25-execution.sh, which may be stale after repo relocation/renaming

Blind Spots and Risk Register

P0 — Internet-exposed Docker ports bypass the intended ingress model

Risk: UFW suggests only SSH is allowed, but Docker-published ports expose many services directly on all interfaces. This can bypass Caddy, TLS, auth, logging, rate limiting, and hostname/path controls.

Examples observed: 3300, 8025, 1025, 1234, 8081, 10000, 11434, many 30xx web ports, and many 40xx backend ports.

Impact: Direct access to dev/infra services, internal APIs, emulators, mail tooling, dashboards, or model endpoints if upstream firewall/provider rules do not block them.

Roadmap:

  • Create a canonical exposure inventory: service, container, host port, public hostname, required audience, auth requirement.
  • For each service, decide one of: public via Caddy, private via Tailscale/SSH, loopback-only host port, Docker-internal only, or remove.
  • Bind non-public Compose ports to 127.0.0.1 or remove host port mapping entirely.
  • Add a DOCKER-USER chain policy to drop unsolicited traffic to non-approved published ports before Docker's accept rules.
  • Keep only 80/443 and intentionally public SSH exposed at the provider/firewall layer.
  • Add a recurring check that compares ss -ltn and Docker published ports against the approved inventory.

Acceptance criteria:

  • docs/vm-exposure-inventory.md lists every ss -ltnp listener and every Docker published port.
  • Every non-SSH direct public bind has an approved classification.
  • Non-public services are either loopback-bound, Docker-internal, provider-firewalled, or blocked in DOCKER-USER.
  • External probe confirms non-approved ports are closed from the internet.
  • Caddy-routed public hostnames still pass smoke checks.

Rollback: keep a saved copy of original Compose files and iptables-save output; rollback means restoring original port mappings or flushing only the newly added DOCKER-USER rules.

P0 — SSH permits root login and password authentication

Risk: PermitRootLogin yes and PasswordAuthentication yes keep the primary admin surface broad. fail2ban helps, but password-enabled root SSH is still high-risk for an internet-facing VM.

Roadmap:

  • Confirm all required admin users have working SSH keys and sudo access.
  • Add a non-root break-glass admin path if one does not exist.
  • Change SSH effective config to:
    • PermitRootLogin prohibit-password or no
    • PasswordAuthentication no
    • X11Forwarding no
    • lower MaxAuthTries, e.g. 3
    • set a sane ClientAliveInterval / ClientAliveCountMax
  • Validate with a second session before restarting SSH.
  • Record rollback commands and keep console/provider access available during rollout.

Acceptance criteria:

  • A non-root sudo admin user can log in with SSH key auth.
  • Root password login no longer works.
  • Existing automation using scripts/VMs/HostingerVM/login.sh still works or is updated.
  • sshd -T confirms the intended effective config.
  • fail2ban-client status sshd still reports an active jail.

Rollback: provider console or still-open root session can restore previous sshd_config drop-in and restart ssh.

P0 — Public/private boundary for dev and internal tooling is unclear

Risk: Caddy publishes ollama.bytelyst.com, llmlab.bytelyst.com, devops.bytelyst.com, admin.bytelyst.com, and Gitea. Some may be intended, but the roadmap lacks an explicit auth/access decision for each.

Roadmap:

  • Document public hostnames, auth model, and data sensitivity.
  • Require explicit approval before exposing new dashboards or model endpoints.
  • Add Caddy auth/IP allowlist/Tailscale-only strategy for admin-like surfaces.
  • Add security headers/auth checks to public UI health reviews.
  • Confirm ollama.bytelyst.com should be publicly reachable at all; if not, move behind private network or auth gate.

Acceptance criteria:

  • ollama, llmlab, devops, admin, gitea, and observability-adjacent routes each have an owner-approved exposure class.
  • Public admin-like routes require authentication or an explicit documented exception.
  • No emulator, mail, model, or raw dashboard port is directly internet reachable unless explicitly approved.

P1 — Docker/container hardening is mostly default

Risk: Many containers run as default/root user, writable rootfs, broad capabilities by default, and rootful Docker. A compromised app gets more host-adjacent leverage than needed.

Roadmap:

  • Create a per-service Docker hardening matrix: user, read-only rootfs, dropped capabilities, no-new-privileges, resource limits, healthcheck, restart policy, secrets handling.
  • Start with public-facing/backend services and admin dashboards.
  • Add security_opt: ["no-new-privileges:true"] where compatible.
  • Add cap_drop: ["ALL"] and selectively add back capabilities only when needed.
  • Convert app images to non-root users consistently.
  • Use read_only: true plus explicit writable tmp/cache volumes where compatible.
  • Review cadvisor privileged mode and replace/restrict if possible.
  • Enable Docker live-restore if it fits maintenance operations.

Implementation note: do not attempt rootless Docker or read-only rootfs as the first hardening step. Start with no-new-privileges, non-root app users where images already support it, and targeted capability drops for public-facing app containers.

P1 — Unhealthy containers can normalize broken deployments

Risk: Multiple app web containers are unhealthy while still running. If unhealthy states are ignored, deploy regressions and broken public pages can persist unnoticed.

Roadmap:

  • Triage each unhealthy container and classify: real app failure, bad healthcheck, intentionally unused, or deprecated.
  • Fix or remove bad healthchecks so Docker health state is trustworthy.
  • Add alerting for sustained unhealthy containers.
  • Make deployment scripts fail on unhealthy post-deploy state.
  • Update dashboard/observability docs with current service ownership and expected state.

Acceptance criteria:

  • Every unhealthy container has one of: fixed app, fixed healthcheck, intentionally disabled, or retired.
  • Docker health state matches the products actual serving state.
  • Post-deploy checks fail if required containers remain unhealthy beyond a grace period.

P1 — Gitea Actions runner is enabled but inactive

Risk: CI/deploy assumptions may be wrong. If a runner is expected to deploy or publish packages, inactive runner state blocks automation and may cause manual drift.

Roadmap:

  • Decide whether the runner should be active or intentionally disabled.
  • If active: restart and verify gitea-act-runner.service, runner labels, Docker access, and a smoke workflow.
  • If disabled: disable the service and document the intentional state.
  • Keep runner secrets separate from smoke/test workflows.
  • Add a runner-health check to VM observability.

Decision needed: runner should be either actively smoke-tested or disabled. An enabled-but-dead runner should not remain a steady state.

P1 — Backup/restore evidence is split and one backup unit is failed

Risk: Hermes cron backup works, but hermes-root-backup.service is failed. There is no recent full restore drill evidence in this review. A backup that cannot be restored is only an assumption.

Roadmap:

  • Inspect hermes-root-backup.service logs and decide whether to fix, disable, or replace it with the cron-backed job.
  • Document all backup mechanisms: Hermes, Gitea data, Docker volumes, app data, Caddy certs/config, environment/secrets escrow.
  • Run a restore drill into a non-production path/profile.
  • Verify no raw .env, OAuth tokens, private keys, SQLite WAL/SHM, or raw transcript DBs are committed.
  • Add backup freshness and restore-drill status to the monthly VM review.

Acceptance criteria:

  • systemctl --failed no longer includes backup units unless the failure is intentionally documented.
  • Backup status shows source, destination, cadence, last success, and restore command.
  • A restore drill has an artifact: date, target path/profile, commands run, result, and gaps found.

P1 — Patch management has pending security/runtime updates

Risk: Unattended upgrades are on, but Docker and security package updates were pending at review time. Docker updates may need controlled restart/redeploy planning.

Roadmap:

  • Add a weekly patch review checkpoint that reports pending security and Docker updates separately.
  • Define a Docker upgrade maintenance window with pre/post checks.
  • Run apt list --upgradable and capture package classes without dumping noise.
  • Verify apps after Docker/containerd upgrades.

Acceptance criteria:

  • Security updates and Docker/runtime updates are tracked separately.
  • Docker upgrade has pre/post container health, Caddy validation, and public smoke checks.
  • Reboot requirement is checked and scheduled rather than discovered accidentally.

P2 — Ubuntu 25.10 lifecycle risk needs explicit tracking

Risk: Ubuntu interim releases have short support windows. If this VM is long-lived production infrastructure, lifecycle tracking matters.

Roadmap:

  • Record current Ubuntu 25.10 support/EOL date in ops docs.
  • Decide whether to stay on interim releases or migrate to an LTS baseline.
  • Add an OS lifecycle check to quarterly review.

P2 — Repository/config secret hygiene needs a repeatable scanner

Risk: The DevOps repo contains operational inputs and historical/deleted repo copies exist on disk. Manual review can miss tokens in old files, generated JSON, logs, backups, or abandoned directories.

Roadmap:

  • Add a documented secret-scan command using gitleaks or trufflehog for tracked files and selected untracked ops directories.
  • Scan historical directories such as DELETED_bytelyst-devops-tools separately before archiving or deleting.
  • Add .gitignore patterns for generated scans, local account snapshots, and credential-shaped outputs.
  • Keep examples as .example files only.

P2 — Cron/systemd ownership and drift are not fully inventoried

Risk: Root crontab references old repo paths and there are multiple cron/systemd sources. Stale jobs can fail silently or mutate production unexpectedly.

Roadmap:

  • Inventory root/user crontabs, /etc/cron.d, systemd timers, Hermes cron, and Gitea Actions schedules.
  • Remove or update stale /opt/bytelyst/bytelyst-devops-tools/... references after confirming replacements.
  • Add owner, purpose, expected output, and alert channel for every job.
  • Add a stale-job detector for missing script paths and failed systemd units.

Acceptance criteria:

  • No active cron/systemd job references a missing path.
  • Every recurring job has an owner, purpose, schedule, expected output, and alert destination.
  • Stale path detection runs in the monthly VM review.

P2 — Observability exists but needs security-focused SLOs

Risk: Prometheus/Grafana/Loki/exporters are present, but security-focused alerts are not yet proven from this review.

Roadmap:

  • Add alerts for unexpected public ports, failed units, unhealthy containers, high disk/swap, backup staleness, Gitea runner inactive, and SSH auth spikes.
  • Validate alert delivery to Telegram.
  • Keep internal observability endpoints private; do not publish Prometheus/Loki/node-exporter/cAdvisor directly.

Execution Plan

Phase 0 — Freeze and inventory before changes

  • Freeze new public hostnames/ports until the exposure inventory is complete.
  • Generate docs/vm-exposure-inventory.md from Docker, Caddy, ss, and DNS.
  • Mark each exposed service as public, private, internal-only, or retire.
  • Review with S before changing public access for customer/user-facing apps.

Exit criteria: the inventory is reviewed and every P0 change has a rollback line and validation line.

Phase 1 — Immediate security hardening

  • Close or loopback-bind non-public Docker host ports.
  • Add DOCKER-USER default-deny rules for non-approved ports.
  • Harden SSH root/password access after key-based access is verified.
  • Put ollama.bytelyst.com, admin dashboards, and dev tooling behind private/auth-gated access unless explicitly approved as public.

Exit criteria: only approved public ports are externally reachable, SSH effective config is hardened, and public apps still pass smoke checks.

Phase 2 — Operational correctness

  • Fix/retire unhealthy containers.
  • Resolve hermes-root-backup.service failed state.
  • Decide and document Gitea runner active/disabled state.
  • Remove stale cron paths and add missing-script checks.
  • Apply pending security/runtime updates in a maintenance window.

Exit criteria: no unexpected failed units, no ignored unhealthy required containers, no stale cron paths, and runner state is intentional.

Phase 3 — Docker and app hardening

  • Add non-root users, no-new-privileges, cap drops, and read-only rootfs by service.
  • Add resource limits for noisy services and emulators.
  • Move emulators/dev tools off public bindings.
  • Review cAdvisor privilege and observability surface.

Phase 4 — Backup, restore, and incident readiness

  • Define full backup map: Hermes, Gitea, Caddy, Docker volumes, app DB/state, secrets escrow.
  • Perform restore drill to non-prod target.
  • Add incident runbooks: compromised container, leaked token, SSH brute force, disk full, failed Docker upgrade.
  • Add quarterly tabletop review.

Phase 5 — Continuous governance

  • Monthly VM security review cron/checklist.
  • Secret scan before DevOps repo pushes.
  • OS lifecycle/EOL tracker.
  • Drift detection for ports, Caddy routes, Docker health, systemd failures, and cron paths.

Change Tickets With Quality Gates

Use this shape for each implementation PR/commit:

Ticket:
Risk:
Files/services changed:
Pre-checks:
Change:
Rollback:
Post-checks:
Residual risk:

Minimum post-checks for Phase 1:

  • ss -ltnp
  • docker ps --format '{{.Names}}\t{{.Status}}\t{{.Ports}}'
  • iptables -S DOCKER-USER
  • docker exec caddy caddy validate --config /etc/caddy/Caddyfile
  • public smoke checks for approved hostnames
  • negative external probe for blocked ports
  • sshd -T after SSH changes
  • systemctl --failed --no-pager

Do Not Start With

  • Rootless Docker migration.
  • Broad iptables default-drop without an allowlist.
  • Mass Compose rewrites across all products.
  • SSH password/root lockout before key-based sudo and rollback are proven.
  • Removing unhealthy containers before confirming whether they are deprecated or broken required services.
  • Publishing secret-scan output that contains secrets.

Suggested First Tickets

  1. P0: Build and review exposure inventory — produce exact approved/blocked list for all currently bound ports.
  2. P0: Lock Docker-published non-public ports — bind to loopback/internal or enforce DOCKER-USER drops.
  3. P0: Harden SSH — disable password/root login after confirming key-based admin access.
  4. P1: Triage unhealthy containers — fix healthchecks/apps or retire dead services.
  5. P1: Resolve failed Hermes backup unit — fix or disable duplicate failed unit; keep cron backup healthy.
  6. P1: Decide Gitea runner state — active smoke-tested runner or documented disabled service.
  7. P2: Add secret scanner and stale-job scanner — prevent silent credential and automation drift.

Recommended first implementation order:

  1. Generate and review docs/vm-exposure-inventory.md.
  2. Fix the stale cron path and failed backup unit, because both are lower blast-radius and improve rollback confidence.
  3. Harden SSH with second-session/provider-console safety.
  4. Move obvious internal-only Docker ports to loopback/internal bindings.
  5. Add DOCKER-USER guardrails after the allowlist is proven.

This order improves safety without letting the port exposure issue linger too long.

Verification Commands for Future Runs

# Host/security baseline
date -Is
uname -a
. /etc/os-release && echo "$PRETTY_NAME"
apt-get -s upgrade | awk '/^Inst /{print}'
test -f /var/run/reboot-required && cat /var/run/reboot-required || echo no-reboot-required

# Firewall and public bind inventory
ufw status verbose
iptables -S DOCKER-USER
ss -ltnup

# SSH effective config
sshd -T | egrep '^(permitrootlogin|passwordauthentication|pubkeyauthentication|kbdinteractiveauthentication|maxauthtries|x11forwarding|clientaliveinterval)'
fail2ban-client status sshd

# Docker health/security
docker ps --format '{{.Names}}\t{{.Status}}\t{{.Ports}}'
docker ps -q | xargs -r docker inspect --format '{{.Name}} user={{.Config.User}} privileged={{.HostConfig.Privileged}} readonly={{.HostConfig.ReadonlyRootfs}} ports={{json .NetworkSettings.Ports}}'

# Caddy and ingress
docker exec caddy caddy validate --config /etc/caddy/Caddyfile
sed -n '1,220p' /opt/bytelyst/Caddyfile

# Backup/cron/systemd drift
systemctl --failed --no-pager
hermes cron list
crontab -l
for f in /etc/cron.d/*; do echo "--- $f"; sed -n '1,80p' "$f"; done

Notes

  • This review did not change firewall, SSH, Docker, Caddy, or service settings. It intentionally documents the risk and remediation order before making potentially disruptive security changes.
  • Public exposure changes should be handled in small maintenance windows with pre/post health checks because this VM hosts multiple ByteLyst apps.
  • The Caddyfile validates today, but Caddy formatting should be normalized in a separate low-risk docs/ops cleanup if desired.