bytelyst/bytelyst-devops-tools

Fork 0

Hermes VM 2c125adb05

pre-commit / pre-commit (push) Waiting to run

Details

docs: add VM security blind spots roadmap

2026-05-27 20:21:52 +00:00

18 KiB

Raw Blame History

ByteLyst VM Security Blind Spots Roadmap

Review date: 2026-05-27 Reviewer: Hermes Agent Scope: Hostinger ByteLyst VM, Docker-hosted product stack, Caddy ingress, Gitea/CI, Hermes backup/ops, VM maintenance posture.

Executive Summary

The VM is operational and has several good foundations already in place: UFW is active, fail2ban is running for SSH, unattended upgrades are enabled, Caddy config validates, disk/memory headroom is acceptable, and Hermes persistent-data backup cron is healthy.

The biggest blind spot is that the apparent firewall posture is misleading: UFW only allows SSH, but Docker-published ports create iptables rules that can expose many application, database/emulator, observability, registry, and development ports on 0.0.0.0 / IPv6. Several of those services should either be private-only, routed only through Caddy with auth, or bound to loopback/internal Docker networks.

Second-order risks are SSH hardening gaps, rootful Docker/container hardening gaps, unhealthy apps that can hide failed deploys, an inactive Gitea Actions runner, a failed Hermes backup systemd unit despite cron backup success, and incomplete evidence for restore drills, secret scans, and off-host recovery.

Evidence Snapshot

Collected on 2026-05-27 from this VM.

Host and patching

Host: srv1491630
OS: Ubuntu 25.10
Kernel: 6.17.0-29-generic
Uptime: about 14 hours at review time
Root filesystem: 193G total, 71G used, 123G available, 37% used
Memory: 15Gi total, about 10Gi available
Swap: 4.0G total, about 1.3G used
Reboot required: no
Pending package upgrades included Docker CE/containerd/buildx/compose and security updates for libgcrypt20, libcaca0, and libssh2-1t64
Unattended upgrades: active and configured for automatic reboot at 04:00 with users absent

Network and ingress

UFW: active; default deny incoming; only 22/tcp allowed by UFW rules
Docker iptables rules are present and publish many ports despite UFW's simple rule list
Public/listening TCP ports bound on all interfaces included:
- 22, 80, 443
- app/frontend ports: 3000, 3002, 3003, 3030, 3035, 3040, 3049, 3050, 3055, 3060, 3070, 3075, 3085
- backend/API ports: 4004, 4010, 4011, 4012, 4013, 4014, 4015, 4016, 4017, 4019, 4020, 4025
- infra/dev ports: 1025, 1234, 3100, 3300, 8025, 8081, 10000, 11434
Caddy source-of-truth config: /opt/bytelyst/Caddyfile, mounted read-only into the caddy container
docker exec caddy caddy validate --config /etc/caddy/Caddyfile: valid config, formatting warning only
Caddy public hostnames include:
- api.bytelyst.com
- gitea.bytelyst.com
- admin.bytelyst.com
- devops.bytelyst.com
- tracker.bytelyst.com
- llmlab.bytelyst.com
- ollama.bytelyst.com
- trading-api.bytelyst.com
- invttrdg.bytelyst.com
- notes.bytelyst.com
- clock.bytelyst.com

SSH and account surface

Effective sshd -T settings showed:

permitrootlogin yes
passwordauthentication yes
pubkeyauthentication yes
kbdinteractiveauthentication no
maxauthtries 6
x11forwarding yes
clientaliveinterval 0

fail2ban is active with one jail: sshd; no current bans at review time.

Docker runtime and containers

Docker: client/server 29.4.2; newer Docker packages are available
Docker daemon is rootful; security options showed AppArmor, seccomp builtin, and cgroup namespaces; live_restore=false
Most product containers run with writable root filesystems and no explicit user configured
cadvisor is privileged
DOCKER-USER chain appears empty, so there is no central Docker firewall policy in front of published containers
Multiple containers are unhealthy:
- learning_ai_common_plat-llmlab-dashboard-1
- learning_ai_common_plat-actiontrail-web-1
- learning_ai_common_plat-jarvisjr-web-1
- learning_ai_common_plat-localmemgpt-web-1
- learning_ai_common_plat-nomgap-web-1
- learning_ai_common_plat-flowmonk-web-1
- learning_ai_common_plat-mindlyst-web-1

Gitea and CI

Gitea public route: https://gitea.bytelyst.com
Local Gitea container port: host 3300 -> container 3000, bound on 0.0.0.0 and IPv6
gitea-act-runner.service: enabled but inactive/dead
Runner user exists: gitea-runner, member of docker
Runner config directory permissions look reasonable:
- /home/gitea-runner/act_runner: 750, owned by gitea-runner:gitea-runner
- /home/gitea-runner/act_runner/config.yaml: 600, owned by gitea-runner:gitea-runner

Backup and operations

systemctl --failed showed failed unit:
- hermes-root-backup.service — Sync root Hermes persistent backup to GitHub
Hermes cron backup is active and healthy:
- job 470832621b43, Sync Hermes persistent-data backup to GitHub, every 30 minutes, last run ok
Existing VM maintenance cron entries exist for health check and cleanup under /opt/bytelyst/learning_ai_devops_tools/scripts/VMs/HostingerVM/
A root crontab entry still references /opt/bytelyst/bytelyst-devops-tools/monitor-lucky25-execution.sh, which may be stale after repo relocation/renaming

Blind Spots and Risk Register

P0 — Internet-exposed Docker ports bypass the intended ingress model

Risk: UFW suggests only SSH is allowed, but Docker-published ports expose many services directly on all interfaces. This can bypass Caddy, TLS, auth, logging, rate limiting, and hostname/path controls.

Examples observed: 3300, 8025, 1025, 1234, 8081, 10000, 11434, many 30xx web ports, and many 40xx backend ports.

Impact: Direct access to dev/infra services, internal APIs, emulators, mail tooling, dashboards, or model endpoints if upstream firewall/provider rules do not block them.

Roadmap:

Create a canonical exposure inventory: service, container, host port, public hostname, required audience, auth requirement.
For each service, decide one of: public via Caddy, private via Tailscale/SSH, loopback-only host port, Docker-internal only, or remove.
Bind non-public Compose ports to 127.0.0.1 or remove host port mapping entirely.
Add a DOCKER-USER chain policy to drop unsolicited traffic to non-approved published ports before Docker's accept rules.
Keep only 80/443 and intentionally public SSH exposed at the provider/firewall layer.
Add a recurring check that compares ss -ltn and Docker published ports against the approved inventory.

Risk: PermitRootLogin yes and PasswordAuthentication yes keep the primary admin surface broad. fail2ban helps, but password-enabled root SSH is still high-risk for an internet-facing VM.

Roadmap:

Confirm all required admin users have working SSH keys and sudo access.
Add a non-root break-glass admin path if one does not exist.
Change SSH effective config to:
- PermitRootLogin prohibit-password or no
- PasswordAuthentication no
- X11Forwarding no
- lower MaxAuthTries, e.g. 3
- set a sane ClientAliveInterval / ClientAliveCountMax
Validate with a second session before restarting SSH.
Record rollback commands and keep console/provider access available during rollout.

P0 — Public/private boundary for dev and internal tooling is unclear

Risk: Caddy publishes ollama.bytelyst.com, llmlab.bytelyst.com, devops.bytelyst.com, admin.bytelyst.com, and Gitea. Some may be intended, but the roadmap lacks an explicit auth/access decision for each.

Roadmap:

Document public hostnames, auth model, and data sensitivity.
Require explicit approval before exposing new dashboards or model endpoints.
Add Caddy auth/IP allowlist/Tailscale-only strategy for admin-like surfaces.
Add security headers/auth checks to public UI health reviews.
Confirm ollama.bytelyst.com should be publicly reachable at all; if not, move behind private network or auth gate.

P1 — Docker/container hardening is mostly default

Risk: Many containers run as default/root user, writable rootfs, broad capabilities by default, and rootful Docker. A compromised app gets more host-adjacent leverage than needed.

Roadmap:

Create a per-service Docker hardening matrix: user, read-only rootfs, dropped capabilities, no-new-privileges, resource limits, healthcheck, restart policy, secrets handling.
Start with public-facing/backend services and admin dashboards.
Add security_opt: ["no-new-privileges:true"] where compatible.
Add cap_drop: ["ALL"] and selectively add back capabilities only when needed.
Convert app images to non-root users consistently.
Use read_only: true plus explicit writable tmp/cache volumes where compatible.
Review cadvisor privileged mode and replace/restrict if possible.
Enable Docker live-restore if it fits maintenance operations.

P1 — Unhealthy containers can normalize broken deployments

Risk: Multiple app web containers are unhealthy while still running. If unhealthy states are ignored, deploy regressions and broken public pages can persist unnoticed.

Roadmap:

Triage each unhealthy container and classify: real app failure, bad healthcheck, intentionally unused, or deprecated.
Fix or remove bad healthchecks so Docker health state is trustworthy.
Add alerting for sustained unhealthy containers.
Make deployment scripts fail on unhealthy post-deploy state.
Update dashboard/observability docs with current service ownership and expected state.

P1 — Gitea Actions runner is enabled but inactive

Risk: CI/deploy assumptions may be wrong. If a runner is expected to deploy or publish packages, inactive runner state blocks automation and may cause manual drift.

Roadmap:

Decide whether the runner should be active or intentionally disabled.
If active: restart and verify gitea-act-runner.service, runner labels, Docker access, and a smoke workflow.
If disabled: disable the service and document the intentional state.
Keep runner secrets separate from smoke/test workflows.
Add a runner-health check to VM observability.

P1 — Backup/restore evidence is split and one backup unit is failed

Risk: Hermes cron backup works, but hermes-root-backup.service is failed. There is no recent full restore drill evidence in this review. A backup that cannot be restored is only an assumption.

Roadmap:

Inspect hermes-root-backup.service logs and decide whether to fix, disable, or replace it with the cron-backed job.
Document all backup mechanisms: Hermes, Gitea data, Docker volumes, app data, Caddy certs/config, environment/secrets escrow.
Run a restore drill into a non-production path/profile.
Verify no raw .env, OAuth tokens, private keys, SQLite WAL/SHM, or raw transcript DBs are committed.
Add backup freshness and restore-drill status to the monthly VM review.

P1 — Patch management has pending security/runtime updates

Risk: Unattended upgrades are on, but Docker and security package updates were pending at review time. Docker updates may need controlled restart/redeploy planning.

Roadmap:

Add a weekly patch review checkpoint that reports pending security and Docker updates separately.
Define a Docker upgrade maintenance window with pre/post checks.
Run apt list --upgradable and capture package classes without dumping noise.
Verify apps after Docker/containerd upgrades.

P2 — Ubuntu 25.10 lifecycle risk needs explicit tracking

Risk: Ubuntu interim releases have short support windows. If this VM is long-lived production infrastructure, lifecycle tracking matters.

Roadmap:

Record current Ubuntu 25.10 support/EOL date in ops docs.
Decide whether to stay on interim releases or migrate to an LTS baseline.
Add an OS lifecycle check to quarterly review.

P2 — Repository/config secret hygiene needs a repeatable scanner

Risk: The DevOps repo contains operational inputs and historical/deleted repo copies exist on disk. Manual review can miss tokens in old files, generated JSON, logs, backups, or abandoned directories.

Roadmap:

Add a documented secret-scan command using gitleaks or trufflehog for tracked files and selected untracked ops directories.
Scan historical directories such as DELETED_bytelyst-devops-tools separately before archiving or deleting.
Add .gitignore patterns for generated scans, local account snapshots, and credential-shaped outputs.
Keep examples as .example files only.

P2 — Cron/systemd ownership and drift are not fully inventoried

Risk: Root crontab references old repo paths and there are multiple cron/systemd sources. Stale jobs can fail silently or mutate production unexpectedly.

Roadmap:

Inventory root/user crontabs, /etc/cron.d, systemd timers, Hermes cron, and Gitea Actions schedules.
Remove or update stale /opt/bytelyst/bytelyst-devops-tools/... references after confirming replacements.
Add owner, purpose, expected output, and alert channel for every job.
Add a stale-job detector for missing script paths and failed systemd units.

P2 — Observability exists but needs security-focused SLOs

Risk: Prometheus/Grafana/Loki/exporters are present, but security-focused alerts are not yet proven from this review.

Roadmap:

Add alerts for unexpected public ports, failed units, unhealthy containers, high disk/swap, backup staleness, Gitea runner inactive, and SSH auth spikes.
Validate alert delivery to Telegram.
Keep internal observability endpoints private; do not publish Prometheus/Loki/node-exporter/cAdvisor directly.

Execution Plan

Phase 0 — Freeze and inventory before changes

Freeze new public hostnames/ports until the exposure inventory is complete.
Generate docs/vm-exposure-inventory.md from Docker, Caddy, ss, and DNS.
Mark each exposed service as public, private, internal-only, or retire.
Review with S before changing public access for customer/user-facing apps.

Phase 1 — Immediate security hardening

Close or loopback-bind non-public Docker host ports.
Add DOCKER-USER default-deny rules for non-approved ports.
Harden SSH root/password access after key-based access is verified.
Put ollama.bytelyst.com, admin dashboards, and dev tooling behind private/auth-gated access unless explicitly approved as public.

Phase 2 — Operational correctness

Fix/retire unhealthy containers.
Resolve hermes-root-backup.service failed state.
Decide and document Gitea runner active/disabled state.
Remove stale cron paths and add missing-script checks.
Apply pending security/runtime updates in a maintenance window.

Phase 3 — Docker and app hardening

Add non-root users, no-new-privileges, cap drops, and read-only rootfs by service.
Add resource limits for noisy services and emulators.
Move emulators/dev tools off public bindings.
Review cAdvisor privilege and observability surface.

Phase 4 — Backup, restore, and incident readiness

Define full backup map: Hermes, Gitea, Caddy, Docker volumes, app DB/state, secrets escrow.
Perform restore drill to non-prod target.
Add incident runbooks: compromised container, leaked token, SSH brute force, disk full, failed Docker upgrade.
Add quarterly tabletop review.

Phase 5 — Continuous governance

Monthly VM security review cron/checklist.
Secret scan before DevOps repo pushes.
OS lifecycle/EOL tracker.
Drift detection for ports, Caddy routes, Docker health, systemd failures, and cron paths.

Suggested First Tickets

P0: Build and review exposure inventory — produce exact approved/blocked list for all currently bound ports.
P0: Lock Docker-published non-public ports — bind to loopback/internal or enforce DOCKER-USER drops.
P0: Harden SSH — disable password/root login after confirming key-based admin access.
P1: Triage unhealthy containers — fix healthchecks/apps or retire dead services.
P1: Resolve failed Hermes backup unit — fix or disable duplicate failed unit; keep cron backup healthy.
P1: Decide Gitea runner state — active smoke-tested runner or documented disabled service.
P2: Add secret scanner and stale-job scanner — prevent silent credential and automation drift.

Verification Commands for Future Runs

# Host/security baseline
date -Is
uname -a
. /etc/os-release && echo "$PRETTY_NAME"
apt-get -s upgrade | awk '/^Inst /{print}'
test -f /var/run/reboot-required && cat /var/run/reboot-required || echo no-reboot-required

# Firewall and public bind inventory
ufw status verbose
iptables -S DOCKER-USER
ss -ltnup

# SSH effective config
sshd -T | egrep '^(permitrootlogin|passwordauthentication|pubkeyauthentication|kbdinteractiveauthentication|maxauthtries|x11forwarding|clientaliveinterval)'
fail2ban-client status sshd

# Docker health/security
docker ps --format '{{.Names}}\t{{.Status}}\t{{.Ports}}'
docker ps -q | xargs -r docker inspect --format '{{.Name}} user={{.Config.User}} privileged={{.HostConfig.Privileged}} readonly={{.HostConfig.ReadonlyRootfs}} ports={{json .NetworkSettings.Ports}}'

# Caddy and ingress
docker exec caddy caddy validate --config /etc/caddy/Caddyfile
sed -n '1,220p' /opt/bytelyst/Caddyfile

# Backup/cron/systemd drift
systemctl --failed --no-pager
hermes cron list
crontab -l
for f in /etc/cron.d/*; do echo "--- $f"; sed -n '1,80p' "$f"; done

Notes

This review did not change firewall, SSH, Docker, Caddy, or service settings. It intentionally documents the risk and remediation order before making potentially disruptive security changes.
Public exposure changes should be handled in small maintenance windows with pre/post health checks because this VM hosts multiple ByteLyst apps.
The Caddyfile validates today, but Caddy formatting should be normalized in a separate low-risk docs/ops cleanup if desired.

18 KiB Raw Blame History

ByteLyst VM Security Blind Spots Roadmap

Executive Summary

Evidence Snapshot

Host and patching

Network and ingress

SSH and account surface

Docker runtime and containers

Gitea and CI

Backup and operations

Blind Spots and Risk Register

P0 — Internet-exposed Docker ports bypass the intended ingress model

P0 — SSH permits root login and password authentication

P0 — Public/private boundary for dev and internal tooling is unclear

P1 — Docker/container hardening is mostly default

P1 — Unhealthy containers can normalize broken deployments

P1 — Gitea Actions runner is enabled but inactive

P1 — Backup/restore evidence is split and one backup unit is failed

P1 — Patch management has pending security/runtime updates

P2 — Ubuntu 25.10 lifecycle risk needs explicit tracking

P2 — Repository/config secret hygiene needs a repeatable scanner

P2 — Cron/systemd ownership and drift are not fully inventoried

P2 — Observability exists but needs security-focused SLOs

Execution Plan

Phase 0 — Freeze and inventory before changes

Phase 1 — Immediate security hardening

Phase 2 — Operational correctness

Phase 3 — Docker and app hardening

Phase 4 — Backup, restore, and incident readiness

Phase 5 — Continuous governance

Suggested First Tickets

Verification Commands for Future Runs

Notes

18 KiB

Raw Blame History