bytelyst/bytelyst-devops-tools

Fork 0

Hermes VM 313a775fa0

pre-commit / pre-commit (push) Waiting to run

Details

docs: strengthen VM security roadmap gates

2026-05-27 20:34:37 +00:00

26 KiB

Raw Blame History

ByteLyst VM Security Blind Spots Roadmap

Review date: 2026-05-27 Reviewer: Hermes Agent Scope: Hostinger ByteLyst VM, Docker-hosted product stack, Caddy ingress, Gitea/CI, Hermes backup/ops, VM maintenance posture.

Executive Summary

The VM is operational and has several good foundations already in place: UFW is active, fail2ban is running for SSH, unattended upgrades are enabled, Caddy config validates, disk/memory headroom is acceptable, and Hermes persistent-data backup cron is healthy.

The biggest blind spot is that the apparent firewall posture is misleading: UFW only allows SSH, but Docker-published ports create iptables rules that can expose many application, database/emulator, observability, registry, and development ports on 0.0.0.0 / IPv6. Several of those services should either be private-only, routed only through Caddy with auth, or bound to loopback/internal Docker networks.

Second-order risks are SSH hardening gaps, rootful Docker/container hardening gaps, unhealthy apps that can hide failed deploys, an inactive Gitea Actions runner, a failed Hermes backup systemd unit despite cron backup success, and incomplete evidence for restore drills, secret scans, and off-host recovery.

Implementation Readiness Assessment

Roadmap quality score: 86%

Implementation confidence before remediation starts: 74%

Why not higher yet: the review has good evidence for the major blind spots, but safe remediation still depends on a service-by-service exposure inventory, owner approval for public/private intent, and verified rollback paths for SSH and Docker firewall changes. The highest-risk changes are not technically hard; they are risky because this VM hosts many ByteLyst apps and several public ports may be relied on by legacy workflows.

Confidence after Phase 0 is complete: expected to rise to about 88% if every public hostname/host port has an approved disposition and rollback commands are tested.

Quality strengths:

Evidence is concrete and command-derived rather than speculative.
The highest-risk items are correctly prioritized as P0.
The roadmap separates discovery from disruptive remediation.
It captures operational debt outside pure security, including unhealthy containers, backup state, runner drift, and cron drift.

Quality gaps to close before implementation:

Convert broad remediation bullets into small tickets with owner, rollback, validation, and maintenance window requirements.
Add an approved exposure inventory before changing Docker bindings or DOCKER-USER.
Record a tested SSH rollback path and keep an active second session/provider console open before changing sshd.
Define what is intentionally public, private, internal-only, or deprecated for each service.
Add post-change verification commands that prove public apps still work and private services are no longer internet reachable.

Implementation Guardrails

These rules apply before any Phase 1 change:

Do not bulk-close ports. Change one service group at a time and verify public app health after each group.
Do not restart SSH from a single session. Keep a second key-based session open and provider console access available.
Do not add broad DROP rules before an allowlist is committed to the inventory.
Prefer loopback/internal Compose bindings over firewall-only hiding when a service does not need direct public access.
Preserve Caddy as the public ingress path for web/API services unless a service is explicitly approved for direct exposure.
Record exact rollback commands next to every change ticket.
Treat Docker, SSH, Caddy, and backup changes as maintenance-window work.

Exposure Classification Model

Every listening port and Caddy hostname should be classified before changes:

Class	Meaning	Expected Controls	Examples To Review
`public-caddy`	Public app/API reached only through Caddy	TLS, hostname routing, app auth where needed, no direct host-port access	product web/API hostnames
`public-direct`	Direct host-port access is intentionally public	Explicit business reason, provider firewall allow, monitoring	SSH only unless approved otherwise
`private-admin`	Admin/dev/internal tool	Tailscale/VPN, SSH tunnel, IP allowlist, or auth gate	admin dashboards, devops tools
`loopback-only`	Host-local service used by Caddy or local automation	Bind `127.0.0.1:port`, no external bind	internal APIs behind Caddy
`docker-internal`	Container-to-container only	no host port mapping	databases, emulators, private workers
`retire`	Unused/deprecated	remove service/port, disable health checks and jobs	stale dashboards/services

Minimum inventory fields:

service/container name
repo/Compose file
host port and bind address
container port
Caddy hostname/path, if any
intended audience
authentication/control plane
classification
owner/approver
rollback command
post-change health check

Evidence Snapshot

Collected on 2026-05-27 from this VM.

Host and patching

Host: srv1491630
OS: Ubuntu 25.10
Kernel: 6.17.0-29-generic
Uptime: about 14 hours at review time
Root filesystem: 193G total, 71G used, 123G available, 37% used
Memory: 15Gi total, about 10Gi available
Swap: 4.0G total, about 1.3G used
Reboot required: no
Pending package upgrades included Docker CE/containerd/buildx/compose and security updates for libgcrypt20, libcaca0, and libssh2-1t64
Unattended upgrades: active and configured for automatic reboot at 04:00 with users absent

Network and ingress

UFW: active; default deny incoming; only 22/tcp allowed by UFW rules
Docker iptables rules are present and publish many ports despite UFW's simple rule list
Public/listening TCP ports bound on all interfaces included:
- 22, 80, 443
- app/frontend ports: 3000, 3002, 3003, 3030, 3035, 3040, 3049, 3050, 3055, 3060, 3070, 3075, 3085
- backend/API ports: 4004, 4010, 4011, 4012, 4013, 4014, 4015, 4016, 4017, 4019, 4020, 4025
- infra/dev ports: 1025, 1234, 3100, 3300, 8025, 8081, 10000, 11434
Caddy source-of-truth config: /opt/bytelyst/Caddyfile, mounted read-only into the caddy container
docker exec caddy caddy validate --config /etc/caddy/Caddyfile: valid config, formatting warning only
Caddy public hostnames include:
- api.bytelyst.com
- gitea.bytelyst.com
- admin.bytelyst.com
- devops.bytelyst.com
- tracker.bytelyst.com
- llmlab.bytelyst.com
- ollama.bytelyst.com
- trading-api.bytelyst.com
- invttrdg.bytelyst.com
- notes.bytelyst.com
- clock.bytelyst.com

SSH and account surface

Effective sshd -T settings showed:

permitrootlogin yes
passwordauthentication yes
pubkeyauthentication yes
kbdinteractiveauthentication no
maxauthtries 6
x11forwarding yes
clientaliveinterval 0

fail2ban is active with one jail: sshd; no current bans at review time.

Docker runtime and containers

Docker: client/server 29.4.2; newer Docker packages are available
Docker daemon is rootful; security options showed AppArmor, seccomp builtin, and cgroup namespaces; live_restore=false
Most product containers run with writable root filesystems and no explicit user configured
cadvisor is privileged
DOCKER-USER chain appears empty, so there is no central Docker firewall policy in front of published containers
Multiple containers are unhealthy:
- learning_ai_common_plat-llmlab-dashboard-1
- learning_ai_common_plat-actiontrail-web-1
- learning_ai_common_plat-jarvisjr-web-1
- learning_ai_common_plat-localmemgpt-web-1
- learning_ai_common_plat-nomgap-web-1
- learning_ai_common_plat-flowmonk-web-1
- learning_ai_common_plat-mindlyst-web-1

Gitea and CI

Gitea public route: https://gitea.bytelyst.com
Local Gitea container port: host 3300 -> container 3000, bound on 0.0.0.0 and IPv6
gitea-act-runner.service: enabled but inactive/dead
Runner user exists: gitea-runner, member of docker
Runner config directory permissions look reasonable:
- /home/gitea-runner/act_runner: 750, owned by gitea-runner:gitea-runner
- /home/gitea-runner/act_runner/config.yaml: 600, owned by gitea-runner:gitea-runner

Backup and operations

systemctl --failed showed failed unit:
- hermes-root-backup.service — Sync root Hermes persistent backup to GitHub
Hermes cron backup is active and healthy:
- job 470832621b43, Sync Hermes persistent-data backup to GitHub, every 30 minutes, last run ok
Existing VM maintenance cron entries exist for health check and cleanup under /opt/bytelyst/learning_ai_devops_tools/scripts/VMs/HostingerVM/
A root crontab entry still references /opt/bytelyst/bytelyst-devops-tools/monitor-lucky25-execution.sh, which may be stale after repo relocation/renaming

Blind Spots and Risk Register

P0 — Internet-exposed Docker ports bypass the intended ingress model

Risk: UFW suggests only SSH is allowed, but Docker-published ports expose many services directly on all interfaces. This can bypass Caddy, TLS, auth, logging, rate limiting, and hostname/path controls.

Examples observed: 3300, 8025, 1025, 1234, 8081, 10000, 11434, many 30xx web ports, and many 40xx backend ports.

Impact: Direct access to dev/infra services, internal APIs, emulators, mail tooling, dashboards, or model endpoints if upstream firewall/provider rules do not block them.

Roadmap:

Create a canonical exposure inventory: service, container, host port, public hostname, required audience, auth requirement.
For each service, decide one of: public via Caddy, private via Tailscale/SSH, loopback-only host port, Docker-internal only, or remove.
Bind non-public Compose ports to 127.0.0.1 or remove host port mapping entirely.
Add a DOCKER-USER chain policy to drop unsolicited traffic to non-approved published ports before Docker's accept rules.
Keep only 80/443 and intentionally public SSH exposed at the provider/firewall layer.
Add a recurring check that compares ss -ltn and Docker published ports against the approved inventory.

Acceptance criteria:

docs/vm-exposure-inventory.md lists every ss -ltnp listener and every Docker published port.
Every non-SSH direct public bind has an approved classification.
Non-public services are either loopback-bound, Docker-internal, provider-firewalled, or blocked in DOCKER-USER.
External probe confirms non-approved ports are closed from the internet.
Caddy-routed public hostnames still pass smoke checks.

Rollback: keep a saved copy of original Compose files and iptables-save output; rollback means restoring original port mappings or flushing only the newly added DOCKER-USER rules.

Risk: PermitRootLogin yes and PasswordAuthentication yes keep the primary admin surface broad. fail2ban helps, but password-enabled root SSH is still high-risk for an internet-facing VM.

Roadmap:

Confirm all required admin users have working SSH keys and sudo access.
Add a non-root break-glass admin path if one does not exist.
Change SSH effective config to:
- PermitRootLogin prohibit-password or no
- PasswordAuthentication no
- X11Forwarding no
- lower MaxAuthTries, e.g. 3
- set a sane ClientAliveInterval / ClientAliveCountMax
Validate with a second session before restarting SSH.
Record rollback commands and keep console/provider access available during rollout.

Acceptance criteria:

A non-root sudo admin user can log in with SSH key auth.
Root password login no longer works.
Existing automation using scripts/VMs/HostingerVM/login.sh still works or is updated.
sshd -T confirms the intended effective config.
fail2ban-client status sshd still reports an active jail.

Rollback: provider console or still-open root session can restore previous sshd_config drop-in and restart ssh.

P0 — Public/private boundary for dev and internal tooling is unclear

Risk: Caddy publishes ollama.bytelyst.com, llmlab.bytelyst.com, devops.bytelyst.com, admin.bytelyst.com, and Gitea. Some may be intended, but the roadmap lacks an explicit auth/access decision for each.

Roadmap:

Document public hostnames, auth model, and data sensitivity.
Require explicit approval before exposing new dashboards or model endpoints.
Add Caddy auth/IP allowlist/Tailscale-only strategy for admin-like surfaces.
Add security headers/auth checks to public UI health reviews.
Confirm ollama.bytelyst.com should be publicly reachable at all; if not, move behind private network or auth gate.

Acceptance criteria:

ollama, llmlab, devops, admin, gitea, and observability-adjacent routes each have an owner-approved exposure class.
Public admin-like routes require authentication or an explicit documented exception.
No emulator, mail, model, or raw dashboard port is directly internet reachable unless explicitly approved.

P1 — Docker/container hardening is mostly default

Risk: Many containers run as default/root user, writable rootfs, broad capabilities by default, and rootful Docker. A compromised app gets more host-adjacent leverage than needed.

Roadmap:

Create a per-service Docker hardening matrix: user, read-only rootfs, dropped capabilities, no-new-privileges, resource limits, healthcheck, restart policy, secrets handling.
Start with public-facing/backend services and admin dashboards.
Add security_opt: ["no-new-privileges:true"] where compatible.
Add cap_drop: ["ALL"] and selectively add back capabilities only when needed.
Convert app images to non-root users consistently.
Use read_only: true plus explicit writable tmp/cache volumes where compatible.
Review cadvisor privileged mode and replace/restrict if possible.
Enable Docker live-restore if it fits maintenance operations.

Implementation note: do not attempt rootless Docker or read-only rootfs as the first hardening step. Start with no-new-privileges, non-root app users where images already support it, and targeted capability drops for public-facing app containers.

P1 — Unhealthy containers can normalize broken deployments

Risk: Multiple app web containers are unhealthy while still running. If unhealthy states are ignored, deploy regressions and broken public pages can persist unnoticed.

Roadmap:

Triage each unhealthy container and classify: real app failure, bad healthcheck, intentionally unused, or deprecated.
Fix or remove bad healthchecks so Docker health state is trustworthy.
Add alerting for sustained unhealthy containers.
Make deployment scripts fail on unhealthy post-deploy state.
Update dashboard/observability docs with current service ownership and expected state.

Acceptance criteria:

Every unhealthy container has one of: fixed app, fixed healthcheck, intentionally disabled, or retired.
Docker health state matches the product’s actual serving state.
Post-deploy checks fail if required containers remain unhealthy beyond a grace period.

P1 — Gitea Actions runner is enabled but inactive

Risk: CI/deploy assumptions may be wrong. If a runner is expected to deploy or publish packages, inactive runner state blocks automation and may cause manual drift.

Roadmap:

Decide whether the runner should be active or intentionally disabled.
If active: restart and verify gitea-act-runner.service, runner labels, Docker access, and a smoke workflow.
If disabled: disable the service and document the intentional state.
Keep runner secrets separate from smoke/test workflows.
Add a runner-health check to VM observability.

Decision needed: runner should be either actively smoke-tested or disabled. An enabled-but-dead runner should not remain a steady state.

P1 — Backup/restore evidence is split and one backup unit is failed

Risk: Hermes cron backup works, but hermes-root-backup.service is failed. There is no recent full restore drill evidence in this review. A backup that cannot be restored is only an assumption.

Roadmap:

Inspect hermes-root-backup.service logs and decide whether to fix, disable, or replace it with the cron-backed job.
Document all backup mechanisms: Hermes, Gitea data, Docker volumes, app data, Caddy certs/config, environment/secrets escrow.
Run a restore drill into a non-production path/profile.
Verify no raw .env, OAuth tokens, private keys, SQLite WAL/SHM, or raw transcript DBs are committed.
Add backup freshness and restore-drill status to the monthly VM review.

Acceptance criteria:

systemctl --failed no longer includes backup units unless the failure is intentionally documented.
Backup status shows source, destination, cadence, last success, and restore command.
A restore drill has an artifact: date, target path/profile, commands run, result, and gaps found.

P1 — Patch management has pending security/runtime updates

Risk: Unattended upgrades are on, but Docker and security package updates were pending at review time. Docker updates may need controlled restart/redeploy planning.

Roadmap:

Add a weekly patch review checkpoint that reports pending security and Docker updates separately.
Define a Docker upgrade maintenance window with pre/post checks.
Run apt list --upgradable and capture package classes without dumping noise.
Verify apps after Docker/containerd upgrades.

Acceptance criteria:

Security updates and Docker/runtime updates are tracked separately.
Docker upgrade has pre/post container health, Caddy validation, and public smoke checks.
Reboot requirement is checked and scheduled rather than discovered accidentally.

P2 — Ubuntu 25.10 lifecycle risk needs explicit tracking

Risk: Ubuntu interim releases have short support windows. If this VM is long-lived production infrastructure, lifecycle tracking matters.

Roadmap:

Record current Ubuntu 25.10 support/EOL date in ops docs.
Decide whether to stay on interim releases or migrate to an LTS baseline.
Add an OS lifecycle check to quarterly review.

P2 — Repository/config secret hygiene needs a repeatable scanner

Risk: The DevOps repo contains operational inputs and historical/deleted repo copies exist on disk. Manual review can miss tokens in old files, generated JSON, logs, backups, or abandoned directories.

Roadmap:

Add a documented secret-scan command using gitleaks or trufflehog for tracked files and selected untracked ops directories.
Scan historical directories such as DELETED_bytelyst-devops-tools separately before archiving or deleting.
Add .gitignore patterns for generated scans, local account snapshots, and credential-shaped outputs.
Keep examples as .example files only.

P2 — Cron/systemd ownership and drift are not fully inventoried

Risk: Root crontab references old repo paths and there are multiple cron/systemd sources. Stale jobs can fail silently or mutate production unexpectedly.

Roadmap:

Inventory root/user crontabs, /etc/cron.d, systemd timers, Hermes cron, and Gitea Actions schedules.
Remove or update stale /opt/bytelyst/bytelyst-devops-tools/... references after confirming replacements.
Add owner, purpose, expected output, and alert channel for every job.
Add a stale-job detector for missing script paths and failed systemd units.

Acceptance criteria:

No active cron/systemd job references a missing path.
Every recurring job has an owner, purpose, schedule, expected output, and alert destination.
Stale path detection runs in the monthly VM review.

P2 — Observability exists but needs security-focused SLOs

Risk: Prometheus/Grafana/Loki/exporters are present, but security-focused alerts are not yet proven from this review.

Roadmap:

Add alerts for unexpected public ports, failed units, unhealthy containers, high disk/swap, backup staleness, Gitea runner inactive, and SSH auth spikes.
Validate alert delivery to Telegram.
Keep internal observability endpoints private; do not publish Prometheus/Loki/node-exporter/cAdvisor directly.

Execution Plan

Phase 0 — Freeze and inventory before changes

Freeze new public hostnames/ports until the exposure inventory is complete.
Generate docs/vm-exposure-inventory.md from Docker, Caddy, ss, and DNS.
Mark each exposed service as public, private, internal-only, or retire.
Review with S before changing public access for customer/user-facing apps.

Exit criteria: the inventory is reviewed and every P0 change has a rollback line and validation line.

Phase 1 — Immediate security hardening

Close or loopback-bind non-public Docker host ports.
Add DOCKER-USER default-deny rules for non-approved ports.
Harden SSH root/password access after key-based access is verified.
Put ollama.bytelyst.com, admin dashboards, and dev tooling behind private/auth-gated access unless explicitly approved as public.

Exit criteria: only approved public ports are externally reachable, SSH effective config is hardened, and public apps still pass smoke checks.

Phase 2 — Operational correctness

Fix/retire unhealthy containers.
Resolve hermes-root-backup.service failed state.
Decide and document Gitea runner active/disabled state.
Remove stale cron paths and add missing-script checks.
Apply pending security/runtime updates in a maintenance window.

Exit criteria: no unexpected failed units, no ignored unhealthy required containers, no stale cron paths, and runner state is intentional.

Phase 3 — Docker and app hardening

Add non-root users, no-new-privileges, cap drops, and read-only rootfs by service.
Add resource limits for noisy services and emulators.
Move emulators/dev tools off public bindings.
Review cAdvisor privilege and observability surface.

Phase 4 — Backup, restore, and incident readiness

Define full backup map: Hermes, Gitea, Caddy, Docker volumes, app DB/state, secrets escrow.
Perform restore drill to non-prod target.
Add incident runbooks: compromised container, leaked token, SSH brute force, disk full, failed Docker upgrade.
Add quarterly tabletop review.

Phase 5 — Continuous governance

Monthly VM security review cron/checklist.
Secret scan before DevOps repo pushes.
OS lifecycle/EOL tracker.
Drift detection for ports, Caddy routes, Docker health, systemd failures, and cron paths.

Change Tickets With Quality Gates

Use this shape for each implementation PR/commit:

Ticket:
Risk:
Files/services changed:
Pre-checks:
Change:
Rollback:
Post-checks:
Residual risk:

Minimum post-checks for Phase 1:

ss -ltnp
docker ps --format '{{.Names}}\t{{.Status}}\t{{.Ports}}'
iptables -S DOCKER-USER
docker exec caddy caddy validate --config /etc/caddy/Caddyfile
public smoke checks for approved hostnames
negative external probe for blocked ports
sshd -T after SSH changes
systemctl --failed --no-pager

Do Not Start With

Rootless Docker migration.
Broad iptables default-drop without an allowlist.
Mass Compose rewrites across all products.
SSH password/root lockout before key-based sudo and rollback are proven.
Removing unhealthy containers before confirming whether they are deprecated or broken required services.
Publishing secret-scan output that contains secrets.

Suggested First Tickets

P0: Build and review exposure inventory — produce exact approved/blocked list for all currently bound ports.
P0: Lock Docker-published non-public ports — bind to loopback/internal or enforce DOCKER-USER drops.
P0: Harden SSH — disable password/root login after confirming key-based admin access.
P1: Triage unhealthy containers — fix healthchecks/apps or retire dead services.
P1: Resolve failed Hermes backup unit — fix or disable duplicate failed unit; keep cron backup healthy.
P1: Decide Gitea runner state — active smoke-tested runner or documented disabled service.
P2: Add secret scanner and stale-job scanner — prevent silent credential and automation drift.

Recommended first implementation order:

Generate and review docs/vm-exposure-inventory.md.
Fix the stale cron path and failed backup unit, because both are lower blast-radius and improve rollback confidence.
Harden SSH with second-session/provider-console safety.
Move obvious internal-only Docker ports to loopback/internal bindings.
Add DOCKER-USER guardrails after the allowlist is proven.

This order improves safety without letting the port exposure issue linger too long.

Verification Commands for Future Runs

# Host/security baseline
date -Is
uname -a
. /etc/os-release && echo "$PRETTY_NAME"
apt-get -s upgrade | awk '/^Inst /{print}'
test -f /var/run/reboot-required && cat /var/run/reboot-required || echo no-reboot-required

# Firewall and public bind inventory
ufw status verbose
iptables -S DOCKER-USER
ss -ltnup

# SSH effective config
sshd -T | egrep '^(permitrootlogin|passwordauthentication|pubkeyauthentication|kbdinteractiveauthentication|maxauthtries|x11forwarding|clientaliveinterval)'
fail2ban-client status sshd

# Docker health/security
docker ps --format '{{.Names}}\t{{.Status}}\t{{.Ports}}'
docker ps -q | xargs -r docker inspect --format '{{.Name}} user={{.Config.User}} privileged={{.HostConfig.Privileged}} readonly={{.HostConfig.ReadonlyRootfs}} ports={{json .NetworkSettings.Ports}}'

# Caddy and ingress
docker exec caddy caddy validate --config /etc/caddy/Caddyfile
sed -n '1,220p' /opt/bytelyst/Caddyfile

# Backup/cron/systemd drift
systemctl --failed --no-pager
hermes cron list
crontab -l
for f in /etc/cron.d/*; do echo "--- $f"; sed -n '1,80p' "$f"; done

Notes

This review did not change firewall, SSH, Docker, Caddy, or service settings. It intentionally documents the risk and remediation order before making potentially disruptive security changes.
Public exposure changes should be handled in small maintenance windows with pre/post health checks because this VM hosts multiple ByteLyst apps.
The Caddyfile validates today, but Caddy formatting should be normalized in a separate low-risk docs/ops cleanup if desired.

26 KiB Raw Blame History Unescape Escape

ByteLyst VM Security Blind Spots Roadmap

Executive Summary

Implementation Readiness Assessment

Implementation Guardrails

Exposure Classification Model

Evidence Snapshot

Host and patching

Network and ingress

SSH and account surface

Docker runtime and containers

Gitea and CI

Backup and operations

Blind Spots and Risk Register

P0 — Internet-exposed Docker ports bypass the intended ingress model

P0 — SSH permits root login and password authentication

P0 — Public/private boundary for dev and internal tooling is unclear

P1 — Docker/container hardening is mostly default

P1 — Unhealthy containers can normalize broken deployments

P1 — Gitea Actions runner is enabled but inactive

P1 — Backup/restore evidence is split and one backup unit is failed

P1 — Patch management has pending security/runtime updates

P2 — Ubuntu 25.10 lifecycle risk needs explicit tracking

P2 — Repository/config secret hygiene needs a repeatable scanner

P2 — Cron/systemd ownership and drift are not fully inventoried

P2 — Observability exists but needs security-focused SLOs

Execution Plan

Phase 0 — Freeze and inventory before changes

Phase 1 — Immediate security hardening

Phase 2 — Operational correctness

Phase 3 — Docker and app hardening

Phase 4 — Backup, restore, and incident readiness

Phase 5 — Continuous governance

Change Tickets With Quality Gates

Do Not Start With

Suggested First Tickets

Verification Commands for Future Runs

Notes

26 KiB

Raw Blame History