bytelyst-devops-tools/docs/vm-security-blind-spots-roadmap.md
Hermes VM 44fd6a462a
Some checks failed
pre-commit / pre-commit (push) Failing after 27s
fix: bind DevOps dashboard ports to loopback
2026-05-27 21:55:46 +00:00

38 KiB
Raw Permalink Blame History

ByteLyst VM Security Blind Spots Roadmap

Review date: 2026-05-27 Reviewer: Hermes Agent Scope: Hostinger ByteLyst VM, Docker-hosted product stack, Caddy ingress, Gitea/CI, Hermes backup/ops, VM maintenance posture.

Executive Summary

The VM is operational and has several good foundations already in place: UFW is active, fail2ban is running for SSH, unattended upgrades are enabled, Caddy config validates, disk/memory headroom is acceptable, and Hermes persistent-data backup cron is healthy.

The biggest blind spot is that the apparent firewall posture is misleading: UFW only allows SSH, but Docker-published ports create iptables rules that can expose many application, database/emulator, observability, registry, and development ports on 0.0.0.0 / IPv6. Several of those services should either be private-only, routed only through Caddy with auth, or bound to loopback/internal Docker networks.

Second-order risks are SSH hardening gaps, rootful Docker/container hardening gaps, unhealthy apps that can hide failed deploys, an inactive Gitea Actions runner, a failed Hermes backup systemd unit despite cron backup success, and incomplete evidence for restore drills, secret scans, and off-host recovery.

Implementation Readiness Assessment

Roadmap quality score: 86%

Implementation confidence before remediation starts: 74%

Why not higher yet: the review has good evidence for the major blind spots, but safe remediation still depends on a service-by-service exposure inventory, owner approval for public/private intent, and verified rollback paths for SSH and Docker firewall changes. The highest-risk changes are not technically hard; they are risky because this VM hosts many ByteLyst apps and several public ports may be relied on by legacy workflows.

Confidence after Phase 0 is complete: expected to rise to about 88% if every public hostname/host port has an approved disposition and rollback commands are tested.

Quality strengths:

  • Evidence is concrete and command-derived rather than speculative.
  • The highest-risk items are correctly prioritized as P0.
  • The roadmap separates discovery from disruptive remediation.
  • It captures operational debt outside pure security, including unhealthy containers, backup state, runner drift, and cron drift.

Quality gaps to close before implementation:

  • Convert broad remediation bullets into small tickets with owner, rollback, validation, and maintenance window requirements.
  • Add an approved exposure inventory before changing Docker bindings or DOCKER-USER.
  • Record a tested SSH rollback path and keep an active second session/provider console open before changing sshd.
  • Define what is intentionally public, private, internal-only, or deprecated for each service.
  • Add post-change verification commands that prove public apps still work and private services are no longer internet reachable.

Implementation Guardrails

These rules apply before any Phase 1 change:

  • Do not bulk-close ports. Change one service group at a time and verify public app health after each group.
  • Do not restart SSH from a single session. Keep a second key-based session open and provider console access available.
  • Do not add broad DROP rules before an allowlist is committed to the inventory.
  • Prefer loopback/internal Compose bindings over firewall-only hiding when a service does not need direct public access.
  • Preserve Caddy as the public ingress path for web/API services unless a service is explicitly approved for direct exposure.
  • Record exact rollback commands next to every change ticket.
  • Treat Docker, SSH, Caddy, and backup changes as maintenance-window work.

Exposure Classification Model

Every listening port and Caddy hostname should be classified before changes:

Class Meaning Expected Controls Examples To Review
public-caddy Public app/API reached only through Caddy TLS, hostname routing, app auth where needed, no direct host-port access product web/API hostnames
public-direct Direct host-port access is intentionally public Explicit business reason, provider firewall allow, monitoring SSH only unless approved otherwise
private-admin Admin/dev/internal tool Tailscale/VPN, SSH tunnel, IP allowlist, or auth gate admin dashboards, devops tools
loopback-only Host-local service used by Caddy or local automation Bind 127.0.0.1:port, no external bind internal APIs behind Caddy
docker-internal Container-to-container only no host port mapping databases, emulators, private workers
retire Unused/deprecated remove service/port, disable health checks and jobs stale dashboards/services

Minimum inventory fields:

  • service/container name
  • repo/Compose file
  • host port and bind address
  • container port
  • Caddy hostname/path, if any
  • intended audience
  • authentication/control plane
  • classification
  • owner/approver
  • rollback command
  • post-change health check

Evidence Snapshot

Collected on 2026-05-27 from this VM.

Host and patching

  • Host: srv1491630
  • OS: Ubuntu 25.10
  • Kernel: 6.17.0-29-generic
  • Uptime: about 14 hours at review time
  • Root filesystem: 193G total, 71G used, 123G available, 37% used
  • Memory: 15Gi total, about 10Gi available
  • Swap: 4.0G total, about 1.3G used
  • Reboot required: no
  • Pending package upgrades included Docker CE/containerd/buildx/compose and security updates for libgcrypt20, libcaca0, and libssh2-1t64
  • Unattended upgrades: active and configured for automatic reboot at 04:00 with users absent

Network and ingress

  • UFW: active; default deny incoming; only 22/tcp allowed by UFW rules
  • Docker iptables rules are present and publish many ports despite UFW's simple rule list
  • Public/listening TCP ports bound on all interfaces included:
    • 22, 80, 443
    • app/frontend ports: 3000, 3002, 3003, 3030, 3035, 3040, 3049, 3050, 3055, 3060, 3070, 3075, 3085
    • backend/API ports: 4004, 4010, 4011, 4012, 4013, 4014, 4015, 4016, 4017, 4019, 4020, 4025
    • infra/dev ports: 1025, 1234, 3100, 3300, 8025, 8081, 10000, 11434
  • Caddy source-of-truth config: /opt/bytelyst/Caddyfile, mounted read-only into the caddy container
  • docker exec caddy caddy validate --config /etc/caddy/Caddyfile: valid config, formatting warning only
  • Caddy public hostnames include:
    • api.bytelyst.com
    • gitea.bytelyst.com
    • admin.bytelyst.com
    • devops.bytelyst.com
    • tracker.bytelyst.com
    • llmlab.bytelyst.com
    • ollama.bytelyst.com
    • trading-api.bytelyst.com
    • invttrdg.bytelyst.com
    • notes.bytelyst.com
    • clock.bytelyst.com

SSH and account surface

Effective sshd -T settings showed:

  • permitrootlogin yes
  • passwordauthentication yes
  • pubkeyauthentication yes
  • kbdinteractiveauthentication no
  • maxauthtries 6
  • x11forwarding yes
  • clientaliveinterval 0

fail2ban is active with one jail: sshd; no current bans at review time.

Docker runtime and containers

  • Docker: client/server 29.4.2; newer Docker packages are available
  • Docker daemon is rootful; security options showed AppArmor, seccomp builtin, and cgroup namespaces; live_restore=false
  • Most product containers run with writable root filesystems and no explicit user configured
  • cadvisor is privileged
  • DOCKER-USER chain appears empty, so there is no central Docker firewall policy in front of published containers
  • Multiple containers are unhealthy:
    • learning_ai_common_plat-llmlab-dashboard-1
    • learning_ai_common_plat-actiontrail-web-1
    • learning_ai_common_plat-jarvisjr-web-1
    • learning_ai_common_plat-localmemgpt-web-1
    • learning_ai_common_plat-nomgap-web-1
    • learning_ai_common_plat-flowmonk-web-1
    • learning_ai_common_plat-mindlyst-web-1

Gitea and CI

  • Gitea public route: https://gitea.bytelyst.com
  • Local Gitea container port: host 3300 -> container 3000, bound on 0.0.0.0 and IPv6
  • gitea-act-runner.service: enabled but inactive/dead
  • Runner user exists: gitea-runner, member of docker
  • Runner config directory permissions look reasonable:
    • /home/gitea-runner/act_runner: 750, owned by gitea-runner:gitea-runner
    • /home/gitea-runner/act_runner/config.yaml: 600, owned by gitea-runner:gitea-runner

Backup and operations

  • systemctl --failed showed failed unit:
    • hermes-root-backup.serviceSync root Hermes persistent backup to GitHub
  • Hermes cron backup is active and healthy:
    • job 470832621b43, Sync Hermes persistent-data backup to GitHub, every 30 minutes, last run ok
  • Existing VM maintenance cron entries exist for health check and cleanup under /opt/bytelyst/learning_ai_devops_tools/scripts/VMs/HostingerVM/
  • A root crontab entry still references /opt/bytelyst/bytelyst-devops-tools/monitor-lucky25-execution.sh, which may be stale after repo relocation/renaming

Blind Spots and Risk Register

P0 — Internet-exposed Docker ports bypass the intended ingress model

Risk: UFW suggests only SSH is allowed, but Docker-published ports expose many services directly on all interfaces. This can bypass Caddy, TLS, auth, logging, rate limiting, and hostname/path controls.

Examples observed: 3300, 8025, 1025, 1234, 8081, 10000, 11434, many 30xx web ports, and many 40xx backend ports.

Impact: Direct access to dev/infra services, internal APIs, emulators, mail tooling, dashboards, or model endpoints if upstream firewall/provider rules do not block them.

Roadmap:

  • Create a canonical exposure inventory: service, container, host port, public hostname, required audience, auth requirement.
  • For each service, decide one of: public via Caddy, private via Tailscale/SSH, loopback-only host port, Docker-internal only, or remove.
  • Bind non-public Compose ports to 127.0.0.1 or remove host port mapping entirely.
    • Internal emulator/mail/observability ports 1025, 8025, 10000, 1234, 8081, and 3100 are loopback-bound.
    • Common-platform direct app/API bypasses are loopback-bound or removed from host publishing.
    • Notes, Clock, and InvtTrdg direct app/API bypasses are loopback-bound.
    • DevOps dashboard/API direct private-admin bypasses are loopback-bound.
  • Add a DOCKER-USER chain policy to drop unsolicited traffic to non-approved published ports before Docker's accept rules.
  • Keep only 80/443 and intentionally public SSH exposed at the provider/firewall layer.
  • Add a recurring check that compares ss -ltn and Docker published ports against the approved inventory.

Acceptance criteria:

  • docs/vm-exposure-inventory.md lists every ss -ltnp listener and every Docker published port.
  • Every non-SSH direct public bind has an approved classification.
  • Non-public services are either loopback-bound, Docker-internal, provider-firewalled, or blocked in DOCKER-USER.
  • External probe confirms non-approved ports are closed from the internet.
  • Caddy-routed public hostnames still pass smoke checks.

Rollback: keep a saved copy of original Compose files and iptables-save output; rollback means restoring original port mappings or flushing only the newly added DOCKER-USER rules.

P0 — SSH permits root login and password authentication

Risk: PermitRootLogin yes and PasswordAuthentication yes keep the primary admin surface broad. fail2ban helps, but password-enabled root SSH is still high-risk for an internet-facing VM.

Roadmap:

  • Confirm all required admin users have working SSH keys and sudo access.
  • Add a non-root break-glass admin path if one does not exist.
  • Change SSH effective config to:
    • PermitRootLogin prohibit-password or no
    • PasswordAuthentication no
    • X11Forwarding no
    • lower MaxAuthTries, e.g. 3
    • set a sane ClientAliveInterval / ClientAliveCountMax
  • Validate with a second session before restarting SSH.
  • Record rollback commands and keep console/provider access available during rollout.

Acceptance criteria:

  • A non-root sudo admin user can log in with SSH key auth.
  • Root password login no longer works.
  • Existing automation using scripts/VMs/HostingerVM/login.sh still works or is updated.
  • sshd -T confirms the intended effective config.
  • fail2ban-client status sshd still reports an active jail.

Rollback: provider console or still-open root session can restore previous sshd_config drop-in and restart ssh.

P0 — Public/private boundary for dev and internal tooling is unclear

Risk: Caddy publishes ollama.bytelyst.com, llmlab.bytelyst.com, devops.bytelyst.com, admin.bytelyst.com, and Gitea. Some may be intended, but the roadmap lacks an explicit auth/access decision for each.

Roadmap:

  • Document public hostnames, auth model, and data sensitivity.
  • Require explicit approval before exposing new dashboards or model endpoints.
  • Add Caddy auth/IP allowlist/Tailscale-only strategy for admin-like surfaces.
  • Add security headers/auth checks to public UI health reviews.
  • Confirm ollama.bytelyst.com should be publicly reachable at all; if not, move behind private network or auth gate.

Acceptance criteria:

  • ollama, llmlab, devops, admin, gitea, and observability-adjacent routes each have an owner-approved exposure class.
  • Public admin-like routes require authentication or an explicit documented exception.
  • No emulator, mail, model, or raw dashboard port is directly internet reachable unless explicitly approved.

P1 — Docker/container hardening is mostly default

Risk: Many containers run as default/root user, writable rootfs, broad capabilities by default, and rootful Docker. A compromised app gets more host-adjacent leverage than needed.

Roadmap:

  • Create a per-service Docker hardening matrix: user, read-only rootfs, dropped capabilities, no-new-privileges, resource limits, healthcheck, restart policy, secrets handling.
  • Start with public-facing/backend services and admin dashboards.
  • Add security_opt: ["no-new-privileges:true"] where compatible.
  • Add cap_drop: ["ALL"] and selectively add back capabilities only when needed.
  • Convert app images to non-root users consistently.
  • Use read_only: true plus explicit writable tmp/cache volumes where compatible.
  • Review cadvisor privileged mode and replace/restrict if possible.
  • Enable Docker live-restore if it fits maintenance operations.

Implementation note: do not attempt rootless Docker or read-only rootfs as the first hardening step. Start with no-new-privileges, non-root app users where images already support it, and targeted capability drops for public-facing app containers.

P1 — Unhealthy containers can normalize broken deployments

Risk: Multiple app web containers are unhealthy while still running. If unhealthy states are ignored, deploy regressions and broken public pages can persist unnoticed.

Roadmap:

  • Triage each unhealthy container and classify: real app failure, bad healthcheck, intentionally unused, or deprecated.
  • Fix or remove bad healthchecks so Docker health state is trustworthy.
  • Add alerting for sustained unhealthy containers.
  • Make deployment scripts fail on unhealthy post-deploy state.
  • Update dashboard/observability docs with current service ownership and expected state.

Acceptance criteria:

  • Every unhealthy container has one of: fixed app, fixed healthcheck, intentionally disabled, or retired.
  • Docker health state matches the products actual serving state.
  • Post-deploy checks fail if required containers remain unhealthy beyond a grace period.

P1 — Gitea Actions runner is enabled but inactive

Risk: CI/deploy assumptions may be wrong. If a runner is expected to deploy or publish packages, inactive runner state blocks automation and may cause manual drift.

Roadmap:

  • Decide whether the runner should be active or intentionally disabled.
  • If active: restart and verify gitea-act-runner.service, runner labels, and Docker access.
  • Run and record a dedicated Gitea Actions smoke workflow result.
  • If disabled: disable the service and document the intentional state.
  • Keep runner secrets separate from smoke/test workflows.
  • Add a runner-health check to VM observability.

Decision needed: runner should be either actively smoke-tested or disabled. An enabled-but-dead runner should not remain a steady state.

P1 — Backup/restore evidence is split and one backup unit is failed

Risk: Hermes cron backup works, but hermes-root-backup.service is failed. There is no recent full restore drill evidence in this review. A backup that cannot be restored is only an assumption.

Roadmap:

  • Inspect hermes-root-backup.service logs and decide whether to fix, disable, or replace it with the cron-backed job.
  • Repair the root backup checkout divergence and verify a successful hermes-root-backup.service one-shot run.
  • Update /root/.hermes/scripts/sync_hermes_persistent_backup.py so future generated-backup divergence preserves a safety branch and rejoins the remote backup stream instead of wedging on git pull --ff-only.
  • Document all backup mechanisms: Hermes, Gitea data, Docker volumes, app data, Caddy certs/config, environment/secrets escrow.
  • Run a restore drill into a non-production path/profile.
  • Verify no raw .env, OAuth tokens, private keys, SQLite WAL/SHM, or raw transcript DBs are committed.
  • Add backup freshness and restore-drill status to the monthly VM review.

Acceptance criteria:

  • systemctl --failed no longer includes backup units unless the failure is intentionally documented.
  • Backup status shows source, destination, cadence, last success, and restore command.
  • A restore drill has an artifact: date, target path/profile, commands run, result, and gaps found.

P1 — Patch management has pending security/runtime updates

Risk: Unattended upgrades are on, but Docker and security package updates were pending at review time. Docker updates may need controlled restart/redeploy planning.

Roadmap:

  • Add a weekly patch review checkpoint that reports pending security and Docker updates separately.
  • Define a Docker upgrade maintenance window with pre/post checks.
  • Run apt list --upgradable and capture package classes without dumping noise.
  • Verify apps after Docker/containerd upgrades.

Acceptance criteria:

  • Security updates and Docker/runtime updates are tracked separately.
  • Docker upgrade has pre/post container health, Caddy validation, and public smoke checks.
  • Reboot requirement is checked and scheduled rather than discovered accidentally.

P2 — Ubuntu 25.10 lifecycle risk needs explicit tracking

Risk: Ubuntu interim releases have short support windows. If this VM is long-lived production infrastructure, lifecycle tracking matters.

Roadmap:

  • Record current Ubuntu 25.10 support/EOL date in ops docs.
  • Decide whether to stay on interim releases or migrate to an LTS baseline.
  • Add an OS lifecycle check to quarterly review.

P2 — Repository/config secret hygiene needs a repeatable scanner

Risk: The DevOps repo contains operational inputs and historical/deleted repo copies exist on disk. Manual review can miss tokens in old files, generated JSON, logs, backups, or abandoned directories.

Roadmap:

  • Add a documented secret-scan command using gitleaks or trufflehog for tracked files and selected untracked ops directories.
  • Scan historical directories such as DELETED_bytelyst-devops-tools separately before archiving or deleting.
  • Add .gitignore patterns for generated scans, local account snapshots, and credential-shaped outputs.
  • Keep examples as .example files only.

P2 — Cron/systemd ownership and drift are not fully inventoried

Risk: Root crontab references old repo paths and there are multiple cron/systemd sources. Stale jobs can fail silently or mutate production unexpectedly.

Roadmap:

  • Inventory root/user crontabs, /etc/cron.d, systemd timers, Hermes cron, and Gitea Actions schedules.
  • Remove or update stale /opt/bytelyst/bytelyst-devops-tools/... references after confirming replacements.
  • Add owner, purpose, expected output, and alert channel for every job.
  • Add a stale-job detector for missing script paths and failed systemd units.

Acceptance criteria:

  • No active cron/systemd job references a missing path.
  • Every recurring job has an owner, purpose, schedule, expected output, and alert destination.
  • Stale path detection runs in the monthly VM review.

P2 — Observability exists but needs security-focused SLOs

Risk: Prometheus/Grafana/Loki/exporters are present, but security-focused alerts are not yet proven from this review.

Roadmap:

  • Add alerts for unexpected public ports, failed units, unhealthy containers, high disk/swap, backup staleness, Gitea runner inactive, and SSH auth spikes.
  • Validate alert delivery to Telegram.
  • Keep internal observability endpoints private; do not publish Prometheus/Loki/node-exporter/cAdvisor directly.

Execution Plan

Phase 0 — Freeze and inventory before changes

  • Freeze new public hostnames/ports until the exposure inventory is complete.
  • Generate docs/vm-exposure-inventory.md from Docker, Caddy, ss, and DNS.
  • Mark each exposed service as public, private, internal-only, or retire.
  • Review with S before changing public access for customer/user-facing apps.

Exit criteria: the inventory is reviewed and every P0 change has a rollback line and validation line.

Phase 1 — Immediate security hardening

  • Close or loopback-bind non-public Docker host ports.
    • Loopback-bound internal emulator/mail/observability ports 1025, 8025, 10000, 1234, 8081, and 3100.
    • Closed/loopback-bound common-platform direct app/API bypasses.
    • Loopback-bound Notes, Clock, and InvtTrdg direct app/API bypasses.
    • Loopback-bound DevOps dashboard/API direct private-admin bypasses.
  • Add DOCKER-USER default-deny rules for non-approved ports.
  • Harden SSH root/password access after key-based access is verified.
  • Put ollama.bytelyst.com, admin dashboards, and dev tooling behind private/auth-gated access unless explicitly approved as public.

Exit criteria: only approved public ports are externally reachable, SSH effective config is hardened, and public apps still pass smoke checks.

Phase 2 — Operational correctness

  • Fix/retire unhealthy containers.
  • Resolve hermes-root-backup.service failed state.
  • Decide and document Gitea runner active/disabled state.
  • Add missing-script checks. Stale root cron path was fixed on 2026-05-27.
  • Apply pending security/runtime updates in a maintenance window.

Exit criteria: no unexpected failed units, no ignored unhealthy required containers, no stale cron paths, and runner state is intentional.

Phase 3 — Docker and app hardening

  • Add non-root users, no-new-privileges, cap drops, and read-only rootfs by service.
  • Add resource limits for noisy services and emulators.
  • Move emulators/dev tools off public bindings.
  • Review cAdvisor privilege and observability surface.

Phase 4 — Backup, restore, and incident readiness

  • Define full backup map: Hermes, Gitea, Caddy, Docker volumes, app DB/state, secrets escrow.
  • Perform restore drill to non-prod target.
  • Add incident runbooks: compromised container, leaked token, SSH brute force, disk full, failed Docker upgrade.
  • Add quarterly tabletop review.

Phase 5 — Continuous governance

  • Monthly VM security review cron/checklist.
  • Secret scan before DevOps repo pushes.
  • OS lifecycle/EOL tracker.
  • Drift detection for ports, Caddy routes, Docker health, systemd failures, and cron paths.

Change Tickets With Quality Gates

Use this shape for each implementation PR/commit:

Ticket:
Risk:
Files/services changed:
Pre-checks:
Change:
Rollback:
Post-checks:
Residual risk:

Minimum post-checks for Phase 1:

  • ss -ltnp
  • docker ps --format '{{.Names}}\t{{.Status}}\t{{.Ports}}'
  • iptables -S DOCKER-USER
  • docker exec caddy caddy validate --config /etc/caddy/Caddyfile
  • public smoke checks for approved hostnames
  • negative external probe for blocked ports
  • sshd -T after SSH changes
  • systemctl --failed --no-pager

Implementation Log

2026-05-27 — Phase 2 backup and cron drift

Changed:

  • Repointed the root Lucky25 monitor cron from /opt/bytelyst/bytelyst-devops-tools/monitor-lucky25-execution.sh to /opt/bytelyst/learning_ai_devops_tools/scripts/monitor-lucky25-execution.sh.
  • Saved the pre-change root crontab at /tmp/root-crontab-before-vm-security-20260527.txt.
  • Repaired /root/repos/bytelyst_hostinger_hermes_vm, which was ahead 1, behind 11; the obsolete local generated backup commit conflicted with newer remote snapshots and was skipped after rebase preserved the current remote stream.
  • Patched /root/.hermes/scripts/sync_hermes_persistent_backup.py to replace unconditional git pull --ff-only with explicit fetch/merge-base handling. Diverged generated snapshots now create a safety branch before attempting rebase and fall back to origin/<branch> if the generated files conflict.
  • Saved the pre-change backup script at /tmp/sync_hermes_persistent_backup.py.before-vm-security-20260527.

Verified:

  • crontab -l now points the Lucky25 monitor at the current repo script.
  • python3 -m py_compile /tmp/sync_hermes_persistent_backup.py passed before deployment.
  • systemctl start hermes-root-backup.service succeeded twice after repair.
  • systemctl status hermes-root-backup.service hermes-root-backup.timer --no-pager showed the service exited status=0/SUCCESS and the timer remains active.
  • /root/repos/bytelyst_hostinger_hermes_vm is aligned with origin/main after successful backup commits 415e824 and 369e584.

Residual risk:

  • A restore drill is still required before the backup posture should be considered fully proven.
  • The backup sync script is runtime-managed under /root/.hermes/scripts/; add a tracked installer or source-of-truth copy so this hardening does not depend on manual VM state.

2026-05-27 — Phase 2 Gitea runner state

Changed:

  • Started gitea-act-runner.service; it was enabled but inactive.
  • Treated the intended state as active because the service unit is enabled, historical journal entries show successful task execution, and restart declared the runner successfully.

Verified:

  • systemctl is-active gitea-act-runner.service returned active.
  • systemctl status gitea-act-runner.service --no-pager showed bytelyst-host-runner running as gitea-runner.
  • Runner labels declared successfully: ubuntu-latest, linux, bytelyst, hostinger.
  • Runner config uses Docker executor images and privileged: false; Docker socket access is granted through the docker group.
  • Runner immediately picked up task 42 for bytelyst/bytelyst-devops-tools, proving it can talk to local Gitea.

Residual risk:

  • Record a small dedicated smoke workflow that does not need production secrets, so runner health is proven by a controlled workflow rather than incidental queued work.
  • Add runner health to VM observability so enabled-but-inactive drift is caught automatically.

2026-05-27 — Phase 2 stale automation detector

Changed:

  • Extended scripts/VMs/HostingerVM/vm-health-check.sh with an AUTOMATION DRIFT section.
  • The daily health check now reports failed systemd units and root crontab script paths that no longer exist.
  • Made optional /var/log/vm-health-check.log writes silent when the script runs in a restricted/read-only context.

Verified:

  • bash -n scripts/VMs/HostingerVM/vm-health-check.sh passed.
  • Restricted --json run stayed quiet on log-write failure and reported the new checks.
  • Host-permission --json run reported failed_units=OK and cron_missing_paths=OK.

Residual risk:

  • The detector currently covers root crontab and failed systemd units. Full ownership inventory still needs /etc/cron.d, user crontabs, Hermes cron, Gitea schedules, owners, outputs, and alert channels.

2026-05-27 — Phase 2 unhealthy containers

Changed:

  • Added HOSTNAME=0.0.0.0 to six managed Next.js web services in /opt/bytelyst/learning_ai_common_plat/docker-compose.ecosystem.yml: jarvisjr-web, flowmonk-web, mindlyst-web, actiontrail-web, localmemgpt-web, and llmlab-dashboard.
  • Recreated those six services from existing images with docker compose ... up -d --no-build.
  • Retired the orphan learning_ai_common_plat-nomgap-web-1 container. Current Compose already documents nomgap-web as deployed to Vercel and not part of the Docker stack.

Verified:

  • docker compose -f docker-compose.ecosystem.yml --env-file .env.ecosystem config --quiet passed.
  • The six recreated web containers report Docker health healthy.
  • docker ps --filter health=unhealthy returns no containers.
  • Host-level smoke checks returned HTTP 200 for 3035, 3040, 3050, 3060, 3070, and 3075; retired orphan port 3055 is closed.
  • Host-permission vm-health-check.sh --json reports container_health=OK, container_loops=OK, failed_units=OK, and cron_missing_paths=OK.

Committed/pushed:

  • learning_ai_common_plat: af035e7d (fix: bind ecosystem Next apps on all interfaces) pushed to GitHub.

Residual risk:

  • Local Gitea mirror push for learning_ai_common_plat failed at Git HTTP transport even though fetch and health checks work; retry/fix mirror push separately.
  • This fixed health state, not public exposure. Several direct published ports remain to be loopback-bound or blocked in Phase 1.

2026-05-27 — Phase 1 internal port loopback

Changed:

  • Updated /opt/bytelyst/learning_ai_common_plat/docker-compose.ecosystem.yml so cosmos-emulator, azurite, mailpit, and loki publish host ports only on 127.0.0.1.
  • Recreated only cosmos-emulator, azurite, mailpit, and loki with docker compose ... up -d --no-build.

Verified:

  • docker compose -f docker-compose.ecosystem.yml --env-file .env.ecosystem config --quiet passed.
  • Docker reports the target services healthy.
  • ss -ltnp shows 1025, 8025, 10000, 1234, 8081, and 3100 listening on 127.0.0.1 only, with no 0.0.0.0 or IPv6 wildcard bind for that group.
  • Local smoke checks returned HTTP 200 for Mailpit UI, Loki readiness, and Cosmos explorer. Azurite returned HTTP 400 on the raw blob endpoint while its container healthcheck remained healthy, which is expected for an unauthenticated root request.

Committed/pushed:

  • learning_ai_common_plat: 1c09e479 (fix: bind internal infra ports to loopback) pushed to GitHub.

Residual risk:

  • Public direct bypass remains for app/API ports, Gitea direct port 3300, devops/admin surfaces, and Ollama 11434.
  • Add a DOCKER-USER fallback policy after the remaining allowlist is reviewed.

2026-05-27 — Phase 1 common-platform app/API bypasses

Changed:

  • Updated /opt/bytelyst/learning_ai_common_plat/docker-compose.ecosystem.yml so remaining published common-platform web/dashboard ports bind to 127.0.0.1.
  • Recreated the common-platform web/dashboard services that previously published on 0.0.0.0: tracker-web, lysnrai-dashboard, jarvisjr-web, flowmonk-web, mindlyst-web, actiontrail-web, localmemgpt-web, and llmlab-dashboard.
  • Recreated stale common-platform backend containers peakpulse-backend, lysnrai-backend, and nomgap-backend; their current Compose definitions do not publish host ports, so the old direct 4010, 4015, and 4013 mappings were removed.

Verified:

  • docker compose -f docker-compose.ecosystem.yml --env-file .env.ecosystem config --quiet passed.
  • docker ps --filter name=learning_ai_common_plat ... | grep 0.0.0.0 returned no common-platform wildcard-published containers.
  • docker ps --filter health=unhealthy returned no unhealthy containers.
  • ss -ltnp shows 3002, 3003, 3035, 3040, 3050, 3060, 3070, and 3075 bound to 127.0.0.1.
  • Host smoke checks returned HTTP 200 for 3002, 3003, 3035, 3040, 3050, 3060, 3070, and 3075.

Committed/pushed:

  • learning_ai_common_plat: e29cc58a (fix: bind app host ports to loopback) pushed to GitHub.

Remaining wildcard Docker publishes after this checkpoint:

  • Caddy public ingress: 80, 443.
  • Local Gitea direct port: 3300.
  • DevOps dashboard/API: 3049, 4004.
  • Host Ollama still listens on wildcard 11434.

2026-05-27 — Phase 1 product repo app/API bypasses

Changed:

  • Updated /opt/bytelyst/learning_ai_notes/docker-compose.yml and docker-compose.override.yml so NoteLett backend/web bind to 127.0.0.1.
  • Updated /root/bytelyst.ai/repos/learning_ai_clock/docker-compose.yml so ChronoMind backend/web bind to 127.0.0.1; also added HOSTNAME=0.0.0.0 so the Next.js healthcheck works inside the container.
  • Updated /opt/bytelyst/learning_ai_invt_trdg/docker-compose.yml so InvtTrdg backend/web bind to 127.0.0.1.
  • Recreated the affected services without rebuilding images.

Verified:

  • Notes: 3000 and 4016 listen on 127.0.0.1; local web/backend smoke checks returned HTTP 200.
  • Clock: 3030 and 4011 listen on 127.0.0.1; local web/backend smoke checks returned HTTP 200; containers are healthy.
  • InvtTrdg: 3085 and 4025 listen on 127.0.0.1; local web/backend smoke checks returned HTTP 200.
  • docker ps --format ... | grep 0.0.0.0 now shows only Caddy 80/443, Gitea 3300, and DevOps 3049/4004 as Docker wildcard publishes.
  • docker ps --filter health=unhealthy returned no unhealthy containers.

Committed/pushed:

  • learning_ai_notes: 3683ba9 (fix: bind Notes host ports to loopback) pushed to GitHub.
  • learning_ai_clock: ee572f8 (fix: bind Clock host ports to loopback) pushed to GitHub.
  • learning_ai_invt_trdg: 39490bc (fix: bind InvtTrdg host ports to loopback) pushed to GitHub.

Remaining wildcard direct exposure after this checkpoint:

  • Expected public ingress: 22, 80, 443.
  • Docker wildcard publishes still to fix: Gitea direct port 3300, DevOps dashboard/API 3049 and 4004.
  • Host process still to fix: Ollama 11434.

2026-05-27 — Phase 1 DevOps private-admin bypasses

Changed:

  • Updated /opt/bytelyst/learning_ai_devops_tools/dashboard/docker-compose.yml so devops-web and devops-backend bind host ports only on 127.0.0.1.
  • Recreated devops-backend and devops-web without rebuilding images.

Verified:

  • docker compose config --quiet passed in the DevOps dashboard directory.
  • devops-web now publishes 127.0.0.1:3049->3000.
  • devops-backend now publishes 127.0.0.1:4004->4004 and is healthy.
  • Local smoke checks returned HTTP 200 for http://127.0.0.1:3049 and http://127.0.0.1:4004/health.
  • docker ps --format ... | grep 0.0.0.0 now shows only Caddy 80/443 and Gitea 3300 as Docker wildcard publishes.

Remaining wildcard direct exposure after this checkpoint:

  • Expected public ingress: 22, 80, 443.
  • Docker wildcard publish still to fix: Gitea direct port 3300.
  • Host process still to fix: Ollama 11434.

Do Not Start With

  • Rootless Docker migration.
  • Broad iptables default-drop without an allowlist.
  • Mass Compose rewrites across all products.
  • SSH password/root lockout before key-based sudo and rollback are proven.
  • Removing unhealthy containers before confirming whether they are deprecated or broken required services.
  • Publishing secret-scan output that contains secrets.

Suggested First Tickets

  1. P0: Build and review exposure inventory — produce exact approved/blocked list for all currently bound ports.
  2. P0: Lock Docker-published non-public ports — bind to loopback/internal or enforce DOCKER-USER drops.
  3. P0: Harden SSH — disable password/root login after confirming key-based admin access.
  4. P1: Triage unhealthy containers — fix healthchecks/apps or retire dead services.
  5. P1: Resolve failed Hermes backup unit — fix or disable duplicate failed unit; keep cron backup healthy.
  6. P1: Decide Gitea runner state — active smoke-tested runner or documented disabled service.
  7. P2: Add secret scanner and stale-job scanner — prevent silent credential and automation drift.

Recommended first implementation order:

  1. Generate and review docs/vm-exposure-inventory.md.
  2. Fix the stale cron path and failed backup unit, because both are lower blast-radius and improve rollback confidence.
  3. Harden SSH with second-session/provider-console safety.
  4. Move obvious internal-only Docker ports to loopback/internal bindings.
  5. Add DOCKER-USER guardrails after the allowlist is proven.

This order improves safety without letting the port exposure issue linger too long.

Verification Commands for Future Runs

# Host/security baseline
date -Is
uname -a
. /etc/os-release && echo "$PRETTY_NAME"
apt-get -s upgrade | awk '/^Inst /{print}'
test -f /var/run/reboot-required && cat /var/run/reboot-required || echo no-reboot-required

# Firewall and public bind inventory
ufw status verbose
iptables -S DOCKER-USER
ss -ltnup

# SSH effective config
sshd -T | egrep '^(permitrootlogin|passwordauthentication|pubkeyauthentication|kbdinteractiveauthentication|maxauthtries|x11forwarding|clientaliveinterval)'
fail2ban-client status sshd

# Docker health/security
docker ps --format '{{.Names}}\t{{.Status}}\t{{.Ports}}'
docker ps -q | xargs -r docker inspect --format '{{.Name}} user={{.Config.User}} privileged={{.HostConfig.Privileged}} readonly={{.HostConfig.ReadonlyRootfs}} ports={{json .NetworkSettings.Ports}}'

# Caddy and ingress
docker exec caddy caddy validate --config /etc/caddy/Caddyfile
sed -n '1,220p' /opt/bytelyst/Caddyfile

# Backup/cron/systemd drift
systemctl --failed --no-pager
hermes cron list
crontab -l
for f in /etc/cron.d/*; do echo "--- $f"; sed -n '1,80p' "$f"; done

Notes

  • This review did not change firewall, SSH, Docker, Caddy, or service settings. It intentionally documents the risk and remediation order before making potentially disruptive security changes.
  • Public exposure changes should be handled in small maintenance windows with pre/post health checks because this VM hosts multiple ByteLyst apps.
  • The Caddyfile validates today, but Caddy formatting should be normalized in a separate low-risk docs/ops cleanup if desired.