37 KiB
ByteLyst VM Security Blind Spots Roadmap
Review date: 2026-05-27 Reviewer: Hermes Agent Scope: Hostinger ByteLyst VM, Docker-hosted product stack, Caddy ingress, Gitea/CI, Hermes backup/ops, VM maintenance posture.
Executive Summary
The VM is operational and has several good foundations already in place: UFW is active, fail2ban is running for SSH, unattended upgrades are enabled, Caddy config validates, disk/memory headroom is acceptable, and Hermes persistent-data backup cron is healthy.
The biggest blind spot is that the apparent firewall posture is misleading: UFW only allows SSH, but Docker-published ports create iptables rules that can expose many application, database/emulator, observability, registry, and development ports on 0.0.0.0 / IPv6. Several of those services should either be private-only, routed only through Caddy with auth, or bound to loopback/internal Docker networks.
Second-order risks are SSH hardening gaps, rootful Docker/container hardening gaps, unhealthy apps that can hide failed deploys, an inactive Gitea Actions runner, a failed Hermes backup systemd unit despite cron backup success, and incomplete evidence for restore drills, secret scans, and off-host recovery.
Implementation Readiness Assessment
Roadmap quality score: 86%
Implementation confidence before remediation starts: 74%
Why not higher yet: the review has good evidence for the major blind spots, but safe remediation still depends on a service-by-service exposure inventory, owner approval for public/private intent, and verified rollback paths for SSH and Docker firewall changes. The highest-risk changes are not technically hard; they are risky because this VM hosts many ByteLyst apps and several public ports may be relied on by legacy workflows.
Confidence after Phase 0 is complete: expected to rise to about 88% if every public hostname/host port has an approved disposition and rollback commands are tested.
Quality strengths:
- Evidence is concrete and command-derived rather than speculative.
- The highest-risk items are correctly prioritized as P0.
- The roadmap separates discovery from disruptive remediation.
- It captures operational debt outside pure security, including unhealthy containers, backup state, runner drift, and cron drift.
Quality gaps to close before implementation:
- Convert broad remediation bullets into small tickets with owner, rollback, validation, and maintenance window requirements.
- Add an approved exposure inventory before changing Docker bindings or
DOCKER-USER. - Record a tested SSH rollback path and keep an active second session/provider console open before changing
sshd. - Define what is intentionally public, private, internal-only, or deprecated for each service.
- Add post-change verification commands that prove public apps still work and private services are no longer internet reachable.
Implementation Guardrails
These rules apply before any Phase 1 change:
- Do not bulk-close ports. Change one service group at a time and verify public app health after each group.
- Do not restart SSH from a single session. Keep a second key-based session open and provider console access available.
- Do not add broad
DROPrules before an allowlist is committed to the inventory. - Prefer loopback/internal Compose bindings over firewall-only hiding when a service does not need direct public access.
- Preserve Caddy as the public ingress path for web/API services unless a service is explicitly approved for direct exposure.
- Record exact rollback commands next to every change ticket.
- Treat Docker, SSH, Caddy, and backup changes as maintenance-window work.
Exposure Classification Model
Every listening port and Caddy hostname should be classified before changes:
| Class | Meaning | Expected Controls | Examples To Review |
|---|---|---|---|
public-caddy |
Public app/API reached only through Caddy | TLS, hostname routing, app auth where needed, no direct host-port access | product web/API hostnames |
public-direct |
Direct host-port access is intentionally public | Explicit business reason, provider firewall allow, monitoring | SSH only unless approved otherwise |
private-admin |
Admin/dev/internal tool | Tailscale/VPN, SSH tunnel, IP allowlist, or auth gate | admin dashboards, devops tools |
loopback-only |
Host-local service used by Caddy or local automation | Bind 127.0.0.1:port, no external bind |
internal APIs behind Caddy |
docker-internal |
Container-to-container only | no host port mapping | databases, emulators, private workers |
retire |
Unused/deprecated | remove service/port, disable health checks and jobs | stale dashboards/services |
Minimum inventory fields:
- service/container name
- repo/Compose file
- host port and bind address
- container port
- Caddy hostname/path, if any
- intended audience
- authentication/control plane
- classification
- owner/approver
- rollback command
- post-change health check
Evidence Snapshot
Collected on 2026-05-27 from this VM.
Host and patching
- Host:
srv1491630 - OS: Ubuntu
25.10 - Kernel:
6.17.0-29-generic - Uptime: about 14 hours at review time
- Root filesystem: 193G total, 71G used, 123G available, 37% used
- Memory: 15Gi total, about 10Gi available
- Swap: 4.0G total, about 1.3G used
- Reboot required: no
- Pending package upgrades included Docker CE/containerd/buildx/compose and security updates for
libgcrypt20,libcaca0, andlibssh2-1t64 - Unattended upgrades: active and configured for automatic reboot at 04:00 with users absent
Network and ingress
- UFW: active; default deny incoming; only
22/tcpallowed by UFW rules - Docker iptables rules are present and publish many ports despite UFW's simple rule list
- Public/listening TCP ports bound on all interfaces included:
22,80,443- app/frontend ports:
3000,3002,3003,3030,3035,3040,3049,3050,3055,3060,3070,3075,3085 - backend/API ports:
4004,4010,4011,4012,4013,4014,4015,4016,4017,4019,4020,4025 - infra/dev ports:
1025,1234,3100,3300,8025,8081,10000,11434
- Caddy source-of-truth config:
/opt/bytelyst/Caddyfile, mounted read-only into thecaddycontainer docker exec caddy caddy validate --config /etc/caddy/Caddyfile: valid config, formatting warning only- Caddy public hostnames include:
api.bytelyst.comgitea.bytelyst.comadmin.bytelyst.comdevops.bytelyst.comtracker.bytelyst.comllmlab.bytelyst.comollama.bytelyst.comtrading-api.bytelyst.cominvttrdg.bytelyst.comnotes.bytelyst.comclock.bytelyst.com
SSH and account surface
Effective sshd -T settings showed:
permitrootlogin yespasswordauthentication yespubkeyauthentication yeskbdinteractiveauthentication nomaxauthtries 6x11forwarding yesclientaliveinterval 0
fail2ban is active with one jail: sshd; no current bans at review time.
Docker runtime and containers
- Docker: client/server
29.4.2; newer Docker packages are available - Docker daemon is rootful; security options showed AppArmor, seccomp builtin, and cgroup namespaces;
live_restore=false - Most product containers run with writable root filesystems and no explicit
userconfigured cadvisoris privilegedDOCKER-USERchain appears empty, so there is no central Docker firewall policy in front of published containers- Multiple containers are unhealthy:
learning_ai_common_plat-llmlab-dashboard-1learning_ai_common_plat-actiontrail-web-1learning_ai_common_plat-jarvisjr-web-1learning_ai_common_plat-localmemgpt-web-1learning_ai_common_plat-nomgap-web-1learning_ai_common_plat-flowmonk-web-1learning_ai_common_plat-mindlyst-web-1
Gitea and CI
- Gitea public route:
https://gitea.bytelyst.com - Local Gitea container port: host
3300-> container3000, bound on0.0.0.0and IPv6 gitea-act-runner.service: enabled but inactive/dead- Runner user exists:
gitea-runner, member ofdocker - Runner config directory permissions look reasonable:
/home/gitea-runner/act_runner:750, owned bygitea-runner:gitea-runner/home/gitea-runner/act_runner/config.yaml:600, owned bygitea-runner:gitea-runner
Backup and operations
systemctl --failedshowed failed unit:hermes-root-backup.service—Sync root Hermes persistent backup to GitHub
- Hermes cron backup is active and healthy:
- job
470832621b43,Sync Hermes persistent-data backup to GitHub, every 30 minutes, last runok
- job
- Existing VM maintenance cron entries exist for health check and cleanup under
/opt/bytelyst/learning_ai_devops_tools/scripts/VMs/HostingerVM/ - A root crontab entry still references
/opt/bytelyst/bytelyst-devops-tools/monitor-lucky25-execution.sh, which may be stale after repo relocation/renaming
Blind Spots and Risk Register
P0 — Internet-exposed Docker ports bypass the intended ingress model
Risk: UFW suggests only SSH is allowed, but Docker-published ports expose many services directly on all interfaces. This can bypass Caddy, TLS, auth, logging, rate limiting, and hostname/path controls.
Examples observed: 3300, 8025, 1025, 1234, 8081, 10000, 11434, many 30xx web ports, and many 40xx backend ports.
Impact: Direct access to dev/infra services, internal APIs, emulators, mail tooling, dashboards, or model endpoints if upstream firewall/provider rules do not block them.
Roadmap:
- Create a canonical exposure inventory: service, container, host port, public hostname, required audience, auth requirement.
- For each service, decide one of: public via Caddy, private via Tailscale/SSH, loopback-only host port, Docker-internal only, or remove.
- Bind non-public Compose ports to
127.0.0.1or remove host port mapping entirely.- Internal emulator/mail/observability ports
1025,8025,10000,1234,8081, and3100are loopback-bound. - Common-platform direct app/API bypasses are loopback-bound or removed from host publishing.
- Notes, Clock, and InvtTrdg direct app/API bypasses are loopback-bound.
- Internal emulator/mail/observability ports
- Add a
DOCKER-USERchain policy to drop unsolicited traffic to non-approved published ports before Docker's accept rules. - Keep only
80/443and intentionally public SSH exposed at the provider/firewall layer. - Add a recurring check that compares
ss -ltnand Docker published ports against the approved inventory.
Acceptance criteria:
docs/vm-exposure-inventory.mdlists everyss -ltnplistener and every Docker published port.- Every non-SSH direct public bind has an approved classification.
- Non-public services are either loopback-bound, Docker-internal, provider-firewalled, or blocked in
DOCKER-USER. - External probe confirms non-approved ports are closed from the internet.
- Caddy-routed public hostnames still pass smoke checks.
Rollback: keep a saved copy of original Compose files and iptables-save output; rollback means restoring original port mappings or flushing only the newly added DOCKER-USER rules.
P0 — SSH permits root login and password authentication
Risk: PermitRootLogin yes and PasswordAuthentication yes keep the primary admin surface broad. fail2ban helps, but password-enabled root SSH is still high-risk for an internet-facing VM.
Roadmap:
- Confirm all required admin users have working SSH keys and sudo access.
- Add a non-root break-glass admin path if one does not exist.
- Change SSH effective config to:
PermitRootLogin prohibit-passwordornoPasswordAuthentication noX11Forwarding no- lower
MaxAuthTries, e.g.3 - set a sane
ClientAliveInterval/ClientAliveCountMax
- Validate with a second session before restarting SSH.
- Record rollback commands and keep console/provider access available during rollout.
Acceptance criteria:
- A non-root sudo admin user can log in with SSH key auth.
- Root password login no longer works.
- Existing automation using
scripts/VMs/HostingerVM/login.shstill works or is updated. sshd -Tconfirms the intended effective config.fail2ban-client status sshdstill reports an active jail.
Rollback: provider console or still-open root session can restore previous sshd_config drop-in and restart ssh.
P0 — Public/private boundary for dev and internal tooling is unclear
Risk: Caddy publishes ollama.bytelyst.com, llmlab.bytelyst.com, devops.bytelyst.com, admin.bytelyst.com, and Gitea. Some may be intended, but the roadmap lacks an explicit auth/access decision for each.
Roadmap:
- Document public hostnames, auth model, and data sensitivity.
- Require explicit approval before exposing new dashboards or model endpoints.
- Add Caddy auth/IP allowlist/Tailscale-only strategy for admin-like surfaces.
- Add security headers/auth checks to public UI health reviews.
- Confirm
ollama.bytelyst.comshould be publicly reachable at all; if not, move behind private network or auth gate.
Acceptance criteria:
ollama,llmlab,devops,admin,gitea, and observability-adjacent routes each have an owner-approved exposure class.- Public admin-like routes require authentication or an explicit documented exception.
- No emulator, mail, model, or raw dashboard port is directly internet reachable unless explicitly approved.
P1 — Docker/container hardening is mostly default
Risk: Many containers run as default/root user, writable rootfs, broad capabilities by default, and rootful Docker. A compromised app gets more host-adjacent leverage than needed.
Roadmap:
- Create a per-service Docker hardening matrix: user, read-only rootfs, dropped capabilities, no-new-privileges, resource limits, healthcheck, restart policy, secrets handling.
- Start with public-facing/backend services and admin dashboards.
- Add
security_opt: ["no-new-privileges:true"]where compatible. - Add
cap_drop: ["ALL"]and selectively add back capabilities only when needed. - Convert app images to non-root users consistently.
- Use
read_only: trueplus explicit writable tmp/cache volumes where compatible. - Review
cadvisorprivileged mode and replace/restrict if possible. - Enable Docker
live-restoreif it fits maintenance operations.
Implementation note: do not attempt rootless Docker or read-only rootfs as the first hardening step. Start with no-new-privileges, non-root app users where images already support it, and targeted capability drops for public-facing app containers.
P1 — Unhealthy containers can normalize broken deployments
Risk: Multiple app web containers are unhealthy while still running. If unhealthy states are ignored, deploy regressions and broken public pages can persist unnoticed.
Roadmap:
- Triage each unhealthy container and classify: real app failure, bad healthcheck, intentionally unused, or deprecated.
- Fix or remove bad healthchecks so Docker health state is trustworthy.
- Add alerting for sustained unhealthy containers.
- Make deployment scripts fail on unhealthy post-deploy state.
- Update dashboard/observability docs with current service ownership and expected state.
Acceptance criteria:
- Every unhealthy container has one of: fixed app, fixed healthcheck, intentionally disabled, or retired.
- Docker health state matches the product’s actual serving state.
- Post-deploy checks fail if required containers remain unhealthy beyond a grace period.
P1 — Gitea Actions runner is enabled but inactive
Risk: CI/deploy assumptions may be wrong. If a runner is expected to deploy or publish packages, inactive runner state blocks automation and may cause manual drift.
Roadmap:
- Decide whether the runner should be active or intentionally disabled.
- If active: restart and verify
gitea-act-runner.service, runner labels, and Docker access. - Run and record a dedicated Gitea Actions smoke workflow result.
- If disabled: disable the service and document the intentional state.
- Keep runner secrets separate from smoke/test workflows.
- Add a runner-health check to VM observability.
Decision needed: runner should be either actively smoke-tested or disabled. An enabled-but-dead runner should not remain a steady state.
P1 — Backup/restore evidence is split and one backup unit is failed
Risk: Hermes cron backup works, but hermes-root-backup.service is failed. There is no recent full restore drill evidence in this review. A backup that cannot be restored is only an assumption.
Roadmap:
- Inspect
hermes-root-backup.servicelogs and decide whether to fix, disable, or replace it with the cron-backed job. - Repair the root backup checkout divergence and verify a successful
hermes-root-backup.serviceone-shot run. - Update
/root/.hermes/scripts/sync_hermes_persistent_backup.pyso future generated-backup divergence preserves a safety branch and rejoins the remote backup stream instead of wedging ongit pull --ff-only. - Document all backup mechanisms: Hermes, Gitea data, Docker volumes, app data, Caddy certs/config, environment/secrets escrow.
- Run a restore drill into a non-production path/profile.
- Verify no raw
.env, OAuth tokens, private keys, SQLite WAL/SHM, or raw transcript DBs are committed. - Add backup freshness and restore-drill status to the monthly VM review.
Acceptance criteria:
systemctl --failedno longer includes backup units unless the failure is intentionally documented.- Backup status shows source, destination, cadence, last success, and restore command.
- A restore drill has an artifact: date, target path/profile, commands run, result, and gaps found.
P1 — Patch management has pending security/runtime updates
Risk: Unattended upgrades are on, but Docker and security package updates were pending at review time. Docker updates may need controlled restart/redeploy planning.
Roadmap:
- Add a weekly patch review checkpoint that reports pending security and Docker updates separately.
- Define a Docker upgrade maintenance window with pre/post checks.
- Run
apt list --upgradableand capture package classes without dumping noise. - Verify apps after Docker/containerd upgrades.
Acceptance criteria:
- Security updates and Docker/runtime updates are tracked separately.
- Docker upgrade has pre/post container health, Caddy validation, and public smoke checks.
- Reboot requirement is checked and scheduled rather than discovered accidentally.
P2 — Ubuntu 25.10 lifecycle risk needs explicit tracking
Risk: Ubuntu interim releases have short support windows. If this VM is long-lived production infrastructure, lifecycle tracking matters.
Roadmap:
- Record current Ubuntu 25.10 support/EOL date in ops docs.
- Decide whether to stay on interim releases or migrate to an LTS baseline.
- Add an OS lifecycle check to quarterly review.
P2 — Repository/config secret hygiene needs a repeatable scanner
Risk: The DevOps repo contains operational inputs and historical/deleted repo copies exist on disk. Manual review can miss tokens in old files, generated JSON, logs, backups, or abandoned directories.
Roadmap:
- Add a documented secret-scan command using
gitleaksortrufflehogfor tracked files and selected untracked ops directories. - Scan historical directories such as
DELETED_bytelyst-devops-toolsseparately before archiving or deleting. - Add
.gitignorepatterns for generated scans, local account snapshots, and credential-shaped outputs. - Keep examples as
.examplefiles only.
P2 — Cron/systemd ownership and drift are not fully inventoried
Risk: Root crontab references old repo paths and there are multiple cron/systemd sources. Stale jobs can fail silently or mutate production unexpectedly.
Roadmap:
- Inventory root/user crontabs,
/etc/cron.d, systemd timers, Hermes cron, and Gitea Actions schedules. - Remove or update stale
/opt/bytelyst/bytelyst-devops-tools/...references after confirming replacements. - Add owner, purpose, expected output, and alert channel for every job.
- Add a stale-job detector for missing script paths and failed systemd units.
Acceptance criteria:
- No active cron/systemd job references a missing path.
- Every recurring job has an owner, purpose, schedule, expected output, and alert destination.
- Stale path detection runs in the monthly VM review.
P2 — Observability exists but needs security-focused SLOs
Risk: Prometheus/Grafana/Loki/exporters are present, but security-focused alerts are not yet proven from this review.
Roadmap:
- Add alerts for unexpected public ports, failed units, unhealthy containers, high disk/swap, backup staleness, Gitea runner inactive, and SSH auth spikes.
- Validate alert delivery to Telegram.
- Keep internal observability endpoints private; do not publish Prometheus/Loki/node-exporter/cAdvisor directly.
Execution Plan
Phase 0 — Freeze and inventory before changes
- Freeze new public hostnames/ports until the exposure inventory is complete.
- Generate
docs/vm-exposure-inventory.mdfrom Docker, Caddy,ss, and DNS. - Mark each exposed service as
public,private,internal-only, orretire. - Review with S before changing public access for customer/user-facing apps.
Exit criteria: the inventory is reviewed and every P0 change has a rollback line and validation line.
Phase 1 — Immediate security hardening
- Close or loopback-bind non-public Docker host ports.
- Loopback-bound internal emulator/mail/observability ports
1025,8025,10000,1234,8081, and3100. - Closed/loopback-bound common-platform direct app/API bypasses.
- Loopback-bound Notes, Clock, and InvtTrdg direct app/API bypasses.
- Loopback-bound internal emulator/mail/observability ports
- Add
DOCKER-USERdefault-deny rules for non-approved ports. - Harden SSH root/password access after key-based access is verified.
- Put
ollama.bytelyst.com, admin dashboards, and dev tooling behind private/auth-gated access unless explicitly approved as public.
Exit criteria: only approved public ports are externally reachable, SSH effective config is hardened, and public apps still pass smoke checks.
Phase 2 — Operational correctness
- Fix/retire unhealthy containers.
- Resolve
hermes-root-backup.servicefailed state. - Decide and document Gitea runner active/disabled state.
- Add missing-script checks. Stale root cron path was fixed on 2026-05-27.
- Apply pending security/runtime updates in a maintenance window.
Exit criteria: no unexpected failed units, no ignored unhealthy required containers, no stale cron paths, and runner state is intentional.
Phase 3 — Docker and app hardening
- Add non-root users,
no-new-privileges, cap drops, and read-only rootfs by service. - Add resource limits for noisy services and emulators.
- Move emulators/dev tools off public bindings.
- Review cAdvisor privilege and observability surface.
Phase 4 — Backup, restore, and incident readiness
- Define full backup map: Hermes, Gitea, Caddy, Docker volumes, app DB/state, secrets escrow.
- Perform restore drill to non-prod target.
- Add incident runbooks: compromised container, leaked token, SSH brute force, disk full, failed Docker upgrade.
- Add quarterly tabletop review.
Phase 5 — Continuous governance
- Monthly VM security review cron/checklist.
- Secret scan before DevOps repo pushes.
- OS lifecycle/EOL tracker.
- Drift detection for ports, Caddy routes, Docker health, systemd failures, and cron paths.
Change Tickets With Quality Gates
Use this shape for each implementation PR/commit:
Ticket:
Risk:
Files/services changed:
Pre-checks:
Change:
Rollback:
Post-checks:
Residual risk:
Minimum post-checks for Phase 1:
ss -ltnpdocker ps --format '{{.Names}}\t{{.Status}}\t{{.Ports}}'iptables -S DOCKER-USERdocker exec caddy caddy validate --config /etc/caddy/Caddyfile- public smoke checks for approved hostnames
- negative external probe for blocked ports
sshd -Tafter SSH changessystemctl --failed --no-pager
Implementation Log
2026-05-27 — Phase 2 backup and cron drift
Changed:
- Repointed the root Lucky25 monitor cron from
/opt/bytelyst/bytelyst-devops-tools/monitor-lucky25-execution.shto/opt/bytelyst/learning_ai_devops_tools/scripts/monitor-lucky25-execution.sh. - Saved the pre-change root crontab at
/tmp/root-crontab-before-vm-security-20260527.txt. - Repaired
/root/repos/bytelyst_hostinger_hermes_vm, which wasahead 1, behind 11; the obsolete local generated backup commit conflicted with newer remote snapshots and was skipped after rebase preserved the current remote stream. - Patched
/root/.hermes/scripts/sync_hermes_persistent_backup.pyto replace unconditionalgit pull --ff-onlywith explicit fetch/merge-base handling. Diverged generated snapshots now create a safety branch before attempting rebase and fall back toorigin/<branch>if the generated files conflict. - Saved the pre-change backup script at
/tmp/sync_hermes_persistent_backup.py.before-vm-security-20260527.
Verified:
crontab -lnow points the Lucky25 monitor at the current repo script.python3 -m py_compile /tmp/sync_hermes_persistent_backup.pypassed before deployment.systemctl start hermes-root-backup.servicesucceeded twice after repair.systemctl status hermes-root-backup.service hermes-root-backup.timer --no-pagershowed the service exitedstatus=0/SUCCESSand the timer remains active./root/repos/bytelyst_hostinger_hermes_vmis aligned withorigin/mainafter successful backup commits415e824and369e584.
Residual risk:
- A restore drill is still required before the backup posture should be considered fully proven.
- The backup sync script is runtime-managed under
/root/.hermes/scripts/; add a tracked installer or source-of-truth copy so this hardening does not depend on manual VM state.
2026-05-27 — Phase 2 Gitea runner state
Changed:
- Started
gitea-act-runner.service; it was enabled but inactive. - Treated the intended state as active because the service unit is enabled, historical journal entries show successful task execution, and restart declared the runner successfully.
Verified:
systemctl is-active gitea-act-runner.servicereturnedactive.systemctl status gitea-act-runner.service --no-pagershowedbytelyst-host-runnerrunning asgitea-runner.- Runner labels declared successfully:
ubuntu-latest,linux,bytelyst,hostinger. - Runner config uses Docker executor images and
privileged: false; Docker socket access is granted through thedockergroup. - Runner immediately picked up task
42forbytelyst/bytelyst-devops-tools, proving it can talk to local Gitea.
Residual risk:
- Record a small dedicated smoke workflow that does not need production secrets, so runner health is proven by a controlled workflow rather than incidental queued work.
- Add runner health to VM observability so enabled-but-inactive drift is caught automatically.
2026-05-27 — Phase 2 stale automation detector
Changed:
- Extended
scripts/VMs/HostingerVM/vm-health-check.shwith anAUTOMATION DRIFTsection. - The daily health check now reports failed systemd units and root crontab script paths that no longer exist.
- Made optional
/var/log/vm-health-check.logwrites silent when the script runs in a restricted/read-only context.
Verified:
bash -n scripts/VMs/HostingerVM/vm-health-check.shpassed.- Restricted
--jsonrun stayed quiet on log-write failure and reported the new checks. - Host-permission
--jsonrun reportedfailed_units=OKandcron_missing_paths=OK.
Residual risk:
- The detector currently covers root crontab and failed systemd units. Full ownership inventory still needs
/etc/cron.d, user crontabs, Hermes cron, Gitea schedules, owners, outputs, and alert channels.
2026-05-27 — Phase 2 unhealthy containers
Changed:
- Added
HOSTNAME=0.0.0.0to six managed Next.js web services in/opt/bytelyst/learning_ai_common_plat/docker-compose.ecosystem.yml:jarvisjr-web,flowmonk-web,mindlyst-web,actiontrail-web,localmemgpt-web, andllmlab-dashboard. - Recreated those six services from existing images with
docker compose ... up -d --no-build. - Retired the orphan
learning_ai_common_plat-nomgap-web-1container. Current Compose already documentsnomgap-webas deployed to Vercel and not part of the Docker stack.
Verified:
docker compose -f docker-compose.ecosystem.yml --env-file .env.ecosystem config --quietpassed.- The six recreated web containers report Docker health
healthy. docker ps --filter health=unhealthyreturns no containers.- Host-level smoke checks returned HTTP
200for3035,3040,3050,3060,3070, and3075; retired orphan port3055is closed. - Host-permission
vm-health-check.sh --jsonreportscontainer_health=OK,container_loops=OK,failed_units=OK, andcron_missing_paths=OK.
Committed/pushed:
learning_ai_common_plat:af035e7d(fix: bind ecosystem Next apps on all interfaces) pushed to GitHub.
Residual risk:
- Local Gitea mirror push for
learning_ai_common_platfailed at Git HTTP transport even though fetch and health checks work; retry/fix mirror push separately. - This fixed health state, not public exposure. Several direct published ports remain to be loopback-bound or blocked in Phase 1.
2026-05-27 — Phase 1 internal port loopback
Changed:
- Updated
/opt/bytelyst/learning_ai_common_plat/docker-compose.ecosystem.ymlsocosmos-emulator,azurite,mailpit, andlokipublish host ports only on127.0.0.1. - Recreated only
cosmos-emulator,azurite,mailpit, andlokiwithdocker compose ... up -d --no-build.
Verified:
docker compose -f docker-compose.ecosystem.yml --env-file .env.ecosystem config --quietpassed.- Docker reports the target services healthy.
ss -ltnpshows1025,8025,10000,1234,8081, and3100listening on127.0.0.1only, with no0.0.0.0or IPv6 wildcard bind for that group.- Local smoke checks returned HTTP
200for Mailpit UI, Loki readiness, and Cosmos explorer. Azurite returned HTTP400on the raw blob endpoint while its container healthcheck remained healthy, which is expected for an unauthenticated root request.
Committed/pushed:
learning_ai_common_plat:1c09e479(fix: bind internal infra ports to loopback) pushed to GitHub.
Residual risk:
- Public direct bypass remains for app/API ports, Gitea direct port
3300, devops/admin surfaces, and Ollama11434. - Add a
DOCKER-USERfallback policy after the remaining allowlist is reviewed.
2026-05-27 — Phase 1 common-platform app/API bypasses
Changed:
- Updated
/opt/bytelyst/learning_ai_common_plat/docker-compose.ecosystem.ymlso remaining published common-platform web/dashboard ports bind to127.0.0.1. - Recreated the common-platform web/dashboard services that previously published on
0.0.0.0:tracker-web,lysnrai-dashboard,jarvisjr-web,flowmonk-web,mindlyst-web,actiontrail-web,localmemgpt-web, andllmlab-dashboard. - Recreated stale common-platform backend containers
peakpulse-backend,lysnrai-backend, andnomgap-backend; their current Compose definitions do not publish host ports, so the old direct4010,4015, and4013mappings were removed.
Verified:
docker compose -f docker-compose.ecosystem.yml --env-file .env.ecosystem config --quietpassed.docker ps --filter name=learning_ai_common_plat ... | grep 0.0.0.0returned no common-platform wildcard-published containers.docker ps --filter health=unhealthyreturned no unhealthy containers.ss -ltnpshows3002,3003,3035,3040,3050,3060,3070, and3075bound to127.0.0.1.- Host smoke checks returned HTTP
200for3002,3003,3035,3040,3050,3060,3070, and3075.
Committed/pushed:
learning_ai_common_plat:e29cc58a(fix: bind app host ports to loopback) pushed to GitHub.
Remaining wildcard Docker publishes after this checkpoint:
- Caddy public ingress:
80,443. - Local Gitea direct port:
3300. - DevOps dashboard/API:
3049,4004. - Host Ollama still listens on wildcard
11434.
2026-05-27 — Phase 1 product repo app/API bypasses
Changed:
- Updated
/opt/bytelyst/learning_ai_notes/docker-compose.ymlanddocker-compose.override.ymlso NoteLett backend/web bind to127.0.0.1. - Updated
/root/bytelyst.ai/repos/learning_ai_clock/docker-compose.ymlso ChronoMind backend/web bind to127.0.0.1; also addedHOSTNAME=0.0.0.0so the Next.js healthcheck works inside the container. - Updated
/opt/bytelyst/learning_ai_invt_trdg/docker-compose.ymlso InvtTrdg backend/web bind to127.0.0.1. - Recreated the affected services without rebuilding images.
Verified:
- Notes:
3000and4016listen on127.0.0.1; local web/backend smoke checks returned HTTP200. - Clock:
3030and4011listen on127.0.0.1; local web/backend smoke checks returned HTTP200; containers are healthy. - InvtTrdg:
3085and4025listen on127.0.0.1; local web/backend smoke checks returned HTTP200. docker ps --format ... | grep 0.0.0.0now shows only Caddy80/443, Gitea3300, and DevOps3049/4004as Docker wildcard publishes.docker ps --filter health=unhealthyreturned no unhealthy containers.
Committed/pushed:
learning_ai_notes:3683ba9(fix: bind Notes host ports to loopback) pushed to GitHub.learning_ai_clock:ee572f8(fix: bind Clock host ports to loopback) pushed to GitHub.learning_ai_invt_trdg:39490bc(fix: bind InvtTrdg host ports to loopback) pushed to GitHub.
Remaining wildcard direct exposure after this checkpoint:
- Expected public ingress:
22,80,443. - Docker wildcard publishes still to fix: Gitea direct port
3300, DevOps dashboard/API3049and4004. - Host process still to fix: Ollama
11434.
Do Not Start With
- Rootless Docker migration.
- Broad
iptablesdefault-drop without an allowlist. - Mass Compose rewrites across all products.
- SSH password/root lockout before key-based sudo and rollback are proven.
- Removing unhealthy containers before confirming whether they are deprecated or broken required services.
- Publishing secret-scan output that contains secrets.
Suggested First Tickets
- P0: Build and review exposure inventory — produce exact approved/blocked list for all currently bound ports.
- P0: Lock Docker-published non-public ports — bind to loopback/internal or enforce
DOCKER-USERdrops. - P0: Harden SSH — disable password/root login after confirming key-based admin access.
- P1: Triage unhealthy containers — fix healthchecks/apps or retire dead services.
- P1: Resolve failed Hermes backup unit — fix or disable duplicate failed unit; keep cron backup healthy.
- P1: Decide Gitea runner state — active smoke-tested runner or documented disabled service.
- P2: Add secret scanner and stale-job scanner — prevent silent credential and automation drift.
Recommended first implementation order:
- Generate and review
docs/vm-exposure-inventory.md. - Fix the stale cron path and failed backup unit, because both are lower blast-radius and improve rollback confidence.
- Harden SSH with second-session/provider-console safety.
- Move obvious internal-only Docker ports to loopback/internal bindings.
- Add
DOCKER-USERguardrails after the allowlist is proven.
This order improves safety without letting the port exposure issue linger too long.
Verification Commands for Future Runs
# Host/security baseline
date -Is
uname -a
. /etc/os-release && echo "$PRETTY_NAME"
apt-get -s upgrade | awk '/^Inst /{print}'
test -f /var/run/reboot-required && cat /var/run/reboot-required || echo no-reboot-required
# Firewall and public bind inventory
ufw status verbose
iptables -S DOCKER-USER
ss -ltnup
# SSH effective config
sshd -T | egrep '^(permitrootlogin|passwordauthentication|pubkeyauthentication|kbdinteractiveauthentication|maxauthtries|x11forwarding|clientaliveinterval)'
fail2ban-client status sshd
# Docker health/security
docker ps --format '{{.Names}}\t{{.Status}}\t{{.Ports}}'
docker ps -q | xargs -r docker inspect --format '{{.Name}} user={{.Config.User}} privileged={{.HostConfig.Privileged}} readonly={{.HostConfig.ReadonlyRootfs}} ports={{json .NetworkSettings.Ports}}'
# Caddy and ingress
docker exec caddy caddy validate --config /etc/caddy/Caddyfile
sed -n '1,220p' /opt/bytelyst/Caddyfile
# Backup/cron/systemd drift
systemctl --failed --no-pager
hermes cron list
crontab -l
for f in /etc/cron.d/*; do echo "--- $f"; sed -n '1,80p' "$f"; done
Notes
- This review did not change firewall, SSH, Docker, Caddy, or service settings. It intentionally documents the risk and remediation order before making potentially disruptive security changes.
- Public exposure changes should be handled in small maintenance windows with pre/post health checks because this VM hosts multiple ByteLyst apps.
- The Caddyfile validates today, but Caddy formatting should be normalized in a separate low-risk docs/ops cleanup if desired.