bytelyst-devops-tools/docs/vm-security-blind-spots-roadmap.md

# ByteLyst VM Security Blind Spots Roadmap

**Review date:** 2026-05-27
**Reviewer:** Hermes Agent
**Scope:** Hostinger ByteLyst VM, Docker-hosted product stack, Caddy ingress, Gitea/CI, Hermes backup/ops, VM maintenance posture.

## Executive Summary

The VM is operational and has several good foundations already in place: UFW is active, fail2ban is running for SSH, unattended upgrades are enabled, Caddy config validates, disk/memory headroom is acceptable, and Hermes persistent-data backup cron is healthy.

The biggest blind spot is that the apparent firewall posture is misleading: UFW only allows SSH, but Docker-published ports create iptables rules that can expose many application, database/emulator, observability, registry, and development ports on `0.0.0.0` / IPv6. Several of those services should either be private-only, routed only through Caddy with auth, or bound to loopback/internal Docker networks.

Second-order risks are SSH hardening gaps, rootful Docker/container hardening gaps, unhealthy apps that can hide failed deploys, an inactive Gitea Actions runner, a failed Hermes backup systemd unit despite cron backup success, and incomplete evidence for restore drills, secret scans, and off-host recovery.

## Implementation Readiness Assessment

**Roadmap quality score:** 86%

**Implementation confidence before remediation starts:** 74%

**Why not higher yet:** the review has good evidence for the major blind spots, but safe remediation still depends on a service-by-service exposure inventory, owner approval for public/private intent, and verified rollback paths for SSH and Docker firewall changes. The highest-risk changes are not technically hard; they are risky because this VM hosts many ByteLyst apps and several public ports may be relied on by legacy workflows.

**Confidence after Phase 0 is complete:** expected to rise to about 88% if every public hostname/host port has an approved disposition and rollback commands are tested.

**Quality strengths:**

- Evidence is concrete and command-derived rather than speculative.
- The highest-risk items are correctly prioritized as P0.
- The roadmap separates discovery from disruptive remediation.
- It captures operational debt outside pure security, including unhealthy containers, backup state, runner drift, and cron drift.

**Quality gaps to close before implementation:**

- Convert broad remediation bullets into small tickets with owner, rollback, validation, and maintenance window requirements.
- Add an approved exposure inventory before changing Docker bindings or `DOCKER-USER`.
- Record a tested SSH rollback path and keep an active second session/provider console open before changing `sshd`.
- Define what is intentionally public, private, internal-only, or deprecated for each service.
- Add post-change verification commands that prove public apps still work and private services are no longer internet reachable.

## Implementation Guardrails

These rules apply before any Phase 1 change:

- Do not bulk-close ports. Change one service group at a time and verify public app health after each group.
- Do not restart SSH from a single session. Keep a second key-based session open and provider console access available.
- Do not add broad `DROP` rules before an allowlist is committed to the inventory.
- Prefer loopback/internal Compose bindings over firewall-only hiding when a service does not need direct public access.
- Preserve Caddy as the public ingress path for web/API services unless a service is explicitly approved for direct exposure.
- Record exact rollback commands next to every change ticket.
- Treat Docker, SSH, Caddy, and backup changes as maintenance-window work.

## Exposure Classification Model

Every listening port and Caddy hostname should be classified before changes:

| Class | Meaning | Expected Controls | Examples To Review |
| --- | --- | --- | --- |
| `public-caddy` | Public app/API reached only through Caddy | TLS, hostname routing, app auth where needed, no direct host-port access | product web/API hostnames |
| `public-direct` | Direct host-port access is intentionally public | Explicit business reason, provider firewall allow, monitoring | SSH only unless approved otherwise |
| `private-admin` | Admin/dev/internal tool | Tailscale/VPN, SSH tunnel, IP allowlist, or auth gate | admin dashboards, devops tools |
| `loopback-only` | Host-local service used by Caddy or local automation | Bind `127.0.0.1:port`, no external bind | internal APIs behind Caddy |
| `docker-internal` | Container-to-container only | no host port mapping | databases, emulators, private workers |
| `retire` | Unused/deprecated | remove service/port, disable health checks and jobs | stale dashboards/services |

Minimum inventory fields:

- service/container name
- repo/Compose file
- host port and bind address
- container port
- Caddy hostname/path, if any
- intended audience
- authentication/control plane
- classification
- owner/approver
- rollback command
- post-change health check

## Evidence Snapshot

Collected on 2026-05-27 from this VM.

### Host and patching

- Host: `srv1491630`
- OS: Ubuntu `25.10`
- Kernel: `6.17.0-29-generic`
- Uptime: about 14 hours at review time
- Root filesystem: 193G total, 71G used, 123G available, 37% used
- Memory: 15Gi total, about 10Gi available
- Swap: 4.0G total, about 1.3G used
- Reboot required: no
- Pending package upgrades included Docker CE/containerd/buildx/compose and security updates for `libgcrypt20`, `libcaca0`, and `libssh2-1t64`
- Unattended upgrades: active and configured for automatic reboot at 04:00 with users absent

### Network and ingress

- UFW: active; default deny incoming; only `22/tcp` allowed by UFW rules
- Docker iptables rules are present and publish many ports despite UFW's simple rule list
- Public/listening TCP ports bound on all interfaces included:
  - `22`, `80`, `443`
  - app/frontend ports: `3000`, `3002`, `3003`, `3030`, `3035`, `3040`, `3049`, `3050`, `3055`, `3060`, `3070`, `3075`, `3085`
  - backend/API ports: `4004`, `4010`, `4011`, `4012`, `4013`, `4014`, `4015`, `4016`, `4017`, `4019`, `4020`, `4025`
  - infra/dev ports: `1025`, `1234`, `3100`, `3300`, `8025`, `8081`, `10000`, `11434`
- Caddy source-of-truth config: `/opt/bytelyst/Caddyfile`, mounted read-only into the `caddy` container
- `docker exec caddy caddy validate --config /etc/caddy/Caddyfile`: valid config, formatting warning only
- Caddy public hostnames include:
  - `api.bytelyst.com`
  - `gitea.bytelyst.com`
  - `admin.bytelyst.com`
  - `devops.bytelyst.com`
  - `tracker.bytelyst.com`
  - `llmlab.bytelyst.com`
  - `ollama.bytelyst.com`
  - `trading-api.bytelyst.com`
  - `invttrdg.bytelyst.com`
  - `notes.bytelyst.com`
  - `clock.bytelyst.com`

### SSH and account surface

Effective `sshd -T` settings showed:

- `permitrootlogin yes`
- `passwordauthentication yes`
- `pubkeyauthentication yes`
- `kbdinteractiveauthentication no`
- `maxauthtries 6`
- `x11forwarding yes`
- `clientaliveinterval 0`

`fail2ban` is active with one jail: `sshd`; no current bans at review time.

### Docker runtime and containers

- Docker: client/server `29.4.2`; newer Docker packages are available
- Docker daemon is rootful; security options showed AppArmor, seccomp builtin, and cgroup namespaces; `live_restore=false`
- Most product containers run with writable root filesystems and no explicit `user` configured
- `cadvisor` is privileged
- `DOCKER-USER` chain appears empty, so there is no central Docker firewall policy in front of published containers
- Multiple containers are unhealthy:
  - `learning_ai_common_plat-llmlab-dashboard-1`
  - `learning_ai_common_plat-actiontrail-web-1`
  - `learning_ai_common_plat-jarvisjr-web-1`
  - `learning_ai_common_plat-localmemgpt-web-1`
  - `learning_ai_common_plat-nomgap-web-1`
  - `learning_ai_common_plat-flowmonk-web-1`
  - `learning_ai_common_plat-mindlyst-web-1`

### Gitea and CI

- Gitea public route: `https://gitea.bytelyst.com`
- Local Gitea container port: host `3300` -> container `3000`, bound on `0.0.0.0` and IPv6
- `gitea-act-runner.service`: enabled but inactive/dead
- Runner user exists: `gitea-runner`, member of `docker`
- Runner config directory permissions look reasonable:
  - `/home/gitea-runner/act_runner`: `750`, owned by `gitea-runner:gitea-runner`
  - `/home/gitea-runner/act_runner/config.yaml`: `600`, owned by `gitea-runner:gitea-runner`

### Backup and operations

- `systemctl --failed` showed failed unit:
  - `hermes-root-backup.service` — `Sync root Hermes persistent backup to GitHub`
- Hermes cron backup is active and healthy:
  - job `470832621b43`, `Sync Hermes persistent-data backup to GitHub`, every 30 minutes, last run `ok`
- Existing VM maintenance cron entries exist for health check and cleanup under `/opt/bytelyst/learning_ai_devops_tools/scripts/VMs/HostingerVM/`
- A root crontab entry still references `/opt/bytelyst/bytelyst-devops-tools/monitor-lucky25-execution.sh`, which may be stale after repo relocation/renaming

## Blind Spots and Risk Register

### P0 — Internet-exposed Docker ports bypass the intended ingress model

**Risk:** UFW suggests only SSH is allowed, but Docker-published ports expose many services directly on all interfaces. This can bypass Caddy, TLS, auth, logging, rate limiting, and hostname/path controls.

**Examples observed:** `3300`, `8025`, `1025`, `1234`, `8081`, `10000`, `11434`, many `30xx` web ports, and many `40xx` backend ports.

**Impact:** Direct access to dev/infra services, internal APIs, emulators, mail tooling, dashboards, or model endpoints if upstream firewall/provider rules do not block them.

**Roadmap:**

- [ ] Create a canonical exposure inventory: service, container, host port, public hostname, required audience, auth requirement.
- [ ] For each service, decide one of: public via Caddy, private via Tailscale/SSH, loopback-only host port, Docker-internal only, or remove.
- [ ] Bind non-public Compose ports to `127.0.0.1` or remove host port mapping entirely.
- [ ] Add a `DOCKER-USER` chain policy to drop unsolicited traffic to non-approved published ports before Docker's accept rules.
- [ ] Keep only `80/443` and intentionally public SSH exposed at the provider/firewall layer.
- [ ] Add a recurring check that compares `ss -ltn` and Docker published ports against the approved inventory.

**Acceptance criteria:**

- `docs/vm-exposure-inventory.md` lists every `ss -ltnp` listener and every Docker published port.
- Every non-SSH direct public bind has an approved classification.
- Non-public services are either loopback-bound, Docker-internal, provider-firewalled, or blocked in `DOCKER-USER`.
- External probe confirms non-approved ports are closed from the internet.
- Caddy-routed public hostnames still pass smoke checks.

**Rollback:** keep a saved copy of original Compose files and `iptables-save` output; rollback means restoring original port mappings or flushing only the newly added `DOCKER-USER` rules.

### P0 — SSH permits root login and password authentication

**Risk:** `PermitRootLogin yes` and `PasswordAuthentication yes` keep the primary admin surface broad. fail2ban helps, but password-enabled root SSH is still high-risk for an internet-facing VM.

**Roadmap:**

- [ ] Confirm all required admin users have working SSH keys and sudo access.
- [ ] Add a non-root break-glass admin path if one does not exist.
- [ ] Change SSH effective config to:
  - [ ] `PermitRootLogin prohibit-password` or `no`
  - [ ] `PasswordAuthentication no`
  - [ ] `X11Forwarding no`
  - [ ] lower `MaxAuthTries`, e.g. `3`
  - [ ] set a sane `ClientAliveInterval` / `ClientAliveCountMax`
- [ ] Validate with a second session before restarting SSH.
- [ ] Record rollback commands and keep console/provider access available during rollout.

**Acceptance criteria:**

- A non-root sudo admin user can log in with SSH key auth.
- Root password login no longer works.
- Existing automation using `scripts/VMs/HostingerVM/login.sh` still works or is updated.
- `sshd -T` confirms the intended effective config.
- `fail2ban-client status sshd` still reports an active jail.

**Rollback:** provider console or still-open root session can restore previous `sshd_config` drop-in and restart `ssh`.

### P0 — Public/private boundary for dev and internal tooling is unclear

**Risk:** Caddy publishes `ollama.bytelyst.com`, `llmlab.bytelyst.com`, `devops.bytelyst.com`, `admin.bytelyst.com`, and Gitea. Some may be intended, but the roadmap lacks an explicit auth/access decision for each.

**Roadmap:**

- [ ] Document public hostnames, auth model, and data sensitivity.
- [ ] Require explicit approval before exposing new dashboards or model endpoints.
- [ ] Add Caddy auth/IP allowlist/Tailscale-only strategy for admin-like surfaces.
- [ ] Add security headers/auth checks to public UI health reviews.
- [ ] Confirm `ollama.bytelyst.com` should be publicly reachable at all; if not, move behind private network or auth gate.

**Acceptance criteria:**

- `ollama`, `llmlab`, `devops`, `admin`, `gitea`, and observability-adjacent routes each have an owner-approved exposure class.
- Public admin-like routes require authentication or an explicit documented exception.
- No emulator, mail, model, or raw dashboard port is directly internet reachable unless explicitly approved.

### P1 — Docker/container hardening is mostly default

**Risk:** Many containers run as default/root user, writable rootfs, broad capabilities by default, and rootful Docker. A compromised app gets more host-adjacent leverage than needed.

**Roadmap:**

- [ ] Create a per-service Docker hardening matrix: user, read-only rootfs, dropped capabilities, no-new-privileges, resource limits, healthcheck, restart policy, secrets handling.
- [ ] Start with public-facing/backend services and admin dashboards.
- [ ] Add `security_opt: ["no-new-privileges:true"]` where compatible.
- [ ] Add `cap_drop: ["ALL"]` and selectively add back capabilities only when needed.
- [ ] Convert app images to non-root users consistently.
- [ ] Use `read_only: true` plus explicit writable tmp/cache volumes where compatible.
- [ ] Review `cadvisor` privileged mode and replace/restrict if possible.
- [ ] Enable Docker `live-restore` if it fits maintenance operations.

**Implementation note:** do not attempt rootless Docker or read-only rootfs as the first hardening step. Start with `no-new-privileges`, non-root app users where images already support it, and targeted capability drops for public-facing app containers.

### P1 — Unhealthy containers can normalize broken deployments

**Risk:** Multiple app web containers are unhealthy while still running. If unhealthy states are ignored, deploy regressions and broken public pages can persist unnoticed.

**Roadmap:**

- [ ] Triage each unhealthy container and classify: real app failure, bad healthcheck, intentionally unused, or deprecated.
- [ ] Fix or remove bad healthchecks so Docker health state is trustworthy.
- [ ] Add alerting for sustained unhealthy containers.
- [ ] Make deployment scripts fail on unhealthy post-deploy state.
- [ ] Update dashboard/observability docs with current service ownership and expected state.

**Acceptance criteria:**

- Every unhealthy container has one of: fixed app, fixed healthcheck, intentionally disabled, or retired.
- Docker health state matches the product’s actual serving state.
- Post-deploy checks fail if required containers remain unhealthy beyond a grace period.

### P1 — Gitea Actions runner is enabled but inactive

**Risk:** CI/deploy assumptions may be wrong. If a runner is expected to deploy or publish packages, inactive runner state blocks automation and may cause manual drift.

**Roadmap:**

- [x] Decide whether the runner should be active or intentionally disabled.
- [x] If active: restart and verify `gitea-act-runner.service`, runner labels, and Docker access.
- [ ] Run and record a dedicated Gitea Actions smoke workflow result.
- [ ] If disabled: disable the service and document the intentional state.
- [ ] Keep runner secrets separate from smoke/test workflows.
- [ ] Add a runner-health check to VM observability.

**Decision needed:** runner should be either actively smoke-tested or disabled. An enabled-but-dead runner should not remain a steady state.

### P1 — Backup/restore evidence is split and one backup unit is failed

**Risk:** Hermes cron backup works, but `hermes-root-backup.service` is failed. There is no recent full restore drill evidence in this review. A backup that cannot be restored is only an assumption.

**Roadmap:**

- [x] Inspect `hermes-root-backup.service` logs and decide whether to fix, disable, or replace it with the cron-backed job.
- [x] Repair the root backup checkout divergence and verify a successful `hermes-root-backup.service` one-shot run.
- [x] Update `/root/.hermes/scripts/sync_hermes_persistent_backup.py` so future generated-backup divergence preserves a safety branch and rejoins the remote backup stream instead of wedging on `git pull --ff-only`.
- [ ] Document all backup mechanisms: Hermes, Gitea data, Docker volumes, app data, Caddy certs/config, environment/secrets escrow.
- [ ] Run a restore drill into a non-production path/profile.
- [ ] Verify no raw `.env`, OAuth tokens, private keys, SQLite WAL/SHM, or raw transcript DBs are committed.
- [ ] Add backup freshness and restore-drill status to the monthly VM review.

**Acceptance criteria:**

- `systemctl --failed` no longer includes backup units unless the failure is intentionally documented.
- Backup status shows source, destination, cadence, last success, and restore command.
- A restore drill has an artifact: date, target path/profile, commands run, result, and gaps found.

### P1 — Patch management has pending security/runtime updates

**Risk:** Unattended upgrades are on, but Docker and security package updates were pending at review time. Docker updates may need controlled restart/redeploy planning.

**Roadmap:**

- [ ] Add a weekly patch review checkpoint that reports pending security and Docker updates separately.
- [ ] Define a Docker upgrade maintenance window with pre/post checks.
- [ ] Run `apt list --upgradable` and capture package classes without dumping noise.
- [ ] Verify apps after Docker/containerd upgrades.

**Acceptance criteria:**

- Security updates and Docker/runtime updates are tracked separately.
- Docker upgrade has pre/post container health, Caddy validation, and public smoke checks.
- Reboot requirement is checked and scheduled rather than discovered accidentally.

### P2 — Ubuntu 25.10 lifecycle risk needs explicit tracking

**Risk:** Ubuntu interim releases have short support windows. If this VM is long-lived production infrastructure, lifecycle tracking matters.

**Roadmap:**

- [ ] Record current Ubuntu 25.10 support/EOL date in ops docs.
- [ ] Decide whether to stay on interim releases or migrate to an LTS baseline.
- [ ] Add an OS lifecycle check to quarterly review.

### P2 — Repository/config secret hygiene needs a repeatable scanner

**Risk:** The DevOps repo contains operational inputs and historical/deleted repo copies exist on disk. Manual review can miss tokens in old files, generated JSON, logs, backups, or abandoned directories.

**Roadmap:**

- [ ] Add a documented secret-scan command using `gitleaks` or `trufflehog` for tracked files and selected untracked ops directories.
- [ ] Scan historical directories such as `DELETED_bytelyst-devops-tools` separately before archiving or deleting.
- [ ] Add `.gitignore` patterns for generated scans, local account snapshots, and credential-shaped outputs.
- [ ] Keep examples as `.example` files only.

### P2 — Cron/systemd ownership and drift are not fully inventoried

**Risk:** Root crontab references old repo paths and there are multiple cron/systemd sources. Stale jobs can fail silently or mutate production unexpectedly.

**Roadmap:**

- [ ] Inventory root/user crontabs, `/etc/cron.d`, systemd timers, Hermes cron, and Gitea Actions schedules.
- [x] Remove or update stale `/opt/bytelyst/bytelyst-devops-tools/...` references after confirming replacements.
- [ ] Add owner, purpose, expected output, and alert channel for every job.
- [ ] Add a stale-job detector for missing script paths and failed systemd units.

**Acceptance criteria:**

- No active cron/systemd job references a missing path.
- Every recurring job has an owner, purpose, schedule, expected output, and alert destination.
- Stale path detection runs in the monthly VM review.

### P2 — Observability exists but needs security-focused SLOs

**Risk:** Prometheus/Grafana/Loki/exporters are present, but security-focused alerts are not yet proven from this review.

**Roadmap:**

- [ ] Add alerts for unexpected public ports, failed units, unhealthy containers, high disk/swap, backup staleness, Gitea runner inactive, and SSH auth spikes.
- [ ] Validate alert delivery to Telegram.
- [ ] Keep internal observability endpoints private; do not publish Prometheus/Loki/node-exporter/cAdvisor directly.

## Execution Plan

### Phase 0 — Freeze and inventory before changes

- [ ] Freeze new public hostnames/ports until the exposure inventory is complete.
- [x] Generate `docs/vm-exposure-inventory.md` from Docker, Caddy, `ss`, and DNS.
- [ ] Mark each exposed service as `public`, `private`, `internal-only`, or `retire`.
- [ ] Review with S before changing public access for customer/user-facing apps.

**Exit criteria:** the inventory is reviewed and every P0 change has a rollback line and validation line.

### Phase 1 — Immediate security hardening

- [ ] Close or loopback-bind non-public Docker host ports.
- [ ] Add `DOCKER-USER` default-deny rules for non-approved ports.
- [ ] Harden SSH root/password access after key-based access is verified.
- [ ] Put `ollama.bytelyst.com`, admin dashboards, and dev tooling behind private/auth-gated access unless explicitly approved as public.

**Exit criteria:** only approved public ports are externally reachable, SSH effective config is hardened, and public apps still pass smoke checks.

### Phase 2 — Operational correctness

- [ ] Fix/retire unhealthy containers.
- [x] Resolve `hermes-root-backup.service` failed state.
- [x] Decide and document Gitea runner active/disabled state.
- [ ] Add missing-script checks. Stale root cron path was fixed on 2026-05-27.
- [ ] Apply pending security/runtime updates in a maintenance window.

**Exit criteria:** no unexpected failed units, no ignored unhealthy required containers, no stale cron paths, and runner state is intentional.

### Phase 3 — Docker and app hardening

- [ ] Add non-root users, `no-new-privileges`, cap drops, and read-only rootfs by service.
- [ ] Add resource limits for noisy services and emulators.
- [ ] Move emulators/dev tools off public bindings.
- [ ] Review cAdvisor privilege and observability surface.

### Phase 4 — Backup, restore, and incident readiness

- [ ] Define full backup map: Hermes, Gitea, Caddy, Docker volumes, app DB/state, secrets escrow.
- [ ] Perform restore drill to non-prod target.
- [ ] Add incident runbooks: compromised container, leaked token, SSH brute force, disk full, failed Docker upgrade.
- [ ] Add quarterly tabletop review.

### Phase 5 — Continuous governance

- [ ] Monthly VM security review cron/checklist.
- [ ] Secret scan before DevOps repo pushes.
- [ ] OS lifecycle/EOL tracker.
- [ ] Drift detection for ports, Caddy routes, Docker health, systemd failures, and cron paths.

## Change Tickets With Quality Gates

Use this shape for each implementation PR/commit:

```text
Ticket:
Risk:
Files/services changed:
Pre-checks:
Change:
Rollback:
Post-checks:
Residual risk:
```

Minimum post-checks for Phase 1:

- `ss -ltnp`
- `docker ps --format '{{.Names}}\t{{.Status}}\t{{.Ports}}'`
- `iptables -S DOCKER-USER`
- `docker exec caddy caddy validate --config /etc/caddy/Caddyfile`
- public smoke checks for approved hostnames
- negative external probe for blocked ports
- `sshd -T` after SSH changes
- `systemctl --failed --no-pager`

## Implementation Log

### 2026-05-27 — Phase 2 backup and cron drift

**Changed:**

- Repointed the root Lucky25 monitor cron from `/opt/bytelyst/bytelyst-devops-tools/monitor-lucky25-execution.sh` to `/opt/bytelyst/learning_ai_devops_tools/scripts/monitor-lucky25-execution.sh`.
- Saved the pre-change root crontab at `/tmp/root-crontab-before-vm-security-20260527.txt`.
- Repaired `/root/repos/bytelyst_hostinger_hermes_vm`, which was `ahead 1, behind 11`; the obsolete local generated backup commit conflicted with newer remote snapshots and was skipped after rebase preserved the current remote stream.
- Patched `/root/.hermes/scripts/sync_hermes_persistent_backup.py` to replace unconditional `git pull --ff-only` with explicit fetch/merge-base handling. Diverged generated snapshots now create a safety branch before attempting rebase and fall back to `origin/<branch>` if the generated files conflict.
- Saved the pre-change backup script at `/tmp/sync_hermes_persistent_backup.py.before-vm-security-20260527`.

**Verified:**

- `crontab -l` now points the Lucky25 monitor at the current repo script.
- `python3 -m py_compile /tmp/sync_hermes_persistent_backup.py` passed before deployment.
- `systemctl start hermes-root-backup.service` succeeded twice after repair.
- `systemctl status hermes-root-backup.service hermes-root-backup.timer --no-pager` showed the service exited `status=0/SUCCESS` and the timer remains active.
- `/root/repos/bytelyst_hostinger_hermes_vm` is aligned with `origin/main` after successful backup commits `415e824` and `369e584`.

**Residual risk:**

- A restore drill is still required before the backup posture should be considered fully proven.
- The backup sync script is runtime-managed under `/root/.hermes/scripts/`; add a tracked installer or source-of-truth copy so this hardening does not depend on manual VM state.

### 2026-05-27 — Phase 2 Gitea runner state

**Changed:**

- Started `gitea-act-runner.service`; it was enabled but inactive.
- Treated the intended state as active because the service unit is enabled, historical journal entries show successful task execution, and restart declared the runner successfully.

**Verified:**

- `systemctl is-active gitea-act-runner.service` returned `active`.
- `systemctl status gitea-act-runner.service --no-pager` showed `bytelyst-host-runner` running as `gitea-runner`.
- Runner labels declared successfully: `ubuntu-latest`, `linux`, `bytelyst`, `hostinger`.
- Runner config uses Docker executor images and `privileged: false`; Docker socket access is granted through the `docker` group.
- Runner immediately picked up task `42` for `bytelyst/bytelyst-devops-tools`, proving it can talk to local Gitea.

**Residual risk:**

- Record a small dedicated smoke workflow that does not need production secrets, so runner health is proven by a controlled workflow rather than incidental queued work.
- Add runner health to VM observability so enabled-but-inactive drift is caught automatically.

## Do Not Start With

- Rootless Docker migration.
- Broad `iptables` default-drop without an allowlist.
- Mass Compose rewrites across all products.
- SSH password/root lockout before key-based sudo and rollback are proven.
- Removing unhealthy containers before confirming whether they are deprecated or broken required services.
- Publishing secret-scan output that contains secrets.

## Suggested First Tickets

1. **P0: Build and review exposure inventory** — produce exact approved/blocked list for all currently bound ports.
2. **P0: Lock Docker-published non-public ports** — bind to loopback/internal or enforce `DOCKER-USER` drops.
3. **P0: Harden SSH** — disable password/root login after confirming key-based admin access.
4. **P1: Triage unhealthy containers** — fix healthchecks/apps or retire dead services.
5. **P1: Resolve failed Hermes backup unit** — fix or disable duplicate failed unit; keep cron backup healthy.
6. **P1: Decide Gitea runner state** — active smoke-tested runner or documented disabled service.
7. **P2: Add secret scanner and stale-job scanner** — prevent silent credential and automation drift.

**Recommended first implementation order:**

1. Generate and review `docs/vm-exposure-inventory.md`.
2. Fix the stale cron path and failed backup unit, because both are lower blast-radius and improve rollback confidence.
3. Harden SSH with second-session/provider-console safety.
4. Move obvious internal-only Docker ports to loopback/internal bindings.
5. Add `DOCKER-USER` guardrails after the allowlist is proven.

This order improves safety without letting the port exposure issue linger too long.

## Verification Commands for Future Runs

```bash
# Host/security baseline
date -Is
uname -a
. /etc/os-release && echo "$PRETTY_NAME"
apt-get -s upgrade | awk '/^Inst /{print}'
test -f /var/run/reboot-required && cat /var/run/reboot-required || echo no-reboot-required

# Firewall and public bind inventory
ufw status verbose
iptables -S DOCKER-USER
ss -ltnup

# SSH effective config
sshd -T | egrep '^(permitrootlogin|passwordauthentication|pubkeyauthentication|kbdinteractiveauthentication|maxauthtries|x11forwarding|clientaliveinterval)'
fail2ban-client status sshd

# Docker health/security
docker ps --format '{{.Names}}\t{{.Status}}\t{{.Ports}}'
docker ps -q | xargs -r docker inspect --format '{{.Name}} user={{.Config.User}} privileged={{.HostConfig.Privileged}} readonly={{.HostConfig.ReadonlyRootfs}} ports={{json .NetworkSettings.Ports}}'

# Caddy and ingress
docker exec caddy caddy validate --config /etc/caddy/Caddyfile
sed -n '1,220p' /opt/bytelyst/Caddyfile

# Backup/cron/systemd drift
systemctl --failed --no-pager
hermes cron list
crontab -l
for f in /etc/cron.d/*; do echo "--- $f"; sed -n '1,80p' "$f"; done
```

## Notes

- This review did not change firewall, SSH, Docker, Caddy, or service settings. It intentionally documents the risk and remediation order before making potentially disruptive security changes.
- Public exposure changes should be handled in small maintenance windows with pre/post health checks because this VM hosts multiple ByteLyst apps.
- The Caddyfile validates today, but Caddy formatting should be normalized in a separate low-risk docs/ops cleanup if desired.