568 lines
30 KiB
Markdown
568 lines
30 KiB
Markdown
# ByteLyst VM Security Blind Spots Roadmap
|
||
|
||
**Review date:** 2026-05-27
|
||
**Reviewer:** Hermes Agent
|
||
**Scope:** Hostinger ByteLyst VM, Docker-hosted product stack, Caddy ingress, Gitea/CI, Hermes backup/ops, VM maintenance posture.
|
||
|
||
## Executive Summary
|
||
|
||
The VM is operational and has several good foundations already in place: UFW is active, fail2ban is running for SSH, unattended upgrades are enabled, Caddy config validates, disk/memory headroom is acceptable, and Hermes persistent-data backup cron is healthy.
|
||
|
||
The biggest blind spot is that the apparent firewall posture is misleading: UFW only allows SSH, but Docker-published ports create iptables rules that can expose many application, database/emulator, observability, registry, and development ports on `0.0.0.0` / IPv6. Several of those services should either be private-only, routed only through Caddy with auth, or bound to loopback/internal Docker networks.
|
||
|
||
Second-order risks are SSH hardening gaps, rootful Docker/container hardening gaps, unhealthy apps that can hide failed deploys, an inactive Gitea Actions runner, a failed Hermes backup systemd unit despite cron backup success, and incomplete evidence for restore drills, secret scans, and off-host recovery.
|
||
|
||
## Implementation Readiness Assessment
|
||
|
||
**Roadmap quality score:** 86%
|
||
|
||
**Implementation confidence before remediation starts:** 74%
|
||
|
||
**Why not higher yet:** the review has good evidence for the major blind spots, but safe remediation still depends on a service-by-service exposure inventory, owner approval for public/private intent, and verified rollback paths for SSH and Docker firewall changes. The highest-risk changes are not technically hard; they are risky because this VM hosts many ByteLyst apps and several public ports may be relied on by legacy workflows.
|
||
|
||
**Confidence after Phase 0 is complete:** expected to rise to about 88% if every public hostname/host port has an approved disposition and rollback commands are tested.
|
||
|
||
**Quality strengths:**
|
||
|
||
- Evidence is concrete and command-derived rather than speculative.
|
||
- The highest-risk items are correctly prioritized as P0.
|
||
- The roadmap separates discovery from disruptive remediation.
|
||
- It captures operational debt outside pure security, including unhealthy containers, backup state, runner drift, and cron drift.
|
||
|
||
**Quality gaps to close before implementation:**
|
||
|
||
- Convert broad remediation bullets into small tickets with owner, rollback, validation, and maintenance window requirements.
|
||
- Add an approved exposure inventory before changing Docker bindings or `DOCKER-USER`.
|
||
- Record a tested SSH rollback path and keep an active second session/provider console open before changing `sshd`.
|
||
- Define what is intentionally public, private, internal-only, or deprecated for each service.
|
||
- Add post-change verification commands that prove public apps still work and private services are no longer internet reachable.
|
||
|
||
## Implementation Guardrails
|
||
|
||
These rules apply before any Phase 1 change:
|
||
|
||
- Do not bulk-close ports. Change one service group at a time and verify public app health after each group.
|
||
- Do not restart SSH from a single session. Keep a second key-based session open and provider console access available.
|
||
- Do not add broad `DROP` rules before an allowlist is committed to the inventory.
|
||
- Prefer loopback/internal Compose bindings over firewall-only hiding when a service does not need direct public access.
|
||
- Preserve Caddy as the public ingress path for web/API services unless a service is explicitly approved for direct exposure.
|
||
- Record exact rollback commands next to every change ticket.
|
||
- Treat Docker, SSH, Caddy, and backup changes as maintenance-window work.
|
||
|
||
## Exposure Classification Model
|
||
|
||
Every listening port and Caddy hostname should be classified before changes:
|
||
|
||
| Class | Meaning | Expected Controls | Examples To Review |
|
||
| --- | --- | --- | --- |
|
||
| `public-caddy` | Public app/API reached only through Caddy | TLS, hostname routing, app auth where needed, no direct host-port access | product web/API hostnames |
|
||
| `public-direct` | Direct host-port access is intentionally public | Explicit business reason, provider firewall allow, monitoring | SSH only unless approved otherwise |
|
||
| `private-admin` | Admin/dev/internal tool | Tailscale/VPN, SSH tunnel, IP allowlist, or auth gate | admin dashboards, devops tools |
|
||
| `loopback-only` | Host-local service used by Caddy or local automation | Bind `127.0.0.1:port`, no external bind | internal APIs behind Caddy |
|
||
| `docker-internal` | Container-to-container only | no host port mapping | databases, emulators, private workers |
|
||
| `retire` | Unused/deprecated | remove service/port, disable health checks and jobs | stale dashboards/services |
|
||
|
||
Minimum inventory fields:
|
||
|
||
- service/container name
|
||
- repo/Compose file
|
||
- host port and bind address
|
||
- container port
|
||
- Caddy hostname/path, if any
|
||
- intended audience
|
||
- authentication/control plane
|
||
- classification
|
||
- owner/approver
|
||
- rollback command
|
||
- post-change health check
|
||
|
||
## Evidence Snapshot
|
||
|
||
Collected on 2026-05-27 from this VM.
|
||
|
||
### Host and patching
|
||
|
||
- Host: `srv1491630`
|
||
- OS: Ubuntu `25.10`
|
||
- Kernel: `6.17.0-29-generic`
|
||
- Uptime: about 14 hours at review time
|
||
- Root filesystem: 193G total, 71G used, 123G available, 37% used
|
||
- Memory: 15Gi total, about 10Gi available
|
||
- Swap: 4.0G total, about 1.3G used
|
||
- Reboot required: no
|
||
- Pending package upgrades included Docker CE/containerd/buildx/compose and security updates for `libgcrypt20`, `libcaca0`, and `libssh2-1t64`
|
||
- Unattended upgrades: active and configured for automatic reboot at 04:00 with users absent
|
||
|
||
### Network and ingress
|
||
|
||
- UFW: active; default deny incoming; only `22/tcp` allowed by UFW rules
|
||
- Docker iptables rules are present and publish many ports despite UFW's simple rule list
|
||
- Public/listening TCP ports bound on all interfaces included:
|
||
- `22`, `80`, `443`
|
||
- app/frontend ports: `3000`, `3002`, `3003`, `3030`, `3035`, `3040`, `3049`, `3050`, `3055`, `3060`, `3070`, `3075`, `3085`
|
||
- backend/API ports: `4004`, `4010`, `4011`, `4012`, `4013`, `4014`, `4015`, `4016`, `4017`, `4019`, `4020`, `4025`
|
||
- infra/dev ports: `1025`, `1234`, `3100`, `3300`, `8025`, `8081`, `10000`, `11434`
|
||
- Caddy source-of-truth config: `/opt/bytelyst/Caddyfile`, mounted read-only into the `caddy` container
|
||
- `docker exec caddy caddy validate --config /etc/caddy/Caddyfile`: valid config, formatting warning only
|
||
- Caddy public hostnames include:
|
||
- `api.bytelyst.com`
|
||
- `gitea.bytelyst.com`
|
||
- `admin.bytelyst.com`
|
||
- `devops.bytelyst.com`
|
||
- `tracker.bytelyst.com`
|
||
- `llmlab.bytelyst.com`
|
||
- `ollama.bytelyst.com`
|
||
- `trading-api.bytelyst.com`
|
||
- `invttrdg.bytelyst.com`
|
||
- `notes.bytelyst.com`
|
||
- `clock.bytelyst.com`
|
||
|
||
### SSH and account surface
|
||
|
||
Effective `sshd -T` settings showed:
|
||
|
||
- `permitrootlogin yes`
|
||
- `passwordauthentication yes`
|
||
- `pubkeyauthentication yes`
|
||
- `kbdinteractiveauthentication no`
|
||
- `maxauthtries 6`
|
||
- `x11forwarding yes`
|
||
- `clientaliveinterval 0`
|
||
|
||
`fail2ban` is active with one jail: `sshd`; no current bans at review time.
|
||
|
||
### Docker runtime and containers
|
||
|
||
- Docker: client/server `29.4.2`; newer Docker packages are available
|
||
- Docker daemon is rootful; security options showed AppArmor, seccomp builtin, and cgroup namespaces; `live_restore=false`
|
||
- Most product containers run with writable root filesystems and no explicit `user` configured
|
||
- `cadvisor` is privileged
|
||
- `DOCKER-USER` chain appears empty, so there is no central Docker firewall policy in front of published containers
|
||
- Multiple containers are unhealthy:
|
||
- `learning_ai_common_plat-llmlab-dashboard-1`
|
||
- `learning_ai_common_plat-actiontrail-web-1`
|
||
- `learning_ai_common_plat-jarvisjr-web-1`
|
||
- `learning_ai_common_plat-localmemgpt-web-1`
|
||
- `learning_ai_common_plat-nomgap-web-1`
|
||
- `learning_ai_common_plat-flowmonk-web-1`
|
||
- `learning_ai_common_plat-mindlyst-web-1`
|
||
|
||
### Gitea and CI
|
||
|
||
- Gitea public route: `https://gitea.bytelyst.com`
|
||
- Local Gitea container port: host `3300` -> container `3000`, bound on `0.0.0.0` and IPv6
|
||
- `gitea-act-runner.service`: enabled but inactive/dead
|
||
- Runner user exists: `gitea-runner`, member of `docker`
|
||
- Runner config directory permissions look reasonable:
|
||
- `/home/gitea-runner/act_runner`: `750`, owned by `gitea-runner:gitea-runner`
|
||
- `/home/gitea-runner/act_runner/config.yaml`: `600`, owned by `gitea-runner:gitea-runner`
|
||
|
||
### Backup and operations
|
||
|
||
- `systemctl --failed` showed failed unit:
|
||
- `hermes-root-backup.service` — `Sync root Hermes persistent backup to GitHub`
|
||
- Hermes cron backup is active and healthy:
|
||
- job `470832621b43`, `Sync Hermes persistent-data backup to GitHub`, every 30 minutes, last run `ok`
|
||
- Existing VM maintenance cron entries exist for health check and cleanup under `/opt/bytelyst/learning_ai_devops_tools/scripts/VMs/HostingerVM/`
|
||
- A root crontab entry still references `/opt/bytelyst/bytelyst-devops-tools/monitor-lucky25-execution.sh`, which may be stale after repo relocation/renaming
|
||
|
||
## Blind Spots and Risk Register
|
||
|
||
### P0 — Internet-exposed Docker ports bypass the intended ingress model
|
||
|
||
**Risk:** UFW suggests only SSH is allowed, but Docker-published ports expose many services directly on all interfaces. This can bypass Caddy, TLS, auth, logging, rate limiting, and hostname/path controls.
|
||
|
||
**Examples observed:** `3300`, `8025`, `1025`, `1234`, `8081`, `10000`, `11434`, many `30xx` web ports, and many `40xx` backend ports.
|
||
|
||
**Impact:** Direct access to dev/infra services, internal APIs, emulators, mail tooling, dashboards, or model endpoints if upstream firewall/provider rules do not block them.
|
||
|
||
**Roadmap:**
|
||
|
||
- [ ] Create a canonical exposure inventory: service, container, host port, public hostname, required audience, auth requirement.
|
||
- [ ] For each service, decide one of: public via Caddy, private via Tailscale/SSH, loopback-only host port, Docker-internal only, or remove.
|
||
- [ ] Bind non-public Compose ports to `127.0.0.1` or remove host port mapping entirely.
|
||
- [ ] Add a `DOCKER-USER` chain policy to drop unsolicited traffic to non-approved published ports before Docker's accept rules.
|
||
- [ ] Keep only `80/443` and intentionally public SSH exposed at the provider/firewall layer.
|
||
- [ ] Add a recurring check that compares `ss -ltn` and Docker published ports against the approved inventory.
|
||
|
||
**Acceptance criteria:**
|
||
|
||
- `docs/vm-exposure-inventory.md` lists every `ss -ltnp` listener and every Docker published port.
|
||
- Every non-SSH direct public bind has an approved classification.
|
||
- Non-public services are either loopback-bound, Docker-internal, provider-firewalled, or blocked in `DOCKER-USER`.
|
||
- External probe confirms non-approved ports are closed from the internet.
|
||
- Caddy-routed public hostnames still pass smoke checks.
|
||
|
||
**Rollback:** keep a saved copy of original Compose files and `iptables-save` output; rollback means restoring original port mappings or flushing only the newly added `DOCKER-USER` rules.
|
||
|
||
### P0 — SSH permits root login and password authentication
|
||
|
||
**Risk:** `PermitRootLogin yes` and `PasswordAuthentication yes` keep the primary admin surface broad. fail2ban helps, but password-enabled root SSH is still high-risk for an internet-facing VM.
|
||
|
||
**Roadmap:**
|
||
|
||
- [ ] Confirm all required admin users have working SSH keys and sudo access.
|
||
- [ ] Add a non-root break-glass admin path if one does not exist.
|
||
- [ ] Change SSH effective config to:
|
||
- [ ] `PermitRootLogin prohibit-password` or `no`
|
||
- [ ] `PasswordAuthentication no`
|
||
- [ ] `X11Forwarding no`
|
||
- [ ] lower `MaxAuthTries`, e.g. `3`
|
||
- [ ] set a sane `ClientAliveInterval` / `ClientAliveCountMax`
|
||
- [ ] Validate with a second session before restarting SSH.
|
||
- [ ] Record rollback commands and keep console/provider access available during rollout.
|
||
|
||
**Acceptance criteria:**
|
||
|
||
- A non-root sudo admin user can log in with SSH key auth.
|
||
- Root password login no longer works.
|
||
- Existing automation using `scripts/VMs/HostingerVM/login.sh` still works or is updated.
|
||
- `sshd -T` confirms the intended effective config.
|
||
- `fail2ban-client status sshd` still reports an active jail.
|
||
|
||
**Rollback:** provider console or still-open root session can restore previous `sshd_config` drop-in and restart `ssh`.
|
||
|
||
### P0 — Public/private boundary for dev and internal tooling is unclear
|
||
|
||
**Risk:** Caddy publishes `ollama.bytelyst.com`, `llmlab.bytelyst.com`, `devops.bytelyst.com`, `admin.bytelyst.com`, and Gitea. Some may be intended, but the roadmap lacks an explicit auth/access decision for each.
|
||
|
||
**Roadmap:**
|
||
|
||
- [ ] Document public hostnames, auth model, and data sensitivity.
|
||
- [ ] Require explicit approval before exposing new dashboards or model endpoints.
|
||
- [ ] Add Caddy auth/IP allowlist/Tailscale-only strategy for admin-like surfaces.
|
||
- [ ] Add security headers/auth checks to public UI health reviews.
|
||
- [ ] Confirm `ollama.bytelyst.com` should be publicly reachable at all; if not, move behind private network or auth gate.
|
||
|
||
**Acceptance criteria:**
|
||
|
||
- `ollama`, `llmlab`, `devops`, `admin`, `gitea`, and observability-adjacent routes each have an owner-approved exposure class.
|
||
- Public admin-like routes require authentication or an explicit documented exception.
|
||
- No emulator, mail, model, or raw dashboard port is directly internet reachable unless explicitly approved.
|
||
|
||
### P1 — Docker/container hardening is mostly default
|
||
|
||
**Risk:** Many containers run as default/root user, writable rootfs, broad capabilities by default, and rootful Docker. A compromised app gets more host-adjacent leverage than needed.
|
||
|
||
**Roadmap:**
|
||
|
||
- [ ] Create a per-service Docker hardening matrix: user, read-only rootfs, dropped capabilities, no-new-privileges, resource limits, healthcheck, restart policy, secrets handling.
|
||
- [ ] Start with public-facing/backend services and admin dashboards.
|
||
- [ ] Add `security_opt: ["no-new-privileges:true"]` where compatible.
|
||
- [ ] Add `cap_drop: ["ALL"]` and selectively add back capabilities only when needed.
|
||
- [ ] Convert app images to non-root users consistently.
|
||
- [ ] Use `read_only: true` plus explicit writable tmp/cache volumes where compatible.
|
||
- [ ] Review `cadvisor` privileged mode and replace/restrict if possible.
|
||
- [ ] Enable Docker `live-restore` if it fits maintenance operations.
|
||
|
||
**Implementation note:** do not attempt rootless Docker or read-only rootfs as the first hardening step. Start with `no-new-privileges`, non-root app users where images already support it, and targeted capability drops for public-facing app containers.
|
||
|
||
### P1 — Unhealthy containers can normalize broken deployments
|
||
|
||
**Risk:** Multiple app web containers are unhealthy while still running. If unhealthy states are ignored, deploy regressions and broken public pages can persist unnoticed.
|
||
|
||
**Roadmap:**
|
||
|
||
- [ ] Triage each unhealthy container and classify: real app failure, bad healthcheck, intentionally unused, or deprecated.
|
||
- [ ] Fix or remove bad healthchecks so Docker health state is trustworthy.
|
||
- [ ] Add alerting for sustained unhealthy containers.
|
||
- [ ] Make deployment scripts fail on unhealthy post-deploy state.
|
||
- [ ] Update dashboard/observability docs with current service ownership and expected state.
|
||
|
||
**Acceptance criteria:**
|
||
|
||
- Every unhealthy container has one of: fixed app, fixed healthcheck, intentionally disabled, or retired.
|
||
- Docker health state matches the product’s actual serving state.
|
||
- Post-deploy checks fail if required containers remain unhealthy beyond a grace period.
|
||
|
||
### P1 — Gitea Actions runner is enabled but inactive
|
||
|
||
**Risk:** CI/deploy assumptions may be wrong. If a runner is expected to deploy or publish packages, inactive runner state blocks automation and may cause manual drift.
|
||
|
||
**Roadmap:**
|
||
|
||
- [x] Decide whether the runner should be active or intentionally disabled.
|
||
- [x] If active: restart and verify `gitea-act-runner.service`, runner labels, and Docker access.
|
||
- [ ] Run and record a dedicated Gitea Actions smoke workflow result.
|
||
- [ ] If disabled: disable the service and document the intentional state.
|
||
- [ ] Keep runner secrets separate from smoke/test workflows.
|
||
- [ ] Add a runner-health check to VM observability.
|
||
|
||
**Decision needed:** runner should be either actively smoke-tested or disabled. An enabled-but-dead runner should not remain a steady state.
|
||
|
||
### P1 — Backup/restore evidence is split and one backup unit is failed
|
||
|
||
**Risk:** Hermes cron backup works, but `hermes-root-backup.service` is failed. There is no recent full restore drill evidence in this review. A backup that cannot be restored is only an assumption.
|
||
|
||
**Roadmap:**
|
||
|
||
- [x] Inspect `hermes-root-backup.service` logs and decide whether to fix, disable, or replace it with the cron-backed job.
|
||
- [x] Repair the root backup checkout divergence and verify a successful `hermes-root-backup.service` one-shot run.
|
||
- [x] Update `/root/.hermes/scripts/sync_hermes_persistent_backup.py` so future generated-backup divergence preserves a safety branch and rejoins the remote backup stream instead of wedging on `git pull --ff-only`.
|
||
- [ ] Document all backup mechanisms: Hermes, Gitea data, Docker volumes, app data, Caddy certs/config, environment/secrets escrow.
|
||
- [ ] Run a restore drill into a non-production path/profile.
|
||
- [ ] Verify no raw `.env`, OAuth tokens, private keys, SQLite WAL/SHM, or raw transcript DBs are committed.
|
||
- [ ] Add backup freshness and restore-drill status to the monthly VM review.
|
||
|
||
**Acceptance criteria:**
|
||
|
||
- `systemctl --failed` no longer includes backup units unless the failure is intentionally documented.
|
||
- Backup status shows source, destination, cadence, last success, and restore command.
|
||
- A restore drill has an artifact: date, target path/profile, commands run, result, and gaps found.
|
||
|
||
### P1 — Patch management has pending security/runtime updates
|
||
|
||
**Risk:** Unattended upgrades are on, but Docker and security package updates were pending at review time. Docker updates may need controlled restart/redeploy planning.
|
||
|
||
**Roadmap:**
|
||
|
||
- [ ] Add a weekly patch review checkpoint that reports pending security and Docker updates separately.
|
||
- [ ] Define a Docker upgrade maintenance window with pre/post checks.
|
||
- [ ] Run `apt list --upgradable` and capture package classes without dumping noise.
|
||
- [ ] Verify apps after Docker/containerd upgrades.
|
||
|
||
**Acceptance criteria:**
|
||
|
||
- Security updates and Docker/runtime updates are tracked separately.
|
||
- Docker upgrade has pre/post container health, Caddy validation, and public smoke checks.
|
||
- Reboot requirement is checked and scheduled rather than discovered accidentally.
|
||
|
||
### P2 — Ubuntu 25.10 lifecycle risk needs explicit tracking
|
||
|
||
**Risk:** Ubuntu interim releases have short support windows. If this VM is long-lived production infrastructure, lifecycle tracking matters.
|
||
|
||
**Roadmap:**
|
||
|
||
- [ ] Record current Ubuntu 25.10 support/EOL date in ops docs.
|
||
- [ ] Decide whether to stay on interim releases or migrate to an LTS baseline.
|
||
- [ ] Add an OS lifecycle check to quarterly review.
|
||
|
||
### P2 — Repository/config secret hygiene needs a repeatable scanner
|
||
|
||
**Risk:** The DevOps repo contains operational inputs and historical/deleted repo copies exist on disk. Manual review can miss tokens in old files, generated JSON, logs, backups, or abandoned directories.
|
||
|
||
**Roadmap:**
|
||
|
||
- [ ] Add a documented secret-scan command using `gitleaks` or `trufflehog` for tracked files and selected untracked ops directories.
|
||
- [ ] Scan historical directories such as `DELETED_bytelyst-devops-tools` separately before archiving or deleting.
|
||
- [ ] Add `.gitignore` patterns for generated scans, local account snapshots, and credential-shaped outputs.
|
||
- [ ] Keep examples as `.example` files only.
|
||
|
||
### P2 — Cron/systemd ownership and drift are not fully inventoried
|
||
|
||
**Risk:** Root crontab references old repo paths and there are multiple cron/systemd sources. Stale jobs can fail silently or mutate production unexpectedly.
|
||
|
||
**Roadmap:**
|
||
|
||
- [ ] Inventory root/user crontabs, `/etc/cron.d`, systemd timers, Hermes cron, and Gitea Actions schedules.
|
||
- [x] Remove or update stale `/opt/bytelyst/bytelyst-devops-tools/...` references after confirming replacements.
|
||
- [ ] Add owner, purpose, expected output, and alert channel for every job.
|
||
- [ ] Add a stale-job detector for missing script paths and failed systemd units.
|
||
|
||
**Acceptance criteria:**
|
||
|
||
- No active cron/systemd job references a missing path.
|
||
- Every recurring job has an owner, purpose, schedule, expected output, and alert destination.
|
||
- Stale path detection runs in the monthly VM review.
|
||
|
||
### P2 — Observability exists but needs security-focused SLOs
|
||
|
||
**Risk:** Prometheus/Grafana/Loki/exporters are present, but security-focused alerts are not yet proven from this review.
|
||
|
||
**Roadmap:**
|
||
|
||
- [ ] Add alerts for unexpected public ports, failed units, unhealthy containers, high disk/swap, backup staleness, Gitea runner inactive, and SSH auth spikes.
|
||
- [ ] Validate alert delivery to Telegram.
|
||
- [ ] Keep internal observability endpoints private; do not publish Prometheus/Loki/node-exporter/cAdvisor directly.
|
||
|
||
## Execution Plan
|
||
|
||
### Phase 0 — Freeze and inventory before changes
|
||
|
||
- [ ] Freeze new public hostnames/ports until the exposure inventory is complete.
|
||
- [x] Generate `docs/vm-exposure-inventory.md` from Docker, Caddy, `ss`, and DNS.
|
||
- [ ] Mark each exposed service as `public`, `private`, `internal-only`, or `retire`.
|
||
- [ ] Review with S before changing public access for customer/user-facing apps.
|
||
|
||
**Exit criteria:** the inventory is reviewed and every P0 change has a rollback line and validation line.
|
||
|
||
### Phase 1 — Immediate security hardening
|
||
|
||
- [ ] Close or loopback-bind non-public Docker host ports.
|
||
- [ ] Add `DOCKER-USER` default-deny rules for non-approved ports.
|
||
- [ ] Harden SSH root/password access after key-based access is verified.
|
||
- [ ] Put `ollama.bytelyst.com`, admin dashboards, and dev tooling behind private/auth-gated access unless explicitly approved as public.
|
||
|
||
**Exit criteria:** only approved public ports are externally reachable, SSH effective config is hardened, and public apps still pass smoke checks.
|
||
|
||
### Phase 2 — Operational correctness
|
||
|
||
- [ ] Fix/retire unhealthy containers.
|
||
- [x] Resolve `hermes-root-backup.service` failed state.
|
||
- [x] Decide and document Gitea runner active/disabled state.
|
||
- [ ] Add missing-script checks. Stale root cron path was fixed on 2026-05-27.
|
||
- [ ] Apply pending security/runtime updates in a maintenance window.
|
||
|
||
**Exit criteria:** no unexpected failed units, no ignored unhealthy required containers, no stale cron paths, and runner state is intentional.
|
||
|
||
### Phase 3 — Docker and app hardening
|
||
|
||
- [ ] Add non-root users, `no-new-privileges`, cap drops, and read-only rootfs by service.
|
||
- [ ] Add resource limits for noisy services and emulators.
|
||
- [ ] Move emulators/dev tools off public bindings.
|
||
- [ ] Review cAdvisor privilege and observability surface.
|
||
|
||
### Phase 4 — Backup, restore, and incident readiness
|
||
|
||
- [ ] Define full backup map: Hermes, Gitea, Caddy, Docker volumes, app DB/state, secrets escrow.
|
||
- [ ] Perform restore drill to non-prod target.
|
||
- [ ] Add incident runbooks: compromised container, leaked token, SSH brute force, disk full, failed Docker upgrade.
|
||
- [ ] Add quarterly tabletop review.
|
||
|
||
### Phase 5 — Continuous governance
|
||
|
||
- [ ] Monthly VM security review cron/checklist.
|
||
- [ ] Secret scan before DevOps repo pushes.
|
||
- [ ] OS lifecycle/EOL tracker.
|
||
- [ ] Drift detection for ports, Caddy routes, Docker health, systemd failures, and cron paths.
|
||
|
||
## Change Tickets With Quality Gates
|
||
|
||
Use this shape for each implementation PR/commit:
|
||
|
||
```text
|
||
Ticket:
|
||
Risk:
|
||
Files/services changed:
|
||
Pre-checks:
|
||
Change:
|
||
Rollback:
|
||
Post-checks:
|
||
Residual risk:
|
||
```
|
||
|
||
Minimum post-checks for Phase 1:
|
||
|
||
- `ss -ltnp`
|
||
- `docker ps --format '{{.Names}}\t{{.Status}}\t{{.Ports}}'`
|
||
- `iptables -S DOCKER-USER`
|
||
- `docker exec caddy caddy validate --config /etc/caddy/Caddyfile`
|
||
- public smoke checks for approved hostnames
|
||
- negative external probe for blocked ports
|
||
- `sshd -T` after SSH changes
|
||
- `systemctl --failed --no-pager`
|
||
|
||
## Implementation Log
|
||
|
||
### 2026-05-27 — Phase 2 backup and cron drift
|
||
|
||
**Changed:**
|
||
|
||
- Repointed the root Lucky25 monitor cron from `/opt/bytelyst/bytelyst-devops-tools/monitor-lucky25-execution.sh` to `/opt/bytelyst/learning_ai_devops_tools/scripts/monitor-lucky25-execution.sh`.
|
||
- Saved the pre-change root crontab at `/tmp/root-crontab-before-vm-security-20260527.txt`.
|
||
- Repaired `/root/repos/bytelyst_hostinger_hermes_vm`, which was `ahead 1, behind 11`; the obsolete local generated backup commit conflicted with newer remote snapshots and was skipped after rebase preserved the current remote stream.
|
||
- Patched `/root/.hermes/scripts/sync_hermes_persistent_backup.py` to replace unconditional `git pull --ff-only` with explicit fetch/merge-base handling. Diverged generated snapshots now create a safety branch before attempting rebase and fall back to `origin/<branch>` if the generated files conflict.
|
||
- Saved the pre-change backup script at `/tmp/sync_hermes_persistent_backup.py.before-vm-security-20260527`.
|
||
|
||
**Verified:**
|
||
|
||
- `crontab -l` now points the Lucky25 monitor at the current repo script.
|
||
- `python3 -m py_compile /tmp/sync_hermes_persistent_backup.py` passed before deployment.
|
||
- `systemctl start hermes-root-backup.service` succeeded twice after repair.
|
||
- `systemctl status hermes-root-backup.service hermes-root-backup.timer --no-pager` showed the service exited `status=0/SUCCESS` and the timer remains active.
|
||
- `/root/repos/bytelyst_hostinger_hermes_vm` is aligned with `origin/main` after successful backup commits `415e824` and `369e584`.
|
||
|
||
**Residual risk:**
|
||
|
||
- A restore drill is still required before the backup posture should be considered fully proven.
|
||
- The backup sync script is runtime-managed under `/root/.hermes/scripts/`; add a tracked installer or source-of-truth copy so this hardening does not depend on manual VM state.
|
||
|
||
### 2026-05-27 — Phase 2 Gitea runner state
|
||
|
||
**Changed:**
|
||
|
||
- Started `gitea-act-runner.service`; it was enabled but inactive.
|
||
- Treated the intended state as active because the service unit is enabled, historical journal entries show successful task execution, and restart declared the runner successfully.
|
||
|
||
**Verified:**
|
||
|
||
- `systemctl is-active gitea-act-runner.service` returned `active`.
|
||
- `systemctl status gitea-act-runner.service --no-pager` showed `bytelyst-host-runner` running as `gitea-runner`.
|
||
- Runner labels declared successfully: `ubuntu-latest`, `linux`, `bytelyst`, `hostinger`.
|
||
- Runner config uses Docker executor images and `privileged: false`; Docker socket access is granted through the `docker` group.
|
||
- Runner immediately picked up task `42` for `bytelyst/bytelyst-devops-tools`, proving it can talk to local Gitea.
|
||
|
||
**Residual risk:**
|
||
|
||
- Record a small dedicated smoke workflow that does not need production secrets, so runner health is proven by a controlled workflow rather than incidental queued work.
|
||
- Add runner health to VM observability so enabled-but-inactive drift is caught automatically.
|
||
|
||
## Do Not Start With
|
||
|
||
- Rootless Docker migration.
|
||
- Broad `iptables` default-drop without an allowlist.
|
||
- Mass Compose rewrites across all products.
|
||
- SSH password/root lockout before key-based sudo and rollback are proven.
|
||
- Removing unhealthy containers before confirming whether they are deprecated or broken required services.
|
||
- Publishing secret-scan output that contains secrets.
|
||
|
||
## Suggested First Tickets
|
||
|
||
1. **P0: Build and review exposure inventory** — produce exact approved/blocked list for all currently bound ports.
|
||
2. **P0: Lock Docker-published non-public ports** — bind to loopback/internal or enforce `DOCKER-USER` drops.
|
||
3. **P0: Harden SSH** — disable password/root login after confirming key-based admin access.
|
||
4. **P1: Triage unhealthy containers** — fix healthchecks/apps or retire dead services.
|
||
5. **P1: Resolve failed Hermes backup unit** — fix or disable duplicate failed unit; keep cron backup healthy.
|
||
6. **P1: Decide Gitea runner state** — active smoke-tested runner or documented disabled service.
|
||
7. **P2: Add secret scanner and stale-job scanner** — prevent silent credential and automation drift.
|
||
|
||
**Recommended first implementation order:**
|
||
|
||
1. Generate and review `docs/vm-exposure-inventory.md`.
|
||
2. Fix the stale cron path and failed backup unit, because both are lower blast-radius and improve rollback confidence.
|
||
3. Harden SSH with second-session/provider-console safety.
|
||
4. Move obvious internal-only Docker ports to loopback/internal bindings.
|
||
5. Add `DOCKER-USER` guardrails after the allowlist is proven.
|
||
|
||
This order improves safety without letting the port exposure issue linger too long.
|
||
|
||
## Verification Commands for Future Runs
|
||
|
||
```bash
|
||
# Host/security baseline
|
||
date -Is
|
||
uname -a
|
||
. /etc/os-release && echo "$PRETTY_NAME"
|
||
apt-get -s upgrade | awk '/^Inst /{print}'
|
||
test -f /var/run/reboot-required && cat /var/run/reboot-required || echo no-reboot-required
|
||
|
||
# Firewall and public bind inventory
|
||
ufw status verbose
|
||
iptables -S DOCKER-USER
|
||
ss -ltnup
|
||
|
||
# SSH effective config
|
||
sshd -T | egrep '^(permitrootlogin|passwordauthentication|pubkeyauthentication|kbdinteractiveauthentication|maxauthtries|x11forwarding|clientaliveinterval)'
|
||
fail2ban-client status sshd
|
||
|
||
# Docker health/security
|
||
docker ps --format '{{.Names}}\t{{.Status}}\t{{.Ports}}'
|
||
docker ps -q | xargs -r docker inspect --format '{{.Name}} user={{.Config.User}} privileged={{.HostConfig.Privileged}} readonly={{.HostConfig.ReadonlyRootfs}} ports={{json .NetworkSettings.Ports}}'
|
||
|
||
# Caddy and ingress
|
||
docker exec caddy caddy validate --config /etc/caddy/Caddyfile
|
||
sed -n '1,220p' /opt/bytelyst/Caddyfile
|
||
|
||
# Backup/cron/systemd drift
|
||
systemctl --failed --no-pager
|
||
hermes cron list
|
||
crontab -l
|
||
for f in /etc/cron.d/*; do echo "--- $f"; sed -n '1,80p' "$f"; done
|
||
```
|
||
|
||
## Notes
|
||
|
||
- This review did not change firewall, SSH, Docker, Caddy, or service settings. It intentionally documents the risk and remediation order before making potentially disruptive security changes.
|
||
- Public exposure changes should be handled in small maintenance windows with pre/post health checks because this VM hosts multiple ByteLyst apps.
|
||
- The Caddyfile validates today, but Caddy formatting should be normalized in a separate low-risk docs/ops cleanup if desired.
|