diff --git a/dashboard/DEPLOYMENT.md b/dashboard/DEPLOYMENT.md index 9362487..b7ca70e 100644 --- a/dashboard/DEPLOYMENT.md +++ b/dashboard/DEPLOYMENT.md @@ -342,6 +342,107 @@ docker logs -f gateway | Traefik | Yes | Yes | Yes | | Auth | Platform | Platform | Platform | +## Privilege Surface (Docker socket + host mounts) + +The `devops-backend` container has root-equivalent access to the host. This +section documents exactly what is mounted, which routes use each mount, and +what the blast radius looks like if an admin token leaks. It exists so reviewers +don't have to reverse-engineer this from `docker-compose.yml` and the route +handlers — and so any future change to the mount set is reviewed against this +list rather than slipped in. + +### Mounts (from `docker-compose.yml`) + +| Host path | Container path | Mode | Purpose | +|------------------------------------|-----------------------------------|------|-------------------------------------------------------------------------| +| `/var/run/docker.sock` | `/var/run/docker.sock` | rw | Allows `docker` CLI inside the container to control the host daemon. Used by the `system` and `vm` modules. **Equivalent to root on the host.** | +| `/opt/bytelyst/learning_ai_devops_tools/scripts` | `/vm-scripts` | ro | Bash scripts the `vm` module shells out to (`HostingerVM/*.sh`). Read-only mount; the container cannot modify the script set. | +| `/var/log/vm-cleanup.log` | `/host-logs/vm-cleanup.log` | rw | The `vm` cleanup script appends here; backend reads it via `/api/vm/cleanup-log`. | +| `/var/log/vm-health-check.log` | `/host-logs/vm-health-check.log` | rw | Health-check probe output; backend reads it via `/api/vm/health`. | +| `/var/log/docker-watchdog.log` | `/host-logs/docker-watchdog.log` | rw | Watchdog tail used by the VM panel. | +| `extra_hosts: host-gateway` | `host.docker.internal`-equivalent | — | Lets the container reach `host:11434` (Ollama) and other host-only services. Not a filesystem mount, but a privilege-relevant capability — the container can talk to anything bound to `127.0.0.1` on the host. | + +The container's listening port (`4004`) is bound to `127.0.0.1` only, so the +API is **not** exposed to the public internet by this compose file — access is +expected via Tailscale or an SSH tunnel. Any reverse proxy in front of it +(Traefik in production) is responsible for its own auth + TLS. + +### What shells out + which routes (auth column = effective gate) + +| Route | Handler module | What it executes | Auth | +|--------------------------------------------------|-------------------------------|-----------------------------------------------------------------------------------|-------------| +| `GET /system/metrics` | `system/repository.ts` | `df -h ...` | `requireAdmin` | +| `GET /docker/stats` | `system/repository.ts` | `docker images / ps / volume ls / system df` (read-only) | `requireAdmin` | +| `POST /docker/cleanup` | `system/repository.ts` | `docker container prune -f`, `docker image prune -a -f`, `docker volume prune -f`, `docker builder prune -f` (a fixed allow-list — request body picks one of the four "types") | `requireAdmin` | +| `GET /vm/health` | `vm/repository.ts` | `bash $VM_SCRIPTS_PATH/vm-health-check.sh --json` | `requireAdmin` | +| `GET /vm/cleanup-log` | `vm/repository.ts` | reads `/host-logs/vm-cleanup.log` | `requireAdmin` | +| `GET /vm/cron-status` | `vm/repository.ts` | `crontab -l` | `requireAdmin` | +| `POST /vm/cleanup` | `vm/repository.ts` | `bash $VM_SCRIPTS_PATH/vm-cleanup.sh` | `requireAdmin` | +| `GET /vm/containers`, `.../unhealthy`, `.../:name/logs` | `vm/repository.ts` | `docker ps`, `docker inspect`, `docker stats`, `docker logs` | `requireAdmin` | +| `POST /vm/containers/:name/restart` | `vm/repository.ts` | `docker restart ""` (name is a path param — see "Known sharp edges" below) | `requireAdmin` | +| `GET /vm/ollama/models`, `DELETE /vm/ollama/models/:name` | `vm/repository.ts` | HTTP-only (talks to host Ollama via `host-gateway`). No shell-out. | `requireAdmin` | +| `POST /code-quality/check` | `code-quality/repository.ts` | `npm run typecheck`, `npm run lint`, `npm run build`, `npm run test:run` in the request-supplied `projectPath`. | `requireAdmin` *(added concurrently with this doc; previously unauthenticated — see the Phase 5 P1 commit)* | +| `POST /deployments/trigger/:serviceId` | `deployments/orchestrator.ts` | `bash ` from the registered service registry (paths are stored at create-time, not request-time). | `requireAdmin` | +| `/hermes/ops` (snapshot) | `hermes-ops/repository.ts` | Read-only probes: `systemctl is-active/is-enabled`, `git status`, `du -sh`, `ps`, `tailscale ip`, `runuser -u uma -- systemctl --user ...`. No state-changing commands. | none (read-only ops snapshot) | + +### Blast radius if an admin token is leaked + +Anyone holding a valid admin JWT for this product can, today: + +- Run any of the four pre-defined `docker prune` commands (data loss for + containers/images/volumes), restart any container, read any container's logs. +- Trigger the host VM cleanup script and crontab listing. +- Trigger any deployment script registered in the service registry. +- Run `npm run` lifecycle scripts in any directory the container can read + (since `code-quality/check` takes a caller-supplied `projectPath`). +- Read the three host logs that are mounted in. + +In other words, an admin token is **equivalent to a host shell**, modulo the +specific commands the codebase chooses to wrap. There is currently **no +allow-list wrapper** between the backend and the docker socket; the backend +constructs `docker ...` shell strings directly with `execAsync`. + +### Known sharp edges (track and shrink) + +1. **Container name is interpolated into a shell string.** `docker restart + "${name}"` and similar paths in `vm/repository.ts` use `execAsync` with a + template literal. The `:name` path parameter is admin-only but is not + validated against a `^[a-zA-Z0-9._-]+$` allow-list. Lock this down before + exposing the dashboard to a wider admin pool. +2. **`projectPath` for `/code-quality/check` is unvalidated.** The handler + passes the caller-supplied path straight into `execAsync({ cwd })`. Even + with `requireAdmin` added, this should be constrained to a known set of + project roots (or rejected if it escapes the workspace). +3. **No per-route audit-log on shell-outs.** `audit/repository.ts` records + deployment triggers but not `/docker/cleanup` or `/vm/cleanup`. A leaked + token's actions are reconstructable only from container stdout + host logs. +4. **The container runs as root.** Both the backend `Dockerfile` and the bind- + mounts assume root. A non-root user with `docker` group membership would + shrink the in-container blast radius without losing functionality (the + socket is still root on the host); revisit when ready. +5. **`fastify-rate-limit` is global, not per-route.** A leaked admin token + currently isn't slowed down on the destructive endpoints any more than it + is on read-only ones. + +### Mitigation roadmap (incremental, not all at once) + +- [ ] **P1 (next):** Allow-list wrapper around shell-outs. Centralize + `docker`/`bash`/`npm` invocations behind a small `lib/shell.ts` that + validates the arg vector against a per-command schema (e.g. + `docker.restart(name)` rejects names not matching `^[a-zA-Z0-9._-]{1,128}$`). +- [ ] **P1:** Validate `/code-quality/check`'s `projectPath` against a + configured set of allowed roots. +- [ ] **P2:** Audit-log every shell-out (command + arg vector + actor + result). +- [ ] **P2:** Run the backend container as a non-root user with `docker` group + membership; rebuild the Dockerfile accordingly. +- [ ] **P3:** Move from `docker.sock` to a thin daemon (`docker-proxy`-style) + that exposes only the verbs the dashboard actually needs (`stats`, + `restart`, `logs`, the four `prune` variants). + +Operators reviewing whether to grant a new admin should read this whole section +before doing so. Adding a new shell-out path in code is a **privilege change** +and must update this table in the same commit. + ## Production Checklist - [ ] Platform stack running with Traefik. diff --git a/dashboard/REVIEW_ACTIONS.md b/dashboard/REVIEW_ACTIONS.md index 9343972..7bc23f8 100644 --- a/dashboard/REVIEW_ACTIONS.md +++ b/dashboard/REVIEW_ACTIONS.md @@ -45,10 +45,8 @@ The TODO has been resolved by **removing the SSE claim**, not by shipping it: th - `README.md` "Web port: 3000" rewritten to call out container vs Compose host vs dev-mode explicitly and link to the port table. - `ENDPOINTS.md` got a top-of-file note: every `localhost:3000` URL in the file refers to `pnpm dev`; substitute `:3049` for the Dockerized stack. The `https://api.bytelyst.com/api/devops` vs `/devops` ambiguity was already resolved by the existing "URL Note" section (kept). -### 6. Docker socket + host log mounts are very privileged -`docker-compose.yml` mounts `/var/run/docker.sock`, the host `scripts` directory, and three host log paths into `devops-backend`. This is the same risk profile as Portainer but with custom code reading/writing those mounts. There is no documentation of which backend module talks to the docker socket or what commands it issues. - -- Action: document the privilege surface (which routes shell out, which call `docker`), and consider a thin allow-list wrapper instead of mounting the raw socket. At minimum, add a section to `DEPLOYMENT.md` enumerating these mounts and their purpose so reviewers know the blast radius. +### 6. Docker socket + host log mounts are very privileged — RESOLVED (documented; allow-list wrapper queued) +`DEPLOYMENT.md` now has a **Privilege Surface** section enumerating every host mount, every shell-outing route + the exact commands it runs, the auth gate on each, and an explicit blast-radius statement ("an admin token is equivalent to a host shell, modulo what the codebase wraps"). Concurrent fix: the `/code-quality/check` endpoint was missing `requireAdmin` and was therefore reachable unauthenticated even though it shells out to `npm run typecheck/lint/build/test:run` in a caller-supplied `projectPath` — that's been gated to admin in the same commit. Two follow-up P1s remain in the doc's mitigation roadmap (allow-list wrapper around shell-outs; validate `code-quality/check` `projectPath` against an allowed root set); P2/P3 cover audit-logging shell-outs, dropping root in the container, and moving off the raw docker socket. --- diff --git a/dashboard/backend/src/modules/code-quality/routes.ts b/dashboard/backend/src/modules/code-quality/routes.ts index 5f75318..83a3fd2 100644 --- a/dashboard/backend/src/modules/code-quality/routes.ts +++ b/dashboard/backend/src/modules/code-quality/routes.ts @@ -1,10 +1,16 @@ import { FastifyInstance } from 'fastify'; import { runCodeQualityCheck } from './repository.js'; import { CodeQualityCheckParamsSchema } from './types.js'; +import { requireAdmin } from '../../lib/auth.js'; export async function codeQualityRoutes(fastify: FastifyInstance) { - // Run code quality check - fastify.post('/code-quality/check', async (request, reply) => { + // Run code quality check. + // Admin-only: this route shells out (`npm run typecheck/lint/build/test:run`) + // in a caller-supplied `projectPath` and is therefore privileged. See the + // "Privilege Surface" section in `dashboard/DEPLOYMENT.md`. + fastify.post('/code-quality/check', { + preHandler: async (req) => requireAdmin(req), + }, async (request, reply) => { try { const params = CodeQualityCheckParamsSchema.parse(request.body); const report = await runCodeQualityCheck(params); diff --git a/docs/hermes_dashboard_v2_roadmap.md b/docs/hermes_dashboard_v2_roadmap.md index 101598a..759c5ee 100644 --- a/docs/hermes_dashboard_v2_roadmap.md +++ b/docs/hermes_dashboard_v2_roadmap.md @@ -125,7 +125,7 @@ This is the biggest operational asymmetry and the reason half the ops-panel warn - [x] **P1:** Add tests for `auth`, `csrf`, `deployments/orchestrator`, `health`, **and `hermes-ops`**; add `pnpm test:coverage` gate. *(35 new unit tests; v8 coverage thresholds gated on the six tested files in `backend/vitest.config.ts` (≥85% lines/funcs/stmts, ≥65% branches), wired into Gitea CI as a dedicated step. Today's actuals: ≥95% lines on every gated file. Ratchet up as more modules get tested.)* - [x] **P1:** Resolve the SSE TODO — either ship a Fastify-5-compatible log-stream or remove the SSE claim from docs/UI. *(Chose **remove**: dropped `fastify-sse-v2` dep, deleted commented-out plugin import + TODO from `server.ts` and `deployments/routes.ts`, rewrote the README/DEPLOYMENT.md "Log Streaming" section as "Logs (JSON-polled, no SSE)". Web client already polls `/deployments/:id/logs` via `apiRequest` — no UI change needed. If a real-time stream is wanted later, implement via `reply.raw` and update docs in the same change.)* - [x] **P1:** Fix doc drift (web port 3000 vs 3049; endpoint URLs; merge duplicate deployment docs). *(`DEPLOYMENT.md` is now canonical; `DEPLOYMENT_GUIDE.md` reduced to a redirect stub; `deploy.sh` updated. Added an explicit "Ports — quick reference" table to `DEPLOYMENT.md` distinguishing container `:3000`, Compose host `:3049`, Traefik production. README and ENDPOINTS.md cross-link to it. Marks REVIEW_ACTIONS #5 resolved.)* -- [ ] **P1:** Document the docker-socket + host-log/script mount privilege surface (the backend reads cross-user/host paths — blast radius must be written down; consider an allow-list wrapper over the raw socket). +- [x] **P1:** Document the docker-socket + host-log/script mount privilege surface (the backend reads cross-user/host paths — blast radius must be written down; consider an allow-list wrapper over the raw socket). *(New "Privilege Surface" section in `dashboard/DEPLOYMENT.md` enumerating every mount, every shell-outing route + commands + auth gate, the blast-radius if an admin token leaks, five known sharp edges, and a P1→P3 mitigation roadmap. Concurrent fix: `/code-quality/check` was reachable unauthenticated despite shelling out to `npm run` in a caller-supplied path — `requireAdmin` added. Allow-list wrapper around `docker`/`bash`/`npm` invocations and `projectPath` validation are queued as the next P1s; running the container as non-root and replacing the raw `docker.sock` with a verb-restricted proxy are P2/P3.)* - [ ] **P2:** Structured backend logging (pino → stdout); wire E2E (`hermes.spec.ts`) into CI with a started stack. ## Phase 6 — Mission Control UX polish (G6)