docs(dashboard): Phase 5 P1 — document privilege surface; gate /code-quality/check

Closes the final Phase 5 P1 checkbox and REVIEW_ACTIONS #6.

The backend container has root-equivalent host access via the docker
socket, host log mounts, and the VM scripts mount, but until now the
"who can do what to the host?" answer was scattered across compose
files and route handlers. This commit centralizes it.

DEPLOYMENT.md gains a "Privilege Surface" section that lists:

  - every host mount + container path + mode + purpose
  - every shell-outing route, the actual commands it runs, and the
    auth gate on each
  - what an admin token can do today (≈ host shell)
  - five known sharp edges (un-allow-listed container names, unvalidated
    projectPath, no per-route audit-log on shell-outs, container runs
    as root, global rate-limit only)
  - a P1 → P3 mitigation roadmap (allow-list wrapper around shell-outs,
    projectPath validation, audit-logging shell-outs, drop root in
    container, replace docker.sock with a verb-restricted proxy)

Concurrent code fix: `POST /code-quality/check` was reachable
**unauthenticated** despite shelling out to `npm run typecheck/lint/
build/test:run` in a caller-supplied `projectPath`. Added
`preHandler: requireAdmin` to bring it in line with every other
shell-outing route in the dashboard. Same commit because the
documentation table promises this gate exists.

REVIEW_ACTIONS #6 marked RESOLVED with the rationale; roadmap checkbox
ticked. Tests, typecheck, lint (0 errors), build, and coverage gate
(≥95% lines on every gated file) all stay green.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
This commit is contained in:
Hermes VM 2026-05-30 07:05:51 +00:00
parent 824f31586a
commit c6ec1a06ea
4 changed files with 112 additions and 7 deletions

View File

@ -342,6 +342,107 @@ docker logs -f gateway
| Traefik | Yes | Yes | Yes |
| Auth | Platform | Platform | Platform |
## Privilege Surface (Docker socket + host mounts)
The `devops-backend` container has root-equivalent access to the host. This
section documents exactly what is mounted, which routes use each mount, and
what the blast radius looks like if an admin token leaks. It exists so reviewers
don't have to reverse-engineer this from `docker-compose.yml` and the route
handlers — and so any future change to the mount set is reviewed against this
list rather than slipped in.
### Mounts (from `docker-compose.yml`)
| Host path | Container path | Mode | Purpose |
|------------------------------------|-----------------------------------|------|-------------------------------------------------------------------------|
| `/var/run/docker.sock` | `/var/run/docker.sock` | rw | Allows `docker` CLI inside the container to control the host daemon. Used by the `system` and `vm` modules. **Equivalent to root on the host.** |
| `/opt/bytelyst/learning_ai_devops_tools/scripts` | `/vm-scripts` | ro | Bash scripts the `vm` module shells out to (`HostingerVM/*.sh`). Read-only mount; the container cannot modify the script set. |
| `/var/log/vm-cleanup.log` | `/host-logs/vm-cleanup.log` | rw | The `vm` cleanup script appends here; backend reads it via `/api/vm/cleanup-log`. |
| `/var/log/vm-health-check.log` | `/host-logs/vm-health-check.log` | rw | Health-check probe output; backend reads it via `/api/vm/health`. |
| `/var/log/docker-watchdog.log` | `/host-logs/docker-watchdog.log` | rw | Watchdog tail used by the VM panel. |
| `extra_hosts: host-gateway` | `host.docker.internal`-equivalent | — | Lets the container reach `host:11434` (Ollama) and other host-only services. Not a filesystem mount, but a privilege-relevant capability — the container can talk to anything bound to `127.0.0.1` on the host. |
The container's listening port (`4004`) is bound to `127.0.0.1` only, so the
API is **not** exposed to the public internet by this compose file — access is
expected via Tailscale or an SSH tunnel. Any reverse proxy in front of it
(Traefik in production) is responsible for its own auth + TLS.
### What shells out + which routes (auth column = effective gate)
| Route | Handler module | What it executes | Auth |
|--------------------------------------------------|-------------------------------|-----------------------------------------------------------------------------------|-------------|
| `GET /system/metrics` | `system/repository.ts` | `df -h ...` | `requireAdmin` |
| `GET /docker/stats` | `system/repository.ts` | `docker images / ps / volume ls / system df` (read-only) | `requireAdmin` |
| `POST /docker/cleanup` | `system/repository.ts` | `docker container prune -f`, `docker image prune -a -f`, `docker volume prune -f`, `docker builder prune -f` (a fixed allow-list — request body picks one of the four "types") | `requireAdmin` |
| `GET /vm/health` | `vm/repository.ts` | `bash $VM_SCRIPTS_PATH/vm-health-check.sh --json` | `requireAdmin` |
| `GET /vm/cleanup-log` | `vm/repository.ts` | reads `/host-logs/vm-cleanup.log` | `requireAdmin` |
| `GET /vm/cron-status` | `vm/repository.ts` | `crontab -l` | `requireAdmin` |
| `POST /vm/cleanup` | `vm/repository.ts` | `bash $VM_SCRIPTS_PATH/vm-cleanup.sh` | `requireAdmin` |
| `GET /vm/containers`, `.../unhealthy`, `.../:name/logs` | `vm/repository.ts` | `docker ps`, `docker inspect`, `docker stats`, `docker logs` | `requireAdmin` |
| `POST /vm/containers/:name/restart` | `vm/repository.ts` | `docker restart "<name>"` (name is a path param — see "Known sharp edges" below) | `requireAdmin` |
| `GET /vm/ollama/models`, `DELETE /vm/ollama/models/:name` | `vm/repository.ts` | HTTP-only (talks to host Ollama via `host-gateway`). No shell-out. | `requireAdmin` |
| `POST /code-quality/check` | `code-quality/repository.ts` | `npm run typecheck`, `npm run lint`, `npm run build`, `npm run test:run` in the request-supplied `projectPath`. | `requireAdmin` *(added concurrently with this doc; previously unauthenticated — see the Phase 5 P1 commit)* |
| `POST /deployments/trigger/:serviceId` | `deployments/orchestrator.ts` | `bash <service.scriptPath>` from the registered service registry (paths are stored at create-time, not request-time). | `requireAdmin` |
| `/hermes/ops` (snapshot) | `hermes-ops/repository.ts` | Read-only probes: `systemctl is-active/is-enabled`, `git status`, `du -sh`, `ps`, `tailscale ip`, `runuser -u uma -- systemctl --user ...`. No state-changing commands. | none (read-only ops snapshot) |
### Blast radius if an admin token is leaked
Anyone holding a valid admin JWT for this product can, today:
- Run any of the four pre-defined `docker prune` commands (data loss for
containers/images/volumes), restart any container, read any container's logs.
- Trigger the host VM cleanup script and crontab listing.
- Trigger any deployment script registered in the service registry.
- Run `npm run` lifecycle scripts in any directory the container can read
(since `code-quality/check` takes a caller-supplied `projectPath`).
- Read the three host logs that are mounted in.
In other words, an admin token is **equivalent to a host shell**, modulo the
specific commands the codebase chooses to wrap. There is currently **no
allow-list wrapper** between the backend and the docker socket; the backend
constructs `docker ...` shell strings directly with `execAsync`.
### Known sharp edges (track and shrink)
1. **Container name is interpolated into a shell string.** `docker restart
"${name}"` and similar paths in `vm/repository.ts` use `execAsync` with a
template literal. The `:name` path parameter is admin-only but is not
validated against a `^[a-zA-Z0-9._-]+$` allow-list. Lock this down before
exposing the dashboard to a wider admin pool.
2. **`projectPath` for `/code-quality/check` is unvalidated.** The handler
passes the caller-supplied path straight into `execAsync({ cwd })`. Even
with `requireAdmin` added, this should be constrained to a known set of
project roots (or rejected if it escapes the workspace).
3. **No per-route audit-log on shell-outs.** `audit/repository.ts` records
deployment triggers but not `/docker/cleanup` or `/vm/cleanup`. A leaked
token's actions are reconstructable only from container stdout + host logs.
4. **The container runs as root.** Both the backend `Dockerfile` and the bind-
mounts assume root. A non-root user with `docker` group membership would
shrink the in-container blast radius without losing functionality (the
socket is still root on the host); revisit when ready.
5. **`fastify-rate-limit` is global, not per-route.** A leaked admin token
currently isn't slowed down on the destructive endpoints any more than it
is on read-only ones.
### Mitigation roadmap (incremental, not all at once)
- [ ] **P1 (next):** Allow-list wrapper around shell-outs. Centralize
`docker`/`bash`/`npm` invocations behind a small `lib/shell.ts` that
validates the arg vector against a per-command schema (e.g.
`docker.restart(name)` rejects names not matching `^[a-zA-Z0-9._-]{1,128}$`).
- [ ] **P1:** Validate `/code-quality/check`'s `projectPath` against a
configured set of allowed roots.
- [ ] **P2:** Audit-log every shell-out (command + arg vector + actor + result).
- [ ] **P2:** Run the backend container as a non-root user with `docker` group
membership; rebuild the Dockerfile accordingly.
- [ ] **P3:** Move from `docker.sock` to a thin daemon (`docker-proxy`-style)
that exposes only the verbs the dashboard actually needs (`stats`,
`restart`, `logs`, the four `prune` variants).
Operators reviewing whether to grant a new admin should read this whole section
before doing so. Adding a new shell-out path in code is a **privilege change**
and must update this table in the same commit.
## Production Checklist
- [ ] Platform stack running with Traefik.

View File

@ -45,10 +45,8 @@ The TODO has been resolved by **removing the SSE claim**, not by shipping it: th
- `README.md` "Web port: 3000" rewritten to call out container vs Compose host vs dev-mode explicitly and link to the port table.
- `ENDPOINTS.md` got a top-of-file note: every `localhost:3000` URL in the file refers to `pnpm dev`; substitute `:3049` for the Dockerized stack. The `https://api.bytelyst.com/api/devops` vs `/devops` ambiguity was already resolved by the existing "URL Note" section (kept).
### 6. Docker socket + host log mounts are very privileged
`docker-compose.yml` mounts `/var/run/docker.sock`, the host `scripts` directory, and three host log paths into `devops-backend`. This is the same risk profile as Portainer but with custom code reading/writing those mounts. There is no documentation of which backend module talks to the docker socket or what commands it issues.
- Action: document the privilege surface (which routes shell out, which call `docker`), and consider a thin allow-list wrapper instead of mounting the raw socket. At minimum, add a section to `DEPLOYMENT.md` enumerating these mounts and their purpose so reviewers know the blast radius.
### 6. Docker socket + host log mounts are very privileged — RESOLVED (documented; allow-list wrapper queued)
`DEPLOYMENT.md` now has a **Privilege Surface** section enumerating every host mount, every shell-outing route + the exact commands it runs, the auth gate on each, and an explicit blast-radius statement ("an admin token is equivalent to a host shell, modulo what the codebase wraps"). Concurrent fix: the `/code-quality/check` endpoint was missing `requireAdmin` and was therefore reachable unauthenticated even though it shells out to `npm run typecheck/lint/build/test:run` in a caller-supplied `projectPath` — that's been gated to admin in the same commit. Two follow-up P1s remain in the doc's mitigation roadmap (allow-list wrapper around shell-outs; validate `code-quality/check` `projectPath` against an allowed root set); P2/P3 cover audit-logging shell-outs, dropping root in the container, and moving off the raw docker socket.
---

View File

@ -1,10 +1,16 @@
import { FastifyInstance } from 'fastify';
import { runCodeQualityCheck } from './repository.js';
import { CodeQualityCheckParamsSchema } from './types.js';
import { requireAdmin } from '../../lib/auth.js';
export async function codeQualityRoutes(fastify: FastifyInstance) {
// Run code quality check
fastify.post('/code-quality/check', async (request, reply) => {
// Run code quality check.
// Admin-only: this route shells out (`npm run typecheck/lint/build/test:run`)
// in a caller-supplied `projectPath` and is therefore privileged. See the
// "Privilege Surface" section in `dashboard/DEPLOYMENT.md`.
fastify.post('/code-quality/check', {
preHandler: async (req) => requireAdmin(req),
}, async (request, reply) => {
try {
const params = CodeQualityCheckParamsSchema.parse(request.body);
const report = await runCodeQualityCheck(params);

View File

@ -125,7 +125,7 @@ This is the biggest operational asymmetry and the reason half the ops-panel warn
- [x] **P1:** Add tests for `auth`, `csrf`, `deployments/orchestrator`, `health`, **and `hermes-ops`**; add `pnpm test:coverage` gate. *(35 new unit tests; v8 coverage thresholds gated on the six tested files in `backend/vitest.config.ts` (≥85% lines/funcs/stmts, ≥65% branches), wired into Gitea CI as a dedicated step. Today's actuals: ≥95% lines on every gated file. Ratchet up as more modules get tested.)*
- [x] **P1:** Resolve the SSE TODO — either ship a Fastify-5-compatible log-stream or remove the SSE claim from docs/UI. *(Chose **remove**: dropped `fastify-sse-v2` dep, deleted commented-out plugin import + TODO from `server.ts` and `deployments/routes.ts`, rewrote the README/DEPLOYMENT.md "Log Streaming" section as "Logs (JSON-polled, no SSE)". Web client already polls `/deployments/:id/logs` via `apiRequest` — no UI change needed. If a real-time stream is wanted later, implement via `reply.raw` and update docs in the same change.)*
- [x] **P1:** Fix doc drift (web port 3000 vs 3049; endpoint URLs; merge duplicate deployment docs). *(`DEPLOYMENT.md` is now canonical; `DEPLOYMENT_GUIDE.md` reduced to a redirect stub; `deploy.sh` updated. Added an explicit "Ports — quick reference" table to `DEPLOYMENT.md` distinguishing container `:3000`, Compose host `:3049`, Traefik production. README and ENDPOINTS.md cross-link to it. Marks REVIEW_ACTIONS #5 resolved.)*
- [ ] **P1:** Document the docker-socket + host-log/script mount privilege surface (the backend reads cross-user/host paths — blast radius must be written down; consider an allow-list wrapper over the raw socket).
- [x] **P1:** Document the docker-socket + host-log/script mount privilege surface (the backend reads cross-user/host paths — blast radius must be written down; consider an allow-list wrapper over the raw socket). *(New "Privilege Surface" section in `dashboard/DEPLOYMENT.md` enumerating every mount, every shell-outing route + commands + auth gate, the blast-radius if an admin token leaks, five known sharp edges, and a P1→P3 mitigation roadmap. Concurrent fix: `/code-quality/check` was reachable unauthenticated despite shelling out to `npm run` in a caller-supplied path — `requireAdmin` added. Allow-list wrapper around `docker`/`bash`/`npm` invocations and `projectPath` validation are queued as the next P1s; running the container as non-root and replacing the raw `docker.sock` with a verb-restricted proxy are P2/P3.)*
- [ ] **P2:** Structured backend logging (pino → stdout); wire E2E (`hermes.spec.ts`) into CI with a started stack.
## Phase 6 — Mission Control UX polish (G6)