Closes the three Phase 5 P2 follow-ups from the DEPLOYMENT.md
mitigation roadmap that don't need infra changes. Two P2 items remain
(non-root container, docker-proxy daemon) — both genuinely need
container/orchestration work and stay queued.
1. Allow-list shell wrapper (P1)
New `lib/shell.ts`:
- `execAllowed(cmd, args, opts)` — `execFile`-only, no shell, no
interpolation. Single escape hatch for ad-hoc invocations.
- `dockerRestart(name)` — name validated against
`[a-zA-Z0-9][a-zA-Z0-9._-]{0,127}`; throws InvalidShellArgError
on anything else (including non-strings, shell metacharacters,
command-substitution attempts). Tests cover all of these.
- `dockerPrune(kind, {all?})` — kind constrained to
{container,image,volume,builder}; `--all` only valid for image.
- `runBashScript(path, args, {allowedRoots})` — script path AND
cwd both checked against allowed roots; rejects `..` escapes
and prefix-matching siblings (`/opt/projects-evil` vs
`/opt/projects`).
- `runNpmScript(script, {cwd, allowedRoots})` — script ∈
{typecheck,lint,build,test,test:run,start}; cwd inside roots.
17 unit tests cover every rejection path. Module added to the
coverage gate (≥95% lines).
Migrated highest-risk callers off template-literal `exec`:
- `vm/repository.ts:restartContainer` → `dockerRestart`. Was
previously `await execAsync(\`docker restart "${name}"\`)`
with only a regex check; now goes through the wrapper.
- `system/repository.ts:dockerCleanup` → `dockerPrune` per kind
+ `execAllowed` for `docker system df`. Drops the array of
template-literal command strings entirely.
- `code-quality/repository.ts` → `runNpmScript` for every
lifecycle invocation. cwd is now the resolved (normalised,
`..`-collapsed) path, not the raw input.
2. projectPath validation for /code-quality/check (P1)
`runCodeQualityCheck` now calls
`assertPathInAllowedRoots(projectPath, getAllowedRoots())` before
any subprocess spawns. `getAllowedRoots()` reads
`CODE_QUALITY_ALLOWED_ROOTS` (colon-separated env, defaults to
`/opt/bytelyst`). Rejection happens with a clear error message
listing the configured roots so operators know what to allow.
3. Audit-log every privileged shell-out (P2)
`audit/types.ts` extended: `action` now includes `'shell-exec'`,
`entityType` includes `'host'`. The migration is additive — old
audit rows still validate.
Three privileged routes now write a `shell-exec` audit row with
actor (authUserId / authRole), entity id, and a sanitized details
payload before responding:
- `POST /docker/cleanup` — `entityId: docker-cleanup:<type>`,
details include {type, force, freedSpace}.
- `POST /vm/cleanup` — `entityId: vm-cleanup:<mode>`.
- `POST /vm/containers/:name/restart` — `entityId:
container-restart:<name>`, details include {success, message}.
Audited even on failure so attempted privileged actions are
still recorded.
Audit writes are best-effort — a Cosmos hiccup logs a warn but
never fails the request the operator was running.
Verified: backend typecheck ✅, 74/74 unit tests ✅ (17 new for
shell.ts + audit changes), 7/7 E2E ✅, lint 0 errors, coverage gate
≥95% lines on every gated file (which now includes shell.ts).
Generated with [Devin](https://cli.devin.ai/docs)
Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
First half of Phase 5 P2 (the "structured backend logging" piece;
E2E-in-CI lands separately so the diff stays reviewable).
Adds `lib/logger.ts` exporting a singleton pino instance shared between
Fastify (via `loggerInstance`) and any non-request code path. One
configured logger across the backend means uniform formatting,
redaction, and log-level control:
- LOG_LEVEL env knob (defaults: debug in non-prod, info in prod when
NODE_ENV=production). Documented in `.env.example`.
- Built-in redaction for Authorization / Cookie headers and the
common secret-shaped field names (password, token, refreshToken,
accessToken, csrfToken, JWT_SECRET, CSRF_SECRET, ENCRYPTION_KEY,
COSMOS_KEY, AZURE_CLIENT_SECRET) so an accidental
`req.log.info(req.body)` or `logger.error({ err, config }, …)`
won't dump credentials. This is a backstop, not the primary
defense — call sites should still avoid logging raw config/req.
- JSON to stdout in every environment. Pipe through `pino-pretty`
locally if you want pretty output; we deliberately don't bundle
pino-pretty as a runtime dep.
- `childLogger(module)` helper tags log lines with their origin so
repositories/background workers don't have to repeat the module
name on every line.
Sweeps the runtime `console.error` sites that lose request context
(deployment orchestrator background fire-and-forget, system docker
stats/cleanup, backup CRUD, vm getAllContainers) onto the structured
logger. CLI-only modules (`scripts/run-migrations.ts`,
`migrations/index.ts`, `cosmos-init.ts` startup, `azure-keyvault.ts`,
`config.ts` env warnings, `lib/migrations.ts` no-op message) keep
`console.*` for now — they run before Fastify is up and are queued for
a separate cleanup pass.
Tests, typecheck, lint (0 errors), build green. Coverage gate still
passing (≥95% lines on every gated file).
Generated with [Devin](https://cli.devin.ai/docs)
Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
- prometheus.ts: new Prometheus client with 7d/30d range queries for disk,
memory, swap, CPU steal, and disk I/O (GB/hr); getWeeklyDigestData()
aggregates all metrics for digest and API endpoint
- routes.ts: GET /api/vm/metrics/trend?metric=…&range=… and
GET /api/vm/weekly-digest endpoints
- api.ts: TrendPoint/TrendSeries types; getTrend() and getMemoryTrend()
added to vmApi
- vm/page.tsx: Sparkline (pure SVG polyline+fill), TrendCard with
latest/avg/peak and threshold colouring, TrendsPanel with lazy load
on first open; Promise.allSettled() isolation for all 5 data panels
- vm-weekly-digest.sh: weekly Telegram digest via docker exec into
devops-backend to reach Prometheus; emoji severity indicators; cron
summary from /var/log/vm-cleanup.log
- systemd timer: Mon 08:00 UTC, Persistent=true (fires on next boot
if missed); first trigger 2026-06-02
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>