diff --git a/docs/docker-build-optimization-roadmap.md b/docs/docker-build-optimization-roadmap.md
index 43bcea4..3f586a5 100644
--- a/docs/docker-build-optimization-roadmap.md
+++ b/docs/docker-build-optimization-roadmap.md
@@ -1,34 +1,46 @@
# Docker Build Optimization Roadmap
-> **Status:** Draft v2 (post-audit) · **Owner:** Platform DevOps · **Created:** 2026-05-27 · **Revised:** 2026-05-27
+> **Status:** Draft v3 (post-review) · **Owner:** Platform DevOps · **Created:** 2026-05-27 · **Revised:** 2026-05-27
>
-> Pilot Docker-build speed-ups + hermetic-fallback hardening on `learning_ai_peakpulse`
-> and `learning_ai_clock`, then capture the playbook here for ecosystem-wide rollout.
+> Pilot Docker-build correctness + speed fixes on `learning_ai_clock` (web + backend)
+> and `learning_ai_peakpulse` (backend), then capture the playbook here for
+> ecosystem-wide rollout.
---
## 0. Pre-flight audit findings (2026-05-27)
-A read-only audit of the two pilot repos surfaced **10 concrete bugs/gaps**
-that contradict the casual narrative that "Gitea-registry is the default and
-`docker-prep.sh` is the fallback." The actual state is closer to the inverse:
+A read-only audit of pilot repos + lessons from recent live incidents surfaced
+**13 concrete bugs/gaps**. The actual state of the ecosystem is closer to the
+inverse of the casual narrative: tarballs are the de facto default, the
+Gitea-registry path is partially wired, and there is a separate class of
+"build green, app broken" silent failures (F11–F13) that the speed-focused
+plan needs to address first.
| # | Finding | Location | Severity |
|---|---|---|---|
-| F1 | `pnpm-lock.yaml` is in `.dockerignore` — any lockfile-based optimization is blocked until removed | `peakpulse/.dockerignore`, `clock/.dockerignore` | **High** |
-| F2 | `pnpm-workspace.yaml` references sibling `../learning_ai_common_plat/packages/*` — `--frozen-lockfile` inside Docker will fail unless workspace is flattened or sibling tree is copied | `peakpulse/pnpm-workspace.yaml`, `clock/pnpm-workspace.yaml` | **High** |
-| F3 | `peakpulse/.npmrc.docker` is tarball-only (no `@bytelyst:registry=…` line) — the "Gitea-registry" path doesn't actually work in this repo today | `peakpulse/.npmrc.docker` | **High** |
-| F4 | `clock/.npmrc.docker` hardcodes `http://localhost:3300` — from inside a Docker container `localhost` is the container itself, not the host registry | `clock/.npmrc.docker` | **High** |
-| F5 | `clock/backend/Dockerfile` has neither `ARG GITEA_NPM_HOST` nor a BuildKit secret mount — it is wholly dependent on `.docker-deps/` having been pre-populated | `clock/backend/Dockerfile` | High |
-| F6 | `clock/web/Dockerfile` accepts `ARG GITEA_NPM_HOST` but never uses it and has no `--mount=type=secret` — passing the arg is a no-op | `clock/web/Dockerfile` | Medium |
-| F7 | `peakpulse/docker-compose.yml` does not pass `GITEA_NPM_HOST` build arg or declare `secrets:` block, so `docker compose build` cannot use the Gitea path | `peakpulse/docker-compose.yml` | Medium |
-| F8 | `COPY .docker-deps/` is unconditional in every backend Dockerfile — every build requires either `docker-prep.sh` to have run OR an empty `.docker-deps/` dir to pre-exist | both repos | Medium |
+| F1 | `pnpm-lock.yaml` is in `.dockerignore` — any lockfile-based optimization is blocked until removed | `peakpulse/.dockerignore`, `clock/.dockerignore` | High |
+| F2 | `pnpm-workspace.yaml` references sibling `../learning_ai_common_plat/packages/*` — `--frozen-lockfile` inside Docker will fail unless workspace is flattened or sibling tree is copied | both pilots | High |
+| F3 | `peakpulse/.npmrc.docker` is tarball-only (no `@bytelyst:registry=…` line) — the "Gitea-registry" path doesn't work in this repo today | `peakpulse/.npmrc.docker` | High |
+| F4 | `clock/.npmrc.docker` hardcodes `http://localhost:3300` — from inside Docker, `localhost` is the container, not the host registry | `clock/.npmrc.docker` | High |
+| F5 | `clock/backend/Dockerfile` has neither `ARG GITEA_NPM_HOST` nor a BuildKit secret mount — wholly dependent on pre-populated `.docker-deps/` | `clock/backend/Dockerfile` | High |
+| F6 | `clock/web/Dockerfile` accepts `ARG GITEA_NPM_HOST` but never uses it; no `--mount=type=secret` either | `clock/web/Dockerfile` | Medium |
+| F7 | `peakpulse/docker-compose.yml` does not pass `GITEA_NPM_HOST` build arg or declare `secrets:` block | `peakpulse/docker-compose.yml` | Medium |
+| F8 | `COPY .docker-deps/` is unconditional in every backend Dockerfile — every build requires `docker-prep.sh` to have run OR an empty `.docker-deps/` dir to pre-exist | both repos | Medium |
| F9 | `npm install -g pnpm@10.6.5` runs on every build (no `corepack`) — 5–10 s overhead, no pinning to `packageManager` field | all four Dockerfiles | Low |
-| F10 | No BuildKit `--mount=type=cache` for pnpm store — cold install on every rebuild even when deps unchanged | all four Dockerfiles | High (the main speed win) |
+| F10 | No BuildKit `--mount=type=cache` for pnpm store — cold install on every rebuild even when deps unchanged | all four Dockerfiles | High (main speed win) |
+| **F11** | **Build-time config file missing from repo or not COPY'd in Dockerfile causes silent UI breakage. Symptom: `next build` succeeds, container is "healthy", but CSS bundle is ~33 KB (only `@font-face`) and all Tailwind classes are absent → UI renders unstyled.** Two sub-bugs: (a) `postcss.config.mjs` missing entirely while `@tailwindcss/postcss` is in `package.json` (NoteLett, JarvisJr fixes `dff459e`, `36f6bc1`); (b) file exists but Dockerfile never COPYs it (Clock, LocalMemGPT fixes `a308c6444`, `07cdf6b`). | `*/web/Dockerfile`, `*/web/postcss.config.*` | **High** |
+| **F12** | **Healthcheck uses `localhost`, resolves to IPv6 `::1`, false-fails.** Backend listens on `0.0.0.0` (IPv4 only). `wget --spider http://localhost:.../health` hits `::1`, connection refused, container marked "unhealthy", `web` service won't start due to `depends_on: condition: service_healthy`. Incident: `learning_ai_jarvis_jr/docker-compose.yml`. | every `docker-compose*.yml` healthcheck | **Medium** |
+| **F13** | **Enumerated `COPY web/foo ./foo` pattern drifts from filesystem.** New config file added to repo but Dockerfile's enumerated COPY list isn't updated. Build succeeds silently with the file absent; behavior diverges from local dev. Root cause of F11(b). | every Dockerfile using enumerated COPY | **Medium** |
-**Implication:** the original plan to "switch to `--frozen-lockfile` + Gitea
-registry" requires two upstream fixes first (F1, F2). The roadmap below
-accounts for that.
+**Implications:**
+
+- The original "switch to `--frozen-lockfile` + Gitea registry" plan requires
+ two upstream fixes first (F1, F2).
+- F11–F13 mean **correctness fixes must precede speed fixes**, otherwise we
+ ship faster builds of broken apps.
+- A linter (Phase E `docker-doctor.sh`) is the durable insurance against
+ F11/F13 recurrence — they are silent in CI today.
---
@@ -36,8 +48,8 @@ accounts for that.
| Path | Status today | Trigger | Notes |
|---|---|---|---|
-| **`docker-prep.sh` tarballs** | **De facto default** in peakpulse + flowmonk; also works in clock | Run `docker-prep.sh` then `docker compose build` | Hermetic; mutates `package.json`; slow to repack |
-| **Gitea NPM registry** | Partially wired in clock + notes; broken in peakpulse | `docker compose build` with `GITEA_NPM_HOST` arg + secret | Needs `.npmrc.docker` standardization to actually be default |
+| **`docker-prep.sh` tarballs** | **De facto default** in peakpulse + flowmonk; also works in clock/notes | Run `docker-prep.sh` then `docker compose build` | Hermetic; mutates `package.json`; slow to repack |
+| **Gitea NPM registry** | Partially wired in clock + notes; broken in peakpulse | `docker compose build` with `GITEA_NPM_HOST` arg + secret | Needs `.npmrc.docker` standardization to be the default |
| **Legacy `file:` refs** | Deprecated | — | Removed during pnpm/Gitea migration |
### Measurement targets
@@ -56,11 +68,12 @@ accounts for that.
**Goals**
-- ✅ Cut warm rebuild time via BuildKit pnpm-store cache mount (the single biggest win)
-- ✅ Make `docker-prep.sh` idempotent, safe to re-run, gitignore-clean
+- ✅ Eliminate F11–F13 class of silent "build green, app broken" failures
+- ✅ Cut warm rebuild time via BuildKit pnpm-store cache mount (single biggest speed win)
+- ✅ Make `docker-prep.sh` idempotent, safe to re-run, gitignore-clean, and canonical (no per-repo drift)
- ✅ Standardize `.npmrc.docker` across the ecosystem so the Gitea path actually works
- ✅ Fix `docker-compose.yml` to pass `GITEA_NPM_HOST` + secrets so the registry path is usable without manual flags
-- ✅ Document which path to use when, and the trade-offs
+- ✅ Ship `docker-doctor.sh` CI lint as the durable insurance layer
**Non-goals**
@@ -71,19 +84,49 @@ accounts for that.
---
-## 3. Phase A — Build speed + path correctness
+## 2.5 Canonical decisions
-Order matters: A0 must precede A1–A5 (you can't enable a path that doesn't work).
+Decisions taken now to avoid contradictions later in the doc:
-### A0. Make the Gitea-registry path actually work (peakpulse + clock)
+- **Base image:** `node:22-alpine` is canonical. For repos blocked by the
+ corporate proxy's Alpine SSL interception (currently only
+ `learning_ai_notes`), the Dockerfile MUST expose:
+ ```dockerfile
+ ARG BASE_IMAGE=node:22-alpine
+ FROM ${BASE_IMAGE} AS builder
+ ```
+ Override per-repo via `--build-arg BASE_IMAGE=node:22-slim`. Document the
+ override in the repo's `AGENTS.md`.
+- **Healthcheck host:** `127.0.0.1` (NOT `localhost`) in every
+ `docker-compose*.yml` `test:` block. See F12.
+- **Lockfile mode in Docker:** `--lockfile=false` for now. `--frozen-lockfile`
+ is blocked on the A3 ADR (F2).
+
+---
+
+## 3. Phase A — Correctness + build speed + path correctness
+
+Order matters: A0 must precede A1+ (you can't optimize a path that doesn't
+work), and A8+A9 (correctness) must land before measuring speed wins.
+
+### A0. Make the Gitea-registry path actually work (clock + peakpulse)
- [ ] **A0-1.** Standardize `.npmrc.docker` to use a templated host so it works on host (`localhost`) and inside Docker (`host.docker.internal`):
```
@bytelyst:registry=http://${GITEA_NPM_HOST}:3300/api/packages/learning_ai_user/npm/
//${GITEA_NPM_HOST}:3300/api/packages/learning_ai_user/npm/:_authToken=${GITEA_NPM_TOKEN}
strict-ssl=false
+ auto-install-peers=true
```
-- [ ] **A0-2.** Remove `pnpm-lock.yaml` from `.dockerignore` in both repos (fixes F1)
+ > **⚠️ Env-var expansion chain:** pnpm expands `${VAR}` in `.npmrc` at read
+ > time using the current process environment (see [pnpm npmrc docs][pnpm-npmrc]).
+ > That means the Dockerfile MUST do `ARG GITEA_NPM_HOST` → `ENV GITEA_NPM_HOST=$GITEA_NPM_HOST`
+ > **before** the `pnpm install` RUN line, AND the `GITEA_NPM_TOKEN` must be
+ > exported from the BuildKit secret mount inside the same `RUN` (since secrets
+ > don't persist as env across layers).
+
+ [pnpm-npmrc]: https://pnpm.io/npmrc
+- [ ] **A0-2.** Remove `pnpm-lock.yaml` from `.dockerignore` in both repos (fixes F1; harmless under `--lockfile=false` since we don't COPY it, but unblocks future A3)
- [ ] **A0-3.** Add `GITEA_NPM_HOST` build arg + `secrets:` block to every service in `docker-compose.yml`:
```yaml
build:
@@ -98,27 +141,28 @@ Order matters: A0 must precede A1–A5 (you can't enable a path that doesn't wor
environment: GITEA_NPM_TOKEN
```
- [ ] **A0-4.** Add `extra_hosts: ["host.docker.internal:host-gateway"]` to each service so Linux Docker can resolve the host
-- [ ] **A0-5.** Document required env: `GITEA_NPM_TOKEN` must be exported in the shell that runs `docker compose build`
+- [ ] **A0-5.** Document required env: `GITEA_NPM_TOKEN` must be exported in the shell that runs `docker compose build` (add to repo `README.md` quickstart)
+- [ ] **A0-V.** **Verification gate (between A0 and A1):** build the registry path **without** any cache-mount or layer optimizations. Confirm `docker compose build --no-cache` succeeds end-to-end pulling from Gitea. Only proceed to A1 once this is green. Don't conflate "make it work" with "make it fast" in one commit.
### A1. Replace `npm install -g pnpm@X` with corepack
-- [ ] **A1-1.** Replace lines `RUN npm install -g pnpm@10.6.5` with:
+- [ ] **A1-1.** Replace `RUN npm install -g pnpm@10.6.5` with:
```dockerfile
RUN corepack enable && corepack prepare pnpm@10.6.5 --activate
```
-- [ ] **A1-2.** Verify `packageManager` field in `backend/package.json` matches (already `pnpm@10.6.5` in peakpulse)
+- [ ] **A1-2.** Verify `packageManager` field in `backend/package.json` and `web/package.json` matches (already `pnpm@10.6.5` in peakpulse backend)
### A2. Add BuildKit pnpm-store cache mount
- [ ] **A2-1.** Set `# syntax=docker/dockerfile:1.7` directive at top of every Dockerfile
-- [ ] **A2-2.** Wrap install step with cache mount:
+- [ ] **A2-2.** Wrap install step with cache + secret mount:
```dockerfile
RUN --mount=type=cache,id=pnpm,target=/root/.local/share/pnpm/store \
--mount=type=secret,id=gitea_npm_token \
export GITEA_NPM_TOKEN="$(cat /run/secrets/gitea_npm_token 2>/dev/null || echo '')" && \
- pnpm install --ignore-scripts
+ pnpm install --ignore-scripts --lockfile=false
```
-- [ ] **A2-3.** Verify cache hit on second build via `docker buildx du` or `docker history`
+- [ ] **A2-3.** Verify cache mount is active: `docker buildx du --filter type=exec.cachemount` shows non-zero size after a build. **Real success metric** is wall-clock: warm rebuild (touching one source file) drops to < 30 s.
### A3. Decide lockfile policy (BLOCKED on F2 resolution)
@@ -139,20 +183,18 @@ Two options — pick one in a short ADR before implementing:
### A4. Restructure layer order
-- [ ] **A4-1.** Reorder COPY/RUN so deps install layer is `package.json` + `.npmrc` ONLY, then a separate layer for `src/`, `tsconfig.json`, `shared/`
-- [ ] **A4-2.** Move all `ARG` lines that affect deps install **before** the install step; move `NEXT_PUBLIC_*` ARGs (clock web) closer to the build step
+- [ ] **A4-1.** Reorder COPY/RUN so deps-install layer is `package.json` + `.npmrc.docker` ONLY, then a separate layer for `src/`, config files, `shared/`
+- [ ] **A4-2.** Move all `ARG` lines that affect deps install **before** the install step; move `NEXT_PUBLIC_*` ARGs (web) closer to the build step (they invalidate the build layer, not the deps layer)
### A5. Gate `.docker-deps/` behind a build arg
- [ ] **A5-1.** Add `ARG USE_TARBALLS=false` to Dockerfile
-- [ ] **A5-2.** Conditionally copy:
+- [ ] **A5-2.** Use wildcard COPY so missing dir doesn't break the build:
```dockerfile
- # Always-empty placeholder so COPY doesn't fail in registry mode
RUN mkdir -p /app/.docker-deps
COPY .docker-deps* /app/.docker-deps/
```
- (The wildcard tolerates a missing `.docker-deps/` dir; works without enabling BuildKit COPY's `--from` tricks.)
-- [ ] **A5-3.** Verify `.docker-deps/` is in `.gitignore` and `.dockerignore` is NOT excluding it when tarball mode is in use
+- [ ] **A5-3.** Verify `.docker-deps/` is in `.gitignore` and `.dockerignore` does NOT exclude it when tarball mode is in use
### A6. `.dockerignore` audit
@@ -164,37 +206,93 @@ Two options — pick one in a short ADR before implementing:
| Repo | Surface | Cold before | Cold after | Warm before | Warm after | Notes |
|---|---|---|---|---|---|---|
-| peakpulse | backend | — | — | — | — | |
-| clock | backend | — | — | — | — | |
| clock | web | — | — | — | — | |
+| clock | backend | — | — | — | — | |
+| peakpulse | backend | — | — | — | — | |
Use:
```
-time DOCKER_BUILDKIT=1 docker compose build --no-cache backend # cold
+time DOCKER_BUILDKIT=1 docker compose build --no-cache backend # cold
touch backend/src/server.ts && time docker compose build backend # warm
```
+### A8. Config-file COPY audit & canonical pattern (addresses F11, F13)
+
+- [ ] **A8-1.** For every Dockerfile in scope, list all build-time files present in the surface directory (`web/` or `backend/`) that affect the build:
+ - `postcss.config.{js,mjs,cjs,ts}`
+ - `tailwind.config.{js,mjs,cjs,ts}`
+ - `next.config.{js,mjs,ts}`
+ - `tsconfig*.json`
+ - `package.json`
+ - `.npmrc.docker`, `.npmrc`
+ - `babel.config.*` (if present)
+ - `drizzle.config.*` (if present)
+ - `vitest.config.*` (only if the build needs it)
+ Verify each is COPY'd in the Dockerfile.
+- [ ] **A8-2.** Choose canonical COPY pattern. **Decision: middle-ground glob** for web surfaces:
+ ```dockerfile
+ COPY web/*.{json,ts,mjs,js,cjs} ./
+ COPY web/public/ ./public/
+ COPY web/src/ ./src/
+ ```
+ Trade-off: glob picks up unintended root-level files if any are added later, but **dramatically reduces F11/F13 risk**. Backend surfaces with few root config files can keep enumerated COPY (lower risk surface).
+- [ ] **A8-3.** Repo-by-repo migration: replace enumerated `COPY web/foo ./foo` with the glob pattern; verify the resulting image has all expected files via `docker run --rm ls -la`.
+
+### A9. Healthcheck canonicalization (addresses F12)
+
+- [ ] **A9-1.** Replace `localhost` with `127.0.0.1` in every `docker-compose*.yml` healthcheck `test:` block. Sweep with:
+ ```
+ rg -l 'http://localhost' --glob 'docker-compose*.yml'
+ ```
+- [ ] **A9-2.** Standardize healthcheck shape:
+ - **Alpine-based images:**
+ ```yaml
+ healthcheck:
+ test: ["CMD-SHELL", "wget -q --spider http://127.0.0.1:${PORT}/health || exit 1"]
+ interval: 30s
+ timeout: 5s
+ retries: 3
+ start_period: 10s
+ ```
+ - **Slim/Debian images** (`wget` not always present, but `node` is):
+ ```yaml
+ healthcheck:
+ test: ["CMD-SHELL", "node -e \"fetch('http://127.0.0.1:${PORT}/health').then(r=>process.exit(r.ok?0:1)).catch(()=>process.exit(1))\""]
+ ```
+- [ ] **A9-3.** Add `start_period` (10s minimum) — prevents flaky "container started but app not yet listening" false-negatives.
+
---
## 4. Phase B — Hermetic-fallback polish (`docker-prep.sh`)
-The script is **duplicated with minor variations** across product repos. Pilot
-in peakpulse + clock, then propose a canonical home.
+`docker-prep.sh` is duplicated with minor variations across product repos.
+**Promotion to canonical home is now in Phase B, not Phase D** — drift
+compounds linearly with time and the `.npmrc` template precedent proves the
+pattern is cheap.
- [ ] **B1.** Add `--dry-run` flag — list packs/rewrites, no side effects
- [ ] **B2.** Idempotency guard — refuse to run if any `*.bak` exists unless `--force`
- [ ] **B3.** Ensure `.docker-deps/` and `*.bak` are in `.gitignore` of every pilot repo
-- [ ] **B4.** Pre-commit hook (husky) — block commits containing `"file:../.docker-deps/"` inside any `package.json`. Add to `.husky/pre-commit`:
+- [ ] **B4.** Pre-commit hook (husky) — block commits containing rewritten `package.json`, staged tarballs, OR `.bak` files:
```bash
+ # .husky/pre-commit
if git diff --cached --name-only | xargs grep -l '"file:\.\./\.docker-deps/' 2>/dev/null; then
echo "ERROR: rewritten package.json detected. Run scripts/docker-prep.sh --restore first."
exit 1
fi
+ if git diff --cached --name-only | grep -qE '(\.docker-deps/.*\.tgz|package\.json\.bak)$'; then
+ echo "ERROR: docker-prep.sh artifacts staged. Run --restore first."
+ exit 1
+ fi
```
- [ ] **B5.** Auto-restore on script error via `trap restore_on_error EXIT` (unless `--keep` passed)
-- [ ] **B6.** Update script header comment with explicit "use only when Gitea unreachable OR you need uncommitted common-plat changes"
-- [ ] **B7.** Propose canonical home: `learning_ai_common_plat/scripts/docker-prep.template.sh` + `sync-docker-prep.sh` (mirrors `.npmrc` template pattern). Defer execution to Phase D.
-- [ ] **B8.** Add a `--strip-overrides` option that removes `pnpm.overrides` block after build, in case `--restore` is forgotten (additional safety net)
+- [ ] **B6.** Update script header comment per § 7.4 template
+- [ ] **B7. CANONICAL HOME (was deferred — now in Phase B proper).**
+ - [ ] **B7-1.** Move script to `learning_ai_common_plat/scripts/docker-prep.template.sh`
+ - [ ] **B7-2.** Add `learning_ai_common_plat/scripts/sync-docker-prep.sh` to copy template into all product repos (mirrors `sync-npmrc.sh`)
+ - [ ] **B7-3.** Add `learning_ai_common_plat/scripts/check-docker-prep-drift.sh` for CI (mirrors `check-npmrc-drift.sh`)
+ - [ ] **B7-4.** Update every repo's `AGENTS.md` with the "NEVER edit `docker-prep.sh` directly" warning + template link
+- [ ] **B8.** Add `--strip-overrides` option that removes `pnpm.overrides` block after build — safety net in case `--restore` is forgotten
---
@@ -205,30 +303,32 @@ Pilot exit criteria (must all pass before Phase D):
- [ ] **C1.** Cold Docker build succeeds on both pilots via Gitea-registry path (no `docker-prep.sh` invocation)
- [ ] **C2.** Warm rebuild (single source file touched) < 30 s on both pilots
- [ ] **C3.** `docker-prep.sh` → `docker compose build` → `--restore` leaves `git status` clean
-- [ ] **C4.** Pre-commit hook blocks a deliberately-staged rewritten `package.json`
+- [ ] **C4.** Pre-commit hook blocks: (a) rewritten `package.json`, (b) staged `.tgz`, (c) staged `.bak`
- [ ] **C5.** Gitea Actions CI green on both pilots (verify CI uses the same Dockerfile path)
- [ ] **C6.** Build-time metrics filled into the table in § 3.A7
-- [ ] **C7.** Decision recorded in ADR for A3 (lockfile policy)
+- [ ] **C7.** ADR recorded for A3 (lockfile policy)
+- [ ] **C8.** `docker-doctor.sh` (Phase E) runs clean against both pilots
+- [ ] **C9.** Smoke test: render the web app, inspect `