# Docker Build Optimization Roadmap > **Status:** Draft v4 (Gitea hardening integrated) · **Owner:** Platform DevOps · **Created:** 2026-05-27 · **Revised:** 2026-05-27 > > Pilot Docker-build correctness + speed fixes on `learning_ai_clock` (web + backend) > and `learning_ai_peakpulse` (backend), then capture the playbook here for > ecosystem-wide rollout. > > **Upstream prerequisite shipped (commit `610a59fd` in `learning_ai_common_plat`):** > Gitea owner parameterization + helper scripts (`scripts/gitea/doctor.sh`, > `scripts/gitea/token.sh`). The `.npmrc` template now resolves owner from > `${GITEA_NPM_OWNER:-learning_ai_user}`. **All A0-1 work in this roadmap > inherits this — Dockerfile/.npmrc.docker must use the same `${GITEA_NPM_OWNER}` > placeholder, not a hardcoded literal.** --- ## 0. Pre-flight audit findings (2026-05-27) A read-only audit of pilot repos + lessons from recent live incidents surfaced **15 concrete bugs/gaps** (F14–F15 added after the Gitea-hardening commit). The actual state of the ecosystem is closer to the inverse of the casual narrative: tarballs are the de facto default, the Gitea-registry path is partially wired, and there is a separate class of "build green, app broken" silent failures (F11–F13) that the speed-focused plan needs to address first. | # | Finding | Location | Severity | |---|---|---|---| | F1 | `pnpm-lock.yaml` is in `.dockerignore` — any lockfile-based optimization is blocked until removed | `peakpulse/.dockerignore`, `clock/.dockerignore` | High | | F2 | `pnpm-workspace.yaml` references sibling `../learning_ai_common_plat/packages/*` — `--frozen-lockfile` inside Docker will fail unless workspace is flattened or sibling tree is copied | both pilots | High | | F3 | `peakpulse/.npmrc.docker` is tarball-only (no `@bytelyst:registry=…` line) — the "Gitea-registry" path doesn't work in this repo today | `peakpulse/.npmrc.docker` | High | | F4 | `clock/.npmrc.docker` hardcodes `http://localhost:3300` — from inside Docker, `localhost` is the container, not the host registry | `clock/.npmrc.docker` | High | | F5 | `clock/backend/Dockerfile` has neither `ARG GITEA_NPM_HOST` nor a BuildKit secret mount — wholly dependent on pre-populated `.docker-deps/` | `clock/backend/Dockerfile` | High | | F6 | `clock/web/Dockerfile` accepts `ARG GITEA_NPM_HOST` but never uses it; no `--mount=type=secret` either | `clock/web/Dockerfile` | Medium | | F7 | `peakpulse/docker-compose.yml` does not pass `GITEA_NPM_HOST` build arg or declare `secrets:` block | `peakpulse/docker-compose.yml` | Medium | | F8 | `COPY .docker-deps/` is unconditional in every backend Dockerfile — every build requires `docker-prep.sh` to have run OR an empty `.docker-deps/` dir to pre-exist | both repos | Medium | | F9 | `npm install -g pnpm@10.6.5` runs on every build (no `corepack`) — 5–10 s overhead, no pinning to `packageManager` field | all four Dockerfiles | Low | | F10 | No BuildKit `--mount=type=cache` for pnpm store — cold install on every rebuild even when deps unchanged | all four Dockerfiles | High (main speed win) | | **F11** | **Build-time config file missing from repo or not COPY'd in Dockerfile causes silent UI breakage. Symptom: `next build` succeeds, container is "healthy", but CSS bundle is ~33 KB (only `@font-face`) and all Tailwind classes are absent → UI renders unstyled.** Two sub-bugs: (a) `postcss.config.mjs` missing entirely while `@tailwindcss/postcss` is in `package.json` (NoteLett, JarvisJr fixes `dff459e`, `36f6bc1`); (b) file exists but Dockerfile never COPYs it (Clock, LocalMemGPT fixes `a308c6444`, `07cdf6b`). | `*/web/Dockerfile`, `*/web/postcss.config.*` | **High** | | **F12** | **Healthcheck uses `localhost`, resolves to IPv6 `::1`, false-fails.** Backend listens on `0.0.0.0` (IPv4 only). `wget --spider http://localhost:.../health` hits `::1`, connection refused, container marked "unhealthy", `web` service won't start due to `depends_on: condition: service_healthy`. Incident: `learning_ai_jarvis_jr/docker-compose.yml`. | every `docker-compose*.yml` healthcheck | **Medium** | | **F13** | **Enumerated `COPY web/foo ./foo` pattern drifts from filesystem.** New config file added to repo but Dockerfile's enumerated COPY list isn't updated. Build succeeds silently with the file absent; behavior diverges from local dev. Root cause of F11(b). | every Dockerfile using enumerated COPY | **Medium** | | **F14** | **Hardcoded Gitea owner (`learning_ai_user`) literally embedded in `.npmrc.docker` + CI workflows + publish scripts across 14 repos.** When the org was renamed from `bytelyst` → `learning_ai_user`, every repo needed a manual commit. **Resolved upstream in `common-plat` (`610a59fd`):** owner now resolves from `${GITEA_NPM_OWNER:-learning_ai_user}`; `scripts/gitea/{doctor,token}.sh` ship as pre-flight/rotation helpers. Docker work in this roadmap MUST consume the env var, not the literal. | `.npmrc.docker`, Dockerfile `ARG`/`ENV`, CI workflows | **Medium** | | **F15** | **Stale shell-env tokens.** `~/.gitea_npm_token` rotated on disk; long-lived shells still exported the old value. Caused 401s during `docker compose build` until `source ~/.zshrc`. **Mitigation shipped:** `bash scripts/gitea/doctor.sh` detects env-vs-file drift and refuses to proceed. **Action required in this roadmap:** wire doctor as a pre-build CI gate. | dev workstation + CI runners | Low (now caught) | **Implications:** - The original "switch to `--frozen-lockfile` + Gitea registry" plan requires two upstream fixes first (F1, F2). - F11–F13 mean **correctness fixes must precede speed fixes**, otherwise we ship faster builds of broken apps. - A linter (Phase E `docker-doctor.sh`) is the durable insurance against F11/F13 recurrence — they are silent in CI today. --- ## 1. Context: three build paths | Path | Status today | Trigger | Notes | |---|---|---|---| | **`docker-prep.sh` tarballs** | **De facto default** in peakpulse + flowmonk; also works in clock/notes | Run `docker-prep.sh` then `docker compose build` | Hermetic; mutates `package.json`; slow to repack | | **Gitea NPM registry** | Partially wired in clock + notes; broken in peakpulse | `docker compose build` with `GITEA_NPM_HOST` arg + secret | Needs `.npmrc.docker` standardization to be the default | | **Legacy `file:` refs** | Deprecated | — | Removed during pnpm/Gitea migration | ### Measurement targets | Build | Baseline (observed) | Target after Phase A | |---|---|---| | Cold (no cache) | ~2–3 min | ≤ 2 min | | Warm (one source file changed) | ~2–3 min | **< 30 s** | | `docker-prep.sh` pack step alone | ~60–90 s | < 30 s (pnpm pack cache) | > Fill in actuals during Phase C. --- ## 2. Goals & non-goals **Goals** - ✅ Eliminate F11–F13 class of silent "build green, app broken" failures - ✅ Cut warm rebuild time via BuildKit pnpm-store cache mount (single biggest speed win) - ✅ Make `docker-prep.sh` idempotent, safe to re-run, gitignore-clean, and canonical (no per-repo drift) - ✅ Standardize `.npmrc.docker` across the ecosystem so the Gitea path actually works - ✅ Fix `docker-compose.yml` to pass `GITEA_NPM_HOST` + secrets so the registry path is usable without manual flags - ✅ Ship `docker-doctor.sh` CI lint as the durable insurance layer **Non-goals** - ❌ Migrating off pnpm or off the Gitea registry - ❌ Adopting `--frozen-lockfile` until F2 is resolved (sibling-workspace problem) - ❌ Publishing `@bytelyst/*` to the public npm registry - ❌ Multi-platform builds (separate roadmap) --- ## 2.5 Canonical decisions Decisions taken now to avoid contradictions later in the doc: - **Base image:** `node:22-alpine` is canonical. For repos blocked by the corporate proxy's Alpine SSL interception (currently only `learning_ai_notes`), the Dockerfile MUST expose: ```dockerfile ARG BASE_IMAGE=node:22-alpine FROM ${BASE_IMAGE} AS builder ``` Override per-repo via `--build-arg BASE_IMAGE=node:22-slim`. Document the override in the repo's `AGENTS.md`. - **Healthcheck host:** `127.0.0.1` (NOT `localhost`) in every `docker-compose*.yml` `test:` block. See F12. - **Lockfile mode in Docker:** `--lockfile=false` for now. `--frozen-lockfile` is blocked on the A3 ADR (F2). --- ## 3. Phase A — Correctness + build speed + path correctness Order matters: A0 must precede A1+ (you can't optimize a path that doesn't work), and A8+A9 (correctness) must land before measuring speed wins. ### A0. Make the Gitea-registry path actually work (clock + peakpulse) - [ ] **A0-1.** Standardize `.npmrc.docker` to use templated host AND owner so it works on host (`localhost`) and inside Docker (`host.docker.internal`), and so future owner renames are a one-line env change: ``` @bytelyst:registry=http://${GITEA_NPM_HOST}:3300/api/packages/${GITEA_NPM_OWNER:-learning_ai_user}/npm/ //${GITEA_NPM_HOST}:3300/api/packages/${GITEA_NPM_OWNER:-learning_ai_user}/npm/:_authToken=${GITEA_NPM_TOKEN} strict-ssl=false auto-install-peers=true ``` > **⚠️ Env-var expansion chain:** pnpm expands `${VAR}` in `.npmrc` at read > time using the current process environment (see [pnpm npmrc docs][pnpm-npmrc]). > That means the Dockerfile MUST do `ARG GITEA_NPM_HOST` + `ARG GITEA_NPM_OWNER` > → `ENV GITEA_NPM_HOST=$GITEA_NPM_HOST` / `ENV GITEA_NPM_OWNER=$GITEA_NPM_OWNER` > **before** the `pnpm install` RUN line, AND the `GITEA_NPM_TOKEN` must be > exported from the BuildKit secret mount inside the same `RUN` (since secrets > don't persist as env across layers). > > **Note on F14:** The canonical `.npmrc` (host-side) template already uses > `${GITEA_NPM_OWNER}` (shipped in common-plat commit `610a59fd`). > `.npmrc.docker` lagged behind because Docker builds have a separate file — > A0-1 brings them into parity. [pnpm-npmrc]: https://pnpm.io/npmrc - [ ] **A0-2.** Remove `pnpm-lock.yaml` from `.dockerignore` in both repos (fixes F1; harmless under `--lockfile=false` since we don't COPY it, but unblocks future A3) - [ ] **A0-3.** Add `GITEA_NPM_HOST` + `GITEA_NPM_OWNER` build args + `secrets:` block to every service in `docker-compose.yml`: ```yaml build: context: . dockerfile: backend/Dockerfile args: GITEA_NPM_HOST: ${GITEA_NPM_HOST:-host.docker.internal} GITEA_NPM_OWNER: ${GITEA_NPM_OWNER:-learning_ai_user} secrets: - gitea_npm_token secrets: gitea_npm_token: environment: GITEA_NPM_TOKEN ``` - [ ] **A0-4.** Add `extra_hosts: ["host.docker.internal:host-gateway"]` to each service so Linux Docker can resolve the host - [ ] **A0-5.** Document required env: `GITEA_NPM_TOKEN` must be exported in the shell that runs `docker compose build` (add to repo `README.md` quickstart). Reference `bash ../learning_ai_common_plat/scripts/gitea/token.sh status` for verification. - [ ] **A0-D.** **Run `gitea-doctor` before any Docker build** (addresses F15). Inline into deploy/CI workflows: ```bash bash ../learning_ai_common_plat/scripts/gitea/doctor.sh --quiet || exit 1 docker compose build ``` - Locally: shell alias or `Makefile` target `make build` that runs doctor then `docker compose build`. - In Gitea Actions CI: a pre-job step. If `doctor` exits non-zero, the build is skipped with a clear error rather than failing 4 minutes in with `ERR_PNPM_AUTHENTICATION`. - [ ] **A0-V.** **Verification gate (between A0 and A1):** build the registry path **without** any cache-mount or layer optimizations. Confirm `docker compose build --no-cache` succeeds end-to-end pulling from Gitea. Only proceed to A1 once this is green. Don't conflate "make it work" with "make it fast" in one commit. ### A1. Replace `npm install -g pnpm@X` with corepack - [ ] **A1-1.** Replace `RUN npm install -g pnpm@10.6.5` with: ```dockerfile RUN corepack enable && corepack prepare pnpm@10.6.5 --activate ``` - [ ] **A1-2.** Verify `packageManager` field in `backend/package.json` and `web/package.json` matches (already `pnpm@10.6.5` in peakpulse backend) ### A2. Add BuildKit pnpm-store cache mount - [ ] **A2-1.** Set `# syntax=docker/dockerfile:1.7` directive at top of every Dockerfile - [ ] **A2-2.** Wrap install step with cache + secret mount: ```dockerfile RUN --mount=type=cache,id=pnpm,target=/root/.local/share/pnpm/store \ --mount=type=secret,id=gitea_npm_token \ export GITEA_NPM_TOKEN="$(cat /run/secrets/gitea_npm_token 2>/dev/null || echo '')" && \ pnpm install --ignore-scripts --lockfile=false ``` - [ ] **A2-3.** Verify cache mount is active: `docker buildx du --filter type=exec.cachemount` shows non-zero size after a build. **Real success metric** is wall-clock: warm rebuild (touching one source file) drops to < 30 s. ### A3. Decide lockfile policy (BLOCKED on F2 resolution) Two options — pick one in a short ADR before implementing: - **Option 1: Keep `--lockfile=false`** (current pragmatic approach) - ✅ No sibling-workspace complications - ❌ No reproducibility guarantee inside Docker - ❌ Slower installs (full resolution every build) - **Option 2: Generate a Docker-only lockfile** via `pnpm install --lockfile-only` against a flattened `package.json` that resolves `@bytelyst/*` to semver - ✅ Reproducibility - ✅ Faster installs - ❌ New build step + tooling - ❌ Drift risk between dev lockfile and Docker lockfile - [ ] **A3-1.** Write 1-page ADR (`docs/decisions/0001-docker-lockfile-policy.md`) and pick Option 1 or 2 - [ ] **A3-2.** Defer `--frozen-lockfile` adoption until ADR lands ### A4. Restructure layer order - [ ] **A4-1.** Reorder COPY/RUN so deps-install layer is `package.json` + `.npmrc.docker` ONLY, then a separate layer for `src/`, config files, `shared/` - [ ] **A4-2.** Move all `ARG` lines that affect deps install **before** the install step; move `NEXT_PUBLIC_*` ARGs (web) closer to the build step (they invalidate the build layer, not the deps layer) ### A5. Gate `.docker-deps/` behind a build arg - [ ] **A5-1.** Add `ARG USE_TARBALLS=false` to Dockerfile - [ ] **A5-2.** Use wildcard COPY so missing dir doesn't break the build: ```dockerfile RUN mkdir -p /app/.docker-deps COPY .docker-deps* /app/.docker-deps/ ``` - [ ] **A5-3.** Verify `.docker-deps/` is in `.gitignore` and `.dockerignore` does NOT exclude it when tarball mode is in use ### A6. `.dockerignore` audit - [ ] **A6-1.** Confirm exclusions: `node_modules`, `**/node_modules`, `dist`, `.next`, `*.log`, `.env`, `.env.*`, `.git`, `*.bak` - [ ] **A6-2.** Remove: `pnpm-lock.yaml` exclusion (was correct under `--lockfile=false`, blocks future optimization) - [ ] **A6-3.** Confirm `.docker-deps/` is NOT excluded when tarball path is active ### A7. Measure & record | Repo | Surface | Cold before | Cold after | Warm before | Warm after | Notes | |---|---|---|---|---|---|---| | clock | web | — | — | — | — | | | clock | backend | — | — | — | — | | | peakpulse | backend | — | — | — | — | | Use: ``` time DOCKER_BUILDKIT=1 docker compose build --no-cache backend # cold touch backend/src/server.ts && time docker compose build backend # warm ``` ### A8. Config-file COPY audit & canonical pattern (addresses F11, F13) - [ ] **A8-1.** For every Dockerfile in scope, list all build-time files present in the surface directory (`web/` or `backend/`) that affect the build: - `postcss.config.{js,mjs,cjs,ts}` - `tailwind.config.{js,mjs,cjs,ts}` - `next.config.{js,mjs,ts}` - `tsconfig*.json` - `package.json` - `.npmrc.docker`, `.npmrc` - `babel.config.*` (if present) - `drizzle.config.*` (if present) - `vitest.config.*` (only if the build needs it) Verify each is COPY'd in the Dockerfile. - [ ] **A8-2.** Choose canonical COPY pattern. **Decision: middle-ground glob** for web surfaces: ```dockerfile COPY web/*.{json,ts,mjs,js,cjs} ./ COPY web/public/ ./public/ COPY web/src/ ./src/ ``` Trade-off: glob picks up unintended root-level files if any are added later, but **dramatically reduces F11/F13 risk**. Backend surfaces with few root config files can keep enumerated COPY (lower risk surface). - [ ] **A8-3.** Repo-by-repo migration: replace enumerated `COPY web/foo ./foo` with the glob pattern; verify the resulting image has all expected files via `docker run --rm ls -la`. ### A9. Healthcheck canonicalization (addresses F12) - [ ] **A9-1.** Replace `localhost` with `127.0.0.1` in every `docker-compose*.yml` healthcheck `test:` block. Sweep with: ``` rg -l 'http://localhost' --glob 'docker-compose*.yml' ``` - [ ] **A9-2.** Standardize healthcheck shape: - **Alpine-based images:** ```yaml healthcheck: test: ["CMD-SHELL", "wget -q --spider http://127.0.0.1:${PORT}/health || exit 1"] interval: 30s timeout: 5s retries: 3 start_period: 10s ``` - **Slim/Debian images** (`wget` not always present, but `node` is): ```yaml healthcheck: test: ["CMD-SHELL", "node -e \"fetch('http://127.0.0.1:${PORT}/health').then(r=>process.exit(r.ok?0:1)).catch(()=>process.exit(1))\""] ``` - [ ] **A9-3.** Add `start_period` (10s minimum) — prevents flaky "container started but app not yet listening" false-negatives. --- ## 4. Phase B — Hermetic-fallback polish (`docker-prep.sh`) `docker-prep.sh` is duplicated with minor variations across product repos. **Promotion to canonical home is now in Phase B, not Phase D** — drift compounds linearly with time and the `.npmrc` template precedent proves the pattern is cheap. - [ ] **B1.** Add `--dry-run` flag — list packs/rewrites, no side effects - [ ] **B2.** Idempotency guard — refuse to run if any `*.bak` exists unless `--force` - [ ] **B3.** Ensure `.docker-deps/` and `*.bak` are in `.gitignore` of every pilot repo - [ ] **B4.** Pre-commit hook (husky) — block commits containing rewritten `package.json`, staged tarballs, OR `.bak` files: ```bash # .husky/pre-commit if git diff --cached --name-only | xargs grep -l '"file:\.\./\.docker-deps/' 2>/dev/null; then echo "ERROR: rewritten package.json detected. Run scripts/docker-prep.sh --restore first." exit 1 fi if git diff --cached --name-only | grep -qE '(\.docker-deps/.*\.tgz|package\.json\.bak)$'; then echo "ERROR: docker-prep.sh artifacts staged. Run --restore first." exit 1 fi ``` - [ ] **B5.** Auto-restore on script error via `trap restore_on_error EXIT` (unless `--keep` passed) - [ ] **B6.** Update script header comment per § 7.4 template - [ ] **B7. CANONICAL HOME (was deferred — now in Phase B proper).** - [ ] **B7-1.** Move script to `learning_ai_common_plat/scripts/docker-prep.template.sh` - [ ] **B7-2.** Add `learning_ai_common_plat/scripts/sync-docker-prep.sh` to copy template into all product repos (mirrors `sync-npmrc.sh`) - [ ] **B7-3.** Add `learning_ai_common_plat/scripts/check-docker-prep-drift.sh` for CI (mirrors `check-npmrc-drift.sh`) - [ ] **B7-4.** Update every repo's `AGENTS.md` with the "NEVER edit `docker-prep.sh` directly" warning + template link - [ ] **B8.** Add `--strip-overrides` option that removes `pnpm.overrides` block after build — safety net in case `--restore` is forgotten --- ## 5. Phase C — Verification gates Pilot exit criteria (must all pass before Phase D): - [ ] **C1.** Cold Docker build succeeds on both pilots via Gitea-registry path (no `docker-prep.sh` invocation) - [ ] **C2.** Warm rebuild (single source file touched) < 30 s on both pilots - [ ] **C3.** `docker-prep.sh` → `docker compose build` → `--restore` leaves `git status` clean - [ ] **C4.** Pre-commit hook blocks: (a) rewritten `package.json`, (b) staged `.tgz`, (c) staged `.bak` - [ ] **C5.** Gitea Actions CI green on both pilots (verify CI uses the same Dockerfile path) - [ ] **C6.** Build-time metrics filled into the table in § 3.A7 - [ ] **C7.** ADR recorded for A3 (lockfile policy) - [ ] **C8.** `docker-doctor.sh` (Phase E) runs clean against both pilots - [ ] **C9.** Smoke test: render the web app, inspect `` for non-trivial CSS bundle (> 50 KB), confirm Tailwind classes apply. Guard against F11 regression. --- ## 6. Phase D — Ecosystem rollout (deferred until § 5 passes) Apply Phase A + B + E to remaining repos. **Pilots excluded.** | Repo | Backend | Web | docker-prep | Healthcheck | Notes | |---|---|---|---|---|---| | `learning_ai_notes` | ☐ | ☐ | ☐ | ☐ | `BASE_IMAGE=node:22-slim` override (corp proxy Alpine SSL) | | `learning_ai_fastgap` | ☐ | ☐ | ☐ | ☐ | Mobile + web + backend | | `learning_ai_jarvis_jr` | ☐ | ☐ | ☐ | ☐ | F12 incident already fixed; verify regression-proof | | `learning_ai_flowmonk` | ☐ | ☐ | ☐ | ☐ | `.npmrc.docker` is tarball-only — needs A0-1 | | `learning_ai_trails` | ☐ | ☐ | ☐ | ☐ | | | `learning_ai_local_memory_gpt` | ☐ | ☐ | ☐ | ☐ | SQLite-based; F11(b) already fixed `07cdf6b` — verify regression-proof | | `learning_multimodal_memory_agents` (MindLyst) | ☐ | ☐ | ☐ | ☐ | KMP repo, different layout | | `learning_voice_ai_agent` (LysnrAI) | ☐ | ☐ | ☐ | ☐ | Python desktop + TS dashboards | | `learning_ai_efforise` | ☐ | ☐ | ☐ | ☐ | | | `learning_ai_auth_app` | ☐ | n/a | ☐ | n/a | iOS/Android — no Docker surfaces | | `learning_ai_talk2obsidian` | ☐ | ☐ | ☐ | ☐ | Single-container app | --- ## 7. Reference snippets ### 7.1 Canonical `.npmrc.docker` Matches the host-side `.npmrc` template shipped in `common-plat` `610a59fd`. ``` @bytelyst:registry=http://${GITEA_NPM_HOST}:3300/api/packages/${GITEA_NPM_OWNER:-learning_ai_user}/npm/ //${GITEA_NPM_HOST}:3300/api/packages/${GITEA_NPM_OWNER:-learning_ai_user}/npm/:_authToken=${GITEA_NPM_TOKEN} strict-ssl=false auto-install-peers=true ``` ### 7.2 Canonical backend Dockerfile ```dockerfile # syntax=docker/dockerfile:1.7 ARG BASE_IMAGE=node:22-alpine FROM ${BASE_IMAGE} AS builder WORKDIR /app/backend ARG GITEA_NPM_HOST=host.docker.internal ARG GITEA_NPM_OWNER=learning_ai_user ARG USE_TARBALLS=false ENV NODE_TLS_REJECT_UNAUTHORIZED=0 ENV NPM_CONFIG_STRICT_SSL=false ENV GITEA_NPM_HOST=$GITEA_NPM_HOST ENV GITEA_NPM_OWNER=$GITEA_NPM_OWNER RUN corepack enable && corepack prepare pnpm@10.6.5 --activate # ── Deps layer (cacheable) ───────────────────────────────────────── COPY .npmrc.docker ./.npmrc COPY backend/package.json ./package.json # Tolerate missing .docker-deps/ when in registry mode RUN mkdir -p /app/.docker-deps COPY .docker-deps* /app/.docker-deps/ RUN --mount=type=cache,id=pnpm,target=/root/.local/share/pnpm/store \ --mount=type=secret,id=gitea_npm_token \ export GITEA_NPM_TOKEN="$(cat /run/secrets/gitea_npm_token 2>/dev/null || echo '')" && \ pnpm install --ignore-scripts --lockfile=false # ── Source layer (changes most often) ────────────────────────────── COPY backend/tsconfig.json ./tsconfig.json COPY backend/src/ ./src/ COPY shared/ ../shared/ RUN pnpm run build # ── Runtime ──────────────────────────────────────────────────────── FROM ${BASE_IMAGE} WORKDIR /app/backend ENV NODE_ENV=production COPY --from=builder /app/backend/node_modules ./node_modules COPY --from=builder /app/backend/package.json ./package.json COPY --from=builder /app/backend/dist ./dist COPY shared/ ../shared/ EXPOSE 4010 CMD ["node", "dist/server.js"] ``` > `--lockfile=false` is intentional pending the A3 ADR. Switch to > `--frozen-lockfile` only once the sibling-workspace problem (F2) is resolved. ### 7.3 Canonical `docker-compose.yml` service block ```yaml services: backend: build: context: . dockerfile: backend/Dockerfile args: GITEA_NPM_HOST: host.docker.internal secrets: - gitea_npm_token extra_hosts: - "host.docker.internal:host-gateway" ports: - "4010:4010" environment: - NODE_ENV=production - PORT=4010 # ... restart: unless-stopped healthcheck: # F12: use 127.0.0.1 NOT localhost (IPv6 resolution false-fails) test: ["CMD-SHELL", "wget -q --spider http://127.0.0.1:4010/health || exit 1"] interval: 30s timeout: 5s retries: 3 start_period: 10s secrets: gitea_npm_token: environment: GITEA_NPM_TOKEN ``` ### 7.4 Hardened `docker-prep.sh` header ```bash #!/usr/bin/env bash # Hermetic Docker-build helper. Packs @bytelyst/* tarballs from the sibling # common-plat repo when the Gitea npm registry is unreachable. # # Use this ONLY when: # - Local Gitea registry (:3300) is down or unreachable, OR # - You need a Docker build that includes uncommitted common-plat changes. # # For normal builds (Gitea up + clean common-plat), use: # docker compose build # # Usage: # ./scripts/docker-prep.sh # pack tarballs + rewrite package.json # ./scripts/docker-prep.sh --dry-run # show what would change (no side effects) # ./scripts/docker-prep.sh --force # override idempotency guard # ./scripts/docker-prep.sh --restore # undo rewrite # ./scripts/docker-prep.sh --keep # skip auto-restore on error # ./scripts/docker-prep.sh --strip-overrides # remove pnpm.overrides block # # Side effects: # - Creates .docker-deps/ (gitignored) # - Backs up package.json → package.json.bak # - Rewrites @bytelyst/* deps to file:../.docker-deps/ # - Injects pnpm.overrides for transitive @bytelyst/* deps # # Safety: # - Refuses to run if .bak files already exist (unless --force) # - Auto-restores on error (trap EXIT) unless --keep passed # - Pre-commit hook blocks committing rewritten package.json, .tgz, .bak ``` ### 7.5 Canonical Next.js web Dockerfile (addresses F11, F13) ```dockerfile # syntax=docker/dockerfile:1.7 ARG BASE_IMAGE=node:22-alpine FROM ${BASE_IMAGE} AS deps WORKDIR /app/web ARG GITEA_NPM_HOST=host.docker.internal ARG GITEA_NPM_OWNER=learning_ai_user ENV NODE_TLS_REJECT_UNAUTHORIZED=0 ENV NPM_CONFIG_STRICT_SSL=false ENV GITEA_NPM_HOST=$GITEA_NPM_HOST ENV GITEA_NPM_OWNER=$GITEA_NPM_OWNER RUN corepack enable && corepack prepare pnpm@10.6.5 --activate COPY .npmrc.docker ./.npmrc COPY web/package.json ./package.json RUN mkdir -p /app/.docker-deps COPY .docker-deps* /app/.docker-deps/ RUN --mount=type=cache,id=pnpm,target=/root/.local/share/pnpm/store \ --mount=type=secret,id=gitea_npm_token \ export GITEA_NPM_TOKEN="$(cat /run/secrets/gitea_npm_token 2>/dev/null || echo '')" && \ pnpm install --ignore-scripts --lockfile=false # ── Builder ──────────────────────────────────────────────────────── FROM ${BASE_IMAGE} AS builder WORKDIR /app/web COPY --from=deps /app/web/node_modules ./node_modules COPY --from=deps /app/web/package.json ./package.json # F11/F13 fix: glob ALL root-level config files instead of enumerating. # Picks up postcss.config.*, tailwind.config.*, next.config.*, tsconfig*, # any future *.config.* additions without Dockerfile changes. COPY web/*.json web/*.ts web/*.mjs web/*.js web/*.cjs ./ COPY web/public/ ./public/ COPY web/src/ ./src/ COPY shared/ ../shared/ ARG NEXT_PUBLIC_BACKEND_URL ARG NEXT_PUBLIC_PLATFORM_SERVICE_URL ENV NEXT_PUBLIC_BACKEND_URL=$NEXT_PUBLIC_BACKEND_URL ENV NEXT_PUBLIC_PLATFORM_SERVICE_URL=$NEXT_PUBLIC_PLATFORM_SERVICE_URL ENV NEXT_TELEMETRY_DISABLED=1 RUN corepack enable && pnpm run build # ── Runtime (Next.js standalone) ─────────────────────────────────── FROM ${BASE_IMAGE} AS runner WORKDIR /app/web ENV NODE_ENV=production ENV NEXT_TELEMETRY_DISABLED=1 COPY --from=builder /app/web/.next/standalone ./ # Next 16 standalone server runs as `node web/server.js` from /app/web, # so static assets live at /app/web/web/.next/static (NOT ./.next/static). COPY --from=builder /app/web/.next/static ./web/.next/static COPY --from=builder /app/web/public ./web/public EXPOSE 3000 ENV PORT=3000 ENV HOSTNAME=0.0.0.0 CMD ["node", "web/server.js"] ``` > **Verification step after every web Dockerfile change:** smoke-test the > built image by running it and curling the rendered HTML. Confirm the CSS > bundle in `` references is > 50 KB. A bundle of ~33 KB is the F11 > signature (only `@font-face`, no Tailwind utilities). ### 7.6 `docker-doctor.sh` skeleton (Phase E) ```bash #!/usr/bin/env bash # docker-doctor.sh — pre-flight Dockerfile + docker-compose health checks. # Run on PRs touching Dockerfile, docker-compose*.yml, .dockerignore. set -euo pipefail REPO_DIR="$(cd "$(dirname "$0")/.." && pwd)" FAILED=0 # Check 1 (A8/F11/F13): every config file in web/ is COPY'd in web/Dockerfile for cfg in postcss.config tailwind.config next.config; do for f in "$REPO_DIR"/web/${cfg}.{js,mjs,cjs,ts}; do [[ -f "$f" ]] || continue base=$(basename "$f") if ! grep -q "COPY web/${base}\\|COPY web/\\*" "$REPO_DIR/web/Dockerfile" 2>/dev/null; then echo "✗ F11/F13: $base exists but not COPY'd in web/Dockerfile" FAILED=1 fi done done # Check 2 (A9/F12): healthchecks use 127.0.0.1 if grep -rE 'test:.*http://localhost' "$REPO_DIR"/docker-compose*.yml 2>/dev/null; then echo "✗ F12: healthcheck uses localhost (should be 127.0.0.1)" FAILED=1 fi # Check 3: .npmrc.docker matches canonical template if [[ -f "$REPO_DIR/.npmrc.docker" ]]; then if ! grep -q '\${GITEA_NPM_HOST}' "$REPO_DIR/.npmrc.docker"; then echo "✗ F4: .npmrc.docker doesn't use \${GITEA_NPM_HOST} placeholder" FAILED=1 fi fi # Check 4: .dockerignore doesn't exclude pnpm-lock.yaml if grep -q '^pnpm-lock\.yaml$' "$REPO_DIR/.dockerignore" 2>/dev/null; then echo "⚠ F1: .dockerignore excludes pnpm-lock.yaml (blocks lockfile optimization)" fi # Check 5: base image is on approved list for df in "$REPO_DIR"/{backend,web}/Dockerfile; do [[ -f "$df" ]] || continue if ! grep -qE 'FROM (\$\{BASE_IMAGE\}|node:22-(alpine|slim))' "$df"; then echo "✗ Unapproved base image in $df" FAILED=1 fi done exit $FAILED ``` --- ## 8. Phase E — Observability / lint (NEW) Two complementary linters: 1. **`gitea-doctor`** — Gitea registry pre-flight (env + token + connectivity). **Already shipped** in `common-plat` commit `610a59fd` at `scripts/gitea/doctor.sh`. This roadmap only wires it into CI/build flows (A0-D + E0 below). 2. **`docker-doctor`** — Dockerfile + compose-file static linter (see § 7.6 skeleton). To be built as part of this roadmap. The two are intentionally separate concerns: | Linter | Scope | When to run | |---|---|---| | `gitea-doctor` | runtime env, token, registry HTTP 200 | Before every build / deploy | | `docker-doctor` | static analysis of Dockerfile + compose YAML | On every PR touching those files | ### Phase E checklist - [ ] **E0.** Wire `bash scripts/gitea/doctor.sh --quiet` into every Gitea Actions CI workflow as a pre-build job (addresses F15). Pattern shipped in `common-plat`; replicate via a reusable `actions/gitea-preflight@main` composite if Gitea Actions supports it, otherwise inline. - [ ] **E1.** Land `docker-doctor.sh` in `learning_ai_common_plat/scripts/` (canonical, mirrors `gitea/doctor.sh` shipped earlier) - [ ] **E2.** Provide a thin per-repo wrapper at `scripts/docker-doctor.sh` that calls the canonical - [ ] **E3.** Wire into CI: run on PRs touching `Dockerfile`, `docker-compose*.yml`, `.dockerignore`, `.npmrc.docker` - [ ] **E4.** Wire into pre-commit hook (warning-only at first, error after 2 weeks) - [ ] **E5.** Document checks in `learning_ai_common_plat/AI.dev/SKILLS/docker-doctor.md` (sibling doc to the existing `gitea-doctor` patterns) - [ ] **E6.** Add `make doctor` target to each pilot repo that runs both `gitea-doctor` AND `docker-doctor` Checks implemented by `docker-doctor.sh`: | Check | Addresses | Action | |---|---|---| | Every `web/*.config.*` file is COPY'd | F11, F13 | Error | | `docker-compose.yml` healthcheck uses `127.0.0.1` | F12 | Error | | `.npmrc.docker` uses `${GITEA_NPM_HOST}` AND `${GITEA_NPM_OWNER}` placeholders | F4, F14 | Error | | Dockerfile declares `ARG GITEA_NPM_OWNER` if it COPYs `.npmrc.docker` | F14 | Error | | `.dockerignore` doesn't exclude `pnpm-lock.yaml` | F1 | Warn (until A3 ADR lands) | | Base image is on approved list (`node:22-alpine` or `node:22-slim` via `BASE_IMAGE` ARG) | Canonical decision | Error | | `.docker-deps/` and `*.bak` in `.gitignore` | B3 | Error | | `docker-compose.yml` passes `GITEA_NPM_OWNER` build arg | F14 | Warn | --- ## 9. Open questions (numbered TODOs, not blockers) 1. **Shared pnpm cache volume?** BuildKit caches are already shared across builds by `id=pnpm`. Test whether a named Docker volume adds anything before adding complexity. 2. **Custom base image?** Publish `bytelyst/node-pnpm:22{alpine,slim}` with pnpm pre-installed to skip corepack. Cost: image maintenance; benefit: ~5 s/build. 3. **CI hostname?** Verify `host.docker.internal:host-gateway` works in Gitea Actions Linux runners, or if a CI-specific Dockerfile variant is needed. 4. **Multi-platform builds?** `linux/amd64` + `linux/arm64` interact awkwardly with cache mounts under `buildx`. Defer to separate roadmap. 5. **Workspace flattening?** Eliminate the `../learning_ai_common_plat/packages/*` workspace entry inside Docker via a flattened `pnpm-workspace.yaml`. Unlocks `--frozen-lockfile`. Requires lockfile regeneration step. --- ## 10. Execution order 1. **Now (this commit):** roadmap doc v3 lands here; sign-off requested. 2. **Phase A0 on `learning_ai_clock`** (web + backend) — pilot order intentionally inverted vs. v2: web is where F11/F13 incidents lived, and clock exercises both surface types in one repo. Fix `.npmrc.docker`, `docker-compose.yml`, `.dockerignore`. Verify **A0-V** (Gitea path works end-to-end) before any speed work. 3. **A8 + A9 + A1** on clock (correctness before speed). Commit. 4. **A2 + A4 + A5 + A6** on clock. Measure. Commit. 5. **Phase A0 → A6** on `learning_ai_peakpulse` (backend only) as validation second pass for the simpler case. 6. **A7** — fill in metrics table. 7. **A3 ADR** — decide lockfile policy (defer implementation). 8. **Phase B** — harden `docker-prep.sh` on clock, then promote to canonical home in common-plat (B7) and sync to peakpulse. 9. **Phase E** — land `docker-doctor.sh`, wire into CI as warning, then error. 10. **Phase C** — verification gates C1–C9. 11. **Phase D** — scheduled separately, only after § 5 passes. --- ## 11. Risk register | Risk | Mitigation | |---|---| | Removing `pnpm-lock.yaml` from `.dockerignore` exposes a stale or sibling-aware lockfile that breaks Docker installs | Keep `--lockfile=false` for now (A3 ADR); revisit after F2 resolution | | BuildKit cache mount on shared CI runners causes cross-build interference | Use distinct `id=` per repo (`id=pnpm-${repo}`) if observed | | `host.docker.internal` doesn't resolve in Linux Docker | `extra_hosts: ["host.docker.internal:host-gateway"]` (A0-4) | | Removing `.docker-deps/` from default builds breaks repos that haven't done A0 yet | Wildcard `COPY .docker-deps*` keeps both paths working during migration | | `docker-prep.sh` `--force` is misused and `.bak` files get committed | Pre-commit hook (B4) blocks `.bak`, `.tgz`, rewritten `package.json` | | Corp network blocks `host.docker.internal:3300` | Verify SSH tunnel reaches Gitea; document in operations.md | | **F11 regression: build green, app ships with no CSS** | C9 smoke test + Phase E `docker-doctor.sh` check on `web/*.config.*` COPY coverage | | **F12 regression: healthcheck false-fails on IPv6** | Phase E `docker-doctor.sh` grep for `localhost` in compose files | | **F13 regression: new config file added, Dockerfile forgotten** | A8-2 glob COPY pattern (root cause fix) + Phase E lint (defense in depth) | | `BASE_IMAGE` override in `notes` diverges silently from canonical | Phase E check approved list; document override in repo `AGENTS.md` | | **F14 regression: future Gitea owner rename re-introduces literal in some Dockerfile** | Phase E `docker-doctor.sh` checks `.npmrc.docker` for `${GITEA_NPM_OWNER}` placeholder + Dockerfile for `ARG GITEA_NPM_OWNER` declaration | | **F15: stale token in dev shell hits build mid-way through, wastes ~4 min** | A0-D + E0 wire `gitea-doctor` as pre-build gate; refuses to start build if env/file drift detected |