bytelyst-devops-tools/docs/docker-build-optimization-roadmap.md
saravanakumardb1 1a638a84e1 docs: roadmap v3 — incorporate review feedback (F11-F13, Phase E)
Review-driven additions:

- F11 added (silent UI breakage from missing/un-COPY'd postcss.config.mjs;
  4 repos hit this tonight: notes dff459e, jarvis_jr 36f6bc1,
  clock a308c6444, local_memory_gpt 07cdf6b)
- F12 added (healthcheck localhost → IPv6 false-fail; jarvis_jr incident)
- F13 added (enumerated COPY drift from filesystem; root cause of F11b)

Structural changes:
- New A8 (config-file COPY audit + glob pattern decision)
- New A9 (healthcheck IPv4 canonicalization)
- New A0-V verification gate (build Gitea path before optimizing)
- New § 2.5 canonical decisions (Alpine + ARG BASE_IMAGE override,
  127.0.0.1, --lockfile=false pending ADR)
- New § 7.5 canonical web Dockerfile (was missing, where F11 lives)
- New § 7.6 docker-doctor.sh skeleton
- New Phase E (docker-doctor.sh CI lint as durable insurance)
- B7 promoted from Phase D to Phase B proper (drift compounds)
- B4 husky hook extended to also block .tgz and .bak
- A0-1 env-var expansion chain explicitly documented
- A2-3 verification command corrected (docker buildx du, not docker history)
- Pilot order inverted: clock first (web + backend), then peakpulse
- C9 smoke test added (CSS bundle > 50 KB, F11 guard)
- 4 new risk-register rows for F11/F12/F13/BASE_IMAGE drift
2026-05-27 00:34:07 -07:00

31 KiB
Raw Blame History

Docker Build Optimization Roadmap

Status: Draft v3 (post-review) · Owner: Platform DevOps · Created: 2026-05-27 · Revised: 2026-05-27

Pilot Docker-build correctness + speed fixes on learning_ai_clock (web + backend) and learning_ai_peakpulse (backend), then capture the playbook here for ecosystem-wide rollout.


0. Pre-flight audit findings (2026-05-27)

A read-only audit of pilot repos + lessons from recent live incidents surfaced 13 concrete bugs/gaps. The actual state of the ecosystem is closer to the inverse of the casual narrative: tarballs are the de facto default, the Gitea-registry path is partially wired, and there is a separate class of "build green, app broken" silent failures (F11F13) that the speed-focused plan needs to address first.

# Finding Location Severity
F1 pnpm-lock.yaml is in .dockerignore — any lockfile-based optimization is blocked until removed peakpulse/.dockerignore, clock/.dockerignore High
F2 pnpm-workspace.yaml references sibling ../learning_ai_common_plat/packages/*--frozen-lockfile inside Docker will fail unless workspace is flattened or sibling tree is copied both pilots High
F3 peakpulse/.npmrc.docker is tarball-only (no @bytelyst:registry=… line) — the "Gitea-registry" path doesn't work in this repo today peakpulse/.npmrc.docker High
F4 clock/.npmrc.docker hardcodes http://localhost:3300 — from inside Docker, localhost is the container, not the host registry clock/.npmrc.docker High
F5 clock/backend/Dockerfile has neither ARG GITEA_NPM_HOST nor a BuildKit secret mount — wholly dependent on pre-populated .docker-deps/ clock/backend/Dockerfile High
F6 clock/web/Dockerfile accepts ARG GITEA_NPM_HOST but never uses it; no --mount=type=secret either clock/web/Dockerfile Medium
F7 peakpulse/docker-compose.yml does not pass GITEA_NPM_HOST build arg or declare secrets: block peakpulse/docker-compose.yml Medium
F8 COPY .docker-deps/ is unconditional in every backend Dockerfile — every build requires docker-prep.sh to have run OR an empty .docker-deps/ dir to pre-exist both repos Medium
F9 npm install -g pnpm@10.6.5 runs on every build (no corepack) — 510 s overhead, no pinning to packageManager field all four Dockerfiles Low
F10 No BuildKit --mount=type=cache for pnpm store — cold install on every rebuild even when deps unchanged all four Dockerfiles High (main speed win)
F11 Build-time config file missing from repo or not COPY'd in Dockerfile causes silent UI breakage. Symptom: next build succeeds, container is "healthy", but CSS bundle is ~33 KB (only @font-face) and all Tailwind classes are absent → UI renders unstyled. Two sub-bugs: (a) postcss.config.mjs missing entirely while @tailwindcss/postcss is in package.json (NoteLett, JarvisJr fixes dff459e, 36f6bc1); (b) file exists but Dockerfile never COPYs it (Clock, LocalMemGPT fixes a308c6444, 07cdf6b). */web/Dockerfile, */web/postcss.config.* High
F12 Healthcheck uses localhost, resolves to IPv6 ::1, false-fails. Backend listens on 0.0.0.0 (IPv4 only). wget --spider http://localhost:.../health hits ::1, connection refused, container marked "unhealthy", web service won't start due to depends_on: condition: service_healthy. Incident: learning_ai_jarvis_jr/docker-compose.yml. every docker-compose*.yml healthcheck Medium
F13 Enumerated COPY web/foo ./foo pattern drifts from filesystem. New config file added to repo but Dockerfile's enumerated COPY list isn't updated. Build succeeds silently with the file absent; behavior diverges from local dev. Root cause of F11(b). every Dockerfile using enumerated COPY Medium

Implications:

  • The original "switch to --frozen-lockfile + Gitea registry" plan requires two upstream fixes first (F1, F2).
  • F11F13 mean correctness fixes must precede speed fixes, otherwise we ship faster builds of broken apps.
  • A linter (Phase E docker-doctor.sh) is the durable insurance against F11/F13 recurrence — they are silent in CI today.

1. Context: three build paths

Path Status today Trigger Notes
docker-prep.sh tarballs De facto default in peakpulse + flowmonk; also works in clock/notes Run docker-prep.sh then docker compose build Hermetic; mutates package.json; slow to repack
Gitea NPM registry Partially wired in clock + notes; broken in peakpulse docker compose build with GITEA_NPM_HOST arg + secret Needs .npmrc.docker standardization to be the default
Legacy file: refs Deprecated Removed during pnpm/Gitea migration

Measurement targets

Build Baseline (observed) Target after Phase A
Cold (no cache) ~23 min ≤ 2 min
Warm (one source file changed) ~23 min < 30 s
docker-prep.sh pack step alone ~6090 s < 30 s (pnpm pack cache)

Fill in actuals during Phase C.


2. Goals & non-goals

Goals

  • Eliminate F11F13 class of silent "build green, app broken" failures
  • Cut warm rebuild time via BuildKit pnpm-store cache mount (single biggest speed win)
  • Make docker-prep.sh idempotent, safe to re-run, gitignore-clean, and canonical (no per-repo drift)
  • Standardize .npmrc.docker across the ecosystem so the Gitea path actually works
  • Fix docker-compose.yml to pass GITEA_NPM_HOST + secrets so the registry path is usable without manual flags
  • Ship docker-doctor.sh CI lint as the durable insurance layer

Non-goals

  • Migrating off pnpm or off the Gitea registry
  • Adopting --frozen-lockfile until F2 is resolved (sibling-workspace problem)
  • Publishing @bytelyst/* to the public npm registry
  • Multi-platform builds (separate roadmap)

2.5 Canonical decisions

Decisions taken now to avoid contradictions later in the doc:

  • Base image: node:22-alpine is canonical. For repos blocked by the corporate proxy's Alpine SSL interception (currently only learning_ai_notes), the Dockerfile MUST expose:
    ARG BASE_IMAGE=node:22-alpine
    FROM ${BASE_IMAGE} AS builder
    
    Override per-repo via --build-arg BASE_IMAGE=node:22-slim. Document the override in the repo's AGENTS.md.
  • Healthcheck host: 127.0.0.1 (NOT localhost) in every docker-compose*.yml test: block. See F12.
  • Lockfile mode in Docker: --lockfile=false for now. --frozen-lockfile is blocked on the A3 ADR (F2).

3. Phase A — Correctness + build speed + path correctness

Order matters: A0 must precede A1+ (you can't optimize a path that doesn't work), and A8+A9 (correctness) must land before measuring speed wins.

A0. Make the Gitea-registry path actually work (clock + peakpulse)

  • A0-1. Standardize .npmrc.docker to use a templated host so it works on host (localhost) and inside Docker (host.docker.internal):

    @bytelyst:registry=http://${GITEA_NPM_HOST}:3300/api/packages/learning_ai_user/npm/
    //${GITEA_NPM_HOST}:3300/api/packages/learning_ai_user/npm/:_authToken=${GITEA_NPM_TOKEN}
    strict-ssl=false
    auto-install-peers=true
    

    ⚠️ Env-var expansion chain: pnpm expands ${VAR} in .npmrc at read time using the current process environment (see pnpm npmrc docs). That means the Dockerfile MUST do ARG GITEA_NPM_HOSTENV GITEA_NPM_HOST=$GITEA_NPM_HOST before the pnpm install RUN line, AND the GITEA_NPM_TOKEN must be exported from the BuildKit secret mount inside the same RUN (since secrets don't persist as env across layers).

  • A0-2. Remove pnpm-lock.yaml from .dockerignore in both repos (fixes F1; harmless under --lockfile=false since we don't COPY it, but unblocks future A3)

  • A0-3. Add GITEA_NPM_HOST build arg + secrets: block to every service in docker-compose.yml:

    build:
      context: .
      dockerfile: backend/Dockerfile
      args:
        GITEA_NPM_HOST: host.docker.internal
      secrets:
        - gitea_npm_token
    secrets:
      gitea_npm_token:
        environment: GITEA_NPM_TOKEN
    
  • A0-4. Add extra_hosts: ["host.docker.internal:host-gateway"] to each service so Linux Docker can resolve the host

  • A0-5. Document required env: GITEA_NPM_TOKEN must be exported in the shell that runs docker compose build (add to repo README.md quickstart)

  • A0-V. Verification gate (between A0 and A1): build the registry path without any cache-mount or layer optimizations. Confirm docker compose build --no-cache succeeds end-to-end pulling from Gitea. Only proceed to A1 once this is green. Don't conflate "make it work" with "make it fast" in one commit.

A1. Replace npm install -g pnpm@X with corepack

  • A1-1. Replace RUN npm install -g pnpm@10.6.5 with:
    RUN corepack enable && corepack prepare pnpm@10.6.5 --activate
    
  • A1-2. Verify packageManager field in backend/package.json and web/package.json matches (already pnpm@10.6.5 in peakpulse backend)

A2. Add BuildKit pnpm-store cache mount

  • A2-1. Set # syntax=docker/dockerfile:1.7 directive at top of every Dockerfile
  • A2-2. Wrap install step with cache + secret mount:
    RUN --mount=type=cache,id=pnpm,target=/root/.local/share/pnpm/store \
        --mount=type=secret,id=gitea_npm_token \
        export GITEA_NPM_TOKEN="$(cat /run/secrets/gitea_npm_token 2>/dev/null || echo '')" && \
        pnpm install --ignore-scripts --lockfile=false
    
  • A2-3. Verify cache mount is active: docker buildx du --filter type=exec.cachemount shows non-zero size after a build. Real success metric is wall-clock: warm rebuild (touching one source file) drops to < 30 s.

A3. Decide lockfile policy (BLOCKED on F2 resolution)

Two options — pick one in a short ADR before implementing:

  • Option 1: Keep --lockfile=false (current pragmatic approach)

    • No sibling-workspace complications
    • No reproducibility guarantee inside Docker
    • Slower installs (full resolution every build)
  • Option 2: Generate a Docker-only lockfile via pnpm install --lockfile-only against a flattened package.json that resolves @bytelyst/* to semver

    • Reproducibility
    • Faster installs
    • New build step + tooling
    • Drift risk between dev lockfile and Docker lockfile
  • A3-1. Write 1-page ADR (docs/decisions/0001-docker-lockfile-policy.md) and pick Option 1 or 2

  • A3-2. Defer --frozen-lockfile adoption until ADR lands

A4. Restructure layer order

  • A4-1. Reorder COPY/RUN so deps-install layer is package.json + .npmrc.docker ONLY, then a separate layer for src/, config files, shared/
  • A4-2. Move all ARG lines that affect deps install before the install step; move NEXT_PUBLIC_* ARGs (web) closer to the build step (they invalidate the build layer, not the deps layer)

A5. Gate .docker-deps/ behind a build arg

  • A5-1. Add ARG USE_TARBALLS=false to Dockerfile
  • A5-2. Use wildcard COPY so missing dir doesn't break the build:
    RUN mkdir -p /app/.docker-deps
    COPY .docker-deps* /app/.docker-deps/
    
  • A5-3. Verify .docker-deps/ is in .gitignore and .dockerignore does NOT exclude it when tarball mode is in use

A6. .dockerignore audit

  • A6-1. Confirm exclusions: node_modules, **/node_modules, dist, .next, *.log, .env, .env.*, .git, *.bak
  • A6-2. Remove: pnpm-lock.yaml exclusion (was correct under --lockfile=false, blocks future optimization)
  • A6-3. Confirm .docker-deps/ is NOT excluded when tarball path is active

A7. Measure & record

Repo Surface Cold before Cold after Warm before Warm after Notes
clock web
clock backend
peakpulse backend

Use:

time DOCKER_BUILDKIT=1 docker compose build --no-cache backend   # cold
touch backend/src/server.ts && time docker compose build backend  # warm

A8. Config-file COPY audit & canonical pattern (addresses F11, F13)

  • A8-1. For every Dockerfile in scope, list all build-time files present in the surface directory (web/ or backend/) that affect the build:
    • postcss.config.{js,mjs,cjs,ts}
    • tailwind.config.{js,mjs,cjs,ts}
    • next.config.{js,mjs,ts}
    • tsconfig*.json
    • package.json
    • .npmrc.docker, .npmrc
    • babel.config.* (if present)
    • drizzle.config.* (if present)
    • vitest.config.* (only if the build needs it) Verify each is COPY'd in the Dockerfile.
  • A8-2. Choose canonical COPY pattern. Decision: middle-ground glob for web surfaces:
    COPY web/*.{json,ts,mjs,js,cjs} ./
    COPY web/public/ ./public/
    COPY web/src/ ./src/
    
    Trade-off: glob picks up unintended root-level files if any are added later, but dramatically reduces F11/F13 risk. Backend surfaces with few root config files can keep enumerated COPY (lower risk surface).
  • A8-3. Repo-by-repo migration: replace enumerated COPY web/foo ./foo with the glob pattern; verify the resulting image has all expected files via docker run --rm <img> ls -la.

A9. Healthcheck canonicalization (addresses F12)

  • A9-1. Replace localhost with 127.0.0.1 in every docker-compose*.yml healthcheck test: block. Sweep with:
    rg -l 'http://localhost' --glob 'docker-compose*.yml'
    
  • A9-2. Standardize healthcheck shape:
    • Alpine-based images:
      healthcheck:
        test: ["CMD-SHELL", "wget -q --spider http://127.0.0.1:${PORT}/health || exit 1"]
        interval: 30s
        timeout: 5s
        retries: 3
        start_period: 10s
      
    • Slim/Debian images (wget not always present, but node is):
      healthcheck:
        test: ["CMD-SHELL", "node -e \"fetch('http://127.0.0.1:${PORT}/health').then(r=>process.exit(r.ok?0:1)).catch(()=>process.exit(1))\""]
      
  • A9-3. Add start_period (10s minimum) — prevents flaky "container started but app not yet listening" false-negatives.

4. Phase B — Hermetic-fallback polish (docker-prep.sh)

docker-prep.sh is duplicated with minor variations across product repos. Promotion to canonical home is now in Phase B, not Phase D — drift compounds linearly with time and the .npmrc template precedent proves the pattern is cheap.

  • B1. Add --dry-run flag — list packs/rewrites, no side effects
  • B2. Idempotency guard — refuse to run if any *.bak exists unless --force
  • B3. Ensure .docker-deps/ and *.bak are in .gitignore of every pilot repo
  • B4. Pre-commit hook (husky) — block commits containing rewritten package.json, staged tarballs, OR .bak files:
    # .husky/pre-commit
    if git diff --cached --name-only | xargs grep -l '"file:\.\./\.docker-deps/' 2>/dev/null; then
      echo "ERROR: rewritten package.json detected. Run scripts/docker-prep.sh --restore first."
      exit 1
    fi
    if git diff --cached --name-only | grep -qE '(\.docker-deps/.*\.tgz|package\.json\.bak)$'; then
      echo "ERROR: docker-prep.sh artifacts staged. Run --restore first."
      exit 1
    fi
    
  • B5. Auto-restore on script error via trap restore_on_error EXIT (unless --keep passed)
  • B6. Update script header comment per § 7.4 template
  • B7. CANONICAL HOME (was deferred — now in Phase B proper).
    • B7-1. Move script to learning_ai_common_plat/scripts/docker-prep.template.sh
    • B7-2. Add learning_ai_common_plat/scripts/sync-docker-prep.sh to copy template into all product repos (mirrors sync-npmrc.sh)
    • B7-3. Add learning_ai_common_plat/scripts/check-docker-prep-drift.sh for CI (mirrors check-npmrc-drift.sh)
    • B7-4. Update every repo's AGENTS.md with the "NEVER edit docker-prep.sh directly" warning + template link
  • B8. Add --strip-overrides option that removes pnpm.overrides block after build — safety net in case --restore is forgotten

5. Phase C — Verification gates

Pilot exit criteria (must all pass before Phase D):

  • C1. Cold Docker build succeeds on both pilots via Gitea-registry path (no docker-prep.sh invocation)
  • C2. Warm rebuild (single source file touched) < 30 s on both pilots
  • C3. docker-prep.shdocker compose build--restore leaves git status clean
  • C4. Pre-commit hook blocks: (a) rewritten package.json, (b) staged .tgz, (c) staged .bak
  • C5. Gitea Actions CI green on both pilots (verify CI uses the same Dockerfile path)
  • C6. Build-time metrics filled into the table in § 3.A7
  • C7. ADR recorded for A3 (lockfile policy)
  • C8. docker-doctor.sh (Phase E) runs clean against both pilots
  • C9. Smoke test: render the web app, inspect <head> for non-trivial CSS bundle (> 50 KB), confirm Tailwind classes apply. Guard against F11 regression.

6. Phase D — Ecosystem rollout (deferred until § 5 passes)

Apply Phase A + B + E to remaining repos. Pilots excluded.

Repo Backend Web docker-prep Healthcheck Notes
learning_ai_notes BASE_IMAGE=node:22-slim override (corp proxy Alpine SSL)
learning_ai_fastgap Mobile + web + backend
learning_ai_jarvis_jr F12 incident already fixed; verify regression-proof
learning_ai_flowmonk .npmrc.docker is tarball-only — needs A0-1
learning_ai_trails
learning_ai_local_memory_gpt SQLite-based; F11(b) already fixed 07cdf6b — verify regression-proof
learning_multimodal_memory_agents (MindLyst) KMP repo, different layout
learning_voice_ai_agent (LysnrAI) Python desktop + TS dashboards
learning_ai_efforise
learning_ai_auth_app n/a n/a iOS/Android — no Docker surfaces
learning_ai_talk2obsidian Single-container app

7. Reference snippets

7.1 Canonical .npmrc.docker

@bytelyst:registry=http://${GITEA_NPM_HOST}:3300/api/packages/learning_ai_user/npm/
//${GITEA_NPM_HOST}:3300/api/packages/learning_ai_user/npm/:_authToken=${GITEA_NPM_TOKEN}
strict-ssl=false
auto-install-peers=true

7.2 Canonical backend Dockerfile

# syntax=docker/dockerfile:1.7
ARG BASE_IMAGE=node:22-alpine
FROM ${BASE_IMAGE} AS builder
WORKDIR /app/backend

ARG GITEA_NPM_HOST=host.docker.internal
ARG USE_TARBALLS=false
ENV NODE_TLS_REJECT_UNAUTHORIZED=0
ENV NPM_CONFIG_STRICT_SSL=false
ENV GITEA_NPM_HOST=$GITEA_NPM_HOST

RUN corepack enable && corepack prepare pnpm@10.6.5 --activate

# ── Deps layer (cacheable) ─────────────────────────────────────────
COPY .npmrc.docker ./.npmrc
COPY backend/package.json ./package.json
# Tolerate missing .docker-deps/ when in registry mode
RUN mkdir -p /app/.docker-deps
COPY .docker-deps* /app/.docker-deps/

RUN --mount=type=cache,id=pnpm,target=/root/.local/share/pnpm/store \
    --mount=type=secret,id=gitea_npm_token \
    export GITEA_NPM_TOKEN="$(cat /run/secrets/gitea_npm_token 2>/dev/null || echo '')" && \
    pnpm install --ignore-scripts --lockfile=false

# ── Source layer (changes most often) ──────────────────────────────
COPY backend/tsconfig.json ./tsconfig.json
COPY backend/src/ ./src/
COPY shared/ ../shared/
RUN pnpm run build

# ── Runtime ────────────────────────────────────────────────────────
FROM ${BASE_IMAGE}
WORKDIR /app/backend
ENV NODE_ENV=production
COPY --from=builder /app/backend/node_modules ./node_modules
COPY --from=builder /app/backend/package.json ./package.json
COPY --from=builder /app/backend/dist ./dist
COPY shared/ ../shared/
EXPOSE 4010
CMD ["node", "dist/server.js"]

--lockfile=false is intentional pending the A3 ADR. Switch to --frozen-lockfile only once the sibling-workspace problem (F2) is resolved.

7.3 Canonical docker-compose.yml service block

services:
  backend:
    build:
      context: .
      dockerfile: backend/Dockerfile
      args:
        GITEA_NPM_HOST: host.docker.internal
      secrets:
        - gitea_npm_token
    extra_hosts:
      - "host.docker.internal:host-gateway"
    ports:
      - "4010:4010"
    environment:
      - NODE_ENV=production
      - PORT=4010
      # ...
    restart: unless-stopped
    healthcheck:
      # F12: use 127.0.0.1 NOT localhost (IPv6 resolution false-fails)
      test: ["CMD-SHELL", "wget -q --spider http://127.0.0.1:4010/health || exit 1"]
      interval: 30s
      timeout: 5s
      retries: 3
      start_period: 10s

secrets:
  gitea_npm_token:
    environment: GITEA_NPM_TOKEN

7.4 Hardened docker-prep.sh header

#!/usr/bin/env bash
# Hermetic Docker-build helper. Packs @bytelyst/* tarballs from the sibling
# common-plat repo when the Gitea npm registry is unreachable.
#
# Use this ONLY when:
#   - Local Gitea registry (:3300) is down or unreachable, OR
#   - You need a Docker build that includes uncommitted common-plat changes.
#
# For normal builds (Gitea up + clean common-plat), use:
#   docker compose build
#
# Usage:
#   ./scripts/docker-prep.sh             # pack tarballs + rewrite package.json
#   ./scripts/docker-prep.sh --dry-run   # show what would change (no side effects)
#   ./scripts/docker-prep.sh --force     # override idempotency guard
#   ./scripts/docker-prep.sh --restore   # undo rewrite
#   ./scripts/docker-prep.sh --keep      # skip auto-restore on error
#   ./scripts/docker-prep.sh --strip-overrides  # remove pnpm.overrides block
#
# Side effects:
#   - Creates .docker-deps/ (gitignored)
#   - Backs up package.json → package.json.bak
#   - Rewrites @bytelyst/* deps to file:../.docker-deps/<tarball>
#   - Injects pnpm.overrides for transitive @bytelyst/* deps
#
# Safety:
#   - Refuses to run if .bak files already exist (unless --force)
#   - Auto-restores on error (trap EXIT) unless --keep passed
#   - Pre-commit hook blocks committing rewritten package.json, .tgz, .bak

7.5 Canonical Next.js web Dockerfile (addresses F11, F13)

# syntax=docker/dockerfile:1.7
ARG BASE_IMAGE=node:22-alpine
FROM ${BASE_IMAGE} AS deps
WORKDIR /app/web

ARG GITEA_NPM_HOST=host.docker.internal
ENV NODE_TLS_REJECT_UNAUTHORIZED=0
ENV NPM_CONFIG_STRICT_SSL=false
ENV GITEA_NPM_HOST=$GITEA_NPM_HOST

RUN corepack enable && corepack prepare pnpm@10.6.5 --activate

COPY .npmrc.docker ./.npmrc
COPY web/package.json ./package.json
RUN mkdir -p /app/.docker-deps
COPY .docker-deps* /app/.docker-deps/

RUN --mount=type=cache,id=pnpm,target=/root/.local/share/pnpm/store \
    --mount=type=secret,id=gitea_npm_token \
    export GITEA_NPM_TOKEN="$(cat /run/secrets/gitea_npm_token 2>/dev/null || echo '')" && \
    pnpm install --ignore-scripts --lockfile=false

# ── Builder ────────────────────────────────────────────────────────
FROM ${BASE_IMAGE} AS builder
WORKDIR /app/web
COPY --from=deps /app/web/node_modules ./node_modules
COPY --from=deps /app/web/package.json ./package.json

# F11/F13 fix: glob ALL root-level config files instead of enumerating.
# Picks up postcss.config.*, tailwind.config.*, next.config.*, tsconfig*,
# any future *.config.* additions without Dockerfile changes.
COPY web/*.json web/*.ts web/*.mjs web/*.js web/*.cjs ./
COPY web/public/ ./public/
COPY web/src/ ./src/
COPY shared/ ../shared/

ARG NEXT_PUBLIC_BACKEND_URL
ARG NEXT_PUBLIC_PLATFORM_SERVICE_URL
ENV NEXT_PUBLIC_BACKEND_URL=$NEXT_PUBLIC_BACKEND_URL
ENV NEXT_PUBLIC_PLATFORM_SERVICE_URL=$NEXT_PUBLIC_PLATFORM_SERVICE_URL
ENV NEXT_TELEMETRY_DISABLED=1

RUN corepack enable && pnpm run build

# ── Runtime (Next.js standalone) ───────────────────────────────────
FROM ${BASE_IMAGE} AS runner
WORKDIR /app/web
ENV NODE_ENV=production
ENV NEXT_TELEMETRY_DISABLED=1

COPY --from=builder /app/web/.next/standalone ./
# Next 16 standalone server runs as `node web/server.js` from /app/web,
# so static assets live at /app/web/web/.next/static (NOT ./.next/static).
COPY --from=builder /app/web/.next/static ./web/.next/static
COPY --from=builder /app/web/public ./web/public

EXPOSE 3000
ENV PORT=3000
ENV HOSTNAME=0.0.0.0
CMD ["node", "web/server.js"]

Verification step after every web Dockerfile change: smoke-test the built image by running it and curling the rendered HTML. Confirm the CSS bundle in <link> references is > 50 KB. A bundle of ~33 KB is the F11 signature (only @font-face, no Tailwind utilities).

7.6 docker-doctor.sh skeleton (Phase E)

#!/usr/bin/env bash
# docker-doctor.sh — pre-flight Dockerfile + docker-compose health checks.
# Run on PRs touching Dockerfile, docker-compose*.yml, .dockerignore.
set -euo pipefail

REPO_DIR="$(cd "$(dirname "$0")/.." && pwd)"
FAILED=0

# Check 1 (A8/F11/F13): every config file in web/ is COPY'd in web/Dockerfile
for cfg in postcss.config tailwind.config next.config; do
  for f in "$REPO_DIR"/web/${cfg}.{js,mjs,cjs,ts}; do
    [[ -f "$f" ]] || continue
    base=$(basename "$f")
    if ! grep -q "COPY web/${base}\\|COPY web/\\*" "$REPO_DIR/web/Dockerfile" 2>/dev/null; then
      echo "✗ F11/F13: $base exists but not COPY'd in web/Dockerfile"
      FAILED=1
    fi
  done
done

# Check 2 (A9/F12): healthchecks use 127.0.0.1
if grep -rE 'test:.*http://localhost' "$REPO_DIR"/docker-compose*.yml 2>/dev/null; then
  echo "✗ F12: healthcheck uses localhost (should be 127.0.0.1)"
  FAILED=1
fi

# Check 3: .npmrc.docker matches canonical template
if [[ -f "$REPO_DIR/.npmrc.docker" ]]; then
  if ! grep -q '\${GITEA_NPM_HOST}' "$REPO_DIR/.npmrc.docker"; then
    echo "✗ F4: .npmrc.docker doesn't use \${GITEA_NPM_HOST} placeholder"
    FAILED=1
  fi
fi

# Check 4: .dockerignore doesn't exclude pnpm-lock.yaml
if grep -q '^pnpm-lock\.yaml$' "$REPO_DIR/.dockerignore" 2>/dev/null; then
  echo "⚠ F1: .dockerignore excludes pnpm-lock.yaml (blocks lockfile optimization)"
fi

# Check 5: base image is on approved list
for df in "$REPO_DIR"/{backend,web}/Dockerfile; do
  [[ -f "$df" ]] || continue
  if ! grep -qE 'FROM (\$\{BASE_IMAGE\}|node:22-(alpine|slim))' "$df"; then
    echo "✗ Unapproved base image in $df"
    FAILED=1
  fi
done

exit $FAILED

8. Phase E — Observability / lint (NEW)

New phase: docker-doctor.sh (see § 7.6 skeleton) as durable insurance against tonight's-class silent bugs (F11, F12, F13).

  • E1. Land docker-doctor.sh in learning_ai_common_plat/scripts/ (canonical)
  • E2. Provide a thin per-repo wrapper at scripts/docker-doctor.sh that calls the canonical
  • E3. Wire into CI: run on PRs touching Dockerfile, docker-compose*.yml, .dockerignore, .npmrc.docker
  • E4. Wire into pre-commit hook (warning-only at first, error after 2 weeks)
  • E5. Document checks in learning_ai_common_plat/AI.dev/SKILLS/docker-doctor.md
  • E6. Add make docker-doctor target to each pilot repo

Checks implemented:

Check Addresses Action
Every web/*.config.* file is COPY'd F11, F13 Error
docker-compose.yml healthcheck uses 127.0.0.1 F12 Error
.npmrc.docker uses ${GITEA_NPM_HOST} placeholder F4 Error
.dockerignore doesn't exclude pnpm-lock.yaml F1 Warn (until A3 ADR lands)
Base image is on approved list Canonical decision Error
.docker-deps/ and *.bak in .gitignore B3 Error

9. Open questions (numbered TODOs, not blockers)

  1. Shared pnpm cache volume? BuildKit caches are already shared across builds by id=pnpm. Test whether a named Docker volume adds anything before adding complexity.
  2. Custom base image? Publish bytelyst/node-pnpm:22{alpine,slim} with pnpm pre-installed to skip corepack. Cost: image maintenance; benefit: ~5 s/build.
  3. CI hostname? Verify host.docker.internal:host-gateway works in Gitea Actions Linux runners, or if a CI-specific Dockerfile variant is needed.
  4. Multi-platform builds? linux/amd64 + linux/arm64 interact awkwardly with cache mounts under buildx. Defer to separate roadmap.
  5. Workspace flattening? Eliminate the ../learning_ai_common_plat/packages/* workspace entry inside Docker via a flattened pnpm-workspace.yaml. Unlocks --frozen-lockfile. Requires lockfile regeneration step.

10. Execution order

  1. Now (this commit): roadmap doc v3 lands here; sign-off requested.
  2. Phase A0 on learning_ai_clock (web + backend) — pilot order intentionally inverted vs. v2: web is where F11/F13 incidents lived, and clock exercises both surface types in one repo. Fix .npmrc.docker, docker-compose.yml, .dockerignore. Verify A0-V (Gitea path works end-to-end) before any speed work.
  3. A8 + A9 + A1 on clock (correctness before speed). Commit.
  4. A2 + A4 + A5 + A6 on clock. Measure. Commit.
  5. Phase A0 → A6 on learning_ai_peakpulse (backend only) as validation second pass for the simpler case.
  6. A7 — fill in metrics table.
  7. A3 ADR — decide lockfile policy (defer implementation).
  8. Phase B — harden docker-prep.sh on clock, then promote to canonical home in common-plat (B7) and sync to peakpulse.
  9. Phase E — land docker-doctor.sh, wire into CI as warning, then error.
  10. Phase C — verification gates C1C9.
  11. Phase D — scheduled separately, only after § 5 passes.

11. Risk register

Risk Mitigation
Removing pnpm-lock.yaml from .dockerignore exposes a stale or sibling-aware lockfile that breaks Docker installs Keep --lockfile=false for now (A3 ADR); revisit after F2 resolution
BuildKit cache mount on shared CI runners causes cross-build interference Use distinct id= per repo (id=pnpm-${repo}) if observed
host.docker.internal doesn't resolve in Linux Docker extra_hosts: ["host.docker.internal:host-gateway"] (A0-4)
Removing .docker-deps/ from default builds breaks repos that haven't done A0 yet Wildcard COPY .docker-deps* keeps both paths working during migration
docker-prep.sh --force is misused and .bak files get committed Pre-commit hook (B4) blocks .bak, .tgz, rewritten package.json
Corp network blocks host.docker.internal:3300 Verify SSH tunnel reaches Gitea; document in operations.md
F11 regression: build green, app ships with no CSS C9 smoke test + Phase E docker-doctor.sh check on web/*.config.* COPY coverage
F12 regression: healthcheck false-fails on IPv6 Phase E docker-doctor.sh grep for localhost in compose files
F13 regression: new config file added, Dockerfile forgotten A8-2 glob COPY pattern (root cause fix) + Phase E lint (defense in depth)
BASE_IMAGE override in notes diverges silently from canonical Phase E check approved list; document override in repo AGENTS.md