bytelyst-devops-tools/docs/docker-build-optimization-roadmap.md
saravanakumardb1 b00af09942 docs(docker): roadmap v8 — peakpulse Phase A done + A3 ADR-0001 accepted
Per § 10 steps 9 + 10.

Step 9: Peakpulse backend Phase A complete.
  cold 72.2 s, warm 2.7 s (96.3% reduction). Pattern from clock applied
  verbatim plus .docker-deps/.gitkeep discoverability fix back-ported
  to clock. Commits:
    peakpulse@11a6bc5  feat(docker): Phase A on peakpulse backend
    peakpulse@6523a1a  fix(docker): track .docker-deps/.gitkeep
    clock@1465e06b1    fix(docker): track .docker-deps/.gitkeep
    clock@d69003c1f    chore: dedupe .docker-deps in .gitignore

Step 10: A3 ADR accepted.
  New file: docs/adr/0001-docker-build-lockfile-policy.md
  Decision: short-term Option A (--lockfile=false) — already shipped in
  Phase A; long-term Option C (vendored pnpm-lock.docker.yaml). Migration
  triggered by production deployment, audit requirement, supply-chain
  incident, or loss of BuildKit cache. Implementation sketch in ADR § 4.

Roadmap doc updates:
  - § A7 metrics table: peakpulse row populated (72.2 s / 2.7 s).
  - § A3: collapsed bullet list into decision-record summary linking ADR.
  - § 10: steps 9 + 10 marked ; status banner v7 → v8.

Next per § 10: step 11 (Phase B docker-prep hardening) or step 12
(Phase E docker-doctor.sh linter). Phase E is higher-value as durable
insurance against F11/F13/F16/F17/F18 regressions across the ecosystem.
2026-05-27 02:54:08 -07:00

45 KiB
Raw Blame History

Docker Build Optimization Roadmap

Status: Draft v8 (Phase A complete on clock + peakpulse; A3 ADR accepted; warm rebuilds 2.75.4 s) · Owner: Platform DevOps · Created: 2026-05-27 · Revised: 2026-05-27

Pilot Docker-build correctness + speed fixes on learning_ai_clock (web + backend) and learning_ai_peakpulse (backend), then capture the playbook here for ecosystem-wide rollout.

Upstream prerequisite shipped (commit 610a59fd in learning_ai_common_plat): Gitea owner parameterization + helper scripts (scripts/gitea/doctor.sh, scripts/gitea/token.sh). The .npmrc template now resolves owner from ${GITEA_NPM_OWNER:-learning_ai_user}. All A0-1 work in this roadmap inherits this — Dockerfile/.npmrc.docker must use the same ${GITEA_NPM_OWNER} placeholder, not a hardcoded literal.


0. Pre-flight audit findings (2026-05-27)

A read-only audit of pilot repos + lessons from recent live incidents + the A0-V execution iterations on clock surfaced 18 concrete bugs/gaps (F14F15 added after the Gitea-hardening commit; F16F18 added during the A0-V execution sweep on clock, 2026-05-27). The actual state of the ecosystem is closer to the inverse of the casual narrative: tarballs are the de facto default, the Gitea-registry path is partially wired, and there is a separate class of "build green, app broken" silent failures (F11F13) that the speed-focused plan needs to address first.

# Finding Location Severity
F1 pnpm-lock.yaml is in .dockerignore — any lockfile-based optimization is blocked until removed peakpulse/.dockerignore, clock/.dockerignore High
F2 pnpm-workspace.yaml references sibling ../learning_ai_common_plat/packages/*--frozen-lockfile inside Docker will fail unless workspace is flattened or sibling tree is copied both pilots High
F3 peakpulse/.npmrc.docker is tarball-only (no @bytelyst:registry=… line) — the "Gitea-registry" path doesn't work in this repo today peakpulse/.npmrc.docker High
F4 clock/.npmrc.docker hardcodes http://localhost:3300 — from inside Docker, localhost is the container, not the host registry clock/.npmrc.docker High
F5 clock/backend/Dockerfile has neither ARG GITEA_NPM_HOST nor a BuildKit secret mount — wholly dependent on pre-populated .docker-deps/ clock/backend/Dockerfile High
F6 clock/web/Dockerfile accepts ARG GITEA_NPM_HOST but never uses it; no --mount=type=secret either clock/web/Dockerfile Medium
F7 peakpulse/docker-compose.yml does not pass GITEA_NPM_HOST build arg or declare secrets: block peakpulse/docker-compose.yml Medium
F8 COPY .docker-deps/ is unconditional in every backend Dockerfile — every build requires docker-prep.sh to have run OR an empty .docker-deps/ dir to pre-exist both repos Medium
F9 npm install -g pnpm@10.6.5 runs on every build (no corepack) — 510 s overhead, no pinning to packageManager field all four Dockerfiles Low
F10 No BuildKit --mount=type=cache for pnpm store — cold install on every rebuild even when deps unchanged all four Dockerfiles High (main speed win)
F11 Build-time config file missing from repo or not COPY'd in Dockerfile causes silent UI breakage. Symptom: next build succeeds, container is "healthy", but CSS bundle is ~33 KB (only @font-face) and all Tailwind classes are absent → UI renders unstyled. Two sub-bugs: (a) postcss.config.mjs missing entirely while @tailwindcss/postcss is in package.json (NoteLett, JarvisJr fixes dff459e, 36f6bc1); (b) file exists but Dockerfile never COPYs it (Clock, LocalMemGPT fixes a308c6444, 07cdf6b). */web/Dockerfile, */web/postcss.config.* High
F12 Healthcheck uses localhost, resolves to IPv6 ::1, false-fails. Backend listens on 0.0.0.0 (IPv4 only). wget --spider http://localhost:.../health hits ::1, connection refused, container marked "unhealthy", web service won't start due to depends_on: condition: service_healthy. Incident: learning_ai_jarvis_jr/docker-compose.yml. every docker-compose*.yml healthcheck Medium
F13 Enumerated COPY web/foo ./foo pattern drifts from filesystem. New config file added to repo but Dockerfile's enumerated COPY list isn't updated. Build succeeds silently with the file absent; behavior diverges from local dev. Root cause of F11(b). every Dockerfile using enumerated COPY Medium
F14 Hardcoded Gitea owner (learning_ai_user) literally embedded in .npmrc.docker + CI workflows + publish scripts across 14 repos. When the org was renamed from bytelystlearning_ai_user, every repo needed a manual commit. Resolved upstream in common-plat (610a59fd): owner now resolves from ${GITEA_NPM_OWNER:-learning_ai_user}; scripts/gitea/{doctor,token}.sh ship as pre-flight/rotation helpers. Docker work in this roadmap MUST consume the env var, not the literal. .npmrc.docker, Dockerfile ARG/ENV, CI workflows Medium
F15 Stale shell-env tokens. ~/.gitea_npm_token rotated on disk; long-lived shells still exported the old value. Caused 401s during docker compose build until source ~/.zshrc. Mitigation shipped: bash scripts/gitea/doctor.sh detects env-vs-file drift and refuses to proceed. Action required in this roadmap: wire doctor as a pre-build CI gate. dev workstation + CI runners Low (now caught)
F16 At least 10 published @bytelyst/* packages had unrewritten workspace:* refs in their package.json dependencies. Root cause: publish-outdated-packages.sh extracts a pnpm-packed tarball then re-packs with npm pack (workaround for a historical Gitea-compat issue with pnpm's tarball format), and npm pack doesn't recognize the pnpm-specific workspace: protocol — it passes it through literally. Fixed in common-plat@cfcfc7bb (fix(gitea): rewrite workspace:* in published tarballs (F16)) — inserted a workspace:* rewriter between extract and npm-repack + a defense-in-depth grep guard. Republished 10 affected packages. common-plat publish flow + Gitea registry Critical (FIXED)
F17 Gitea bakes localhost:3300 into the dist.tarball field of every published package's metadata. Inside Docker, localhost is the container itself, not the host — so even after a successful registry-metadata fetch via host.docker.internal, pnpm follows the tarball URL to localhost:3300 and ECONNREFUSEs. Root cause: Gitea app.ini's ROOT_URL=http://localhost:3300/ was baked at publish time. Fixed by setting ROOT_URL=http://host.docker.internal:3300/, restarting Gitea, adding 127.0.0.1 host.docker.internal to /etc/hosts, adding host.docker.internal to NO_PROXY (corp proxy was hijacking DNS), and republishing all 64 packages (common-plat@dd90f709). Gitea app.ini + host /etc/hosts + every dev machine's switch-network.sh Critical (FIXED)
F18 clock/web/package.json had 4 @bytelyst/* deps declared as file: refs to sibling ../../learning_ai_common_plat/packages/* — a legacy pre-Gitea pattern. Inside Docker those paths don't exist, so pnpm install fails with ERR_PNPM_LINKED_PKG_DIR_NOT_FOUND. Discovered during clock web A0-V on 2026-05-27. Fixed in learning_ai_clock@8b5c767a3 by rewriting to * semver. Same pattern likely lives in other product repos (especially anything that consumes @bytelyst/ui, @bytelyst/design-tokens, @bytelyst/use-theme) — audit needed in Phase D rollout. */web/package.json (and likely others) High

Implications:

  • The original "switch to --frozen-lockfile + Gitea registry" plan requires two upstream fixes first (F1, F2).
  • F11F13 mean correctness fixes must precede speed fixes, otherwise we ship faster builds of broken apps.
  • F16 + F17 are both fixed as of 2026-05-27. Gitea path now works end-to-end on clock. A-pre is largely complete; remaining items (A-pre-4, A-pre-5) become Phase E checks.
  • F18 (sibling file: refs in product repo manifests) is the same family as F2 but separately tractable — fixed in clock, audit needed across other repos as part of Phase D rollout.
  • A linter (Phase E docker-doctor.sh) is the durable insurance against F11/F13/F18 recurrence — silent in CI today. The registry-side guard (publish-time check for workspace:* leaks) shipped in common-plat@cfcfc7bb as part of the F16 fix.

1. Context: three build paths

Path Status today Trigger Notes
docker-prep.sh tarballs De facto default in peakpulse + flowmonk; also works in clock/notes Run docker-prep.sh then docker compose build Hermetic; mutates package.json; slow to repack
Gitea NPM registry Partially wired in clock + notes; broken in peakpulse docker compose build with GITEA_NPM_HOST arg + secret Needs .npmrc.docker standardization to be the default
Legacy file: refs Deprecated Removed during pnpm/Gitea migration

Measurement targets

Build Baseline (observed) Target after Phase A
Cold (no cache) ~23 min ≤ 2 min
Warm (one source file changed) ~23 min < 30 s
docker-prep.sh pack step alone ~6090 s < 30 s (pnpm pack cache)

Fill in actuals during Phase C.


2. Goals & non-goals

Goals

  • Eliminate F11F13 class of silent "build green, app broken" failures
  • Cut warm rebuild time via BuildKit pnpm-store cache mount (single biggest speed win)
  • Make docker-prep.sh idempotent, safe to re-run, gitignore-clean, and canonical (no per-repo drift)
  • Standardize .npmrc.docker across the ecosystem so the Gitea path actually works
  • Fix docker-compose.yml to pass GITEA_NPM_HOST + secrets so the registry path is usable without manual flags
  • Ship docker-doctor.sh CI lint as the durable insurance layer

Non-goals

  • Migrating off pnpm or off the Gitea registry
  • Adopting --frozen-lockfile until F2 is resolved (sibling-workspace problem)
  • Publishing @bytelyst/* to the public npm registry
  • Multi-platform builds (separate roadmap)

2.5 Canonical decisions

Decisions taken now to avoid contradictions later in the doc:

  • Base image: node:22-alpine is canonical. For repos blocked by the corporate proxy's Alpine SSL interception (currently only learning_ai_notes), the Dockerfile MUST expose:
    ARG BASE_IMAGE=node:22-alpine
    FROM ${BASE_IMAGE} AS builder
    
    Override per-repo via --build-arg BASE_IMAGE=node:22-slim. Document the override in the repo's AGENTS.md.
  • Healthcheck host: 127.0.0.1 (NOT localhost) in every docker-compose*.yml test: block. See F12.
  • Lockfile mode in Docker: --lockfile=false for now. --frozen-lockfile is blocked on the A3 ADR (F2).

3. Phase A — Correctness + build speed + path correctness

Order matters: A-pre must precede A0 (you can't build via a registry that serves broken metadata); A0 must precede A1+ (you can't optimize a path that doesn't work), and A8+A9 (correctness) must land before measuring speed wins.

A-pre. Make the Gitea registry actually usable from Docker (F16 + F17 + F18)

Owner: learning_ai_common_plat + per-product repo · Status: done for clock + global config.

Three distinct bugs surfaced during clock A0-V on 2026-05-27:

  • F16: Publish flow leaked workspace:* into published metadata.

  • F17: Gitea baked localhost:3300 into tarball URLs.

  • F18: Product repos had legacy file: refs to sibling packages.

  • A-pre-1. Audit publish-outdated-packages.sh — confirmed it uses pnpm pack then re-tars with npm pack, which loses workspace: rewriting.

  • A-pre-2. Patch publish script with a workspace:* rewriter + a post-rewrite grep guard. Shipped in common-plat@cfcfc7bb.

  • A-pre-3. Verify all packages publish with 0 workspace:* refs. Confirmed via curl scan across all 64 packages.

  • A-pre-4. F17 fix: set Gitea ROOT_URL=http://host.docker.internal:3300/, restart Gitea, add 127.0.0.1 host.docker.internal to /etc/hosts, add host.docker.internal to NO_PROXY in switch-network.sh, bulk republish all 64 packages. Shipped in common-plat@dd90f709.

  • A-pre-5. F18 fix: rewrite file:../../learning_ai_common_plat/packages/* refs in clock/web/package.json to * semver. Shipped in clock@8b5c767a3. Audit needed in Phase D for other product repos.

  • A-pre-6. Document Gitea config requirements (below).

A-pre-6. Gitea configuration prerequisites (one-time per dev machine)

The Gitea registry MUST be configured with ROOT_URL=http://host.docker.internal:3300/ so published tarball URLs are reachable from inside Docker containers. The host /etc/hosts MUST resolve host.docker.internal to 127.0.0.1 so the same URLs work from the host shell.

On macOS (Homebrew Gitea):

# 1. Edit Gitea's app.ini
sudo -e /opt/homebrew/var/gitea/custom/conf/app.ini
#   change:   ROOT_URL = http://localhost:3300/
#   to:       ROOT_URL = http://host.docker.internal:3300/

# 2. Restart Gitea
brew services restart gitea

# 3. Add /etc/hosts entry so host.docker.internal resolves on the host too
sudo sh -c 'grep -q host.docker.internal /etc/hosts || \
  echo "127.0.0.1       host.docker.internal" >> /etc/hosts'

# 4. Ensure host.docker.internal is in NO_PROXY for corp shells
# (already done in switch-network.sh as of common-plat@dd90f709)
source ~/.zshrc   # reload

# 5. Verify
curl -sS http://host.docker.internal:3300/api/v1/version
# expected: {"version":"1.25.5"} or similar

A0. Make the Gitea-registry path actually work (clock + peakpulse)

  • A0-1. Standardize .npmrc.docker to use templated host AND owner so it works on host (localhost) and inside Docker (host.docker.internal), and so future owner renames are a one-line env change:

    @bytelyst:registry=http://${GITEA_NPM_HOST}:3300/api/packages/${GITEA_NPM_OWNER:-learning_ai_user}/npm/
    //${GITEA_NPM_HOST}:3300/api/packages/${GITEA_NPM_OWNER:-learning_ai_user}/npm/:_authToken=${GITEA_NPM_TOKEN}
    strict-ssl=false
    auto-install-peers=true
    

    ⚠️ Env-var expansion chain: pnpm expands ${VAR} in .npmrc at read time using the current process environment (see pnpm npmrc docs). That means the Dockerfile MUST do ARG GITEA_NPM_HOST + ARG GITEA_NPM_OWNERENV GITEA_NPM_HOST=$GITEA_NPM_HOST / ENV GITEA_NPM_OWNER=$GITEA_NPM_OWNER before the pnpm install RUN line, AND the GITEA_NPM_TOKEN must be exported from the BuildKit secret mount inside the same RUN (since secrets don't persist as env across layers).

    Note on F14: The canonical .npmrc (host-side) template already uses ${GITEA_NPM_OWNER} (shipped in common-plat commit 610a59fd). .npmrc.docker lagged behind because Docker builds have a separate file — A0-1 brings them into parity.

  • A0-2. Remove pnpm-lock.yaml from .dockerignore in both repos (fixes F1; harmless under --lockfile=false since we don't COPY it, but unblocks future A3)

  • A0-3. Add GITEA_NPM_HOST + GITEA_NPM_OWNER build args + secrets: block to every service in docker-compose.yml:

    build:
      context: .
      dockerfile: backend/Dockerfile
      args:
        GITEA_NPM_HOST: ${GITEA_NPM_HOST:-host.docker.internal}
        GITEA_NPM_OWNER: ${GITEA_NPM_OWNER:-learning_ai_user}
      secrets:
        - gitea_npm_token
    secrets:
      gitea_npm_token:
        environment: GITEA_NPM_TOKEN
    
  • A0-4. Add extra_hosts: ["host.docker.internal:host-gateway"] to each service so Linux Docker can resolve the host

  • A0-5. Document required env: GITEA_NPM_TOKEN must be exported in the shell that runs docker compose build (add to repo README.md quickstart). Reference bash ../learning_ai_common_plat/scripts/gitea/token.sh status for verification.

  • A0-D. Run gitea-doctor before any Docker build (addresses F15). Inline into deploy/CI workflows:

    bash ../learning_ai_common_plat/scripts/gitea/doctor.sh --quiet || exit 1
    docker compose build
    
    • Locally: shell alias or Makefile target make build that runs doctor then docker compose build.
    • In Gitea Actions CI: a pre-job step. If doctor exits non-zero, the build is skipped with a clear error rather than failing 4 minutes in with ERR_PNPM_AUTHENTICATION.
  • A0-V. Verification gate (between A0 and A1): build the registry path without any cache-mount or layer optimizations. Confirm docker compose build --no-cache succeeds end-to-end pulling from Gitea. Only proceed to A1 once this is green. Don't conflate "make it work" with "make it fast" in one commit.

    2026-05-27 status — clock A0-V: PASSED (third attempt, after F16, F17, F18 fixed). Cold-build wall-clock:

    • backend: 59.2 s (commits: clock@0be887288 + common-plat@cfcfc7bb + common-plat@dd90f709)
    • web: 3:13 (193 s) (commits: above + clock@8b5c767a3)

    Both surfaces resolve @bytelyst/* from the Gitea registry end-to-end — no docker-prep.sh tarballs, no sibling file: refs, no proxy interference. See §3.A7 metrics table.

A1. Replace npm install -g pnpm@X with corepack

  • A1-1. Replace RUN npm install -g pnpm@10.6.5 with:
    RUN corepack enable && corepack prepare pnpm@10.6.5 --activate
    
  • A1-2. Verify packageManager field in backend/package.json and web/package.json matches (already pnpm@10.6.5 in peakpulse backend)

A2. Add BuildKit pnpm-store cache mount

  • A2-1. Set # syntax=docker/dockerfile:1.7 directive at top of every Dockerfile
  • A2-2. Wrap install step with cache + secret mount:
    RUN --mount=type=cache,id=pnpm,target=/root/.local/share/pnpm/store \
        --mount=type=secret,id=gitea_npm_token \
        export GITEA_NPM_TOKEN="$(cat /run/secrets/gitea_npm_token 2>/dev/null || echo '')" && \
        pnpm install --ignore-scripts --lockfile=false
    
  • A2-3. Verify cache mount is active: docker buildx du --filter type=exec.cachemount shows non-zero size after a build. Real success metric is wall-clock: warm rebuild (touching one source file) drops to < 30 s.

A3. Decide lockfile policy (BLOCKED on F2 resolution)

Two options — pick one in a short ADR before implementing:

  • Option 1: Keep --lockfile=false (current pragmatic approach)

    • No sibling-workspace complications
    • No reproducibility guarantee inside Docker
    • Slower installs (full resolution every build)
  • Option 2: Generate a Docker-only lockfile via pnpm install --lockfile-only against a flattened package.json that resolves @bytelyst/* to semver

    • Reproducibility
    • Faster installs
    • New build step + tooling
    • Drift risk between dev lockfile and Docker lockfile
  • A3-1. Write 1-page ADR (docs/decisions/0001-docker-lockfile-policy.md) and pick Option 1 or 2

  • A3-2. Defer --frozen-lockfile adoption until ADR lands

A4. Restructure layer order

  • A4-1. Reorder COPY/RUN so deps-install layer is package.json + .npmrc.docker ONLY, then a separate layer for src/, config files, shared/
  • A4-2. Move all ARG lines that affect deps install before the install step; move NEXT_PUBLIC_* ARGs (web) closer to the build step (they invalidate the build layer, not the deps layer)

A5. Gate .docker-deps/ behind a build arg

  • A5-1. Add ARG USE_TARBALLS=false to Dockerfile
  • A5-2. Use wildcard COPY so missing dir doesn't break the build:
    RUN mkdir -p /app/.docker-deps
    COPY .docker-deps* /app/.docker-deps/
    
  • A5-3. Verify .docker-deps/ is in .gitignore and .dockerignore does NOT exclude it when tarball mode is in use

A6. .dockerignore audit

  • A6-1. Confirm exclusions: node_modules, **/node_modules, dist, .next, *.log, .env, .env.*, .git, *.bak
  • A6-2. Remove: pnpm-lock.yaml exclusion (was correct under --lockfile=false, blocks future optimization)
  • A6-3. Confirm .docker-deps/ is NOT excluded when tarball path is active

A7. Measure & record

Repo Surface Cold (A0-V) Cold (post-A2) Warm (post-A2) Notes
clock backend 59.2 s 64.7 s 2.9 s Cold essentially flat (corepack adds ~1 s; cache mount empty on first run). Warm → 95.1% reduction. Commits: clock@8b5c767a3 (A0-V), clock@f6a806ff3 (A1+A8+A9), clock@55e8d22d3 (A2+A5+A6)
clock web 193 s (3:13) 291 s (4:51) † 5.4 s Warm → 97.2% reduction. † Cold variance — see footer
peakpulse backend — (was tarball-only path) 72.2 s 2.7 s Warm → 96.3% reduction. Commits: peakpulse@11a6bc5 (Phase A), peakpulse@6523a1a (.gitkeep fix), clock@1465e06b1+d69003c1f (mirror .gitkeep fix)

Footer note on cold-build variance. Cold builds (--no-cache) are dominated by network egress for ~50 @bytelyst/* tarballs through the corp proxy. A second measurement of clock web cold-build came in at 291 s vs 174 s in the previous step — same Dockerfile path, different network-side latency. Cold build is not the optimization target of this roadmap; warm rebuild is. Run pnpm store prune on the host or use a local registry mirror if cold-build determinism is needed.

Measurement commands:

# Cold (clear all layer cache; cache mounts may still persist)
time DOCKER_BUILDKIT=1 docker compose build --no-cache backend

# Warm (one source file changed; deps unchanged)
touch backend/src/server.ts
time DOCKER_BUILDKIT=1 docker compose build backend

# Deps-changed (touch package.json; pnpm store cache helps here)
touch backend/package.json
time DOCKER_BUILDKIT=1 docker compose build backend

A8. Config-file COPY audit & canonical pattern (addresses F11, F13)

  • A8-1. For every Dockerfile in scope, list all build-time files present in the surface directory (web/ or backend/) that affect the build:
    • postcss.config.{js,mjs,cjs,ts}
    • tailwind.config.{js,mjs,cjs,ts}
    • next.config.{js,mjs,ts}
    • tsconfig*.json
    • package.json
    • .npmrc.docker, .npmrc
    • babel.config.* (if present)
    • drizzle.config.* (if present)
    • vitest.config.* (only if the build needs it) Verify each is COPY'd in the Dockerfile.
  • A8-2. Choose canonical COPY pattern. Decision: middle-ground glob for web surfaces:
    COPY web/*.{json,ts,mjs,js,cjs} ./
    COPY web/public/ ./public/
    COPY web/src/ ./src/
    
    Trade-off: glob picks up unintended root-level files if any are added later, but dramatically reduces F11/F13 risk. Backend surfaces with few root config files can keep enumerated COPY (lower risk surface).
  • A8-3. Repo-by-repo migration: replace enumerated COPY web/foo ./foo with the glob pattern; verify the resulting image has all expected files via docker run --rm <img> ls -la.

A9. Healthcheck canonicalization (addresses F12)

  • A9-1. Replace localhost with 127.0.0.1 in every docker-compose*.yml healthcheck test: block. Sweep with:
    rg -l 'http://localhost' --glob 'docker-compose*.yml'
    
  • A9-2. Standardize healthcheck shape:
    • Alpine-based images:
      healthcheck:
        test: ["CMD-SHELL", "wget -q --spider http://127.0.0.1:${PORT}/health || exit 1"]
        interval: 30s
        timeout: 5s
        retries: 3
        start_period: 10s
      
    • Slim/Debian images (wget not always present, but node is):
      healthcheck:
        test: ["CMD-SHELL", "node -e \"fetch('http://127.0.0.1:${PORT}/health').then(r=>process.exit(r.ok?0:1)).catch(()=>process.exit(1))\""]
      
  • A9-3. Add start_period (10s minimum) — prevents flaky "container started but app not yet listening" false-negatives.

4. Phase B — Hermetic-fallback polish (docker-prep.sh)

docker-prep.sh is duplicated with minor variations across product repos. Promotion to canonical home is now in Phase B, not Phase D — drift compounds linearly with time and the .npmrc template precedent proves the pattern is cheap.

  • B1. Add --dry-run flag — list packs/rewrites, no side effects
  • B2. Idempotency guard — refuse to run if any *.bak exists unless --force
  • B3. Ensure .docker-deps/ and *.bak are in .gitignore of every pilot repo
  • B4. Pre-commit hook (husky) — block commits containing rewritten package.json, staged tarballs, OR .bak files:
    # .husky/pre-commit
    if git diff --cached --name-only | xargs grep -l '"file:\.\./\.docker-deps/' 2>/dev/null; then
      echo "ERROR: rewritten package.json detected. Run scripts/docker-prep.sh --restore first."
      exit 1
    fi
    if git diff --cached --name-only | grep -qE '(\.docker-deps/.*\.tgz|package\.json\.bak)$'; then
      echo "ERROR: docker-prep.sh artifacts staged. Run --restore first."
      exit 1
    fi
    
  • B5. Auto-restore on script error via trap restore_on_error EXIT (unless --keep passed)
  • B6. Update script header comment per § 7.4 template
  • B7. CANONICAL HOME (was deferred — now in Phase B proper).
    • B7-1. Move script to learning_ai_common_plat/scripts/docker-prep.template.sh
    • B7-2. Add learning_ai_common_plat/scripts/sync-docker-prep.sh to copy template into all product repos (mirrors sync-npmrc.sh)
    • B7-3. Add learning_ai_common_plat/scripts/check-docker-prep-drift.sh for CI (mirrors check-npmrc-drift.sh)
    • B7-4. Update every repo's AGENTS.md with the "NEVER edit docker-prep.sh directly" warning + template link
  • B8. Add --strip-overrides option that removes pnpm.overrides block after build — safety net in case --restore is forgotten

5. Phase C — Verification gates

Pilot exit criteria (must all pass before Phase D):

  • C1. Cold Docker build succeeds on both pilots via Gitea-registry path (no docker-prep.sh invocation)
  • C2. Warm rebuild (single source file touched) < 30 s on both pilots
  • C3. docker-prep.shdocker compose build--restore leaves git status clean
  • C4. Pre-commit hook blocks: (a) rewritten package.json, (b) staged .tgz, (c) staged .bak
  • C5. Gitea Actions CI green on both pilots (verify CI uses the same Dockerfile path)
  • C6. Build-time metrics filled into the table in § 3.A7
  • C7. ADR recorded for A3 (lockfile policy)
  • C8. docker-doctor.sh (Phase E) runs clean against both pilots
  • C9. Smoke test: render the web app, inspect <head> for non-trivial CSS bundle (> 50 KB), confirm Tailwind classes apply. Guard against F11 regression.

6. Phase D — Ecosystem rollout (deferred until § 5 passes)

Apply Phase A + B + E to remaining repos. Pilots excluded.

Repo Backend Web docker-prep Healthcheck Notes
learning_ai_notes BASE_IMAGE=node:22-slim override (corp proxy Alpine SSL)
learning_ai_fastgap Mobile + web + backend
learning_ai_jarvis_jr F12 incident already fixed; verify regression-proof
learning_ai_flowmonk .npmrc.docker is tarball-only — needs A0-1
learning_ai_trails
learning_ai_local_memory_gpt SQLite-based; F11(b) already fixed 07cdf6b — verify regression-proof
learning_multimodal_memory_agents (MindLyst) KMP repo, different layout
learning_voice_ai_agent (LysnrAI) Python desktop + TS dashboards
learning_ai_efforise
learning_ai_auth_app n/a n/a iOS/Android — no Docker surfaces
learning_ai_talk2obsidian Single-container app

7. Reference snippets

7.1 Canonical .npmrc.docker

Matches the host-side .npmrc template shipped in common-plat 610a59fd.

@bytelyst:registry=http://${GITEA_NPM_HOST}:3300/api/packages/${GITEA_NPM_OWNER:-learning_ai_user}/npm/
//${GITEA_NPM_HOST}:3300/api/packages/${GITEA_NPM_OWNER:-learning_ai_user}/npm/:_authToken=${GITEA_NPM_TOKEN}
strict-ssl=false
auto-install-peers=true

7.2 Canonical backend Dockerfile

# syntax=docker/dockerfile:1.7
ARG BASE_IMAGE=node:22-alpine
FROM ${BASE_IMAGE} AS builder
WORKDIR /app/backend

ARG GITEA_NPM_HOST=host.docker.internal
ARG GITEA_NPM_OWNER=learning_ai_user
ARG USE_TARBALLS=false
ENV NODE_TLS_REJECT_UNAUTHORIZED=0
ENV NPM_CONFIG_STRICT_SSL=false
ENV GITEA_NPM_HOST=$GITEA_NPM_HOST
ENV GITEA_NPM_OWNER=$GITEA_NPM_OWNER

RUN corepack enable && corepack prepare pnpm@10.6.5 --activate

# ── Deps layer (cacheable) ─────────────────────────────────────────
COPY .npmrc.docker ./.npmrc
COPY backend/package.json ./package.json
# Tolerate missing .docker-deps/ when in registry mode
RUN mkdir -p /app/.docker-deps
COPY .docker-deps* /app/.docker-deps/

RUN --mount=type=cache,id=pnpm,target=/root/.local/share/pnpm/store \
    --mount=type=secret,id=gitea_npm_token \
    export GITEA_NPM_TOKEN="$(cat /run/secrets/gitea_npm_token 2>/dev/null || echo '')" && \
    pnpm install --ignore-scripts --lockfile=false

# ── Source layer (changes most often) ──────────────────────────────
COPY backend/tsconfig.json ./tsconfig.json
COPY backend/src/ ./src/
COPY shared/ ../shared/
RUN pnpm run build

# ── Runtime ────────────────────────────────────────────────────────
FROM ${BASE_IMAGE}
WORKDIR /app/backend
ENV NODE_ENV=production
COPY --from=builder /app/backend/node_modules ./node_modules
COPY --from=builder /app/backend/package.json ./package.json
COPY --from=builder /app/backend/dist ./dist
COPY shared/ ../shared/
EXPOSE 4010
CMD ["node", "dist/server.js"]

--lockfile=false is intentional pending the A3 ADR. Switch to --frozen-lockfile only once the sibling-workspace problem (F2) is resolved.

7.3 Canonical docker-compose.yml service block

services:
  backend:
    build:
      context: .
      dockerfile: backend/Dockerfile
      args:
        GITEA_NPM_HOST: host.docker.internal
      secrets:
        - gitea_npm_token
    extra_hosts:
      - "host.docker.internal:host-gateway"
    ports:
      - "4010:4010"
    environment:
      - NODE_ENV=production
      - PORT=4010
      # ...
    restart: unless-stopped
    healthcheck:
      # F12: use 127.0.0.1 NOT localhost (IPv6 resolution false-fails)
      test: ["CMD-SHELL", "wget -q --spider http://127.0.0.1:4010/health || exit 1"]
      interval: 30s
      timeout: 5s
      retries: 3
      start_period: 10s

secrets:
  gitea_npm_token:
    environment: GITEA_NPM_TOKEN

7.4 Hardened docker-prep.sh header

#!/usr/bin/env bash
# Hermetic Docker-build helper. Packs @bytelyst/* tarballs from the sibling
# common-plat repo when the Gitea npm registry is unreachable.
#
# Use this ONLY when:
#   - Local Gitea registry (:3300) is down or unreachable, OR
#   - You need a Docker build that includes uncommitted common-plat changes.
#
# For normal builds (Gitea up + clean common-plat), use:
#   docker compose build
#
# Usage:
#   ./scripts/docker-prep.sh             # pack tarballs + rewrite package.json
#   ./scripts/docker-prep.sh --dry-run   # show what would change (no side effects)
#   ./scripts/docker-prep.sh --force     # override idempotency guard
#   ./scripts/docker-prep.sh --restore   # undo rewrite
#   ./scripts/docker-prep.sh --keep      # skip auto-restore on error
#   ./scripts/docker-prep.sh --strip-overrides  # remove pnpm.overrides block
#
# Side effects:
#   - Creates .docker-deps/ (gitignored)
#   - Backs up package.json → package.json.bak
#   - Rewrites @bytelyst/* deps to file:../.docker-deps/<tarball>
#   - Injects pnpm.overrides for transitive @bytelyst/* deps
#
# Safety:
#   - Refuses to run if .bak files already exist (unless --force)
#   - Auto-restores on error (trap EXIT) unless --keep passed
#   - Pre-commit hook blocks committing rewritten package.json, .tgz, .bak

7.5 Canonical Next.js web Dockerfile (addresses F11, F13)

# syntax=docker/dockerfile:1.7
ARG BASE_IMAGE=node:22-alpine
FROM ${BASE_IMAGE} AS deps
WORKDIR /app/web

ARG GITEA_NPM_HOST=host.docker.internal
ARG GITEA_NPM_OWNER=learning_ai_user
ENV NODE_TLS_REJECT_UNAUTHORIZED=0
ENV NPM_CONFIG_STRICT_SSL=false
ENV GITEA_NPM_HOST=$GITEA_NPM_HOST
ENV GITEA_NPM_OWNER=$GITEA_NPM_OWNER

RUN corepack enable && corepack prepare pnpm@10.6.5 --activate

COPY .npmrc.docker ./.npmrc
COPY web/package.json ./package.json
RUN mkdir -p /app/.docker-deps
COPY .docker-deps* /app/.docker-deps/

RUN --mount=type=cache,id=pnpm,target=/root/.local/share/pnpm/store \
    --mount=type=secret,id=gitea_npm_token \
    export GITEA_NPM_TOKEN="$(cat /run/secrets/gitea_npm_token 2>/dev/null || echo '')" && \
    pnpm install --ignore-scripts --lockfile=false

# ── Builder ────────────────────────────────────────────────────────
FROM ${BASE_IMAGE} AS builder
WORKDIR /app/web
COPY --from=deps /app/web/node_modules ./node_modules
COPY --from=deps /app/web/package.json ./package.json

# F11/F13 fix: glob ALL root-level config files instead of enumerating.
# Picks up postcss.config.*, tailwind.config.*, next.config.*, tsconfig*,
# any future *.config.* additions without Dockerfile changes.
COPY web/*.json web/*.ts web/*.mjs web/*.js web/*.cjs ./
COPY web/public/ ./public/
COPY web/src/ ./src/
COPY shared/ ../shared/

ARG NEXT_PUBLIC_BACKEND_URL
ARG NEXT_PUBLIC_PLATFORM_SERVICE_URL
ENV NEXT_PUBLIC_BACKEND_URL=$NEXT_PUBLIC_BACKEND_URL
ENV NEXT_PUBLIC_PLATFORM_SERVICE_URL=$NEXT_PUBLIC_PLATFORM_SERVICE_URL
ENV NEXT_TELEMETRY_DISABLED=1

RUN corepack enable && pnpm run build

# ── Runtime (Next.js standalone) ───────────────────────────────────
FROM ${BASE_IMAGE} AS runner
WORKDIR /app/web
ENV NODE_ENV=production
ENV NEXT_TELEMETRY_DISABLED=1

COPY --from=builder /app/web/.next/standalone ./
# Next 16 standalone server runs as `node web/server.js` from /app/web,
# so static assets live at /app/web/web/.next/static (NOT ./.next/static).
COPY --from=builder /app/web/.next/static ./web/.next/static
COPY --from=builder /app/web/public ./web/public

EXPOSE 3000
ENV PORT=3000
ENV HOSTNAME=0.0.0.0
CMD ["node", "web/server.js"]

Verification step after every web Dockerfile change: smoke-test the built image by running it and curling the rendered HTML. Confirm the CSS bundle in <link> references is > 50 KB. A bundle of ~33 KB is the F11 signature (only @font-face, no Tailwind utilities).

7.6 docker-doctor.sh skeleton (Phase E)

#!/usr/bin/env bash
# docker-doctor.sh — pre-flight Dockerfile + docker-compose health checks.
# Run on PRs touching Dockerfile, docker-compose*.yml, .dockerignore.
set -euo pipefail

REPO_DIR="$(cd "$(dirname "$0")/.." && pwd)"
FAILED=0

# Check 1 (A8/F11/F13): every config file in web/ is COPY'd in web/Dockerfile
for cfg in postcss.config tailwind.config next.config; do
  for f in "$REPO_DIR"/web/${cfg}.{js,mjs,cjs,ts}; do
    [[ -f "$f" ]] || continue
    base=$(basename "$f")
    if ! grep -q "COPY web/${base}\\|COPY web/\\*" "$REPO_DIR/web/Dockerfile" 2>/dev/null; then
      echo "✗ F11/F13: $base exists but not COPY'd in web/Dockerfile"
      FAILED=1
    fi
  done
done

# Check 2 (A9/F12): healthchecks use 127.0.0.1
if grep -rE 'test:.*http://localhost' "$REPO_DIR"/docker-compose*.yml 2>/dev/null; then
  echo "✗ F12: healthcheck uses localhost (should be 127.0.0.1)"
  FAILED=1
fi

# Check 3: .npmrc.docker matches canonical template
if [[ -f "$REPO_DIR/.npmrc.docker" ]]; then
  if ! grep -q '\${GITEA_NPM_HOST}' "$REPO_DIR/.npmrc.docker"; then
    echo "✗ F4: .npmrc.docker doesn't use \${GITEA_NPM_HOST} placeholder"
    FAILED=1
  fi
fi

# Check 4: .dockerignore doesn't exclude pnpm-lock.yaml
if grep -q '^pnpm-lock\.yaml$' "$REPO_DIR/.dockerignore" 2>/dev/null; then
  echo "⚠ F1: .dockerignore excludes pnpm-lock.yaml (blocks lockfile optimization)"
fi

# Check 5: base image is on approved list
for df in "$REPO_DIR"/{backend,web}/Dockerfile; do
  [[ -f "$df" ]] || continue
  if ! grep -qE 'FROM (\$\{BASE_IMAGE\}|node:22-(alpine|slim))' "$df"; then
    echo "✗ Unapproved base image in $df"
    FAILED=1
  fi
done

exit $FAILED

8. Phase E — Observability / lint (NEW)

Two complementary linters:

  1. gitea-doctor — Gitea registry pre-flight (env + token + connectivity). Already shipped in common-plat commit 610a59fd at scripts/gitea/doctor.sh. This roadmap only wires it into CI/build flows (A0-D + E0 below).
  2. docker-doctor — Dockerfile + compose-file static linter (see § 7.6 skeleton). To be built as part of this roadmap.

The two are intentionally separate concerns:

Linter Scope When to run
gitea-doctor runtime env, token, registry HTTP 200 Before every build / deploy
docker-doctor static analysis of Dockerfile + compose YAML On every PR touching those files

Phase E checklist

  • E0. Wire bash scripts/gitea/doctor.sh --quiet into every Gitea Actions CI workflow as a pre-build job (addresses F15). Pattern shipped in common-plat; replicate via a reusable actions/gitea-preflight@main composite if Gitea Actions supports it, otherwise inline.
  • E1. Land docker-doctor.sh in learning_ai_common_plat/scripts/ (canonical, mirrors gitea/doctor.sh shipped earlier)
  • E2. Provide a thin per-repo wrapper at scripts/docker-doctor.sh that calls the canonical
  • E3. Wire into CI: run on PRs touching Dockerfile, docker-compose*.yml, .dockerignore, .npmrc.docker
  • E4. Wire into pre-commit hook (warning-only at first, error after 2 weeks)
  • E5. Document checks in learning_ai_common_plat/AI.dev/SKILLS/docker-doctor.md (sibling doc to the existing gitea-doctor patterns)
  • E6. Add make doctor target to each pilot repo that runs both gitea-doctor AND docker-doctor

Checks implemented by docker-doctor.sh:

Check Addresses Action
Every web/*.config.* file is COPY'd F11, F13 Error
docker-compose.yml healthcheck uses 127.0.0.1 F12 Error
.npmrc.docker uses ${GITEA_NPM_HOST} AND ${GITEA_NPM_OWNER} placeholders F4, F14 Error
Dockerfile declares ARG GITEA_NPM_OWNER if it COPYs .npmrc.docker F14 Error
.dockerignore doesn't exclude pnpm-lock.yaml F1 Warn (until A3 ADR lands)
Base image is on approved list (node:22-alpine or node:22-slim via BASE_IMAGE ARG) Canonical decision Error
.docker-deps/ and *.bak in .gitignore B3 Error
docker-compose.yml passes GITEA_NPM_OWNER build arg F14 Warn

9. Open questions (numbered TODOs, not blockers)

  1. Shared pnpm cache volume? BuildKit caches are already shared across builds by id=pnpm. Test whether a named Docker volume adds anything before adding complexity.
  2. Custom base image? Publish bytelyst/node-pnpm:22{alpine,slim} with pnpm pre-installed to skip corepack. Cost: image maintenance; benefit: ~5 s/build.
  3. CI hostname? Verify host.docker.internal:host-gateway works in Gitea Actions Linux runners, or if a CI-specific Dockerfile variant is needed.
  4. Multi-platform builds? linux/amd64 + linux/arm64 interact awkwardly with cache mounts under buildx. Defer to separate roadmap.
  5. Workspace flattening? Eliminate the ../learning_ai_common_plat/packages/* workspace entry inside Docker via a flattened pnpm-workspace.yaml. Unlocks --frozen-lockfile. Requires lockfile regeneration step.

10. Execution order

  1. v5 commit: roadmap doc v5 lands; F16 documented (devops_tools@ba8b4d1).
  2. Phase A0 on learning_ai_clock — Dockerfile + compose changes landed in clock@0be887288. Initial A0-V blocked on F16/F17/F18.
  3. F16 fix in common-plat — workspace:* rewriter + defense-in-depth guard + republish of 10 affected packages (common-plat@cfcfc7bb).
  4. F17 fix in common-plat + Gitea config — ROOT_URL=host.docker.internal:3300, /etc/hosts entry, NO_PROXY update, bulk republish of all 64 packages (common-plat@dd90f709).
  5. F18 fix in clock — 4 file: refs in web/package.json rewritten to * (clock@8b5c767a3).
  6. A0-V on clock PASSED. v6 commit lands (devops_tools@7627d55).
  7. A8 + A9 + A1 on clock (correctness + corepack) — clock@f6a806ff3. Web cold dropped to 174 s; backend essentially flat at 60 s. F11 guard verified (Tailwind utilities present in CSS bundle).
  8. A2 + A4 + A5 + A6 on clock (cache mount + dockerignore) — clock@55e8d22d3. Warm rebuilds: backend 2.9 s, web 5.4 s (9597% reduction). A7 metrics table populated this commit.
  9. Phase A0 → A6 on learning_ai_peakpulse backend (peakpulse@11a6bc5). Cold 72.2 s, warm 2.7 s. Pattern from clock applied verbatim, plus a side fix for .docker-deps/.gitkeep discoverability that was also ported back to clock (peakpulse@6523a1a, clock@1465e06b1, clock@d69003c1f).
  10. A3 ADRdocs/adr/0001-docker-build-lockfile-policy.md. Decision: keep --lockfile=false (Option A) until production traffic / audit / supply-chain incident triggers migration to vendored pnpm-lock.docker.yaml (Option C). Implementation deferred.
  11. ⚳ Phase B — harden docker-prep.sh on clock, then promote to canonical home in common-plat (B7) and sync to peakpulse.
  12. ⚳ Phase E — land docker-doctor.sh, wire into CI as warning, then error.
  13. ⚳ Phase C — verification gates C1C9.
  14. ⏸ Phase D — scheduled separately, only after §5 C-gates pass. STOP and request approval before starting.

11. Risk register

Risk Mitigation
Removing pnpm-lock.yaml from .dockerignore exposes a stale or sibling-aware lockfile that breaks Docker installs Keep --lockfile=false for now (A3 ADR); revisit after F2 resolution
BuildKit cache mount on shared CI runners causes cross-build interference Use distinct id= per repo (id=pnpm-${repo}) if observed
host.docker.internal doesn't resolve in Linux Docker extra_hosts: ["host.docker.internal:host-gateway"] (A0-4)
Removing .docker-deps/ from default builds breaks repos that haven't done A0 yet Wildcard COPY .docker-deps* keeps both paths working during migration
docker-prep.sh --force is misused and .bak files get committed Pre-commit hook (B4) blocks .bak, .tgz, rewritten package.json
Corp network blocks host.docker.internal:3300 Verify SSH tunnel reaches Gitea; document in operations.md
F11 regression: build green, app ships with no CSS C9 smoke test + Phase E docker-doctor.sh check on web/*.config.* COPY coverage
F12 regression: healthcheck false-fails on IPv6 Phase E docker-doctor.sh grep for localhost in compose files
F13 regression: new config file added, Dockerfile forgotten A8-2 glob COPY pattern (root cause fix) + Phase E lint (defense in depth)
BASE_IMAGE override in notes diverges silently from canonical Phase E check approved list; document override in repo AGENTS.md
F14 regression: future Gitea owner rename re-introduces literal in some Dockerfile Phase E docker-doctor.sh checks .npmrc.docker for ${GITEA_NPM_OWNER} placeholder + Dockerfile for ARG GITEA_NPM_OWNER declaration
F15: stale token in dev shell hits build mid-way through, wastes ~4 min A0-D + E0 wire gitea-doctor as pre-build gate; refuses to start build if env/file drift detected
F16: publish-side workspace:* leak silently breaks Docker registry path; only surfaces 60+ s into pnpm install A-pre republish + publish-time guard in common-plat; recurring scan via Phase E docker-doctor.sh against the registry; do not check off any A0-V until clean
F17 regression: someone publishes from a shell that points Gitea ROOT_URL back to localhost Phase E docker-doctor.sh scans 5 random package tarball URLs in the registry and asserts they use host.docker.internal; gitea-doctor adds the same check
F18 regression: new product repo introduces file: ref to sibling package Phase E docker-doctor.sh greps **/package.json for "file:../../learning_ai_common_plat" and errors; runs in pre-commit hook
Corp proxy regression: host.docker.internal falls out of NO_PROXY on a dev machine switch-network.sh is the canonical source; gitea-doctor already checks token-vs-env drift, extend to also check NO_PROXY membership