Adds 2 new findings to the docker build optimization roadmap and updates
templates to consume the new GITEA_NPM_OWNER env var shipped in common-plat
commit 610a59fd.
- F14: hardcoded Gitea owner literal across 14 repos (now resolved upstream
via ${GITEA_NPM_OWNER:-learning_ai_user})
- F15: stale shell-env tokens (caught by scripts/gitea/doctor.sh)
- A0-1, A0-3, 7.1, 7.2, 7.5: snippets updated to thread GITEA_NPM_OWNER
through .npmrc.docker, Dockerfile ARG/ENV, and docker-compose build args
- A0-D: new step — run gitea-doctor.sh as pre-build gate (replaces
'wait 4 minutes for ERR_PNPM_AUTHENTICATION' with 'fail fast in 2 sec')
- Phase E: now distinguishes gitea-doctor (shipped) from docker-doctor (to
build). Adds two new docker-doctor checks for F14
- Risk register: F14/F15 mitigations called out explicitly
36 KiB
Docker Build Optimization Roadmap
Status: Draft v4 (Gitea hardening integrated) · Owner: Platform DevOps · Created: 2026-05-27 · Revised: 2026-05-27
Pilot Docker-build correctness + speed fixes on
learning_ai_clock(web + backend) andlearning_ai_peakpulse(backend), then capture the playbook here for ecosystem-wide rollout.Upstream prerequisite shipped (commit
610a59fdinlearning_ai_common_plat): Gitea owner parameterization + helper scripts (scripts/gitea/doctor.sh,scripts/gitea/token.sh). The.npmrctemplate now resolves owner from${GITEA_NPM_OWNER:-learning_ai_user}. All A0-1 work in this roadmap inherits this — Dockerfile/.npmrc.docker must use the same${GITEA_NPM_OWNER}placeholder, not a hardcoded literal.
0. Pre-flight audit findings (2026-05-27)
A read-only audit of pilot repos + lessons from recent live incidents surfaced 15 concrete bugs/gaps (F14–F15 added after the Gitea-hardening commit). The actual state of the ecosystem is closer to the inverse of the casual narrative: tarballs are the de facto default, the Gitea-registry path is partially wired, and there is a separate class of "build green, app broken" silent failures (F11–F13) that the speed-focused plan needs to address first.
| # | Finding | Location | Severity |
|---|---|---|---|
| F1 | pnpm-lock.yaml is in .dockerignore — any lockfile-based optimization is blocked until removed |
peakpulse/.dockerignore, clock/.dockerignore |
High |
| F2 | pnpm-workspace.yaml references sibling ../learning_ai_common_plat/packages/* — --frozen-lockfile inside Docker will fail unless workspace is flattened or sibling tree is copied |
both pilots | High |
| F3 | peakpulse/.npmrc.docker is tarball-only (no @bytelyst:registry=… line) — the "Gitea-registry" path doesn't work in this repo today |
peakpulse/.npmrc.docker |
High |
| F4 | clock/.npmrc.docker hardcodes http://localhost:3300 — from inside Docker, localhost is the container, not the host registry |
clock/.npmrc.docker |
High |
| F5 | clock/backend/Dockerfile has neither ARG GITEA_NPM_HOST nor a BuildKit secret mount — wholly dependent on pre-populated .docker-deps/ |
clock/backend/Dockerfile |
High |
| F6 | clock/web/Dockerfile accepts ARG GITEA_NPM_HOST but never uses it; no --mount=type=secret either |
clock/web/Dockerfile |
Medium |
| F7 | peakpulse/docker-compose.yml does not pass GITEA_NPM_HOST build arg or declare secrets: block |
peakpulse/docker-compose.yml |
Medium |
| F8 | COPY .docker-deps/ is unconditional in every backend Dockerfile — every build requires docker-prep.sh to have run OR an empty .docker-deps/ dir to pre-exist |
both repos | Medium |
| F9 | npm install -g pnpm@10.6.5 runs on every build (no corepack) — 5–10 s overhead, no pinning to packageManager field |
all four Dockerfiles | Low |
| F10 | No BuildKit --mount=type=cache for pnpm store — cold install on every rebuild even when deps unchanged |
all four Dockerfiles | High (main speed win) |
| F11 | Build-time config file missing from repo or not COPY'd in Dockerfile causes silent UI breakage. Symptom: next build succeeds, container is "healthy", but CSS bundle is ~33 KB (only @font-face) and all Tailwind classes are absent → UI renders unstyled. Two sub-bugs: (a) postcss.config.mjs missing entirely while @tailwindcss/postcss is in package.json (NoteLett, JarvisJr fixes dff459e, 36f6bc1); (b) file exists but Dockerfile never COPYs it (Clock, LocalMemGPT fixes a308c6444, 07cdf6b). |
*/web/Dockerfile, */web/postcss.config.* |
High |
| F12 | Healthcheck uses localhost, resolves to IPv6 ::1, false-fails. Backend listens on 0.0.0.0 (IPv4 only). wget --spider http://localhost:.../health hits ::1, connection refused, container marked "unhealthy", web service won't start due to depends_on: condition: service_healthy. Incident: learning_ai_jarvis_jr/docker-compose.yml. |
every docker-compose*.yml healthcheck |
Medium |
| F13 | Enumerated COPY web/foo ./foo pattern drifts from filesystem. New config file added to repo but Dockerfile's enumerated COPY list isn't updated. Build succeeds silently with the file absent; behavior diverges from local dev. Root cause of F11(b). |
every Dockerfile using enumerated COPY | Medium |
| F14 | Hardcoded Gitea owner (learning_ai_user) literally embedded in .npmrc.docker + CI workflows + publish scripts across 14 repos. When the org was renamed from bytelyst → learning_ai_user, every repo needed a manual commit. Resolved upstream in common-plat (610a59fd): owner now resolves from ${GITEA_NPM_OWNER:-learning_ai_user}; scripts/gitea/{doctor,token}.sh ship as pre-flight/rotation helpers. Docker work in this roadmap MUST consume the env var, not the literal. |
.npmrc.docker, Dockerfile ARG/ENV, CI workflows |
Medium |
| F15 | Stale shell-env tokens. ~/.gitea_npm_token rotated on disk; long-lived shells still exported the old value. Caused 401s during docker compose build until source ~/.zshrc. Mitigation shipped: bash scripts/gitea/doctor.sh detects env-vs-file drift and refuses to proceed. Action required in this roadmap: wire doctor as a pre-build CI gate. |
dev workstation + CI runners | Low (now caught) |
Implications:
- The original "switch to
--frozen-lockfile+ Gitea registry" plan requires two upstream fixes first (F1, F2). - F11–F13 mean correctness fixes must precede speed fixes, otherwise we ship faster builds of broken apps.
- A linter (Phase E
docker-doctor.sh) is the durable insurance against F11/F13 recurrence — they are silent in CI today.
1. Context: three build paths
| Path | Status today | Trigger | Notes |
|---|---|---|---|
docker-prep.sh tarballs |
De facto default in peakpulse + flowmonk; also works in clock/notes | Run docker-prep.sh then docker compose build |
Hermetic; mutates package.json; slow to repack |
| Gitea NPM registry | Partially wired in clock + notes; broken in peakpulse | docker compose build with GITEA_NPM_HOST arg + secret |
Needs .npmrc.docker standardization to be the default |
Legacy file: refs |
Deprecated | — | Removed during pnpm/Gitea migration |
Measurement targets
| Build | Baseline (observed) | Target after Phase A |
|---|---|---|
| Cold (no cache) | ~2–3 min | ≤ 2 min |
| Warm (one source file changed) | ~2–3 min | < 30 s |
docker-prep.sh pack step alone |
~60–90 s | < 30 s (pnpm pack cache) |
Fill in actuals during Phase C.
2. Goals & non-goals
Goals
- ✅ Eliminate F11–F13 class of silent "build green, app broken" failures
- ✅ Cut warm rebuild time via BuildKit pnpm-store cache mount (single biggest speed win)
- ✅ Make
docker-prep.shidempotent, safe to re-run, gitignore-clean, and canonical (no per-repo drift) - ✅ Standardize
.npmrc.dockeracross the ecosystem so the Gitea path actually works - ✅ Fix
docker-compose.ymlto passGITEA_NPM_HOST+ secrets so the registry path is usable without manual flags - ✅ Ship
docker-doctor.shCI lint as the durable insurance layer
Non-goals
- ❌ Migrating off pnpm or off the Gitea registry
- ❌ Adopting
--frozen-lockfileuntil F2 is resolved (sibling-workspace problem) - ❌ Publishing
@bytelyst/*to the public npm registry - ❌ Multi-platform builds (separate roadmap)
2.5 Canonical decisions
Decisions taken now to avoid contradictions later in the doc:
- Base image:
node:22-alpineis canonical. For repos blocked by the corporate proxy's Alpine SSL interception (currently onlylearning_ai_notes), the Dockerfile MUST expose:
Override per-repo viaARG BASE_IMAGE=node:22-alpine FROM ${BASE_IMAGE} AS builder--build-arg BASE_IMAGE=node:22-slim. Document the override in the repo'sAGENTS.md. - Healthcheck host:
127.0.0.1(NOTlocalhost) in everydocker-compose*.ymltest:block. See F12. - Lockfile mode in Docker:
--lockfile=falsefor now.--frozen-lockfileis blocked on the A3 ADR (F2).
3. Phase A — Correctness + build speed + path correctness
Order matters: A0 must precede A1+ (you can't optimize a path that doesn't work), and A8+A9 (correctness) must land before measuring speed wins.
A0. Make the Gitea-registry path actually work (clock + peakpulse)
-
A0-1. Standardize
.npmrc.dockerto use templated host AND owner so it works on host (localhost) and inside Docker (host.docker.internal), and so future owner renames are a one-line env change:@bytelyst:registry=http://${GITEA_NPM_HOST}:3300/api/packages/${GITEA_NPM_OWNER:-learning_ai_user}/npm/ //${GITEA_NPM_HOST}:3300/api/packages/${GITEA_NPM_OWNER:-learning_ai_user}/npm/:_authToken=${GITEA_NPM_TOKEN} strict-ssl=false auto-install-peers=true⚠️ Env-var expansion chain: pnpm expands
${VAR}in.npmrcat read time using the current process environment (see pnpm npmrc docs). That means the Dockerfile MUST doARG GITEA_NPM_HOST+ARG GITEA_NPM_OWNER→ENV GITEA_NPM_HOST=$GITEA_NPM_HOST/ENV GITEA_NPM_OWNER=$GITEA_NPM_OWNERbefore thepnpm installRUN line, AND theGITEA_NPM_TOKENmust be exported from the BuildKit secret mount inside the sameRUN(since secrets don't persist as env across layers).Note on F14: The canonical
.npmrc(host-side) template already uses${GITEA_NPM_OWNER}(shipped in common-plat commit610a59fd)..npmrc.dockerlagged behind because Docker builds have a separate file — A0-1 brings them into parity. -
A0-2. Remove
pnpm-lock.yamlfrom.dockerignorein both repos (fixes F1; harmless under--lockfile=falsesince we don't COPY it, but unblocks future A3) -
A0-3. Add
GITEA_NPM_HOST+GITEA_NPM_OWNERbuild args +secrets:block to every service indocker-compose.yml:build: context: . dockerfile: backend/Dockerfile args: GITEA_NPM_HOST: ${GITEA_NPM_HOST:-host.docker.internal} GITEA_NPM_OWNER: ${GITEA_NPM_OWNER:-learning_ai_user} secrets: - gitea_npm_token secrets: gitea_npm_token: environment: GITEA_NPM_TOKEN -
A0-4. Add
extra_hosts: ["host.docker.internal:host-gateway"]to each service so Linux Docker can resolve the host -
A0-5. Document required env:
GITEA_NPM_TOKENmust be exported in the shell that runsdocker compose build(add to repoREADME.mdquickstart). Referencebash ../learning_ai_common_plat/scripts/gitea/token.sh statusfor verification. -
A0-D. Run
gitea-doctorbefore any Docker build (addresses F15). Inline into deploy/CI workflows:bash ../learning_ai_common_plat/scripts/gitea/doctor.sh --quiet || exit 1 docker compose build- Locally: shell alias or
Makefiletargetmake buildthat runs doctor thendocker compose build. - In Gitea Actions CI: a pre-job step. If
doctorexits non-zero, the build is skipped with a clear error rather than failing 4 minutes in withERR_PNPM_AUTHENTICATION.
- Locally: shell alias or
-
A0-V. Verification gate (between A0 and A1): build the registry path without any cache-mount or layer optimizations. Confirm
docker compose build --no-cachesucceeds end-to-end pulling from Gitea. Only proceed to A1 once this is green. Don't conflate "make it work" with "make it fast" in one commit.
A1. Replace npm install -g pnpm@X with corepack
- A1-1. Replace
RUN npm install -g pnpm@10.6.5with:RUN corepack enable && corepack prepare pnpm@10.6.5 --activate - A1-2. Verify
packageManagerfield inbackend/package.jsonandweb/package.jsonmatches (alreadypnpm@10.6.5in peakpulse backend)
A2. Add BuildKit pnpm-store cache mount
- A2-1. Set
# syntax=docker/dockerfile:1.7directive at top of every Dockerfile - A2-2. Wrap install step with cache + secret mount:
RUN --mount=type=cache,id=pnpm,target=/root/.local/share/pnpm/store \ --mount=type=secret,id=gitea_npm_token \ export GITEA_NPM_TOKEN="$(cat /run/secrets/gitea_npm_token 2>/dev/null || echo '')" && \ pnpm install --ignore-scripts --lockfile=false - A2-3. Verify cache mount is active:
docker buildx du --filter type=exec.cachemountshows non-zero size after a build. Real success metric is wall-clock: warm rebuild (touching one source file) drops to < 30 s.
A3. Decide lockfile policy (BLOCKED on F2 resolution)
Two options — pick one in a short ADR before implementing:
-
Option 1: Keep
--lockfile=false(current pragmatic approach)- ✅ No sibling-workspace complications
- ❌ No reproducibility guarantee inside Docker
- ❌ Slower installs (full resolution every build)
-
Option 2: Generate a Docker-only lockfile via
pnpm install --lockfile-onlyagainst a flattenedpackage.jsonthat resolves@bytelyst/*to semver- ✅ Reproducibility
- ✅ Faster installs
- ❌ New build step + tooling
- ❌ Drift risk between dev lockfile and Docker lockfile
-
A3-1. Write 1-page ADR (
docs/decisions/0001-docker-lockfile-policy.md) and pick Option 1 or 2 -
A3-2. Defer
--frozen-lockfileadoption until ADR lands
A4. Restructure layer order
- A4-1. Reorder COPY/RUN so deps-install layer is
package.json+.npmrc.dockerONLY, then a separate layer forsrc/, config files,shared/ - A4-2. Move all
ARGlines that affect deps install before the install step; moveNEXT_PUBLIC_*ARGs (web) closer to the build step (they invalidate the build layer, not the deps layer)
A5. Gate .docker-deps/ behind a build arg
- A5-1. Add
ARG USE_TARBALLS=falseto Dockerfile - A5-2. Use wildcard COPY so missing dir doesn't break the build:
RUN mkdir -p /app/.docker-deps COPY .docker-deps* /app/.docker-deps/ - A5-3. Verify
.docker-deps/is in.gitignoreand.dockerignoredoes NOT exclude it when tarball mode is in use
A6. .dockerignore audit
- A6-1. Confirm exclusions:
node_modules,**/node_modules,dist,.next,*.log,.env,.env.*,.git,*.bak - A6-2. Remove:
pnpm-lock.yamlexclusion (was correct under--lockfile=false, blocks future optimization) - A6-3. Confirm
.docker-deps/is NOT excluded when tarball path is active
A7. Measure & record
| Repo | Surface | Cold before | Cold after | Warm before | Warm after | Notes |
|---|---|---|---|---|---|---|
| clock | web | — | — | — | — | |
| clock | backend | — | — | — | — | |
| peakpulse | backend | — | — | — | — |
Use:
time DOCKER_BUILDKIT=1 docker compose build --no-cache backend # cold
touch backend/src/server.ts && time docker compose build backend # warm
A8. Config-file COPY audit & canonical pattern (addresses F11, F13)
- A8-1. For every Dockerfile in scope, list all build-time files present in the surface directory (
web/orbackend/) that affect the build:postcss.config.{js,mjs,cjs,ts}tailwind.config.{js,mjs,cjs,ts}next.config.{js,mjs,ts}tsconfig*.jsonpackage.json.npmrc.docker,.npmrcbabel.config.*(if present)drizzle.config.*(if present)vitest.config.*(only if the build needs it) Verify each is COPY'd in the Dockerfile.
- A8-2. Choose canonical COPY pattern. Decision: middle-ground glob for web surfaces:
Trade-off: glob picks up unintended root-level files if any are added later, but dramatically reduces F11/F13 risk. Backend surfaces with few root config files can keep enumerated COPY (lower risk surface).COPY web/*.{json,ts,mjs,js,cjs} ./ COPY web/public/ ./public/ COPY web/src/ ./src/ - A8-3. Repo-by-repo migration: replace enumerated
COPY web/foo ./foowith the glob pattern; verify the resulting image has all expected files viadocker run --rm <img> ls -la.
A9. Healthcheck canonicalization (addresses F12)
- A9-1. Replace
localhostwith127.0.0.1in everydocker-compose*.ymlhealthchecktest:block. Sweep with:rg -l 'http://localhost' --glob 'docker-compose*.yml' - A9-2. Standardize healthcheck shape:
- Alpine-based images:
healthcheck: test: ["CMD-SHELL", "wget -q --spider http://127.0.0.1:${PORT}/health || exit 1"] interval: 30s timeout: 5s retries: 3 start_period: 10s - Slim/Debian images (
wgetnot always present, butnodeis):healthcheck: test: ["CMD-SHELL", "node -e \"fetch('http://127.0.0.1:${PORT}/health').then(r=>process.exit(r.ok?0:1)).catch(()=>process.exit(1))\""]
- Alpine-based images:
- A9-3. Add
start_period(10s minimum) — prevents flaky "container started but app not yet listening" false-negatives.
4. Phase B — Hermetic-fallback polish (docker-prep.sh)
docker-prep.sh is duplicated with minor variations across product repos.
Promotion to canonical home is now in Phase B, not Phase D — drift
compounds linearly with time and the .npmrc template precedent proves the
pattern is cheap.
- B1. Add
--dry-runflag — list packs/rewrites, no side effects - B2. Idempotency guard — refuse to run if any
*.bakexists unless--force - B3. Ensure
.docker-deps/and*.bakare in.gitignoreof every pilot repo - B4. Pre-commit hook (husky) — block commits containing rewritten
package.json, staged tarballs, OR.bakfiles:# .husky/pre-commit if git diff --cached --name-only | xargs grep -l '"file:\.\./\.docker-deps/' 2>/dev/null; then echo "ERROR: rewritten package.json detected. Run scripts/docker-prep.sh --restore first." exit 1 fi if git diff --cached --name-only | grep -qE '(\.docker-deps/.*\.tgz|package\.json\.bak)$'; then echo "ERROR: docker-prep.sh artifacts staged. Run --restore first." exit 1 fi - B5. Auto-restore on script error via
trap restore_on_error EXIT(unless--keeppassed) - B6. Update script header comment per § 7.4 template
- B7. CANONICAL HOME (was deferred — now in Phase B proper).
- B7-1. Move script to
learning_ai_common_plat/scripts/docker-prep.template.sh - B7-2. Add
learning_ai_common_plat/scripts/sync-docker-prep.shto copy template into all product repos (mirrorssync-npmrc.sh) - B7-3. Add
learning_ai_common_plat/scripts/check-docker-prep-drift.shfor CI (mirrorscheck-npmrc-drift.sh) - B7-4. Update every repo's
AGENTS.mdwith the "NEVER editdocker-prep.shdirectly" warning + template link
- B7-1. Move script to
- B8. Add
--strip-overridesoption that removespnpm.overridesblock after build — safety net in case--restoreis forgotten
5. Phase C — Verification gates
Pilot exit criteria (must all pass before Phase D):
- C1. Cold Docker build succeeds on both pilots via Gitea-registry path (no
docker-prep.shinvocation) - C2. Warm rebuild (single source file touched) < 30 s on both pilots
- C3.
docker-prep.sh→docker compose build→--restoreleavesgit statusclean - C4. Pre-commit hook blocks: (a) rewritten
package.json, (b) staged.tgz, (c) staged.bak - C5. Gitea Actions CI green on both pilots (verify CI uses the same Dockerfile path)
- C6. Build-time metrics filled into the table in § 3.A7
- C7. ADR recorded for A3 (lockfile policy)
- C8.
docker-doctor.sh(Phase E) runs clean against both pilots - C9. Smoke test: render the web app, inspect
<head>for non-trivial CSS bundle (> 50 KB), confirm Tailwind classes apply. Guard against F11 regression.
6. Phase D — Ecosystem rollout (deferred until § 5 passes)
Apply Phase A + B + E to remaining repos. Pilots excluded.
| Repo | Backend | Web | docker-prep | Healthcheck | Notes |
|---|---|---|---|---|---|
learning_ai_notes |
☐ | ☐ | ☐ | ☐ | BASE_IMAGE=node:22-slim override (corp proxy Alpine SSL) |
learning_ai_fastgap |
☐ | ☐ | ☐ | ☐ | Mobile + web + backend |
learning_ai_jarvis_jr |
☐ | ☐ | ☐ | ☐ | F12 incident already fixed; verify regression-proof |
learning_ai_flowmonk |
☐ | ☐ | ☐ | ☐ | .npmrc.docker is tarball-only — needs A0-1 |
learning_ai_trails |
☐ | ☐ | ☐ | ☐ | |
learning_ai_local_memory_gpt |
☐ | ☐ | ☐ | ☐ | SQLite-based; F11(b) already fixed 07cdf6b — verify regression-proof |
learning_multimodal_memory_agents (MindLyst) |
☐ | ☐ | ☐ | ☐ | KMP repo, different layout |
learning_voice_ai_agent (LysnrAI) |
☐ | ☐ | ☐ | ☐ | Python desktop + TS dashboards |
learning_ai_efforise |
☐ | ☐ | ☐ | ☐ | |
learning_ai_auth_app |
☐ | n/a | ☐ | n/a | iOS/Android — no Docker surfaces |
learning_ai_talk2obsidian |
☐ | ☐ | ☐ | ☐ | Single-container app |
7. Reference snippets
7.1 Canonical .npmrc.docker
Matches the host-side .npmrc template shipped in common-plat 610a59fd.
@bytelyst:registry=http://${GITEA_NPM_HOST}:3300/api/packages/${GITEA_NPM_OWNER:-learning_ai_user}/npm/
//${GITEA_NPM_HOST}:3300/api/packages/${GITEA_NPM_OWNER:-learning_ai_user}/npm/:_authToken=${GITEA_NPM_TOKEN}
strict-ssl=false
auto-install-peers=true
7.2 Canonical backend Dockerfile
# syntax=docker/dockerfile:1.7
ARG BASE_IMAGE=node:22-alpine
FROM ${BASE_IMAGE} AS builder
WORKDIR /app/backend
ARG GITEA_NPM_HOST=host.docker.internal
ARG GITEA_NPM_OWNER=learning_ai_user
ARG USE_TARBALLS=false
ENV NODE_TLS_REJECT_UNAUTHORIZED=0
ENV NPM_CONFIG_STRICT_SSL=false
ENV GITEA_NPM_HOST=$GITEA_NPM_HOST
ENV GITEA_NPM_OWNER=$GITEA_NPM_OWNER
RUN corepack enable && corepack prepare pnpm@10.6.5 --activate
# ── Deps layer (cacheable) ─────────────────────────────────────────
COPY .npmrc.docker ./.npmrc
COPY backend/package.json ./package.json
# Tolerate missing .docker-deps/ when in registry mode
RUN mkdir -p /app/.docker-deps
COPY .docker-deps* /app/.docker-deps/
RUN --mount=type=cache,id=pnpm,target=/root/.local/share/pnpm/store \
--mount=type=secret,id=gitea_npm_token \
export GITEA_NPM_TOKEN="$(cat /run/secrets/gitea_npm_token 2>/dev/null || echo '')" && \
pnpm install --ignore-scripts --lockfile=false
# ── Source layer (changes most often) ──────────────────────────────
COPY backend/tsconfig.json ./tsconfig.json
COPY backend/src/ ./src/
COPY shared/ ../shared/
RUN pnpm run build
# ── Runtime ────────────────────────────────────────────────────────
FROM ${BASE_IMAGE}
WORKDIR /app/backend
ENV NODE_ENV=production
COPY --from=builder /app/backend/node_modules ./node_modules
COPY --from=builder /app/backend/package.json ./package.json
COPY --from=builder /app/backend/dist ./dist
COPY shared/ ../shared/
EXPOSE 4010
CMD ["node", "dist/server.js"]
--lockfile=falseis intentional pending the A3 ADR. Switch to--frozen-lockfileonly once the sibling-workspace problem (F2) is resolved.
7.3 Canonical docker-compose.yml service block
services:
backend:
build:
context: .
dockerfile: backend/Dockerfile
args:
GITEA_NPM_HOST: host.docker.internal
secrets:
- gitea_npm_token
extra_hosts:
- "host.docker.internal:host-gateway"
ports:
- "4010:4010"
environment:
- NODE_ENV=production
- PORT=4010
# ...
restart: unless-stopped
healthcheck:
# F12: use 127.0.0.1 NOT localhost (IPv6 resolution false-fails)
test: ["CMD-SHELL", "wget -q --spider http://127.0.0.1:4010/health || exit 1"]
interval: 30s
timeout: 5s
retries: 3
start_period: 10s
secrets:
gitea_npm_token:
environment: GITEA_NPM_TOKEN
7.4 Hardened docker-prep.sh header
#!/usr/bin/env bash
# Hermetic Docker-build helper. Packs @bytelyst/* tarballs from the sibling
# common-plat repo when the Gitea npm registry is unreachable.
#
# Use this ONLY when:
# - Local Gitea registry (:3300) is down or unreachable, OR
# - You need a Docker build that includes uncommitted common-plat changes.
#
# For normal builds (Gitea up + clean common-plat), use:
# docker compose build
#
# Usage:
# ./scripts/docker-prep.sh # pack tarballs + rewrite package.json
# ./scripts/docker-prep.sh --dry-run # show what would change (no side effects)
# ./scripts/docker-prep.sh --force # override idempotency guard
# ./scripts/docker-prep.sh --restore # undo rewrite
# ./scripts/docker-prep.sh --keep # skip auto-restore on error
# ./scripts/docker-prep.sh --strip-overrides # remove pnpm.overrides block
#
# Side effects:
# - Creates .docker-deps/ (gitignored)
# - Backs up package.json → package.json.bak
# - Rewrites @bytelyst/* deps to file:../.docker-deps/<tarball>
# - Injects pnpm.overrides for transitive @bytelyst/* deps
#
# Safety:
# - Refuses to run if .bak files already exist (unless --force)
# - Auto-restores on error (trap EXIT) unless --keep passed
# - Pre-commit hook blocks committing rewritten package.json, .tgz, .bak
7.5 Canonical Next.js web Dockerfile (addresses F11, F13)
# syntax=docker/dockerfile:1.7
ARG BASE_IMAGE=node:22-alpine
FROM ${BASE_IMAGE} AS deps
WORKDIR /app/web
ARG GITEA_NPM_HOST=host.docker.internal
ARG GITEA_NPM_OWNER=learning_ai_user
ENV NODE_TLS_REJECT_UNAUTHORIZED=0
ENV NPM_CONFIG_STRICT_SSL=false
ENV GITEA_NPM_HOST=$GITEA_NPM_HOST
ENV GITEA_NPM_OWNER=$GITEA_NPM_OWNER
RUN corepack enable && corepack prepare pnpm@10.6.5 --activate
COPY .npmrc.docker ./.npmrc
COPY web/package.json ./package.json
RUN mkdir -p /app/.docker-deps
COPY .docker-deps* /app/.docker-deps/
RUN --mount=type=cache,id=pnpm,target=/root/.local/share/pnpm/store \
--mount=type=secret,id=gitea_npm_token \
export GITEA_NPM_TOKEN="$(cat /run/secrets/gitea_npm_token 2>/dev/null || echo '')" && \
pnpm install --ignore-scripts --lockfile=false
# ── Builder ────────────────────────────────────────────────────────
FROM ${BASE_IMAGE} AS builder
WORKDIR /app/web
COPY --from=deps /app/web/node_modules ./node_modules
COPY --from=deps /app/web/package.json ./package.json
# F11/F13 fix: glob ALL root-level config files instead of enumerating.
# Picks up postcss.config.*, tailwind.config.*, next.config.*, tsconfig*,
# any future *.config.* additions without Dockerfile changes.
COPY web/*.json web/*.ts web/*.mjs web/*.js web/*.cjs ./
COPY web/public/ ./public/
COPY web/src/ ./src/
COPY shared/ ../shared/
ARG NEXT_PUBLIC_BACKEND_URL
ARG NEXT_PUBLIC_PLATFORM_SERVICE_URL
ENV NEXT_PUBLIC_BACKEND_URL=$NEXT_PUBLIC_BACKEND_URL
ENV NEXT_PUBLIC_PLATFORM_SERVICE_URL=$NEXT_PUBLIC_PLATFORM_SERVICE_URL
ENV NEXT_TELEMETRY_DISABLED=1
RUN corepack enable && pnpm run build
# ── Runtime (Next.js standalone) ───────────────────────────────────
FROM ${BASE_IMAGE} AS runner
WORKDIR /app/web
ENV NODE_ENV=production
ENV NEXT_TELEMETRY_DISABLED=1
COPY --from=builder /app/web/.next/standalone ./
# Next 16 standalone server runs as `node web/server.js` from /app/web,
# so static assets live at /app/web/web/.next/static (NOT ./.next/static).
COPY --from=builder /app/web/.next/static ./web/.next/static
COPY --from=builder /app/web/public ./web/public
EXPOSE 3000
ENV PORT=3000
ENV HOSTNAME=0.0.0.0
CMD ["node", "web/server.js"]
Verification step after every web Dockerfile change: smoke-test the built image by running it and curling the rendered HTML. Confirm the CSS bundle in
<link>references is > 50 KB. A bundle of ~33 KB is the F11 signature (only@font-face, no Tailwind utilities).
7.6 docker-doctor.sh skeleton (Phase E)
#!/usr/bin/env bash
# docker-doctor.sh — pre-flight Dockerfile + docker-compose health checks.
# Run on PRs touching Dockerfile, docker-compose*.yml, .dockerignore.
set -euo pipefail
REPO_DIR="$(cd "$(dirname "$0")/.." && pwd)"
FAILED=0
# Check 1 (A8/F11/F13): every config file in web/ is COPY'd in web/Dockerfile
for cfg in postcss.config tailwind.config next.config; do
for f in "$REPO_DIR"/web/${cfg}.{js,mjs,cjs,ts}; do
[[ -f "$f" ]] || continue
base=$(basename "$f")
if ! grep -q "COPY web/${base}\\|COPY web/\\*" "$REPO_DIR/web/Dockerfile" 2>/dev/null; then
echo "✗ F11/F13: $base exists but not COPY'd in web/Dockerfile"
FAILED=1
fi
done
done
# Check 2 (A9/F12): healthchecks use 127.0.0.1
if grep -rE 'test:.*http://localhost' "$REPO_DIR"/docker-compose*.yml 2>/dev/null; then
echo "✗ F12: healthcheck uses localhost (should be 127.0.0.1)"
FAILED=1
fi
# Check 3: .npmrc.docker matches canonical template
if [[ -f "$REPO_DIR/.npmrc.docker" ]]; then
if ! grep -q '\${GITEA_NPM_HOST}' "$REPO_DIR/.npmrc.docker"; then
echo "✗ F4: .npmrc.docker doesn't use \${GITEA_NPM_HOST} placeholder"
FAILED=1
fi
fi
# Check 4: .dockerignore doesn't exclude pnpm-lock.yaml
if grep -q '^pnpm-lock\.yaml$' "$REPO_DIR/.dockerignore" 2>/dev/null; then
echo "⚠ F1: .dockerignore excludes pnpm-lock.yaml (blocks lockfile optimization)"
fi
# Check 5: base image is on approved list
for df in "$REPO_DIR"/{backend,web}/Dockerfile; do
[[ -f "$df" ]] || continue
if ! grep -qE 'FROM (\$\{BASE_IMAGE\}|node:22-(alpine|slim))' "$df"; then
echo "✗ Unapproved base image in $df"
FAILED=1
fi
done
exit $FAILED
8. Phase E — Observability / lint (NEW)
Two complementary linters:
gitea-doctor— Gitea registry pre-flight (env + token + connectivity). Already shipped incommon-platcommit610a59fdatscripts/gitea/doctor.sh. This roadmap only wires it into CI/build flows (A0-D + E0 below).docker-doctor— Dockerfile + compose-file static linter (see § 7.6 skeleton). To be built as part of this roadmap.
The two are intentionally separate concerns:
| Linter | Scope | When to run |
|---|---|---|
gitea-doctor |
runtime env, token, registry HTTP 200 | Before every build / deploy |
docker-doctor |
static analysis of Dockerfile + compose YAML | On every PR touching those files |
Phase E checklist
- E0. Wire
bash scripts/gitea/doctor.sh --quietinto every Gitea Actions CI workflow as a pre-build job (addresses F15). Pattern shipped incommon-plat; replicate via a reusableactions/gitea-preflight@maincomposite if Gitea Actions supports it, otherwise inline. - E1. Land
docker-doctor.shinlearning_ai_common_plat/scripts/(canonical, mirrorsgitea/doctor.shshipped earlier) - E2. Provide a thin per-repo wrapper at
scripts/docker-doctor.shthat calls the canonical - E3. Wire into CI: run on PRs touching
Dockerfile,docker-compose*.yml,.dockerignore,.npmrc.docker - E4. Wire into pre-commit hook (warning-only at first, error after 2 weeks)
- E5. Document checks in
learning_ai_common_plat/AI.dev/SKILLS/docker-doctor.md(sibling doc to the existinggitea-doctorpatterns) - E6. Add
make doctortarget to each pilot repo that runs bothgitea-doctorANDdocker-doctor
Checks implemented by docker-doctor.sh:
| Check | Addresses | Action |
|---|---|---|
Every web/*.config.* file is COPY'd |
F11, F13 | Error |
docker-compose.yml healthcheck uses 127.0.0.1 |
F12 | Error |
.npmrc.docker uses ${GITEA_NPM_HOST} AND ${GITEA_NPM_OWNER} placeholders |
F4, F14 | Error |
Dockerfile declares ARG GITEA_NPM_OWNER if it COPYs .npmrc.docker |
F14 | Error |
.dockerignore doesn't exclude pnpm-lock.yaml |
F1 | Warn (until A3 ADR lands) |
Base image is on approved list (node:22-alpine or node:22-slim via BASE_IMAGE ARG) |
Canonical decision | Error |
.docker-deps/ and *.bak in .gitignore |
B3 | Error |
docker-compose.yml passes GITEA_NPM_OWNER build arg |
F14 | Warn |
9. Open questions (numbered TODOs, not blockers)
- Shared pnpm cache volume? BuildKit caches are already shared across
builds by
id=pnpm. Test whether a named Docker volume adds anything before adding complexity. - Custom base image? Publish
bytelyst/node-pnpm:22{alpine,slim}with pnpm pre-installed to skip corepack. Cost: image maintenance; benefit: ~5 s/build. - CI hostname? Verify
host.docker.internal:host-gatewayworks in Gitea Actions Linux runners, or if a CI-specific Dockerfile variant is needed. - Multi-platform builds?
linux/amd64+linux/arm64interact awkwardly with cache mounts underbuildx. Defer to separate roadmap. - Workspace flattening? Eliminate the
../learning_ai_common_plat/packages/*workspace entry inside Docker via a flattenedpnpm-workspace.yaml. Unlocks--frozen-lockfile. Requires lockfile regeneration step.
10. Execution order
- Now (this commit): roadmap doc v3 lands here; sign-off requested.
- Phase A0 on
learning_ai_clock(web + backend) — pilot order intentionally inverted vs. v2: web is where F11/F13 incidents lived, and clock exercises both surface types in one repo. Fix.npmrc.docker,docker-compose.yml,.dockerignore. Verify A0-V (Gitea path works end-to-end) before any speed work. - A8 + A9 + A1 on clock (correctness before speed). Commit.
- A2 + A4 + A5 + A6 on clock. Measure. Commit.
- Phase A0 → A6 on
learning_ai_peakpulse(backend only) as validation second pass for the simpler case. - A7 — fill in metrics table.
- A3 ADR — decide lockfile policy (defer implementation).
- Phase B — harden
docker-prep.shon clock, then promote to canonical home in common-plat (B7) and sync to peakpulse. - Phase E — land
docker-doctor.sh, wire into CI as warning, then error. - Phase C — verification gates C1–C9.
- Phase D — scheduled separately, only after § 5 passes.
11. Risk register
| Risk | Mitigation |
|---|---|
Removing pnpm-lock.yaml from .dockerignore exposes a stale or sibling-aware lockfile that breaks Docker installs |
Keep --lockfile=false for now (A3 ADR); revisit after F2 resolution |
| BuildKit cache mount on shared CI runners causes cross-build interference | Use distinct id= per repo (id=pnpm-${repo}) if observed |
host.docker.internal doesn't resolve in Linux Docker |
extra_hosts: ["host.docker.internal:host-gateway"] (A0-4) |
Removing .docker-deps/ from default builds breaks repos that haven't done A0 yet |
Wildcard COPY .docker-deps* keeps both paths working during migration |
docker-prep.sh --force is misused and .bak files get committed |
Pre-commit hook (B4) blocks .bak, .tgz, rewritten package.json |
Corp network blocks host.docker.internal:3300 |
Verify SSH tunnel reaches Gitea; document in operations.md |
| F11 regression: build green, app ships with no CSS | C9 smoke test + Phase E docker-doctor.sh check on web/*.config.* COPY coverage |
| F12 regression: healthcheck false-fails on IPv6 | Phase E docker-doctor.sh grep for localhost in compose files |
| F13 regression: new config file added, Dockerfile forgotten | A8-2 glob COPY pattern (root cause fix) + Phase E lint (defense in depth) |
BASE_IMAGE override in notes diverges silently from canonical |
Phase E check approved list; document override in repo AGENTS.md |
| F14 regression: future Gitea owner rename re-introduces literal in some Dockerfile | Phase E docker-doctor.sh checks .npmrc.docker for ${GITEA_NPM_OWNER} placeholder + Dockerfile for ARG GITEA_NPM_OWNER declaration |
| F15: stale token in dev shell hits build mid-way through, wastes ~4 min | A0-D + E0 wire gitea-doctor as pre-build gate; refuses to start build if env/file drift detected |