From 529d4f37f54d6cd94d56f785626809d60a371687 Mon Sep 17 00:00:00 2001 From: saravanakumardb1 Date: Wed, 27 May 2026 00:28:10 -0700 Subject: [PATCH] docs: add Docker build optimization roadmap (post-audit v2) Captures audit findings on Dockerfile patterns across pilot repos (peakpulse, clock): - 10 concrete bugs documented (F1-F10): .dockerignore blocks pnpm-lock.yaml, sibling-workspace lockfile problem, .npmrc.docker inconsistencies, missing BuildKit cache mounts, etc. - Phase A0 added: fix Gitea-registry path before optimizing (without it, the 'default' path doesn't actually work) - Phase A1-A7: corepack, cache mounts, layer reordering, measurement - Phase B: docker-prep.sh hardening (dry-run, idempotency, auto-restore, pre-commit guard) - Phase C: 7 verification gates - Phase D: deferred 11-repo rollout checklist - ADR-pending lockfile policy decision (A3) - Risk register + 6 open questions --- docs/docker-build-optimization-roadmap.md | 404 ++++++++++++++++++++++ 1 file changed, 404 insertions(+) create mode 100644 docs/docker-build-optimization-roadmap.md diff --git a/docs/docker-build-optimization-roadmap.md b/docs/docker-build-optimization-roadmap.md new file mode 100644 index 0000000..43bcea4 --- /dev/null +++ b/docs/docker-build-optimization-roadmap.md @@ -0,0 +1,404 @@ +# Docker Build Optimization Roadmap + +> **Status:** Draft v2 (post-audit) · **Owner:** Platform DevOps · **Created:** 2026-05-27 · **Revised:** 2026-05-27 +> +> Pilot Docker-build speed-ups + hermetic-fallback hardening on `learning_ai_peakpulse` +> and `learning_ai_clock`, then capture the playbook here for ecosystem-wide rollout. + +--- + +## 0. Pre-flight audit findings (2026-05-27) + +A read-only audit of the two pilot repos surfaced **10 concrete bugs/gaps** +that contradict the casual narrative that "Gitea-registry is the default and +`docker-prep.sh` is the fallback." The actual state is closer to the inverse: + +| # | Finding | Location | Severity | +|---|---|---|---| +| F1 | `pnpm-lock.yaml` is in `.dockerignore` — any lockfile-based optimization is blocked until removed | `peakpulse/.dockerignore`, `clock/.dockerignore` | **High** | +| F2 | `pnpm-workspace.yaml` references sibling `../learning_ai_common_plat/packages/*` — `--frozen-lockfile` inside Docker will fail unless workspace is flattened or sibling tree is copied | `peakpulse/pnpm-workspace.yaml`, `clock/pnpm-workspace.yaml` | **High** | +| F3 | `peakpulse/.npmrc.docker` is tarball-only (no `@bytelyst:registry=…` line) — the "Gitea-registry" path doesn't actually work in this repo today | `peakpulse/.npmrc.docker` | **High** | +| F4 | `clock/.npmrc.docker` hardcodes `http://localhost:3300` — from inside a Docker container `localhost` is the container itself, not the host registry | `clock/.npmrc.docker` | **High** | +| F5 | `clock/backend/Dockerfile` has neither `ARG GITEA_NPM_HOST` nor a BuildKit secret mount — it is wholly dependent on `.docker-deps/` having been pre-populated | `clock/backend/Dockerfile` | High | +| F6 | `clock/web/Dockerfile` accepts `ARG GITEA_NPM_HOST` but never uses it and has no `--mount=type=secret` — passing the arg is a no-op | `clock/web/Dockerfile` | Medium | +| F7 | `peakpulse/docker-compose.yml` does not pass `GITEA_NPM_HOST` build arg or declare `secrets:` block, so `docker compose build` cannot use the Gitea path | `peakpulse/docker-compose.yml` | Medium | +| F8 | `COPY .docker-deps/` is unconditional in every backend Dockerfile — every build requires either `docker-prep.sh` to have run OR an empty `.docker-deps/` dir to pre-exist | both repos | Medium | +| F9 | `npm install -g pnpm@10.6.5` runs on every build (no `corepack`) — 5–10 s overhead, no pinning to `packageManager` field | all four Dockerfiles | Low | +| F10 | No BuildKit `--mount=type=cache` for pnpm store — cold install on every rebuild even when deps unchanged | all four Dockerfiles | High (the main speed win) | + +**Implication:** the original plan to "switch to `--frozen-lockfile` + Gitea +registry" requires two upstream fixes first (F1, F2). The roadmap below +accounts for that. + +--- + +## 1. Context: three build paths + +| Path | Status today | Trigger | Notes | +|---|---|---|---| +| **`docker-prep.sh` tarballs** | **De facto default** in peakpulse + flowmonk; also works in clock | Run `docker-prep.sh` then `docker compose build` | Hermetic; mutates `package.json`; slow to repack | +| **Gitea NPM registry** | Partially wired in clock + notes; broken in peakpulse | `docker compose build` with `GITEA_NPM_HOST` arg + secret | Needs `.npmrc.docker` standardization to actually be default | +| **Legacy `file:` refs** | Deprecated | — | Removed during pnpm/Gitea migration | + +### Measurement targets + +| Build | Baseline (observed) | Target after Phase A | +|---|---|---| +| Cold (no cache) | ~2–3 min | ≤ 2 min | +| Warm (one source file changed) | ~2–3 min | **< 30 s** | +| `docker-prep.sh` pack step alone | ~60–90 s | < 30 s (pnpm pack cache) | + +> Fill in actuals during Phase C. + +--- + +## 2. Goals & non-goals + +**Goals** + +- ✅ Cut warm rebuild time via BuildKit pnpm-store cache mount (the single biggest win) +- ✅ Make `docker-prep.sh` idempotent, safe to re-run, gitignore-clean +- ✅ Standardize `.npmrc.docker` across the ecosystem so the Gitea path actually works +- ✅ Fix `docker-compose.yml` to pass `GITEA_NPM_HOST` + secrets so the registry path is usable without manual flags +- ✅ Document which path to use when, and the trade-offs + +**Non-goals** + +- ❌ Migrating off pnpm or off the Gitea registry +- ❌ Adopting `--frozen-lockfile` until F2 is resolved (sibling-workspace problem) +- ❌ Publishing `@bytelyst/*` to the public npm registry +- ❌ Multi-platform builds (separate roadmap) + +--- + +## 3. Phase A — Build speed + path correctness + +Order matters: A0 must precede A1–A5 (you can't enable a path that doesn't work). + +### A0. Make the Gitea-registry path actually work (peakpulse + clock) + +- [ ] **A0-1.** Standardize `.npmrc.docker` to use a templated host so it works on host (`localhost`) and inside Docker (`host.docker.internal`): + ``` + @bytelyst:registry=http://${GITEA_NPM_HOST}:3300/api/packages/learning_ai_user/npm/ + //${GITEA_NPM_HOST}:3300/api/packages/learning_ai_user/npm/:_authToken=${GITEA_NPM_TOKEN} + strict-ssl=false + ``` +- [ ] **A0-2.** Remove `pnpm-lock.yaml` from `.dockerignore` in both repos (fixes F1) +- [ ] **A0-3.** Add `GITEA_NPM_HOST` build arg + `secrets:` block to every service in `docker-compose.yml`: + ```yaml + build: + context: . + dockerfile: backend/Dockerfile + args: + GITEA_NPM_HOST: host.docker.internal + secrets: + - gitea_npm_token + secrets: + gitea_npm_token: + environment: GITEA_NPM_TOKEN + ``` +- [ ] **A0-4.** Add `extra_hosts: ["host.docker.internal:host-gateway"]` to each service so Linux Docker can resolve the host +- [ ] **A0-5.** Document required env: `GITEA_NPM_TOKEN` must be exported in the shell that runs `docker compose build` + +### A1. Replace `npm install -g pnpm@X` with corepack + +- [ ] **A1-1.** Replace lines `RUN npm install -g pnpm@10.6.5` with: + ```dockerfile + RUN corepack enable && corepack prepare pnpm@10.6.5 --activate + ``` +- [ ] **A1-2.** Verify `packageManager` field in `backend/package.json` matches (already `pnpm@10.6.5` in peakpulse) + +### A2. Add BuildKit pnpm-store cache mount + +- [ ] **A2-1.** Set `# syntax=docker/dockerfile:1.7` directive at top of every Dockerfile +- [ ] **A2-2.** Wrap install step with cache mount: + ```dockerfile + RUN --mount=type=cache,id=pnpm,target=/root/.local/share/pnpm/store \ + --mount=type=secret,id=gitea_npm_token \ + export GITEA_NPM_TOKEN="$(cat /run/secrets/gitea_npm_token 2>/dev/null || echo '')" && \ + pnpm install --ignore-scripts + ``` +- [ ] **A2-3.** Verify cache hit on second build via `docker buildx du` or `docker history` + +### A3. Decide lockfile policy (BLOCKED on F2 resolution) + +Two options — pick one in a short ADR before implementing: + +- **Option 1: Keep `--lockfile=false`** (current pragmatic approach) + - ✅ No sibling-workspace complications + - ❌ No reproducibility guarantee inside Docker + - ❌ Slower installs (full resolution every build) +- **Option 2: Generate a Docker-only lockfile** via `pnpm install --lockfile-only` against a flattened `package.json` that resolves `@bytelyst/*` to semver + - ✅ Reproducibility + - ✅ Faster installs + - ❌ New build step + tooling + - ❌ Drift risk between dev lockfile and Docker lockfile + +- [ ] **A3-1.** Write 1-page ADR (`docs/decisions/0001-docker-lockfile-policy.md`) and pick Option 1 or 2 +- [ ] **A3-2.** Defer `--frozen-lockfile` adoption until ADR lands + +### A4. Restructure layer order + +- [ ] **A4-1.** Reorder COPY/RUN so deps install layer is `package.json` + `.npmrc` ONLY, then a separate layer for `src/`, `tsconfig.json`, `shared/` +- [ ] **A4-2.** Move all `ARG` lines that affect deps install **before** the install step; move `NEXT_PUBLIC_*` ARGs (clock web) closer to the build step + +### A5. Gate `.docker-deps/` behind a build arg + +- [ ] **A5-1.** Add `ARG USE_TARBALLS=false` to Dockerfile +- [ ] **A5-2.** Conditionally copy: + ```dockerfile + # Always-empty placeholder so COPY doesn't fail in registry mode + RUN mkdir -p /app/.docker-deps + COPY .docker-deps* /app/.docker-deps/ + ``` + (The wildcard tolerates a missing `.docker-deps/` dir; works without enabling BuildKit COPY's `--from` tricks.) +- [ ] **A5-3.** Verify `.docker-deps/` is in `.gitignore` and `.dockerignore` is NOT excluding it when tarball mode is in use + +### A6. `.dockerignore` audit + +- [ ] **A6-1.** Confirm exclusions: `node_modules`, `**/node_modules`, `dist`, `.next`, `*.log`, `.env`, `.env.*`, `.git`, `*.bak` +- [ ] **A6-2.** Remove: `pnpm-lock.yaml` exclusion (was correct under `--lockfile=false`, blocks future optimization) +- [ ] **A6-3.** Confirm `.docker-deps/` is NOT excluded when tarball path is active + +### A7. Measure & record + +| Repo | Surface | Cold before | Cold after | Warm before | Warm after | Notes | +|---|---|---|---|---|---|---| +| peakpulse | backend | — | — | — | — | | +| clock | backend | — | — | — | — | | +| clock | web | — | — | — | — | | + +Use: +``` +time DOCKER_BUILDKIT=1 docker compose build --no-cache backend # cold +touch backend/src/server.ts && time docker compose build backend # warm +``` + +--- + +## 4. Phase B — Hermetic-fallback polish (`docker-prep.sh`) + +The script is **duplicated with minor variations** across product repos. Pilot +in peakpulse + clock, then propose a canonical home. + +- [ ] **B1.** Add `--dry-run` flag — list packs/rewrites, no side effects +- [ ] **B2.** Idempotency guard — refuse to run if any `*.bak` exists unless `--force` +- [ ] **B3.** Ensure `.docker-deps/` and `*.bak` are in `.gitignore` of every pilot repo +- [ ] **B4.** Pre-commit hook (husky) — block commits containing `"file:../.docker-deps/"` inside any `package.json`. Add to `.husky/pre-commit`: + ```bash + if git diff --cached --name-only | xargs grep -l '"file:\.\./\.docker-deps/' 2>/dev/null; then + echo "ERROR: rewritten package.json detected. Run scripts/docker-prep.sh --restore first." + exit 1 + fi + ``` +- [ ] **B5.** Auto-restore on script error via `trap restore_on_error EXIT` (unless `--keep` passed) +- [ ] **B6.** Update script header comment with explicit "use only when Gitea unreachable OR you need uncommitted common-plat changes" +- [ ] **B7.** Propose canonical home: `learning_ai_common_plat/scripts/docker-prep.template.sh` + `sync-docker-prep.sh` (mirrors `.npmrc` template pattern). Defer execution to Phase D. +- [ ] **B8.** Add a `--strip-overrides` option that removes `pnpm.overrides` block after build, in case `--restore` is forgotten (additional safety net) + +--- + +## 5. Phase C — Verification gates + +Pilot exit criteria (must all pass before Phase D): + +- [ ] **C1.** Cold Docker build succeeds on both pilots via Gitea-registry path (no `docker-prep.sh` invocation) +- [ ] **C2.** Warm rebuild (single source file touched) < 30 s on both pilots +- [ ] **C3.** `docker-prep.sh` → `docker compose build` → `--restore` leaves `git status` clean +- [ ] **C4.** Pre-commit hook blocks a deliberately-staged rewritten `package.json` +- [ ] **C5.** Gitea Actions CI green on both pilots (verify CI uses the same Dockerfile path) +- [ ] **C6.** Build-time metrics filled into the table in § 3.A7 +- [ ] **C7.** Decision recorded in ADR for A3 (lockfile policy) + +--- + +## 6. Phase D — Ecosystem rollout (deferred until § 5 passes) + +Apply Phase A0 → A2 + A4 → A6 + B to remaining repos. **Pilots excluded.** + +| Repo | Backend | Web | docker-prep | Notes | +|---|---|---|---|---| +| `learning_ai_notes` | ☐ | ☐ | ☐ | Uses `node:22-slim` (corp proxy / Alpine SSL issue) | +| `learning_ai_fastgap` | ☐ | ☐ | ☐ | Mobile + web + backend | +| `learning_ai_jarvis_jr` | ☐ | ☐ | ☐ | | +| `learning_ai_flowmonk` | ☐ | ☐ | ☐ | `.npmrc.docker` is tarball-only — needs A0-1 | +| `learning_ai_trails` | ☐ | ☐ | ☐ | | +| `learning_ai_local_memory_gpt` | ☐ | ☐ | ☐ | SQLite-based, no Cosmos | +| `learning_multimodal_memory_agents` (MindLyst) | ☐ | ☐ | ☐ | KMP repo, different layout | +| `learning_voice_ai_agent` (LysnrAI) | ☐ | ☐ | ☐ | Python desktop + TS dashboards | +| `learning_ai_efforise` | ☐ | ☐ | ☐ | | +| `learning_ai_auth_app` | ☐ | n/a | ☐ | iOS/Android — no backend Dockerfile | +| `learning_ai_talk2obsidian` | ☐ | ☐ | ☐ | Single-container app | + +--- + +## 7. Reference snippets + +### 7.1 Canonical `.npmrc.docker` + +``` +@bytelyst:registry=http://${GITEA_NPM_HOST}:3300/api/packages/learning_ai_user/npm/ +//${GITEA_NPM_HOST}:3300/api/packages/learning_ai_user/npm/:_authToken=${GITEA_NPM_TOKEN} +strict-ssl=false +auto-install-peers=true +``` + +### 7.2 Canonical backend Dockerfile (post Phase A) + +```dockerfile +# syntax=docker/dockerfile:1.7 +FROM node:22-alpine AS builder +WORKDIR /app/backend + +ARG GITEA_NPM_HOST=host.docker.internal +ARG USE_TARBALLS=false +ENV NODE_TLS_REJECT_UNAUTHORIZED=0 +ENV NPM_CONFIG_STRICT_SSL=false +ENV GITEA_NPM_HOST=$GITEA_NPM_HOST + +RUN corepack enable && corepack prepare pnpm@10.6.5 --activate + +# ── Deps layer (cacheable) ───────────────────────────────────────── +COPY .npmrc.docker ./.npmrc +COPY backend/package.json ./package.json +# Tolerate missing .docker-deps/ when in registry mode (wildcard match) +RUN mkdir -p /app/.docker-deps +COPY .docker-deps* /app/.docker-deps/ + +RUN --mount=type=cache,id=pnpm,target=/root/.local/share/pnpm/store \ + --mount=type=secret,id=gitea_npm_token \ + export GITEA_NPM_TOKEN="$(cat /run/secrets/gitea_npm_token 2>/dev/null || echo '')" && \ + pnpm install --ignore-scripts --lockfile=false + +# ── Source layer (changes most often) ────────────────────────────── +COPY backend/tsconfig.json ./tsconfig.json +COPY backend/src/ ./src/ +COPY shared/ ../shared/ +RUN pnpm run build + +# ── Runtime ──────────────────────────────────────────────────────── +FROM node:22-alpine +WORKDIR /app/backend +ENV NODE_ENV=production +COPY --from=builder /app/backend/node_modules ./node_modules +COPY --from=builder /app/backend/package.json ./package.json +COPY --from=builder /app/backend/dist ./dist +COPY shared/ ../shared/ +EXPOSE 4010 +CMD ["node", "dist/server.js"] +``` + +> `--lockfile=false` is intentional pending the A3 ADR. Switch to +> `--frozen-lockfile` once the sibling-workspace problem (F2) is resolved. + +### 7.3 Canonical `docker-compose.yml` service block + +```yaml +services: + backend: + build: + context: . + dockerfile: backend/Dockerfile + args: + GITEA_NPM_HOST: host.docker.internal + secrets: + - gitea_npm_token + extra_hosts: + - "host.docker.internal:host-gateway" + ports: + - "4010:4010" + environment: + - NODE_ENV=production + # ... + restart: unless-stopped + +secrets: + gitea_npm_token: + environment: GITEA_NPM_TOKEN +``` + +### 7.4 Hardened `docker-prep.sh` header + +```bash +#!/usr/bin/env bash +# Hermetic Docker-build helper. Packs @bytelyst/* tarballs from the sibling +# common-plat repo when the Gitea npm registry is unreachable. +# +# Use this ONLY when: +# - Local Gitea registry (:3300) is down or unreachable, OR +# - You need a Docker build that includes uncommitted common-plat changes. +# +# For normal builds (Gitea up + clean common-plat), use: +# docker compose build +# +# Usage: +# ./scripts/docker-prep.sh # pack tarballs + rewrite package.json +# ./scripts/docker-prep.sh --dry-run # show what would change (no side effects) +# ./scripts/docker-prep.sh --force # override idempotency guard +# ./scripts/docker-prep.sh --restore # undo rewrite +# ./scripts/docker-prep.sh --keep # skip auto-restore on error +# +# Side effects: +# - Creates .docker-deps/ (gitignored) +# - Backs up package.json → package.json.bak +# - Rewrites @bytelyst/* deps to file:../.docker-deps/ +# - Injects pnpm.overrides for transitive @bytelyst/* deps +# +# Safety: +# - Refuses to run if .bak files already exist (unless --force) +# - Auto-restores on error (trap EXIT) unless --keep passed +# - Pre-commit hook blocks committing rewritten package.json +``` + +--- + +## 8. Open questions (numbered TODOs, not blockers) + +1. **Shared pnpm cache volume?** Should the BuildKit pnpm store cache be shared + across all 13 repos via a named Docker volume (`pnpm-store`) instead of + per-repo BuildKit caches keyed by `id=pnpm`? (BuildKit caches are already + shared by `id=` — verify before adding volume complexity.) +2. **Custom base image?** Publish `bytelyst/node-pnpm:22` with pnpm + pre-installed to skip the corepack step entirely. Cost: maintenance of a + base image; benefit: ~5 s/build × 13 repos × N builds/day. +3. **CI hostname?** Gitea Actions runs builds with `--add-host` to reach the + registry. Is `host.docker.internal:host-gateway` portable to Linux CI + runners, or do we need a CI-specific Dockerfile variant? +4. **Canonical script home?** `docker-prep.sh` is currently per-repo with + drift. Move to `learning_ai_common_plat/scripts/docker-prep.template.sh` + with a `sync-docker-prep.sh` (mirrors `.npmrc` template pattern)? +5. **Multi-platform builds?** Any need for `linux/amd64` + `linux/arm64` + images? If yes, BuildKit cache mounts interact awkwardly with `buildx` + `--platform`. Defer to separate roadmap. +6. **Workspace flattening?** Could we eliminate the + `../learning_ai_common_plat/packages/*` workspace entry inside Docker by + building with a flattened `pnpm-workspace.yaml` (only local `backend/`)? + This unlocks `--frozen-lockfile`. Requires lockfile regeneration step. + +--- + +## 9. Execution order + +1. **Now (this commit):** roadmap doc lands here; sign-off requested. +2. **A0 first** — fix `.npmrc.docker`, `docker-compose.yml`, `.dockerignore` on both pilots. Without this, the Gitea path doesn't work and no measurement is possible. +3. **A1 + A2** on peakpulse backend. Measure. Commit. +4. **A1 + A2** on clock backend, then clock web. Measure. Commit. +5. **A4 + A5 + A6** on all three surfaces. Commit. +6. **A3 ADR** — decide lockfile policy (defer implementation). +7. **A7** — fill in metrics table. +8. **Phase B** — harden `docker-prep.sh` on peakpulse, then mirror to clock. +9. **Phase C** — verification gates C1–C7. +10. **Phase D** — scheduled separately, only after § 5 passes. + +--- + +## 10. Risk register + +| Risk | Mitigation | +|---|---| +| Removing `pnpm-lock.yaml` from `.dockerignore` exposes a stale or sibling-aware lockfile that breaks Docker installs | Keep `--lockfile=false` for now (A3 ADR); revisit after F2 resolution | +| BuildKit cache mount on shared CI runners causes cross-build interference | Use distinct `id=` per repo (`id=pnpm-${repo}`) if observed | +| `host.docker.internal` doesn't resolve in Linux Docker | `extra_hosts: ["host.docker.internal:host-gateway"]` (added in A0-4) | +| Removing `.docker-deps/` from default builds breaks repos that haven't done A0 yet | Wildcard `COPY .docker-deps*` keeps both paths working during migration | +| `docker-prep.sh` `--force` is misused and `.bak` files get committed | Pre-commit hook (B4) blocks this regardless | +| Corp network blocks `host.docker.internal:3300` | Verify SSH tunnel (`localhost:3300` from host) reaches Gitea; document in operations.md |