bytelyst-devops-tools/docs/docker-build-optimization-roadmap.md
saravanakumardb1 529d4f37f5 docs: add Docker build optimization roadmap (post-audit v2)
Captures audit findings on Dockerfile patterns across pilot repos
(peakpulse, clock):

- 10 concrete bugs documented (F1-F10): .dockerignore blocks
  pnpm-lock.yaml, sibling-workspace lockfile problem, .npmrc.docker
  inconsistencies, missing BuildKit cache mounts, etc.
- Phase A0 added: fix Gitea-registry path before optimizing
  (without it, the 'default' path doesn't actually work)
- Phase A1-A7: corepack, cache mounts, layer reordering, measurement
- Phase B: docker-prep.sh hardening (dry-run, idempotency,
  auto-restore, pre-commit guard)
- Phase C: 7 verification gates
- Phase D: deferred 11-repo rollout checklist
- ADR-pending lockfile policy decision (A3)
- Risk register + 6 open questions
2026-05-27 00:28:10 -07:00

19 KiB
Raw Blame History

Docker Build Optimization Roadmap

Status: Draft v2 (post-audit) · Owner: Platform DevOps · Created: 2026-05-27 · Revised: 2026-05-27

Pilot Docker-build speed-ups + hermetic-fallback hardening on learning_ai_peakpulse and learning_ai_clock, then capture the playbook here for ecosystem-wide rollout.


0. Pre-flight audit findings (2026-05-27)

A read-only audit of the two pilot repos surfaced 10 concrete bugs/gaps that contradict the casual narrative that "Gitea-registry is the default and docker-prep.sh is the fallback." The actual state is closer to the inverse:

# Finding Location Severity
F1 pnpm-lock.yaml is in .dockerignore — any lockfile-based optimization is blocked until removed peakpulse/.dockerignore, clock/.dockerignore High
F2 pnpm-workspace.yaml references sibling ../learning_ai_common_plat/packages/*--frozen-lockfile inside Docker will fail unless workspace is flattened or sibling tree is copied peakpulse/pnpm-workspace.yaml, clock/pnpm-workspace.yaml High
F3 peakpulse/.npmrc.docker is tarball-only (no @bytelyst:registry=… line) — the "Gitea-registry" path doesn't actually work in this repo today peakpulse/.npmrc.docker High
F4 clock/.npmrc.docker hardcodes http://localhost:3300 — from inside a Docker container localhost is the container itself, not the host registry clock/.npmrc.docker High
F5 clock/backend/Dockerfile has neither ARG GITEA_NPM_HOST nor a BuildKit secret mount — it is wholly dependent on .docker-deps/ having been pre-populated clock/backend/Dockerfile High
F6 clock/web/Dockerfile accepts ARG GITEA_NPM_HOST but never uses it and has no --mount=type=secret — passing the arg is a no-op clock/web/Dockerfile Medium
F7 peakpulse/docker-compose.yml does not pass GITEA_NPM_HOST build arg or declare secrets: block, so docker compose build cannot use the Gitea path peakpulse/docker-compose.yml Medium
F8 COPY .docker-deps/ is unconditional in every backend Dockerfile — every build requires either docker-prep.sh to have run OR an empty .docker-deps/ dir to pre-exist both repos Medium
F9 npm install -g pnpm@10.6.5 runs on every build (no corepack) — 510 s overhead, no pinning to packageManager field all four Dockerfiles Low
F10 No BuildKit --mount=type=cache for pnpm store — cold install on every rebuild even when deps unchanged all four Dockerfiles High (the main speed win)

Implication: the original plan to "switch to --frozen-lockfile + Gitea registry" requires two upstream fixes first (F1, F2). The roadmap below accounts for that.


1. Context: three build paths

Path Status today Trigger Notes
docker-prep.sh tarballs De facto default in peakpulse + flowmonk; also works in clock Run docker-prep.sh then docker compose build Hermetic; mutates package.json; slow to repack
Gitea NPM registry Partially wired in clock + notes; broken in peakpulse docker compose build with GITEA_NPM_HOST arg + secret Needs .npmrc.docker standardization to actually be default
Legacy file: refs Deprecated Removed during pnpm/Gitea migration

Measurement targets

Build Baseline (observed) Target after Phase A
Cold (no cache) ~23 min ≤ 2 min
Warm (one source file changed) ~23 min < 30 s
docker-prep.sh pack step alone ~6090 s < 30 s (pnpm pack cache)

Fill in actuals during Phase C.


2. Goals & non-goals

Goals

  • Cut warm rebuild time via BuildKit pnpm-store cache mount (the single biggest win)
  • Make docker-prep.sh idempotent, safe to re-run, gitignore-clean
  • Standardize .npmrc.docker across the ecosystem so the Gitea path actually works
  • Fix docker-compose.yml to pass GITEA_NPM_HOST + secrets so the registry path is usable without manual flags
  • Document which path to use when, and the trade-offs

Non-goals

  • Migrating off pnpm or off the Gitea registry
  • Adopting --frozen-lockfile until F2 is resolved (sibling-workspace problem)
  • Publishing @bytelyst/* to the public npm registry
  • Multi-platform builds (separate roadmap)

3. Phase A — Build speed + path correctness

Order matters: A0 must precede A1A5 (you can't enable a path that doesn't work).

A0. Make the Gitea-registry path actually work (peakpulse + clock)

  • A0-1. Standardize .npmrc.docker to use a templated host so it works on host (localhost) and inside Docker (host.docker.internal):
    @bytelyst:registry=http://${GITEA_NPM_HOST}:3300/api/packages/learning_ai_user/npm/
    //${GITEA_NPM_HOST}:3300/api/packages/learning_ai_user/npm/:_authToken=${GITEA_NPM_TOKEN}
    strict-ssl=false
    
  • A0-2. Remove pnpm-lock.yaml from .dockerignore in both repos (fixes F1)
  • A0-3. Add GITEA_NPM_HOST build arg + secrets: block to every service in docker-compose.yml:
    build:
      context: .
      dockerfile: backend/Dockerfile
      args:
        GITEA_NPM_HOST: host.docker.internal
      secrets:
        - gitea_npm_token
    secrets:
      gitea_npm_token:
        environment: GITEA_NPM_TOKEN
    
  • A0-4. Add extra_hosts: ["host.docker.internal:host-gateway"] to each service so Linux Docker can resolve the host
  • A0-5. Document required env: GITEA_NPM_TOKEN must be exported in the shell that runs docker compose build

A1. Replace npm install -g pnpm@X with corepack

  • A1-1. Replace lines RUN npm install -g pnpm@10.6.5 with:
    RUN corepack enable && corepack prepare pnpm@10.6.5 --activate
    
  • A1-2. Verify packageManager field in backend/package.json matches (already pnpm@10.6.5 in peakpulse)

A2. Add BuildKit pnpm-store cache mount

  • A2-1. Set # syntax=docker/dockerfile:1.7 directive at top of every Dockerfile
  • A2-2. Wrap install step with cache mount:
    RUN --mount=type=cache,id=pnpm,target=/root/.local/share/pnpm/store \
        --mount=type=secret,id=gitea_npm_token \
        export GITEA_NPM_TOKEN="$(cat /run/secrets/gitea_npm_token 2>/dev/null || echo '')" && \
        pnpm install --ignore-scripts
    
  • A2-3. Verify cache hit on second build via docker buildx du or docker history

A3. Decide lockfile policy (BLOCKED on F2 resolution)

Two options — pick one in a short ADR before implementing:

  • Option 1: Keep --lockfile=false (current pragmatic approach)

    • No sibling-workspace complications
    • No reproducibility guarantee inside Docker
    • Slower installs (full resolution every build)
  • Option 2: Generate a Docker-only lockfile via pnpm install --lockfile-only against a flattened package.json that resolves @bytelyst/* to semver

    • Reproducibility
    • Faster installs
    • New build step + tooling
    • Drift risk between dev lockfile and Docker lockfile
  • A3-1. Write 1-page ADR (docs/decisions/0001-docker-lockfile-policy.md) and pick Option 1 or 2

  • A3-2. Defer --frozen-lockfile adoption until ADR lands

A4. Restructure layer order

  • A4-1. Reorder COPY/RUN so deps install layer is package.json + .npmrc ONLY, then a separate layer for src/, tsconfig.json, shared/
  • A4-2. Move all ARG lines that affect deps install before the install step; move NEXT_PUBLIC_* ARGs (clock web) closer to the build step

A5. Gate .docker-deps/ behind a build arg

  • A5-1. Add ARG USE_TARBALLS=false to Dockerfile
  • A5-2. Conditionally copy:
    # Always-empty placeholder so COPY doesn't fail in registry mode
    RUN mkdir -p /app/.docker-deps
    COPY .docker-deps* /app/.docker-deps/
    
    (The wildcard tolerates a missing .docker-deps/ dir; works without enabling BuildKit COPY's --from tricks.)
  • A5-3. Verify .docker-deps/ is in .gitignore and .dockerignore is NOT excluding it when tarball mode is in use

A6. .dockerignore audit

  • A6-1. Confirm exclusions: node_modules, **/node_modules, dist, .next, *.log, .env, .env.*, .git, *.bak
  • A6-2. Remove: pnpm-lock.yaml exclusion (was correct under --lockfile=false, blocks future optimization)
  • A6-3. Confirm .docker-deps/ is NOT excluded when tarball path is active

A7. Measure & record

Repo Surface Cold before Cold after Warm before Warm after Notes
peakpulse backend
clock backend
clock web

Use:

time DOCKER_BUILDKIT=1 docker compose build --no-cache backend  # cold
touch backend/src/server.ts && time docker compose build backend  # warm

4. Phase B — Hermetic-fallback polish (docker-prep.sh)

The script is duplicated with minor variations across product repos. Pilot in peakpulse + clock, then propose a canonical home.

  • B1. Add --dry-run flag — list packs/rewrites, no side effects
  • B2. Idempotency guard — refuse to run if any *.bak exists unless --force
  • B3. Ensure .docker-deps/ and *.bak are in .gitignore of every pilot repo
  • B4. Pre-commit hook (husky) — block commits containing "file:../.docker-deps/" inside any package.json. Add to .husky/pre-commit:
    if git diff --cached --name-only | xargs grep -l '"file:\.\./\.docker-deps/' 2>/dev/null; then
      echo "ERROR: rewritten package.json detected. Run scripts/docker-prep.sh --restore first."
      exit 1
    fi
    
  • B5. Auto-restore on script error via trap restore_on_error EXIT (unless --keep passed)
  • B6. Update script header comment with explicit "use only when Gitea unreachable OR you need uncommitted common-plat changes"
  • B7. Propose canonical home: learning_ai_common_plat/scripts/docker-prep.template.sh + sync-docker-prep.sh (mirrors .npmrc template pattern). Defer execution to Phase D.
  • B8. Add a --strip-overrides option that removes pnpm.overrides block after build, in case --restore is forgotten (additional safety net)

5. Phase C — Verification gates

Pilot exit criteria (must all pass before Phase D):

  • C1. Cold Docker build succeeds on both pilots via Gitea-registry path (no docker-prep.sh invocation)
  • C2. Warm rebuild (single source file touched) < 30 s on both pilots
  • C3. docker-prep.shdocker compose build--restore leaves git status clean
  • C4. Pre-commit hook blocks a deliberately-staged rewritten package.json
  • C5. Gitea Actions CI green on both pilots (verify CI uses the same Dockerfile path)
  • C6. Build-time metrics filled into the table in § 3.A7
  • C7. Decision recorded in ADR for A3 (lockfile policy)

6. Phase D — Ecosystem rollout (deferred until § 5 passes)

Apply Phase A0 → A2 + A4 → A6 + B to remaining repos. Pilots excluded.

Repo Backend Web docker-prep Notes
learning_ai_notes Uses node:22-slim (corp proxy / Alpine SSL issue)
learning_ai_fastgap Mobile + web + backend
learning_ai_jarvis_jr
learning_ai_flowmonk .npmrc.docker is tarball-only — needs A0-1
learning_ai_trails
learning_ai_local_memory_gpt SQLite-based, no Cosmos
learning_multimodal_memory_agents (MindLyst) KMP repo, different layout
learning_voice_ai_agent (LysnrAI) Python desktop + TS dashboards
learning_ai_efforise
learning_ai_auth_app n/a iOS/Android — no backend Dockerfile
learning_ai_talk2obsidian Single-container app

7. Reference snippets

7.1 Canonical .npmrc.docker

@bytelyst:registry=http://${GITEA_NPM_HOST}:3300/api/packages/learning_ai_user/npm/
//${GITEA_NPM_HOST}:3300/api/packages/learning_ai_user/npm/:_authToken=${GITEA_NPM_TOKEN}
strict-ssl=false
auto-install-peers=true

7.2 Canonical backend Dockerfile (post Phase A)

# syntax=docker/dockerfile:1.7
FROM node:22-alpine AS builder
WORKDIR /app/backend

ARG GITEA_NPM_HOST=host.docker.internal
ARG USE_TARBALLS=false
ENV NODE_TLS_REJECT_UNAUTHORIZED=0
ENV NPM_CONFIG_STRICT_SSL=false
ENV GITEA_NPM_HOST=$GITEA_NPM_HOST

RUN corepack enable && corepack prepare pnpm@10.6.5 --activate

# ── Deps layer (cacheable) ─────────────────────────────────────────
COPY .npmrc.docker ./.npmrc
COPY backend/package.json ./package.json
# Tolerate missing .docker-deps/ when in registry mode (wildcard match)
RUN mkdir -p /app/.docker-deps
COPY .docker-deps* /app/.docker-deps/

RUN --mount=type=cache,id=pnpm,target=/root/.local/share/pnpm/store \
    --mount=type=secret,id=gitea_npm_token \
    export GITEA_NPM_TOKEN="$(cat /run/secrets/gitea_npm_token 2>/dev/null || echo '')" && \
    pnpm install --ignore-scripts --lockfile=false

# ── Source layer (changes most often) ──────────────────────────────
COPY backend/tsconfig.json ./tsconfig.json
COPY backend/src/ ./src/
COPY shared/ ../shared/
RUN pnpm run build

# ── Runtime ────────────────────────────────────────────────────────
FROM node:22-alpine
WORKDIR /app/backend
ENV NODE_ENV=production
COPY --from=builder /app/backend/node_modules ./node_modules
COPY --from=builder /app/backend/package.json ./package.json
COPY --from=builder /app/backend/dist ./dist
COPY shared/ ../shared/
EXPOSE 4010
CMD ["node", "dist/server.js"]

--lockfile=false is intentional pending the A3 ADR. Switch to --frozen-lockfile once the sibling-workspace problem (F2) is resolved.

7.3 Canonical docker-compose.yml service block

services:
  backend:
    build:
      context: .
      dockerfile: backend/Dockerfile
      args:
        GITEA_NPM_HOST: host.docker.internal
      secrets:
        - gitea_npm_token
    extra_hosts:
      - "host.docker.internal:host-gateway"
    ports:
      - "4010:4010"
    environment:
      - NODE_ENV=production
      # ...
    restart: unless-stopped

secrets:
  gitea_npm_token:
    environment: GITEA_NPM_TOKEN

7.4 Hardened docker-prep.sh header

#!/usr/bin/env bash
# Hermetic Docker-build helper. Packs @bytelyst/* tarballs from the sibling
# common-plat repo when the Gitea npm registry is unreachable.
#
# Use this ONLY when:
#   - Local Gitea registry (:3300) is down or unreachable, OR
#   - You need a Docker build that includes uncommitted common-plat changes.
#
# For normal builds (Gitea up + clean common-plat), use:
#   docker compose build
#
# Usage:
#   ./scripts/docker-prep.sh             # pack tarballs + rewrite package.json
#   ./scripts/docker-prep.sh --dry-run   # show what would change (no side effects)
#   ./scripts/docker-prep.sh --force     # override idempotency guard
#   ./scripts/docker-prep.sh --restore   # undo rewrite
#   ./scripts/docker-prep.sh --keep      # skip auto-restore on error
#
# Side effects:
#   - Creates .docker-deps/ (gitignored)
#   - Backs up package.json → package.json.bak
#   - Rewrites @bytelyst/* deps to file:../.docker-deps/<tarball>
#   - Injects pnpm.overrides for transitive @bytelyst/* deps
#
# Safety:
#   - Refuses to run if .bak files already exist (unless --force)
#   - Auto-restores on error (trap EXIT) unless --keep passed
#   - Pre-commit hook blocks committing rewritten package.json

8. Open questions (numbered TODOs, not blockers)

  1. Shared pnpm cache volume? Should the BuildKit pnpm store cache be shared across all 13 repos via a named Docker volume (pnpm-store) instead of per-repo BuildKit caches keyed by id=pnpm? (BuildKit caches are already shared by id= — verify before adding volume complexity.)
  2. Custom base image? Publish bytelyst/node-pnpm:22 with pnpm pre-installed to skip the corepack step entirely. Cost: maintenance of a base image; benefit: ~5 s/build × 13 repos × N builds/day.
  3. CI hostname? Gitea Actions runs builds with --add-host to reach the registry. Is host.docker.internal:host-gateway portable to Linux CI runners, or do we need a CI-specific Dockerfile variant?
  4. Canonical script home? docker-prep.sh is currently per-repo with drift. Move to learning_ai_common_plat/scripts/docker-prep.template.sh with a sync-docker-prep.sh (mirrors .npmrc template pattern)?
  5. Multi-platform builds? Any need for linux/amd64 + linux/arm64 images? If yes, BuildKit cache mounts interact awkwardly with buildx --platform. Defer to separate roadmap.
  6. Workspace flattening? Could we eliminate the ../learning_ai_common_plat/packages/* workspace entry inside Docker by building with a flattened pnpm-workspace.yaml (only local backend/)? This unlocks --frozen-lockfile. Requires lockfile regeneration step.

9. Execution order

  1. Now (this commit): roadmap doc lands here; sign-off requested.
  2. A0 first — fix .npmrc.docker, docker-compose.yml, .dockerignore on both pilots. Without this, the Gitea path doesn't work and no measurement is possible.
  3. A1 + A2 on peakpulse backend. Measure. Commit.
  4. A1 + A2 on clock backend, then clock web. Measure. Commit.
  5. A4 + A5 + A6 on all three surfaces. Commit.
  6. A3 ADR — decide lockfile policy (defer implementation).
  7. A7 — fill in metrics table.
  8. Phase B — harden docker-prep.sh on peakpulse, then mirror to clock.
  9. Phase C — verification gates C1C7.
  10. Phase D — scheduled separately, only after § 5 passes.

10. Risk register

Risk Mitigation
Removing pnpm-lock.yaml from .dockerignore exposes a stale or sibling-aware lockfile that breaks Docker installs Keep --lockfile=false for now (A3 ADR); revisit after F2 resolution
BuildKit cache mount on shared CI runners causes cross-build interference Use distinct id= per repo (id=pnpm-${repo}) if observed
host.docker.internal doesn't resolve in Linux Docker extra_hosts: ["host.docker.internal:host-gateway"] (added in A0-4)
Removing .docker-deps/ from default builds breaks repos that haven't done A0 yet Wildcard COPY .docker-deps* keeps both paths working during migration
docker-prep.sh --force is misused and .bak files get committed Pre-commit hook (B4) blocks this regardless
Corp network blocks host.docker.internal:3300 Verify SSH tunnel (localhost:3300 from host) reaches Gitea; document in operations.md