bytelyst-devops-tools/docs/docker-build-optimization-roadmap.md
saravanakumardb1 529d4f37f5 docs: add Docker build optimization roadmap (post-audit v2)
Captures audit findings on Dockerfile patterns across pilot repos
(peakpulse, clock):

- 10 concrete bugs documented (F1-F10): .dockerignore blocks
  pnpm-lock.yaml, sibling-workspace lockfile problem, .npmrc.docker
  inconsistencies, missing BuildKit cache mounts, etc.
- Phase A0 added: fix Gitea-registry path before optimizing
  (without it, the 'default' path doesn't actually work)
- Phase A1-A7: corepack, cache mounts, layer reordering, measurement
- Phase B: docker-prep.sh hardening (dry-run, idempotency,
  auto-restore, pre-commit guard)
- Phase C: 7 verification gates
- Phase D: deferred 11-repo rollout checklist
- ADR-pending lockfile policy decision (A3)
- Risk register + 6 open questions
2026-05-27 00:28:10 -07:00

405 lines
19 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Docker Build Optimization Roadmap
> **Status:** Draft v2 (post-audit) · **Owner:** Platform DevOps · **Created:** 2026-05-27 · **Revised:** 2026-05-27
>
> Pilot Docker-build speed-ups + hermetic-fallback hardening on `learning_ai_peakpulse`
> and `learning_ai_clock`, then capture the playbook here for ecosystem-wide rollout.
---
## 0. Pre-flight audit findings (2026-05-27)
A read-only audit of the two pilot repos surfaced **10 concrete bugs/gaps**
that contradict the casual narrative that "Gitea-registry is the default and
`docker-prep.sh` is the fallback." The actual state is closer to the inverse:
| # | Finding | Location | Severity |
|---|---|---|---|
| F1 | `pnpm-lock.yaml` is in `.dockerignore` — any lockfile-based optimization is blocked until removed | `peakpulse/.dockerignore`, `clock/.dockerignore` | **High** |
| F2 | `pnpm-workspace.yaml` references sibling `../learning_ai_common_plat/packages/*``--frozen-lockfile` inside Docker will fail unless workspace is flattened or sibling tree is copied | `peakpulse/pnpm-workspace.yaml`, `clock/pnpm-workspace.yaml` | **High** |
| F3 | `peakpulse/.npmrc.docker` is tarball-only (no `@bytelyst:registry=…` line) — the "Gitea-registry" path doesn't actually work in this repo today | `peakpulse/.npmrc.docker` | **High** |
| F4 | `clock/.npmrc.docker` hardcodes `http://localhost:3300` — from inside a Docker container `localhost` is the container itself, not the host registry | `clock/.npmrc.docker` | **High** |
| F5 | `clock/backend/Dockerfile` has neither `ARG GITEA_NPM_HOST` nor a BuildKit secret mount — it is wholly dependent on `.docker-deps/` having been pre-populated | `clock/backend/Dockerfile` | High |
| F6 | `clock/web/Dockerfile` accepts `ARG GITEA_NPM_HOST` but never uses it and has no `--mount=type=secret` — passing the arg is a no-op | `clock/web/Dockerfile` | Medium |
| F7 | `peakpulse/docker-compose.yml` does not pass `GITEA_NPM_HOST` build arg or declare `secrets:` block, so `docker compose build` cannot use the Gitea path | `peakpulse/docker-compose.yml` | Medium |
| F8 | `COPY .docker-deps/` is unconditional in every backend Dockerfile — every build requires either `docker-prep.sh` to have run OR an empty `.docker-deps/` dir to pre-exist | both repos | Medium |
| F9 | `npm install -g pnpm@10.6.5` runs on every build (no `corepack`) — 510 s overhead, no pinning to `packageManager` field | all four Dockerfiles | Low |
| F10 | No BuildKit `--mount=type=cache` for pnpm store — cold install on every rebuild even when deps unchanged | all four Dockerfiles | High (the main speed win) |
**Implication:** the original plan to "switch to `--frozen-lockfile` + Gitea
registry" requires two upstream fixes first (F1, F2). The roadmap below
accounts for that.
---
## 1. Context: three build paths
| Path | Status today | Trigger | Notes |
|---|---|---|---|
| **`docker-prep.sh` tarballs** | **De facto default** in peakpulse + flowmonk; also works in clock | Run `docker-prep.sh` then `docker compose build` | Hermetic; mutates `package.json`; slow to repack |
| **Gitea NPM registry** | Partially wired in clock + notes; broken in peakpulse | `docker compose build` with `GITEA_NPM_HOST` arg + secret | Needs `.npmrc.docker` standardization to actually be default |
| **Legacy `file:` refs** | Deprecated | — | Removed during pnpm/Gitea migration |
### Measurement targets
| Build | Baseline (observed) | Target after Phase A |
|---|---|---|
| Cold (no cache) | ~23 min | ≤ 2 min |
| Warm (one source file changed) | ~23 min | **< 30 s** |
| `docker-prep.sh` pack step alone | ~6090 s | < 30 s (pnpm pack cache) |
> Fill in actuals during Phase C.
---
## 2. Goals & non-goals
**Goals**
- Cut warm rebuild time via BuildKit pnpm-store cache mount (the single biggest win)
- Make `docker-prep.sh` idempotent, safe to re-run, gitignore-clean
- Standardize `.npmrc.docker` across the ecosystem so the Gitea path actually works
- Fix `docker-compose.yml` to pass `GITEA_NPM_HOST` + secrets so the registry path is usable without manual flags
- Document which path to use when, and the trade-offs
**Non-goals**
- Migrating off pnpm or off the Gitea registry
- Adopting `--frozen-lockfile` until F2 is resolved (sibling-workspace problem)
- Publishing `@bytelyst/*` to the public npm registry
- Multi-platform builds (separate roadmap)
---
## 3. Phase A — Build speed + path correctness
Order matters: A0 must precede A1A5 (you can't enable a path that doesn't work).
### A0. Make the Gitea-registry path actually work (peakpulse + clock)
- [ ] **A0-1.** Standardize `.npmrc.docker` to use a templated host so it works on host (`localhost`) and inside Docker (`host.docker.internal`):
```
@bytelyst:registry=http://${GITEA_NPM_HOST}:3300/api/packages/learning_ai_user/npm/
//${GITEA_NPM_HOST}:3300/api/packages/learning_ai_user/npm/:_authToken=${GITEA_NPM_TOKEN}
strict-ssl=false
```
- [ ] **A0-2.** Remove `pnpm-lock.yaml` from `.dockerignore` in both repos (fixes F1)
- [ ] **A0-3.** Add `GITEA_NPM_HOST` build arg + `secrets:` block to every service in `docker-compose.yml`:
```yaml
build:
context: .
dockerfile: backend/Dockerfile
args:
GITEA_NPM_HOST: host.docker.internal
secrets:
- gitea_npm_token
secrets:
gitea_npm_token:
environment: GITEA_NPM_TOKEN
```
- [ ] **A0-4.** Add `extra_hosts: ["host.docker.internal:host-gateway"]` to each service so Linux Docker can resolve the host
- [ ] **A0-5.** Document required env: `GITEA_NPM_TOKEN` must be exported in the shell that runs `docker compose build`
### A1. Replace `npm install -g pnpm@X` with corepack
- [ ] **A1-1.** Replace lines `RUN npm install -g pnpm@10.6.5` with:
```dockerfile
RUN corepack enable && corepack prepare pnpm@10.6.5 --activate
```
- [ ] **A1-2.** Verify `packageManager` field in `backend/package.json` matches (already `pnpm@10.6.5` in peakpulse)
### A2. Add BuildKit pnpm-store cache mount
- [ ] **A2-1.** Set `# syntax=docker/dockerfile:1.7` directive at top of every Dockerfile
- [ ] **A2-2.** Wrap install step with cache mount:
```dockerfile
RUN --mount=type=cache,id=pnpm,target=/root/.local/share/pnpm/store \
--mount=type=secret,id=gitea_npm_token \
export GITEA_NPM_TOKEN="$(cat /run/secrets/gitea_npm_token 2>/dev/null || echo '')" && \
pnpm install --ignore-scripts
```
- [ ] **A2-3.** Verify cache hit on second build via `docker buildx du` or `docker history`
### A3. Decide lockfile policy (BLOCKED on F2 resolution)
Two options pick one in a short ADR before implementing:
- **Option 1: Keep `--lockfile=false`** (current pragmatic approach)
- No sibling-workspace complications
- No reproducibility guarantee inside Docker
- Slower installs (full resolution every build)
- **Option 2: Generate a Docker-only lockfile** via `pnpm install --lockfile-only` against a flattened `package.json` that resolves `@bytelyst/*` to semver
- Reproducibility
- Faster installs
- New build step + tooling
- Drift risk between dev lockfile and Docker lockfile
- [ ] **A3-1.** Write 1-page ADR (`docs/decisions/0001-docker-lockfile-policy.md`) and pick Option 1 or 2
- [ ] **A3-2.** Defer `--frozen-lockfile` adoption until ADR lands
### A4. Restructure layer order
- [ ] **A4-1.** Reorder COPY/RUN so deps install layer is `package.json` + `.npmrc` ONLY, then a separate layer for `src/`, `tsconfig.json`, `shared/`
- [ ] **A4-2.** Move all `ARG` lines that affect deps install **before** the install step; move `NEXT_PUBLIC_*` ARGs (clock web) closer to the build step
### A5. Gate `.docker-deps/` behind a build arg
- [ ] **A5-1.** Add `ARG USE_TARBALLS=false` to Dockerfile
- [ ] **A5-2.** Conditionally copy:
```dockerfile
# Always-empty placeholder so COPY doesn't fail in registry mode
RUN mkdir -p /app/.docker-deps
COPY .docker-deps* /app/.docker-deps/
```
(The wildcard tolerates a missing `.docker-deps/` dir; works without enabling BuildKit COPY's `--from` tricks.)
- [ ] **A5-3.** Verify `.docker-deps/` is in `.gitignore` and `.dockerignore` is NOT excluding it when tarball mode is in use
### A6. `.dockerignore` audit
- [ ] **A6-1.** Confirm exclusions: `node_modules`, `**/node_modules`, `dist`, `.next`, `*.log`, `.env`, `.env.*`, `.git`, `*.bak`
- [ ] **A6-2.** Remove: `pnpm-lock.yaml` exclusion (was correct under `--lockfile=false`, blocks future optimization)
- [ ] **A6-3.** Confirm `.docker-deps/` is NOT excluded when tarball path is active
### A7. Measure & record
| Repo | Surface | Cold before | Cold after | Warm before | Warm after | Notes |
|---|---|---|---|---|---|---|
| peakpulse | backend | | | | | |
| clock | backend | | | | | |
| clock | web | | | | | |
Use:
```
time DOCKER_BUILDKIT=1 docker compose build --no-cache backend # cold
touch backend/src/server.ts && time docker compose build backend # warm
```
---
## 4. Phase B — Hermetic-fallback polish (`docker-prep.sh`)
The script is **duplicated with minor variations** across product repos. Pilot
in peakpulse + clock, then propose a canonical home.
- [ ] **B1.** Add `--dry-run` flag list packs/rewrites, no side effects
- [ ] **B2.** Idempotency guard refuse to run if any `*.bak` exists unless `--force`
- [ ] **B3.** Ensure `.docker-deps/` and `*.bak` are in `.gitignore` of every pilot repo
- [ ] **B4.** Pre-commit hook (husky) block commits containing `"file:../.docker-deps/"` inside any `package.json`. Add to `.husky/pre-commit`:
```bash
if git diff --cached --name-only | xargs grep -l '"file:\.\./\.docker-deps/' 2>/dev/null; then
echo "ERROR: rewritten package.json detected. Run scripts/docker-prep.sh --restore first."
exit 1
fi
```
- [ ] **B5.** Auto-restore on script error via `trap restore_on_error EXIT` (unless `--keep` passed)
- [ ] **B6.** Update script header comment with explicit "use only when Gitea unreachable OR you need uncommitted common-plat changes"
- [ ] **B7.** Propose canonical home: `learning_ai_common_plat/scripts/docker-prep.template.sh` + `sync-docker-prep.sh` (mirrors `.npmrc` template pattern). Defer execution to Phase D.
- [ ] **B8.** Add a `--strip-overrides` option that removes `pnpm.overrides` block after build, in case `--restore` is forgotten (additional safety net)
---
## 5. Phase C — Verification gates
Pilot exit criteria (must all pass before Phase D):
- [ ] **C1.** Cold Docker build succeeds on both pilots via Gitea-registry path (no `docker-prep.sh` invocation)
- [ ] **C2.** Warm rebuild (single source file touched) < 30 s on both pilots
- [ ] **C3.** `docker-prep.sh` `docker compose build` `--restore` leaves `git status` clean
- [ ] **C4.** Pre-commit hook blocks a deliberately-staged rewritten `package.json`
- [ ] **C5.** Gitea Actions CI green on both pilots (verify CI uses the same Dockerfile path)
- [ ] **C6.** Build-time metrics filled into the table in § 3.A7
- [ ] **C7.** Decision recorded in ADR for A3 (lockfile policy)
---
## 6. Phase D — Ecosystem rollout (deferred until § 5 passes)
Apply Phase A0 A2 + A4 A6 + B to remaining repos. **Pilots excluded.**
| Repo | Backend | Web | docker-prep | Notes |
|---|---|---|---|---|
| `learning_ai_notes` | | | | Uses `node:22-slim` (corp proxy / Alpine SSL issue) |
| `learning_ai_fastgap` | | | | Mobile + web + backend |
| `learning_ai_jarvis_jr` | | | | |
| `learning_ai_flowmonk` | | | | `.npmrc.docker` is tarball-only needs A0-1 |
| `learning_ai_trails` | | | | |
| `learning_ai_local_memory_gpt` | | | | SQLite-based, no Cosmos |
| `learning_multimodal_memory_agents` (MindLyst) | | | | KMP repo, different layout |
| `learning_voice_ai_agent` (LysnrAI) | | | | Python desktop + TS dashboards |
| `learning_ai_efforise` | | | | |
| `learning_ai_auth_app` | | n/a | | iOS/Android no backend Dockerfile |
| `learning_ai_talk2obsidian` | | | | Single-container app |
---
## 7. Reference snippets
### 7.1 Canonical `.npmrc.docker`
```
@bytelyst:registry=http://${GITEA_NPM_HOST}:3300/api/packages/learning_ai_user/npm/
//${GITEA_NPM_HOST}:3300/api/packages/learning_ai_user/npm/:_authToken=${GITEA_NPM_TOKEN}
strict-ssl=false
auto-install-peers=true
```
### 7.2 Canonical backend Dockerfile (post Phase A)
```dockerfile
# syntax=docker/dockerfile:1.7
FROM node:22-alpine AS builder
WORKDIR /app/backend
ARG GITEA_NPM_HOST=host.docker.internal
ARG USE_TARBALLS=false
ENV NODE_TLS_REJECT_UNAUTHORIZED=0
ENV NPM_CONFIG_STRICT_SSL=false
ENV GITEA_NPM_HOST=$GITEA_NPM_HOST
RUN corepack enable && corepack prepare pnpm@10.6.5 --activate
# ── Deps layer (cacheable) ─────────────────────────────────────────
COPY .npmrc.docker ./.npmrc
COPY backend/package.json ./package.json
# Tolerate missing .docker-deps/ when in registry mode (wildcard match)
RUN mkdir -p /app/.docker-deps
COPY .docker-deps* /app/.docker-deps/
RUN --mount=type=cache,id=pnpm,target=/root/.local/share/pnpm/store \
--mount=type=secret,id=gitea_npm_token \
export GITEA_NPM_TOKEN="$(cat /run/secrets/gitea_npm_token 2>/dev/null || echo '')" && \
pnpm install --ignore-scripts --lockfile=false
# ── Source layer (changes most often) ──────────────────────────────
COPY backend/tsconfig.json ./tsconfig.json
COPY backend/src/ ./src/
COPY shared/ ../shared/
RUN pnpm run build
# ── Runtime ────────────────────────────────────────────────────────
FROM node:22-alpine
WORKDIR /app/backend
ENV NODE_ENV=production
COPY --from=builder /app/backend/node_modules ./node_modules
COPY --from=builder /app/backend/package.json ./package.json
COPY --from=builder /app/backend/dist ./dist
COPY shared/ ../shared/
EXPOSE 4010
CMD ["node", "dist/server.js"]
```
> `--lockfile=false` is intentional pending the A3 ADR. Switch to
> `--frozen-lockfile` once the sibling-workspace problem (F2) is resolved.
### 7.3 Canonical `docker-compose.yml` service block
```yaml
services:
backend:
build:
context: .
dockerfile: backend/Dockerfile
args:
GITEA_NPM_HOST: host.docker.internal
secrets:
- gitea_npm_token
extra_hosts:
- "host.docker.internal:host-gateway"
ports:
- "4010:4010"
environment:
- NODE_ENV=production
# ...
restart: unless-stopped
secrets:
gitea_npm_token:
environment: GITEA_NPM_TOKEN
```
### 7.4 Hardened `docker-prep.sh` header
```bash
#!/usr/bin/env bash
# Hermetic Docker-build helper. Packs @bytelyst/* tarballs from the sibling
# common-plat repo when the Gitea npm registry is unreachable.
#
# Use this ONLY when:
# - Local Gitea registry (:3300) is down or unreachable, OR
# - You need a Docker build that includes uncommitted common-plat changes.
#
# For normal builds (Gitea up + clean common-plat), use:
# docker compose build
#
# Usage:
# ./scripts/docker-prep.sh # pack tarballs + rewrite package.json
# ./scripts/docker-prep.sh --dry-run # show what would change (no side effects)
# ./scripts/docker-prep.sh --force # override idempotency guard
# ./scripts/docker-prep.sh --restore # undo rewrite
# ./scripts/docker-prep.sh --keep # skip auto-restore on error
#
# Side effects:
# - Creates .docker-deps/ (gitignored)
# - Backs up package.json → package.json.bak
# - Rewrites @bytelyst/* deps to file:../.docker-deps/<tarball>
# - Injects pnpm.overrides for transitive @bytelyst/* deps
#
# Safety:
# - Refuses to run if .bak files already exist (unless --force)
# - Auto-restores on error (trap EXIT) unless --keep passed
# - Pre-commit hook blocks committing rewritten package.json
```
---
## 8. Open questions (numbered TODOs, not blockers)
1. **Shared pnpm cache volume?** Should the BuildKit pnpm store cache be shared
across all 13 repos via a named Docker volume (`pnpm-store`) instead of
per-repo BuildKit caches keyed by `id=pnpm`? (BuildKit caches are already
shared by `id=` verify before adding volume complexity.)
2. **Custom base image?** Publish `bytelyst/node-pnpm:22` with pnpm
pre-installed to skip the corepack step entirely. Cost: maintenance of a
base image; benefit: ~5 s/build × 13 repos × N builds/day.
3. **CI hostname?** Gitea Actions runs builds with `--add-host` to reach the
registry. Is `host.docker.internal:host-gateway` portable to Linux CI
runners, or do we need a CI-specific Dockerfile variant?
4. **Canonical script home?** `docker-prep.sh` is currently per-repo with
drift. Move to `learning_ai_common_plat/scripts/docker-prep.template.sh`
with a `sync-docker-prep.sh` (mirrors `.npmrc` template pattern)?
5. **Multi-platform builds?** Any need for `linux/amd64` + `linux/arm64`
images? If yes, BuildKit cache mounts interact awkwardly with `buildx`
`--platform`. Defer to separate roadmap.
6. **Workspace flattening?** Could we eliminate the
`../learning_ai_common_plat/packages/*` workspace entry inside Docker by
building with a flattened `pnpm-workspace.yaml` (only local `backend/`)?
This unlocks `--frozen-lockfile`. Requires lockfile regeneration step.
---
## 9. Execution order
1. **Now (this commit):** roadmap doc lands here; sign-off requested.
2. **A0 first** fix `.npmrc.docker`, `docker-compose.yml`, `.dockerignore` on both pilots. Without this, the Gitea path doesn't work and no measurement is possible.
3. **A1 + A2** on peakpulse backend. Measure. Commit.
4. **A1 + A2** on clock backend, then clock web. Measure. Commit.
5. **A4 + A5 + A6** on all three surfaces. Commit.
6. **A3 ADR** decide lockfile policy (defer implementation).
7. **A7** fill in metrics table.
8. **Phase B** harden `docker-prep.sh` on peakpulse, then mirror to clock.
9. **Phase C** verification gates C1C7.
10. **Phase D** scheduled separately, only after § 5 passes.
---
## 10. Risk register
| Risk | Mitigation |
|---|---|
| Removing `pnpm-lock.yaml` from `.dockerignore` exposes a stale or sibling-aware lockfile that breaks Docker installs | Keep `--lockfile=false` for now (A3 ADR); revisit after F2 resolution |
| BuildKit cache mount on shared CI runners causes cross-build interference | Use distinct `id=` per repo (`id=pnpm-${repo}`) if observed |
| `host.docker.internal` doesn't resolve in Linux Docker | `extra_hosts: ["host.docker.internal:host-gateway"]` (added in A0-4) |
| Removing `.docker-deps/` from default builds breaks repos that haven't done A0 yet | Wildcard `COPY .docker-deps*` keeps both paths working during migration |
| `docker-prep.sh` `--force` is misused and `.bak` files get committed | Pre-commit hook (B4) blocks this regardless |
| Corp network blocks `host.docker.internal:3300` | Verify SSH tunnel (`localhost:3300` from host) reaches Gitea; document in operations.md |