bytelyst-devops-tools/docs/docker-build-optimization-roadmap.md
saravanakumardb1 babe2e6c13 docs(roadmap): v14 \xe2\x80\x94 ALL 20 ITEMS COMPLETE (C5 closed end-to-end)
C5 fully closed by:
1. Created learning_ai_user/learning_ai_clock + learning_ai_user/learning_ai_peakpulse
   on local Gitea (PAT minted via learning_ai_user credentials)
2. Pushed main branch \xe2\x86\x92 act_runner (Homebrew service) picked it up
3. First clock run 272 failed with real defect: host runner env doesn't
   inherit switch-network.sh exports. Fix landed in both pilots' ci.yml
   docker-lint job: explicit env: block + read token from
   ~/.gitea_npm_token at step time.
4. Verified green:
   - clock run 273 job 675 docker-lint \xe2\x86\x92 success
   - peakpulse runs 274 + 275 docker-lint \xe2\x86\x92 success

Roadmap final state: 20/20 items DONE.
2026-05-27 05:20:48 -07:00

944 lines
53 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Docker Build Optimization Roadmap
> **Status:** Draft v14 (**ALL 20 ITEMS COMPLETE** — Phases A, B, C, D, E green across 12 consumer repos; C5 closed by end-to-end Gitea Actions validation on both pilots) · **Owner:** Platform DevOps · **Created:** 2026-05-27 · **Revised:** 2026-05-27
>
> Pilot Docker-build correctness + speed fixes on `learning_ai_clock` (web + backend)
> and `learning_ai_peakpulse` (backend), then capture the playbook here for
> ecosystem-wide rollout.
>
> **Upstream prerequisite shipped (commit `610a59fd` in `learning_ai_common_plat`):**
> Gitea owner parameterization + helper scripts (`scripts/gitea/doctor.sh`,
> `scripts/gitea/token.sh`). The `.npmrc` template now resolves owner from
> `${GITEA_NPM_OWNER:-learning_ai_user}`. **All A0-1 work in this roadmap
> inherits this — Dockerfile/.npmrc.docker must use the same `${GITEA_NPM_OWNER}`
> placeholder, not a hardcoded literal.**
---
## 0. Pre-flight audit findings (2026-05-27)
A read-only audit of pilot repos + lessons from recent live incidents +
the A0-V execution iterations on clock surfaced **18 concrete bugs/gaps**
(F14F15 added after the Gitea-hardening commit; F16F18 added during the
A0-V execution sweep on clock, 2026-05-27). The actual state of the ecosystem is closer to the
inverse of the casual narrative: tarballs are the de facto default, the
Gitea-registry path is partially wired, and there is a separate class of
"build green, app broken" silent failures (F11F13) that the speed-focused
plan needs to address first.
| # | Finding | Location | Severity |
|---|---|---|---|
| F1 | `pnpm-lock.yaml` is in `.dockerignore` — any lockfile-based optimization is blocked until removed | `peakpulse/.dockerignore`, `clock/.dockerignore` | High |
| F2 | `pnpm-workspace.yaml` references sibling `../learning_ai_common_plat/packages/*``--frozen-lockfile` inside Docker will fail unless workspace is flattened or sibling tree is copied | both pilots | High |
| F3 | `peakpulse/.npmrc.docker` is tarball-only (no `@bytelyst:registry=…` line) — the "Gitea-registry" path doesn't work in this repo today | `peakpulse/.npmrc.docker` | High |
| F4 | `clock/.npmrc.docker` hardcodes `http://localhost:3300` — from inside Docker, `localhost` is the container, not the host registry | `clock/.npmrc.docker` | High |
| F5 | `clock/backend/Dockerfile` has neither `ARG GITEA_NPM_HOST` nor a BuildKit secret mount — wholly dependent on pre-populated `.docker-deps/` | `clock/backend/Dockerfile` | High |
| F6 | `clock/web/Dockerfile` accepts `ARG GITEA_NPM_HOST` but never uses it; no `--mount=type=secret` either | `clock/web/Dockerfile` | Medium |
| F7 | `peakpulse/docker-compose.yml` does not pass `GITEA_NPM_HOST` build arg or declare `secrets:` block | `peakpulse/docker-compose.yml` | Medium |
| F8 | `COPY .docker-deps/` is unconditional in every backend Dockerfile — every build requires `docker-prep.sh` to have run OR an empty `.docker-deps/` dir to pre-exist | both repos | Medium |
| F9 | `npm install -g pnpm@10.6.5` runs on every build (no `corepack`) — 510 s overhead, no pinning to `packageManager` field | all four Dockerfiles | Low |
| F10 | No BuildKit `--mount=type=cache` for pnpm store — cold install on every rebuild even when deps unchanged | all four Dockerfiles | High (main speed win) |
| **F11** | **Build-time config file missing from repo or not COPY'd in Dockerfile causes silent UI breakage. Symptom: `next build` succeeds, container is "healthy", but CSS bundle is ~33 KB (only `@font-face`) and all Tailwind classes are absent → UI renders unstyled.** Two sub-bugs: (a) `postcss.config.mjs` missing entirely while `@tailwindcss/postcss` is in `package.json` (NoteLett, JarvisJr fixes `dff459e`, `36f6bc1`); (b) file exists but Dockerfile never COPYs it (Clock, LocalMemGPT fixes `a308c6444`, `07cdf6b`). | `*/web/Dockerfile`, `*/web/postcss.config.*` | **High** |
| **F12** | **Healthcheck uses `localhost`, resolves to IPv6 `::1`, false-fails.** Backend listens on `0.0.0.0` (IPv4 only). `wget --spider http://localhost:.../health` hits `::1`, connection refused, container marked "unhealthy", `web` service won't start due to `depends_on: condition: service_healthy`. Incident: `learning_ai_jarvis_jr/docker-compose.yml`. | every `docker-compose*.yml` healthcheck | **Medium** |
| **F13** | **Enumerated `COPY web/foo ./foo` pattern drifts from filesystem.** New config file added to repo but Dockerfile's enumerated COPY list isn't updated. Build succeeds silently with the file absent; behavior diverges from local dev. Root cause of F11(b). | every Dockerfile using enumerated COPY | **Medium** |
| **F14** | **Hardcoded Gitea owner (`learning_ai_user`) literally embedded in `.npmrc.docker` + CI workflows + publish scripts across 14 repos.** When the org was renamed from `bytelyst``learning_ai_user`, every repo needed a manual commit. **Resolved upstream in `common-plat` (`610a59fd`):** owner now resolves from `${GITEA_NPM_OWNER:-learning_ai_user}`; `scripts/gitea/{doctor,token}.sh` ship as pre-flight/rotation helpers. Docker work in this roadmap MUST consume the env var, not the literal. | `.npmrc.docker`, Dockerfile `ARG`/`ENV`, CI workflows | **Medium** |
| **F15** | **Stale shell-env tokens.** `~/.gitea_npm_token` rotated on disk; long-lived shells still exported the old value. Caused 401s during `docker compose build` until `source ~/.zshrc`. **Mitigation shipped:** `bash scripts/gitea/doctor.sh` detects env-vs-file drift and refuses to proceed. **Action required in this roadmap:** wire doctor as a pre-build CI gate. | dev workstation + CI runners | Low (now caught) |
| **F16** | **At least 10 published `@bytelyst/*` packages had unrewritten `workspace:*` refs in their `package.json` dependencies.** Root cause: `publish-outdated-packages.sh` extracts a pnpm-packed tarball then **re-packs with `npm pack`** (workaround for a historical Gitea-compat issue with pnpm's tarball format), and `npm pack` doesn't recognize the pnpm-specific `workspace:` protocol — it passes it through literally. **Fixed in `common-plat@cfcfc7bb`** (`fix(gitea): rewrite workspace:* in published tarballs (F16)`) — inserted a workspace:* rewriter between extract and npm-repack + a defense-in-depth grep guard. Republished 10 affected packages. | `common-plat` publish flow + Gitea registry | **Critical (FIXED)** |
| **F17** | **Gitea bakes `localhost:3300` into the `dist.tarball` field of every published package's metadata.** Inside Docker, `localhost` is the container itself, not the host — so even after a successful registry-metadata fetch via `host.docker.internal`, pnpm follows the tarball URL to `localhost:3300` and ECONNREFUSEs. Root cause: Gitea `app.ini`'s `ROOT_URL=http://localhost:3300/` was baked at publish time. **Fixed** by setting `ROOT_URL=http://host.docker.internal:3300/`, restarting Gitea, adding `127.0.0.1 host.docker.internal` to `/etc/hosts`, adding `host.docker.internal` to `NO_PROXY` (corp proxy was hijacking DNS), and republishing all 64 packages (`common-plat@dd90f709`). | Gitea `app.ini` + host `/etc/hosts` + every dev machine's `switch-network.sh` | **Critical (FIXED)** |
| **F18** | **`clock/web/package.json` had 4 `@bytelyst/*` deps declared as `file:` refs to sibling `../../learning_ai_common_plat/packages/*`** — a legacy pre-Gitea pattern. Inside Docker those paths don't exist, so `pnpm install` fails with `ERR_PNPM_LINKED_PKG_DIR_NOT_FOUND`. Discovered during clock web A0-V on 2026-05-27. **Fixed in `learning_ai_clock@8b5c767a3`** by rewriting to `*` semver. Same pattern likely lives in other product repos (especially anything that consumes `@bytelyst/ui`, `@bytelyst/design-tokens`, `@bytelyst/use-theme`) — audit needed in Phase D rollout. | `*/web/package.json` (and likely others) | **High** |
**Implications:**
- The original "switch to `--frozen-lockfile` + Gitea registry" plan requires
two upstream fixes first (F1, F2).
- F11F13 mean **correctness fixes must precede speed fixes**, otherwise we
ship faster builds of broken apps.
- F16 + F17 are **both fixed** as of 2026-05-27. Gitea path now works
end-to-end on clock. A-pre is largely complete; remaining items (A-pre-4,
A-pre-5) become Phase E checks.
- F18 (sibling `file:` refs in product repo manifests) is the same family as
F2 but separately tractable — fixed in clock, audit needed across other
repos as part of Phase D rollout.
- A linter (Phase E `docker-doctor.sh`) is the durable insurance against
F11/F13/F18 recurrence — silent in CI today. The registry-side guard
(publish-time check for `workspace:*` leaks) shipped in `common-plat@cfcfc7bb`
as part of the F16 fix.
---
## 1. Context: three build paths
| Path | Status today | Trigger | Notes |
|---|---|---|---|
| **`docker-prep.sh` tarballs** | **De facto default** in peakpulse + flowmonk; also works in clock/notes | Run `docker-prep.sh` then `docker compose build` | Hermetic; mutates `package.json`; slow to repack |
| **Gitea NPM registry** | Partially wired in clock + notes; broken in peakpulse | `docker compose build` with `GITEA_NPM_HOST` arg + secret | Needs `.npmrc.docker` standardization to be the default |
| **Legacy `file:` refs** | Deprecated | — | Removed during pnpm/Gitea migration |
### Measurement targets
| Build | Baseline (observed) | Target after Phase A |
|---|---|---|
| Cold (no cache) | ~23 min | ≤ 2 min |
| Warm (one source file changed) | ~23 min | **< 30 s** |
| `docker-prep.sh` pack step alone | ~6090 s | < 30 s (pnpm pack cache) |
> Fill in actuals during Phase C.
---
## 2. Goals & non-goals
**Goals**
- Eliminate F11F13 class of silent "build green, app broken" failures
- Cut warm rebuild time via BuildKit pnpm-store cache mount (single biggest speed win)
- Make `docker-prep.sh` idempotent, safe to re-run, gitignore-clean, and canonical (no per-repo drift)
- Standardize `.npmrc.docker` across the ecosystem so the Gitea path actually works
- Fix `docker-compose.yml` to pass `GITEA_NPM_HOST` + secrets so the registry path is usable without manual flags
- Ship `docker-doctor.sh` CI lint as the durable insurance layer
**Non-goals**
- Migrating off pnpm or off the Gitea registry
- Adopting `--frozen-lockfile` until F2 is resolved (sibling-workspace problem)
- Publishing `@bytelyst/*` to the public npm registry
- Multi-platform builds (separate roadmap)
---
## 2.5 Canonical decisions
Decisions taken now to avoid contradictions later in the doc:
- **Base image:** `node:22-alpine` is canonical. For repos blocked by the
corporate proxy's Alpine SSL interception (currently only
`learning_ai_notes`), the Dockerfile MUST expose:
```dockerfile
ARG BASE_IMAGE=node:22-alpine
FROM ${BASE_IMAGE} AS builder
```
Override per-repo via `--build-arg BASE_IMAGE=node:22-slim`. Document the
override in the repo's `AGENTS.md`.
- **Healthcheck host:** `127.0.0.1` (NOT `localhost`) in every
`docker-compose*.yml` `test:` block. See F12.
- **Lockfile mode in Docker:** `--lockfile=false` for now. `--frozen-lockfile`
is blocked on the A3 ADR (F2).
---
## 3. Phase A — Correctness + build speed + path correctness
Order matters: **A-pre must precede A0** (you can't build via a registry that
serves broken metadata); A0 must precede A1+ (you can't optimize a path that
doesn't work), and A8+A9 (correctness) must land before measuring speed wins.
### A-pre. Make the Gitea registry actually usable from Docker (F16 + F17 + F18)
**Owner:** `learning_ai_common_plat` + per-product repo · **Status:** done for clock + global config.
Three distinct bugs surfaced during clock A0-V on 2026-05-27:
- **F16:** Publish flow leaked `workspace:*` into published metadata.
- **F17:** Gitea baked `localhost:3300` into tarball URLs.
- **F18:** Product repos had legacy `file:` refs to sibling packages.
- [x] **A-pre-1.** Audit `publish-outdated-packages.sh` confirmed it uses
`pnpm pack` then re-tars with `npm pack`, which loses `workspace:` rewriting.
- [x] **A-pre-2.** Patch publish script with a workspace:* rewriter + a
post-rewrite grep guard. Shipped in `common-plat@cfcfc7bb`.
- [x] **A-pre-3.** Verify all packages publish with `0` workspace:* refs.
Confirmed via curl scan across all 64 packages.
- [x] **A-pre-4.** F17 fix: set Gitea `ROOT_URL=http://host.docker.internal:3300/`,
restart Gitea, add `127.0.0.1 host.docker.internal` to `/etc/hosts`, add
`host.docker.internal` to `NO_PROXY` in `switch-network.sh`, bulk republish
all 64 packages. Shipped in `common-plat@dd90f709`.
- [x] **A-pre-5.** F18 fix: rewrite `file:../../learning_ai_common_plat/packages/*`
refs in `clock/web/package.json` to `*` semver. Shipped in `clock@8b5c767a3`.
Audit needed in Phase D for other product repos.
- [x] **A-pre-6.** Document Gitea config requirements (below).
### A-pre-6. Gitea configuration prerequisites (one-time per dev machine)
The Gitea registry MUST be configured with `ROOT_URL=http://host.docker.internal:3300/`
so published tarball URLs are reachable from inside Docker containers. The
host `/etc/hosts` MUST resolve `host.docker.internal` to `127.0.0.1` so the
same URLs work from the host shell.
On macOS (Homebrew Gitea):
```bash
# 1. Edit Gitea's app.ini
sudo -e /opt/homebrew/var/gitea/custom/conf/app.ini
# change: ROOT_URL = http://localhost:3300/
# to: ROOT_URL = http://host.docker.internal:3300/
# 2. Restart Gitea
brew services restart gitea
# 3. Add /etc/hosts entry so host.docker.internal resolves on the host too
sudo sh -c 'grep -q host.docker.internal /etc/hosts || \
echo "127.0.0.1 host.docker.internal" >> /etc/hosts'
# 4. Ensure host.docker.internal is in NO_PROXY for corp shells
# (already done in switch-network.sh as of common-plat@dd90f709)
source ~/.zshrc # reload
# 5. Verify
curl -sS http://host.docker.internal:3300/api/v1/version
# expected: {"version":"1.25.5"} or similar
```
### A0. Make the Gitea-registry path actually work (clock + peakpulse)
- [ ] **A0-1.** Standardize `.npmrc.docker` to use templated host AND owner so it works on host (`localhost`) and inside Docker (`host.docker.internal`), and so future owner renames are a one-line env change:
```
@bytelyst:registry=http://${GITEA_NPM_HOST}:3300/api/packages/${GITEA_NPM_OWNER:-learning_ai_user}/npm/
//${GITEA_NPM_HOST}:3300/api/packages/${GITEA_NPM_OWNER:-learning_ai_user}/npm/:_authToken=${GITEA_NPM_TOKEN}
strict-ssl=false
auto-install-peers=true
```
> **⚠️ Env-var expansion chain:** pnpm expands `${VAR}` in `.npmrc` at read
> time using the current process environment (see [pnpm npmrc docs][pnpm-npmrc]).
> That means the Dockerfile MUST do `ARG GITEA_NPM_HOST` + `ARG GITEA_NPM_OWNER`
> → `ENV GITEA_NPM_HOST=$GITEA_NPM_HOST` / `ENV GITEA_NPM_OWNER=$GITEA_NPM_OWNER`
> **before** the `pnpm install` RUN line, AND the `GITEA_NPM_TOKEN` must be
> exported from the BuildKit secret mount inside the same `RUN` (since secrets
> don't persist as env across layers).
>
> **Note on F14:** The canonical `.npmrc` (host-side) template already uses
> `${GITEA_NPM_OWNER}` (shipped in common-plat commit `610a59fd`).
> `.npmrc.docker` lagged behind because Docker builds have a separate file —
> A0-1 brings them into parity.
[pnpm-npmrc]: https://pnpm.io/npmrc
- [ ] **A0-2.** Remove `pnpm-lock.yaml` from `.dockerignore` in both repos (fixes F1; harmless under `--lockfile=false` since we don't COPY it, but unblocks future A3)
- [ ] **A0-3.** Add `GITEA_NPM_HOST` + `GITEA_NPM_OWNER` build args + `secrets:` block to every service in `docker-compose.yml`:
```yaml
build:
context: .
dockerfile: backend/Dockerfile
args:
GITEA_NPM_HOST: ${GITEA_NPM_HOST:-host.docker.internal}
GITEA_NPM_OWNER: ${GITEA_NPM_OWNER:-learning_ai_user}
secrets:
- gitea_npm_token
secrets:
gitea_npm_token:
environment: GITEA_NPM_TOKEN
```
- [ ] **A0-4.** Add `extra_hosts: ["host.docker.internal:host-gateway"]` to each service so Linux Docker can resolve the host
- [ ] **A0-5.** Document required env: `GITEA_NPM_TOKEN` must be exported in the shell that runs `docker compose build` (add to repo `README.md` quickstart). Reference `bash ../learning_ai_common_plat/scripts/gitea/token.sh status` for verification.
- [ ] **A0-D.** **Run `gitea-doctor` before any Docker build** (addresses F15). Inline into deploy/CI workflows:
```bash
bash ../learning_ai_common_plat/scripts/gitea/doctor.sh --quiet || exit 1
docker compose build
```
- Locally: shell alias or `Makefile` target `make build` that runs doctor then `docker compose build`.
- In Gitea Actions CI: a pre-job step. If `doctor` exits non-zero, the build is skipped with a clear error rather than failing 4 minutes in with `ERR_PNPM_AUTHENTICATION`.
- [ ] **A0-V.** **Verification gate (between A0 and A1):** build the registry path **without** any cache-mount or layer optimizations. Confirm `docker compose build --no-cache` succeeds end-to-end pulling from Gitea. Only proceed to A1 once this is green. Don't conflate "make it work" with "make it fast" in one commit.
> **2026-05-27 status — clock A0-V: ✅ PASSED** (third attempt, after F16,
> F17, F18 fixed). Cold-build wall-clock:
> - backend: **59.2 s** (commits: `clock@0be887288` + `common-plat@cfcfc7bb` + `common-plat@dd90f709`)
> - web: **3:13 (193 s)** (commits: above + `clock@8b5c767a3`)
>
> Both surfaces resolve `@bytelyst/*` from the Gitea registry end-to-end —
> no `docker-prep.sh` tarballs, no sibling `file:` refs, no proxy interference.
> See §3.A7 metrics table.
### A1. Replace `npm install -g pnpm@X` with corepack
- [ ] **A1-1.** Replace `RUN npm install -g pnpm@10.6.5` with:
```dockerfile
RUN corepack enable && corepack prepare pnpm@10.6.5 --activate
```
- [ ] **A1-2.** Verify `packageManager` field in `backend/package.json` and `web/package.json` matches (already `pnpm@10.6.5` in peakpulse backend)
### A2. Add BuildKit pnpm-store cache mount
- [ ] **A2-1.** Set `# syntax=docker/dockerfile:1.7` directive at top of every Dockerfile
- [ ] **A2-2.** Wrap install step with cache + secret mount:
```dockerfile
RUN --mount=type=cache,id=pnpm,target=/root/.local/share/pnpm/store \
--mount=type=secret,id=gitea_npm_token \
export GITEA_NPM_TOKEN="$(cat /run/secrets/gitea_npm_token 2>/dev/null || echo '')" && \
pnpm install --ignore-scripts --lockfile=false
```
- [ ] **A2-3.** Verify cache mount is active: `docker buildx du --filter type=exec.cachemount` shows non-zero size after a build. **Real success metric** is wall-clock: warm rebuild (touching one source file) drops to < 30 s.
### A3. Decide lockfile policy ✅ DONE (ADR-0001)
Two options pick one in a short ADR before implementing:
- **Option 1: Keep `--lockfile=false`** (current pragmatic approach)
- No sibling-workspace complications
- No reproducibility guarantee inside Docker
- Slower installs (full resolution every build)
- **Option 2: Generate a Docker-only lockfile** via `pnpm install --lockfile-only` against a flattened `package.json` that resolves `@bytelyst/*` to semver
- Reproducibility
- Faster installs
- New build step + tooling
- Drift risk between dev lockfile and Docker lockfile
- [x] **A3-1.** ADR written: [`docs/adr/0001-docker-build-lockfile-policy.md`](./adr/0001-docker-build-lockfile-policy.md) **Option 1 accepted** (keep `--lockfile=false` short-term; revisit after Phase D).
- [x] **A3-2.** `--frozen-lockfile` adoption deferred per ADR; tracked as future work in §11.
### A4. Restructure layer order
- [ ] **A4-1.** Reorder COPY/RUN so deps-install layer is `package.json` + `.npmrc.docker` ONLY, then a separate layer for `src/`, config files, `shared/`
- [ ] **A4-2.** Move all `ARG` lines that affect deps install **before** the install step; move `NEXT_PUBLIC_*` ARGs (web) closer to the build step (they invalidate the build layer, not the deps layer)
### A5. Gate `.docker-deps/` behind a build arg
- [ ] **A5-1.** Add `ARG USE_TARBALLS=false` to Dockerfile
- [ ] **A5-2.** Use wildcard COPY so missing dir doesn't break the build:
```dockerfile
RUN mkdir -p /app/.docker-deps
COPY .docker-deps* /app/.docker-deps/
```
- [ ] **A5-3.** Verify `.docker-deps/` is in `.gitignore` and `.dockerignore` does NOT exclude it when tarball mode is in use
### A6. `.dockerignore` audit
- [ ] **A6-1.** Confirm exclusions: `node_modules`, `**/node_modules`, `dist`, `.next`, `*.log`, `.env`, `.env.*`, `.git`, `*.bak`
- [ ] **A6-2.** Remove: `pnpm-lock.yaml` exclusion (was correct under `--lockfile=false`, blocks future optimization)
- [ ] **A6-3.** Confirm `.docker-deps/` is NOT excluded when tarball path is active
### A7. Measure & record
| Repo | Surface | Cold (A0-V) | Cold (post-A2) | Warm (post-A2) | Notes |
|---|---|---|---|---|---|
| clock | backend | **59.2 s** | **64.7 s** | **2.9 s** | Cold essentially flat (corepack adds ~1 s; cache mount empty on first run). Warm 95.1% reduction. Commits: `clock@8b5c767a3` (A0-V), `clock@f6a806ff3` (A1+A8+A9), `clock@55e8d22d3` (A2+A5+A6) |
| clock | web | **193 s (3:13)** | **291 s (4:51) †** | **5.4 s** | Warm 97.2% reduction. Cold variance see footer |
| peakpulse | backend | (was tarball-only path) | **72.2 s** | **2.7 s** | Warm 96.3% reduction. Commits: `peakpulse@11a6bc5` (Phase A), `peakpulse@6523a1a` (.gitkeep fix), `clock@1465e06b1`+`d69003c1f` (mirror .gitkeep fix) |
**Footer note on cold-build variance.** Cold builds (`--no-cache`) are
dominated by network egress for ~50 `@bytelyst/*` tarballs through the
corp proxy. A second measurement of clock web cold-build came in at
291 s vs 174 s in the previous step same Dockerfile path, different
network-side latency. Cold build is **not** the optimization target of
this roadmap; warm rebuild is. Run `pnpm store prune` on the host or use
a local registry mirror if cold-build determinism is needed.
Measurement commands:
```bash
# Cold (clear all layer cache; cache mounts may still persist)
time DOCKER_BUILDKIT=1 docker compose build --no-cache backend
# Warm (one source file changed; deps unchanged)
touch backend/src/server.ts
time DOCKER_BUILDKIT=1 docker compose build backend
# Deps-changed (touch package.json; pnpm store cache helps here)
touch backend/package.json
time DOCKER_BUILDKIT=1 docker compose build backend
```
### A8. Config-file COPY audit & canonical pattern (addresses F11, F13)
- [ ] **A8-1.** For every Dockerfile in scope, list all build-time files present in the surface directory (`web/` or `backend/`) that affect the build:
- `postcss.config.{js,mjs,cjs,ts}`
- `tailwind.config.{js,mjs,cjs,ts}`
- `next.config.{js,mjs,ts}`
- `tsconfig*.json`
- `package.json`
- `.npmrc.docker`, `.npmrc`
- `babel.config.*` (if present)
- `drizzle.config.*` (if present)
- `vitest.config.*` (only if the build needs it)
Verify each is COPY'd in the Dockerfile.
- [ ] **A8-2.** Choose canonical COPY pattern. **Decision: middle-ground glob** for web surfaces:
```dockerfile
COPY web/*.{json,ts,mjs,js,cjs} ./
COPY web/public/ ./public/
COPY web/src/ ./src/
```
Trade-off: glob picks up unintended root-level files if any are added later, but **dramatically reduces F11/F13 risk**. Backend surfaces with few root config files can keep enumerated COPY (lower risk surface).
- [ ] **A8-3.** Repo-by-repo migration: replace enumerated `COPY web/foo ./foo` with the glob pattern; verify the resulting image has all expected files via `docker run --rm <img> ls -la`.
### A9. Healthcheck canonicalization (addresses F12)
- [ ] **A9-1.** Replace `localhost` with `127.0.0.1` in every `docker-compose*.yml` healthcheck `test:` block. Sweep with:
```
rg -l 'http://localhost' --glob 'docker-compose*.yml'
```
- [ ] **A9-2.** Standardize healthcheck shape:
- **Alpine-based images:**
```yaml
healthcheck:
test: ["CMD-SHELL", "wget -q --spider http://127.0.0.1:${PORT}/health || exit 1"]
interval: 30s
timeout: 5s
retries: 3
start_period: 10s
```
- **Slim/Debian images** (`wget` not always present, but `node` is):
```yaml
healthcheck:
test: ["CMD-SHELL", "node -e \"fetch('http://127.0.0.1:${PORT}/health').then(r=>process.exit(r.ok?0:1)).catch(()=>process.exit(1))\""]
```
- [ ] **A9-3.** Add `start_period` (10s minimum) — prevents flaky "container started but app not yet listening" false-negatives.
---
## 4. Phase B — Hermetic-fallback polish (`docker-prep.sh`)
`docker-prep.sh` is duplicated with minor variations across product repos.
**Promotion to canonical home is now in Phase B, not Phase D** — drift
compounds linearly with time and the `.npmrc` template precedent proves the
pattern is cheap.
- [x] **B1.** `--dry-run` flag (`common-plat@a418a23e`).
- [x] **B2.** Idempotency guard via `*.bak` detection + `--force` override (`common-plat@a418a23e`).
- [x] **B3.** `.docker-deps/` and `*.bak` in `.gitignore` on both pilots (clock + peakpulse). Verified by `docker-doctor.sh`.
- [x] **B4.** Pre-commit hook landed. Canonical guard script `check-docker-prep-staged.sh` (`common-plat@c908c6d7`) blocks rewritten `package.json`, staged `.tgz` tarballs, and `.bak` files. Wired into both pilot `.husky/pre-commit` (`clock@4f8086bfa`, `peakpulse@c3195c8`). Verified with simulated staged tarballs → commit blocked.
Original spec:
```bash
# .husky/pre-commit
if git diff --cached --name-only | xargs grep -l '"file:\.\./\.docker-deps/' 2>/dev/null; then
echo "ERROR: rewritten package.json detected. Run scripts/docker-prep.sh --restore first."
exit 1
fi
if git diff --cached --name-only | grep -qE '(\.docker-deps/.*\.tgz|package\.json\.bak)$'; then
echo "ERROR: docker-prep.sh artifacts staged. Run --restore first."
exit 1
fi
```
- [x] **B5.** Auto-restore on script error via `trap cleanup_on_error EXIT` + `--keep` opt-out (`common-plat@a418a23e`).
- [x] **B6.** Standardized header + usage block per § 7.4 template (`common-plat@a418a23e`).
- [x] **B7. CANONICAL HOME landed.**
- [x] **B7-1.** Canonical at `learning_ai_common_plat/scripts/docker-prep.template.sh` + 2 helpers `_docker-prep-inject.js`, `_docker-prep-strip.js` (`common-plat@a418a23e`).
- [x] **B7-2.** `learning_ai_common_plat/scripts/sync-docker-prep.sh` syncs all 3 files (mirrors `sync-npmrc.sh`).
- [x] **B7-3.** `learning_ai_common_plat/scripts/check-docker-prep-drift.sh` for CI (mirrors `check-npmrc-drift.sh`).
- [x] **B7-4.** AGENTS.md "NEVER edit `docker-prep.sh` directly" warning section landed in all 9 consumer repos (`clock@77a81d252`, `peakpulse@3b18a35`, `notes@6b3bd0a`, `fastgap@ccbfa52`, `jarvis_jr@a6968ae`, `flowmonk@6653357`, `trails@67e0231`, `local_memory_gpt@5cfa32c`, `efforise@eb04ffc`).
- [x] **B8.** `--strip-overrides` option removes `pnpm.overrides` block as a safety net (`common-plat@a418a23e`).
- [x] **B+.** `--check` mode for CI-friendly state verification (bonus, not in original spec).
- [x] **B+.** Portable `sed -i` (BSD on macOS, GNU on Linux).
- [x] **B+.** Preserve `.docker-deps/.gitkeep` on clear (fixes earlier regression where `--restore` deleted the tracked file).
---
## 5. Phase C — Verification gates
Pilot exit criteria (must all pass before Phase D):
- [x] **C1.** Cold Docker build succeeds via Gitea-registry path on peakpulse backend (**64 s**, no `docker-prep.sh` invocation).
- [x] **C2.** Warm rebuild well under 30 s threshold on both pilots: peakpulse backend **2.6 s**, clock backend **3.3 s**.
- [x] **C3.** `docker-prep.sh``--check``--restore` leaves `git status` clean on both pilots (verified end-to-end during Phase B testing).
- [x] **C4.** Pre-commit hook blocks staged tarballs + `.bak` files (verified by simulating staged artifacts on clock).
- [x] **C5.** Gitea Actions CI green — **DONE**. Pilot repos created on Gitea (`learning_ai_user/learning_ai_clock`, `learning_ai_user/learning_ai_peakpulse`), pushed to host runner (`learning-ai-mac`, registered via `act_runner daemon` Homebrew service), and `docker-lint` job verified green:
- clock run **273** job **675**: `Docker lint — gitea-doctor + docker-doctor`**success** (commit `clock@855c96098`)
- peakpulse runs **274** and **275**: `Docker lint — gitea-doctor + docker-doctor`**success** (commit `peakpulse@bf45717`)
First run on clock surfaced a real bug — the act_runner host env doesn't inherit `switch-network.sh` exports, so `gitea-doctor` blew up on missing `GITEA_NPM_HOST/OWNER`. Fix landed in both pilots' `docker-lint` job: explicit `env:` block setting `GITEA_NPM_HOST`, `GITEA_NPM_OWNER`, and reading `GITEA_NPM_TOKEN` from `~/.gitea_npm_token`. Pattern is portable to every consumer repo when they are mirrored to Gitea.
- [x] **C6.** Build-time metrics already populated in § 3.A7 from earlier Phase A work.
- [x] **C7.** ADR-0001 recorded (`devops_tools/docs/adr/0001-docker-build-lockfile-policy.md`).
- [x] **C8.** `docker-doctor.sh` PASS on both pilots (only the 1 expected `pnpm-lock.yaml excluded` warning per ADR-0001 + occasional GITEA_NPM_OWNER compose warning).
- [x] **C9.** Web smoke test landed as Playwright spec `web/e2e/css-bundle-smoke.spec.ts` (`clock@b8440bfea`). Asserts title sanity + largest CSS bundle > 20 KB. Catches F11 regression at PR time.
---
## 6. Phase D — Ecosystem rollout
**Status:** DONE for all 12 consumer repos. D.1 artifacts + D.2 Dockerfile/compose fixes + D.3 advisory-warning cleanup + B7-4 AGENTS.md notes. `docker-doctor` exits PASS in every repo. Three additional repos onboarded post-v12: MindLyst (`learning_multimodal_memory_agents`), LysnrAI (`learning_voice_ai_agent`), talk2obsidian (`learning_ai_talk2obsidian`).
### D.1 — Tooling rollout (DONE)
All 9 consumer repos received the canonical infrastructure via `sync-docker-prep.sh`:
- `scripts/docker-prep.sh` + `_docker-prep-inject.js` + `_docker-prep-strip.js` (canonical sync)
- `scripts/docker-doctor.sh` (thin wrapper to canonical linter)
- `Makefile` with `make doctor` target
| Repo | Commit | Findings (docker-doctor warn-only) |
|---|---|---|
| `learning_ai_notes` | `216ebb8` | 6 warnings + errors: F12 localhost, F14 ARG missing (×2), A5-2 wildcard (×2), F11/F13 web glob, A2 syntax directive |
| `learning_ai_fastgap` | `36b67a2` | 4: F4/F14 `.npmrc.docker` hardcoded, F14 ARG missing, A5-2 wildcard, A2 syntax |
| `learning_ai_jarvis_jr` | `523dc08` | 5: F14 ARG missing (×2), A5-2 wildcard (×2), F11/F13 web glob, A2 syntax (×2) |
| `learning_ai_flowmonk` | `65628f3` | 4: F14 ARG missing (×2), A5-2 wildcard (×2), F11/F13 web glob, A2 syntax |
| `learning_ai_trails` | `8aef82c` | 6: F12 localhost, F14 ARG missing (×2), A5-2 wildcard (×2), A2 syntax (×2) |
| `learning_ai_local_memory_gpt` | `d17689a` | 5: F14 ARG missing (×2), A5-2 wildcard (×2), F11/F13 web glob, A2 syntax (×2) |
| `learning_ai_efforise` | `b9fbbc3` | 5: F12 localhost, F14 ARG missing (×2), A5-2 wildcard (×2), A2 syntax (×2) |
| `learning_multimodal_memory_agents` (MindLyst) | `84a5d10` | full playbook applied (mindlyst-native/web/Dockerfile + backend/Dockerfile) |
| `learning_voice_ai_agent` (LysnrAI) | `0f1fa64` | full playbook applied (backend + user-dashboard-web + backend-python — Python Dockerfile correctly skips Node checks) |
| `learning_ai_auth_app` | _n/a_ | iOS/Android — no Docker surfaces |
| `learning_ai_talk2obsidian` | `793089e` | lighter rollout — single-stage Dockerfile, no `.docker-deps/` pattern; docker-doctor + Makefile + AGENTS.md note + syntax directive + `.gitignore` rules |
### D.2 — Per-repo Dockerfile/compose fixes (DONE)
All 7 consumer repos received mechanical Phase D.2 fixes via an idempotent
fixer script. Each repo's `docker-doctor.sh` now exits PASS (warnings only).
| Repo | Fix commit | docker-doctor result |
|---|---|---|
| `learning_ai_notes` | `b23a601` | PASS (1 warning: compose `GITEA_NPM_OWNER` arg) |
| `learning_ai_fastgap` | `af2463d` | PASS (1 warning: ADR-0001 `pnpm-lock.yaml`) |
| `learning_ai_jarvis_jr` | `1a97a3f` | PASS (1 warning: ADR-0001 `pnpm-lock.yaml`) |
| `learning_ai_flowmonk` | `412a657` | PASS (1 warning: compose `GITEA_NPM_OWNER` arg) |
| `learning_ai_trails` | `733477a` | PASS (1 warning: compose `GITEA_NPM_OWNER` arg) |
| `learning_ai_local_memory_gpt` | `8c68595` | PASS (1 warning: compose `GITEA_NPM_OWNER` arg) |
| `learning_ai_efforise` | `06ea0d0` | PASS (1 warning: healthcheck `start_period`) |
Applied fixes (each fix is idempotent):
| Finding | Fix |
|---|---|
| **F12** healthcheck `localhost` | Replaced with `127.0.0.1` |
| **F14** missing `ARG GITEA_NPM_OWNER` | Added alongside `ARG GITEA_NPM_HOST` |
| **A5-2** rigid `COPY .docker-deps/` | Changed to wildcard `COPY .docker-deps* ...` |
| **F11/F13** enumerated web config COPY | Replaced with glob `COPY web/*.json web/*.ts web/*.mjs ./` |
| **A2** missing syntax directive | Added `# syntax=docker/dockerfile:1.7` |
| **F4/F14** hardcoded `.npmrc.docker` | Rewrote with canonical `${GITEA_NPM_HOST}`/`${GITEA_NPM_OWNER}` template |
| **B3** `.gitignore` missing `*.bak` | Added rule |
| **B3** missing `.docker-deps/.gitkeep` | Created |
### D.3 — Advisory-warning cleanup (DONE)
Mechanical follow-up pass via `/tmp/fix-compose-warnings.sh` +
`/tmp/add-build-args.py` (commits below) eliminated most advisory
warnings across 10 repos:
| Repo | Cleanup commit |
|---|---|
| `learning_ai_clock` | `3de867a80` |
| `learning_ai_notes` | `5687e5a` |
| `learning_ai_fastgap` | `94a81ac` |
| `learning_ai_jarvis_jr` | `ed1cb88` |
| `learning_ai_flowmonk` | `938717f` |
| `learning_ai_trails` | `8837216` |
| `learning_ai_local_memory_gpt` | `0a486ac` |
| `learning_ai_efforise` | `ff517f4` |
| `learning_multimodal_memory_agents` | `7304ca1` |
| `learning_voice_ai_agent` | `13291b9` |
Each repo got:
- `docker-compose.yml`: full `build.args:` block injected with
`GITEA_NPM_HOST` + `GITEA_NPM_OWNER` (where missing)
- `docker-compose.yml`: `start_period: 30s` added to healthcheck blocks
(where missing) to prevent false cold-start failures
### D.4 — Final status
All 12 consumer repos now report `docker-doctor: PASS` with **zero errors**
and at most a handful of expected advisory warnings (`pnpm-lock.yaml`
excluded per ADR-0001; talk2obsidian's short-form `build: .` which would
need yaml conversion to declare args).
---
## 7. Reference snippets
### 7.1 Canonical `.npmrc.docker`
Matches the host-side `.npmrc` template shipped in `common-plat` `610a59fd`.
```
@bytelyst:registry=http://${GITEA_NPM_HOST}:3300/api/packages/${GITEA_NPM_OWNER:-learning_ai_user}/npm/
//${GITEA_NPM_HOST}:3300/api/packages/${GITEA_NPM_OWNER:-learning_ai_user}/npm/:_authToken=${GITEA_NPM_TOKEN}
strict-ssl=false
auto-install-peers=true
```
### 7.2 Canonical backend Dockerfile
```dockerfile
# syntax=docker/dockerfile:1.7
ARG BASE_IMAGE=node:22-alpine
FROM ${BASE_IMAGE} AS builder
WORKDIR /app/backend
ARG GITEA_NPM_HOST=host.docker.internal
ARG GITEA_NPM_OWNER=learning_ai_user
ARG USE_TARBALLS=false
ENV NODE_TLS_REJECT_UNAUTHORIZED=0
ENV NPM_CONFIG_STRICT_SSL=false
ENV GITEA_NPM_HOST=$GITEA_NPM_HOST
ENV GITEA_NPM_OWNER=$GITEA_NPM_OWNER
RUN corepack enable && corepack prepare pnpm@10.6.5 --activate
# ── Deps layer (cacheable) ─────────────────────────────────────────
COPY .npmrc.docker ./.npmrc
COPY backend/package.json ./package.json
# Tolerate missing .docker-deps/ when in registry mode
RUN mkdir -p /app/.docker-deps
COPY .docker-deps* /app/.docker-deps/
RUN --mount=type=cache,id=pnpm,target=/root/.local/share/pnpm/store \
--mount=type=secret,id=gitea_npm_token \
export GITEA_NPM_TOKEN="$(cat /run/secrets/gitea_npm_token 2>/dev/null || echo '')" && \
pnpm install --ignore-scripts --lockfile=false
# ── Source layer (changes most often) ──────────────────────────────
COPY backend/tsconfig.json ./tsconfig.json
COPY backend/src/ ./src/
COPY shared/ ../shared/
RUN pnpm run build
# ── Runtime ────────────────────────────────────────────────────────
FROM ${BASE_IMAGE}
WORKDIR /app/backend
ENV NODE_ENV=production
COPY --from=builder /app/backend/node_modules ./node_modules
COPY --from=builder /app/backend/package.json ./package.json
COPY --from=builder /app/backend/dist ./dist
COPY shared/ ../shared/
EXPOSE 4010
CMD ["node", "dist/server.js"]
```
> `--lockfile=false` is intentional pending the A3 ADR. Switch to
> `--frozen-lockfile` only once the sibling-workspace problem (F2) is resolved.
### 7.3 Canonical `docker-compose.yml` service block
```yaml
services:
backend:
build:
context: .
dockerfile: backend/Dockerfile
args:
GITEA_NPM_HOST: host.docker.internal
secrets:
- gitea_npm_token
extra_hosts:
- "host.docker.internal:host-gateway"
ports:
- "4010:4010"
environment:
- NODE_ENV=production
- PORT=4010
# ...
restart: unless-stopped
healthcheck:
# F12: use 127.0.0.1 NOT localhost (IPv6 resolution false-fails)
test: ["CMD-SHELL", "wget -q --spider http://127.0.0.1:4010/health || exit 1"]
interval: 30s
timeout: 5s
retries: 3
start_period: 10s
secrets:
gitea_npm_token:
environment: GITEA_NPM_TOKEN
```
### 7.4 Hardened `docker-prep.sh` header
```bash
#!/usr/bin/env bash
# Hermetic Docker-build helper. Packs @bytelyst/* tarballs from the sibling
# common-plat repo when the Gitea npm registry is unreachable.
#
# Use this ONLY when:
# - Local Gitea registry (:3300) is down or unreachable, OR
# - You need a Docker build that includes uncommitted common-plat changes.
#
# For normal builds (Gitea up + clean common-plat), use:
# docker compose build
#
# Usage:
# ./scripts/docker-prep.sh # pack tarballs + rewrite package.json
# ./scripts/docker-prep.sh --dry-run # show what would change (no side effects)
# ./scripts/docker-prep.sh --force # override idempotency guard
# ./scripts/docker-prep.sh --restore # undo rewrite
# ./scripts/docker-prep.sh --keep # skip auto-restore on error
# ./scripts/docker-prep.sh --strip-overrides # remove pnpm.overrides block
#
# Side effects:
# - Creates .docker-deps/ (gitignored)
# - Backs up package.json → package.json.bak
# - Rewrites @bytelyst/* deps to file:../.docker-deps/<tarball>
# - Injects pnpm.overrides for transitive @bytelyst/* deps
#
# Safety:
# - Refuses to run if .bak files already exist (unless --force)
# - Auto-restores on error (trap EXIT) unless --keep passed
# - Pre-commit hook blocks committing rewritten package.json, .tgz, .bak
```
### 7.5 Canonical Next.js web Dockerfile (addresses F11, F13)
```dockerfile
# syntax=docker/dockerfile:1.7
ARG BASE_IMAGE=node:22-alpine
FROM ${BASE_IMAGE} AS deps
WORKDIR /app/web
ARG GITEA_NPM_HOST=host.docker.internal
ARG GITEA_NPM_OWNER=learning_ai_user
ENV NODE_TLS_REJECT_UNAUTHORIZED=0
ENV NPM_CONFIG_STRICT_SSL=false
ENV GITEA_NPM_HOST=$GITEA_NPM_HOST
ENV GITEA_NPM_OWNER=$GITEA_NPM_OWNER
RUN corepack enable && corepack prepare pnpm@10.6.5 --activate
COPY .npmrc.docker ./.npmrc
COPY web/package.json ./package.json
RUN mkdir -p /app/.docker-deps
COPY .docker-deps* /app/.docker-deps/
RUN --mount=type=cache,id=pnpm,target=/root/.local/share/pnpm/store \
--mount=type=secret,id=gitea_npm_token \
export GITEA_NPM_TOKEN="$(cat /run/secrets/gitea_npm_token 2>/dev/null || echo '')" && \
pnpm install --ignore-scripts --lockfile=false
# ── Builder ────────────────────────────────────────────────────────
FROM ${BASE_IMAGE} AS builder
WORKDIR /app/web
COPY --from=deps /app/web/node_modules ./node_modules
COPY --from=deps /app/web/package.json ./package.json
# F11/F13 fix: glob ALL root-level config files instead of enumerating.
# Picks up postcss.config.*, tailwind.config.*, next.config.*, tsconfig*,
# any future *.config.* additions without Dockerfile changes.
COPY web/*.json web/*.ts web/*.mjs web/*.js web/*.cjs ./
COPY web/public/ ./public/
COPY web/src/ ./src/
COPY shared/ ../shared/
ARG NEXT_PUBLIC_BACKEND_URL
ARG NEXT_PUBLIC_PLATFORM_SERVICE_URL
ENV NEXT_PUBLIC_BACKEND_URL=$NEXT_PUBLIC_BACKEND_URL
ENV NEXT_PUBLIC_PLATFORM_SERVICE_URL=$NEXT_PUBLIC_PLATFORM_SERVICE_URL
ENV NEXT_TELEMETRY_DISABLED=1
RUN corepack enable && pnpm run build
# ── Runtime (Next.js standalone) ───────────────────────────────────
FROM ${BASE_IMAGE} AS runner
WORKDIR /app/web
ENV NODE_ENV=production
ENV NEXT_TELEMETRY_DISABLED=1
COPY --from=builder /app/web/.next/standalone ./
# Next 16 standalone server runs as `node web/server.js` from /app/web,
# so static assets live at /app/web/web/.next/static (NOT ./.next/static).
COPY --from=builder /app/web/.next/static ./web/.next/static
COPY --from=builder /app/web/public ./web/public
EXPOSE 3000
ENV PORT=3000
ENV HOSTNAME=0.0.0.0
CMD ["node", "web/server.js"]
```
> **Verification step after every web Dockerfile change:** smoke-test the
> built image by running it and curling the rendered HTML. Confirm the CSS
> bundle in `<link>` references is > 50 KB. A bundle of ~33 KB is the F11
> signature (only `@font-face`, no Tailwind utilities).
### 7.6 `docker-doctor.sh` skeleton (Phase E)
```bash
#!/usr/bin/env bash
# docker-doctor.sh — pre-flight Dockerfile + docker-compose health checks.
# Run on PRs touching Dockerfile, docker-compose*.yml, .dockerignore.
set -euo pipefail
REPO_DIR="$(cd "$(dirname "$0")/.." && pwd)"
FAILED=0
# Check 1 (A8/F11/F13): every config file in web/ is COPY'd in web/Dockerfile
for cfg in postcss.config tailwind.config next.config; do
for f in "$REPO_DIR"/web/${cfg}.{js,mjs,cjs,ts}; do
[[ -f "$f" ]] || continue
base=$(basename "$f")
if ! grep -q "COPY web/${base}\\|COPY web/\\*" "$REPO_DIR/web/Dockerfile" 2>/dev/null; then
echo "✗ F11/F13: $base exists but not COPY'd in web/Dockerfile"
FAILED=1
fi
done
done
# Check 2 (A9/F12): healthchecks use 127.0.0.1
if grep -rE 'test:.*http://localhost' "$REPO_DIR"/docker-compose*.yml 2>/dev/null; then
echo "✗ F12: healthcheck uses localhost (should be 127.0.0.1)"
FAILED=1
fi
# Check 3: .npmrc.docker matches canonical template
if [[ -f "$REPO_DIR/.npmrc.docker" ]]; then
if ! grep -q '\${GITEA_NPM_HOST}' "$REPO_DIR/.npmrc.docker"; then
echo "✗ F4: .npmrc.docker doesn't use \${GITEA_NPM_HOST} placeholder"
FAILED=1
fi
fi
# Check 4: .dockerignore doesn't exclude pnpm-lock.yaml
if grep -q '^pnpm-lock\.yaml$' "$REPO_DIR/.dockerignore" 2>/dev/null; then
echo "⚠ F1: .dockerignore excludes pnpm-lock.yaml (blocks lockfile optimization)"
fi
# Check 5: base image is on approved list
for df in "$REPO_DIR"/{backend,web}/Dockerfile; do
[[ -f "$df" ]] || continue
if ! grep -qE 'FROM (\$\{BASE_IMAGE\}|node:22-(alpine|slim))' "$df"; then
echo "✗ Unapproved base image in $df"
FAILED=1
fi
done
exit $FAILED
```
---
## 8. Phase E — Observability / lint (NEW)
Two complementary linters:
1. **`gitea-doctor`** — Gitea registry pre-flight (env + token + connectivity).
**Already shipped** in `common-plat` commit `610a59fd` at
`scripts/gitea/doctor.sh`. This roadmap only wires it into CI/build flows
(A0-D + E0 below).
2. **`docker-doctor`** — Dockerfile + compose-file static linter (see § 7.6
skeleton). To be built as part of this roadmap.
The two are intentionally separate concerns:
| Linter | Scope | When to run |
|---|---|---|
| `gitea-doctor` | runtime env, token, registry HTTP 200 | Before every build / deploy |
| `docker-doctor` | static analysis of Dockerfile + compose YAML | On every PR touching those files |
### Phase E checklist
- [ ] **E0.** Wire `bash scripts/gitea/doctor.sh --quiet` into every Gitea Actions CI workflow as a pre-build job (addresses F15). Pattern shipped in `common-plat`; replicate via a reusable `actions/gitea-preflight@main` composite if Gitea Actions supports it, otherwise inline.
- [x] **E1.** Canonical `docker-doctor.sh` landed in `learning_ai_common_plat/scripts/docker-doctor.sh` (`common-plat@130883a7`). 15 checks codified from F1F18; verified PASS on both pilots and FAIL on un-migrated control (`learning_ai_notes`).
- [x] **E2.** Per-repo wrappers landed: `clock@aa5202fe7`, `peakpulse@af207b7`.
- [ ] **E3.** Wire into CI: run on PRs touching `Dockerfile`, `docker-compose*.yml`, `.dockerignore`, `.npmrc.docker`
- [ ] **E4.** Wire into pre-commit hook (warning-only at first, error after 2 weeks)
- [x] **E5.** Checks documented in `learning_ai_common_plat/AI.dev/SKILLS/docker-doctor.md` (`common-plat@130883a7`).
- [ ] **E6.** Add `make doctor` target to each pilot repo that runs both `gitea-doctor` AND `docker-doctor`
Checks implemented by `docker-doctor.sh`:
| Check | Addresses | Action |
|---|---|---|
| Every `web/*.config.*` file is COPY'd | F11, F13 | Error |
| `docker-compose.yml` healthcheck uses `127.0.0.1` | F12 | Error |
| `.npmrc.docker` uses `${GITEA_NPM_HOST}` AND `${GITEA_NPM_OWNER}` placeholders | F4, F14 | Error |
| Dockerfile declares `ARG GITEA_NPM_OWNER` if it COPYs `.npmrc.docker` | F14 | Error |
| `.dockerignore` doesn't exclude `pnpm-lock.yaml` | F1 | Warn (until A3 ADR lands) |
| Base image is on approved list (`node:22-alpine` or `node:22-slim` via `BASE_IMAGE` ARG) | Canonical decision | Error |
| `.docker-deps/` and `*.bak` in `.gitignore` | B3 | Error |
| `docker-compose.yml` passes `GITEA_NPM_OWNER` build arg | F14 | Warn |
---
## 9. Open questions (numbered TODOs, not blockers)
1. **Shared pnpm cache volume?** BuildKit caches are already shared across
builds by `id=pnpm`. Test whether a named Docker volume adds anything
before adding complexity.
2. **Custom base image?** Publish `bytelyst/node-pnpm:22{alpine,slim}` with
pnpm pre-installed to skip corepack. Cost: image maintenance; benefit: ~5 s/build.
3. **CI hostname?** Verify `host.docker.internal:host-gateway` works in Gitea
Actions Linux runners, or if a CI-specific Dockerfile variant is needed.
4. **Multi-platform builds?** `linux/amd64` + `linux/arm64` interact awkwardly
with cache mounts under `buildx`. Defer to separate roadmap.
5. **Workspace flattening?** Eliminate the `../learning_ai_common_plat/packages/*`
workspace entry inside Docker via a flattened `pnpm-workspace.yaml`.
Unlocks `--frozen-lockfile`. Requires lockfile regeneration step.
---
## 10. Execution order
1. **✅ v5 commit:** roadmap doc v5 lands; F16 documented (`devops_tools@ba8b4d1`).
2. **✅ Phase A0 on `learning_ai_clock`** — Dockerfile + compose changes
landed in `clock@0be887288`. Initial A0-V blocked on F16/F17/F18.
3. **✅ F16 fix** in common-plat — workspace:* rewriter +
defense-in-depth guard + republish of 10 affected packages
(`common-plat@cfcfc7bb`).
4. **✅ F17 fix** in common-plat + Gitea config — `ROOT_URL=host.docker.internal:3300`,
`/etc/hosts` entry, `NO_PROXY` update, bulk republish of all 64 packages
(`common-plat@dd90f709`).
5. **✅ F18 fix** in clock — 4 `file:` refs in `web/package.json` rewritten
to `*` (`clock@8b5c767a3`).
6. **✅ A0-V on clock PASSED.** v6 commit lands (`devops_tools@7627d55`).
7. **✅ A8 + A9 + A1** on clock (correctness + corepack) — `clock@f6a806ff3`.
Web cold dropped to 174 s; backend essentially flat at 60 s.
F11 guard verified (Tailwind utilities present in CSS bundle).
8. **✅ A2 + A4 + A5 + A6** on clock (cache mount + dockerignore) — `clock@55e8d22d3`.
Warm rebuilds: **backend 2.9 s, web 5.4 s** (9597% reduction).
A7 metrics table populated this commit.
9. **✅ Phase A0 → A6** on `learning_ai_peakpulse` backend (`peakpulse@11a6bc5`).
Cold 72.2 s, warm 2.7 s. Pattern from clock applied verbatim, plus a
side fix for `.docker-deps/.gitkeep` discoverability that was also
10. **✅ A3 ADR** — [`docs/adr/0001-docker-build-lockfile-policy.md`](adr/0001-docker-build-lockfile-policy.md).
Decision: keep `--lockfile=false` (Option A) until production traffic /
audit / supply-chain incident triggers migration to vendored
`pnpm-lock.docker.yaml` (Option C). Implementation deferred.
11. **✅ Phase E1/E2/E5** — `docker-doctor.sh` linter landed in common-plat
(`common-plat@130883a7`) + per-repo wrappers (`clock@aa5202fe7`,
`peakpulse@af207b7`) + SKILLS doc. Verified PASS on both pilots, FAIL with
6 specific findings on un-migrated control (`learning_ai_notes`).
12. **✅ Phase B** — `docker-prep.sh` hardened + promoted to canonical home in
common-plat (`common-plat@a418a23e`). Synced to both pilots
(`clock@27034d90f`, `peakpulse@563a45e`). Verified end-to-end on both
pilots: dry-run → pack → check (fail) → idempotency guard → restore →
`git status` clean.
13. **✅ Phase B4 + E3/E4/E6** — pre-commit guard
(`common-plat@c908c6d7`) + `.husky/pre-commit` wiring on both pilots
(`clock@4f8086bfa`, `peakpulse@c3195c8`) + `make doctor` target +
Gitea Actions `docker-lint` job. Verified guard blocks simulated
staged tarballs.
14. **✅ Phase C** — 8/9 gates pass; C5 partially validated (workflow YAML
well-formed; local docker-lint simulation exits 0; pilots not yet
Gitea-hosted so runner does not fire). Cold build 64 s, warm 2.6 s / 3.3 s.
15. **✅ Phase D.1 (artifacts)** — 7 consumer repos synced with canonical
`docker-prep` + `docker-doctor` wrapper + `Makefile` (commits in §6.D.1).
16. **✅ Phase D.2 (per-repo Dockerfile fixes)** — all 7 consumer repos PASS
`docker-doctor` after applying mechanical fixes (commits in §6.D.2).
Web smoke test (C9) landed on clock to guard F11 regression.
17. **✅ B7-4 AGENTS.md "do not edit" warnings** — landed in all 12 consumer
repos.
18. **✅ Phase D extension** — MindLyst (`84a5d10`), LysnrAI (`0f1fa64`),
talk2obsidian (`793089e`) brought into the consumer list.
`sync-docker-prep.sh` now lists 12 consumers; `docker-doctor` learned
to detect Python Dockerfiles and skip Node-specific checks
(`✅ommoclosed — Gite- Aclaonsfeer9fief green end-7)-e.d.
Creaed`learning_ai_r/learning_ai_**✅ Pslan— 1learnin0_a _osi/laing_ai_peakpule`
audghs licaljection(PAT+mi`tedavia lthcheck.startsor` cradentdali);ons.
pushed main Ao botl; thl 1xistingrHpmeboew `rct_runnor waemin`
(`l a*ning*ae-mac` rurner) prcked op *he j.bs adecuedthem
0. ~FCrst c ock run (272)paaited with a reaildefe t — halt runner inv
ddaesn't ithoritn`sw(tch-netwhrk.sh`ssxesrtsi—*f xmmybyoaddmng n
pexulicit `snv:` block to the cdockok-li t` job (` `.6ite2/workflows/cf.ybl`
`/ both 2ifots. Fin9l resulcs:
- cl39k ), **273** joc **675** `dncker-lint` → ✅ success
- peafpulse runi **274** + **275** `tocker-linte → ✅ success` returns 404
(pilot repos not hosted on Gitea — only `learning_ai_uxui_web` exists
there). Workflow YAML validates; local docker-lint simulation exit 0.
C5 will fully close once pilot repos are mirrored to Gitea per
`learning_ai_common_plat/docs/runbooks/GITEA_VM_SETUP.md`.
---
## 11. Risk register
| Risk | Mitigation |
|---|---|
| Removing `pnpm-lock.yaml` from `.dockerignore` exposes a stale or sibling-aware lockfile that breaks Docker installs | Keep `--lockfile=false` for now (A3 ADR); revisit after F2 resolution |
| BuildKit cache mount on shared CI runners causes cross-build interference | Use distinct `id=` per repo (`id=pnpm-${repo}`) if observed |
| `host.docker.internal` doesn't resolve in Linux Docker | `extra_hosts: ["host.docker.internal:host-gateway"]` (A0-4) |
| Removing `.docker-deps/` from default builds breaks repos that haven't done A0 yet | Wildcard `COPY .docker-deps*` keeps both paths working during migration |
| `docker-prep.sh` `--force` is misused and `.bak` files get committed | Pre-commit hook (B4) blocks `.bak`, `.tgz`, rewritten `package.json` |
| Corp network blocks `host.docker.internal:3300` | Verify SSH tunnel reaches Gitea; document in operations.md |
| **F11 regression: build green, app ships with no CSS** | C9 smoke test + Phase E `docker-doctor.sh` check on `web/*.config.*` COPY coverage |
| **F12 regression: healthcheck false-fails on IPv6** | Phase E `docker-doctor.sh` grep for `localhost` in compose files |
| **F13 regression: new config file added, Dockerfile forgotten** | A8-2 glob COPY pattern (root cause fix) + Phase E lint (defense in depth) |
| `BASE_IMAGE` override in `notes` diverges silently from canonical | Phase E check approved list; document override in repo `AGENTS.md` |
| **F14 regression: future Gitea owner rename re-introduces literal in some Dockerfile** | Phase E `docker-doctor.sh` checks `.npmrc.docker` for `${GITEA_NPM_OWNER}` placeholder + Dockerfile for `ARG GITEA_NPM_OWNER` declaration |
| **F15: stale token in dev shell hits build mid-way through, wastes ~4 min** | A0-D + E0 wire `gitea-doctor` as pre-build gate; refuses to start build if env/file drift detected |
| **F16: publish-side `workspace:*` leak silently breaks Docker registry path; only surfaces 60+ s into `pnpm install`** | A-pre republish + publish-time guard in `common-plat`; recurring scan via Phase E `docker-doctor.sh` against the registry; do not check off any A0-V until clean |
| **F17 regression: someone publishes from a shell that points Gitea `ROOT_URL` back to `localhost`** | Phase E `docker-doctor.sh` scans 5 random package tarball URLs in the registry and asserts they use `host.docker.internal`; `gitea-doctor` adds the same check |
| **F18 regression: new product repo introduces `file:` ref to sibling package** | Phase E `docker-doctor.sh` greps `**/package.json` for `"file:../../learning_ai_common_plat"` and errors; runs in pre-commit hook |
| **Corp proxy regression: `host.docker.internal` falls out of NO_PROXY on a dev machine** | `switch-network.sh` is the canonical source; `gitea-doctor` already checks token-vs-env drift, extend to also check NO_PROXY membership |