bytelyst-devops-tools/docs/docker-build-optimization-roadmap.md
saravanakumardb1 8025cd5d82 docs(docker): roadmap v4 — integrate Gitea hardening (F14, F15)
Adds 2 new findings to the docker build optimization roadmap and updates
templates to consume the new GITEA_NPM_OWNER env var shipped in common-plat
commit 610a59fd.

- F14: hardcoded Gitea owner literal across 14 repos (now resolved upstream
  via ${GITEA_NPM_OWNER:-learning_ai_user})
- F15: stale shell-env tokens (caught by scripts/gitea/doctor.sh)
- A0-1, A0-3, 7.1, 7.2, 7.5: snippets updated to thread GITEA_NPM_OWNER
  through .npmrc.docker, Dockerfile ARG/ENV, and docker-compose build args
- A0-D: new step — run gitea-doctor.sh as pre-build gate (replaces
  'wait 4 minutes for ERR_PNPM_AUTHENTICATION' with 'fail fast in 2 sec')
- Phase E: now distinguishes gitea-doctor (shipped) from docker-doctor (to
  build). Adds two new docker-doctor checks for F14
- Risk register: F14/F15 mitigations called out explicitly
2026-05-27 00:53:33 -07:00

716 lines
36 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Docker Build Optimization Roadmap
> **Status:** Draft v4 (Gitea hardening integrated) · **Owner:** Platform DevOps · **Created:** 2026-05-27 · **Revised:** 2026-05-27
>
> Pilot Docker-build correctness + speed fixes on `learning_ai_clock` (web + backend)
> and `learning_ai_peakpulse` (backend), then capture the playbook here for
> ecosystem-wide rollout.
>
> **Upstream prerequisite shipped (commit `610a59fd` in `learning_ai_common_plat`):**
> Gitea owner parameterization + helper scripts (`scripts/gitea/doctor.sh`,
> `scripts/gitea/token.sh`). The `.npmrc` template now resolves owner from
> `${GITEA_NPM_OWNER:-learning_ai_user}`. **All A0-1 work in this roadmap
> inherits this — Dockerfile/.npmrc.docker must use the same `${GITEA_NPM_OWNER}`
> placeholder, not a hardcoded literal.**
---
## 0. Pre-flight audit findings (2026-05-27)
A read-only audit of pilot repos + lessons from recent live incidents surfaced
**15 concrete bugs/gaps** (F14F15 added after the Gitea-hardening commit). The actual state of the ecosystem is closer to the
inverse of the casual narrative: tarballs are the de facto default, the
Gitea-registry path is partially wired, and there is a separate class of
"build green, app broken" silent failures (F11F13) that the speed-focused
plan needs to address first.
| # | Finding | Location | Severity |
|---|---|---|---|
| F1 | `pnpm-lock.yaml` is in `.dockerignore` — any lockfile-based optimization is blocked until removed | `peakpulse/.dockerignore`, `clock/.dockerignore` | High |
| F2 | `pnpm-workspace.yaml` references sibling `../learning_ai_common_plat/packages/*``--frozen-lockfile` inside Docker will fail unless workspace is flattened or sibling tree is copied | both pilots | High |
| F3 | `peakpulse/.npmrc.docker` is tarball-only (no `@bytelyst:registry=…` line) — the "Gitea-registry" path doesn't work in this repo today | `peakpulse/.npmrc.docker` | High |
| F4 | `clock/.npmrc.docker` hardcodes `http://localhost:3300` — from inside Docker, `localhost` is the container, not the host registry | `clock/.npmrc.docker` | High |
| F5 | `clock/backend/Dockerfile` has neither `ARG GITEA_NPM_HOST` nor a BuildKit secret mount — wholly dependent on pre-populated `.docker-deps/` | `clock/backend/Dockerfile` | High |
| F6 | `clock/web/Dockerfile` accepts `ARG GITEA_NPM_HOST` but never uses it; no `--mount=type=secret` either | `clock/web/Dockerfile` | Medium |
| F7 | `peakpulse/docker-compose.yml` does not pass `GITEA_NPM_HOST` build arg or declare `secrets:` block | `peakpulse/docker-compose.yml` | Medium |
| F8 | `COPY .docker-deps/` is unconditional in every backend Dockerfile — every build requires `docker-prep.sh` to have run OR an empty `.docker-deps/` dir to pre-exist | both repos | Medium |
| F9 | `npm install -g pnpm@10.6.5` runs on every build (no `corepack`) — 510 s overhead, no pinning to `packageManager` field | all four Dockerfiles | Low |
| F10 | No BuildKit `--mount=type=cache` for pnpm store — cold install on every rebuild even when deps unchanged | all four Dockerfiles | High (main speed win) |
| **F11** | **Build-time config file missing from repo or not COPY'd in Dockerfile causes silent UI breakage. Symptom: `next build` succeeds, container is "healthy", but CSS bundle is ~33 KB (only `@font-face`) and all Tailwind classes are absent → UI renders unstyled.** Two sub-bugs: (a) `postcss.config.mjs` missing entirely while `@tailwindcss/postcss` is in `package.json` (NoteLett, JarvisJr fixes `dff459e`, `36f6bc1`); (b) file exists but Dockerfile never COPYs it (Clock, LocalMemGPT fixes `a308c6444`, `07cdf6b`). | `*/web/Dockerfile`, `*/web/postcss.config.*` | **High** |
| **F12** | **Healthcheck uses `localhost`, resolves to IPv6 `::1`, false-fails.** Backend listens on `0.0.0.0` (IPv4 only). `wget --spider http://localhost:.../health` hits `::1`, connection refused, container marked "unhealthy", `web` service won't start due to `depends_on: condition: service_healthy`. Incident: `learning_ai_jarvis_jr/docker-compose.yml`. | every `docker-compose*.yml` healthcheck | **Medium** |
| **F13** | **Enumerated `COPY web/foo ./foo` pattern drifts from filesystem.** New config file added to repo but Dockerfile's enumerated COPY list isn't updated. Build succeeds silently with the file absent; behavior diverges from local dev. Root cause of F11(b). | every Dockerfile using enumerated COPY | **Medium** |
| **F14** | **Hardcoded Gitea owner (`learning_ai_user`) literally embedded in `.npmrc.docker` + CI workflows + publish scripts across 14 repos.** When the org was renamed from `bytelyst``learning_ai_user`, every repo needed a manual commit. **Resolved upstream in `common-plat` (`610a59fd`):** owner now resolves from `${GITEA_NPM_OWNER:-learning_ai_user}`; `scripts/gitea/{doctor,token}.sh` ship as pre-flight/rotation helpers. Docker work in this roadmap MUST consume the env var, not the literal. | `.npmrc.docker`, Dockerfile `ARG`/`ENV`, CI workflows | **Medium** |
| **F15** | **Stale shell-env tokens.** `~/.gitea_npm_token` rotated on disk; long-lived shells still exported the old value. Caused 401s during `docker compose build` until `source ~/.zshrc`. **Mitigation shipped:** `bash scripts/gitea/doctor.sh` detects env-vs-file drift and refuses to proceed. **Action required in this roadmap:** wire doctor as a pre-build CI gate. | dev workstation + CI runners | Low (now caught) |
**Implications:**
- The original "switch to `--frozen-lockfile` + Gitea registry" plan requires
two upstream fixes first (F1, F2).
- F11F13 mean **correctness fixes must precede speed fixes**, otherwise we
ship faster builds of broken apps.
- A linter (Phase E `docker-doctor.sh`) is the durable insurance against
F11/F13 recurrence — they are silent in CI today.
---
## 1. Context: three build paths
| Path | Status today | Trigger | Notes |
|---|---|---|---|
| **`docker-prep.sh` tarballs** | **De facto default** in peakpulse + flowmonk; also works in clock/notes | Run `docker-prep.sh` then `docker compose build` | Hermetic; mutates `package.json`; slow to repack |
| **Gitea NPM registry** | Partially wired in clock + notes; broken in peakpulse | `docker compose build` with `GITEA_NPM_HOST` arg + secret | Needs `.npmrc.docker` standardization to be the default |
| **Legacy `file:` refs** | Deprecated | — | Removed during pnpm/Gitea migration |
### Measurement targets
| Build | Baseline (observed) | Target after Phase A |
|---|---|---|
| Cold (no cache) | ~23 min | ≤ 2 min |
| Warm (one source file changed) | ~23 min | **< 30 s** |
| `docker-prep.sh` pack step alone | ~6090 s | < 30 s (pnpm pack cache) |
> Fill in actuals during Phase C.
---
## 2. Goals & non-goals
**Goals**
- Eliminate F11F13 class of silent "build green, app broken" failures
- Cut warm rebuild time via BuildKit pnpm-store cache mount (single biggest speed win)
- Make `docker-prep.sh` idempotent, safe to re-run, gitignore-clean, and canonical (no per-repo drift)
- Standardize `.npmrc.docker` across the ecosystem so the Gitea path actually works
- Fix `docker-compose.yml` to pass `GITEA_NPM_HOST` + secrets so the registry path is usable without manual flags
- Ship `docker-doctor.sh` CI lint as the durable insurance layer
**Non-goals**
- Migrating off pnpm or off the Gitea registry
- Adopting `--frozen-lockfile` until F2 is resolved (sibling-workspace problem)
- Publishing `@bytelyst/*` to the public npm registry
- Multi-platform builds (separate roadmap)
---
## 2.5 Canonical decisions
Decisions taken now to avoid contradictions later in the doc:
- **Base image:** `node:22-alpine` is canonical. For repos blocked by the
corporate proxy's Alpine SSL interception (currently only
`learning_ai_notes`), the Dockerfile MUST expose:
```dockerfile
ARG BASE_IMAGE=node:22-alpine
FROM ${BASE_IMAGE} AS builder
```
Override per-repo via `--build-arg BASE_IMAGE=node:22-slim`. Document the
override in the repo's `AGENTS.md`.
- **Healthcheck host:** `127.0.0.1` (NOT `localhost`) in every
`docker-compose*.yml` `test:` block. See F12.
- **Lockfile mode in Docker:** `--lockfile=false` for now. `--frozen-lockfile`
is blocked on the A3 ADR (F2).
---
## 3. Phase A — Correctness + build speed + path correctness
Order matters: A0 must precede A1+ (you can't optimize a path that doesn't
work), and A8+A9 (correctness) must land before measuring speed wins.
### A0. Make the Gitea-registry path actually work (clock + peakpulse)
- [ ] **A0-1.** Standardize `.npmrc.docker` to use templated host AND owner so it works on host (`localhost`) and inside Docker (`host.docker.internal`), and so future owner renames are a one-line env change:
```
@bytelyst:registry=http://${GITEA_NPM_HOST}:3300/api/packages/${GITEA_NPM_OWNER:-learning_ai_user}/npm/
//${GITEA_NPM_HOST}:3300/api/packages/${GITEA_NPM_OWNER:-learning_ai_user}/npm/:_authToken=${GITEA_NPM_TOKEN}
strict-ssl=false
auto-install-peers=true
```
> **⚠️ Env-var expansion chain:** pnpm expands `${VAR}` in `.npmrc` at read
> time using the current process environment (see [pnpm npmrc docs][pnpm-npmrc]).
> That means the Dockerfile MUST do `ARG GITEA_NPM_HOST` + `ARG GITEA_NPM_OWNER`
> → `ENV GITEA_NPM_HOST=$GITEA_NPM_HOST` / `ENV GITEA_NPM_OWNER=$GITEA_NPM_OWNER`
> **before** the `pnpm install` RUN line, AND the `GITEA_NPM_TOKEN` must be
> exported from the BuildKit secret mount inside the same `RUN` (since secrets
> don't persist as env across layers).
>
> **Note on F14:** The canonical `.npmrc` (host-side) template already uses
> `${GITEA_NPM_OWNER}` (shipped in common-plat commit `610a59fd`).
> `.npmrc.docker` lagged behind because Docker builds have a separate file —
> A0-1 brings them into parity.
[pnpm-npmrc]: https://pnpm.io/npmrc
- [ ] **A0-2.** Remove `pnpm-lock.yaml` from `.dockerignore` in both repos (fixes F1; harmless under `--lockfile=false` since we don't COPY it, but unblocks future A3)
- [ ] **A0-3.** Add `GITEA_NPM_HOST` + `GITEA_NPM_OWNER` build args + `secrets:` block to every service in `docker-compose.yml`:
```yaml
build:
context: .
dockerfile: backend/Dockerfile
args:
GITEA_NPM_HOST: ${GITEA_NPM_HOST:-host.docker.internal}
GITEA_NPM_OWNER: ${GITEA_NPM_OWNER:-learning_ai_user}
secrets:
- gitea_npm_token
secrets:
gitea_npm_token:
environment: GITEA_NPM_TOKEN
```
- [ ] **A0-4.** Add `extra_hosts: ["host.docker.internal:host-gateway"]` to each service so Linux Docker can resolve the host
- [ ] **A0-5.** Document required env: `GITEA_NPM_TOKEN` must be exported in the shell that runs `docker compose build` (add to repo `README.md` quickstart). Reference `bash ../learning_ai_common_plat/scripts/gitea/token.sh status` for verification.
- [ ] **A0-D.** **Run `gitea-doctor` before any Docker build** (addresses F15). Inline into deploy/CI workflows:
```bash
bash ../learning_ai_common_plat/scripts/gitea/doctor.sh --quiet || exit 1
docker compose build
```
- Locally: shell alias or `Makefile` target `make build` that runs doctor then `docker compose build`.
- In Gitea Actions CI: a pre-job step. If `doctor` exits non-zero, the build is skipped with a clear error rather than failing 4 minutes in with `ERR_PNPM_AUTHENTICATION`.
- [ ] **A0-V.** **Verification gate (between A0 and A1):** build the registry path **without** any cache-mount or layer optimizations. Confirm `docker compose build --no-cache` succeeds end-to-end pulling from Gitea. Only proceed to A1 once this is green. Don't conflate "make it work" with "make it fast" in one commit.
### A1. Replace `npm install -g pnpm@X` with corepack
- [ ] **A1-1.** Replace `RUN npm install -g pnpm@10.6.5` with:
```dockerfile
RUN corepack enable && corepack prepare pnpm@10.6.5 --activate
```
- [ ] **A1-2.** Verify `packageManager` field in `backend/package.json` and `web/package.json` matches (already `pnpm@10.6.5` in peakpulse backend)
### A2. Add BuildKit pnpm-store cache mount
- [ ] **A2-1.** Set `# syntax=docker/dockerfile:1.7` directive at top of every Dockerfile
- [ ] **A2-2.** Wrap install step with cache + secret mount:
```dockerfile
RUN --mount=type=cache,id=pnpm,target=/root/.local/share/pnpm/store \
--mount=type=secret,id=gitea_npm_token \
export GITEA_NPM_TOKEN="$(cat /run/secrets/gitea_npm_token 2>/dev/null || echo '')" && \
pnpm install --ignore-scripts --lockfile=false
```
- [ ] **A2-3.** Verify cache mount is active: `docker buildx du --filter type=exec.cachemount` shows non-zero size after a build. **Real success metric** is wall-clock: warm rebuild (touching one source file) drops to < 30 s.
### A3. Decide lockfile policy (BLOCKED on F2 resolution)
Two options pick one in a short ADR before implementing:
- **Option 1: Keep `--lockfile=false`** (current pragmatic approach)
- No sibling-workspace complications
- No reproducibility guarantee inside Docker
- Slower installs (full resolution every build)
- **Option 2: Generate a Docker-only lockfile** via `pnpm install --lockfile-only` against a flattened `package.json` that resolves `@bytelyst/*` to semver
- Reproducibility
- Faster installs
- New build step + tooling
- Drift risk between dev lockfile and Docker lockfile
- [ ] **A3-1.** Write 1-page ADR (`docs/decisions/0001-docker-lockfile-policy.md`) and pick Option 1 or 2
- [ ] **A3-2.** Defer `--frozen-lockfile` adoption until ADR lands
### A4. Restructure layer order
- [ ] **A4-1.** Reorder COPY/RUN so deps-install layer is `package.json` + `.npmrc.docker` ONLY, then a separate layer for `src/`, config files, `shared/`
- [ ] **A4-2.** Move all `ARG` lines that affect deps install **before** the install step; move `NEXT_PUBLIC_*` ARGs (web) closer to the build step (they invalidate the build layer, not the deps layer)
### A5. Gate `.docker-deps/` behind a build arg
- [ ] **A5-1.** Add `ARG USE_TARBALLS=false` to Dockerfile
- [ ] **A5-2.** Use wildcard COPY so missing dir doesn't break the build:
```dockerfile
RUN mkdir -p /app/.docker-deps
COPY .docker-deps* /app/.docker-deps/
```
- [ ] **A5-3.** Verify `.docker-deps/` is in `.gitignore` and `.dockerignore` does NOT exclude it when tarball mode is in use
### A6. `.dockerignore` audit
- [ ] **A6-1.** Confirm exclusions: `node_modules`, `**/node_modules`, `dist`, `.next`, `*.log`, `.env`, `.env.*`, `.git`, `*.bak`
- [ ] **A6-2.** Remove: `pnpm-lock.yaml` exclusion (was correct under `--lockfile=false`, blocks future optimization)
- [ ] **A6-3.** Confirm `.docker-deps/` is NOT excluded when tarball path is active
### A7. Measure & record
| Repo | Surface | Cold before | Cold after | Warm before | Warm after | Notes |
|---|---|---|---|---|---|---|
| clock | web | | | | | |
| clock | backend | | | | | |
| peakpulse | backend | | | | | |
Use:
```
time DOCKER_BUILDKIT=1 docker compose build --no-cache backend # cold
touch backend/src/server.ts && time docker compose build backend # warm
```
### A8. Config-file COPY audit & canonical pattern (addresses F11, F13)
- [ ] **A8-1.** For every Dockerfile in scope, list all build-time files present in the surface directory (`web/` or `backend/`) that affect the build:
- `postcss.config.{js,mjs,cjs,ts}`
- `tailwind.config.{js,mjs,cjs,ts}`
- `next.config.{js,mjs,ts}`
- `tsconfig*.json`
- `package.json`
- `.npmrc.docker`, `.npmrc`
- `babel.config.*` (if present)
- `drizzle.config.*` (if present)
- `vitest.config.*` (only if the build needs it)
Verify each is COPY'd in the Dockerfile.
- [ ] **A8-2.** Choose canonical COPY pattern. **Decision: middle-ground glob** for web surfaces:
```dockerfile
COPY web/*.{json,ts,mjs,js,cjs} ./
COPY web/public/ ./public/
COPY web/src/ ./src/
```
Trade-off: glob picks up unintended root-level files if any are added later, but **dramatically reduces F11/F13 risk**. Backend surfaces with few root config files can keep enumerated COPY (lower risk surface).
- [ ] **A8-3.** Repo-by-repo migration: replace enumerated `COPY web/foo ./foo` with the glob pattern; verify the resulting image has all expected files via `docker run --rm <img> ls -la`.
### A9. Healthcheck canonicalization (addresses F12)
- [ ] **A9-1.** Replace `localhost` with `127.0.0.1` in every `docker-compose*.yml` healthcheck `test:` block. Sweep with:
```
rg -l 'http://localhost' --glob 'docker-compose*.yml'
```
- [ ] **A9-2.** Standardize healthcheck shape:
- **Alpine-based images:**
```yaml
healthcheck:
test: ["CMD-SHELL", "wget -q --spider http://127.0.0.1:${PORT}/health || exit 1"]
interval: 30s
timeout: 5s
retries: 3
start_period: 10s
```
- **Slim/Debian images** (`wget` not always present, but `node` is):
```yaml
healthcheck:
test: ["CMD-SHELL", "node -e \"fetch('http://127.0.0.1:${PORT}/health').then(r=>process.exit(r.ok?0:1)).catch(()=>process.exit(1))\""]
```
- [ ] **A9-3.** Add `start_period` (10s minimum) — prevents flaky "container started but app not yet listening" false-negatives.
---
## 4. Phase B — Hermetic-fallback polish (`docker-prep.sh`)
`docker-prep.sh` is duplicated with minor variations across product repos.
**Promotion to canonical home is now in Phase B, not Phase D** — drift
compounds linearly with time and the `.npmrc` template precedent proves the
pattern is cheap.
- [ ] **B1.** Add `--dry-run` flag — list packs/rewrites, no side effects
- [ ] **B2.** Idempotency guard — refuse to run if any `*.bak` exists unless `--force`
- [ ] **B3.** Ensure `.docker-deps/` and `*.bak` are in `.gitignore` of every pilot repo
- [ ] **B4.** Pre-commit hook (husky) — block commits containing rewritten `package.json`, staged tarballs, OR `.bak` files:
```bash
# .husky/pre-commit
if git diff --cached --name-only | xargs grep -l '"file:\.\./\.docker-deps/' 2>/dev/null; then
echo "ERROR: rewritten package.json detected. Run scripts/docker-prep.sh --restore first."
exit 1
fi
if git diff --cached --name-only | grep -qE '(\.docker-deps/.*\.tgz|package\.json\.bak)$'; then
echo "ERROR: docker-prep.sh artifacts staged. Run --restore first."
exit 1
fi
```
- [ ] **B5.** Auto-restore on script error via `trap restore_on_error EXIT` (unless `--keep` passed)
- [ ] **B6.** Update script header comment per § 7.4 template
- [ ] **B7. CANONICAL HOME (was deferred — now in Phase B proper).**
- [ ] **B7-1.** Move script to `learning_ai_common_plat/scripts/docker-prep.template.sh`
- [ ] **B7-2.** Add `learning_ai_common_plat/scripts/sync-docker-prep.sh` to copy template into all product repos (mirrors `sync-npmrc.sh`)
- [ ] **B7-3.** Add `learning_ai_common_plat/scripts/check-docker-prep-drift.sh` for CI (mirrors `check-npmrc-drift.sh`)
- [ ] **B7-4.** Update every repo's `AGENTS.md` with the "NEVER edit `docker-prep.sh` directly" warning + template link
- [ ] **B8.** Add `--strip-overrides` option that removes `pnpm.overrides` block after build — safety net in case `--restore` is forgotten
---
## 5. Phase C — Verification gates
Pilot exit criteria (must all pass before Phase D):
- [ ] **C1.** Cold Docker build succeeds on both pilots via Gitea-registry path (no `docker-prep.sh` invocation)
- [ ] **C2.** Warm rebuild (single source file touched) < 30 s on both pilots
- [ ] **C3.** `docker-prep.sh` `docker compose build` `--restore` leaves `git status` clean
- [ ] **C4.** Pre-commit hook blocks: (a) rewritten `package.json`, (b) staged `.tgz`, (c) staged `.bak`
- [ ] **C5.** Gitea Actions CI green on both pilots (verify CI uses the same Dockerfile path)
- [ ] **C6.** Build-time metrics filled into the table in § 3.A7
- [ ] **C7.** ADR recorded for A3 (lockfile policy)
- [ ] **C8.** `docker-doctor.sh` (Phase E) runs clean against both pilots
- [ ] **C9.** Smoke test: render the web app, inspect `<head>` for non-trivial CSS bundle (> 50 KB), confirm Tailwind classes apply. Guard against F11 regression.
---
## 6. Phase D — Ecosystem rollout (deferred until § 5 passes)
Apply Phase A + B + E to remaining repos. **Pilots excluded.**
| Repo | Backend | Web | docker-prep | Healthcheck | Notes |
|---|---|---|---|---|---|
| `learning_ai_notes` | ☐ | ☐ | ☐ | ☐ | `BASE_IMAGE=node:22-slim` override (corp proxy Alpine SSL) |
| `learning_ai_fastgap` | ☐ | ☐ | ☐ | ☐ | Mobile + web + backend |
| `learning_ai_jarvis_jr` | ☐ | ☐ | ☐ | ☐ | F12 incident already fixed; verify regression-proof |
| `learning_ai_flowmonk` | ☐ | ☐ | ☐ | ☐ | `.npmrc.docker` is tarball-only — needs A0-1 |
| `learning_ai_trails` | ☐ | ☐ | ☐ | ☐ | |
| `learning_ai_local_memory_gpt` | ☐ | ☐ | ☐ | ☐ | SQLite-based; F11(b) already fixed `07cdf6b` — verify regression-proof |
| `learning_multimodal_memory_agents` (MindLyst) | ☐ | ☐ | ☐ | ☐ | KMP repo, different layout |
| `learning_voice_ai_agent` (LysnrAI) | ☐ | ☐ | ☐ | ☐ | Python desktop + TS dashboards |
| `learning_ai_efforise` | ☐ | ☐ | ☐ | ☐ | |
| `learning_ai_auth_app` | ☐ | n/a | ☐ | n/a | iOS/Android — no Docker surfaces |
| `learning_ai_talk2obsidian` | ☐ | ☐ | ☐ | ☐ | Single-container app |
---
## 7. Reference snippets
### 7.1 Canonical `.npmrc.docker`
Matches the host-side `.npmrc` template shipped in `common-plat` `610a59fd`.
```
@bytelyst:registry=http://${GITEA_NPM_HOST}:3300/api/packages/${GITEA_NPM_OWNER:-learning_ai_user}/npm/
//${GITEA_NPM_HOST}:3300/api/packages/${GITEA_NPM_OWNER:-learning_ai_user}/npm/:_authToken=${GITEA_NPM_TOKEN}
strict-ssl=false
auto-install-peers=true
```
### 7.2 Canonical backend Dockerfile
```dockerfile
# syntax=docker/dockerfile:1.7
ARG BASE_IMAGE=node:22-alpine
FROM ${BASE_IMAGE} AS builder
WORKDIR /app/backend
ARG GITEA_NPM_HOST=host.docker.internal
ARG GITEA_NPM_OWNER=learning_ai_user
ARG USE_TARBALLS=false
ENV NODE_TLS_REJECT_UNAUTHORIZED=0
ENV NPM_CONFIG_STRICT_SSL=false
ENV GITEA_NPM_HOST=$GITEA_NPM_HOST
ENV GITEA_NPM_OWNER=$GITEA_NPM_OWNER
RUN corepack enable && corepack prepare pnpm@10.6.5 --activate
# ── Deps layer (cacheable) ─────────────────────────────────────────
COPY .npmrc.docker ./.npmrc
COPY backend/package.json ./package.json
# Tolerate missing .docker-deps/ when in registry mode
RUN mkdir -p /app/.docker-deps
COPY .docker-deps* /app/.docker-deps/
RUN --mount=type=cache,id=pnpm,target=/root/.local/share/pnpm/store \
--mount=type=secret,id=gitea_npm_token \
export GITEA_NPM_TOKEN="$(cat /run/secrets/gitea_npm_token 2>/dev/null || echo '')" && \
pnpm install --ignore-scripts --lockfile=false
# ── Source layer (changes most often) ──────────────────────────────
COPY backend/tsconfig.json ./tsconfig.json
COPY backend/src/ ./src/
COPY shared/ ../shared/
RUN pnpm run build
# ── Runtime ────────────────────────────────────────────────────────
FROM ${BASE_IMAGE}
WORKDIR /app/backend
ENV NODE_ENV=production
COPY --from=builder /app/backend/node_modules ./node_modules
COPY --from=builder /app/backend/package.json ./package.json
COPY --from=builder /app/backend/dist ./dist
COPY shared/ ../shared/
EXPOSE 4010
CMD ["node", "dist/server.js"]
```
> `--lockfile=false` is intentional pending the A3 ADR. Switch to
> `--frozen-lockfile` only once the sibling-workspace problem (F2) is resolved.
### 7.3 Canonical `docker-compose.yml` service block
```yaml
services:
backend:
build:
context: .
dockerfile: backend/Dockerfile
args:
GITEA_NPM_HOST: host.docker.internal
secrets:
- gitea_npm_token
extra_hosts:
- "host.docker.internal:host-gateway"
ports:
- "4010:4010"
environment:
- NODE_ENV=production
- PORT=4010
# ...
restart: unless-stopped
healthcheck:
# F12: use 127.0.0.1 NOT localhost (IPv6 resolution false-fails)
test: ["CMD-SHELL", "wget -q --spider http://127.0.0.1:4010/health || exit 1"]
interval: 30s
timeout: 5s
retries: 3
start_period: 10s
secrets:
gitea_npm_token:
environment: GITEA_NPM_TOKEN
```
### 7.4 Hardened `docker-prep.sh` header
```bash
#!/usr/bin/env bash
# Hermetic Docker-build helper. Packs @bytelyst/* tarballs from the sibling
# common-plat repo when the Gitea npm registry is unreachable.
#
# Use this ONLY when:
# - Local Gitea registry (:3300) is down or unreachable, OR
# - You need a Docker build that includes uncommitted common-plat changes.
#
# For normal builds (Gitea up + clean common-plat), use:
# docker compose build
#
# Usage:
# ./scripts/docker-prep.sh # pack tarballs + rewrite package.json
# ./scripts/docker-prep.sh --dry-run # show what would change (no side effects)
# ./scripts/docker-prep.sh --force # override idempotency guard
# ./scripts/docker-prep.sh --restore # undo rewrite
# ./scripts/docker-prep.sh --keep # skip auto-restore on error
# ./scripts/docker-prep.sh --strip-overrides # remove pnpm.overrides block
#
# Side effects:
# - Creates .docker-deps/ (gitignored)
# - Backs up package.json → package.json.bak
# - Rewrites @bytelyst/* deps to file:../.docker-deps/<tarball>
# - Injects pnpm.overrides for transitive @bytelyst/* deps
#
# Safety:
# - Refuses to run if .bak files already exist (unless --force)
# - Auto-restores on error (trap EXIT) unless --keep passed
# - Pre-commit hook blocks committing rewritten package.json, .tgz, .bak
```
### 7.5 Canonical Next.js web Dockerfile (addresses F11, F13)
```dockerfile
# syntax=docker/dockerfile:1.7
ARG BASE_IMAGE=node:22-alpine
FROM ${BASE_IMAGE} AS deps
WORKDIR /app/web
ARG GITEA_NPM_HOST=host.docker.internal
ARG GITEA_NPM_OWNER=learning_ai_user
ENV NODE_TLS_REJECT_UNAUTHORIZED=0
ENV NPM_CONFIG_STRICT_SSL=false
ENV GITEA_NPM_HOST=$GITEA_NPM_HOST
ENV GITEA_NPM_OWNER=$GITEA_NPM_OWNER
RUN corepack enable && corepack prepare pnpm@10.6.5 --activate
COPY .npmrc.docker ./.npmrc
COPY web/package.json ./package.json
RUN mkdir -p /app/.docker-deps
COPY .docker-deps* /app/.docker-deps/
RUN --mount=type=cache,id=pnpm,target=/root/.local/share/pnpm/store \
--mount=type=secret,id=gitea_npm_token \
export GITEA_NPM_TOKEN="$(cat /run/secrets/gitea_npm_token 2>/dev/null || echo '')" && \
pnpm install --ignore-scripts --lockfile=false
# ── Builder ────────────────────────────────────────────────────────
FROM ${BASE_IMAGE} AS builder
WORKDIR /app/web
COPY --from=deps /app/web/node_modules ./node_modules
COPY --from=deps /app/web/package.json ./package.json
# F11/F13 fix: glob ALL root-level config files instead of enumerating.
# Picks up postcss.config.*, tailwind.config.*, next.config.*, tsconfig*,
# any future *.config.* additions without Dockerfile changes.
COPY web/*.json web/*.ts web/*.mjs web/*.js web/*.cjs ./
COPY web/public/ ./public/
COPY web/src/ ./src/
COPY shared/ ../shared/
ARG NEXT_PUBLIC_BACKEND_URL
ARG NEXT_PUBLIC_PLATFORM_SERVICE_URL
ENV NEXT_PUBLIC_BACKEND_URL=$NEXT_PUBLIC_BACKEND_URL
ENV NEXT_PUBLIC_PLATFORM_SERVICE_URL=$NEXT_PUBLIC_PLATFORM_SERVICE_URL
ENV NEXT_TELEMETRY_DISABLED=1
RUN corepack enable && pnpm run build
# ── Runtime (Next.js standalone) ───────────────────────────────────
FROM ${BASE_IMAGE} AS runner
WORKDIR /app/web
ENV NODE_ENV=production
ENV NEXT_TELEMETRY_DISABLED=1
COPY --from=builder /app/web/.next/standalone ./
# Next 16 standalone server runs as `node web/server.js` from /app/web,
# so static assets live at /app/web/web/.next/static (NOT ./.next/static).
COPY --from=builder /app/web/.next/static ./web/.next/static
COPY --from=builder /app/web/public ./web/public
EXPOSE 3000
ENV PORT=3000
ENV HOSTNAME=0.0.0.0
CMD ["node", "web/server.js"]
```
> **Verification step after every web Dockerfile change:** smoke-test the
> built image by running it and curling the rendered HTML. Confirm the CSS
> bundle in `<link>` references is > 50 KB. A bundle of ~33 KB is the F11
> signature (only `@font-face`, no Tailwind utilities).
### 7.6 `docker-doctor.sh` skeleton (Phase E)
```bash
#!/usr/bin/env bash
# docker-doctor.sh — pre-flight Dockerfile + docker-compose health checks.
# Run on PRs touching Dockerfile, docker-compose*.yml, .dockerignore.
set -euo pipefail
REPO_DIR="$(cd "$(dirname "$0")/.." && pwd)"
FAILED=0
# Check 1 (A8/F11/F13): every config file in web/ is COPY'd in web/Dockerfile
for cfg in postcss.config tailwind.config next.config; do
for f in "$REPO_DIR"/web/${cfg}.{js,mjs,cjs,ts}; do
[[ -f "$f" ]] || continue
base=$(basename "$f")
if ! grep -q "COPY web/${base}\\|COPY web/\\*" "$REPO_DIR/web/Dockerfile" 2>/dev/null; then
echo "✗ F11/F13: $base exists but not COPY'd in web/Dockerfile"
FAILED=1
fi
done
done
# Check 2 (A9/F12): healthchecks use 127.0.0.1
if grep -rE 'test:.*http://localhost' "$REPO_DIR"/docker-compose*.yml 2>/dev/null; then
echo "✗ F12: healthcheck uses localhost (should be 127.0.0.1)"
FAILED=1
fi
# Check 3: .npmrc.docker matches canonical template
if [[ -f "$REPO_DIR/.npmrc.docker" ]]; then
if ! grep -q '\${GITEA_NPM_HOST}' "$REPO_DIR/.npmrc.docker"; then
echo "✗ F4: .npmrc.docker doesn't use \${GITEA_NPM_HOST} placeholder"
FAILED=1
fi
fi
# Check 4: .dockerignore doesn't exclude pnpm-lock.yaml
if grep -q '^pnpm-lock\.yaml$' "$REPO_DIR/.dockerignore" 2>/dev/null; then
echo "⚠ F1: .dockerignore excludes pnpm-lock.yaml (blocks lockfile optimization)"
fi
# Check 5: base image is on approved list
for df in "$REPO_DIR"/{backend,web}/Dockerfile; do
[[ -f "$df" ]] || continue
if ! grep -qE 'FROM (\$\{BASE_IMAGE\}|node:22-(alpine|slim))' "$df"; then
echo "✗ Unapproved base image in $df"
FAILED=1
fi
done
exit $FAILED
```
---
## 8. Phase E — Observability / lint (NEW)
Two complementary linters:
1. **`gitea-doctor`** — Gitea registry pre-flight (env + token + connectivity).
**Already shipped** in `common-plat` commit `610a59fd` at
`scripts/gitea/doctor.sh`. This roadmap only wires it into CI/build flows
(A0-D + E0 below).
2. **`docker-doctor`** — Dockerfile + compose-file static linter (see § 7.6
skeleton). To be built as part of this roadmap.
The two are intentionally separate concerns:
| Linter | Scope | When to run |
|---|---|---|
| `gitea-doctor` | runtime env, token, registry HTTP 200 | Before every build / deploy |
| `docker-doctor` | static analysis of Dockerfile + compose YAML | On every PR touching those files |
### Phase E checklist
- [ ] **E0.** Wire `bash scripts/gitea/doctor.sh --quiet` into every Gitea Actions CI workflow as a pre-build job (addresses F15). Pattern shipped in `common-plat`; replicate via a reusable `actions/gitea-preflight@main` composite if Gitea Actions supports it, otherwise inline.
- [ ] **E1.** Land `docker-doctor.sh` in `learning_ai_common_plat/scripts/` (canonical, mirrors `gitea/doctor.sh` shipped earlier)
- [ ] **E2.** Provide a thin per-repo wrapper at `scripts/docker-doctor.sh` that calls the canonical
- [ ] **E3.** Wire into CI: run on PRs touching `Dockerfile`, `docker-compose*.yml`, `.dockerignore`, `.npmrc.docker`
- [ ] **E4.** Wire into pre-commit hook (warning-only at first, error after 2 weeks)
- [ ] **E5.** Document checks in `learning_ai_common_plat/AI.dev/SKILLS/docker-doctor.md` (sibling doc to the existing `gitea-doctor` patterns)
- [ ] **E6.** Add `make doctor` target to each pilot repo that runs both `gitea-doctor` AND `docker-doctor`
Checks implemented by `docker-doctor.sh`:
| Check | Addresses | Action |
|---|---|---|
| Every `web/*.config.*` file is COPY'd | F11, F13 | Error |
| `docker-compose.yml` healthcheck uses `127.0.0.1` | F12 | Error |
| `.npmrc.docker` uses `${GITEA_NPM_HOST}` AND `${GITEA_NPM_OWNER}` placeholders | F4, F14 | Error |
| Dockerfile declares `ARG GITEA_NPM_OWNER` if it COPYs `.npmrc.docker` | F14 | Error |
| `.dockerignore` doesn't exclude `pnpm-lock.yaml` | F1 | Warn (until A3 ADR lands) |
| Base image is on approved list (`node:22-alpine` or `node:22-slim` via `BASE_IMAGE` ARG) | Canonical decision | Error |
| `.docker-deps/` and `*.bak` in `.gitignore` | B3 | Error |
| `docker-compose.yml` passes `GITEA_NPM_OWNER` build arg | F14 | Warn |
---
## 9. Open questions (numbered TODOs, not blockers)
1. **Shared pnpm cache volume?** BuildKit caches are already shared across
builds by `id=pnpm`. Test whether a named Docker volume adds anything
before adding complexity.
2. **Custom base image?** Publish `bytelyst/node-pnpm:22{alpine,slim}` with
pnpm pre-installed to skip corepack. Cost: image maintenance; benefit: ~5 s/build.
3. **CI hostname?** Verify `host.docker.internal:host-gateway` works in Gitea
Actions Linux runners, or if a CI-specific Dockerfile variant is needed.
4. **Multi-platform builds?** `linux/amd64` + `linux/arm64` interact awkwardly
with cache mounts under `buildx`. Defer to separate roadmap.
5. **Workspace flattening?** Eliminate the `../learning_ai_common_plat/packages/*`
workspace entry inside Docker via a flattened `pnpm-workspace.yaml`.
Unlocks `--frozen-lockfile`. Requires lockfile regeneration step.
---
## 10. Execution order
1. **Now (this commit):** roadmap doc v3 lands here; sign-off requested.
2. **Phase A0 on `learning_ai_clock`** (web + backend) — pilot order
intentionally inverted vs. v2: web is where F11/F13 incidents lived, and
clock exercises both surface types in one repo. Fix `.npmrc.docker`,
`docker-compose.yml`, `.dockerignore`. Verify **A0-V** (Gitea path works
end-to-end) before any speed work.
3. **A8 + A9 + A1** on clock (correctness before speed). Commit.
4. **A2 + A4 + A5 + A6** on clock. Measure. Commit.
5. **Phase A0 → A6** on `learning_ai_peakpulse` (backend only) as validation
second pass for the simpler case.
6. **A7** — fill in metrics table.
7. **A3 ADR** — decide lockfile policy (defer implementation).
8. **Phase B** — harden `docker-prep.sh` on clock, then promote to canonical
home in common-plat (B7) and sync to peakpulse.
9. **Phase E** — land `docker-doctor.sh`, wire into CI as warning, then error.
10. **Phase C** — verification gates C1C9.
11. **Phase D** — scheduled separately, only after § 5 passes.
---
## 11. Risk register
| Risk | Mitigation |
|---|---|
| Removing `pnpm-lock.yaml` from `.dockerignore` exposes a stale or sibling-aware lockfile that breaks Docker installs | Keep `--lockfile=false` for now (A3 ADR); revisit after F2 resolution |
| BuildKit cache mount on shared CI runners causes cross-build interference | Use distinct `id=` per repo (`id=pnpm-${repo}`) if observed |
| `host.docker.internal` doesn't resolve in Linux Docker | `extra_hosts: ["host.docker.internal:host-gateway"]` (A0-4) |
| Removing `.docker-deps/` from default builds breaks repos that haven't done A0 yet | Wildcard `COPY .docker-deps*` keeps both paths working during migration |
| `docker-prep.sh` `--force` is misused and `.bak` files get committed | Pre-commit hook (B4) blocks `.bak`, `.tgz`, rewritten `package.json` |
| Corp network blocks `host.docker.internal:3300` | Verify SSH tunnel reaches Gitea; document in operations.md |
| **F11 regression: build green, app ships with no CSS** | C9 smoke test + Phase E `docker-doctor.sh` check on `web/*.config.*` COPY coverage |
| **F12 regression: healthcheck false-fails on IPv6** | Phase E `docker-doctor.sh` grep for `localhost` in compose files |
| **F13 regression: new config file added, Dockerfile forgotten** | A8-2 glob COPY pattern (root cause fix) + Phase E lint (defense in depth) |
| `BASE_IMAGE` override in `notes` diverges silently from canonical | Phase E check approved list; document override in repo `AGENTS.md` |
| **F14 regression: future Gitea owner rename re-introduces literal in some Dockerfile** | Phase E `docker-doctor.sh` checks `.npmrc.docker` for `${GITEA_NPM_OWNER}` placeholder + Dockerfile for `ARG GITEA_NPM_OWNER` declaration |
| **F15: stale token in dev shell hits build mid-way through, wastes ~4 min** | A0-D + E0 wire `gitea-doctor` as pre-build gate; refuses to start build if env/file drift detected |