Findings from dummy check-in attempt: - Pilot workflow YAML parses cleanly (6 jobs on clock incl. docker-lint) - Local simulation of docker-lint job (gitea-doctor + docker-doctor) exits 0 on both pilots - Pilot repos are NOT hosted on Gitea (`git push gitea` returns 404). Only `learning_ai_uxui_web` exists at localhost:3300 - Until pilot repos are mirrored to Gitea, the .gitea/workflows/ci.yml file ships but the runner never fires - C5 marked as partial; gap recorded explicitly in \xc2\xa7Phase C and \xc2\xa710
934 lines
52 KiB
Markdown
934 lines
52 KiB
Markdown
# Docker Build Optimization Roadmap
|
||
|
||
> **Status:** Draft v13 (Phases A, B, C, D, E **complete across all 12 consumer repos**; docker-doctor PASS everywhere; only advisory warnings remain) · **Owner:** Platform DevOps · **Created:** 2026-05-27 · **Revised:** 2026-05-27
|
||
>
|
||
> Pilot Docker-build correctness + speed fixes on `learning_ai_clock` (web + backend)
|
||
> and `learning_ai_peakpulse` (backend), then capture the playbook here for
|
||
> ecosystem-wide rollout.
|
||
>
|
||
> **Upstream prerequisite shipped (commit `610a59fd` in `learning_ai_common_plat`):**
|
||
> Gitea owner parameterization + helper scripts (`scripts/gitea/doctor.sh`,
|
||
> `scripts/gitea/token.sh`). The `.npmrc` template now resolves owner from
|
||
> `${GITEA_NPM_OWNER:-learning_ai_user}`. **All A0-1 work in this roadmap
|
||
> inherits this — Dockerfile/.npmrc.docker must use the same `${GITEA_NPM_OWNER}`
|
||
> placeholder, not a hardcoded literal.**
|
||
|
||
---
|
||
|
||
## 0. Pre-flight audit findings (2026-05-27)
|
||
|
||
A read-only audit of pilot repos + lessons from recent live incidents +
|
||
the A0-V execution iterations on clock surfaced **18 concrete bugs/gaps**
|
||
(F14–F15 added after the Gitea-hardening commit; F16–F18 added during the
|
||
A0-V execution sweep on clock, 2026-05-27). The actual state of the ecosystem is closer to the
|
||
inverse of the casual narrative: tarballs are the de facto default, the
|
||
Gitea-registry path is partially wired, and there is a separate class of
|
||
"build green, app broken" silent failures (F11–F13) that the speed-focused
|
||
plan needs to address first.
|
||
|
||
| # | Finding | Location | Severity |
|
||
|---|---|---|---|
|
||
| F1 | `pnpm-lock.yaml` is in `.dockerignore` — any lockfile-based optimization is blocked until removed | `peakpulse/.dockerignore`, `clock/.dockerignore` | High |
|
||
| F2 | `pnpm-workspace.yaml` references sibling `../learning_ai_common_plat/packages/*` — `--frozen-lockfile` inside Docker will fail unless workspace is flattened or sibling tree is copied | both pilots | High |
|
||
| F3 | `peakpulse/.npmrc.docker` is tarball-only (no `@bytelyst:registry=…` line) — the "Gitea-registry" path doesn't work in this repo today | `peakpulse/.npmrc.docker` | High |
|
||
| F4 | `clock/.npmrc.docker` hardcodes `http://localhost:3300` — from inside Docker, `localhost` is the container, not the host registry | `clock/.npmrc.docker` | High |
|
||
| F5 | `clock/backend/Dockerfile` has neither `ARG GITEA_NPM_HOST` nor a BuildKit secret mount — wholly dependent on pre-populated `.docker-deps/` | `clock/backend/Dockerfile` | High |
|
||
| F6 | `clock/web/Dockerfile` accepts `ARG GITEA_NPM_HOST` but never uses it; no `--mount=type=secret` either | `clock/web/Dockerfile` | Medium |
|
||
| F7 | `peakpulse/docker-compose.yml` does not pass `GITEA_NPM_HOST` build arg or declare `secrets:` block | `peakpulse/docker-compose.yml` | Medium |
|
||
| F8 | `COPY .docker-deps/` is unconditional in every backend Dockerfile — every build requires `docker-prep.sh` to have run OR an empty `.docker-deps/` dir to pre-exist | both repos | Medium |
|
||
| F9 | `npm install -g pnpm@10.6.5` runs on every build (no `corepack`) — 5–10 s overhead, no pinning to `packageManager` field | all four Dockerfiles | Low |
|
||
| F10 | No BuildKit `--mount=type=cache` for pnpm store — cold install on every rebuild even when deps unchanged | all four Dockerfiles | High (main speed win) |
|
||
| **F11** | **Build-time config file missing from repo or not COPY'd in Dockerfile causes silent UI breakage. Symptom: `next build` succeeds, container is "healthy", but CSS bundle is ~33 KB (only `@font-face`) and all Tailwind classes are absent → UI renders unstyled.** Two sub-bugs: (a) `postcss.config.mjs` missing entirely while `@tailwindcss/postcss` is in `package.json` (NoteLett, JarvisJr fixes `dff459e`, `36f6bc1`); (b) file exists but Dockerfile never COPYs it (Clock, LocalMemGPT fixes `a308c6444`, `07cdf6b`). | `*/web/Dockerfile`, `*/web/postcss.config.*` | **High** |
|
||
| **F12** | **Healthcheck uses `localhost`, resolves to IPv6 `::1`, false-fails.** Backend listens on `0.0.0.0` (IPv4 only). `wget --spider http://localhost:.../health` hits `::1`, connection refused, container marked "unhealthy", `web` service won't start due to `depends_on: condition: service_healthy`. Incident: `learning_ai_jarvis_jr/docker-compose.yml`. | every `docker-compose*.yml` healthcheck | **Medium** |
|
||
| **F13** | **Enumerated `COPY web/foo ./foo` pattern drifts from filesystem.** New config file added to repo but Dockerfile's enumerated COPY list isn't updated. Build succeeds silently with the file absent; behavior diverges from local dev. Root cause of F11(b). | every Dockerfile using enumerated COPY | **Medium** |
|
||
| **F14** | **Hardcoded Gitea owner (`learning_ai_user`) literally embedded in `.npmrc.docker` + CI workflows + publish scripts across 14 repos.** When the org was renamed from `bytelyst` → `learning_ai_user`, every repo needed a manual commit. **Resolved upstream in `common-plat` (`610a59fd`):** owner now resolves from `${GITEA_NPM_OWNER:-learning_ai_user}`; `scripts/gitea/{doctor,token}.sh` ship as pre-flight/rotation helpers. Docker work in this roadmap MUST consume the env var, not the literal. | `.npmrc.docker`, Dockerfile `ARG`/`ENV`, CI workflows | **Medium** |
|
||
| **F15** | **Stale shell-env tokens.** `~/.gitea_npm_token` rotated on disk; long-lived shells still exported the old value. Caused 401s during `docker compose build` until `source ~/.zshrc`. **Mitigation shipped:** `bash scripts/gitea/doctor.sh` detects env-vs-file drift and refuses to proceed. **Action required in this roadmap:** wire doctor as a pre-build CI gate. | dev workstation + CI runners | Low (now caught) |
|
||
| **F16** | **At least 10 published `@bytelyst/*` packages had unrewritten `workspace:*` refs in their `package.json` dependencies.** Root cause: `publish-outdated-packages.sh` extracts a pnpm-packed tarball then **re-packs with `npm pack`** (workaround for a historical Gitea-compat issue with pnpm's tarball format), and `npm pack` doesn't recognize the pnpm-specific `workspace:` protocol — it passes it through literally. **Fixed in `common-plat@cfcfc7bb`** (`fix(gitea): rewrite workspace:* in published tarballs (F16)`) — inserted a workspace:* rewriter between extract and npm-repack + a defense-in-depth grep guard. Republished 10 affected packages. | `common-plat` publish flow + Gitea registry | **Critical (FIXED)** |
|
||
| **F17** | **Gitea bakes `localhost:3300` into the `dist.tarball` field of every published package's metadata.** Inside Docker, `localhost` is the container itself, not the host — so even after a successful registry-metadata fetch via `host.docker.internal`, pnpm follows the tarball URL to `localhost:3300` and ECONNREFUSEs. Root cause: Gitea `app.ini`'s `ROOT_URL=http://localhost:3300/` was baked at publish time. **Fixed** by setting `ROOT_URL=http://host.docker.internal:3300/`, restarting Gitea, adding `127.0.0.1 host.docker.internal` to `/etc/hosts`, adding `host.docker.internal` to `NO_PROXY` (corp proxy was hijacking DNS), and republishing all 64 packages (`common-plat@dd90f709`). | Gitea `app.ini` + host `/etc/hosts` + every dev machine's `switch-network.sh` | **Critical (FIXED)** |
|
||
| **F18** | **`clock/web/package.json` had 4 `@bytelyst/*` deps declared as `file:` refs to sibling `../../learning_ai_common_plat/packages/*`** — a legacy pre-Gitea pattern. Inside Docker those paths don't exist, so `pnpm install` fails with `ERR_PNPM_LINKED_PKG_DIR_NOT_FOUND`. Discovered during clock web A0-V on 2026-05-27. **Fixed in `learning_ai_clock@8b5c767a3`** by rewriting to `*` semver. Same pattern likely lives in other product repos (especially anything that consumes `@bytelyst/ui`, `@bytelyst/design-tokens`, `@bytelyst/use-theme`) — audit needed in Phase D rollout. | `*/web/package.json` (and likely others) | **High** |
|
||
|
||
**Implications:**
|
||
|
||
- The original "switch to `--frozen-lockfile` + Gitea registry" plan requires
|
||
two upstream fixes first (F1, F2).
|
||
- F11–F13 mean **correctness fixes must precede speed fixes**, otherwise we
|
||
ship faster builds of broken apps.
|
||
- F16 + F17 are **both fixed** as of 2026-05-27. Gitea path now works
|
||
end-to-end on clock. A-pre is largely complete; remaining items (A-pre-4,
|
||
A-pre-5) become Phase E checks.
|
||
- F18 (sibling `file:` refs in product repo manifests) is the same family as
|
||
F2 but separately tractable — fixed in clock, audit needed across other
|
||
repos as part of Phase D rollout.
|
||
- A linter (Phase E `docker-doctor.sh`) is the durable insurance against
|
||
F11/F13/F18 recurrence — silent in CI today. The registry-side guard
|
||
(publish-time check for `workspace:*` leaks) shipped in `common-plat@cfcfc7bb`
|
||
as part of the F16 fix.
|
||
|
||
---
|
||
|
||
## 1. Context: three build paths
|
||
|
||
| Path | Status today | Trigger | Notes |
|
||
|---|---|---|---|
|
||
| **`docker-prep.sh` tarballs** | **De facto default** in peakpulse + flowmonk; also works in clock/notes | Run `docker-prep.sh` then `docker compose build` | Hermetic; mutates `package.json`; slow to repack |
|
||
| **Gitea NPM registry** | Partially wired in clock + notes; broken in peakpulse | `docker compose build` with `GITEA_NPM_HOST` arg + secret | Needs `.npmrc.docker` standardization to be the default |
|
||
| **Legacy `file:` refs** | Deprecated | — | Removed during pnpm/Gitea migration |
|
||
|
||
### Measurement targets
|
||
|
||
| Build | Baseline (observed) | Target after Phase A |
|
||
|---|---|---|
|
||
| Cold (no cache) | ~2–3 min | ≤ 2 min |
|
||
| Warm (one source file changed) | ~2–3 min | **< 30 s** |
|
||
| `docker-prep.sh` pack step alone | ~60–90 s | < 30 s (pnpm pack cache) |
|
||
|
||
> Fill in actuals during Phase C.
|
||
|
||
---
|
||
|
||
## 2. Goals & non-goals
|
||
|
||
**Goals**
|
||
|
||
- ✅ Eliminate F11–F13 class of silent "build green, app broken" failures
|
||
- ✅ Cut warm rebuild time via BuildKit pnpm-store cache mount (single biggest speed win)
|
||
- ✅ Make `docker-prep.sh` idempotent, safe to re-run, gitignore-clean, and canonical (no per-repo drift)
|
||
- ✅ Standardize `.npmrc.docker` across the ecosystem so the Gitea path actually works
|
||
- ✅ Fix `docker-compose.yml` to pass `GITEA_NPM_HOST` + secrets so the registry path is usable without manual flags
|
||
- ✅ Ship `docker-doctor.sh` CI lint as the durable insurance layer
|
||
|
||
**Non-goals**
|
||
|
||
- ❌ Migrating off pnpm or off the Gitea registry
|
||
- ❌ Adopting `--frozen-lockfile` until F2 is resolved (sibling-workspace problem)
|
||
- ❌ Publishing `@bytelyst/*` to the public npm registry
|
||
- ❌ Multi-platform builds (separate roadmap)
|
||
|
||
---
|
||
|
||
## 2.5 Canonical decisions
|
||
|
||
Decisions taken now to avoid contradictions later in the doc:
|
||
|
||
- **Base image:** `node:22-alpine` is canonical. For repos blocked by the
|
||
corporate proxy's Alpine SSL interception (currently only
|
||
`learning_ai_notes`), the Dockerfile MUST expose:
|
||
```dockerfile
|
||
ARG BASE_IMAGE=node:22-alpine
|
||
FROM ${BASE_IMAGE} AS builder
|
||
```
|
||
Override per-repo via `--build-arg BASE_IMAGE=node:22-slim`. Document the
|
||
override in the repo's `AGENTS.md`.
|
||
- **Healthcheck host:** `127.0.0.1` (NOT `localhost`) in every
|
||
`docker-compose*.yml` `test:` block. See F12.
|
||
- **Lockfile mode in Docker:** `--lockfile=false` for now. `--frozen-lockfile`
|
||
is blocked on the A3 ADR (F2).
|
||
|
||
---
|
||
|
||
## 3. Phase A — Correctness + build speed + path correctness
|
||
|
||
Order matters: **A-pre must precede A0** (you can't build via a registry that
|
||
serves broken metadata); A0 must precede A1+ (you can't optimize a path that
|
||
doesn't work), and A8+A9 (correctness) must land before measuring speed wins.
|
||
|
||
### A-pre. Make the Gitea registry actually usable from Docker (F16 + F17 + F18)
|
||
|
||
**Owner:** `learning_ai_common_plat` + per-product repo · **Status:** ✅ done for clock + global config.
|
||
|
||
Three distinct bugs surfaced during clock A0-V on 2026-05-27:
|
||
|
||
- **F16:** Publish flow leaked `workspace:*` into published metadata.
|
||
- **F17:** Gitea baked `localhost:3300` into tarball URLs.
|
||
- **F18:** Product repos had legacy `file:` refs to sibling packages.
|
||
|
||
- [x] **A-pre-1.** Audit `publish-outdated-packages.sh` — confirmed it uses
|
||
`pnpm pack` then re-tars with `npm pack`, which loses `workspace:` rewriting.
|
||
- [x] **A-pre-2.** Patch publish script with a workspace:* rewriter + a
|
||
post-rewrite grep guard. Shipped in `common-plat@cfcfc7bb`.
|
||
- [x] **A-pre-3.** Verify all packages publish with `0` workspace:* refs.
|
||
Confirmed via curl scan across all 64 packages.
|
||
- [x] **A-pre-4.** F17 fix: set Gitea `ROOT_URL=http://host.docker.internal:3300/`,
|
||
restart Gitea, add `127.0.0.1 host.docker.internal` to `/etc/hosts`, add
|
||
`host.docker.internal` to `NO_PROXY` in `switch-network.sh`, bulk republish
|
||
all 64 packages. Shipped in `common-plat@dd90f709`.
|
||
- [x] **A-pre-5.** F18 fix: rewrite `file:../../learning_ai_common_plat/packages/*`
|
||
refs in `clock/web/package.json` to `*` semver. Shipped in `clock@8b5c767a3`.
|
||
Audit needed in Phase D for other product repos.
|
||
- [x] **A-pre-6.** Document Gitea config requirements (below).
|
||
|
||
### A-pre-6. Gitea configuration prerequisites (one-time per dev machine)
|
||
|
||
The Gitea registry MUST be configured with `ROOT_URL=http://host.docker.internal:3300/`
|
||
so published tarball URLs are reachable from inside Docker containers. The
|
||
host `/etc/hosts` MUST resolve `host.docker.internal` to `127.0.0.1` so the
|
||
same URLs work from the host shell.
|
||
|
||
On macOS (Homebrew Gitea):
|
||
|
||
```bash
|
||
# 1. Edit Gitea's app.ini
|
||
sudo -e /opt/homebrew/var/gitea/custom/conf/app.ini
|
||
# change: ROOT_URL = http://localhost:3300/
|
||
# to: ROOT_URL = http://host.docker.internal:3300/
|
||
|
||
# 2. Restart Gitea
|
||
brew services restart gitea
|
||
|
||
# 3. Add /etc/hosts entry so host.docker.internal resolves on the host too
|
||
sudo sh -c 'grep -q host.docker.internal /etc/hosts || \
|
||
echo "127.0.0.1 host.docker.internal" >> /etc/hosts'
|
||
|
||
# 4. Ensure host.docker.internal is in NO_PROXY for corp shells
|
||
# (already done in switch-network.sh as of common-plat@dd90f709)
|
||
source ~/.zshrc # reload
|
||
|
||
# 5. Verify
|
||
curl -sS http://host.docker.internal:3300/api/v1/version
|
||
# expected: {"version":"1.25.5"} or similar
|
||
```
|
||
|
||
### A0. Make the Gitea-registry path actually work (clock + peakpulse)
|
||
|
||
- [ ] **A0-1.** Standardize `.npmrc.docker` to use templated host AND owner so it works on host (`localhost`) and inside Docker (`host.docker.internal`), and so future owner renames are a one-line env change:
|
||
```
|
||
@bytelyst:registry=http://${GITEA_NPM_HOST}:3300/api/packages/${GITEA_NPM_OWNER:-learning_ai_user}/npm/
|
||
//${GITEA_NPM_HOST}:3300/api/packages/${GITEA_NPM_OWNER:-learning_ai_user}/npm/:_authToken=${GITEA_NPM_TOKEN}
|
||
strict-ssl=false
|
||
auto-install-peers=true
|
||
```
|
||
> **⚠️ Env-var expansion chain:** pnpm expands `${VAR}` in `.npmrc` at read
|
||
> time using the current process environment (see [pnpm npmrc docs][pnpm-npmrc]).
|
||
> That means the Dockerfile MUST do `ARG GITEA_NPM_HOST` + `ARG GITEA_NPM_OWNER`
|
||
> → `ENV GITEA_NPM_HOST=$GITEA_NPM_HOST` / `ENV GITEA_NPM_OWNER=$GITEA_NPM_OWNER`
|
||
> **before** the `pnpm install` RUN line, AND the `GITEA_NPM_TOKEN` must be
|
||
> exported from the BuildKit secret mount inside the same `RUN` (since secrets
|
||
> don't persist as env across layers).
|
||
>
|
||
> **Note on F14:** The canonical `.npmrc` (host-side) template already uses
|
||
> `${GITEA_NPM_OWNER}` (shipped in common-plat commit `610a59fd`).
|
||
> `.npmrc.docker` lagged behind because Docker builds have a separate file —
|
||
> A0-1 brings them into parity.
|
||
|
||
[pnpm-npmrc]: https://pnpm.io/npmrc
|
||
- [ ] **A0-2.** Remove `pnpm-lock.yaml` from `.dockerignore` in both repos (fixes F1; harmless under `--lockfile=false` since we don't COPY it, but unblocks future A3)
|
||
- [ ] **A0-3.** Add `GITEA_NPM_HOST` + `GITEA_NPM_OWNER` build args + `secrets:` block to every service in `docker-compose.yml`:
|
||
```yaml
|
||
build:
|
||
context: .
|
||
dockerfile: backend/Dockerfile
|
||
args:
|
||
GITEA_NPM_HOST: ${GITEA_NPM_HOST:-host.docker.internal}
|
||
GITEA_NPM_OWNER: ${GITEA_NPM_OWNER:-learning_ai_user}
|
||
secrets:
|
||
- gitea_npm_token
|
||
secrets:
|
||
gitea_npm_token:
|
||
environment: GITEA_NPM_TOKEN
|
||
```
|
||
- [ ] **A0-4.** Add `extra_hosts: ["host.docker.internal:host-gateway"]` to each service so Linux Docker can resolve the host
|
||
- [ ] **A0-5.** Document required env: `GITEA_NPM_TOKEN` must be exported in the shell that runs `docker compose build` (add to repo `README.md` quickstart). Reference `bash ../learning_ai_common_plat/scripts/gitea/token.sh status` for verification.
|
||
- [ ] **A0-D.** **Run `gitea-doctor` before any Docker build** (addresses F15). Inline into deploy/CI workflows:
|
||
```bash
|
||
bash ../learning_ai_common_plat/scripts/gitea/doctor.sh --quiet || exit 1
|
||
docker compose build
|
||
```
|
||
- Locally: shell alias or `Makefile` target `make build` that runs doctor then `docker compose build`.
|
||
- In Gitea Actions CI: a pre-job step. If `doctor` exits non-zero, the build is skipped with a clear error rather than failing 4 minutes in with `ERR_PNPM_AUTHENTICATION`.
|
||
- [ ] **A0-V.** **Verification gate (between A0 and A1):** build the registry path **without** any cache-mount or layer optimizations. Confirm `docker compose build --no-cache` succeeds end-to-end pulling from Gitea. Only proceed to A1 once this is green. Don't conflate "make it work" with "make it fast" in one commit.
|
||
|
||
> **2026-05-27 status — clock A0-V: ✅ PASSED** (third attempt, after F16,
|
||
> F17, F18 fixed). Cold-build wall-clock:
|
||
> - backend: **59.2 s** (commits: `clock@0be887288` + `common-plat@cfcfc7bb` + `common-plat@dd90f709`)
|
||
> - web: **3:13 (193 s)** (commits: above + `clock@8b5c767a3`)
|
||
>
|
||
> Both surfaces resolve `@bytelyst/*` from the Gitea registry end-to-end —
|
||
> no `docker-prep.sh` tarballs, no sibling `file:` refs, no proxy interference.
|
||
> See §3.A7 metrics table.
|
||
|
||
### A1. Replace `npm install -g pnpm@X` with corepack
|
||
|
||
- [ ] **A1-1.** Replace `RUN npm install -g pnpm@10.6.5` with:
|
||
```dockerfile
|
||
RUN corepack enable && corepack prepare pnpm@10.6.5 --activate
|
||
```
|
||
- [ ] **A1-2.** Verify `packageManager` field in `backend/package.json` and `web/package.json` matches (already `pnpm@10.6.5` in peakpulse backend)
|
||
|
||
### A2. Add BuildKit pnpm-store cache mount
|
||
|
||
- [ ] **A2-1.** Set `# syntax=docker/dockerfile:1.7` directive at top of every Dockerfile
|
||
- [ ] **A2-2.** Wrap install step with cache + secret mount:
|
||
```dockerfile
|
||
RUN --mount=type=cache,id=pnpm,target=/root/.local/share/pnpm/store \
|
||
--mount=type=secret,id=gitea_npm_token \
|
||
export GITEA_NPM_TOKEN="$(cat /run/secrets/gitea_npm_token 2>/dev/null || echo '')" && \
|
||
pnpm install --ignore-scripts --lockfile=false
|
||
```
|
||
- [ ] **A2-3.** Verify cache mount is active: `docker buildx du --filter type=exec.cachemount` shows non-zero size after a build. **Real success metric** is wall-clock: warm rebuild (touching one source file) drops to < 30 s.
|
||
|
||
### A3. Decide lockfile policy ✅ DONE (ADR-0001)
|
||
|
||
Two options — pick one in a short ADR before implementing:
|
||
|
||
- **Option 1: Keep `--lockfile=false`** (current pragmatic approach)
|
||
- ✅ No sibling-workspace complications
|
||
- ❌ No reproducibility guarantee inside Docker
|
||
- ❌ Slower installs (full resolution every build)
|
||
- **Option 2: Generate a Docker-only lockfile** via `pnpm install --lockfile-only` against a flattened `package.json` that resolves `@bytelyst/*` to semver
|
||
- ✅ Reproducibility
|
||
- ✅ Faster installs
|
||
- ❌ New build step + tooling
|
||
- ❌ Drift risk between dev lockfile and Docker lockfile
|
||
|
||
- [x] **A3-1.** ADR written: [`docs/adr/0001-docker-build-lockfile-policy.md`](./adr/0001-docker-build-lockfile-policy.md) — **Option 1 accepted** (keep `--lockfile=false` short-term; revisit after Phase D).
|
||
- [x] **A3-2.** `--frozen-lockfile` adoption deferred per ADR; tracked as future work in §11.
|
||
|
||
### A4. Restructure layer order
|
||
|
||
- [ ] **A4-1.** Reorder COPY/RUN so deps-install layer is `package.json` + `.npmrc.docker` ONLY, then a separate layer for `src/`, config files, `shared/`
|
||
- [ ] **A4-2.** Move all `ARG` lines that affect deps install **before** the install step; move `NEXT_PUBLIC_*` ARGs (web) closer to the build step (they invalidate the build layer, not the deps layer)
|
||
|
||
### A5. Gate `.docker-deps/` behind a build arg
|
||
|
||
- [ ] **A5-1.** Add `ARG USE_TARBALLS=false` to Dockerfile
|
||
- [ ] **A5-2.** Use wildcard COPY so missing dir doesn't break the build:
|
||
```dockerfile
|
||
RUN mkdir -p /app/.docker-deps
|
||
COPY .docker-deps* /app/.docker-deps/
|
||
```
|
||
- [ ] **A5-3.** Verify `.docker-deps/` is in `.gitignore` and `.dockerignore` does NOT exclude it when tarball mode is in use
|
||
|
||
### A6. `.dockerignore` audit
|
||
|
||
- [ ] **A6-1.** Confirm exclusions: `node_modules`, `**/node_modules`, `dist`, `.next`, `*.log`, `.env`, `.env.*`, `.git`, `*.bak`
|
||
- [ ] **A6-2.** Remove: `pnpm-lock.yaml` exclusion (was correct under `--lockfile=false`, blocks future optimization)
|
||
- [ ] **A6-3.** Confirm `.docker-deps/` is NOT excluded when tarball path is active
|
||
|
||
### A7. Measure & record
|
||
|
||
| Repo | Surface | Cold (A0-V) | Cold (post-A2) | Warm (post-A2) | Notes |
|
||
|---|---|---|---|---|---|
|
||
| clock | backend | **59.2 s** | **64.7 s** | **2.9 s** | Cold essentially flat (corepack adds ~1 s; cache mount empty on first run). Warm → 95.1% reduction. Commits: `clock@8b5c767a3` (A0-V), `clock@f6a806ff3` (A1+A8+A9), `clock@55e8d22d3` (A2+A5+A6) |
|
||
| clock | web | **193 s (3:13)** | **291 s (4:51) †** | **5.4 s** | Warm → 97.2% reduction. † Cold variance — see footer |
|
||
| peakpulse | backend | — (was tarball-only path) | **72.2 s** | **2.7 s** | Warm → 96.3% reduction. Commits: `peakpulse@11a6bc5` (Phase A), `peakpulse@6523a1a` (.gitkeep fix), `clock@1465e06b1`+`d69003c1f` (mirror .gitkeep fix) |
|
||
|
||
**Footer note on cold-build variance.** Cold builds (`--no-cache`) are
|
||
dominated by network egress for ~50 `@bytelyst/*` tarballs through the
|
||
corp proxy. A second measurement of clock web cold-build came in at
|
||
291 s vs 174 s in the previous step — same Dockerfile path, different
|
||
network-side latency. Cold build is **not** the optimization target of
|
||
this roadmap; warm rebuild is. Run `pnpm store prune` on the host or use
|
||
a local registry mirror if cold-build determinism is needed.
|
||
|
||
Measurement commands:
|
||
```bash
|
||
# Cold (clear all layer cache; cache mounts may still persist)
|
||
time DOCKER_BUILDKIT=1 docker compose build --no-cache backend
|
||
|
||
# Warm (one source file changed; deps unchanged)
|
||
touch backend/src/server.ts
|
||
time DOCKER_BUILDKIT=1 docker compose build backend
|
||
|
||
# Deps-changed (touch package.json; pnpm store cache helps here)
|
||
touch backend/package.json
|
||
time DOCKER_BUILDKIT=1 docker compose build backend
|
||
```
|
||
|
||
### A8. Config-file COPY audit & canonical pattern (addresses F11, F13)
|
||
|
||
- [ ] **A8-1.** For every Dockerfile in scope, list all build-time files present in the surface directory (`web/` or `backend/`) that affect the build:
|
||
- `postcss.config.{js,mjs,cjs,ts}`
|
||
- `tailwind.config.{js,mjs,cjs,ts}`
|
||
- `next.config.{js,mjs,ts}`
|
||
- `tsconfig*.json`
|
||
- `package.json`
|
||
- `.npmrc.docker`, `.npmrc`
|
||
- `babel.config.*` (if present)
|
||
- `drizzle.config.*` (if present)
|
||
- `vitest.config.*` (only if the build needs it)
|
||
Verify each is COPY'd in the Dockerfile.
|
||
- [ ] **A8-2.** Choose canonical COPY pattern. **Decision: middle-ground glob** for web surfaces:
|
||
```dockerfile
|
||
COPY web/*.{json,ts,mjs,js,cjs} ./
|
||
COPY web/public/ ./public/
|
||
COPY web/src/ ./src/
|
||
```
|
||
Trade-off: glob picks up unintended root-level files if any are added later, but **dramatically reduces F11/F13 risk**. Backend surfaces with few root config files can keep enumerated COPY (lower risk surface).
|
||
- [ ] **A8-3.** Repo-by-repo migration: replace enumerated `COPY web/foo ./foo` with the glob pattern; verify the resulting image has all expected files via `docker run --rm <img> ls -la`.
|
||
|
||
### A9. Healthcheck canonicalization (addresses F12)
|
||
|
||
- [ ] **A9-1.** Replace `localhost` with `127.0.0.1` in every `docker-compose*.yml` healthcheck `test:` block. Sweep with:
|
||
```
|
||
rg -l 'http://localhost' --glob 'docker-compose*.yml'
|
||
```
|
||
- [ ] **A9-2.** Standardize healthcheck shape:
|
||
- **Alpine-based images:**
|
||
```yaml
|
||
healthcheck:
|
||
test: ["CMD-SHELL", "wget -q --spider http://127.0.0.1:${PORT}/health || exit 1"]
|
||
interval: 30s
|
||
timeout: 5s
|
||
retries: 3
|
||
start_period: 10s
|
||
```
|
||
- **Slim/Debian images** (`wget` not always present, but `node` is):
|
||
```yaml
|
||
healthcheck:
|
||
test: ["CMD-SHELL", "node -e \"fetch('http://127.0.0.1:${PORT}/health').then(r=>process.exit(r.ok?0:1)).catch(()=>process.exit(1))\""]
|
||
```
|
||
- [ ] **A9-3.** Add `start_period` (10s minimum) — prevents flaky "container started but app not yet listening" false-negatives.
|
||
|
||
---
|
||
|
||
## 4. Phase B — Hermetic-fallback polish (`docker-prep.sh`)
|
||
|
||
`docker-prep.sh` is duplicated with minor variations across product repos.
|
||
**Promotion to canonical home is now in Phase B, not Phase D** — drift
|
||
compounds linearly with time and the `.npmrc` template precedent proves the
|
||
pattern is cheap.
|
||
|
||
- [x] **B1.** `--dry-run` flag (`common-plat@a418a23e`).
|
||
- [x] **B2.** Idempotency guard via `*.bak` detection + `--force` override (`common-plat@a418a23e`).
|
||
- [x] **B3.** `.docker-deps/` and `*.bak` in `.gitignore` on both pilots (clock + peakpulse). Verified by `docker-doctor.sh`.
|
||
- [x] **B4.** Pre-commit hook landed. Canonical guard script `check-docker-prep-staged.sh` (`common-plat@c908c6d7`) blocks rewritten `package.json`, staged `.tgz` tarballs, and `.bak` files. Wired into both pilot `.husky/pre-commit` (`clock@4f8086bfa`, `peakpulse@c3195c8`). Verified with simulated staged tarballs → commit blocked.
|
||
|
||
Original spec:
|
||
```bash
|
||
# .husky/pre-commit
|
||
if git diff --cached --name-only | xargs grep -l '"file:\.\./\.docker-deps/' 2>/dev/null; then
|
||
echo "ERROR: rewritten package.json detected. Run scripts/docker-prep.sh --restore first."
|
||
exit 1
|
||
fi
|
||
if git diff --cached --name-only | grep -qE '(\.docker-deps/.*\.tgz|package\.json\.bak)$'; then
|
||
echo "ERROR: docker-prep.sh artifacts staged. Run --restore first."
|
||
exit 1
|
||
fi
|
||
```
|
||
- [x] **B5.** Auto-restore on script error via `trap cleanup_on_error EXIT` + `--keep` opt-out (`common-plat@a418a23e`).
|
||
- [x] **B6.** Standardized header + usage block per § 7.4 template (`common-plat@a418a23e`).
|
||
- [x] **B7. CANONICAL HOME landed.**
|
||
- [x] **B7-1.** Canonical at `learning_ai_common_plat/scripts/docker-prep.template.sh` + 2 helpers `_docker-prep-inject.js`, `_docker-prep-strip.js` (`common-plat@a418a23e`).
|
||
- [x] **B7-2.** `learning_ai_common_plat/scripts/sync-docker-prep.sh` syncs all 3 files (mirrors `sync-npmrc.sh`).
|
||
- [x] **B7-3.** `learning_ai_common_plat/scripts/check-docker-prep-drift.sh` for CI (mirrors `check-npmrc-drift.sh`).
|
||
- [x] **B7-4.** AGENTS.md "NEVER edit `docker-prep.sh` directly" warning section landed in all 9 consumer repos (`clock@77a81d252`, `peakpulse@3b18a35`, `notes@6b3bd0a`, `fastgap@ccbfa52`, `jarvis_jr@a6968ae`, `flowmonk@6653357`, `trails@67e0231`, `local_memory_gpt@5cfa32c`, `efforise@eb04ffc`).
|
||
- [x] **B8.** `--strip-overrides` option removes `pnpm.overrides` block as a safety net (`common-plat@a418a23e`).
|
||
- [x] **B+.** `--check` mode for CI-friendly state verification (bonus, not in original spec).
|
||
- [x] **B+.** Portable `sed -i` (BSD on macOS, GNU on Linux).
|
||
- [x] **B+.** Preserve `.docker-deps/.gitkeep` on clear (fixes earlier regression where `--restore` deleted the tracked file).
|
||
|
||
---
|
||
|
||
## 5. Phase C — Verification gates
|
||
|
||
Pilot exit criteria (must all pass before Phase D):
|
||
|
||
- [x] **C1.** Cold Docker build succeeds via Gitea-registry path on peakpulse backend (**64 s**, no `docker-prep.sh` invocation).
|
||
- [x] **C2.** Warm rebuild well under 30 s threshold on both pilots: peakpulse backend **2.6 s**, clock backend **3.3 s**.
|
||
- [x] **C3.** `docker-prep.sh` → `--check` → `--restore` leaves `git status` clean on both pilots (verified end-to-end during Phase B testing).
|
||
- [x] **C4.** Pre-commit hook blocks staged tarballs + `.bak` files (verified by simulating staged artifacts on clock).
|
||
- [~] **C5.** Gitea Actions CI green — **partially validated**. Workflow YAML is well-formed in both pilots (`clock@4f8086bfa`, `peakpulse@c3195c8`); local simulation of the `docker-lint` job (`bash scripts/gitea/doctor.sh --quiet && bash scripts/docker-doctor.sh --quiet`) exits 0 on both pilots. **Gap:** the pilot repos are not currently hosted on Gitea (`http://localhost:3300/learning_ai_user/` has only `learning_ai_uxui_web`), so the workflow file ships but the runner never fires. A dummy `git push gitea` returns 404. C5 will fully close when the pilot repos are mirrored to Gitea (see `learning_ai_common_plat/docs/runbooks/GITEA_VM_SETUP.md` for the hosting setup).
|
||
- [x] **C6.** Build-time metrics already populated in § 3.A7 from earlier Phase A work.
|
||
- [x] **C7.** ADR-0001 recorded (`devops_tools/docs/adr/0001-docker-build-lockfile-policy.md`).
|
||
- [x] **C8.** `docker-doctor.sh` PASS on both pilots (only the 1 expected `pnpm-lock.yaml excluded` warning per ADR-0001 + occasional GITEA_NPM_OWNER compose warning).
|
||
- [x] **C9.** Web smoke test landed as Playwright spec `web/e2e/css-bundle-smoke.spec.ts` (`clock@b8440bfea`). Asserts title sanity + largest CSS bundle > 20 KB. Catches F11 regression at PR time.
|
||
|
||
---
|
||
|
||
## 6. Phase D — Ecosystem rollout
|
||
|
||
**Status:** DONE for all 12 consumer repos. D.1 artifacts + D.2 Dockerfile/compose fixes + D.3 advisory-warning cleanup + B7-4 AGENTS.md notes. `docker-doctor` exits PASS in every repo. Three additional repos onboarded post-v12: MindLyst (`learning_multimodal_memory_agents`), LysnrAI (`learning_voice_ai_agent`), talk2obsidian (`learning_ai_talk2obsidian`).
|
||
|
||
### D.1 — Tooling rollout (DONE)
|
||
|
||
All 9 consumer repos received the canonical infrastructure via `sync-docker-prep.sh`:
|
||
|
||
- `scripts/docker-prep.sh` + `_docker-prep-inject.js` + `_docker-prep-strip.js` (canonical sync)
|
||
- `scripts/docker-doctor.sh` (thin wrapper to canonical linter)
|
||
- `Makefile` with `make doctor` target
|
||
|
||
| Repo | Commit | Findings (docker-doctor warn-only) |
|
||
|---|---|---|
|
||
| `learning_ai_notes` | `216ebb8` | 6 warnings + errors: F12 localhost, F14 ARG missing (×2), A5-2 wildcard (×2), F11/F13 web glob, A2 syntax directive |
|
||
| `learning_ai_fastgap` | `36b67a2` | 4: F4/F14 `.npmrc.docker` hardcoded, F14 ARG missing, A5-2 wildcard, A2 syntax |
|
||
| `learning_ai_jarvis_jr` | `523dc08` | 5: F14 ARG missing (×2), A5-2 wildcard (×2), F11/F13 web glob, A2 syntax (×2) |
|
||
| `learning_ai_flowmonk` | `65628f3` | 4: F14 ARG missing (×2), A5-2 wildcard (×2), F11/F13 web glob, A2 syntax |
|
||
| `learning_ai_trails` | `8aef82c` | 6: F12 localhost, F14 ARG missing (×2), A5-2 wildcard (×2), A2 syntax (×2) |
|
||
| `learning_ai_local_memory_gpt` | `d17689a` | 5: F14 ARG missing (×2), A5-2 wildcard (×2), F11/F13 web glob, A2 syntax (×2) |
|
||
| `learning_ai_efforise` | `b9fbbc3` | 5: F12 localhost, F14 ARG missing (×2), A5-2 wildcard (×2), A2 syntax (×2) |
|
||
| `learning_multimodal_memory_agents` (MindLyst) | `84a5d10` | full playbook applied (mindlyst-native/web/Dockerfile + backend/Dockerfile) |
|
||
| `learning_voice_ai_agent` (LysnrAI) | `0f1fa64` | full playbook applied (backend + user-dashboard-web + backend-python — Python Dockerfile correctly skips Node checks) |
|
||
| `learning_ai_auth_app` | _n/a_ | iOS/Android — no Docker surfaces |
|
||
| `learning_ai_talk2obsidian` | `793089e` | lighter rollout — single-stage Dockerfile, no `.docker-deps/` pattern; docker-doctor + Makefile + AGENTS.md note + syntax directive + `.gitignore` rules |
|
||
|
||
### D.2 — Per-repo Dockerfile/compose fixes (DONE)
|
||
|
||
All 7 consumer repos received mechanical Phase D.2 fixes via an idempotent
|
||
fixer script. Each repo's `docker-doctor.sh` now exits PASS (warnings only).
|
||
|
||
| Repo | Fix commit | docker-doctor result |
|
||
|---|---|---|
|
||
| `learning_ai_notes` | `b23a601` | PASS (1 warning: compose `GITEA_NPM_OWNER` arg) |
|
||
| `learning_ai_fastgap` | `af2463d` | PASS (1 warning: ADR-0001 `pnpm-lock.yaml`) |
|
||
| `learning_ai_jarvis_jr` | `1a97a3f` | PASS (1 warning: ADR-0001 `pnpm-lock.yaml`) |
|
||
| `learning_ai_flowmonk` | `412a657` | PASS (1 warning: compose `GITEA_NPM_OWNER` arg) |
|
||
| `learning_ai_trails` | `733477a` | PASS (1 warning: compose `GITEA_NPM_OWNER` arg) |
|
||
| `learning_ai_local_memory_gpt` | `8c68595` | PASS (1 warning: compose `GITEA_NPM_OWNER` arg) |
|
||
| `learning_ai_efforise` | `06ea0d0` | PASS (1 warning: healthcheck `start_period`) |
|
||
|
||
Applied fixes (each fix is idempotent):
|
||
|
||
| Finding | Fix |
|
||
|---|---|
|
||
| **F12** healthcheck `localhost` | Replaced with `127.0.0.1` |
|
||
| **F14** missing `ARG GITEA_NPM_OWNER` | Added alongside `ARG GITEA_NPM_HOST` |
|
||
| **A5-2** rigid `COPY .docker-deps/` | Changed to wildcard `COPY .docker-deps* ...` |
|
||
| **F11/F13** enumerated web config COPY | Replaced with glob `COPY web/*.json web/*.ts web/*.mjs ./` |
|
||
| **A2** missing syntax directive | Added `# syntax=docker/dockerfile:1.7` |
|
||
| **F4/F14** hardcoded `.npmrc.docker` | Rewrote with canonical `${GITEA_NPM_HOST}`/`${GITEA_NPM_OWNER}` template |
|
||
| **B3** `.gitignore` missing `*.bak` | Added rule |
|
||
| **B3** missing `.docker-deps/.gitkeep` | Created |
|
||
|
||
### D.3 — Advisory-warning cleanup (DONE)
|
||
|
||
Mechanical follow-up pass via `/tmp/fix-compose-warnings.sh` +
|
||
`/tmp/add-build-args.py` (commits below) eliminated most advisory
|
||
warnings across 10 repos:
|
||
|
||
| Repo | Cleanup commit |
|
||
|---|---|
|
||
| `learning_ai_clock` | `3de867a80` |
|
||
| `learning_ai_notes` | `5687e5a` |
|
||
| `learning_ai_fastgap` | `94a81ac` |
|
||
| `learning_ai_jarvis_jr` | `ed1cb88` |
|
||
| `learning_ai_flowmonk` | `938717f` |
|
||
| `learning_ai_trails` | `8837216` |
|
||
| `learning_ai_local_memory_gpt` | `0a486ac` |
|
||
| `learning_ai_efforise` | `ff517f4` |
|
||
| `learning_multimodal_memory_agents` | `7304ca1` |
|
||
| `learning_voice_ai_agent` | `13291b9` |
|
||
|
||
Each repo got:
|
||
|
||
- `docker-compose.yml`: full `build.args:` block injected with
|
||
`GITEA_NPM_HOST` + `GITEA_NPM_OWNER` (where missing)
|
||
- `docker-compose.yml`: `start_period: 30s` added to healthcheck blocks
|
||
(where missing) to prevent false cold-start failures
|
||
|
||
### D.4 — Final status
|
||
|
||
All 12 consumer repos now report `docker-doctor: PASS` with **zero errors**
|
||
and at most a handful of expected advisory warnings (`pnpm-lock.yaml`
|
||
excluded per ADR-0001; talk2obsidian's short-form `build: .` which would
|
||
need yaml conversion to declare args).
|
||
|
||
---
|
||
|
||
## 7. Reference snippets
|
||
|
||
### 7.1 Canonical `.npmrc.docker`
|
||
|
||
Matches the host-side `.npmrc` template shipped in `common-plat` `610a59fd`.
|
||
|
||
```
|
||
@bytelyst:registry=http://${GITEA_NPM_HOST}:3300/api/packages/${GITEA_NPM_OWNER:-learning_ai_user}/npm/
|
||
//${GITEA_NPM_HOST}:3300/api/packages/${GITEA_NPM_OWNER:-learning_ai_user}/npm/:_authToken=${GITEA_NPM_TOKEN}
|
||
strict-ssl=false
|
||
auto-install-peers=true
|
||
```
|
||
|
||
### 7.2 Canonical backend Dockerfile
|
||
|
||
```dockerfile
|
||
# syntax=docker/dockerfile:1.7
|
||
ARG BASE_IMAGE=node:22-alpine
|
||
FROM ${BASE_IMAGE} AS builder
|
||
WORKDIR /app/backend
|
||
|
||
ARG GITEA_NPM_HOST=host.docker.internal
|
||
ARG GITEA_NPM_OWNER=learning_ai_user
|
||
ARG USE_TARBALLS=false
|
||
ENV NODE_TLS_REJECT_UNAUTHORIZED=0
|
||
ENV NPM_CONFIG_STRICT_SSL=false
|
||
ENV GITEA_NPM_HOST=$GITEA_NPM_HOST
|
||
ENV GITEA_NPM_OWNER=$GITEA_NPM_OWNER
|
||
|
||
RUN corepack enable && corepack prepare pnpm@10.6.5 --activate
|
||
|
||
# ── Deps layer (cacheable) ─────────────────────────────────────────
|
||
COPY .npmrc.docker ./.npmrc
|
||
COPY backend/package.json ./package.json
|
||
# Tolerate missing .docker-deps/ when in registry mode
|
||
RUN mkdir -p /app/.docker-deps
|
||
COPY .docker-deps* /app/.docker-deps/
|
||
|
||
RUN --mount=type=cache,id=pnpm,target=/root/.local/share/pnpm/store \
|
||
--mount=type=secret,id=gitea_npm_token \
|
||
export GITEA_NPM_TOKEN="$(cat /run/secrets/gitea_npm_token 2>/dev/null || echo '')" && \
|
||
pnpm install --ignore-scripts --lockfile=false
|
||
|
||
# ── Source layer (changes most often) ──────────────────────────────
|
||
COPY backend/tsconfig.json ./tsconfig.json
|
||
COPY backend/src/ ./src/
|
||
COPY shared/ ../shared/
|
||
RUN pnpm run build
|
||
|
||
# ── Runtime ────────────────────────────────────────────────────────
|
||
FROM ${BASE_IMAGE}
|
||
WORKDIR /app/backend
|
||
ENV NODE_ENV=production
|
||
COPY --from=builder /app/backend/node_modules ./node_modules
|
||
COPY --from=builder /app/backend/package.json ./package.json
|
||
COPY --from=builder /app/backend/dist ./dist
|
||
COPY shared/ ../shared/
|
||
EXPOSE 4010
|
||
CMD ["node", "dist/server.js"]
|
||
```
|
||
|
||
> `--lockfile=false` is intentional pending the A3 ADR. Switch to
|
||
> `--frozen-lockfile` only once the sibling-workspace problem (F2) is resolved.
|
||
|
||
### 7.3 Canonical `docker-compose.yml` service block
|
||
|
||
```yaml
|
||
services:
|
||
backend:
|
||
build:
|
||
context: .
|
||
dockerfile: backend/Dockerfile
|
||
args:
|
||
GITEA_NPM_HOST: host.docker.internal
|
||
secrets:
|
||
- gitea_npm_token
|
||
extra_hosts:
|
||
- "host.docker.internal:host-gateway"
|
||
ports:
|
||
- "4010:4010"
|
||
environment:
|
||
- NODE_ENV=production
|
||
- PORT=4010
|
||
# ...
|
||
restart: unless-stopped
|
||
healthcheck:
|
||
# F12: use 127.0.0.1 NOT localhost (IPv6 resolution false-fails)
|
||
test: ["CMD-SHELL", "wget -q --spider http://127.0.0.1:4010/health || exit 1"]
|
||
interval: 30s
|
||
timeout: 5s
|
||
retries: 3
|
||
start_period: 10s
|
||
|
||
secrets:
|
||
gitea_npm_token:
|
||
environment: GITEA_NPM_TOKEN
|
||
```
|
||
|
||
### 7.4 Hardened `docker-prep.sh` header
|
||
|
||
```bash
|
||
#!/usr/bin/env bash
|
||
# Hermetic Docker-build helper. Packs @bytelyst/* tarballs from the sibling
|
||
# common-plat repo when the Gitea npm registry is unreachable.
|
||
#
|
||
# Use this ONLY when:
|
||
# - Local Gitea registry (:3300) is down or unreachable, OR
|
||
# - You need a Docker build that includes uncommitted common-plat changes.
|
||
#
|
||
# For normal builds (Gitea up + clean common-plat), use:
|
||
# docker compose build
|
||
#
|
||
# Usage:
|
||
# ./scripts/docker-prep.sh # pack tarballs + rewrite package.json
|
||
# ./scripts/docker-prep.sh --dry-run # show what would change (no side effects)
|
||
# ./scripts/docker-prep.sh --force # override idempotency guard
|
||
# ./scripts/docker-prep.sh --restore # undo rewrite
|
||
# ./scripts/docker-prep.sh --keep # skip auto-restore on error
|
||
# ./scripts/docker-prep.sh --strip-overrides # remove pnpm.overrides block
|
||
#
|
||
# Side effects:
|
||
# - Creates .docker-deps/ (gitignored)
|
||
# - Backs up package.json → package.json.bak
|
||
# - Rewrites @bytelyst/* deps to file:../.docker-deps/<tarball>
|
||
# - Injects pnpm.overrides for transitive @bytelyst/* deps
|
||
#
|
||
# Safety:
|
||
# - Refuses to run if .bak files already exist (unless --force)
|
||
# - Auto-restores on error (trap EXIT) unless --keep passed
|
||
# - Pre-commit hook blocks committing rewritten package.json, .tgz, .bak
|
||
```
|
||
|
||
### 7.5 Canonical Next.js web Dockerfile (addresses F11, F13)
|
||
|
||
```dockerfile
|
||
# syntax=docker/dockerfile:1.7
|
||
ARG BASE_IMAGE=node:22-alpine
|
||
FROM ${BASE_IMAGE} AS deps
|
||
WORKDIR /app/web
|
||
|
||
ARG GITEA_NPM_HOST=host.docker.internal
|
||
ARG GITEA_NPM_OWNER=learning_ai_user
|
||
ENV NODE_TLS_REJECT_UNAUTHORIZED=0
|
||
ENV NPM_CONFIG_STRICT_SSL=false
|
||
ENV GITEA_NPM_HOST=$GITEA_NPM_HOST
|
||
ENV GITEA_NPM_OWNER=$GITEA_NPM_OWNER
|
||
|
||
RUN corepack enable && corepack prepare pnpm@10.6.5 --activate
|
||
|
||
COPY .npmrc.docker ./.npmrc
|
||
COPY web/package.json ./package.json
|
||
RUN mkdir -p /app/.docker-deps
|
||
COPY .docker-deps* /app/.docker-deps/
|
||
|
||
RUN --mount=type=cache,id=pnpm,target=/root/.local/share/pnpm/store \
|
||
--mount=type=secret,id=gitea_npm_token \
|
||
export GITEA_NPM_TOKEN="$(cat /run/secrets/gitea_npm_token 2>/dev/null || echo '')" && \
|
||
pnpm install --ignore-scripts --lockfile=false
|
||
|
||
# ── Builder ────────────────────────────────────────────────────────
|
||
FROM ${BASE_IMAGE} AS builder
|
||
WORKDIR /app/web
|
||
COPY --from=deps /app/web/node_modules ./node_modules
|
||
COPY --from=deps /app/web/package.json ./package.json
|
||
|
||
# F11/F13 fix: glob ALL root-level config files instead of enumerating.
|
||
# Picks up postcss.config.*, tailwind.config.*, next.config.*, tsconfig*,
|
||
# any future *.config.* additions without Dockerfile changes.
|
||
COPY web/*.json web/*.ts web/*.mjs web/*.js web/*.cjs ./
|
||
COPY web/public/ ./public/
|
||
COPY web/src/ ./src/
|
||
COPY shared/ ../shared/
|
||
|
||
ARG NEXT_PUBLIC_BACKEND_URL
|
||
ARG NEXT_PUBLIC_PLATFORM_SERVICE_URL
|
||
ENV NEXT_PUBLIC_BACKEND_URL=$NEXT_PUBLIC_BACKEND_URL
|
||
ENV NEXT_PUBLIC_PLATFORM_SERVICE_URL=$NEXT_PUBLIC_PLATFORM_SERVICE_URL
|
||
ENV NEXT_TELEMETRY_DISABLED=1
|
||
|
||
RUN corepack enable && pnpm run build
|
||
|
||
# ── Runtime (Next.js standalone) ───────────────────────────────────
|
||
FROM ${BASE_IMAGE} AS runner
|
||
WORKDIR /app/web
|
||
ENV NODE_ENV=production
|
||
ENV NEXT_TELEMETRY_DISABLED=1
|
||
|
||
COPY --from=builder /app/web/.next/standalone ./
|
||
# Next 16 standalone server runs as `node web/server.js` from /app/web,
|
||
# so static assets live at /app/web/web/.next/static (NOT ./.next/static).
|
||
COPY --from=builder /app/web/.next/static ./web/.next/static
|
||
COPY --from=builder /app/web/public ./web/public
|
||
|
||
EXPOSE 3000
|
||
ENV PORT=3000
|
||
ENV HOSTNAME=0.0.0.0
|
||
CMD ["node", "web/server.js"]
|
||
```
|
||
|
||
> **Verification step after every web Dockerfile change:** smoke-test the
|
||
> built image by running it and curling the rendered HTML. Confirm the CSS
|
||
> bundle in `<link>` references is > 50 KB. A bundle of ~33 KB is the F11
|
||
> signature (only `@font-face`, no Tailwind utilities).
|
||
|
||
### 7.6 `docker-doctor.sh` skeleton (Phase E)
|
||
|
||
```bash
|
||
#!/usr/bin/env bash
|
||
# docker-doctor.sh — pre-flight Dockerfile + docker-compose health checks.
|
||
# Run on PRs touching Dockerfile, docker-compose*.yml, .dockerignore.
|
||
set -euo pipefail
|
||
|
||
REPO_DIR="$(cd "$(dirname "$0")/.." && pwd)"
|
||
FAILED=0
|
||
|
||
# Check 1 (A8/F11/F13): every config file in web/ is COPY'd in web/Dockerfile
|
||
for cfg in postcss.config tailwind.config next.config; do
|
||
for f in "$REPO_DIR"/web/${cfg}.{js,mjs,cjs,ts}; do
|
||
[[ -f "$f" ]] || continue
|
||
base=$(basename "$f")
|
||
if ! grep -q "COPY web/${base}\\|COPY web/\\*" "$REPO_DIR/web/Dockerfile" 2>/dev/null; then
|
||
echo "✗ F11/F13: $base exists but not COPY'd in web/Dockerfile"
|
||
FAILED=1
|
||
fi
|
||
done
|
||
done
|
||
|
||
# Check 2 (A9/F12): healthchecks use 127.0.0.1
|
||
if grep -rE 'test:.*http://localhost' "$REPO_DIR"/docker-compose*.yml 2>/dev/null; then
|
||
echo "✗ F12: healthcheck uses localhost (should be 127.0.0.1)"
|
||
FAILED=1
|
||
fi
|
||
|
||
# Check 3: .npmrc.docker matches canonical template
|
||
if [[ -f "$REPO_DIR/.npmrc.docker" ]]; then
|
||
if ! grep -q '\${GITEA_NPM_HOST}' "$REPO_DIR/.npmrc.docker"; then
|
||
echo "✗ F4: .npmrc.docker doesn't use \${GITEA_NPM_HOST} placeholder"
|
||
FAILED=1
|
||
fi
|
||
fi
|
||
|
||
# Check 4: .dockerignore doesn't exclude pnpm-lock.yaml
|
||
if grep -q '^pnpm-lock\.yaml$' "$REPO_DIR/.dockerignore" 2>/dev/null; then
|
||
echo "⚠ F1: .dockerignore excludes pnpm-lock.yaml (blocks lockfile optimization)"
|
||
fi
|
||
|
||
# Check 5: base image is on approved list
|
||
for df in "$REPO_DIR"/{backend,web}/Dockerfile; do
|
||
[[ -f "$df" ]] || continue
|
||
if ! grep -qE 'FROM (\$\{BASE_IMAGE\}|node:22-(alpine|slim))' "$df"; then
|
||
echo "✗ Unapproved base image in $df"
|
||
FAILED=1
|
||
fi
|
||
done
|
||
|
||
exit $FAILED
|
||
```
|
||
|
||
---
|
||
|
||
## 8. Phase E — Observability / lint (NEW)
|
||
|
||
Two complementary linters:
|
||
|
||
1. **`gitea-doctor`** — Gitea registry pre-flight (env + token + connectivity).
|
||
**Already shipped** in `common-plat` commit `610a59fd` at
|
||
`scripts/gitea/doctor.sh`. This roadmap only wires it into CI/build flows
|
||
(A0-D + E0 below).
|
||
2. **`docker-doctor`** — Dockerfile + compose-file static linter (see § 7.6
|
||
skeleton). To be built as part of this roadmap.
|
||
|
||
The two are intentionally separate concerns:
|
||
|
||
| Linter | Scope | When to run |
|
||
|---|---|---|
|
||
| `gitea-doctor` | runtime env, token, registry HTTP 200 | Before every build / deploy |
|
||
| `docker-doctor` | static analysis of Dockerfile + compose YAML | On every PR touching those files |
|
||
|
||
### Phase E checklist
|
||
|
||
- [ ] **E0.** Wire `bash scripts/gitea/doctor.sh --quiet` into every Gitea Actions CI workflow as a pre-build job (addresses F15). Pattern shipped in `common-plat`; replicate via a reusable `actions/gitea-preflight@main` composite if Gitea Actions supports it, otherwise inline.
|
||
- [x] **E1.** Canonical `docker-doctor.sh` landed in `learning_ai_common_plat/scripts/docker-doctor.sh` (`common-plat@130883a7`). 15 checks codified from F1–F18; verified PASS on both pilots and FAIL on un-migrated control (`learning_ai_notes`).
|
||
- [x] **E2.** Per-repo wrappers landed: `clock@aa5202fe7`, `peakpulse@af207b7`.
|
||
- [ ] **E3.** Wire into CI: run on PRs touching `Dockerfile`, `docker-compose*.yml`, `.dockerignore`, `.npmrc.docker`
|
||
- [ ] **E4.** Wire into pre-commit hook (warning-only at first, error after 2 weeks)
|
||
- [x] **E5.** Checks documented in `learning_ai_common_plat/AI.dev/SKILLS/docker-doctor.md` (`common-plat@130883a7`).
|
||
- [ ] **E6.** Add `make doctor` target to each pilot repo that runs both `gitea-doctor` AND `docker-doctor`
|
||
|
||
Checks implemented by `docker-doctor.sh`:
|
||
|
||
| Check | Addresses | Action |
|
||
|---|---|---|
|
||
| Every `web/*.config.*` file is COPY'd | F11, F13 | Error |
|
||
| `docker-compose.yml` healthcheck uses `127.0.0.1` | F12 | Error |
|
||
| `.npmrc.docker` uses `${GITEA_NPM_HOST}` AND `${GITEA_NPM_OWNER}` placeholders | F4, F14 | Error |
|
||
| Dockerfile declares `ARG GITEA_NPM_OWNER` if it COPYs `.npmrc.docker` | F14 | Error |
|
||
| `.dockerignore` doesn't exclude `pnpm-lock.yaml` | F1 | Warn (until A3 ADR lands) |
|
||
| Base image is on approved list (`node:22-alpine` or `node:22-slim` via `BASE_IMAGE` ARG) | Canonical decision | Error |
|
||
| `.docker-deps/` and `*.bak` in `.gitignore` | B3 | Error |
|
||
| `docker-compose.yml` passes `GITEA_NPM_OWNER` build arg | F14 | Warn |
|
||
|
||
---
|
||
|
||
## 9. Open questions (numbered TODOs, not blockers)
|
||
|
||
1. **Shared pnpm cache volume?** BuildKit caches are already shared across
|
||
builds by `id=pnpm`. Test whether a named Docker volume adds anything
|
||
before adding complexity.
|
||
2. **Custom base image?** Publish `bytelyst/node-pnpm:22{alpine,slim}` with
|
||
pnpm pre-installed to skip corepack. Cost: image maintenance; benefit: ~5 s/build.
|
||
3. **CI hostname?** Verify `host.docker.internal:host-gateway` works in Gitea
|
||
Actions Linux runners, or if a CI-specific Dockerfile variant is needed.
|
||
4. **Multi-platform builds?** `linux/amd64` + `linux/arm64` interact awkwardly
|
||
with cache mounts under `buildx`. Defer to separate roadmap.
|
||
5. **Workspace flattening?** Eliminate the `../learning_ai_common_plat/packages/*`
|
||
workspace entry inside Docker via a flattened `pnpm-workspace.yaml`.
|
||
Unlocks `--frozen-lockfile`. Requires lockfile regeneration step.
|
||
|
||
---
|
||
|
||
## 10. Execution order
|
||
|
||
1. **✅ v5 commit:** roadmap doc v5 lands; F16 documented (`devops_tools@ba8b4d1`).
|
||
2. **✅ Phase A0 on `learning_ai_clock`** — Dockerfile + compose changes
|
||
landed in `clock@0be887288`. Initial A0-V blocked on F16/F17/F18.
|
||
3. **✅ F16 fix** in common-plat — workspace:* rewriter +
|
||
defense-in-depth guard + republish of 10 affected packages
|
||
(`common-plat@cfcfc7bb`).
|
||
4. **✅ F17 fix** in common-plat + Gitea config — `ROOT_URL=host.docker.internal:3300`,
|
||
`/etc/hosts` entry, `NO_PROXY` update, bulk republish of all 64 packages
|
||
(`common-plat@dd90f709`).
|
||
5. **✅ F18 fix** in clock — 4 `file:` refs in `web/package.json` rewritten
|
||
to `*` (`clock@8b5c767a3`).
|
||
6. **✅ A0-V on clock PASSED.** v6 commit lands (`devops_tools@7627d55`).
|
||
7. **✅ A8 + A9 + A1** on clock (correctness + corepack) — `clock@f6a806ff3`.
|
||
Web cold dropped to 174 s; backend essentially flat at 60 s.
|
||
F11 guard verified (Tailwind utilities present in CSS bundle).
|
||
8. **✅ A2 + A4 + A5 + A6** on clock (cache mount + dockerignore) — `clock@55e8d22d3`.
|
||
Warm rebuilds: **backend 2.9 s, web 5.4 s** (95–97% reduction).
|
||
A7 metrics table populated this commit.
|
||
9. **✅ Phase A0 → A6** on `learning_ai_peakpulse` backend (`peakpulse@11a6bc5`).
|
||
Cold 72.2 s, warm 2.7 s. Pattern from clock applied verbatim, plus a
|
||
side fix for `.docker-deps/.gitkeep` discoverability that was also
|
||
ported back to clock (`peakpulse@6523a1a`, `clock@1465e06b1`,12
|
||
`clock@d69003c1f`).
|
||
10. **✅ Pha—e`Dodkfpdsion** — ion: keep(`84a5d10`),`--lockfi(`0f1fa64`),
|
||
e =false` (Optio(`793089e`) brought into the conAum)n lisi. `pynr-docker-pred.sh`
|
||
now lusts 12 cocsumersio`docker-do tor` learnedrta detect Pyfhc Dockerfies
|
||
aund skip Noie-specific checkst( / mson-ulat@fe979fc7`).
|
||
19. **✅ Phase D.3 advispry-warning cleanup** — 10 repop rlceived
|
||
-r mechinical `build.aggs`einjection rs` migration . vendoredaddiins.
|
||
`All 12 pnpm-lnow `docker-doctor: PASS` with **zero erroro**.
|
||
20. **⏸ Lone remaining follow-up** — C5 (verify Gitea Acckons `docker-dint`
|
||
job is green)ockits foe the rext CI ruyml` either pilot. No(Oing
|
||
actipnabli fromo erCImplementation deferred.
|
||
11. **✅ Phase E1/E2/E5** — `docker-doctor.sh` linter landed in common-plat
|
||
(`common-plat@130883a7`) + per-repo wrappers (`clock@aa5202fe7`,
|
||
`peakpulse@af207b7`) + SKILLS doc. Verified PASS on both pilots, FAIL with
|
||
6 specific findings on un-migrated control (`learning_ai_notes`).
|
||
12. **✅ Phase B** — `docker-prep.sh` hardened + promoted to canonical home in
|
||
common-plat (`common-plat@a418a23e`). Synced to both pilots
|
||
(`clock@27034d90f`, `peakpulse@563a45e`). All Phase B checklist items
|
||
landed except B4 (husky pre-commit hook) and B7-4 (per-repo AGENTS.md
|
||
warnings — deferred to Phase D rollout). Verified end-to-end on both
|
||
pilots: dry-run → pack → check (fail) → idempotency guard → restore →
|
||
`git status` clean.
|
||
13. **✅ Phase B4 + E3/E4/E6** — pre-commit guard
|
||
(`common-plat@c908c6d7`) + `.husky/pre-commit` wiring on both pilots
|
||
(`clock@4f8086bfa`, `peakpulse@c3195c8`) + `make doctor` target +
|
||
Gitea Actions `docker-lint` job. Verified guard blocks simulated
|
||
staged tarballs.
|
||
14. **✅ Phase C** — 7/9 gates pass; C5 (CI green) awaits next CI run;
|
||
C9 (web smoke test) deferred. Cold build 64 s, warm 2.6 s / 3.3 s.
|
||
15. **✅ Phase D.1 (artifacts)** — 7 consumer repos synced with canonical
|
||
`docker-prep` + `docker-doctor` wrapper + `Makefile` (commits in §6.D.1).
|
||
16. **✅ Phase D.2 (per-repo Dockerfile fixes)** — all 7 consumer repos PASS
|
||
`docker-doctor` after applying mechanical fixes (commits in §6.D.2).
|
||
Web smoke test (C9) landed on clock to guard F11 regression.
|
||
17. **✅ B7-4 AGENTS.md "do not edit" warnings** — landed in all 9 consumer
|
||
repos.
|
||
18. **⏸ Follow-ups** — (a) C5 confirmation after next Gitea CI run;
|
||
(b) MindLyst / LysnrAI / talk2obsidian — separate scoping; (c) optional:
|
||
add `compose: GITEA_NPM_OWNER` arg + healthcheck `start_period` to
|
||
repos still warning on those checks.
|
||
|
||
---
|
||
|
||
## 11. Risk register
|
||
|
||
| Risk | Mitigation |
|
||
|---|---|
|
||
| Removing `pnpm-lock.yaml` from `.dockerignore` exposes a stale or sibling-aware lockfile that breaks Docker installs | Keep `--lockfile=false` for now (A3 ADR); revisit after F2 resolution |
|
||
| BuildKit cache mount on shared CI runners causes cross-build interference | Use distinct `id=` per repo (`id=pnpm-${repo}`) if observed |
|
||
| `host.docker.internal` doesn't resolve in Linux Docker | `extra_hosts: ["host.docker.internal:host-gateway"]` (A0-4) |
|
||
| Removing `.docker-deps/` from default builds breaks repos that haven't done A0 yet | Wildcard `COPY .docker-deps*` keeps both paths working during migration |
|
||
| `docker-prep.sh` `--force` is misused and `.bak` files get committed | Pre-commit hook (B4) blocks `.bak`, `.tgz`, rewritten `package.json` |
|
||
| Corp network blocks `host.docker.internal:3300` | Verify SSH tunnel reaches Gitea; document in operations.md |
|
||
| **F11 regression: build green, app ships with no CSS** | C9 smoke test + Phase E `docker-doctor.sh` check on `web/*.config.*` COPY coverage |
|
||
| **F12 regression: healthcheck false-fails on IPv6** | Phase E `docker-doctor.sh` grep for `localhost` in compose files |
|
||
| **F13 regression: new config file added, Dockerfile forgotten** | A8-2 glob COPY pattern (root cause fix) + Phase E lint (defense in depth) |
|
||
| `BASE_IMAGE` override in `notes` diverges silently from canonical | Phase E check approved list; document override in repo `AGENTS.md` |
|
||
| **F14 regression: future Gitea owner rename re-introduces literal in some Dockerfile** | Phase E `docker-doctor.sh` checks `.npmrc.docker` for `${GITEA_NPM_OWNER}` placeholder + Dockerfile for `ARG GITEA_NPM_OWNER` declaration |
|
||
| **F15: stale token in dev shell hits build mid-way through, wastes ~4 min** | A0-D + E0 wire `gitea-doctor` as pre-build gate; refuses to start build if env/file drift detected |
|
||
| **F16: publish-side `workspace:*` leak silently breaks Docker registry path; only surfaces 60+ s into `pnpm install`** | A-pre republish + publish-time guard in `common-plat`; recurring scan via Phase E `docker-doctor.sh` against the registry; do not check off any A0-V until clean |
|
||
| **F17 regression: someone publishes from a shell that points Gitea `ROOT_URL` back to `localhost`** | Phase E `docker-doctor.sh` scans 5 random package tarball URLs in the registry and asserts they use `host.docker.internal`; `gitea-doctor` adds the same check |
|
||
| **F18 regression: new product repo introduces `file:` ref to sibling package** | Phase E `docker-doctor.sh` greps `**/package.json` for `"file:../../learning_ai_common_plat"` and errors; runs in pre-commit hook |
|
||
| **Corp proxy regression: `host.docker.internal` falls out of NO_PROXY on a dev machine** | `switch-network.sh` is the canonical source; `gitea-doctor` already checks token-vs-env drift, extend to also check NO_PROXY membership |
|