diff --git a/docs/docker-build-optimization-roadmap.md b/docs/docker-build-optimization-roadmap.md index 186786a..b11cf52 100644 --- a/docs/docker-build-optimization-roadmap.md +++ b/docs/docker-build-optimization-roadmap.md @@ -1,6 +1,6 @@ # Docker Build Optimization Roadmap -> **Status:** Draft v5 (F16 — registry workspace:* leaks discovered during A0-V) · **Owner:** Platform DevOps · **Created:** 2026-05-27 · **Revised:** 2026-05-27 +> **Status:** Draft v6 (F17 + F18 fixed; A0-V PASSED on clock) · **Owner:** Platform DevOps · **Created:** 2026-05-27 · **Revised:** 2026-05-27 > > Pilot Docker-build correctness + speed fixes on `learning_ai_clock` (web + backend) > and `learning_ai_peakpulse` (backend), then capture the playbook here for @@ -18,9 +18,9 @@ ## 0. Pre-flight audit findings (2026-05-27) A read-only audit of pilot repos + lessons from recent live incidents + -A0-V execution failure on clock surfaced **16 concrete bugs/gaps** (F14–F15 -added after the Gitea-hardening commit; F16 added during A0-V execution on -clock, 2026-05-27). The actual state of the ecosystem is closer to the +the A0-V execution iterations on clock surfaced **18 concrete bugs/gaps** +(F14–F15 added after the Gitea-hardening commit; F16–F18 added during the +A0-V execution sweep on clock, 2026-05-27). The actual state of the ecosystem is closer to the inverse of the casual narrative: tarballs are the de facto default, the Gitea-registry path is partially wired, and there is a separate class of "build green, app broken" silent failures (F11–F13) that the speed-focused @@ -43,7 +43,9 @@ plan needs to address first. | **F13** | **Enumerated `COPY web/foo ./foo` pattern drifts from filesystem.** New config file added to repo but Dockerfile's enumerated COPY list isn't updated. Build succeeds silently with the file absent; behavior diverges from local dev. Root cause of F11(b). | every Dockerfile using enumerated COPY | **Medium** | | **F14** | **Hardcoded Gitea owner (`learning_ai_user`) literally embedded in `.npmrc.docker` + CI workflows + publish scripts across 14 repos.** When the org was renamed from `bytelyst` → `learning_ai_user`, every repo needed a manual commit. **Resolved upstream in `common-plat` (`610a59fd`):** owner now resolves from `${GITEA_NPM_OWNER:-learning_ai_user}`; `scripts/gitea/{doctor,token}.sh` ship as pre-flight/rotation helpers. Docker work in this roadmap MUST consume the env var, not the literal. | `.npmrc.docker`, Dockerfile `ARG`/`ENV`, CI workflows | **Medium** | | **F15** | **Stale shell-env tokens.** `~/.gitea_npm_token` rotated on disk; long-lived shells still exported the old value. Caused 401s during `docker compose build` until `source ~/.zshrc`. **Mitigation shipped:** `bash scripts/gitea/doctor.sh` detects env-vs-file drift and refuses to proceed. **Action required in this roadmap:** wire doctor as a pre-build CI gate. | dev workstation + CI runners | Low (now caught) | -| **F16** | **At least 10 published `@bytelyst/*` packages in the Gitea registry have unrewritten `workspace:*` refs in their `package.json` dependencies.** `pnpm publish` should rewrite these to concrete semver automatically; these packages were either published via a path that bypassed `pnpm publish` (raw `npm publish`, `pnpm pack` + manual upload), or `pnpm publish` was invoked with a flag that disabled rewriting. **Discovered during A0-V on clock 2026-05-27** — `@bytelyst/fastify-core@0.1.5` and `@bytelyst/field-encrypt@0.1.6` (both direct deps of clock backend) fail `pnpm install` inside Docker with `ERR_PNPM_WORKSPACE_PKG_NOT_FOUND`. **Blocks A0-V on every pilot repo.** Full list: `auth@0.1.5`, `diagnostics-client@0.1.6`, `events@0.1.5`, `extraction@0.1.5`, `fastify-auth@0.1.5`, `fastify-core@0.1.5`, `feedback-client@0.1.6`, `field-encrypt@0.1.6`, `react-auth@0.1.6`, `sync@0.1.5`. | `common-plat` publish flow + Gitea registry | **Critical** | +| **F16** | **At least 10 published `@bytelyst/*` packages had unrewritten `workspace:*` refs in their `package.json` dependencies.** Root cause: `publish-outdated-packages.sh` extracts a pnpm-packed tarball then **re-packs with `npm pack`** (workaround for a historical Gitea-compat issue with pnpm's tarball format), and `npm pack` doesn't recognize the pnpm-specific `workspace:` protocol — it passes it through literally. **Fixed in `common-plat@cfcfc7bb`** (`fix(gitea): rewrite workspace:* in published tarballs (F16)`) — inserted a workspace:* rewriter between extract and npm-repack + a defense-in-depth grep guard. Republished 10 affected packages. | `common-plat` publish flow + Gitea registry | **Critical (FIXED)** | +| **F17** | **Gitea bakes `localhost:3300` into the `dist.tarball` field of every published package's metadata.** Inside Docker, `localhost` is the container itself, not the host — so even after a successful registry-metadata fetch via `host.docker.internal`, pnpm follows the tarball URL to `localhost:3300` and ECONNREFUSEs. Root cause: Gitea `app.ini`'s `ROOT_URL=http://localhost:3300/` was baked at publish time. **Fixed** by setting `ROOT_URL=http://host.docker.internal:3300/`, restarting Gitea, adding `127.0.0.1 host.docker.internal` to `/etc/hosts`, adding `host.docker.internal` to `NO_PROXY` (corp proxy was hijacking DNS), and republishing all 64 packages (`common-plat@dd90f709`). | Gitea `app.ini` + host `/etc/hosts` + every dev machine's `switch-network.sh` | **Critical (FIXED)** | +| **F18** | **`clock/web/package.json` had 4 `@bytelyst/*` deps declared as `file:` refs to sibling `../../learning_ai_common_plat/packages/*`** — a legacy pre-Gitea pattern. Inside Docker those paths don't exist, so `pnpm install` fails with `ERR_PNPM_LINKED_PKG_DIR_NOT_FOUND`. Discovered during clock web A0-V on 2026-05-27. **Fixed in `learning_ai_clock@8b5c767a3`** by rewriting to `*` semver. Same pattern likely lives in other product repos (especially anything that consumes `@bytelyst/ui`, `@bytelyst/design-tokens`, `@bytelyst/use-theme`) — audit needed in Phase D rollout. | `*/web/package.json` (and likely others) | **High** | **Implications:** @@ -51,13 +53,16 @@ plan needs to address first. two upstream fixes first (F1, F2). - F11–F13 mean **correctness fixes must precede speed fixes**, otherwise we ship faster builds of broken apps. -- **F16 blocks the entire Gitea-registry pilot** — no pilot repo can A0-V - green until the registry is republished cleanly. Added new Phase A-pre - (§3) and updated §10 execution order to insert republish ahead of any - Docker-side verification. +- F16 + F17 are **both fixed** as of 2026-05-27. Gitea path now works + end-to-end on clock. A-pre is largely complete; remaining items (A-pre-4, + A-pre-5) become Phase E checks. +- F18 (sibling `file:` refs in product repo manifests) is the same family as + F2 but separately tractable — fixed in clock, audit needed across other + repos as part of Phase D rollout. - A linter (Phase E `docker-doctor.sh`) is the durable insurance against - F11/F13 recurrence — they are silent in CI today. A registry-side guard - (publish-time check for `workspace:*` leaks) is the equivalent for F16. + F11/F13/F18 recurrence — silent in CI today. The registry-side guard + (publish-time check for `workspace:*` leaks) shipped in `common-plat@cfcfc7bb` + as part of the F16 fix. --- @@ -127,43 +132,61 @@ Order matters: **A-pre must precede A0** (you can't build via a registry that serves broken metadata); A0 must precede A1+ (you can't optimize a path that doesn't work), and A8+A9 (correctness) must land before measuring speed wins. -### A-pre. Republish `@bytelyst/*` to Gitea cleanly (addresses F16) +### A-pre. Make the Gitea registry actually usable from Docker (F16 + F17 + F18) -**Owner:** `learning_ai_common_plat` · **Blocks:** every A0-V on every pilot. +**Owner:** `learning_ai_common_plat` + per-product repo · **Status:** ✅ done for clock + global config. -Discovered during clock A0-V on 2026-05-27 (see [§0 F16](#0-pre-flight-audit-findings-2026-05-27)). -At least 10 packages in the registry have literal `workspace:*` strings in -their published `package.json` dependencies. `pnpm install` inside Docker -fails with `ERR_PNPM_WORKSPACE_PKG_NOT_FOUND` because there is no workspace -context inside the container. +Three distinct bugs surfaced during clock A0-V on 2026-05-27: -- [ ] **A-pre-1.** Audit the publish flow in `common-plat`. Identify whether - packages are published via `pnpm publish` (which rewrites `workspace:*` - automatically) or via a path that bypasses rewriting (raw `npm publish`, - `pnpm pack` then manual upload, or `pnpm publish --no-workspace-protocol`). -- [ ] **A-pre-2.** Republish all `@bytelyst/*` packages with proper rewriting. - Bump patch version if Gitea refuses to overwrite existing versions. - ```bash - cd learning_ai_common_plat - pnpm -r --filter './packages/*' publish \ - --no-git-checks \ - --registry http://localhost:3300/api/packages/learning_ai_user/npm/ - ``` -- [ ] **A-pre-3.** Verify with the same curl scan used in clock A0-V (output - should be `0/N workspace:* refs`): - ```bash - for pkg in $(list); do - curl -sS -H "Authorization: token $GITEA_NPM_TOKEN" \ - "http://localhost:3300/api/packages/learning_ai_user/npm/${pkg}" \ - | jq -r '.versions[.["dist-tags"].latest] | (.dependencies // {}) + (.peerDependencies // {}) | to_entries | map(select(.value | test("workspace:"))) | length' - done - ``` -- [ ] **A-pre-4.** Add a publish-time guard in `common-plat`: - pre-publish script that runs `node -e 'JSON.parse(fs.readFileSync("package.json")).dependencies && Object.values(...).every(v => !v.startsWith("workspace:"))'` - against each package's tarball'd `package.json` before push to Gitea. - Mirror of Phase E lint at the registry layer. -- [ ] **A-pre-5.** Document publish flow in `common-plat/AGENTS.md` and link - back to this roadmap section. +- **F16:** Publish flow leaked `workspace:*` into published metadata. +- **F17:** Gitea baked `localhost:3300` into tarball URLs. +- **F18:** Product repos had legacy `file:` refs to sibling packages. + +- [x] **A-pre-1.** Audit `publish-outdated-packages.sh` — confirmed it uses + `pnpm pack` then re-tars with `npm pack`, which loses `workspace:` rewriting. +- [x] **A-pre-2.** Patch publish script with a workspace:* rewriter + a + post-rewrite grep guard. Shipped in `common-plat@cfcfc7bb`. +- [x] **A-pre-3.** Verify all packages publish with `0` workspace:* refs. + Confirmed via curl scan across all 64 packages. +- [x] **A-pre-4.** F17 fix: set Gitea `ROOT_URL=http://host.docker.internal:3300/`, + restart Gitea, add `127.0.0.1 host.docker.internal` to `/etc/hosts`, add + `host.docker.internal` to `NO_PROXY` in `switch-network.sh`, bulk republish + all 64 packages. Shipped in `common-plat@dd90f709`. +- [x] **A-pre-5.** F18 fix: rewrite `file:../../learning_ai_common_plat/packages/*` + refs in `clock/web/package.json` to `*` semver. Shipped in `clock@8b5c767a3`. + Audit needed in Phase D for other product repos. +- [x] **A-pre-6.** Document Gitea config requirements (below). + +### A-pre-6. Gitea configuration prerequisites (one-time per dev machine) + +The Gitea registry MUST be configured with `ROOT_URL=http://host.docker.internal:3300/` +so published tarball URLs are reachable from inside Docker containers. The +host `/etc/hosts` MUST resolve `host.docker.internal` to `127.0.0.1` so the +same URLs work from the host shell. + +On macOS (Homebrew Gitea): + +```bash +# 1. Edit Gitea's app.ini +sudo -e /opt/homebrew/var/gitea/custom/conf/app.ini +# change: ROOT_URL = http://localhost:3300/ +# to: ROOT_URL = http://host.docker.internal:3300/ + +# 2. Restart Gitea +brew services restart gitea + +# 3. Add /etc/hosts entry so host.docker.internal resolves on the host too +sudo sh -c 'grep -q host.docker.internal /etc/hosts || \ + echo "127.0.0.1 host.docker.internal" >> /etc/hosts' + +# 4. Ensure host.docker.internal is in NO_PROXY for corp shells +# (already done in switch-network.sh as of common-plat@dd90f709) +source ~/.zshrc # reload + +# 5. Verify +curl -sS http://host.docker.internal:3300/api/v1/version +# expected: {"version":"1.25.5"} or similar +``` ### A0. Make the Gitea-registry path actually work (clock + peakpulse) @@ -214,13 +237,14 @@ context inside the container. - In Gitea Actions CI: a pre-job step. If `doctor` exits non-zero, the build is skipped with a clear error rather than failing 4 minutes in with `ERR_PNPM_AUTHENTICATION`. - [ ] **A0-V.** **Verification gate (between A0 and A1):** build the registry path **without** any cache-mount or layer optimizations. Confirm `docker compose build --no-cache` succeeds end-to-end pulling from Gitea. Only proceed to A1 once this is green. Don't conflate "make it work" with "make it fast" in one commit. - > **2026-05-27 status — clock A0-V: BLOCKED on F16.** Dockerfile + compose - > changes landed in `learning_ai_clock@0be887288` (`feat(docker): A0 — wire - > Gitea-registry path (blocked by F16)`). Build fails at - > `pnpm install` with `ERR_PNPM_WORKSPACE_PKG_NOT_FOUND` because - > `@bytelyst/fastify-core@0.1.5` and `@bytelyst/field-encrypt@0.1.6` - > (both direct deps) have unrewritten `workspace:*` refs in published - > metadata. A-pre must complete before retry. + > **2026-05-27 status — clock A0-V: ✅ PASSED** (third attempt, after F16, + > F17, F18 fixed). Cold-build wall-clock: + > - backend: **59.2 s** (commits: `clock@0be887288` + `common-plat@cfcfc7bb` + `common-plat@dd90f709`) + > - web: **3:13 (193 s)** (commits: above + `clock@8b5c767a3`) + > + > Both surfaces resolve `@bytelyst/*` from the Gitea registry end-to-end — + > no `docker-prep.sh` tarballs, no sibling `file:` refs, no proxy interference. + > See §3.A7 metrics table. ### A1. Replace `npm install -g pnpm@X` with corepack @@ -284,9 +308,13 @@ Two options — pick one in a short ADR before implementing: | Repo | Surface | Cold before | Cold after | Warm before | Warm after | Notes | |---|---|---|---|---|---|---| -| clock | web | — | — | — | — | | -| clock | backend | — | — | — | — | | -| peakpulse | backend | — | — | — | — | | +| clock | backend | ≈2–3 min (failed via tarball path) | **59.2 s** (A0-V #2) | — | — | Pre-A1/A2; no corepack, no cache mount yet. Speed is just from working Gitea path | +| clock | web | ≈2–3 min (failed via tarball path) | **3:13 (193 s)** (A0-V #3) | — | — | Same caveat | +| peakpulse | backend | — | — | — | — | Pending step 7 | + +Warm-rebuild numbers will be measured after A1 (corepack) + A2 (cache mount) +land — those are the actual speed phases. Current numbers establish the +baseline that A1+A2 must beat. Use: ``` @@ -732,26 +760,30 @@ Checks implemented by `docker-doctor.sh`: ## 10. Execution order -1. **Now (v5 commit):** roadmap doc v5 lands here; F16 documented. -2. **✅ Phase A0 on `learning_ai_clock`** (web + backend) — Dockerfile + - compose changes landed in `learning_ai_clock@0be887288`. A0-V verification - **BLOCKED on F16**; retry after A-pre. -3. **⚠️ Phase A-pre on `learning_ai_common_plat`** — republish all - `@bytelyst/*` packages with `workspace:*` rewritten; add publish-time - guard. **Unblocks every A0-V downstream.** -4. **Retry A0-V on clock** (no code change needed; rebuild only). Once green, - commit a doc-only "A0-V passed" update to this roadmap. -5. **A8 + A9 + A1** on clock (correctness before speed). Commit. -6. **A2 + A4 + A5 + A6** on clock. Measure. Commit. -7. **Phase A0 → A6** on `learning_ai_peakpulse` (backend only) as validation - second pass for the simpler case. -8. **A7** — fill in metrics table. -9. **A3 ADR** — decide lockfile policy (defer implementation). -10. **Phase B** — harden `docker-prep.sh` on clock, then promote to canonical +1. **✅ v5 commit:** roadmap doc v5 lands; F16 documented (`devops_tools@ba8b4d1`). +2. **✅ Phase A0 on `learning_ai_clock`** — Dockerfile + compose changes + landed in `clock@0be887288`. Initial A0-V blocked on F16/F17/F18. +3. **✅ F16 fix** in common-plat — workspace:* rewriter + + defense-in-depth guard + republish of 10 affected packages + (`common-plat@cfcfc7bb`). +4. **✅ F17 fix** in common-plat + Gitea config — `ROOT_URL=host.docker.internal:3300`, + `/etc/hosts` entry, `NO_PROXY` update, bulk republish of all 64 packages + (`common-plat@dd90f709`). +5. **✅ F18 fix** in clock — 4 `file:` refs in `web/package.json` rewritten + to `*` (`clock@8b5c767a3`). +6. **✅ A0-V on clock PASSED.** v6 commit lands here documenting it. +7. **⚳ A8 + A9 + A1** on clock (correctness before speed). Commit. +8. **⚳ A2 + A4 + A5 + A6** on clock. Measure warm-rebuild numbers. Commit. +9. **⚳ Phase A0 → A6** on `learning_ai_peakpulse` (backend only) as validation + second pass. +10. **⚳ A7** — fill in warm-rebuild numbers in metrics table. +11. **⚳ A3 ADR** — decide lockfile policy (defer implementation). +12. **⚳ Phase B** — harden `docker-prep.sh` on clock, then promote to canonical home in common-plat (B7) and sync to peakpulse. -11. **Phase E** — land `docker-doctor.sh`, wire into CI as warning, then error. -12. **Phase C** — verification gates C1–C9. -13. **Phase D** — scheduled separately, only after § 5 passes. +13. **⚳ Phase E** — land `docker-doctor.sh`, wire into CI as warning, then error. +14. **⚳ Phase C** — verification gates C1–C9. +15. **⏸ Phase D** — scheduled separately, only after §5 C-gates pass. **STOP + and request approval before starting.** --- @@ -772,3 +804,6 @@ Checks implemented by `docker-doctor.sh`: | **F14 regression: future Gitea owner rename re-introduces literal in some Dockerfile** | Phase E `docker-doctor.sh` checks `.npmrc.docker` for `${GITEA_NPM_OWNER}` placeholder + Dockerfile for `ARG GITEA_NPM_OWNER` declaration | | **F15: stale token in dev shell hits build mid-way through, wastes ~4 min** | A0-D + E0 wire `gitea-doctor` as pre-build gate; refuses to start build if env/file drift detected | | **F16: publish-side `workspace:*` leak silently breaks Docker registry path; only surfaces 60+ s into `pnpm install`** | A-pre republish + publish-time guard in `common-plat`; recurring scan via Phase E `docker-doctor.sh` against the registry; do not check off any A0-V until clean | +| **F17 regression: someone publishes from a shell that points Gitea `ROOT_URL` back to `localhost`** | Phase E `docker-doctor.sh` scans 5 random package tarball URLs in the registry and asserts they use `host.docker.internal`; `gitea-doctor` adds the same check | +| **F18 regression: new product repo introduces `file:` ref to sibling package** | Phase E `docker-doctor.sh` greps `**/package.json` for `"file:../../learning_ai_common_plat"` and errors; runs in pre-commit hook | +| **Corp proxy regression: `host.docker.internal` falls out of NO_PROXY on a dev machine** | `switch-network.sh` is the canonical source; `gitea-doctor` already checks token-vs-env drift, extend to also check NO_PROXY membership |