From ba8b4d1ace568f77150a3a9eb0c173bc4651f9c7 Mon Sep 17 00:00:00 2001 From: saravanakumardb1 Date: Wed, 27 May 2026 01:18:25 -0700 Subject: [PATCH] =?UTF-8?q?docs(docker):=20roadmap=20v5=20=E2=80=94=20add?= =?UTF-8?q?=20F16=20(registry=20workspace:*=20leaks)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Discovered during A0-V execution on learning_ai_clock (2026-05-27). F16: At least 10 of ~50 published @bytelyst/* packages in the Gitea registry have unrewritten 'workspace:*' refs in their package.json dependencies. pnpm install inside Docker fails with ERR_PNPM_WORKSPACE_PKG_NOT_FOUND because there is no workspace context inside the container. Confirmed broken (latest version each): @bytelyst/auth@0.1.5 → errors=workspace:* @bytelyst/diagnostics-client@0.1.6 → api-client=workspace:* @bytelyst/events@0.1.5 → queue=workspace:* @bytelyst/extraction@0.1.5 → api-client=workspace:* @bytelyst/fastify-auth@0.1.5 → errors=workspace:* @bytelyst/fastify-core@0.1.5 → errors=workspace:* ← clock dep @bytelyst/feedback-client@0.1.6 → api-client=workspace:* @bytelyst/field-encrypt@0.1.6 → errors=workspace:* ← clock dep @bytelyst/react-auth@0.1.6 → api-client=workspace:* @bytelyst/sync@0.1.5 → api-client, telemetry-client=workspace:* Changes: - § 0: bump count to 16; add F16 row (Critical severity) - § 0 Implications: F16 blocks every A0-V; updated rationale - § 3: insert new Phase A-pre (republish + publish-time guard) before A0 - § 3 A0-V: append blocked-status note linking to clock@0be887288 - § 10 Execution order: renumber; insert A-pre as step 3 - § 11 Risk register: add F16 row Implementation status: ✅ Step 2 (A0 on clock) — committed in learning_ai_clock@0be887288; Dockerfile + compose changes correct, end-to-end build blocked on F16 ⏸ Step 3 (A-pre) — next; touches common-plat publish flow ⏸ Step 4+ (A0-V retry on clock, then onward) — blocked on A-pre --- docs/docker-build-optimization-roadmap.md | 103 +++++++++++++++++----- 1 file changed, 81 insertions(+), 22 deletions(-) diff --git a/docs/docker-build-optimization-roadmap.md b/docs/docker-build-optimization-roadmap.md index 4892b59..186786a 100644 --- a/docs/docker-build-optimization-roadmap.md +++ b/docs/docker-build-optimization-roadmap.md @@ -1,6 +1,6 @@ # Docker Build Optimization Roadmap -> **Status:** Draft v4 (Gitea hardening integrated) · **Owner:** Platform DevOps · **Created:** 2026-05-27 · **Revised:** 2026-05-27 +> **Status:** Draft v5 (F16 — registry workspace:* leaks discovered during A0-V) · **Owner:** Platform DevOps · **Created:** 2026-05-27 · **Revised:** 2026-05-27 > > Pilot Docker-build correctness + speed fixes on `learning_ai_clock` (web + backend) > and `learning_ai_peakpulse` (backend), then capture the playbook here for @@ -17,8 +17,10 @@ ## 0. Pre-flight audit findings (2026-05-27) -A read-only audit of pilot repos + lessons from recent live incidents surfaced -**15 concrete bugs/gaps** (F14–F15 added after the Gitea-hardening commit). The actual state of the ecosystem is closer to the +A read-only audit of pilot repos + lessons from recent live incidents + +A0-V execution failure on clock surfaced **16 concrete bugs/gaps** (F14–F15 +added after the Gitea-hardening commit; F16 added during A0-V execution on +clock, 2026-05-27). The actual state of the ecosystem is closer to the inverse of the casual narrative: tarballs are the de facto default, the Gitea-registry path is partially wired, and there is a separate class of "build green, app broken" silent failures (F11–F13) that the speed-focused @@ -41,6 +43,7 @@ plan needs to address first. | **F13** | **Enumerated `COPY web/foo ./foo` pattern drifts from filesystem.** New config file added to repo but Dockerfile's enumerated COPY list isn't updated. Build succeeds silently with the file absent; behavior diverges from local dev. Root cause of F11(b). | every Dockerfile using enumerated COPY | **Medium** | | **F14** | **Hardcoded Gitea owner (`learning_ai_user`) literally embedded in `.npmrc.docker` + CI workflows + publish scripts across 14 repos.** When the org was renamed from `bytelyst` → `learning_ai_user`, every repo needed a manual commit. **Resolved upstream in `common-plat` (`610a59fd`):** owner now resolves from `${GITEA_NPM_OWNER:-learning_ai_user}`; `scripts/gitea/{doctor,token}.sh` ship as pre-flight/rotation helpers. Docker work in this roadmap MUST consume the env var, not the literal. | `.npmrc.docker`, Dockerfile `ARG`/`ENV`, CI workflows | **Medium** | | **F15** | **Stale shell-env tokens.** `~/.gitea_npm_token` rotated on disk; long-lived shells still exported the old value. Caused 401s during `docker compose build` until `source ~/.zshrc`. **Mitigation shipped:** `bash scripts/gitea/doctor.sh` detects env-vs-file drift and refuses to proceed. **Action required in this roadmap:** wire doctor as a pre-build CI gate. | dev workstation + CI runners | Low (now caught) | +| **F16** | **At least 10 published `@bytelyst/*` packages in the Gitea registry have unrewritten `workspace:*` refs in their `package.json` dependencies.** `pnpm publish` should rewrite these to concrete semver automatically; these packages were either published via a path that bypassed `pnpm publish` (raw `npm publish`, `pnpm pack` + manual upload), or `pnpm publish` was invoked with a flag that disabled rewriting. **Discovered during A0-V on clock 2026-05-27** — `@bytelyst/fastify-core@0.1.5` and `@bytelyst/field-encrypt@0.1.6` (both direct deps of clock backend) fail `pnpm install` inside Docker with `ERR_PNPM_WORKSPACE_PKG_NOT_FOUND`. **Blocks A0-V on every pilot repo.** Full list: `auth@0.1.5`, `diagnostics-client@0.1.6`, `events@0.1.5`, `extraction@0.1.5`, `fastify-auth@0.1.5`, `fastify-core@0.1.5`, `feedback-client@0.1.6`, `field-encrypt@0.1.6`, `react-auth@0.1.6`, `sync@0.1.5`. | `common-plat` publish flow + Gitea registry | **Critical** | **Implications:** @@ -48,8 +51,13 @@ plan needs to address first. two upstream fixes first (F1, F2). - F11–F13 mean **correctness fixes must precede speed fixes**, otherwise we ship faster builds of broken apps. +- **F16 blocks the entire Gitea-registry pilot** — no pilot repo can A0-V + green until the registry is republished cleanly. Added new Phase A-pre + (§3) and updated §10 execution order to insert republish ahead of any + Docker-side verification. - A linter (Phase E `docker-doctor.sh`) is the durable insurance against - F11/F13 recurrence — they are silent in CI today. + F11/F13 recurrence — they are silent in CI today. A registry-side guard + (publish-time check for `workspace:*` leaks) is the equivalent for F16. --- @@ -115,8 +123,47 @@ Decisions taken now to avoid contradictions later in the doc: ## 3. Phase A — Correctness + build speed + path correctness -Order matters: A0 must precede A1+ (you can't optimize a path that doesn't -work), and A8+A9 (correctness) must land before measuring speed wins. +Order matters: **A-pre must precede A0** (you can't build via a registry that +serves broken metadata); A0 must precede A1+ (you can't optimize a path that +doesn't work), and A8+A9 (correctness) must land before measuring speed wins. + +### A-pre. Republish `@bytelyst/*` to Gitea cleanly (addresses F16) + +**Owner:** `learning_ai_common_plat` · **Blocks:** every A0-V on every pilot. + +Discovered during clock A0-V on 2026-05-27 (see [§0 F16](#0-pre-flight-audit-findings-2026-05-27)). +At least 10 packages in the registry have literal `workspace:*` strings in +their published `package.json` dependencies. `pnpm install` inside Docker +fails with `ERR_PNPM_WORKSPACE_PKG_NOT_FOUND` because there is no workspace +context inside the container. + +- [ ] **A-pre-1.** Audit the publish flow in `common-plat`. Identify whether + packages are published via `pnpm publish` (which rewrites `workspace:*` + automatically) or via a path that bypasses rewriting (raw `npm publish`, + `pnpm pack` then manual upload, or `pnpm publish --no-workspace-protocol`). +- [ ] **A-pre-2.** Republish all `@bytelyst/*` packages with proper rewriting. + Bump patch version if Gitea refuses to overwrite existing versions. + ```bash + cd learning_ai_common_plat + pnpm -r --filter './packages/*' publish \ + --no-git-checks \ + --registry http://localhost:3300/api/packages/learning_ai_user/npm/ + ``` +- [ ] **A-pre-3.** Verify with the same curl scan used in clock A0-V (output + should be `0/N workspace:* refs`): + ```bash + for pkg in $(list); do + curl -sS -H "Authorization: token $GITEA_NPM_TOKEN" \ + "http://localhost:3300/api/packages/learning_ai_user/npm/${pkg}" \ + | jq -r '.versions[.["dist-tags"].latest] | (.dependencies // {}) + (.peerDependencies // {}) | to_entries | map(select(.value | test("workspace:"))) | length' + done + ``` +- [ ] **A-pre-4.** Add a publish-time guard in `common-plat`: + pre-publish script that runs `node -e 'JSON.parse(fs.readFileSync("package.json")).dependencies && Object.values(...).every(v => !v.startsWith("workspace:"))'` + against each package's tarball'd `package.json` before push to Gitea. + Mirror of Phase E lint at the registry layer. +- [ ] **A-pre-5.** Document publish flow in `common-plat/AGENTS.md` and link + back to this roadmap section. ### A0. Make the Gitea-registry path actually work (clock + peakpulse) @@ -167,6 +214,14 @@ work), and A8+A9 (correctness) must land before measuring speed wins. - In Gitea Actions CI: a pre-job step. If `doctor` exits non-zero, the build is skipped with a clear error rather than failing 4 minutes in with `ERR_PNPM_AUTHENTICATION`. - [ ] **A0-V.** **Verification gate (between A0 and A1):** build the registry path **without** any cache-mount or layer optimizations. Confirm `docker compose build --no-cache` succeeds end-to-end pulling from Gitea. Only proceed to A1 once this is green. Don't conflate "make it work" with "make it fast" in one commit. + > **2026-05-27 status — clock A0-V: BLOCKED on F16.** Dockerfile + compose + > changes landed in `learning_ai_clock@0be887288` (`feat(docker): A0 — wire + > Gitea-registry path (blocked by F16)`). Build fails at + > `pnpm install` with `ERR_PNPM_WORKSPACE_PKG_NOT_FOUND` because + > `@bytelyst/fastify-core@0.1.5` and `@bytelyst/field-encrypt@0.1.6` + > (both direct deps) have unrewritten `workspace:*` refs in published + > metadata. A-pre must complete before retry. + ### A1. Replace `npm install -g pnpm@X` with corepack - [ ] **A1-1.** Replace `RUN npm install -g pnpm@10.6.5` with: @@ -677,23 +732,26 @@ Checks implemented by `docker-doctor.sh`: ## 10. Execution order -1. **Now (this commit):** roadmap doc v3 lands here; sign-off requested. -2. **Phase A0 on `learning_ai_clock`** (web + backend) — pilot order - intentionally inverted vs. v2: web is where F11/F13 incidents lived, and - clock exercises both surface types in one repo. Fix `.npmrc.docker`, - `docker-compose.yml`, `.dockerignore`. Verify **A0-V** (Gitea path works - end-to-end) before any speed work. -3. **A8 + A9 + A1** on clock (correctness before speed). Commit. -4. **A2 + A4 + A5 + A6** on clock. Measure. Commit. -5. **Phase A0 → A6** on `learning_ai_peakpulse` (backend only) as validation +1. **Now (v5 commit):** roadmap doc v5 lands here; F16 documented. +2. **✅ Phase A0 on `learning_ai_clock`** (web + backend) — Dockerfile + + compose changes landed in `learning_ai_clock@0be887288`. A0-V verification + **BLOCKED on F16**; retry after A-pre. +3. **⚠️ Phase A-pre on `learning_ai_common_plat`** — republish all + `@bytelyst/*` packages with `workspace:*` rewritten; add publish-time + guard. **Unblocks every A0-V downstream.** +4. **Retry A0-V on clock** (no code change needed; rebuild only). Once green, + commit a doc-only "A0-V passed" update to this roadmap. +5. **A8 + A9 + A1** on clock (correctness before speed). Commit. +6. **A2 + A4 + A5 + A6** on clock. Measure. Commit. +7. **Phase A0 → A6** on `learning_ai_peakpulse` (backend only) as validation second pass for the simpler case. -6. **A7** — fill in metrics table. -7. **A3 ADR** — decide lockfile policy (defer implementation). -8. **Phase B** — harden `docker-prep.sh` on clock, then promote to canonical - home in common-plat (B7) and sync to peakpulse. -9. **Phase E** — land `docker-doctor.sh`, wire into CI as warning, then error. -10. **Phase C** — verification gates C1–C9. -11. **Phase D** — scheduled separately, only after § 5 passes. +8. **A7** — fill in metrics table. +9. **A3 ADR** — decide lockfile policy (defer implementation). +10. **Phase B** — harden `docker-prep.sh` on clock, then promote to canonical + home in common-plat (B7) and sync to peakpulse. +11. **Phase E** — land `docker-doctor.sh`, wire into CI as warning, then error. +12. **Phase C** — verification gates C1–C9. +13. **Phase D** — scheduled separately, only after § 5 passes. --- @@ -713,3 +771,4 @@ Checks implemented by `docker-doctor.sh`: | `BASE_IMAGE` override in `notes` diverges silently from canonical | Phase E check approved list; document override in repo `AGENTS.md` | | **F14 regression: future Gitea owner rename re-introduces literal in some Dockerfile** | Phase E `docker-doctor.sh` checks `.npmrc.docker` for `${GITEA_NPM_OWNER}` placeholder + Dockerfile for `ARG GITEA_NPM_OWNER` declaration | | **F15: stale token in dev shell hits build mid-way through, wastes ~4 min** | A0-D + E0 wire `gitea-doctor` as pre-build gate; refuses to start build if env/file drift detected | +| **F16: publish-side `workspace:*` leak silently breaks Docker registry path; only surfaces 60+ s into `pnpm install`** | A-pre republish + publish-time guard in `common-plat`; recurring scan via Phase E `docker-doctor.sh` against the registry; do not check off any A0-V until clean |