docs(docker): roadmap v5 — add F16 (registry workspace:* leaks)

Discovered during A0-V execution on learning_ai_clock (2026-05-27).

F16: At least 10 of ~50 published @bytelyst/* packages in the Gitea
registry have unrewritten 'workspace:*' refs in their package.json
dependencies. pnpm install inside Docker fails with
ERR_PNPM_WORKSPACE_PKG_NOT_FOUND because there is no workspace context
inside the container.

Confirmed broken (latest version each):
  @bytelyst/auth@0.1.5             → errors=workspace:*
  @bytelyst/diagnostics-client@0.1.6 → api-client=workspace:*
  @bytelyst/events@0.1.5           → queue=workspace:*
  @bytelyst/extraction@0.1.5       → api-client=workspace:*
  @bytelyst/fastify-auth@0.1.5     → errors=workspace:*
  @bytelyst/fastify-core@0.1.5     → errors=workspace:*   ← clock dep
  @bytelyst/feedback-client@0.1.6  → api-client=workspace:*
  @bytelyst/field-encrypt@0.1.6    → errors=workspace:*   ← clock dep
  @bytelyst/react-auth@0.1.6       → api-client=workspace:*
  @bytelyst/sync@0.1.5             → api-client, telemetry-client=workspace:*

Changes:
- § 0: bump count to 16; add F16 row (Critical severity)
- § 0 Implications: F16 blocks every A0-V; updated rationale
- § 3: insert new Phase A-pre (republish + publish-time guard) before A0
- § 3 A0-V: append blocked-status note linking to clock@0be887288
- § 10 Execution order: renumber; insert A-pre as step 3
- § 11 Risk register: add F16 row

Implementation status:
   Step 2 (A0 on clock) — committed in learning_ai_clock@0be887288;
     Dockerfile + compose changes correct, end-to-end build blocked on F16
  ⏸  Step 3 (A-pre) — next; touches common-plat publish flow
  ⏸  Step 4+ (A0-V retry on clock, then onward) — blocked on A-pre
This commit is contained in:
saravanakumardb1 2026-05-27 01:18:25 -07:00
parent 8025cd5d82
commit ba8b4d1ace

View File

@ -1,6 +1,6 @@
# Docker Build Optimization Roadmap
> **Status:** Draft v4 (Gitea hardening integrated) · **Owner:** Platform DevOps · **Created:** 2026-05-27 · **Revised:** 2026-05-27
> **Status:** Draft v5 (F16 — registry workspace:* leaks discovered during A0-V) · **Owner:** Platform DevOps · **Created:** 2026-05-27 · **Revised:** 2026-05-27
>
> Pilot Docker-build correctness + speed fixes on `learning_ai_clock` (web + backend)
> and `learning_ai_peakpulse` (backend), then capture the playbook here for
@ -17,8 +17,10 @@
## 0. Pre-flight audit findings (2026-05-27)
A read-only audit of pilot repos + lessons from recent live incidents surfaced
**15 concrete bugs/gaps** (F14F15 added after the Gitea-hardening commit). The actual state of the ecosystem is closer to the
A read-only audit of pilot repos + lessons from recent live incidents +
A0-V execution failure on clock surfaced **16 concrete bugs/gaps** (F14F15
added after the Gitea-hardening commit; F16 added during A0-V execution on
clock, 2026-05-27). The actual state of the ecosystem is closer to the
inverse of the casual narrative: tarballs are the de facto default, the
Gitea-registry path is partially wired, and there is a separate class of
"build green, app broken" silent failures (F11F13) that the speed-focused
@ -41,6 +43,7 @@ plan needs to address first.
| **F13** | **Enumerated `COPY web/foo ./foo` pattern drifts from filesystem.** New config file added to repo but Dockerfile's enumerated COPY list isn't updated. Build succeeds silently with the file absent; behavior diverges from local dev. Root cause of F11(b). | every Dockerfile using enumerated COPY | **Medium** |
| **F14** | **Hardcoded Gitea owner (`learning_ai_user`) literally embedded in `.npmrc.docker` + CI workflows + publish scripts across 14 repos.** When the org was renamed from `bytelyst``learning_ai_user`, every repo needed a manual commit. **Resolved upstream in `common-plat` (`610a59fd`):** owner now resolves from `${GITEA_NPM_OWNER:-learning_ai_user}`; `scripts/gitea/{doctor,token}.sh` ship as pre-flight/rotation helpers. Docker work in this roadmap MUST consume the env var, not the literal. | `.npmrc.docker`, Dockerfile `ARG`/`ENV`, CI workflows | **Medium** |
| **F15** | **Stale shell-env tokens.** `~/.gitea_npm_token` rotated on disk; long-lived shells still exported the old value. Caused 401s during `docker compose build` until `source ~/.zshrc`. **Mitigation shipped:** `bash scripts/gitea/doctor.sh` detects env-vs-file drift and refuses to proceed. **Action required in this roadmap:** wire doctor as a pre-build CI gate. | dev workstation + CI runners | Low (now caught) |
| **F16** | **At least 10 published `@bytelyst/*` packages in the Gitea registry have unrewritten `workspace:*` refs in their `package.json` dependencies.** `pnpm publish` should rewrite these to concrete semver automatically; these packages were either published via a path that bypassed `pnpm publish` (raw `npm publish`, `pnpm pack` + manual upload), or `pnpm publish` was invoked with a flag that disabled rewriting. **Discovered during A0-V on clock 2026-05-27**`@bytelyst/fastify-core@0.1.5` and `@bytelyst/field-encrypt@0.1.6` (both direct deps of clock backend) fail `pnpm install` inside Docker with `ERR_PNPM_WORKSPACE_PKG_NOT_FOUND`. **Blocks A0-V on every pilot repo.** Full list: `auth@0.1.5`, `diagnostics-client@0.1.6`, `events@0.1.5`, `extraction@0.1.5`, `fastify-auth@0.1.5`, `fastify-core@0.1.5`, `feedback-client@0.1.6`, `field-encrypt@0.1.6`, `react-auth@0.1.6`, `sync@0.1.5`. | `common-plat` publish flow + Gitea registry | **Critical** |
**Implications:**
@ -48,8 +51,13 @@ plan needs to address first.
two upstream fixes first (F1, F2).
- F11F13 mean **correctness fixes must precede speed fixes**, otherwise we
ship faster builds of broken apps.
- **F16 blocks the entire Gitea-registry pilot** — no pilot repo can A0-V
green until the registry is republished cleanly. Added new Phase A-pre
(§3) and updated §10 execution order to insert republish ahead of any
Docker-side verification.
- A linter (Phase E `docker-doctor.sh`) is the durable insurance against
F11/F13 recurrence — they are silent in CI today.
F11/F13 recurrence — they are silent in CI today. A registry-side guard
(publish-time check for `workspace:*` leaks) is the equivalent for F16.
---
@ -115,8 +123,47 @@ Decisions taken now to avoid contradictions later in the doc:
## 3. Phase A — Correctness + build speed + path correctness
Order matters: A0 must precede A1+ (you can't optimize a path that doesn't
work), and A8+A9 (correctness) must land before measuring speed wins.
Order matters: **A-pre must precede A0** (you can't build via a registry that
serves broken metadata); A0 must precede A1+ (you can't optimize a path that
doesn't work), and A8+A9 (correctness) must land before measuring speed wins.
### A-pre. Republish `@bytelyst/*` to Gitea cleanly (addresses F16)
**Owner:** `learning_ai_common_plat` · **Blocks:** every A0-V on every pilot.
Discovered during clock A0-V on 2026-05-27 (see [§0 F16](#0-pre-flight-audit-findings-2026-05-27)).
At least 10 packages in the registry have literal `workspace:*` strings in
their published `package.json` dependencies. `pnpm install` inside Docker
fails with `ERR_PNPM_WORKSPACE_PKG_NOT_FOUND` because there is no workspace
context inside the container.
- [ ] **A-pre-1.** Audit the publish flow in `common-plat`. Identify whether
packages are published via `pnpm publish` (which rewrites `workspace:*`
automatically) or via a path that bypasses rewriting (raw `npm publish`,
`pnpm pack` then manual upload, or `pnpm publish --no-workspace-protocol`).
- [ ] **A-pre-2.** Republish all `@bytelyst/*` packages with proper rewriting.
Bump patch version if Gitea refuses to overwrite existing versions.
```bash
cd learning_ai_common_plat
pnpm -r --filter './packages/*' publish \
--no-git-checks \
--registry http://localhost:3300/api/packages/learning_ai_user/npm/
```
- [ ] **A-pre-3.** Verify with the same curl scan used in clock A0-V (output
should be `0/N workspace:* refs`):
```bash
for pkg in $(list); do
curl -sS -H "Authorization: token $GITEA_NPM_TOKEN" \
"http://localhost:3300/api/packages/learning_ai_user/npm/${pkg}" \
| jq -r '.versions[.["dist-tags"].latest] | (.dependencies // {}) + (.peerDependencies // {}) | to_entries | map(select(.value | test("workspace:"))) | length'
done
```
- [ ] **A-pre-4.** Add a publish-time guard in `common-plat`:
pre-publish script that runs `node -e 'JSON.parse(fs.readFileSync("package.json")).dependencies && Object.values(...).every(v => !v.startsWith("workspace:"))'`
against each package's tarball'd `package.json` before push to Gitea.
Mirror of Phase E lint at the registry layer.
- [ ] **A-pre-5.** Document publish flow in `common-plat/AGENTS.md` and link
back to this roadmap section.
### A0. Make the Gitea-registry path actually work (clock + peakpulse)
@ -167,6 +214,14 @@ work), and A8+A9 (correctness) must land before measuring speed wins.
- In Gitea Actions CI: a pre-job step. If `doctor` exits non-zero, the build is skipped with a clear error rather than failing 4 minutes in with `ERR_PNPM_AUTHENTICATION`.
- [ ] **A0-V.** **Verification gate (between A0 and A1):** build the registry path **without** any cache-mount or layer optimizations. Confirm `docker compose build --no-cache` succeeds end-to-end pulling from Gitea. Only proceed to A1 once this is green. Don't conflate "make it work" with "make it fast" in one commit.
> **2026-05-27 status — clock A0-V: BLOCKED on F16.** Dockerfile + compose
> changes landed in `learning_ai_clock@0be887288` (`feat(docker): A0 — wire
> Gitea-registry path (blocked by F16)`). Build fails at
> `pnpm install` with `ERR_PNPM_WORKSPACE_PKG_NOT_FOUND` because
> `@bytelyst/fastify-core@0.1.5` and `@bytelyst/field-encrypt@0.1.6`
> (both direct deps) have unrewritten `workspace:*` refs in published
> metadata. A-pre must complete before retry.
### A1. Replace `npm install -g pnpm@X` with corepack
- [ ] **A1-1.** Replace `RUN npm install -g pnpm@10.6.5` with:
@ -677,23 +732,26 @@ Checks implemented by `docker-doctor.sh`:
## 10. Execution order
1. **Now (this commit):** roadmap doc v3 lands here; sign-off requested.
2. **Phase A0 on `learning_ai_clock`** (web + backend) — pilot order
intentionally inverted vs. v2: web is where F11/F13 incidents lived, and
clock exercises both surface types in one repo. Fix `.npmrc.docker`,
`docker-compose.yml`, `.dockerignore`. Verify **A0-V** (Gitea path works
end-to-end) before any speed work.
3. **A8 + A9 + A1** on clock (correctness before speed). Commit.
4. **A2 + A4 + A5 + A6** on clock. Measure. Commit.
5. **Phase A0 → A6** on `learning_ai_peakpulse` (backend only) as validation
1. **Now (v5 commit):** roadmap doc v5 lands here; F16 documented.
2. **✅ Phase A0 on `learning_ai_clock`** (web + backend) — Dockerfile +
compose changes landed in `learning_ai_clock@0be887288`. A0-V verification
**BLOCKED on F16**; retry after A-pre.
3. **⚠️ Phase A-pre on `learning_ai_common_plat`** — republish all
`@bytelyst/*` packages with `workspace:*` rewritten; add publish-time
guard. **Unblocks every A0-V downstream.**
4. **Retry A0-V on clock** (no code change needed; rebuild only). Once green,
commit a doc-only "A0-V passed" update to this roadmap.
5. **A8 + A9 + A1** on clock (correctness before speed). Commit.
6. **A2 + A4 + A5 + A6** on clock. Measure. Commit.
7. **Phase A0 → A6** on `learning_ai_peakpulse` (backend only) as validation
second pass for the simpler case.
6. **A7** — fill in metrics table.
7. **A3 ADR** — decide lockfile policy (defer implementation).
8. **Phase B** — harden `docker-prep.sh` on clock, then promote to canonical
home in common-plat (B7) and sync to peakpulse.
9. **Phase E** — land `docker-doctor.sh`, wire into CI as warning, then error.
10. **Phase C** — verification gates C1C9.
11. **Phase D** — scheduled separately, only after § 5 passes.
8. **A7** — fill in metrics table.
9. **A3 ADR** — decide lockfile policy (defer implementation).
10. **Phase B** — harden `docker-prep.sh` on clock, then promote to canonical
home in common-plat (B7) and sync to peakpulse.
11. **Phase E** — land `docker-doctor.sh`, wire into CI as warning, then error.
12. **Phase C** — verification gates C1C9.
13. **Phase D** — scheduled separately, only after § 5 passes.
---
@ -713,3 +771,4 @@ Checks implemented by `docker-doctor.sh`:
| `BASE_IMAGE` override in `notes` diverges silently from canonical | Phase E check approved list; document override in repo `AGENTS.md` |
| **F14 regression: future Gitea owner rename re-introduces literal in some Dockerfile** | Phase E `docker-doctor.sh` checks `.npmrc.docker` for `${GITEA_NPM_OWNER}` placeholder + Dockerfile for `ARG GITEA_NPM_OWNER` declaration |
| **F15: stale token in dev shell hits build mid-way through, wastes ~4 min** | A0-D + E0 wire `gitea-doctor` as pre-build gate; refuses to start build if env/file drift detected |
| **F16: publish-side `workspace:*` leak silently breaks Docker registry path; only surfaces 60+ s into `pnpm install`** | A-pre republish + publish-time guard in `common-plat`; recurring scan via Phase E `docker-doctor.sh` against the registry; do not check off any A0-V until clean |