docs(docker): roadmap v5 — add F16 (registry workspace:* leaks)
Discovered during A0-V execution on learning_ai_clock (2026-05-27).
F16: At least 10 of ~50 published @bytelyst/* packages in the Gitea
registry have unrewritten 'workspace:*' refs in their package.json
dependencies. pnpm install inside Docker fails with
ERR_PNPM_WORKSPACE_PKG_NOT_FOUND because there is no workspace context
inside the container.
Confirmed broken (latest version each):
@bytelyst/auth@0.1.5 → errors=workspace:*
@bytelyst/diagnostics-client@0.1.6 → api-client=workspace:*
@bytelyst/events@0.1.5 → queue=workspace:*
@bytelyst/extraction@0.1.5 → api-client=workspace:*
@bytelyst/fastify-auth@0.1.5 → errors=workspace:*
@bytelyst/fastify-core@0.1.5 → errors=workspace:* ← clock dep
@bytelyst/feedback-client@0.1.6 → api-client=workspace:*
@bytelyst/field-encrypt@0.1.6 → errors=workspace:* ← clock dep
@bytelyst/react-auth@0.1.6 → api-client=workspace:*
@bytelyst/sync@0.1.5 → api-client, telemetry-client=workspace:*
Changes:
- § 0: bump count to 16; add F16 row (Critical severity)
- § 0 Implications: F16 blocks every A0-V; updated rationale
- § 3: insert new Phase A-pre (republish + publish-time guard) before A0
- § 3 A0-V: append blocked-status note linking to clock@0be887288
- § 10 Execution order: renumber; insert A-pre as step 3
- § 11 Risk register: add F16 row
Implementation status:
✅ Step 2 (A0 on clock) — committed in learning_ai_clock@0be887288;
Dockerfile + compose changes correct, end-to-end build blocked on F16
⏸ Step 3 (A-pre) — next; touches common-plat publish flow
⏸ Step 4+ (A0-V retry on clock, then onward) — blocked on A-pre
This commit is contained in:
parent
8025cd5d82
commit
ba8b4d1ace
@ -1,6 +1,6 @@
|
||||
# Docker Build Optimization Roadmap
|
||||
|
||||
> **Status:** Draft v4 (Gitea hardening integrated) · **Owner:** Platform DevOps · **Created:** 2026-05-27 · **Revised:** 2026-05-27
|
||||
> **Status:** Draft v5 (F16 — registry workspace:* leaks discovered during A0-V) · **Owner:** Platform DevOps · **Created:** 2026-05-27 · **Revised:** 2026-05-27
|
||||
>
|
||||
> Pilot Docker-build correctness + speed fixes on `learning_ai_clock` (web + backend)
|
||||
> and `learning_ai_peakpulse` (backend), then capture the playbook here for
|
||||
@ -17,8 +17,10 @@
|
||||
|
||||
## 0. Pre-flight audit findings (2026-05-27)
|
||||
|
||||
A read-only audit of pilot repos + lessons from recent live incidents surfaced
|
||||
**15 concrete bugs/gaps** (F14–F15 added after the Gitea-hardening commit). The actual state of the ecosystem is closer to the
|
||||
A read-only audit of pilot repos + lessons from recent live incidents +
|
||||
A0-V execution failure on clock surfaced **16 concrete bugs/gaps** (F14–F15
|
||||
added after the Gitea-hardening commit; F16 added during A0-V execution on
|
||||
clock, 2026-05-27). The actual state of the ecosystem is closer to the
|
||||
inverse of the casual narrative: tarballs are the de facto default, the
|
||||
Gitea-registry path is partially wired, and there is a separate class of
|
||||
"build green, app broken" silent failures (F11–F13) that the speed-focused
|
||||
@ -41,6 +43,7 @@ plan needs to address first.
|
||||
| **F13** | **Enumerated `COPY web/foo ./foo` pattern drifts from filesystem.** New config file added to repo but Dockerfile's enumerated COPY list isn't updated. Build succeeds silently with the file absent; behavior diverges from local dev. Root cause of F11(b). | every Dockerfile using enumerated COPY | **Medium** |
|
||||
| **F14** | **Hardcoded Gitea owner (`learning_ai_user`) literally embedded in `.npmrc.docker` + CI workflows + publish scripts across 14 repos.** When the org was renamed from `bytelyst` → `learning_ai_user`, every repo needed a manual commit. **Resolved upstream in `common-plat` (`610a59fd`):** owner now resolves from `${GITEA_NPM_OWNER:-learning_ai_user}`; `scripts/gitea/{doctor,token}.sh` ship as pre-flight/rotation helpers. Docker work in this roadmap MUST consume the env var, not the literal. | `.npmrc.docker`, Dockerfile `ARG`/`ENV`, CI workflows | **Medium** |
|
||||
| **F15** | **Stale shell-env tokens.** `~/.gitea_npm_token` rotated on disk; long-lived shells still exported the old value. Caused 401s during `docker compose build` until `source ~/.zshrc`. **Mitigation shipped:** `bash scripts/gitea/doctor.sh` detects env-vs-file drift and refuses to proceed. **Action required in this roadmap:** wire doctor as a pre-build CI gate. | dev workstation + CI runners | Low (now caught) |
|
||||
| **F16** | **At least 10 published `@bytelyst/*` packages in the Gitea registry have unrewritten `workspace:*` refs in their `package.json` dependencies.** `pnpm publish` should rewrite these to concrete semver automatically; these packages were either published via a path that bypassed `pnpm publish` (raw `npm publish`, `pnpm pack` + manual upload), or `pnpm publish` was invoked with a flag that disabled rewriting. **Discovered during A0-V on clock 2026-05-27** — `@bytelyst/fastify-core@0.1.5` and `@bytelyst/field-encrypt@0.1.6` (both direct deps of clock backend) fail `pnpm install` inside Docker with `ERR_PNPM_WORKSPACE_PKG_NOT_FOUND`. **Blocks A0-V on every pilot repo.** Full list: `auth@0.1.5`, `diagnostics-client@0.1.6`, `events@0.1.5`, `extraction@0.1.5`, `fastify-auth@0.1.5`, `fastify-core@0.1.5`, `feedback-client@0.1.6`, `field-encrypt@0.1.6`, `react-auth@0.1.6`, `sync@0.1.5`. | `common-plat` publish flow + Gitea registry | **Critical** |
|
||||
|
||||
**Implications:**
|
||||
|
||||
@ -48,8 +51,13 @@ plan needs to address first.
|
||||
two upstream fixes first (F1, F2).
|
||||
- F11–F13 mean **correctness fixes must precede speed fixes**, otherwise we
|
||||
ship faster builds of broken apps.
|
||||
- **F16 blocks the entire Gitea-registry pilot** — no pilot repo can A0-V
|
||||
green until the registry is republished cleanly. Added new Phase A-pre
|
||||
(§3) and updated §10 execution order to insert republish ahead of any
|
||||
Docker-side verification.
|
||||
- A linter (Phase E `docker-doctor.sh`) is the durable insurance against
|
||||
F11/F13 recurrence — they are silent in CI today.
|
||||
F11/F13 recurrence — they are silent in CI today. A registry-side guard
|
||||
(publish-time check for `workspace:*` leaks) is the equivalent for F16.
|
||||
|
||||
---
|
||||
|
||||
@ -115,8 +123,47 @@ Decisions taken now to avoid contradictions later in the doc:
|
||||
|
||||
## 3. Phase A — Correctness + build speed + path correctness
|
||||
|
||||
Order matters: A0 must precede A1+ (you can't optimize a path that doesn't
|
||||
work), and A8+A9 (correctness) must land before measuring speed wins.
|
||||
Order matters: **A-pre must precede A0** (you can't build via a registry that
|
||||
serves broken metadata); A0 must precede A1+ (you can't optimize a path that
|
||||
doesn't work), and A8+A9 (correctness) must land before measuring speed wins.
|
||||
|
||||
### A-pre. Republish `@bytelyst/*` to Gitea cleanly (addresses F16)
|
||||
|
||||
**Owner:** `learning_ai_common_plat` · **Blocks:** every A0-V on every pilot.
|
||||
|
||||
Discovered during clock A0-V on 2026-05-27 (see [§0 F16](#0-pre-flight-audit-findings-2026-05-27)).
|
||||
At least 10 packages in the registry have literal `workspace:*` strings in
|
||||
their published `package.json` dependencies. `pnpm install` inside Docker
|
||||
fails with `ERR_PNPM_WORKSPACE_PKG_NOT_FOUND` because there is no workspace
|
||||
context inside the container.
|
||||
|
||||
- [ ] **A-pre-1.** Audit the publish flow in `common-plat`. Identify whether
|
||||
packages are published via `pnpm publish` (which rewrites `workspace:*`
|
||||
automatically) or via a path that bypasses rewriting (raw `npm publish`,
|
||||
`pnpm pack` then manual upload, or `pnpm publish --no-workspace-protocol`).
|
||||
- [ ] **A-pre-2.** Republish all `@bytelyst/*` packages with proper rewriting.
|
||||
Bump patch version if Gitea refuses to overwrite existing versions.
|
||||
```bash
|
||||
cd learning_ai_common_plat
|
||||
pnpm -r --filter './packages/*' publish \
|
||||
--no-git-checks \
|
||||
--registry http://localhost:3300/api/packages/learning_ai_user/npm/
|
||||
```
|
||||
- [ ] **A-pre-3.** Verify with the same curl scan used in clock A0-V (output
|
||||
should be `0/N workspace:* refs`):
|
||||
```bash
|
||||
for pkg in $(list); do
|
||||
curl -sS -H "Authorization: token $GITEA_NPM_TOKEN" \
|
||||
"http://localhost:3300/api/packages/learning_ai_user/npm/${pkg}" \
|
||||
| jq -r '.versions[.["dist-tags"].latest] | (.dependencies // {}) + (.peerDependencies // {}) | to_entries | map(select(.value | test("workspace:"))) | length'
|
||||
done
|
||||
```
|
||||
- [ ] **A-pre-4.** Add a publish-time guard in `common-plat`:
|
||||
pre-publish script that runs `node -e 'JSON.parse(fs.readFileSync("package.json")).dependencies && Object.values(...).every(v => !v.startsWith("workspace:"))'`
|
||||
against each package's tarball'd `package.json` before push to Gitea.
|
||||
Mirror of Phase E lint at the registry layer.
|
||||
- [ ] **A-pre-5.** Document publish flow in `common-plat/AGENTS.md` and link
|
||||
back to this roadmap section.
|
||||
|
||||
### A0. Make the Gitea-registry path actually work (clock + peakpulse)
|
||||
|
||||
@ -167,6 +214,14 @@ work), and A8+A9 (correctness) must land before measuring speed wins.
|
||||
- In Gitea Actions CI: a pre-job step. If `doctor` exits non-zero, the build is skipped with a clear error rather than failing 4 minutes in with `ERR_PNPM_AUTHENTICATION`.
|
||||
- [ ] **A0-V.** **Verification gate (between A0 and A1):** build the registry path **without** any cache-mount or layer optimizations. Confirm `docker compose build --no-cache` succeeds end-to-end pulling from Gitea. Only proceed to A1 once this is green. Don't conflate "make it work" with "make it fast" in one commit.
|
||||
|
||||
> **2026-05-27 status — clock A0-V: BLOCKED on F16.** Dockerfile + compose
|
||||
> changes landed in `learning_ai_clock@0be887288` (`feat(docker): A0 — wire
|
||||
> Gitea-registry path (blocked by F16)`). Build fails at
|
||||
> `pnpm install` with `ERR_PNPM_WORKSPACE_PKG_NOT_FOUND` because
|
||||
> `@bytelyst/fastify-core@0.1.5` and `@bytelyst/field-encrypt@0.1.6`
|
||||
> (both direct deps) have unrewritten `workspace:*` refs in published
|
||||
> metadata. A-pre must complete before retry.
|
||||
|
||||
### A1. Replace `npm install -g pnpm@X` with corepack
|
||||
|
||||
- [ ] **A1-1.** Replace `RUN npm install -g pnpm@10.6.5` with:
|
||||
@ -677,23 +732,26 @@ Checks implemented by `docker-doctor.sh`:
|
||||
|
||||
## 10. Execution order
|
||||
|
||||
1. **Now (this commit):** roadmap doc v3 lands here; sign-off requested.
|
||||
2. **Phase A0 on `learning_ai_clock`** (web + backend) — pilot order
|
||||
intentionally inverted vs. v2: web is where F11/F13 incidents lived, and
|
||||
clock exercises both surface types in one repo. Fix `.npmrc.docker`,
|
||||
`docker-compose.yml`, `.dockerignore`. Verify **A0-V** (Gitea path works
|
||||
end-to-end) before any speed work.
|
||||
3. **A8 + A9 + A1** on clock (correctness before speed). Commit.
|
||||
4. **A2 + A4 + A5 + A6** on clock. Measure. Commit.
|
||||
5. **Phase A0 → A6** on `learning_ai_peakpulse` (backend only) as validation
|
||||
1. **Now (v5 commit):** roadmap doc v5 lands here; F16 documented.
|
||||
2. **✅ Phase A0 on `learning_ai_clock`** (web + backend) — Dockerfile +
|
||||
compose changes landed in `learning_ai_clock@0be887288`. A0-V verification
|
||||
**BLOCKED on F16**; retry after A-pre.
|
||||
3. **⚠️ Phase A-pre on `learning_ai_common_plat`** — republish all
|
||||
`@bytelyst/*` packages with `workspace:*` rewritten; add publish-time
|
||||
guard. **Unblocks every A0-V downstream.**
|
||||
4. **Retry A0-V on clock** (no code change needed; rebuild only). Once green,
|
||||
commit a doc-only "A0-V passed" update to this roadmap.
|
||||
5. **A8 + A9 + A1** on clock (correctness before speed). Commit.
|
||||
6. **A2 + A4 + A5 + A6** on clock. Measure. Commit.
|
||||
7. **Phase A0 → A6** on `learning_ai_peakpulse` (backend only) as validation
|
||||
second pass for the simpler case.
|
||||
6. **A7** — fill in metrics table.
|
||||
7. **A3 ADR** — decide lockfile policy (defer implementation).
|
||||
8. **Phase B** — harden `docker-prep.sh` on clock, then promote to canonical
|
||||
home in common-plat (B7) and sync to peakpulse.
|
||||
9. **Phase E** — land `docker-doctor.sh`, wire into CI as warning, then error.
|
||||
10. **Phase C** — verification gates C1–C9.
|
||||
11. **Phase D** — scheduled separately, only after § 5 passes.
|
||||
8. **A7** — fill in metrics table.
|
||||
9. **A3 ADR** — decide lockfile policy (defer implementation).
|
||||
10. **Phase B** — harden `docker-prep.sh` on clock, then promote to canonical
|
||||
home in common-plat (B7) and sync to peakpulse.
|
||||
11. **Phase E** — land `docker-doctor.sh`, wire into CI as warning, then error.
|
||||
12. **Phase C** — verification gates C1–C9.
|
||||
13. **Phase D** — scheduled separately, only after § 5 passes.
|
||||
|
||||
---
|
||||
|
||||
@ -713,3 +771,4 @@ Checks implemented by `docker-doctor.sh`:
|
||||
| `BASE_IMAGE` override in `notes` diverges silently from canonical | Phase E check approved list; document override in repo `AGENTS.md` |
|
||||
| **F14 regression: future Gitea owner rename re-introduces literal in some Dockerfile** | Phase E `docker-doctor.sh` checks `.npmrc.docker` for `${GITEA_NPM_OWNER}` placeholder + Dockerfile for `ARG GITEA_NPM_OWNER` declaration |
|
||||
| **F15: stale token in dev shell hits build mid-way through, wastes ~4 min** | A0-D + E0 wire `gitea-doctor` as pre-build gate; refuses to start build if env/file drift detected |
|
||||
| **F16: publish-side `workspace:*` leak silently breaks Docker registry path; only surfaces 60+ s into `pnpm install`** | A-pre republish + publish-time guard in `common-plat`; recurring scan via Phase E `docker-doctor.sh` against the registry; do not check off any A0-V until clean |
|
||||
|
||||
Loading…
Reference in New Issue
Block a user