docs(docker): roadmap v6 — F17 + F18 fixed, A0-V PASSED on clock

Resolves the A-pre phase entirely. Gitea-registry path now works
end-to-end on learning_ai_clock for both backend + web.

Findings added to § 0:
  F17: Gitea baked localhost:3300 in tarball URLs (Critical, FIXED)
  F18: clock/web/package.json had file: refs to sibling repo (High, FIXED)

Status updates:
  - A0-V on clock:  PASSED (3rd attempt, after F16/F17/F18 cleared)
    backend cold build: 59.2 s
    web cold build:     193 s (3:13)
  - A-pre fully complete (all 6 sub-items checked)
  - § 3.A7 metrics table populated with baseline numbers
  - § 10 execution order: steps 1-6 marked done; step 7 (A8+A9+A1) is next

New § A-pre-6: documents Gitea config requirements (ROOT_URL, /etc/hosts,
NO_PROXY) as one-time setup per dev machine. Required because the F17 fix
spans Gitea server config + host /etc/hosts + repo-side switch-network.sh.

Risk register: 4 new rows (F17 regression, F18 regression,
corp-proxy NO_PROXY drift, BASE_IMAGE override drift kept from v5).

Cross-repo commit chain (this run):
  clock@0be887288       feat(docker): A0 — wire Gitea-registry path
  devops_tools@ba8b4d1  docs(docker): roadmap v5 — add F16
  common-plat@cfcfc7bb  fix(gitea): rewrite workspace:* in tarballs (F16)
  common-plat@dd90f709  fix(gitea): ROOT_URL host.docker.internal (F17)
  clock@8b5c767a3       fix(docker): rewrite file: refs to * (F18)
  devops_tools@HEAD     docs(docker): roadmap v6 (this commit)

Next: § 10 step 7 — A8 (config-file COPY audit) + A9 (healthcheck IPv4)
+ A1 (corepack) on clock.
This commit is contained in:
saravanakumardb1 2026-05-27 01:54:14 -07:00
parent ba8b4d1ace
commit 7627d5526d

View File

@ -1,6 +1,6 @@
# Docker Build Optimization Roadmap
> **Status:** Draft v5 (F16 — registry workspace:* leaks discovered during A0-V) · **Owner:** Platform DevOps · **Created:** 2026-05-27 · **Revised:** 2026-05-27
> **Status:** Draft v6 (F17 + F18 fixed; A0-V PASSED on clock) · **Owner:** Platform DevOps · **Created:** 2026-05-27 · **Revised:** 2026-05-27
>
> Pilot Docker-build correctness + speed fixes on `learning_ai_clock` (web + backend)
> and `learning_ai_peakpulse` (backend), then capture the playbook here for
@ -18,9 +18,9 @@
## 0. Pre-flight audit findings (2026-05-27)
A read-only audit of pilot repos + lessons from recent live incidents +
A0-V execution failure on clock surfaced **16 concrete bugs/gaps** (F14F15
added after the Gitea-hardening commit; F16 added during A0-V execution on
clock, 2026-05-27). The actual state of the ecosystem is closer to the
the A0-V execution iterations on clock surfaced **18 concrete bugs/gaps**
(F14F15 added after the Gitea-hardening commit; F16F18 added during the
A0-V execution sweep on clock, 2026-05-27). The actual state of the ecosystem is closer to the
inverse of the casual narrative: tarballs are the de facto default, the
Gitea-registry path is partially wired, and there is a separate class of
"build green, app broken" silent failures (F11F13) that the speed-focused
@ -43,7 +43,9 @@ plan needs to address first.
| **F13** | **Enumerated `COPY web/foo ./foo` pattern drifts from filesystem.** New config file added to repo but Dockerfile's enumerated COPY list isn't updated. Build succeeds silently with the file absent; behavior diverges from local dev. Root cause of F11(b). | every Dockerfile using enumerated COPY | **Medium** |
| **F14** | **Hardcoded Gitea owner (`learning_ai_user`) literally embedded in `.npmrc.docker` + CI workflows + publish scripts across 14 repos.** When the org was renamed from `bytelyst``learning_ai_user`, every repo needed a manual commit. **Resolved upstream in `common-plat` (`610a59fd`):** owner now resolves from `${GITEA_NPM_OWNER:-learning_ai_user}`; `scripts/gitea/{doctor,token}.sh` ship as pre-flight/rotation helpers. Docker work in this roadmap MUST consume the env var, not the literal. | `.npmrc.docker`, Dockerfile `ARG`/`ENV`, CI workflows | **Medium** |
| **F15** | **Stale shell-env tokens.** `~/.gitea_npm_token` rotated on disk; long-lived shells still exported the old value. Caused 401s during `docker compose build` until `source ~/.zshrc`. **Mitigation shipped:** `bash scripts/gitea/doctor.sh` detects env-vs-file drift and refuses to proceed. **Action required in this roadmap:** wire doctor as a pre-build CI gate. | dev workstation + CI runners | Low (now caught) |
| **F16** | **At least 10 published `@bytelyst/*` packages in the Gitea registry have unrewritten `workspace:*` refs in their `package.json` dependencies.** `pnpm publish` should rewrite these to concrete semver automatically; these packages were either published via a path that bypassed `pnpm publish` (raw `npm publish`, `pnpm pack` + manual upload), or `pnpm publish` was invoked with a flag that disabled rewriting. **Discovered during A0-V on clock 2026-05-27**`@bytelyst/fastify-core@0.1.5` and `@bytelyst/field-encrypt@0.1.6` (both direct deps of clock backend) fail `pnpm install` inside Docker with `ERR_PNPM_WORKSPACE_PKG_NOT_FOUND`. **Blocks A0-V on every pilot repo.** Full list: `auth@0.1.5`, `diagnostics-client@0.1.6`, `events@0.1.5`, `extraction@0.1.5`, `fastify-auth@0.1.5`, `fastify-core@0.1.5`, `feedback-client@0.1.6`, `field-encrypt@0.1.6`, `react-auth@0.1.6`, `sync@0.1.5`. | `common-plat` publish flow + Gitea registry | **Critical** |
| **F16** | **At least 10 published `@bytelyst/*` packages had unrewritten `workspace:*` refs in their `package.json` dependencies.** Root cause: `publish-outdated-packages.sh` extracts a pnpm-packed tarball then **re-packs with `npm pack`** (workaround for a historical Gitea-compat issue with pnpm's tarball format), and `npm pack` doesn't recognize the pnpm-specific `workspace:` protocol — it passes it through literally. **Fixed in `common-plat@cfcfc7bb`** (`fix(gitea): rewrite workspace:* in published tarballs (F16)`) — inserted a workspace:* rewriter between extract and npm-repack + a defense-in-depth grep guard. Republished 10 affected packages. | `common-plat` publish flow + Gitea registry | **Critical (FIXED)** |
| **F17** | **Gitea bakes `localhost:3300` into the `dist.tarball` field of every published package's metadata.** Inside Docker, `localhost` is the container itself, not the host — so even after a successful registry-metadata fetch via `host.docker.internal`, pnpm follows the tarball URL to `localhost:3300` and ECONNREFUSEs. Root cause: Gitea `app.ini`'s `ROOT_URL=http://localhost:3300/` was baked at publish time. **Fixed** by setting `ROOT_URL=http://host.docker.internal:3300/`, restarting Gitea, adding `127.0.0.1 host.docker.internal` to `/etc/hosts`, adding `host.docker.internal` to `NO_PROXY` (corp proxy was hijacking DNS), and republishing all 64 packages (`common-plat@dd90f709`). | Gitea `app.ini` + host `/etc/hosts` + every dev machine's `switch-network.sh` | **Critical (FIXED)** |
| **F18** | **`clock/web/package.json` had 4 `@bytelyst/*` deps declared as `file:` refs to sibling `../../learning_ai_common_plat/packages/*`** — a legacy pre-Gitea pattern. Inside Docker those paths don't exist, so `pnpm install` fails with `ERR_PNPM_LINKED_PKG_DIR_NOT_FOUND`. Discovered during clock web A0-V on 2026-05-27. **Fixed in `learning_ai_clock@8b5c767a3`** by rewriting to `*` semver. Same pattern likely lives in other product repos (especially anything that consumes `@bytelyst/ui`, `@bytelyst/design-tokens`, `@bytelyst/use-theme`) — audit needed in Phase D rollout. | `*/web/package.json` (and likely others) | **High** |
**Implications:**
@ -51,13 +53,16 @@ plan needs to address first.
two upstream fixes first (F1, F2).
- F11F13 mean **correctness fixes must precede speed fixes**, otherwise we
ship faster builds of broken apps.
- **F16 blocks the entire Gitea-registry pilot** — no pilot repo can A0-V
green until the registry is republished cleanly. Added new Phase A-pre
(§3) and updated §10 execution order to insert republish ahead of any
Docker-side verification.
- F16 + F17 are **both fixed** as of 2026-05-27. Gitea path now works
end-to-end on clock. A-pre is largely complete; remaining items (A-pre-4,
A-pre-5) become Phase E checks.
- F18 (sibling `file:` refs in product repo manifests) is the same family as
F2 but separately tractable — fixed in clock, audit needed across other
repos as part of Phase D rollout.
- A linter (Phase E `docker-doctor.sh`) is the durable insurance against
F11/F13 recurrence — they are silent in CI today. A registry-side guard
(publish-time check for `workspace:*` leaks) is the equivalent for F16.
F11/F13/F18 recurrence — silent in CI today. The registry-side guard
(publish-time check for `workspace:*` leaks) shipped in `common-plat@cfcfc7bb`
as part of the F16 fix.
---
@ -127,43 +132,61 @@ Order matters: **A-pre must precede A0** (you can't build via a registry that
serves broken metadata); A0 must precede A1+ (you can't optimize a path that
doesn't work), and A8+A9 (correctness) must land before measuring speed wins.
### A-pre. Republish `@bytelyst/*` to Gitea cleanly (addresses F16)
### A-pre. Make the Gitea registry actually usable from Docker (F16 + F17 + F18)
**Owner:** `learning_ai_common_plat` · **Blocks:** every A0-V on every pilot.
**Owner:** `learning_ai_common_plat` + per-product repo · **Status:** ✅ done for clock + global config.
Discovered during clock A0-V on 2026-05-27 (see [§0 F16](#0-pre-flight-audit-findings-2026-05-27)).
At least 10 packages in the registry have literal `workspace:*` strings in
their published `package.json` dependencies. `pnpm install` inside Docker
fails with `ERR_PNPM_WORKSPACE_PKG_NOT_FOUND` because there is no workspace
context inside the container.
Three distinct bugs surfaced during clock A0-V on 2026-05-27:
- **F16:** Publish flow leaked `workspace:*` into published metadata.
- **F17:** Gitea baked `localhost:3300` into tarball URLs.
- **F18:** Product repos had legacy `file:` refs to sibling packages.
- [x] **A-pre-1.** Audit `publish-outdated-packages.sh` — confirmed it uses
`pnpm pack` then re-tars with `npm pack`, which loses `workspace:` rewriting.
- [x] **A-pre-2.** Patch publish script with a workspace:* rewriter + a
post-rewrite grep guard. Shipped in `common-plat@cfcfc7bb`.
- [x] **A-pre-3.** Verify all packages publish with `0` workspace:* refs.
Confirmed via curl scan across all 64 packages.
- [x] **A-pre-4.** F17 fix: set Gitea `ROOT_URL=http://host.docker.internal:3300/`,
restart Gitea, add `127.0.0.1 host.docker.internal` to `/etc/hosts`, add
`host.docker.internal` to `NO_PROXY` in `switch-network.sh`, bulk republish
all 64 packages. Shipped in `common-plat@dd90f709`.
- [x] **A-pre-5.** F18 fix: rewrite `file:../../learning_ai_common_plat/packages/*`
refs in `clock/web/package.json` to `*` semver. Shipped in `clock@8b5c767a3`.
Audit needed in Phase D for other product repos.
- [x] **A-pre-6.** Document Gitea config requirements (below).
### A-pre-6. Gitea configuration prerequisites (one-time per dev machine)
The Gitea registry MUST be configured with `ROOT_URL=http://host.docker.internal:3300/`
so published tarball URLs are reachable from inside Docker containers. The
host `/etc/hosts` MUST resolve `host.docker.internal` to `127.0.0.1` so the
same URLs work from the host shell.
On macOS (Homebrew Gitea):
- [ ] **A-pre-1.** Audit the publish flow in `common-plat`. Identify whether
packages are published via `pnpm publish` (which rewrites `workspace:*`
automatically) or via a path that bypasses rewriting (raw `npm publish`,
`pnpm pack` then manual upload, or `pnpm publish --no-workspace-protocol`).
- [ ] **A-pre-2.** Republish all `@bytelyst/*` packages with proper rewriting.
Bump patch version if Gitea refuses to overwrite existing versions.
```bash
cd learning_ai_common_plat
pnpm -r --filter './packages/*' publish \
--no-git-checks \
--registry http://localhost:3300/api/packages/learning_ai_user/npm/
# 1. Edit Gitea's app.ini
sudo -e /opt/homebrew/var/gitea/custom/conf/app.ini
# change: ROOT_URL = http://localhost:3300/
# to: ROOT_URL = http://host.docker.internal:3300/
# 2. Restart Gitea
brew services restart gitea
# 3. Add /etc/hosts entry so host.docker.internal resolves on the host too
sudo sh -c 'grep -q host.docker.internal /etc/hosts || \
echo "127.0.0.1 host.docker.internal" >> /etc/hosts'
# 4. Ensure host.docker.internal is in NO_PROXY for corp shells
# (already done in switch-network.sh as of common-plat@dd90f709)
source ~/.zshrc # reload
# 5. Verify
curl -sS http://host.docker.internal:3300/api/v1/version
# expected: {"version":"1.25.5"} or similar
```
- [ ] **A-pre-3.** Verify with the same curl scan used in clock A0-V (output
should be `0/N workspace:* refs`):
```bash
for pkg in $(list); do
curl -sS -H "Authorization: token $GITEA_NPM_TOKEN" \
"http://localhost:3300/api/packages/learning_ai_user/npm/${pkg}" \
| jq -r '.versions[.["dist-tags"].latest] | (.dependencies // {}) + (.peerDependencies // {}) | to_entries | map(select(.value | test("workspace:"))) | length'
done
```
- [ ] **A-pre-4.** Add a publish-time guard in `common-plat`:
pre-publish script that runs `node -e 'JSON.parse(fs.readFileSync("package.json")).dependencies && Object.values(...).every(v => !v.startsWith("workspace:"))'`
against each package's tarball'd `package.json` before push to Gitea.
Mirror of Phase E lint at the registry layer.
- [ ] **A-pre-5.** Document publish flow in `common-plat/AGENTS.md` and link
back to this roadmap section.
### A0. Make the Gitea-registry path actually work (clock + peakpulse)
@ -214,13 +237,14 @@ context inside the container.
- In Gitea Actions CI: a pre-job step. If `doctor` exits non-zero, the build is skipped with a clear error rather than failing 4 minutes in with `ERR_PNPM_AUTHENTICATION`.
- [ ] **A0-V.** **Verification gate (between A0 and A1):** build the registry path **without** any cache-mount or layer optimizations. Confirm `docker compose build --no-cache` succeeds end-to-end pulling from Gitea. Only proceed to A1 once this is green. Don't conflate "make it work" with "make it fast" in one commit.
> **2026-05-27 status — clock A0-V: BLOCKED on F16.** Dockerfile + compose
> changes landed in `learning_ai_clock@0be887288` (`feat(docker): A0 — wire
> Gitea-registry path (blocked by F16)`). Build fails at
> `pnpm install` with `ERR_PNPM_WORKSPACE_PKG_NOT_FOUND` because
> `@bytelyst/fastify-core@0.1.5` and `@bytelyst/field-encrypt@0.1.6`
> (both direct deps) have unrewritten `workspace:*` refs in published
> metadata. A-pre must complete before retry.
> **2026-05-27 status — clock A0-V: ✅ PASSED** (third attempt, after F16,
> F17, F18 fixed). Cold-build wall-clock:
> - backend: **59.2 s** (commits: `clock@0be887288` + `common-plat@cfcfc7bb` + `common-plat@dd90f709`)
> - web: **3:13 (193 s)** (commits: above + `clock@8b5c767a3`)
>
> Both surfaces resolve `@bytelyst/*` from the Gitea registry end-to-end —
> no `docker-prep.sh` tarballs, no sibling `file:` refs, no proxy interference.
> See §3.A7 metrics table.
### A1. Replace `npm install -g pnpm@X` with corepack
@ -284,9 +308,13 @@ Two options — pick one in a short ADR before implementing:
| Repo | Surface | Cold before | Cold after | Warm before | Warm after | Notes |
|---|---|---|---|---|---|---|
| clock | web | — | — | — | — | |
| clock | backend | — | — | — | — | |
| peakpulse | backend | — | — | — | — | |
| clock | backend | ≈23 min (failed via tarball path) | **59.2 s** (A0-V #2) | — | — | Pre-A1/A2; no corepack, no cache mount yet. Speed is just from working Gitea path |
| clock | web | ≈23 min (failed via tarball path) | **3:13 (193 s)** (A0-V #3) | — | — | Same caveat |
| peakpulse | backend | — | — | — | — | Pending step 7 |
Warm-rebuild numbers will be measured after A1 (corepack) + A2 (cache mount)
land — those are the actual speed phases. Current numbers establish the
baseline that A1+A2 must beat.
Use:
```
@ -732,26 +760,30 @@ Checks implemented by `docker-doctor.sh`:
## 10. Execution order
1. **Now (v5 commit):** roadmap doc v5 lands here; F16 documented.
2. **✅ Phase A0 on `learning_ai_clock`** (web + backend) — Dockerfile +
compose changes landed in `learning_ai_clock@0be887288`. A0-V verification
**BLOCKED on F16**; retry after A-pre.
3. **⚠️ Phase A-pre on `learning_ai_common_plat`** — republish all
`@bytelyst/*` packages with `workspace:*` rewritten; add publish-time
guard. **Unblocks every A0-V downstream.**
4. **Retry A0-V on clock** (no code change needed; rebuild only). Once green,
commit a doc-only "A0-V passed" update to this roadmap.
5. **A8 + A9 + A1** on clock (correctness before speed). Commit.
6. **A2 + A4 + A5 + A6** on clock. Measure. Commit.
7. **Phase A0 → A6** on `learning_ai_peakpulse` (backend only) as validation
second pass for the simpler case.
8. **A7** — fill in metrics table.
9. **A3 ADR** — decide lockfile policy (defer implementation).
10. **Phase B** — harden `docker-prep.sh` on clock, then promote to canonical
1. **✅ v5 commit:** roadmap doc v5 lands; F16 documented (`devops_tools@ba8b4d1`).
2. **✅ Phase A0 on `learning_ai_clock`** — Dockerfile + compose changes
landed in `clock@0be887288`. Initial A0-V blocked on F16/F17/F18.
3. **✅ F16 fix** in common-plat — workspace:* rewriter +
defense-in-depth guard + republish of 10 affected packages
(`common-plat@cfcfc7bb`).
4. **✅ F17 fix** in common-plat + Gitea config — `ROOT_URL=host.docker.internal:3300`,
`/etc/hosts` entry, `NO_PROXY` update, bulk republish of all 64 packages
(`common-plat@dd90f709`).
5. **✅ F18 fix** in clock — 4 `file:` refs in `web/package.json` rewritten
to `*` (`clock@8b5c767a3`).
6. **✅ A0-V on clock PASSED.** v6 commit lands here documenting it.
7. **⚳ A8 + A9 + A1** on clock (correctness before speed). Commit.
8. **⚳ A2 + A4 + A5 + A6** on clock. Measure warm-rebuild numbers. Commit.
9. **⚳ Phase A0 → A6** on `learning_ai_peakpulse` (backend only) as validation
second pass.
10. **⚳ A7** — fill in warm-rebuild numbers in metrics table.
11. **⚳ A3 ADR** — decide lockfile policy (defer implementation).
12. **⚳ Phase B** — harden `docker-prep.sh` on clock, then promote to canonical
home in common-plat (B7) and sync to peakpulse.
11. **Phase E** — land `docker-doctor.sh`, wire into CI as warning, then error.
12. **Phase C** — verification gates C1C9.
13. **Phase D** — scheduled separately, only after § 5 passes.
13. **⚳ Phase E** — land `docker-doctor.sh`, wire into CI as warning, then error.
14. **⚳ Phase C** — verification gates C1C9.
15. **⏸ Phase D** — scheduled separately, only after §5 C-gates pass. **STOP
and request approval before starting.**
---
@ -772,3 +804,6 @@ Checks implemented by `docker-doctor.sh`:
| **F14 regression: future Gitea owner rename re-introduces literal in some Dockerfile** | Phase E `docker-doctor.sh` checks `.npmrc.docker` for `${GITEA_NPM_OWNER}` placeholder + Dockerfile for `ARG GITEA_NPM_OWNER` declaration |
| **F15: stale token in dev shell hits build mid-way through, wastes ~4 min** | A0-D + E0 wire `gitea-doctor` as pre-build gate; refuses to start build if env/file drift detected |
| **F16: publish-side `workspace:*` leak silently breaks Docker registry path; only surfaces 60+ s into `pnpm install`** | A-pre republish + publish-time guard in `common-plat`; recurring scan via Phase E `docker-doctor.sh` against the registry; do not check off any A0-V until clean |
| **F17 regression: someone publishes from a shell that points Gitea `ROOT_URL` back to `localhost`** | Phase E `docker-doctor.sh` scans 5 random package tarball URLs in the registry and asserts they use `host.docker.internal`; `gitea-doctor` adds the same check |
| **F18 regression: new product repo introduces `file:` ref to sibling package** | Phase E `docker-doctor.sh` greps `**/package.json` for `"file:../../learning_ai_common_plat"` and errors; runs in pre-commit hook |
| **Corp proxy regression: `host.docker.internal` falls out of NO_PROXY on a dev machine** | `switch-network.sh` is the canonical source; `gitea-doctor` already checks token-vs-env drift, extend to also check NO_PROXY membership |