From 8025cd5d82a501fd30fe0fdc08df8c5b8a82927a Mon Sep 17 00:00:00 2001 From: saravanakumardb1 Date: Wed, 27 May 2026 00:53:33 -0700 Subject: [PATCH] =?UTF-8?q?docs(docker):=20roadmap=20v4=20=E2=80=94=20inte?= =?UTF-8?q?grate=20Gitea=20hardening=20(F14,=20F15)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Adds 2 new findings to the docker build optimization roadmap and updates templates to consume the new GITEA_NPM_OWNER env var shipped in common-plat commit 610a59fd. - F14: hardcoded Gitea owner literal across 14 repos (now resolved upstream via ${GITEA_NPM_OWNER:-learning_ai_user}) - F15: stale shell-env tokens (caught by scripts/gitea/doctor.sh) - A0-1, A0-3, 7.1, 7.2, 7.5: snippets updated to thread GITEA_NPM_OWNER through .npmrc.docker, Dockerfile ARG/ENV, and docker-compose build args - A0-D: new step — run gitea-doctor.sh as pre-build gate (replaces 'wait 4 minutes for ERR_PNPM_AUTHENTICATION' with 'fail fast in 2 sec') - Phase E: now distinguishes gitea-doctor (shipped) from docker-doctor (to build). Adds two new docker-doctor checks for F14 - Risk register: F14/F15 mitigations called out explicitly --- docs/docker-build-optimization-roadmap.md | 87 ++++++++++++++++++----- 1 file changed, 68 insertions(+), 19 deletions(-) diff --git a/docs/docker-build-optimization-roadmap.md b/docs/docker-build-optimization-roadmap.md index 3f586a5..4892b59 100644 --- a/docs/docker-build-optimization-roadmap.md +++ b/docs/docker-build-optimization-roadmap.md @@ -1,17 +1,24 @@ # Docker Build Optimization Roadmap -> **Status:** Draft v3 (post-review) · **Owner:** Platform DevOps · **Created:** 2026-05-27 · **Revised:** 2026-05-27 +> **Status:** Draft v4 (Gitea hardening integrated) · **Owner:** Platform DevOps · **Created:** 2026-05-27 · **Revised:** 2026-05-27 > > Pilot Docker-build correctness + speed fixes on `learning_ai_clock` (web + backend) > and `learning_ai_peakpulse` (backend), then capture the playbook here for > ecosystem-wide rollout. +> +> **Upstream prerequisite shipped (commit `610a59fd` in `learning_ai_common_plat`):** +> Gitea owner parameterization + helper scripts (`scripts/gitea/doctor.sh`, +> `scripts/gitea/token.sh`). The `.npmrc` template now resolves owner from +> `${GITEA_NPM_OWNER:-learning_ai_user}`. **All A0-1 work in this roadmap +> inherits this — Dockerfile/.npmrc.docker must use the same `${GITEA_NPM_OWNER}` +> placeholder, not a hardcoded literal.** --- ## 0. Pre-flight audit findings (2026-05-27) A read-only audit of pilot repos + lessons from recent live incidents surfaced -**13 concrete bugs/gaps**. The actual state of the ecosystem is closer to the +**15 concrete bugs/gaps** (F14–F15 added after the Gitea-hardening commit). The actual state of the ecosystem is closer to the inverse of the casual narrative: tarballs are the de facto default, the Gitea-registry path is partially wired, and there is a separate class of "build green, app broken" silent failures (F11–F13) that the speed-focused @@ -32,6 +39,8 @@ plan needs to address first. | **F11** | **Build-time config file missing from repo or not COPY'd in Dockerfile causes silent UI breakage. Symptom: `next build` succeeds, container is "healthy", but CSS bundle is ~33 KB (only `@font-face`) and all Tailwind classes are absent → UI renders unstyled.** Two sub-bugs: (a) `postcss.config.mjs` missing entirely while `@tailwindcss/postcss` is in `package.json` (NoteLett, JarvisJr fixes `dff459e`, `36f6bc1`); (b) file exists but Dockerfile never COPYs it (Clock, LocalMemGPT fixes `a308c6444`, `07cdf6b`). | `*/web/Dockerfile`, `*/web/postcss.config.*` | **High** | | **F12** | **Healthcheck uses `localhost`, resolves to IPv6 `::1`, false-fails.** Backend listens on `0.0.0.0` (IPv4 only). `wget --spider http://localhost:.../health` hits `::1`, connection refused, container marked "unhealthy", `web` service won't start due to `depends_on: condition: service_healthy`. Incident: `learning_ai_jarvis_jr/docker-compose.yml`. | every `docker-compose*.yml` healthcheck | **Medium** | | **F13** | **Enumerated `COPY web/foo ./foo` pattern drifts from filesystem.** New config file added to repo but Dockerfile's enumerated COPY list isn't updated. Build succeeds silently with the file absent; behavior diverges from local dev. Root cause of F11(b). | every Dockerfile using enumerated COPY | **Medium** | +| **F14** | **Hardcoded Gitea owner (`learning_ai_user`) literally embedded in `.npmrc.docker` + CI workflows + publish scripts across 14 repos.** When the org was renamed from `bytelyst` → `learning_ai_user`, every repo needed a manual commit. **Resolved upstream in `common-plat` (`610a59fd`):** owner now resolves from `${GITEA_NPM_OWNER:-learning_ai_user}`; `scripts/gitea/{doctor,token}.sh` ship as pre-flight/rotation helpers. Docker work in this roadmap MUST consume the env var, not the literal. | `.npmrc.docker`, Dockerfile `ARG`/`ENV`, CI workflows | **Medium** | +| **F15** | **Stale shell-env tokens.** `~/.gitea_npm_token` rotated on disk; long-lived shells still exported the old value. Caused 401s during `docker compose build` until `source ~/.zshrc`. **Mitigation shipped:** `bash scripts/gitea/doctor.sh` detects env-vs-file drift and refuses to proceed. **Action required in this roadmap:** wire doctor as a pre-build CI gate. | dev workstation + CI runners | Low (now caught) | **Implications:** @@ -111,29 +120,36 @@ work), and A8+A9 (correctness) must land before measuring speed wins. ### A0. Make the Gitea-registry path actually work (clock + peakpulse) -- [ ] **A0-1.** Standardize `.npmrc.docker` to use a templated host so it works on host (`localhost`) and inside Docker (`host.docker.internal`): +- [ ] **A0-1.** Standardize `.npmrc.docker` to use templated host AND owner so it works on host (`localhost`) and inside Docker (`host.docker.internal`), and so future owner renames are a one-line env change: ``` - @bytelyst:registry=http://${GITEA_NPM_HOST}:3300/api/packages/learning_ai_user/npm/ - //${GITEA_NPM_HOST}:3300/api/packages/learning_ai_user/npm/:_authToken=${GITEA_NPM_TOKEN} + @bytelyst:registry=http://${GITEA_NPM_HOST}:3300/api/packages/${GITEA_NPM_OWNER:-learning_ai_user}/npm/ + //${GITEA_NPM_HOST}:3300/api/packages/${GITEA_NPM_OWNER:-learning_ai_user}/npm/:_authToken=${GITEA_NPM_TOKEN} strict-ssl=false auto-install-peers=true ``` > **⚠️ Env-var expansion chain:** pnpm expands `${VAR}` in `.npmrc` at read > time using the current process environment (see [pnpm npmrc docs][pnpm-npmrc]). - > That means the Dockerfile MUST do `ARG GITEA_NPM_HOST` → `ENV GITEA_NPM_HOST=$GITEA_NPM_HOST` + > That means the Dockerfile MUST do `ARG GITEA_NPM_HOST` + `ARG GITEA_NPM_OWNER` + > → `ENV GITEA_NPM_HOST=$GITEA_NPM_HOST` / `ENV GITEA_NPM_OWNER=$GITEA_NPM_OWNER` > **before** the `pnpm install` RUN line, AND the `GITEA_NPM_TOKEN` must be > exported from the BuildKit secret mount inside the same `RUN` (since secrets > don't persist as env across layers). + > + > **Note on F14:** The canonical `.npmrc` (host-side) template already uses + > `${GITEA_NPM_OWNER}` (shipped in common-plat commit `610a59fd`). + > `.npmrc.docker` lagged behind because Docker builds have a separate file — + > A0-1 brings them into parity. [pnpm-npmrc]: https://pnpm.io/npmrc - [ ] **A0-2.** Remove `pnpm-lock.yaml` from `.dockerignore` in both repos (fixes F1; harmless under `--lockfile=false` since we don't COPY it, but unblocks future A3) -- [ ] **A0-3.** Add `GITEA_NPM_HOST` build arg + `secrets:` block to every service in `docker-compose.yml`: +- [ ] **A0-3.** Add `GITEA_NPM_HOST` + `GITEA_NPM_OWNER` build args + `secrets:` block to every service in `docker-compose.yml`: ```yaml build: context: . dockerfile: backend/Dockerfile args: - GITEA_NPM_HOST: host.docker.internal + GITEA_NPM_HOST: ${GITEA_NPM_HOST:-host.docker.internal} + GITEA_NPM_OWNER: ${GITEA_NPM_OWNER:-learning_ai_user} secrets: - gitea_npm_token secrets: @@ -141,7 +157,14 @@ work), and A8+A9 (correctness) must land before measuring speed wins. environment: GITEA_NPM_TOKEN ``` - [ ] **A0-4.** Add `extra_hosts: ["host.docker.internal:host-gateway"]` to each service so Linux Docker can resolve the host -- [ ] **A0-5.** Document required env: `GITEA_NPM_TOKEN` must be exported in the shell that runs `docker compose build` (add to repo `README.md` quickstart) +- [ ] **A0-5.** Document required env: `GITEA_NPM_TOKEN` must be exported in the shell that runs `docker compose build` (add to repo `README.md` quickstart). Reference `bash ../learning_ai_common_plat/scripts/gitea/token.sh status` for verification. +- [ ] **A0-D.** **Run `gitea-doctor` before any Docker build** (addresses F15). Inline into deploy/CI workflows: + ```bash + bash ../learning_ai_common_plat/scripts/gitea/doctor.sh --quiet || exit 1 + docker compose build + ``` + - Locally: shell alias or `Makefile` target `make build` that runs doctor then `docker compose build`. + - In Gitea Actions CI: a pre-job step. If `doctor` exits non-zero, the build is skipped with a clear error rather than failing 4 minutes in with `ERR_PNPM_AUTHENTICATION`. - [ ] **A0-V.** **Verification gate (between A0 and A1):** build the registry path **without** any cache-mount or layer optimizations. Confirm `docker compose build --no-cache` succeeds end-to-end pulling from Gitea. Only proceed to A1 once this is green. Don't conflate "make it work" with "make it fast" in one commit. ### A1. Replace `npm install -g pnpm@X` with corepack @@ -336,9 +359,11 @@ Apply Phase A + B + E to remaining repos. **Pilots excluded.** ### 7.1 Canonical `.npmrc.docker` +Matches the host-side `.npmrc` template shipped in `common-plat` `610a59fd`. + ``` -@bytelyst:registry=http://${GITEA_NPM_HOST}:3300/api/packages/learning_ai_user/npm/ -//${GITEA_NPM_HOST}:3300/api/packages/learning_ai_user/npm/:_authToken=${GITEA_NPM_TOKEN} +@bytelyst:registry=http://${GITEA_NPM_HOST}:3300/api/packages/${GITEA_NPM_OWNER:-learning_ai_user}/npm/ +//${GITEA_NPM_HOST}:3300/api/packages/${GITEA_NPM_OWNER:-learning_ai_user}/npm/:_authToken=${GITEA_NPM_TOKEN} strict-ssl=false auto-install-peers=true ``` @@ -352,10 +377,12 @@ FROM ${BASE_IMAGE} AS builder WORKDIR /app/backend ARG GITEA_NPM_HOST=host.docker.internal +ARG GITEA_NPM_OWNER=learning_ai_user ARG USE_TARBALLS=false ENV NODE_TLS_REJECT_UNAUTHORIZED=0 ENV NPM_CONFIG_STRICT_SSL=false ENV GITEA_NPM_HOST=$GITEA_NPM_HOST +ENV GITEA_NPM_OWNER=$GITEA_NPM_OWNER RUN corepack enable && corepack prepare pnpm@10.6.5 --activate @@ -469,9 +496,11 @@ FROM ${BASE_IMAGE} AS deps WORKDIR /app/web ARG GITEA_NPM_HOST=host.docker.internal +ARG GITEA_NPM_OWNER=learning_ai_user ENV NODE_TLS_REJECT_UNAUTHORIZED=0 ENV NPM_CONFIG_STRICT_SSL=false ENV GITEA_NPM_HOST=$GITEA_NPM_HOST +ENV GITEA_NPM_OWNER=$GITEA_NPM_OWNER RUN corepack enable && corepack prepare pnpm@10.6.5 --activate @@ -588,26 +617,44 @@ exit $FAILED ## 8. Phase E — Observability / lint (NEW) -New phase: `docker-doctor.sh` (see § 7.6 skeleton) as durable insurance -against tonight's-class silent bugs (F11, F12, F13). +Two complementary linters: -- [ ] **E1.** Land `docker-doctor.sh` in `learning_ai_common_plat/scripts/` (canonical) +1. **`gitea-doctor`** — Gitea registry pre-flight (env + token + connectivity). + **Already shipped** in `common-plat` commit `610a59fd` at + `scripts/gitea/doctor.sh`. This roadmap only wires it into CI/build flows + (A0-D + E0 below). +2. **`docker-doctor`** — Dockerfile + compose-file static linter (see § 7.6 + skeleton). To be built as part of this roadmap. + +The two are intentionally separate concerns: + +| Linter | Scope | When to run | +|---|---|---| +| `gitea-doctor` | runtime env, token, registry HTTP 200 | Before every build / deploy | +| `docker-doctor` | static analysis of Dockerfile + compose YAML | On every PR touching those files | + +### Phase E checklist + +- [ ] **E0.** Wire `bash scripts/gitea/doctor.sh --quiet` into every Gitea Actions CI workflow as a pre-build job (addresses F15). Pattern shipped in `common-plat`; replicate via a reusable `actions/gitea-preflight@main` composite if Gitea Actions supports it, otherwise inline. +- [ ] **E1.** Land `docker-doctor.sh` in `learning_ai_common_plat/scripts/` (canonical, mirrors `gitea/doctor.sh` shipped earlier) - [ ] **E2.** Provide a thin per-repo wrapper at `scripts/docker-doctor.sh` that calls the canonical - [ ] **E3.** Wire into CI: run on PRs touching `Dockerfile`, `docker-compose*.yml`, `.dockerignore`, `.npmrc.docker` - [ ] **E4.** Wire into pre-commit hook (warning-only at first, error after 2 weeks) -- [ ] **E5.** Document checks in `learning_ai_common_plat/AI.dev/SKILLS/docker-doctor.md` -- [ ] **E6.** Add `make docker-doctor` target to each pilot repo +- [ ] **E5.** Document checks in `learning_ai_common_plat/AI.dev/SKILLS/docker-doctor.md` (sibling doc to the existing `gitea-doctor` patterns) +- [ ] **E6.** Add `make doctor` target to each pilot repo that runs both `gitea-doctor` AND `docker-doctor` -Checks implemented: +Checks implemented by `docker-doctor.sh`: | Check | Addresses | Action | |---|---|---| | Every `web/*.config.*` file is COPY'd | F11, F13 | Error | | `docker-compose.yml` healthcheck uses `127.0.0.1` | F12 | Error | -| `.npmrc.docker` uses `${GITEA_NPM_HOST}` placeholder | F4 | Error | +| `.npmrc.docker` uses `${GITEA_NPM_HOST}` AND `${GITEA_NPM_OWNER}` placeholders | F4, F14 | Error | +| Dockerfile declares `ARG GITEA_NPM_OWNER` if it COPYs `.npmrc.docker` | F14 | Error | | `.dockerignore` doesn't exclude `pnpm-lock.yaml` | F1 | Warn (until A3 ADR lands) | -| Base image is on approved list | Canonical decision | Error | +| Base image is on approved list (`node:22-alpine` or `node:22-slim` via `BASE_IMAGE` ARG) | Canonical decision | Error | | `.docker-deps/` and `*.bak` in `.gitignore` | B3 | Error | +| `docker-compose.yml` passes `GITEA_NPM_OWNER` build arg | F14 | Warn | --- @@ -664,3 +711,5 @@ Checks implemented: | **F12 regression: healthcheck false-fails on IPv6** | Phase E `docker-doctor.sh` grep for `localhost` in compose files | | **F13 regression: new config file added, Dockerfile forgotten** | A8-2 glob COPY pattern (root cause fix) + Phase E lint (defense in depth) | | `BASE_IMAGE` override in `notes` diverges silently from canonical | Phase E check approved list; document override in repo `AGENTS.md` | +| **F14 regression: future Gitea owner rename re-introduces literal in some Dockerfile** | Phase E `docker-doctor.sh` checks `.npmrc.docker` for `${GITEA_NPM_OWNER}` placeholder + Dockerfile for `ARG GITEA_NPM_OWNER` declaration | +| **F15: stale token in dev shell hits build mid-way through, wastes ~4 min** | A0-D + E0 wire `gitea-doctor` as pre-build gate; refuses to start build if env/file drift detected |