docs(docker): roadmap v4 — integrate Gitea hardening (F14, F15)
Adds 2 new findings to the docker build optimization roadmap and updates
templates to consume the new GITEA_NPM_OWNER env var shipped in common-plat
commit 610a59fd.
- F14: hardcoded Gitea owner literal across 14 repos (now resolved upstream
via ${GITEA_NPM_OWNER:-learning_ai_user})
- F15: stale shell-env tokens (caught by scripts/gitea/doctor.sh)
- A0-1, A0-3, 7.1, 7.2, 7.5: snippets updated to thread GITEA_NPM_OWNER
through .npmrc.docker, Dockerfile ARG/ENV, and docker-compose build args
- A0-D: new step — run gitea-doctor.sh as pre-build gate (replaces
'wait 4 minutes for ERR_PNPM_AUTHENTICATION' with 'fail fast in 2 sec')
- Phase E: now distinguishes gitea-doctor (shipped) from docker-doctor (to
build). Adds two new docker-doctor checks for F14
- Risk register: F14/F15 mitigations called out explicitly
This commit is contained in:
parent
1a638a84e1
commit
8025cd5d82
@ -1,17 +1,24 @@
|
||||
# Docker Build Optimization Roadmap
|
||||
|
||||
> **Status:** Draft v3 (post-review) · **Owner:** Platform DevOps · **Created:** 2026-05-27 · **Revised:** 2026-05-27
|
||||
> **Status:** Draft v4 (Gitea hardening integrated) · **Owner:** Platform DevOps · **Created:** 2026-05-27 · **Revised:** 2026-05-27
|
||||
>
|
||||
> Pilot Docker-build correctness + speed fixes on `learning_ai_clock` (web + backend)
|
||||
> and `learning_ai_peakpulse` (backend), then capture the playbook here for
|
||||
> ecosystem-wide rollout.
|
||||
>
|
||||
> **Upstream prerequisite shipped (commit `610a59fd` in `learning_ai_common_plat`):**
|
||||
> Gitea owner parameterization + helper scripts (`scripts/gitea/doctor.sh`,
|
||||
> `scripts/gitea/token.sh`). The `.npmrc` template now resolves owner from
|
||||
> `${GITEA_NPM_OWNER:-learning_ai_user}`. **All A0-1 work in this roadmap
|
||||
> inherits this — Dockerfile/.npmrc.docker must use the same `${GITEA_NPM_OWNER}`
|
||||
> placeholder, not a hardcoded literal.**
|
||||
|
||||
---
|
||||
|
||||
## 0. Pre-flight audit findings (2026-05-27)
|
||||
|
||||
A read-only audit of pilot repos + lessons from recent live incidents surfaced
|
||||
**13 concrete bugs/gaps**. The actual state of the ecosystem is closer to the
|
||||
**15 concrete bugs/gaps** (F14–F15 added after the Gitea-hardening commit). The actual state of the ecosystem is closer to the
|
||||
inverse of the casual narrative: tarballs are the de facto default, the
|
||||
Gitea-registry path is partially wired, and there is a separate class of
|
||||
"build green, app broken" silent failures (F11–F13) that the speed-focused
|
||||
@ -32,6 +39,8 @@ plan needs to address first.
|
||||
| **F11** | **Build-time config file missing from repo or not COPY'd in Dockerfile causes silent UI breakage. Symptom: `next build` succeeds, container is "healthy", but CSS bundle is ~33 KB (only `@font-face`) and all Tailwind classes are absent → UI renders unstyled.** Two sub-bugs: (a) `postcss.config.mjs` missing entirely while `@tailwindcss/postcss` is in `package.json` (NoteLett, JarvisJr fixes `dff459e`, `36f6bc1`); (b) file exists but Dockerfile never COPYs it (Clock, LocalMemGPT fixes `a308c6444`, `07cdf6b`). | `*/web/Dockerfile`, `*/web/postcss.config.*` | **High** |
|
||||
| **F12** | **Healthcheck uses `localhost`, resolves to IPv6 `::1`, false-fails.** Backend listens on `0.0.0.0` (IPv4 only). `wget --spider http://localhost:.../health` hits `::1`, connection refused, container marked "unhealthy", `web` service won't start due to `depends_on: condition: service_healthy`. Incident: `learning_ai_jarvis_jr/docker-compose.yml`. | every `docker-compose*.yml` healthcheck | **Medium** |
|
||||
| **F13** | **Enumerated `COPY web/foo ./foo` pattern drifts from filesystem.** New config file added to repo but Dockerfile's enumerated COPY list isn't updated. Build succeeds silently with the file absent; behavior diverges from local dev. Root cause of F11(b). | every Dockerfile using enumerated COPY | **Medium** |
|
||||
| **F14** | **Hardcoded Gitea owner (`learning_ai_user`) literally embedded in `.npmrc.docker` + CI workflows + publish scripts across 14 repos.** When the org was renamed from `bytelyst` → `learning_ai_user`, every repo needed a manual commit. **Resolved upstream in `common-plat` (`610a59fd`):** owner now resolves from `${GITEA_NPM_OWNER:-learning_ai_user}`; `scripts/gitea/{doctor,token}.sh` ship as pre-flight/rotation helpers. Docker work in this roadmap MUST consume the env var, not the literal. | `.npmrc.docker`, Dockerfile `ARG`/`ENV`, CI workflows | **Medium** |
|
||||
| **F15** | **Stale shell-env tokens.** `~/.gitea_npm_token` rotated on disk; long-lived shells still exported the old value. Caused 401s during `docker compose build` until `source ~/.zshrc`. **Mitigation shipped:** `bash scripts/gitea/doctor.sh` detects env-vs-file drift and refuses to proceed. **Action required in this roadmap:** wire doctor as a pre-build CI gate. | dev workstation + CI runners | Low (now caught) |
|
||||
|
||||
**Implications:**
|
||||
|
||||
@ -111,29 +120,36 @@ work), and A8+A9 (correctness) must land before measuring speed wins.
|
||||
|
||||
### A0. Make the Gitea-registry path actually work (clock + peakpulse)
|
||||
|
||||
- [ ] **A0-1.** Standardize `.npmrc.docker` to use a templated host so it works on host (`localhost`) and inside Docker (`host.docker.internal`):
|
||||
- [ ] **A0-1.** Standardize `.npmrc.docker` to use templated host AND owner so it works on host (`localhost`) and inside Docker (`host.docker.internal`), and so future owner renames are a one-line env change:
|
||||
```
|
||||
@bytelyst:registry=http://${GITEA_NPM_HOST}:3300/api/packages/learning_ai_user/npm/
|
||||
//${GITEA_NPM_HOST}:3300/api/packages/learning_ai_user/npm/:_authToken=${GITEA_NPM_TOKEN}
|
||||
@bytelyst:registry=http://${GITEA_NPM_HOST}:3300/api/packages/${GITEA_NPM_OWNER:-learning_ai_user}/npm/
|
||||
//${GITEA_NPM_HOST}:3300/api/packages/${GITEA_NPM_OWNER:-learning_ai_user}/npm/:_authToken=${GITEA_NPM_TOKEN}
|
||||
strict-ssl=false
|
||||
auto-install-peers=true
|
||||
```
|
||||
> **⚠️ Env-var expansion chain:** pnpm expands `${VAR}` in `.npmrc` at read
|
||||
> time using the current process environment (see [pnpm npmrc docs][pnpm-npmrc]).
|
||||
> That means the Dockerfile MUST do `ARG GITEA_NPM_HOST` → `ENV GITEA_NPM_HOST=$GITEA_NPM_HOST`
|
||||
> That means the Dockerfile MUST do `ARG GITEA_NPM_HOST` + `ARG GITEA_NPM_OWNER`
|
||||
> → `ENV GITEA_NPM_HOST=$GITEA_NPM_HOST` / `ENV GITEA_NPM_OWNER=$GITEA_NPM_OWNER`
|
||||
> **before** the `pnpm install` RUN line, AND the `GITEA_NPM_TOKEN` must be
|
||||
> exported from the BuildKit secret mount inside the same `RUN` (since secrets
|
||||
> don't persist as env across layers).
|
||||
>
|
||||
> **Note on F14:** The canonical `.npmrc` (host-side) template already uses
|
||||
> `${GITEA_NPM_OWNER}` (shipped in common-plat commit `610a59fd`).
|
||||
> `.npmrc.docker` lagged behind because Docker builds have a separate file —
|
||||
> A0-1 brings them into parity.
|
||||
|
||||
[pnpm-npmrc]: https://pnpm.io/npmrc
|
||||
- [ ] **A0-2.** Remove `pnpm-lock.yaml` from `.dockerignore` in both repos (fixes F1; harmless under `--lockfile=false` since we don't COPY it, but unblocks future A3)
|
||||
- [ ] **A0-3.** Add `GITEA_NPM_HOST` build arg + `secrets:` block to every service in `docker-compose.yml`:
|
||||
- [ ] **A0-3.** Add `GITEA_NPM_HOST` + `GITEA_NPM_OWNER` build args + `secrets:` block to every service in `docker-compose.yml`:
|
||||
```yaml
|
||||
build:
|
||||
context: .
|
||||
dockerfile: backend/Dockerfile
|
||||
args:
|
||||
GITEA_NPM_HOST: host.docker.internal
|
||||
GITEA_NPM_HOST: ${GITEA_NPM_HOST:-host.docker.internal}
|
||||
GITEA_NPM_OWNER: ${GITEA_NPM_OWNER:-learning_ai_user}
|
||||
secrets:
|
||||
- gitea_npm_token
|
||||
secrets:
|
||||
@ -141,7 +157,14 @@ work), and A8+A9 (correctness) must land before measuring speed wins.
|
||||
environment: GITEA_NPM_TOKEN
|
||||
```
|
||||
- [ ] **A0-4.** Add `extra_hosts: ["host.docker.internal:host-gateway"]` to each service so Linux Docker can resolve the host
|
||||
- [ ] **A0-5.** Document required env: `GITEA_NPM_TOKEN` must be exported in the shell that runs `docker compose build` (add to repo `README.md` quickstart)
|
||||
- [ ] **A0-5.** Document required env: `GITEA_NPM_TOKEN` must be exported in the shell that runs `docker compose build` (add to repo `README.md` quickstart). Reference `bash ../learning_ai_common_plat/scripts/gitea/token.sh status` for verification.
|
||||
- [ ] **A0-D.** **Run `gitea-doctor` before any Docker build** (addresses F15). Inline into deploy/CI workflows:
|
||||
```bash
|
||||
bash ../learning_ai_common_plat/scripts/gitea/doctor.sh --quiet || exit 1
|
||||
docker compose build
|
||||
```
|
||||
- Locally: shell alias or `Makefile` target `make build` that runs doctor then `docker compose build`.
|
||||
- In Gitea Actions CI: a pre-job step. If `doctor` exits non-zero, the build is skipped with a clear error rather than failing 4 minutes in with `ERR_PNPM_AUTHENTICATION`.
|
||||
- [ ] **A0-V.** **Verification gate (between A0 and A1):** build the registry path **without** any cache-mount or layer optimizations. Confirm `docker compose build --no-cache` succeeds end-to-end pulling from Gitea. Only proceed to A1 once this is green. Don't conflate "make it work" with "make it fast" in one commit.
|
||||
|
||||
### A1. Replace `npm install -g pnpm@X` with corepack
|
||||
@ -336,9 +359,11 @@ Apply Phase A + B + E to remaining repos. **Pilots excluded.**
|
||||
|
||||
### 7.1 Canonical `.npmrc.docker`
|
||||
|
||||
Matches the host-side `.npmrc` template shipped in `common-plat` `610a59fd`.
|
||||
|
||||
```
|
||||
@bytelyst:registry=http://${GITEA_NPM_HOST}:3300/api/packages/learning_ai_user/npm/
|
||||
//${GITEA_NPM_HOST}:3300/api/packages/learning_ai_user/npm/:_authToken=${GITEA_NPM_TOKEN}
|
||||
@bytelyst:registry=http://${GITEA_NPM_HOST}:3300/api/packages/${GITEA_NPM_OWNER:-learning_ai_user}/npm/
|
||||
//${GITEA_NPM_HOST}:3300/api/packages/${GITEA_NPM_OWNER:-learning_ai_user}/npm/:_authToken=${GITEA_NPM_TOKEN}
|
||||
strict-ssl=false
|
||||
auto-install-peers=true
|
||||
```
|
||||
@ -352,10 +377,12 @@ FROM ${BASE_IMAGE} AS builder
|
||||
WORKDIR /app/backend
|
||||
|
||||
ARG GITEA_NPM_HOST=host.docker.internal
|
||||
ARG GITEA_NPM_OWNER=learning_ai_user
|
||||
ARG USE_TARBALLS=false
|
||||
ENV NODE_TLS_REJECT_UNAUTHORIZED=0
|
||||
ENV NPM_CONFIG_STRICT_SSL=false
|
||||
ENV GITEA_NPM_HOST=$GITEA_NPM_HOST
|
||||
ENV GITEA_NPM_OWNER=$GITEA_NPM_OWNER
|
||||
|
||||
RUN corepack enable && corepack prepare pnpm@10.6.5 --activate
|
||||
|
||||
@ -469,9 +496,11 @@ FROM ${BASE_IMAGE} AS deps
|
||||
WORKDIR /app/web
|
||||
|
||||
ARG GITEA_NPM_HOST=host.docker.internal
|
||||
ARG GITEA_NPM_OWNER=learning_ai_user
|
||||
ENV NODE_TLS_REJECT_UNAUTHORIZED=0
|
||||
ENV NPM_CONFIG_STRICT_SSL=false
|
||||
ENV GITEA_NPM_HOST=$GITEA_NPM_HOST
|
||||
ENV GITEA_NPM_OWNER=$GITEA_NPM_OWNER
|
||||
|
||||
RUN corepack enable && corepack prepare pnpm@10.6.5 --activate
|
||||
|
||||
@ -588,26 +617,44 @@ exit $FAILED
|
||||
|
||||
## 8. Phase E — Observability / lint (NEW)
|
||||
|
||||
New phase: `docker-doctor.sh` (see § 7.6 skeleton) as durable insurance
|
||||
against tonight's-class silent bugs (F11, F12, F13).
|
||||
Two complementary linters:
|
||||
|
||||
- [ ] **E1.** Land `docker-doctor.sh` in `learning_ai_common_plat/scripts/` (canonical)
|
||||
1. **`gitea-doctor`** — Gitea registry pre-flight (env + token + connectivity).
|
||||
**Already shipped** in `common-plat` commit `610a59fd` at
|
||||
`scripts/gitea/doctor.sh`. This roadmap only wires it into CI/build flows
|
||||
(A0-D + E0 below).
|
||||
2. **`docker-doctor`** — Dockerfile + compose-file static linter (see § 7.6
|
||||
skeleton). To be built as part of this roadmap.
|
||||
|
||||
The two are intentionally separate concerns:
|
||||
|
||||
| Linter | Scope | When to run |
|
||||
|---|---|---|
|
||||
| `gitea-doctor` | runtime env, token, registry HTTP 200 | Before every build / deploy |
|
||||
| `docker-doctor` | static analysis of Dockerfile + compose YAML | On every PR touching those files |
|
||||
|
||||
### Phase E checklist
|
||||
|
||||
- [ ] **E0.** Wire `bash scripts/gitea/doctor.sh --quiet` into every Gitea Actions CI workflow as a pre-build job (addresses F15). Pattern shipped in `common-plat`; replicate via a reusable `actions/gitea-preflight@main` composite if Gitea Actions supports it, otherwise inline.
|
||||
- [ ] **E1.** Land `docker-doctor.sh` in `learning_ai_common_plat/scripts/` (canonical, mirrors `gitea/doctor.sh` shipped earlier)
|
||||
- [ ] **E2.** Provide a thin per-repo wrapper at `scripts/docker-doctor.sh` that calls the canonical
|
||||
- [ ] **E3.** Wire into CI: run on PRs touching `Dockerfile`, `docker-compose*.yml`, `.dockerignore`, `.npmrc.docker`
|
||||
- [ ] **E4.** Wire into pre-commit hook (warning-only at first, error after 2 weeks)
|
||||
- [ ] **E5.** Document checks in `learning_ai_common_plat/AI.dev/SKILLS/docker-doctor.md`
|
||||
- [ ] **E6.** Add `make docker-doctor` target to each pilot repo
|
||||
- [ ] **E5.** Document checks in `learning_ai_common_plat/AI.dev/SKILLS/docker-doctor.md` (sibling doc to the existing `gitea-doctor` patterns)
|
||||
- [ ] **E6.** Add `make doctor` target to each pilot repo that runs both `gitea-doctor` AND `docker-doctor`
|
||||
|
||||
Checks implemented:
|
||||
Checks implemented by `docker-doctor.sh`:
|
||||
|
||||
| Check | Addresses | Action |
|
||||
|---|---|---|
|
||||
| Every `web/*.config.*` file is COPY'd | F11, F13 | Error |
|
||||
| `docker-compose.yml` healthcheck uses `127.0.0.1` | F12 | Error |
|
||||
| `.npmrc.docker` uses `${GITEA_NPM_HOST}` placeholder | F4 | Error |
|
||||
| `.npmrc.docker` uses `${GITEA_NPM_HOST}` AND `${GITEA_NPM_OWNER}` placeholders | F4, F14 | Error |
|
||||
| Dockerfile declares `ARG GITEA_NPM_OWNER` if it COPYs `.npmrc.docker` | F14 | Error |
|
||||
| `.dockerignore` doesn't exclude `pnpm-lock.yaml` | F1 | Warn (until A3 ADR lands) |
|
||||
| Base image is on approved list | Canonical decision | Error |
|
||||
| Base image is on approved list (`node:22-alpine` or `node:22-slim` via `BASE_IMAGE` ARG) | Canonical decision | Error |
|
||||
| `.docker-deps/` and `*.bak` in `.gitignore` | B3 | Error |
|
||||
| `docker-compose.yml` passes `GITEA_NPM_OWNER` build arg | F14 | Warn |
|
||||
|
||||
---
|
||||
|
||||
@ -664,3 +711,5 @@ Checks implemented:
|
||||
| **F12 regression: healthcheck false-fails on IPv6** | Phase E `docker-doctor.sh` grep for `localhost` in compose files |
|
||||
| **F13 regression: new config file added, Dockerfile forgotten** | A8-2 glob COPY pattern (root cause fix) + Phase E lint (defense in depth) |
|
||||
| `BASE_IMAGE` override in `notes` diverges silently from canonical | Phase E check approved list; document override in repo `AGENTS.md` |
|
||||
| **F14 regression: future Gitea owner rename re-introduces literal in some Dockerfile** | Phase E `docker-doctor.sh` checks `.npmrc.docker` for `${GITEA_NPM_OWNER}` placeholder + Dockerfile for `ARG GITEA_NPM_OWNER` declaration |
|
||||
| **F15: stale token in dev shell hits build mid-way through, wastes ~4 min** | A0-D + E0 wire `gitea-doctor` as pre-build gate; refuses to start build if env/file drift detected |
|
||||
|
||||
Loading…
Reference in New Issue
Block a user