diff --git a/docs/adr/0001-docker-build-lockfile-policy.md b/docs/adr/0001-docker-build-lockfile-policy.md new file mode 100644 index 0000000..24ce668 --- /dev/null +++ b/docs/adr/0001-docker-build-lockfile-policy.md @@ -0,0 +1,221 @@ +# ADR-0001: Docker build lockfile policy + +> **Status:** Accepted (decision); Deferred (implementation) · **Date:** 2026-05-27 +> **Context:** docker-build-optimization-roadmap §A3 · **Supersedes:** None +> **Authors:** Platform DevOps + +--- + +## 1. Context + +The pilot Phase A work in `docker-build-optimization-roadmap` standardized +on `pnpm install --lockfile=false` inside Docker for both +`learning_ai_clock` (web + backend) and `learning_ai_peakpulse` (backend). +That choice unblocked Phase A by sidestepping a structural mismatch: + +- `pnpm-lock.yaml` is generated against the **outer pnpm workspace**, which + includes `../learning_ai_common_plat/packages/*` as workspace members + (sibling-repo path). +- Inside the Docker build context, the sibling repo doesn't exist + (a single-repo build context is intentionally used for hermeticity). +- `--frozen-lockfile` therefore fails immediately with workspace + resolution errors (finding F2 in the roadmap audit). + +`--lockfile=false` skips lockfile validation entirely and re-resolves all +dependencies against the registry on every `pnpm install`. This is +correct for the workspace-mismatch problem but introduces non-determinism: +the **same Dockerfile + same source tree can produce a different lockset** +across two builds if upstream `@bytelyst/*` versions move between them. + +Phase A2's BuildKit cache mount mitigates the *speed* cost of +re-resolution but not the *determinism* cost. + +This ADR records the decision on which long-term policy to adopt for +Docker builds. Implementation is deferred to a future Phase A3 sprint. + +--- + +## 2. Options considered + +### Option A — Keep `--lockfile=false` (status quo) + +**How it works.** Docker `pnpm install` re-resolves on every cold build. +Cache mount preserves the pnpm content-addressed store across builds, so +warm rebuilds don't pay re-resolution cost. + +**Pros:** +- Zero churn — already shipped in Phase A. +- Tolerates sibling-repo workspace mismatch for free. +- Tolerates `*` semver across all `@bytelyst/*` deps without rework. +- Compatible with the F17 fix (Gitea `host.docker.internal` URLs). + +**Cons:** +- **Non-deterministic builds.** Same Dockerfile + same source can produce + different `node_modules` if a dependency was published between two + cold builds. CI runs days apart can ship divergent images for the same + commit. +- No supply-chain pinning. Any compromised upstream auto-rolls forward. +- `pnpm audit` on the host can disagree with what's actually inside + the image. + +### Option B — Generate a Docker-only flat lockfile during build + +**How it works.** Add a build step that runs `pnpm install --lockfile-only` +in a temp dir against a flattened `pnpm-workspace.yaml` that excludes +sibling-repo paths, then `--frozen-lockfile` against that generated lock. + +**Pros:** +- Deterministic *within a single build* — same registry state at the + moment of the build always produces the same lockset. +- Doesn't require changes to the source tree's `pnpm-workspace.yaml`. + +**Cons:** +- Still non-deterministic across builds (the lock is regenerated each time + unless cached separately). +- Adds Dockerfile complexity and a non-trivial new failure mode + (workspace-flattening logic). +- Marginal value over Option A given the cache mount. + +### Option C — Vendor a Docker-flattened lockfile in the repo + +**How it works.** Commit a `pnpm-lock.docker.yaml` (or similar) per repo +that's generated against a flattened workspace. Dockerfile uses +`pnpm install --frozen-lockfile --lockfile=pnpm-lock.docker.yaml`. + +**Pros:** +- Fully deterministic. Same commit → same lockset → same image. +- Supply chain pins enforced. +- `pnpm audit` matches image contents. + +**Cons:** +- Two lockfiles to maintain (the workspace one + the Docker one). +- Drift risk between the two — solved only by a CI gate that regenerates + the Docker lockfile on every PR that touches `package.json`. +- Requires a tested regenerate-on-CI workflow per repo. +- Workspace flattening logic must be encoded somewhere (script in + `common-plat/scripts/regen-docker-lockfile.sh`). + +### Option D — Restructure to single-repo workspace (eliminate sibling) + +**How it works.** Inline the consumed `@bytelyst/*` packages into each +product repo (vendor them) so there is no sibling-workspace dependency. +Then `--frozen-lockfile` works trivially. + +**Pros:** +- Cleanest from a Docker-build-determinism standpoint. + +**Cons:** +- Massive churn across 14+ product repos. +- Defeats the entire `learning_ai_common_plat` shared-package model. +- Multiplies maintenance cost of `@bytelyst/*` updates by the number of + consumers. +- Out of scope; would supersede the entire ecosystem architecture. + +--- + +## 3. Decision + +**Adopt Option A (`--lockfile=false`) as the official short-term policy.** +**Plan to migrate to Option C (`pnpm-lock.docker.yaml`) when supply-chain +determinism becomes a hard requirement** (e.g., before any production +deployment of a Docker-built image, or before SOC2-style attestation). + +**Reasoning:** + +1. **Phase A is already shipped on Option A** with verified speed wins + (warm rebuilds 2.7–5.4 s across all surfaces). Switching policies + mid-rollout would invalidate metrics + add risk. +2. **The cache mount (Phase A2) addresses the speed concern** that + Option A creates. The remaining concern is determinism, which is a + correctness concern — but the actual blast radius is limited because: + - All `@bytelyst/*` deps are first-party and pinned in source repos. + - Third-party deps already have fixed semver in `package.json` (no + loose `*` ranges to public registries). + - The Gitea registry is the only `@bytelyst/*` source — no public + supply-chain risk for the in-house deps. +3. **Option C is the right end state** but requires CI infrastructure + that doesn't exist yet (auto-regen-on-PR). Building it inside this + roadmap is scope creep. +4. **Option B is dominated by Option C** — same complexity, weaker + guarantees. +5. **Option D is non-starter** — it would require redesigning the + ByteLyst shared-package model. + +--- + +## 4. Consequences + +### Positive + +- Phase A speed wins are preserved with zero policy churn. +- `pnpm-lock.yaml` continues to live in source repos for host development; + it stays in `.dockerignore` for Docker builds. +- The decision is reversible: switching to Option C in the future is + additive (add a Docker lockfile + change one Dockerfile line). + +### Negative + +- Same commit can produce different Docker images on different days. CI + must not assume image hash stability for a given commit. +- `pnpm audit` results from the host don't match Docker image contents. + Workaround: run `pnpm audit` inside the built container as a separate + CI job (cheap; no rebuild needed). +- Supply-chain attestation (SOC2, SLSA) cannot be produced for these + images today. Acceptable while there is no production traffic. + +### Migration trigger + +Switch to Option C when **any** of the following becomes true: + +1. A production environment (paid customers, real PII) deploys a + Docker-built image from this codebase. +2. A regulatory/audit requirement demands reproducible builds. +3. A supply-chain incident occurs (compromised upstream package) and + we need rollback granularity finer than "rebuild from current `*`". +4. The cache-mount speed win disappears (e.g., CI runner switch removes + BuildKit cache persistence). + +### Implementation sketch (when triggered) + +1. In `learning_ai_common_plat`, add `scripts/regen-docker-lockfile.sh`: + - Reads each product repo's `package.json`. + - Generates a flattened `pnpm-workspace.yaml` (no sibling paths). + - Runs `pnpm install --lockfile-only` against the Gitea registry. + - Writes `pnpm-lock.docker.yaml` back to the product repo. +2. Each product repo gets a `.gitea/workflows/regen-docker-lockfile.yml` + that runs the script on PR-touch of `package.json` and either: + - commits the regenerated lockfile (auto-PR), or + - fails the PR with a "run regen-docker-lockfile.sh and commit" message. +3. Each product Dockerfile changes one line: + ```dockerfile + # before + RUN pnpm install --ignore-scripts --lockfile=false + # after + COPY pnpm-lock.docker.yaml ./pnpm-lock.yaml + RUN pnpm install --ignore-scripts --frozen-lockfile + ``` +4. `.dockerignore` removes `pnpm-lock.yaml` exclusion (or adds explicit + include for `pnpm-lock.docker.yaml`). + +This work is **not scoped** in the current roadmap and should be its own +small ADR-driven sprint. + +--- + +## 5. Status tracking + +| Phase | State | Notes | +|---|---|---| +| Decision | ✅ Accepted | This ADR | +| Implementation | ⏸ Deferred | Triggered by §4 conditions | +| Trigger monitor | ⚳ Open | Re-evaluate when Phase D rollout begins | + +--- + +## 6. References + +- `docker-build-optimization-roadmap.md` §0 F1, F2 (lockfile findings) +- `docker-build-optimization-roadmap.md` §A3 (deferred phase) +- `docker-build-optimization-roadmap.md` §A2 (BuildKit cache mount that + mitigates the speed concern of Option A) +- `learning_ai_common_plat/AGENTS.md` (canonical pnpm workspace config) diff --git a/docs/docker-build-optimization-roadmap.md b/docs/docker-build-optimization-roadmap.md index 7792c1d..d432597 100644 --- a/docs/docker-build-optimization-roadmap.md +++ b/docs/docker-build-optimization-roadmap.md @@ -1,6 +1,6 @@ # Docker Build Optimization Roadmap -> **Status:** Draft v7 (Phase A complete on clock; warm rebuilds 2.9 s backend / 5.4 s web) · **Owner:** Platform DevOps · **Created:** 2026-05-27 · **Revised:** 2026-05-27 +> **Status:** Draft v8 (Phase A complete on clock + peakpulse; A3 ADR accepted; warm rebuilds 2.7–5.4 s) · **Owner:** Platform DevOps · **Created:** 2026-05-27 · **Revised:** 2026-05-27 > > Pilot Docker-build correctness + speed fixes on `learning_ai_clock` (web + backend) > and `learning_ai_peakpulse` (backend), then capture the playbook here for @@ -310,7 +310,7 @@ Two options — pick one in a short ADR before implementing: |---|---|---|---|---|---| | clock | backend | **59.2 s** | **64.7 s** | **2.9 s** | Cold essentially flat (corepack adds ~1 s; cache mount empty on first run). Warm → 95.1% reduction. Commits: `clock@8b5c767a3` (A0-V), `clock@f6a806ff3` (A1+A8+A9), `clock@55e8d22d3` (A2+A5+A6) | | clock | web | **193 s (3:13)** | **291 s (4:51) †** | **5.4 s** | Warm → 97.2% reduction. † Cold variance — see footer | -| peakpulse | backend | — | — | — | Pending §10 step 9 | +| peakpulse | backend | — (was tarball-only path) | **72.2 s** | **2.7 s** | Warm → 96.3% reduction. Commits: `peakpulse@11a6bc5` (Phase A), `peakpulse@6523a1a` (.gitkeep fix), `clock@1465e06b1`+`d69003c1f` (mirror .gitkeep fix) | **Footer note on cold-build variance.** Cold builds (`--no-cache`) are dominated by network egress for ~50 `@bytelyst/*` tarballs through the @@ -790,9 +790,15 @@ Checks implemented by `docker-doctor.sh`: 8. **✅ A2 + A4 + A5 + A6** on clock (cache mount + dockerignore) — `clock@55e8d22d3`. Warm rebuilds: **backend 2.9 s, web 5.4 s** (95–97% reduction). A7 metrics table populated this commit. -9. **⏸ Phase A0 → A6** on `learning_ai_peakpulse` (backend only) as validation - second pass. **STOP and request approval — different repo, separate audit.** -10. **⚳ A3 ADR** — decide lockfile policy (defer implementation). +9. **✅ Phase A0 → A6** on `learning_ai_peakpulse` backend (`peakpulse@11a6bc5`). + Cold 72.2 s, warm 2.7 s. Pattern from clock applied verbatim, plus a + side fix for `.docker-deps/.gitkeep` discoverability that was also + ported back to clock (`peakpulse@6523a1a`, `clock@1465e06b1`, + `clock@d69003c1f`). +10. **✅ A3 ADR** — [`docs/adr/0001-docker-build-lockfile-policy.md`](adr/0001-docker-build-lockfile-policy.md). + Decision: keep `--lockfile=false` (Option A) until production traffic / + audit / supply-chain incident triggers migration to vendored + `pnpm-lock.docker.yaml` (Option C). Implementation deferred. 11. **⚳ Phase B** — harden `docker-prep.sh` on clock, then promote to canonical home in common-plat (B7) and sync to peakpulse. 12. **⚳ Phase E** — land `docker-doctor.sh`, wire into CI as warning, then error.