docs(docker): roadmap v8 — peakpulse Phase A done + A3 ADR-0001 accepted

Per § 10 steps 9 + 10. Step 9: Peakpulse backend Phase A complete. cold 72.2 s, warm 2.7 s (96.3% reduction). Pattern from clock applied verbatim plus .docker-deps/.gitkeep discoverability fix back-ported to clock. Commits: peakpulse@11a6bc5 feat(docker): Phase A on peakpulse backend peakpulse@6523a1a fix(docker): track .docker-deps/.gitkeep clock@1465e06b1 fix(docker): track .docker-deps/.gitkeep clock@d69003c1f chore: dedupe .docker-deps in .gitignore Step 10: A3 ADR accepted. New file: docs/adr/0001-docker-build-lockfile-policy.md Decision: short-term Option A (--lockfile=false) — already shipped in Phase A; long-term Option C (vendored pnpm-lock.docker.yaml). Migration triggered by production deployment, audit requirement, supply-chain incident, or loss of BuildKit cache. Implementation sketch in ADR § 4. Roadmap doc updates: - § A7 metrics table: peakpulse row populated (72.2 s / 2.7 s). - § A3: collapsed bullet list into decision-record summary linking ADR. - § 10: steps 9 + 10 marked ✅; status banner v7 → v8. Next per § 10: step 11 (Phase B docker-prep hardening) or step 12 (Phase E docker-doctor.sh linter). Phase E is higher-value as durable insurance against F11/F13/F16/F17/F18 regressions across the ecosystem.
2026-05-27 02:54:08 -07:00 · 2026-05-27 02:54:08 -07:00 · b00af09942
commit b00af09942
parent 062155b81e
2 changed files with 232 additions and 5 deletions
--- a/docs/adr/0001-docker-build-lockfile-policy.md
+++ b/docs/adr/0001-docker-build-lockfile-policy.md
@ -0,0 +1,221 @@
 # ADR-0001: Docker build lockfile policy
 > **Status:** Accepted (decision); Deferred (implementation) · **Date:** 2026-05-27
 > **Context:** docker-build-optimization-roadmap §A3 · **Supersedes:** None
 > **Authors:** Platform DevOps
 ---
 ## 1. Context
 The pilot Phase A work in `docker-build-optimization-roadmap` standardized
 on `pnpm install --lockfile=false` inside Docker for both
 `learning_ai_clock` (web + backend) and `learning_ai_peakpulse` (backend).
 That choice unblocked Phase A by sidestepping a structural mismatch:
 - `pnpm-lock.yaml` is generated against the **outer pnpm workspace**, which
  includes `../learning_ai_common_plat/packages/*` as workspace members
  (sibling-repo path).
 - Inside the Docker build context, the sibling repo doesn't exist
  (a single-repo build context is intentionally used for hermeticity).
 - `--frozen-lockfile` therefore fails immediately with workspace
  resolution errors (finding F2 in the roadmap audit).
 `--lockfile=false` skips lockfile validation entirely and re-resolves all
 dependencies against the registry on every `pnpm install`. This is
 correct for the workspace-mismatch problem but introduces non-determinism:
 the **same Dockerfile + same source tree can produce a different lockset**
 across two builds if upstream `@bytelyst/*` versions move between them.
 Phase A2's BuildKit cache mount mitigates the *speed* cost of
 re-resolution but not the *determinism* cost.
 This ADR records the decision on which long-term policy to adopt for
 Docker builds. Implementation is deferred to a future Phase A3 sprint.
 ---
 ## 2. Options considered
 ### Option A — Keep `--lockfile=false` (status quo)
 **How it works.** Docker `pnpm install` re-resolves on every cold build.
 Cache mount preserves the pnpm content-addressed store across builds, so
 warm rebuilds don't pay re-resolution cost.
 **Pros:**
 - Zero churn — already shipped in Phase A.
 - Tolerates sibling-repo workspace mismatch for free.
 - Tolerates `*` semver across all `@bytelyst/*` deps without rework.
 - Compatible with the F17 fix (Gitea `host.docker.internal` URLs).
 **Cons:**
 - **Non-deterministic builds.** Same Dockerfile + same source can produce
  different `node_modules` if a dependency was published between two
  cold builds. CI runs days apart can ship divergent images for the same
  commit.
 - No supply-chain pinning. Any compromised upstream auto-rolls forward.
 - `pnpm audit` on the host can disagree with what's actually inside
  the image.
 ### Option B — Generate a Docker-only flat lockfile during build
 **How it works.** Add a build step that runs `pnpm install --lockfile-only`
 in a temp dir against a flattened `pnpm-workspace.yaml` that excludes
 sibling-repo paths, then `--frozen-lockfile` against that generated lock.
 **Pros:**
 - Deterministic *within a single build* — same registry state at the
  moment of the build always produces the same lockset.
 - Doesn't require changes to the source tree's `pnpm-workspace.yaml`.
 **Cons:**
 - Still non-deterministic across builds (the lock is regenerated each time
  unless cached separately).
 - Adds Dockerfile complexity and a non-trivial new failure mode
  (workspace-flattening logic).
 - Marginal value over Option A given the cache mount.
 ### Option C — Vendor a Docker-flattened lockfile in the repo
 **How it works.** Commit a `pnpm-lock.docker.yaml` (or similar) per repo
 that's generated against a flattened workspace. Dockerfile uses
 `pnpm install --frozen-lockfile --lockfile=pnpm-lock.docker.yaml`.
 **Pros:**
 - Fully deterministic. Same commit → same lockset → same image.
 - Supply chain pins enforced.
 - `pnpm audit` matches image contents.
 **Cons:**
 - Two lockfiles to maintain (the workspace one + the Docker one).
 - Drift risk between the two — solved only by a CI gate that regenerates
  the Docker lockfile on every PR that touches `package.json`.
 - Requires a tested regenerate-on-CI workflow per repo.
 - Workspace flattening logic must be encoded somewhere (script in
  `common-plat/scripts/regen-docker-lockfile.sh`).
 ### Option D — Restructure to single-repo workspace (eliminate sibling)
 **How it works.** Inline the consumed `@bytelyst/*` packages into each
 product repo (vendor them) so there is no sibling-workspace dependency.
 Then `--frozen-lockfile` works trivially.
 **Pros:**
 - Cleanest from a Docker-build-determinism standpoint.
 **Cons:**
 - Massive churn across 14+ product repos.
 - Defeats the entire `learning_ai_common_plat` shared-package model.
 - Multiplies maintenance cost of `@bytelyst/*` updates by the number of
  consumers.
 - Out of scope; would supersede the entire ecosystem architecture.
 ---
 ## 3. Decision
 **Adopt Option A (`--lockfile=false`) as the official short-term policy.**
 **Plan to migrate to Option C (`pnpm-lock.docker.yaml`) when supply-chain
 determinism becomes a hard requirement** (e.g., before any production
 deployment of a Docker-built image, or before SOC2-style attestation).
 **Reasoning:**
 1. **Phase A is already shipped on Option A** with verified speed wins
   (warm rebuilds 2.7–5.4 s across all surfaces). Switching policies
   mid-rollout would invalidate metrics + add risk.
 2. **The cache mount (Phase A2) addresses the speed concern** that
   Option A creates. The remaining concern is determinism, which is a
   correctness concern — but the actual blast radius is limited because:
   - All `@bytelyst/*` deps are first-party and pinned in source repos.
   - Third-party deps already have fixed semver in `package.json` (no
     loose `*` ranges to public registries).
   - The Gitea registry is the only `@bytelyst/*` source — no public
     supply-chain risk for the in-house deps.
 3. **Option C is the right end state** but requires CI infrastructure
   that doesn't exist yet (auto-regen-on-PR). Building it inside this
   roadmap is scope creep.
 4. **Option B is dominated by Option C** — same complexity, weaker
   guarantees.
 5. **Option D is non-starter** — it would require redesigning the
   ByteLyst shared-package model.
 ---
 ## 4. Consequences
 ### Positive
 - Phase A speed wins are preserved with zero policy churn.
 - `pnpm-lock.yaml` continues to live in source repos for host development;
  it stays in `.dockerignore` for Docker builds.
 - The decision is reversible: switching to Option C in the future is
  additive (add a Docker lockfile + change one Dockerfile line).
 ### Negative
 - Same commit can produce different Docker images on different days. CI
  must not assume image hash stability for a given commit.
 - `pnpm audit` results from the host don't match Docker image contents.
  Workaround: run `pnpm audit` inside the built container as a separate
  CI job (cheap; no rebuild needed).
 - Supply-chain attestation (SOC2, SLSA) cannot be produced for these
  images today. Acceptable while there is no production traffic.
 ### Migration trigger
 Switch to Option C when **any** of the following becomes true:
 1. A production environment (paid customers, real PII) deploys a
   Docker-built image from this codebase.
 2. A regulatory/audit requirement demands reproducible builds.
 3. A supply-chain incident occurs (compromised upstream package) and
   we need rollback granularity finer than "rebuild from current `*`".
 4. The cache-mount speed win disappears (e.g., CI runner switch removes
   BuildKit cache persistence).
 ### Implementation sketch (when triggered)
 1. In `learning_ai_common_plat`, add `scripts/regen-docker-lockfile.sh`:
   - Reads each product repo's `package.json`.
   - Generates a flattened `pnpm-workspace.yaml` (no sibling paths).
   - Runs `pnpm install --lockfile-only` against the Gitea registry.
   - Writes `pnpm-lock.docker.yaml` back to the product repo.
 2. Each product repo gets a `.gitea/workflows/regen-docker-lockfile.yml`
   that runs the script on PR-touch of `package.json` and either:
   - commits the regenerated lockfile (auto-PR), or
   - fails the PR with a "run regen-docker-lockfile.sh and commit" message.
 3. Each product Dockerfile changes one line:
   ```dockerfile
   # before
   RUN pnpm install --ignore-scripts --lockfile=false
   # after
   COPY pnpm-lock.docker.yaml ./pnpm-lock.yaml
   RUN pnpm install --ignore-scripts --frozen-lockfile
   ```
 4. `.dockerignore` removes `pnpm-lock.yaml` exclusion (or adds explicit
   include for `pnpm-lock.docker.yaml`).
 This work is **not scoped** in the current roadmap and should be its own
 small ADR-driven sprint.
 ---
 ## 5. Status tracking
 | Phase | State | Notes |
 |---|---|---|
 | Decision | ✅ Accepted | This ADR |
 | Implementation | ⏸ Deferred | Triggered by §4 conditions |
 | Trigger monitor | ⚳ Open | Re-evaluate when Phase D rollout begins |
 ---
 ## 6. References
 - `docker-build-optimization-roadmap.md` §0 F1, F2 (lockfile findings)
 - `docker-build-optimization-roadmap.md` §A3 (deferred phase)
 - `docker-build-optimization-roadmap.md` §A2 (BuildKit cache mount that
  mitigates the speed concern of Option A)
 - `learning_ai_common_plat/AGENTS.md` (canonical pnpm workspace config)
--- a/docs/docker-build-optimization-roadmap.md
+++ b/docs/docker-build-optimization-roadmap.md
@ -1,6 +1,6 @@
 # Docker Build Optimization Roadmap
-> **Status:** Draft v7 (Phase A complete on clock; warm rebuilds 2.9 s backend / 5.4 s web) · **Owner:** Platform DevOps · **Created:** 2026-05-27 · **Revised:** 2026-05-27
+> **Status:** Draft v8 (Phase A complete on clock + peakpulse; A3 ADR accepted; warm rebuilds 2.7–5.4 s) · **Owner:** Platform DevOps · **Created:** 2026-05-27 · **Revised:** 2026-05-27
 >
 > Pilot Docker-build correctness + speed fixes on `learning_ai_clock` (web + backend)
 > and `learning_ai_peakpulse` (backend), then capture the playbook here for
@ -310,7 +310,7 @@ Two options — pick one in a short ADR before implementing:
 |---|---|---|---|---|---|
 | clock | backend | **59.2 s** | **64.7 s** | **2.9 s** | Cold essentially flat (corepack adds ~1 s; cache mount empty on first run). Warm → 95.1% reduction. Commits: `clock@8b5c767a3` (A0-V), `clock@f6a806ff3` (A1+A8+A9), `clock@55e8d22d3` (A2+A5+A6) |
 | clock | web | **193 s (3:13)** | **291 s (4:51) †** | **5.4 s** | Warm → 97.2% reduction. † Cold variance — see footer |
-| peakpulse | backend | — | — | — | Pending §10 step 9 |
+| peakpulse | backend | — (was tarball-only path) | **72.2 s** | **2.7 s** | Warm → 96.3% reduction. Commits: `peakpulse@11a6bc5` (Phase A), `peakpulse@6523a1a` (.gitkeep fix), `clock@1465e06b1`+`d69003c1f` (mirror .gitkeep fix) |
 **Footer note on cold-build variance.** Cold builds (`--no-cache`) are
 dominated by network egress for ~50 `@bytelyst/*` tarballs through the
@ -790,9 +790,15 @@ Checks implemented by `docker-doctor.sh`:
 8. **✅ A2 + A4 + A5 + A6** on clock (cache mount + dockerignore) — `clock@55e8d22d3`.
   Warm rebuilds: **backend 2.9 s, web 5.4 s** (95–97% reduction).
   A7 metrics table populated this commit.
-9. **⏸ Phase A0 → A6** on `learning_ai_peakpulse` (backend only) as validation
+9. **✅ Phase A0 → A6** on `learning_ai_peakpulse` backend (`peakpulse@11a6bc5`).
-   second pass. **STOP and request approval — different repo, separate audit.**
+   Cold 72.2 s, warm 2.7 s. Pattern from clock applied verbatim, plus a
-10. **⚳ A3 ADR** — decide lockfile policy (defer implementation).
+   side fix for `.docker-deps/.gitkeep` discoverability that was also
   ported back to clock (`peakpulse@6523a1a`, `clock@1465e06b1`,
   `clock@d69003c1f`).
 10. **✅ A3 ADR** — [`docs/adr/0001-docker-build-lockfile-policy.md`](adr/0001-docker-build-lockfile-policy.md).
   Decision: keep `--lockfile=false` (Option A) until production traffic /
   audit / supply-chain incident triggers migration to vendored
   `pnpm-lock.docker.yaml` (Option C). Implementation deferred.
 11. **⚳ Phase B** — harden `docker-prep.sh` on clock, then promote to canonical
    home in common-plat (B7) and sync to peakpulse.
 12. **⚳ Phase E** — land `docker-doctor.sh`, wire into CI as warning, then error.