docs(docker): roadmap v8 — peakpulse Phase A done + A3 ADR-0001 accepted

Per § 10 steps 9 + 10.

Step 9: Peakpulse backend Phase A complete.
  cold 72.2 s, warm 2.7 s (96.3% reduction). Pattern from clock applied
  verbatim plus .docker-deps/.gitkeep discoverability fix back-ported
  to clock. Commits:
    peakpulse@11a6bc5  feat(docker): Phase A on peakpulse backend
    peakpulse@6523a1a  fix(docker): track .docker-deps/.gitkeep
    clock@1465e06b1    fix(docker): track .docker-deps/.gitkeep
    clock@d69003c1f    chore: dedupe .docker-deps in .gitignore

Step 10: A3 ADR accepted.
  New file: docs/adr/0001-docker-build-lockfile-policy.md
  Decision: short-term Option A (--lockfile=false) — already shipped in
  Phase A; long-term Option C (vendored pnpm-lock.docker.yaml). Migration
  triggered by production deployment, audit requirement, supply-chain
  incident, or loss of BuildKit cache. Implementation sketch in ADR § 4.

Roadmap doc updates:
  - § A7 metrics table: peakpulse row populated (72.2 s / 2.7 s).
  - § A3: collapsed bullet list into decision-record summary linking ADR.
  - § 10: steps 9 + 10 marked ; status banner v7 → v8.

Next per § 10: step 11 (Phase B docker-prep hardening) or step 12
(Phase E docker-doctor.sh linter). Phase E is higher-value as durable
insurance against F11/F13/F16/F17/F18 regressions across the ecosystem.
This commit is contained in:
saravanakumardb1 2026-05-27 02:54:08 -07:00
parent 062155b81e
commit b00af09942
2 changed files with 232 additions and 5 deletions

View File

@ -0,0 +1,221 @@
# ADR-0001: Docker build lockfile policy
> **Status:** Accepted (decision); Deferred (implementation) · **Date:** 2026-05-27
> **Context:** docker-build-optimization-roadmap §A3 · **Supersedes:** None
> **Authors:** Platform DevOps
---
## 1. Context
The pilot Phase A work in `docker-build-optimization-roadmap` standardized
on `pnpm install --lockfile=false` inside Docker for both
`learning_ai_clock` (web + backend) and `learning_ai_peakpulse` (backend).
That choice unblocked Phase A by sidestepping a structural mismatch:
- `pnpm-lock.yaml` is generated against the **outer pnpm workspace**, which
includes `../learning_ai_common_plat/packages/*` as workspace members
(sibling-repo path).
- Inside the Docker build context, the sibling repo doesn't exist
(a single-repo build context is intentionally used for hermeticity).
- `--frozen-lockfile` therefore fails immediately with workspace
resolution errors (finding F2 in the roadmap audit).
`--lockfile=false` skips lockfile validation entirely and re-resolves all
dependencies against the registry on every `pnpm install`. This is
correct for the workspace-mismatch problem but introduces non-determinism:
the **same Dockerfile + same source tree can produce a different lockset**
across two builds if upstream `@bytelyst/*` versions move between them.
Phase A2's BuildKit cache mount mitigates the *speed* cost of
re-resolution but not the *determinism* cost.
This ADR records the decision on which long-term policy to adopt for
Docker builds. Implementation is deferred to a future Phase A3 sprint.
---
## 2. Options considered
### Option A — Keep `--lockfile=false` (status quo)
**How it works.** Docker `pnpm install` re-resolves on every cold build.
Cache mount preserves the pnpm content-addressed store across builds, so
warm rebuilds don't pay re-resolution cost.
**Pros:**
- Zero churn — already shipped in Phase A.
- Tolerates sibling-repo workspace mismatch for free.
- Tolerates `*` semver across all `@bytelyst/*` deps without rework.
- Compatible with the F17 fix (Gitea `host.docker.internal` URLs).
**Cons:**
- **Non-deterministic builds.** Same Dockerfile + same source can produce
different `node_modules` if a dependency was published between two
cold builds. CI runs days apart can ship divergent images for the same
commit.
- No supply-chain pinning. Any compromised upstream auto-rolls forward.
- `pnpm audit` on the host can disagree with what's actually inside
the image.
### Option B — Generate a Docker-only flat lockfile during build
**How it works.** Add a build step that runs `pnpm install --lockfile-only`
in a temp dir against a flattened `pnpm-workspace.yaml` that excludes
sibling-repo paths, then `--frozen-lockfile` against that generated lock.
**Pros:**
- Deterministic *within a single build* — same registry state at the
moment of the build always produces the same lockset.
- Doesn't require changes to the source tree's `pnpm-workspace.yaml`.
**Cons:**
- Still non-deterministic across builds (the lock is regenerated each time
unless cached separately).
- Adds Dockerfile complexity and a non-trivial new failure mode
(workspace-flattening logic).
- Marginal value over Option A given the cache mount.
### Option C — Vendor a Docker-flattened lockfile in the repo
**How it works.** Commit a `pnpm-lock.docker.yaml` (or similar) per repo
that's generated against a flattened workspace. Dockerfile uses
`pnpm install --frozen-lockfile --lockfile=pnpm-lock.docker.yaml`.
**Pros:**
- Fully deterministic. Same commit → same lockset → same image.
- Supply chain pins enforced.
- `pnpm audit` matches image contents.
**Cons:**
- Two lockfiles to maintain (the workspace one + the Docker one).
- Drift risk between the two — solved only by a CI gate that regenerates
the Docker lockfile on every PR that touches `package.json`.
- Requires a tested regenerate-on-CI workflow per repo.
- Workspace flattening logic must be encoded somewhere (script in
`common-plat/scripts/regen-docker-lockfile.sh`).
### Option D — Restructure to single-repo workspace (eliminate sibling)
**How it works.** Inline the consumed `@bytelyst/*` packages into each
product repo (vendor them) so there is no sibling-workspace dependency.
Then `--frozen-lockfile` works trivially.
**Pros:**
- Cleanest from a Docker-build-determinism standpoint.
**Cons:**
- Massive churn across 14+ product repos.
- Defeats the entire `learning_ai_common_plat` shared-package model.
- Multiplies maintenance cost of `@bytelyst/*` updates by the number of
consumers.
- Out of scope; would supersede the entire ecosystem architecture.
---
## 3. Decision
**Adopt Option A (`--lockfile=false`) as the official short-term policy.**
**Plan to migrate to Option C (`pnpm-lock.docker.yaml`) when supply-chain
determinism becomes a hard requirement** (e.g., before any production
deployment of a Docker-built image, or before SOC2-style attestation).
**Reasoning:**
1. **Phase A is already shipped on Option A** with verified speed wins
(warm rebuilds 2.75.4 s across all surfaces). Switching policies
mid-rollout would invalidate metrics + add risk.
2. **The cache mount (Phase A2) addresses the speed concern** that
Option A creates. The remaining concern is determinism, which is a
correctness concern — but the actual blast radius is limited because:
- All `@bytelyst/*` deps are first-party and pinned in source repos.
- Third-party deps already have fixed semver in `package.json` (no
loose `*` ranges to public registries).
- The Gitea registry is the only `@bytelyst/*` source — no public
supply-chain risk for the in-house deps.
3. **Option C is the right end state** but requires CI infrastructure
that doesn't exist yet (auto-regen-on-PR). Building it inside this
roadmap is scope creep.
4. **Option B is dominated by Option C** — same complexity, weaker
guarantees.
5. **Option D is non-starter** — it would require redesigning the
ByteLyst shared-package model.
---
## 4. Consequences
### Positive
- Phase A speed wins are preserved with zero policy churn.
- `pnpm-lock.yaml` continues to live in source repos for host development;
it stays in `.dockerignore` for Docker builds.
- The decision is reversible: switching to Option C in the future is
additive (add a Docker lockfile + change one Dockerfile line).
### Negative
- Same commit can produce different Docker images on different days. CI
must not assume image hash stability for a given commit.
- `pnpm audit` results from the host don't match Docker image contents.
Workaround: run `pnpm audit` inside the built container as a separate
CI job (cheap; no rebuild needed).
- Supply-chain attestation (SOC2, SLSA) cannot be produced for these
images today. Acceptable while there is no production traffic.
### Migration trigger
Switch to Option C when **any** of the following becomes true:
1. A production environment (paid customers, real PII) deploys a
Docker-built image from this codebase.
2. A regulatory/audit requirement demands reproducible builds.
3. A supply-chain incident occurs (compromised upstream package) and
we need rollback granularity finer than "rebuild from current `*`".
4. The cache-mount speed win disappears (e.g., CI runner switch removes
BuildKit cache persistence).
### Implementation sketch (when triggered)
1. In `learning_ai_common_plat`, add `scripts/regen-docker-lockfile.sh`:
- Reads each product repo's `package.json`.
- Generates a flattened `pnpm-workspace.yaml` (no sibling paths).
- Runs `pnpm install --lockfile-only` against the Gitea registry.
- Writes `pnpm-lock.docker.yaml` back to the product repo.
2. Each product repo gets a `.gitea/workflows/regen-docker-lockfile.yml`
that runs the script on PR-touch of `package.json` and either:
- commits the regenerated lockfile (auto-PR), or
- fails the PR with a "run regen-docker-lockfile.sh and commit" message.
3. Each product Dockerfile changes one line:
```dockerfile
# before
RUN pnpm install --ignore-scripts --lockfile=false
# after
COPY pnpm-lock.docker.yaml ./pnpm-lock.yaml
RUN pnpm install --ignore-scripts --frozen-lockfile
```
4. `.dockerignore` removes `pnpm-lock.yaml` exclusion (or adds explicit
include for `pnpm-lock.docker.yaml`).
This work is **not scoped** in the current roadmap and should be its own
small ADR-driven sprint.
---
## 5. Status tracking
| Phase | State | Notes |
|---|---|---|
| Decision | ✅ Accepted | This ADR |
| Implementation | ⏸ Deferred | Triggered by §4 conditions |
| Trigger monitor | ⚳ Open | Re-evaluate when Phase D rollout begins |
---
## 6. References
- `docker-build-optimization-roadmap.md` §0 F1, F2 (lockfile findings)
- `docker-build-optimization-roadmap.md` §A3 (deferred phase)
- `docker-build-optimization-roadmap.md` §A2 (BuildKit cache mount that
mitigates the speed concern of Option A)
- `learning_ai_common_plat/AGENTS.md` (canonical pnpm workspace config)

View File

@ -1,6 +1,6 @@
# Docker Build Optimization Roadmap # Docker Build Optimization Roadmap
> **Status:** Draft v7 (Phase A complete on clock; warm rebuilds 2.9 s backend / 5.4 s web) · **Owner:** Platform DevOps · **Created:** 2026-05-27 · **Revised:** 2026-05-27 > **Status:** Draft v8 (Phase A complete on clock + peakpulse; A3 ADR accepted; warm rebuilds 2.75.4 s) · **Owner:** Platform DevOps · **Created:** 2026-05-27 · **Revised:** 2026-05-27
> >
> Pilot Docker-build correctness + speed fixes on `learning_ai_clock` (web + backend) > Pilot Docker-build correctness + speed fixes on `learning_ai_clock` (web + backend)
> and `learning_ai_peakpulse` (backend), then capture the playbook here for > and `learning_ai_peakpulse` (backend), then capture the playbook here for
@ -310,7 +310,7 @@ Two options — pick one in a short ADR before implementing:
|---|---|---|---|---|---| |---|---|---|---|---|---|
| clock | backend | **59.2 s** | **64.7 s** | **2.9 s** | Cold essentially flat (corepack adds ~1 s; cache mount empty on first run). Warm → 95.1% reduction. Commits: `clock@8b5c767a3` (A0-V), `clock@f6a806ff3` (A1+A8+A9), `clock@55e8d22d3` (A2+A5+A6) | | clock | backend | **59.2 s** | **64.7 s** | **2.9 s** | Cold essentially flat (corepack adds ~1 s; cache mount empty on first run). Warm → 95.1% reduction. Commits: `clock@8b5c767a3` (A0-V), `clock@f6a806ff3` (A1+A8+A9), `clock@55e8d22d3` (A2+A5+A6) |
| clock | web | **193 s (3:13)** | **291 s (4:51) †** | **5.4 s** | Warm → 97.2% reduction. † Cold variance — see footer | | clock | web | **193 s (3:13)** | **291 s (4:51) †** | **5.4 s** | Warm → 97.2% reduction. † Cold variance — see footer |
| peakpulse | backend | — | — | — | Pending §10 step 9 | | peakpulse | backend | — (was tarball-only path) | **72.2 s** | **2.7 s** | Warm → 96.3% reduction. Commits: `peakpulse@11a6bc5` (Phase A), `peakpulse@6523a1a` (.gitkeep fix), `clock@1465e06b1`+`d69003c1f` (mirror .gitkeep fix) |
**Footer note on cold-build variance.** Cold builds (`--no-cache`) are **Footer note on cold-build variance.** Cold builds (`--no-cache`) are
dominated by network egress for ~50 `@bytelyst/*` tarballs through the dominated by network egress for ~50 `@bytelyst/*` tarballs through the
@ -790,9 +790,15 @@ Checks implemented by `docker-doctor.sh`:
8. **✅ A2 + A4 + A5 + A6** on clock (cache mount + dockerignore) — `clock@55e8d22d3`. 8. **✅ A2 + A4 + A5 + A6** on clock (cache mount + dockerignore) — `clock@55e8d22d3`.
Warm rebuilds: **backend 2.9 s, web 5.4 s** (9597% reduction). Warm rebuilds: **backend 2.9 s, web 5.4 s** (9597% reduction).
A7 metrics table populated this commit. A7 metrics table populated this commit.
9. **⏸ Phase A0 → A6** on `learning_ai_peakpulse` (backend only) as validation 9. **✅ Phase A0 → A6** on `learning_ai_peakpulse` backend (`peakpulse@11a6bc5`).
second pass. **STOP and request approval — different repo, separate audit.** Cold 72.2 s, warm 2.7 s. Pattern from clock applied verbatim, plus a
10. **⚳ A3 ADR** — decide lockfile policy (defer implementation). side fix for `.docker-deps/.gitkeep` discoverability that was also
ported back to clock (`peakpulse@6523a1a`, `clock@1465e06b1`,
`clock@d69003c1f`).
10. **✅ A3 ADR** — [`docs/adr/0001-docker-build-lockfile-policy.md`](adr/0001-docker-build-lockfile-policy.md).
Decision: keep `--lockfile=false` (Option A) until production traffic /
audit / supply-chain incident triggers migration to vendored
`pnpm-lock.docker.yaml` (Option C). Implementation deferred.
11. **⚳ Phase B** — harden `docker-prep.sh` on clock, then promote to canonical 11. **⚳ Phase B** — harden `docker-prep.sh` on clock, then promote to canonical
home in common-plat (B7) and sync to peakpulse. home in common-plat (B7) and sync to peakpulse.
12. **⚳ Phase E** — land `docker-doctor.sh`, wire into CI as warning, then error. 12. **⚳ Phase E** — land `docker-doctor.sh`, wire into CI as warning, then error.