bytelyst-devops-tools/docs/adr/0001-docker-build-lockfile-policy.md
saravanakumardb1 b00af09942 docs(docker): roadmap v8 — peakpulse Phase A done + A3 ADR-0001 accepted
Per § 10 steps 9 + 10.

Step 9: Peakpulse backend Phase A complete.
  cold 72.2 s, warm 2.7 s (96.3% reduction). Pattern from clock applied
  verbatim plus .docker-deps/.gitkeep discoverability fix back-ported
  to clock. Commits:
    peakpulse@11a6bc5  feat(docker): Phase A on peakpulse backend
    peakpulse@6523a1a  fix(docker): track .docker-deps/.gitkeep
    clock@1465e06b1    fix(docker): track .docker-deps/.gitkeep
    clock@d69003c1f    chore: dedupe .docker-deps in .gitignore

Step 10: A3 ADR accepted.
  New file: docs/adr/0001-docker-build-lockfile-policy.md
  Decision: short-term Option A (--lockfile=false) — already shipped in
  Phase A; long-term Option C (vendored pnpm-lock.docker.yaml). Migration
  triggered by production deployment, audit requirement, supply-chain
  incident, or loss of BuildKit cache. Implementation sketch in ADR § 4.

Roadmap doc updates:
  - § A7 metrics table: peakpulse row populated (72.2 s / 2.7 s).
  - § A3: collapsed bullet list into decision-record summary linking ADR.
  - § 10: steps 9 + 10 marked ; status banner v7 → v8.

Next per § 10: step 11 (Phase B docker-prep hardening) or step 12
(Phase E docker-doctor.sh linter). Phase E is higher-value as durable
insurance against F11/F13/F16/F17/F18 regressions across the ecosystem.
2026-05-27 02:54:08 -07:00

222 lines
8.8 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# ADR-0001: Docker build lockfile policy
> **Status:** Accepted (decision); Deferred (implementation) · **Date:** 2026-05-27
> **Context:** docker-build-optimization-roadmap §A3 · **Supersedes:** None
> **Authors:** Platform DevOps
---
## 1. Context
The pilot Phase A work in `docker-build-optimization-roadmap` standardized
on `pnpm install --lockfile=false` inside Docker for both
`learning_ai_clock` (web + backend) and `learning_ai_peakpulse` (backend).
That choice unblocked Phase A by sidestepping a structural mismatch:
- `pnpm-lock.yaml` is generated against the **outer pnpm workspace**, which
includes `../learning_ai_common_plat/packages/*` as workspace members
(sibling-repo path).
- Inside the Docker build context, the sibling repo doesn't exist
(a single-repo build context is intentionally used for hermeticity).
- `--frozen-lockfile` therefore fails immediately with workspace
resolution errors (finding F2 in the roadmap audit).
`--lockfile=false` skips lockfile validation entirely and re-resolves all
dependencies against the registry on every `pnpm install`. This is
correct for the workspace-mismatch problem but introduces non-determinism:
the **same Dockerfile + same source tree can produce a different lockset**
across two builds if upstream `@bytelyst/*` versions move between them.
Phase A2's BuildKit cache mount mitigates the *speed* cost of
re-resolution but not the *determinism* cost.
This ADR records the decision on which long-term policy to adopt for
Docker builds. Implementation is deferred to a future Phase A3 sprint.
---
## 2. Options considered
### Option A — Keep `--lockfile=false` (status quo)
**How it works.** Docker `pnpm install` re-resolves on every cold build.
Cache mount preserves the pnpm content-addressed store across builds, so
warm rebuilds don't pay re-resolution cost.
**Pros:**
- Zero churn — already shipped in Phase A.
- Tolerates sibling-repo workspace mismatch for free.
- Tolerates `*` semver across all `@bytelyst/*` deps without rework.
- Compatible with the F17 fix (Gitea `host.docker.internal` URLs).
**Cons:**
- **Non-deterministic builds.** Same Dockerfile + same source can produce
different `node_modules` if a dependency was published between two
cold builds. CI runs days apart can ship divergent images for the same
commit.
- No supply-chain pinning. Any compromised upstream auto-rolls forward.
- `pnpm audit` on the host can disagree with what's actually inside
the image.
### Option B — Generate a Docker-only flat lockfile during build
**How it works.** Add a build step that runs `pnpm install --lockfile-only`
in a temp dir against a flattened `pnpm-workspace.yaml` that excludes
sibling-repo paths, then `--frozen-lockfile` against that generated lock.
**Pros:**
- Deterministic *within a single build* — same registry state at the
moment of the build always produces the same lockset.
- Doesn't require changes to the source tree's `pnpm-workspace.yaml`.
**Cons:**
- Still non-deterministic across builds (the lock is regenerated each time
unless cached separately).
- Adds Dockerfile complexity and a non-trivial new failure mode
(workspace-flattening logic).
- Marginal value over Option A given the cache mount.
### Option C — Vendor a Docker-flattened lockfile in the repo
**How it works.** Commit a `pnpm-lock.docker.yaml` (or similar) per repo
that's generated against a flattened workspace. Dockerfile uses
`pnpm install --frozen-lockfile --lockfile=pnpm-lock.docker.yaml`.
**Pros:**
- Fully deterministic. Same commit → same lockset → same image.
- Supply chain pins enforced.
- `pnpm audit` matches image contents.
**Cons:**
- Two lockfiles to maintain (the workspace one + the Docker one).
- Drift risk between the two — solved only by a CI gate that regenerates
the Docker lockfile on every PR that touches `package.json`.
- Requires a tested regenerate-on-CI workflow per repo.
- Workspace flattening logic must be encoded somewhere (script in
`common-plat/scripts/regen-docker-lockfile.sh`).
### Option D — Restructure to single-repo workspace (eliminate sibling)
**How it works.** Inline the consumed `@bytelyst/*` packages into each
product repo (vendor them) so there is no sibling-workspace dependency.
Then `--frozen-lockfile` works trivially.
**Pros:**
- Cleanest from a Docker-build-determinism standpoint.
**Cons:**
- Massive churn across 14+ product repos.
- Defeats the entire `learning_ai_common_plat` shared-package model.
- Multiplies maintenance cost of `@bytelyst/*` updates by the number of
consumers.
- Out of scope; would supersede the entire ecosystem architecture.
---
## 3. Decision
**Adopt Option A (`--lockfile=false`) as the official short-term policy.**
**Plan to migrate to Option C (`pnpm-lock.docker.yaml`) when supply-chain
determinism becomes a hard requirement** (e.g., before any production
deployment of a Docker-built image, or before SOC2-style attestation).
**Reasoning:**
1. **Phase A is already shipped on Option A** with verified speed wins
(warm rebuilds 2.75.4 s across all surfaces). Switching policies
mid-rollout would invalidate metrics + add risk.
2. **The cache mount (Phase A2) addresses the speed concern** that
Option A creates. The remaining concern is determinism, which is a
correctness concern — but the actual blast radius is limited because:
- All `@bytelyst/*` deps are first-party and pinned in source repos.
- Third-party deps already have fixed semver in `package.json` (no
loose `*` ranges to public registries).
- The Gitea registry is the only `@bytelyst/*` source — no public
supply-chain risk for the in-house deps.
3. **Option C is the right end state** but requires CI infrastructure
that doesn't exist yet (auto-regen-on-PR). Building it inside this
roadmap is scope creep.
4. **Option B is dominated by Option C** — same complexity, weaker
guarantees.
5. **Option D is non-starter** — it would require redesigning the
ByteLyst shared-package model.
---
## 4. Consequences
### Positive
- Phase A speed wins are preserved with zero policy churn.
- `pnpm-lock.yaml` continues to live in source repos for host development;
it stays in `.dockerignore` for Docker builds.
- The decision is reversible: switching to Option C in the future is
additive (add a Docker lockfile + change one Dockerfile line).
### Negative
- Same commit can produce different Docker images on different days. CI
must not assume image hash stability for a given commit.
- `pnpm audit` results from the host don't match Docker image contents.
Workaround: run `pnpm audit` inside the built container as a separate
CI job (cheap; no rebuild needed).
- Supply-chain attestation (SOC2, SLSA) cannot be produced for these
images today. Acceptable while there is no production traffic.
### Migration trigger
Switch to Option C when **any** of the following becomes true:
1. A production environment (paid customers, real PII) deploys a
Docker-built image from this codebase.
2. A regulatory/audit requirement demands reproducible builds.
3. A supply-chain incident occurs (compromised upstream package) and
we need rollback granularity finer than "rebuild from current `*`".
4. The cache-mount speed win disappears (e.g., CI runner switch removes
BuildKit cache persistence).
### Implementation sketch (when triggered)
1. In `learning_ai_common_plat`, add `scripts/regen-docker-lockfile.sh`:
- Reads each product repo's `package.json`.
- Generates a flattened `pnpm-workspace.yaml` (no sibling paths).
- Runs `pnpm install --lockfile-only` against the Gitea registry.
- Writes `pnpm-lock.docker.yaml` back to the product repo.
2. Each product repo gets a `.gitea/workflows/regen-docker-lockfile.yml`
that runs the script on PR-touch of `package.json` and either:
- commits the regenerated lockfile (auto-PR), or
- fails the PR with a "run regen-docker-lockfile.sh and commit" message.
3. Each product Dockerfile changes one line:
```dockerfile
# before
RUN pnpm install --ignore-scripts --lockfile=false
# after
COPY pnpm-lock.docker.yaml ./pnpm-lock.yaml
RUN pnpm install --ignore-scripts --frozen-lockfile
```
4. `.dockerignore` removes `pnpm-lock.yaml` exclusion (or adds explicit
include for `pnpm-lock.docker.yaml`).
This work is **not scoped** in the current roadmap and should be its own
small ADR-driven sprint.
---
## 5. Status tracking
| Phase | State | Notes |
|---|---|---|
| Decision | ✅ Accepted | This ADR |
| Implementation | ⏸ Deferred | Triggered by §4 conditions |
| Trigger monitor | ⚳ Open | Re-evaluate when Phase D rollout begins |
---
## 6. References
- `docker-build-optimization-roadmap.md` §0 F1, F2 (lockfile findings)
- `docker-build-optimization-roadmap.md` §A3 (deferred phase)
- `docker-build-optimization-roadmap.md` §A2 (BuildKit cache mount that
mitigates the speed concern of Option A)
- `learning_ai_common_plat/AGENTS.md` (canonical pnpm workspace config)