bytelyst-devops-tools/docs/adr/0001-docker-build-lockfile-policy.md
saravanakumardb1 b00af09942 docs(docker): roadmap v8 — peakpulse Phase A done + A3 ADR-0001 accepted
Per § 10 steps 9 + 10.

Step 9: Peakpulse backend Phase A complete.
  cold 72.2 s, warm 2.7 s (96.3% reduction). Pattern from clock applied
  verbatim plus .docker-deps/.gitkeep discoverability fix back-ported
  to clock. Commits:
    peakpulse@11a6bc5  feat(docker): Phase A on peakpulse backend
    peakpulse@6523a1a  fix(docker): track .docker-deps/.gitkeep
    clock@1465e06b1    fix(docker): track .docker-deps/.gitkeep
    clock@d69003c1f    chore: dedupe .docker-deps in .gitignore

Step 10: A3 ADR accepted.
  New file: docs/adr/0001-docker-build-lockfile-policy.md
  Decision: short-term Option A (--lockfile=false) — already shipped in
  Phase A; long-term Option C (vendored pnpm-lock.docker.yaml). Migration
  triggered by production deployment, audit requirement, supply-chain
  incident, or loss of BuildKit cache. Implementation sketch in ADR § 4.

Roadmap doc updates:
  - § A7 metrics table: peakpulse row populated (72.2 s / 2.7 s).
  - § A3: collapsed bullet list into decision-record summary linking ADR.
  - § 10: steps 9 + 10 marked ; status banner v7 → v8.

Next per § 10: step 11 (Phase B docker-prep hardening) or step 12
(Phase E docker-doctor.sh linter). Phase E is higher-value as durable
insurance against F11/F13/F16/F17/F18 regressions across the ecosystem.
2026-05-27 02:54:08 -07:00

8.8 KiB
Raw Blame History

ADR-0001: Docker build lockfile policy

Status: Accepted (decision); Deferred (implementation) · Date: 2026-05-27 Context: docker-build-optimization-roadmap §A3 · Supersedes: None Authors: Platform DevOps


1. Context

The pilot Phase A work in docker-build-optimization-roadmap standardized on pnpm install --lockfile=false inside Docker for both learning_ai_clock (web + backend) and learning_ai_peakpulse (backend). That choice unblocked Phase A by sidestepping a structural mismatch:

  • pnpm-lock.yaml is generated against the outer pnpm workspace, which includes ../learning_ai_common_plat/packages/* as workspace members (sibling-repo path).
  • Inside the Docker build context, the sibling repo doesn't exist (a single-repo build context is intentionally used for hermeticity).
  • --frozen-lockfile therefore fails immediately with workspace resolution errors (finding F2 in the roadmap audit).

--lockfile=false skips lockfile validation entirely and re-resolves all dependencies against the registry on every pnpm install. This is correct for the workspace-mismatch problem but introduces non-determinism: the same Dockerfile + same source tree can produce a different lockset across two builds if upstream @bytelyst/* versions move between them.

Phase A2's BuildKit cache mount mitigates the speed cost of re-resolution but not the determinism cost.

This ADR records the decision on which long-term policy to adopt for Docker builds. Implementation is deferred to a future Phase A3 sprint.


2. Options considered

Option A — Keep --lockfile=false (status quo)

How it works. Docker pnpm install re-resolves on every cold build. Cache mount preserves the pnpm content-addressed store across builds, so warm rebuilds don't pay re-resolution cost.

Pros:

  • Zero churn — already shipped in Phase A.
  • Tolerates sibling-repo workspace mismatch for free.
  • Tolerates * semver across all @bytelyst/* deps without rework.
  • Compatible with the F17 fix (Gitea host.docker.internal URLs).

Cons:

  • Non-deterministic builds. Same Dockerfile + same source can produce different node_modules if a dependency was published between two cold builds. CI runs days apart can ship divergent images for the same commit.
  • No supply-chain pinning. Any compromised upstream auto-rolls forward.
  • pnpm audit on the host can disagree with what's actually inside the image.

Option B — Generate a Docker-only flat lockfile during build

How it works. Add a build step that runs pnpm install --lockfile-only in a temp dir against a flattened pnpm-workspace.yaml that excludes sibling-repo paths, then --frozen-lockfile against that generated lock.

Pros:

  • Deterministic within a single build — same registry state at the moment of the build always produces the same lockset.
  • Doesn't require changes to the source tree's pnpm-workspace.yaml.

Cons:

  • Still non-deterministic across builds (the lock is regenerated each time unless cached separately).
  • Adds Dockerfile complexity and a non-trivial new failure mode (workspace-flattening logic).
  • Marginal value over Option A given the cache mount.

Option C — Vendor a Docker-flattened lockfile in the repo

How it works. Commit a pnpm-lock.docker.yaml (or similar) per repo that's generated against a flattened workspace. Dockerfile uses pnpm install --frozen-lockfile --lockfile=pnpm-lock.docker.yaml.

Pros:

  • Fully deterministic. Same commit → same lockset → same image.
  • Supply chain pins enforced.
  • pnpm audit matches image contents.

Cons:

  • Two lockfiles to maintain (the workspace one + the Docker one).
  • Drift risk between the two — solved only by a CI gate that regenerates the Docker lockfile on every PR that touches package.json.
  • Requires a tested regenerate-on-CI workflow per repo.
  • Workspace flattening logic must be encoded somewhere (script in common-plat/scripts/regen-docker-lockfile.sh).

Option D — Restructure to single-repo workspace (eliminate sibling)

How it works. Inline the consumed @bytelyst/* packages into each product repo (vendor them) so there is no sibling-workspace dependency. Then --frozen-lockfile works trivially.

Pros:

  • Cleanest from a Docker-build-determinism standpoint.

Cons:

  • Massive churn across 14+ product repos.
  • Defeats the entire learning_ai_common_plat shared-package model.
  • Multiplies maintenance cost of @bytelyst/* updates by the number of consumers.
  • Out of scope; would supersede the entire ecosystem architecture.

3. Decision

Adopt Option A (--lockfile=false) as the official short-term policy. Plan to migrate to Option C (pnpm-lock.docker.yaml) when supply-chain determinism becomes a hard requirement (e.g., before any production deployment of a Docker-built image, or before SOC2-style attestation).

Reasoning:

  1. Phase A is already shipped on Option A with verified speed wins (warm rebuilds 2.75.4 s across all surfaces). Switching policies mid-rollout would invalidate metrics + add risk.
  2. The cache mount (Phase A2) addresses the speed concern that Option A creates. The remaining concern is determinism, which is a correctness concern — but the actual blast radius is limited because:
    • All @bytelyst/* deps are first-party and pinned in source repos.
    • Third-party deps already have fixed semver in package.json (no loose * ranges to public registries).
    • The Gitea registry is the only @bytelyst/* source — no public supply-chain risk for the in-house deps.
  3. Option C is the right end state but requires CI infrastructure that doesn't exist yet (auto-regen-on-PR). Building it inside this roadmap is scope creep.
  4. Option B is dominated by Option C — same complexity, weaker guarantees.
  5. Option D is non-starter — it would require redesigning the ByteLyst shared-package model.

4. Consequences

Positive

  • Phase A speed wins are preserved with zero policy churn.
  • pnpm-lock.yaml continues to live in source repos for host development; it stays in .dockerignore for Docker builds.
  • The decision is reversible: switching to Option C in the future is additive (add a Docker lockfile + change one Dockerfile line).

Negative

  • Same commit can produce different Docker images on different days. CI must not assume image hash stability for a given commit.
  • pnpm audit results from the host don't match Docker image contents. Workaround: run pnpm audit inside the built container as a separate CI job (cheap; no rebuild needed).
  • Supply-chain attestation (SOC2, SLSA) cannot be produced for these images today. Acceptable while there is no production traffic.

Migration trigger

Switch to Option C when any of the following becomes true:

  1. A production environment (paid customers, real PII) deploys a Docker-built image from this codebase.
  2. A regulatory/audit requirement demands reproducible builds.
  3. A supply-chain incident occurs (compromised upstream package) and we need rollback granularity finer than "rebuild from current *".
  4. The cache-mount speed win disappears (e.g., CI runner switch removes BuildKit cache persistence).

Implementation sketch (when triggered)

  1. In learning_ai_common_plat, add scripts/regen-docker-lockfile.sh:
    • Reads each product repo's package.json.
    • Generates a flattened pnpm-workspace.yaml (no sibling paths).
    • Runs pnpm install --lockfile-only against the Gitea registry.
    • Writes pnpm-lock.docker.yaml back to the product repo.
  2. Each product repo gets a .gitea/workflows/regen-docker-lockfile.yml that runs the script on PR-touch of package.json and either:
    • commits the regenerated lockfile (auto-PR), or
    • fails the PR with a "run regen-docker-lockfile.sh and commit" message.
  3. Each product Dockerfile changes one line:
    # before
    RUN pnpm install --ignore-scripts --lockfile=false
    # after
    COPY pnpm-lock.docker.yaml ./pnpm-lock.yaml
    RUN pnpm install --ignore-scripts --frozen-lockfile
    
  4. .dockerignore removes pnpm-lock.yaml exclusion (or adds explicit include for pnpm-lock.docker.yaml).

This work is not scoped in the current roadmap and should be its own small ADR-driven sprint.


5. Status tracking

Phase State Notes
Decision Accepted This ADR
Implementation ⏸ Deferred Triggered by §4 conditions
Trigger monitor ⚳ Open Re-evaluate when Phase D rollout begins

6. References

  • docker-build-optimization-roadmap.md §0 F1, F2 (lockfile findings)
  • docker-build-optimization-roadmap.md §A3 (deferred phase)
  • docker-build-optimization-roadmap.md §A2 (BuildKit cache mount that mitigates the speed concern of Option A)
  • learning_ai_common_plat/AGENTS.md (canonical pnpm workspace config)