saravanakumardb1 b00af09942 docs(docker): roadmap v8 — peakpulse Phase A done + A3 ADR-0001 accepted

Per § 10 steps 9 + 10.

Step 9: Peakpulse backend Phase A complete.
  cold 72.2 s, warm 2.7 s (96.3% reduction). Pattern from clock applied
  verbatim plus .docker-deps/.gitkeep discoverability fix back-ported
  to clock. Commits:
    peakpulse@11a6bc5  feat(docker): Phase A on peakpulse backend
    peakpulse@6523a1a  fix(docker): track .docker-deps/.gitkeep
    clock@1465e06b1    fix(docker): track .docker-deps/.gitkeep
    clock@d69003c1f    chore: dedupe .docker-deps in .gitignore

Step 10: A3 ADR accepted.
  New file: docs/adr/0001-docker-build-lockfile-policy.md
  Decision: short-term Option A (--lockfile=false) — already shipped in
  Phase A; long-term Option C (vendored pnpm-lock.docker.yaml). Migration
  triggered by production deployment, audit requirement, supply-chain
  incident, or loss of BuildKit cache. Implementation sketch in ADR § 4.

Roadmap doc updates:
  - § A7 metrics table: peakpulse row populated (72.2 s / 2.7 s).
  - § A3: collapsed bullet list into decision-record summary linking ADR.
  - § 10: steps 9 + 10 marked ✅; status banner v7 → v8.

Next per § 10: step 11 (Phase B docker-prep hardening) or step 12
(Phase E docker-doctor.sh linter). Phase E is higher-value as durable
insurance against F11/F13/F16/F17/F18 regressions across the ecosystem.

2026-05-27 02:54:08 -07:00

8.8 KiB

Raw Blame History

ADR-0001: Docker build lockfile policy

Status: Accepted (decision); Deferred (implementation) · Date: 2026-05-27 Context: docker-build-optimization-roadmap §A3 · Supersedes: None Authors: Platform DevOps

1. Context

The pilot Phase A work in docker-build-optimization-roadmap standardized on pnpm install --lockfile=false inside Docker for both learning_ai_clock (web + backend) and learning_ai_peakpulse (backend). That choice unblocked Phase A by sidestepping a structural mismatch:

pnpm-lock.yaml is generated against the outer pnpm workspace, which includes ../learning_ai_common_plat/packages/* as workspace members (sibling-repo path).
Inside the Docker build context, the sibling repo doesn't exist (a single-repo build context is intentionally used for hermeticity).
--frozen-lockfile therefore fails immediately with workspace resolution errors (finding F2 in the roadmap audit).

--lockfile=false skips lockfile validation entirely and re-resolves all dependencies against the registry on every pnpm install. This is correct for the workspace-mismatch problem but introduces non-determinism: the same Dockerfile + same source tree can produce a different lockset across two builds if upstream @bytelyst/* versions move between them.

Phase A2's BuildKit cache mount mitigates the speed cost of re-resolution but not the determinism cost.

This ADR records the decision on which long-term policy to adopt for Docker builds. Implementation is deferred to a future Phase A3 sprint.

2. Options considered

Option A — Keep `--lockfile=false` (status quo)

How it works. Docker pnpm install re-resolves on every cold build. Cache mount preserves the pnpm content-addressed store across builds, so warm rebuilds don't pay re-resolution cost.

Pros:

Zero churn — already shipped in Phase A.
Tolerates sibling-repo workspace mismatch for free.
Tolerates * semver across all @bytelyst/* deps without rework.
Compatible with the F17 fix (Gitea host.docker.internal URLs).

Cons:

Non-deterministic builds. Same Dockerfile + same source can produce different node_modules if a dependency was published between two cold builds. CI runs days apart can ship divergent images for the same commit.
No supply-chain pinning. Any compromised upstream auto-rolls forward.
pnpm audit on the host can disagree with what's actually inside the image.

Option B — Generate a Docker-only flat lockfile during build

How it works. Add a build step that runs pnpm install --lockfile-only in a temp dir against a flattened pnpm-workspace.yaml that excludes sibling-repo paths, then --frozen-lockfile against that generated lock.

Pros:

Deterministic within a single build — same registry state at the moment of the build always produces the same lockset.
Doesn't require changes to the source tree's pnpm-workspace.yaml.

Cons:

Still non-deterministic across builds (the lock is regenerated each time unless cached separately).
Adds Dockerfile complexity and a non-trivial new failure mode (workspace-flattening logic).
Marginal value over Option A given the cache mount.

Option C — Vendor a Docker-flattened lockfile in the repo

How it works. Commit a pnpm-lock.docker.yaml (or similar) per repo that's generated against a flattened workspace. Dockerfile uses pnpm install --frozen-lockfile --lockfile=pnpm-lock.docker.yaml.

Pros:

Fully deterministic. Same commit → same lockset → same image.
Supply chain pins enforced.
pnpm audit matches image contents.

Cons:

Two lockfiles to maintain (the workspace one + the Docker one).
Drift risk between the two — solved only by a CI gate that regenerates the Docker lockfile on every PR that touches package.json.
Requires a tested regenerate-on-CI workflow per repo.
Workspace flattening logic must be encoded somewhere (script in common-plat/scripts/regen-docker-lockfile.sh).

Option D — Restructure to single-repo workspace (eliminate sibling)

How it works. Inline the consumed @bytelyst/* packages into each product repo (vendor them) so there is no sibling-workspace dependency. Then --frozen-lockfile works trivially.

Pros:

Cleanest from a Docker-build-determinism standpoint.

Cons:

Massive churn across 14+ product repos.
Defeats the entire learning_ai_common_plat shared-package model.
Multiplies maintenance cost of @bytelyst/* updates by the number of consumers.
Out of scope; would supersede the entire ecosystem architecture.

3. Decision

Adopt Option A (--lockfile=false) as the official short-term policy. Plan to migrate to Option C (pnpm-lock.docker.yaml) when supply-chain determinism becomes a hard requirement (e.g., before any production deployment of a Docker-built image, or before SOC2-style attestation).

Reasoning:

Phase A is already shipped on Option A with verified speed wins (warm rebuilds 2.7–5.4 s across all surfaces). Switching policies mid-rollout would invalidate metrics + add risk.
The cache mount (Phase A2) addresses the speed concern that Option A creates. The remaining concern is determinism, which is a correctness concern — but the actual blast radius is limited because:
- All @bytelyst/* deps are first-party and pinned in source repos.
- Third-party deps already have fixed semver in package.json (no loose * ranges to public registries).
- The Gitea registry is the only @bytelyst/* source — no public supply-chain risk for the in-house deps.
Option C is the right end state but requires CI infrastructure that doesn't exist yet (auto-regen-on-PR). Building it inside this roadmap is scope creep.
Option B is dominated by Option C — same complexity, weaker guarantees.
Option D is non-starter — it would require redesigning the ByteLyst shared-package model.

4. Consequences

Positive

Phase A speed wins are preserved with zero policy churn.
pnpm-lock.yaml continues to live in source repos for host development; it stays in .dockerignore for Docker builds.
The decision is reversible: switching to Option C in the future is additive (add a Docker lockfile + change one Dockerfile line).

Negative

Same commit can produce different Docker images on different days. CI must not assume image hash stability for a given commit.
pnpm audit results from the host don't match Docker image contents. Workaround: run pnpm audit inside the built container as a separate CI job (cheap; no rebuild needed).
Supply-chain attestation (SOC2, SLSA) cannot be produced for these images today. Acceptable while there is no production traffic.

Migration trigger

Switch to Option C when any of the following becomes true:

A production environment (paid customers, real PII) deploys a Docker-built image from this codebase.
A regulatory/audit requirement demands reproducible builds.
A supply-chain incident occurs (compromised upstream package) and we need rollback granularity finer than "rebuild from current *".
The cache-mount speed win disappears (e.g., CI runner switch removes BuildKit cache persistence).

Implementation sketch (when triggered)

In learning_ai_common_plat, add scripts/regen-docker-lockfile.sh:
- Reads each product repo's package.json.
- Generates a flattened pnpm-workspace.yaml (no sibling paths).
- Runs pnpm install --lockfile-only against the Gitea registry.
- Writes pnpm-lock.docker.yaml back to the product repo.
Each product repo gets a .gitea/workflows/regen-docker-lockfile.yml that runs the script on PR-touch of package.json and either:
- commits the regenerated lockfile (auto-PR), or
- fails the PR with a "run regen-docker-lockfile.sh and commit" message.

Each product Dockerfile changes one line:

# before
RUN pnpm install --ignore-scripts --lockfile=false
# after
COPY pnpm-lock.docker.yaml ./pnpm-lock.yaml
RUN pnpm install --ignore-scripts --frozen-lockfile

.dockerignore removes pnpm-lock.yaml exclusion (or adds explicit include for pnpm-lock.docker.yaml).

This work is not scoped in the current roadmap and should be its own small ADR-driven sprint.

5. Status tracking

Phase	State	Notes
Decision	✅ Accepted	This ADR
Implementation	⏸ Deferred	Triggered by §4 conditions
Trigger monitor	⚳ Open	Re-evaluate when Phase D rollout begins

6. References

docker-build-optimization-roadmap.md §0 F1, F2 (lockfile findings)
docker-build-optimization-roadmap.md §A3 (deferred phase)
docker-build-optimization-roadmap.md §A2 (BuildKit cache mount that mitigates the speed concern of Option A)
learning_ai_common_plat/AGENTS.md (canonical pnpm workspace config)

8.8 KiB Raw Blame History Unescape Escape

ADR-0001: Docker build lockfile policy

1. Context

2. Options considered

Option A — Keep --lockfile=false (status quo)

Option B — Generate a Docker-only flat lockfile during build

Option C — Vendor a Docker-flattened lockfile in the repo

Option D — Restructure to single-repo workspace (eliminate sibling)

3. Decision

4. Consequences

Positive

Negative

Migration trigger

Implementation sketch (when triggered)

5. Status tracking

6. References

8.8 KiB

Raw Blame History

Option A — Keep `--lockfile=false` (status quo)