docs(deploy): add deployment optimization roadmap

Document a phased roadmap for the single-VM deployment layer (build-off-VM, recreate-in-place to cut downtime, change-detection + BuildKit guarantee, image slimming + resource caps, artifact-based rollback). Scoped to deploy orchestration; defers image-build internals to docker-build-optimization-roadmap. Register the doc in repo-map. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
2026-05-31 00:40:10 -07:00 · 2026-05-31 00:40:10 -07:00 · f837512026
commit f837512026
parent 9d871282c3
2 changed files with 261 additions and 0 deletions
--- a/docs/deployment-optimization-roadmap.md
+++ b/docs/deployment-optimization-roadmap.md
@ -0,0 +1,260 @@
+# Deployment Optimization Roadmap
+
+> **Status:** v1 (**PROPOSED** — analysis complete, no changes applied yet) · **Owner:** Platform DevOps · **Created:** 2026-05-31
+>
+> Optimize the **deployment orchestration layer** for the single-Azure-VM MVP:
+> reduce deploy wall-clock time, eliminate deploy-time downtime, and stop the
+> production VM from running out of RAM/disk during builds.
+>
+> **Scope boundary — read this first.** This roadmap is about *how we ship and
+> run* images on the VM (the `deploy-*.sh` scripts, `docker compose` lifecycle,
+> registry strategy, VM resource limits). The complementary *image-build*
+> concerns (pnpm install speed, BuildKit cache mounts, `.npmrc.docker`, Gitea
+> registry correctness, "build green / app broken" silent failures) live in
+> [`docker-build-optimization-roadmap.md`](docker-build-optimization-roadmap.md)
+> and are **out of scope here** — this doc references that work, never duplicates it.
+
+---
+
+## 0. Current state (audited 2026-05-31)
+
+The production deployment model is **Docker Compose on a single Azure VM**,
+fronted by Caddy on `80/443` (see
+[`DEPLOYMENT_GUIDE.md`](../DEPLOYMENT_GUIDE.md) and
+[`learning_ai_common_plat/docs/devops/single_azure_vm/docker/DEPLOYMENT_STATUS_2026-03-29.md`](../../learning_ai_common_plat/docs/devops/single_azure_vm/docker/DEPLOYMENT_STATUS_2026-03-29.md)).
+Four production repos deploy here: `learning_ai_invt_trdg`,
+`learning_ai_common_plat`, `learning_ai_clock`, `learning_ai_notes`.
+
+**What is already good** (do not redo): the per-repo Dockerfiles are
+multi-stage, use `node:22-alpine`, mount a BuildKit pnpm-store cache, inject the
+Gitea token via a BuildKit secret, and emit Next.js `standalone` output. The
+image-build layer is in good shape — credit to the build-optimization roadmap.
+
+**The bottlenecks are in the deployment flow, not the Dockerfiles.**
+
+| # | Finding | Location | Symptom it causes | Severity |
+|---|---|---|---|---|
+| **D1** | **Images are built _on the production VM_.** `docker build` runs on the box while ~31 containers (Cosmos emulator + Azurite are RAM-heavy) are live. The VM has **less RAM than the deploy scripts assumed** (per the deployment-status doc), so builds thrash/swap. | `deploy-invttrdg.sh` build step; `deploy-all.sh` `docker compose build` | Slow builds **and** memory pressure **and** risk of OOM-killing live services | **High (keystone)** |
+| **D2** | **Blanket `docker compose down` → `up -d --force-recreate`.** Every deploy stops _all_ services and restarts them cold, even for a one-line change. | `deploy-invttrdg.sh` (down then force-recreate) | Deploy-time downtime + cold caches on every release | **High** |
+| **D3** | **`deploy-all.sh` rebuilds every service in every repo, sequentially.** No change-detection, no parallelism; loops repos one-by-one and runs `docker compose build` (all services). | `deploy-all.sh` deploy loop | Multi-repo "deploy all" takes minutes even when one file changed | **High** |
+| **D4** | **`deploy-all.sh` does not guarantee BuildKit.** It calls plain `docker compose build` with no `DOCKER_BUILDKIT=1` / `COMPOSE_DOCKER_CLI_BUILD=1`. On an older Docker it can silently fall back to the legacy builder and **lose the Dockerfile `--mount=type=cache` pnpm-store wins** the build roadmap added. | `deploy-all.sh` build step | Silent slow path; warm rebuilds behave like cold builds | **Medium** |
+| **D5** | **Whole `node_modules` (dev deps included) copied into the runtime stage.** Backend runtime `COPY --from=builder .../node_modules` is the full install. | `*/backend/Dockerfile` runtime stage | Larger images → more disk, slower `pull`/start | **Medium** |
+| **D6** | **No RAM/disk guardrails on the VM.** No `mem_limit`/reservations on Cosmos emulator/Azurite; no scheduled image prune; no container log rotation. | `docker-compose*.yml`, VM cron | Disk creeps to full + a single container can starve the box (silent cause of "deploys suddenly got slow") | **Medium** |
+| **D7** | **Rollback requires a rebuild.** `DEPLOYMENT_GUIDE.md` rollback path is `git revert` + `docker compose up -d --build` — i.e. rebuild on the VM to roll back. Images are tagged only `:latest`, so there is no immutable artifact to re-point to. | `DEPLOYMENT_GUIDE.md` rollback section; image tags | Slow, risky rollbacks; no clean "known-good" artifact | **Medium** |
+
+**Implications**
+
+- **D1 is the keystone.** Moving builds off the VM removes the build/runtime
+  resource contention that drives slow builds, memory pressure, _and_ (with
+  SHA-tagged images) enables fast rollback. Most other items compound on top of it.
+- D2 alone removes the majority of deploy-time downtime and is low-risk.
+- D3/D4 are the "deploy all is slow" story; D5/D6 are the disk/memory story.
+
+---
+
+## 1. Goals & non-goals
+
+**Goals**
+
+- Cut warm deploy wall-clock time to **seconds** for a single changed service.
+- **Zero (or near-zero) downtime** for routine deploys.
+- Keep the production VM's RAM/disk **predictable and bounded** during deploys.
+- Make rollback an **artifact re-point**, not a rebuild.
+
+**Non-goals**
+
+- ❌ Adopting Kubernetes / Swarm / Nomad. Compose-on-one-VM is the correct
+  model at MVP; revisit orchestration only when we outgrow a single host.
+- ❌ Re-doing image-build internals (pnpm, BuildKit cache, Gitea path) — owned
+  by [`docker-build-optimization-roadmap.md`](docker-build-optimization-roadmap.md).
+- ❌ Full blue/green infra. Tier 1 already removes most downtime; multi-replica
+  comes only when traffic justifies it.
+
+### Measurement targets
+
+| Metric | Baseline (observed/estimated) | Target |
+|---|---|---|
+| Warm deploy, 1 service changed | ~2–3 min (build-on-VM) | **< 30 s** (pull + recreate one service) |
+| Deploy-time downtime per service | full stop/start cycle | **~0** (recreate-in-place, old stays up until new is ready) |
+| Peak VM RAM during deploy | build spike on top of live stack | **no build spike** (build is off-VM) |
+| Rollback to previous release | rebuild on VM (minutes) | **< 30 s** (re-point SHA tag + `up -d`) |
+
+> Fill in actuals during Phase 3.
+
+---
+
+## 2. Phased roadmap (why / what / how)
+
+Phases are ordered by leverage. **Phase 1 is the keystone** — do it first; the
+rest compound on it.
+
+### Phase 1 — Build off the VM, ship images, VM pulls  ⟵ keystone
+
+**Why.** D1 is the single biggest cause of slow builds, memory pressure, and
+risky rollback. The production VM should *run* containers, not *compile* them.
+Removing build work from the box frees its scarce RAM/CPU and turns deploys into
+a fast `pull` + recreate. Tagging images by commit SHA gives an immutable,
+re-pointable artifact (fixes D7).
+
+**What.**
+- Build images in CI (GitHub Actions / Gitea Actions) or on a dedicated builder.
+- Push to the Gitea container/image registry, tagged `:<commit-sha>` **and**
+  `:latest`.
+- The VM deploy step becomes `docker compose pull && docker compose up -d` — no
+  `docker build` on the box.
+
+**How (checklist).**
+- [ ] **1.1** Stand up / confirm an image registry (reuse Gitea on the VM, or a
+      hosted registry). Decide auth: reuse the existing `GITEA_NPM_TOKEN` pattern
+      from `deploy-invttrdg.sh` for `docker login`.
+- [ ] **1.2** Add a CI build job per production repo: build backend + web images
+      with BuildKit, tag `:<git-sha>` + `:latest`, push to the registry. Reuse the
+      existing `BYTELYST_COMMIT_*` build args already collected in
+      `deploy-invttrdg.sh`.
+- [ ] **1.3** Parameterize each `docker-compose.yml` service to use
+      `image: <registry>/<svc>:${IMAGE_TAG:-latest}` instead of a local `build:`
+      context for production. (Keep `build:` for local dev via an override file.)
+- [ ] **1.4** Rewrite the VM-side deploy path to `docker compose pull` then
+      `docker compose up -d` (no build). Pass `IMAGE_TAG=<sha>` to deploy a
+      specific release.
+- [ ] **1.5** Keep a thin "emergency build-on-VM" fallback flag for when the
+      registry/CI is unavailable, but make pull-based the default.
+- [ ] **1.6** Verify: a clean deploy uses **zero** `tsc`/`next build` CPU on the VM.
+
+**Done when:** deploys to the VM perform no compilation; images are addressable
+by commit SHA in the registry.
+
+---
+
+### Phase 2 — Stop-the-world → recreate-in-place
+
+**Why.** D2: `docker compose down` + `--force-recreate` takes everything down on
+every deploy. Plain `up -d` already recreates only the containers whose
+image/config changed, leaving the rest running.
+
+**What.** Remove the blanket `down`; rely on Compose's differential recreate.
+Target individual services where possible.
+
+**How (checklist).**
+- [ ] **2.1** Remove `docker compose down` from the production deploy path
+      (`deploy-invttrdg.sh`). Replace `up -d --force-recreate` (all services)
+      with `up -d` (differential) — `--force-recreate` only when config didn't
+      change but image did and you're _not_ using SHA tags (after Phase 1, the
+      SHA tag change makes Compose recreate automatically).
+- [ ] **2.2** Support per-service deploys: `docker compose up -d --no-deps <svc>`
+      so deploying the backend doesn't bounce unrelated services.
+- [ ] **2.3** Confirm every service has a correct healthcheck (cross-check the
+      IPv6/`localhost` healthcheck pitfall documented as F12 in the build roadmap)
+      so `up -d` waits for healthy before considering the deploy done.
+- [ ] **2.4** (Later / optional) For true zero-downtime on a hot service, run two
+      replicas behind Caddy and recreate one at a time. Defer until traffic needs it.
+
+**Done when:** a routine single-service deploy does not interrupt the other
+running services.
+
+---
+
+### Phase 3 — Deploy only what changed; guarantee the fast path
+
+**Why.** D3/D4: `deploy-all.sh` rebuilds everything sequentially and may silently
+drop BuildKit. After Phase 1 most of this moves to CI, but the VM-side and any
+remaining build paths should still skip untouched services and never fall back to
+the legacy builder.
+
+**What.** Change-detection on what to deploy + an explicit BuildKit guarantee for
+any path that still builds.
+
+**How (checklist).**
+- [ ] **3.1** In `deploy-all.sh`, compute changed services via
+      `git diff --name-only <last-deployed-sha>..HEAD` and skip services with no
+      changes. Record the last-deployed SHA per repo (e.g. a small state file or
+      the registry tag).
+- [ ] **3.2** Export `DOCKER_BUILDKIT=1` and `COMPOSE_DOCKER_CLI_BUILD=1` (or
+      switch to `docker buildx bake`) anywhere a build still runs, so the
+      Dockerfile `--mount=type=cache` pnpm-store wins are never silently lost (D4).
+- [ ] **3.3** Where multiple independent images must build, build them in
+      parallel (`buildx bake`, or `docker compose build --parallel`) instead of
+      the current sequential loop.
+- [ ] **3.4** Capture before/after timings into the Measurement targets table above.
+
+**Done when:** "deploy all" only touches changed services and always uses the
+warm-cache build path.
+
+---
+
+### Phase 4 — Image size & VM resource guardrails
+
+**Why.** D5/D6: bloated images and unbounded disk/RAM are the slow-burn causes of
+"the VM filled up / a build OOM'd the box." Caps make pressure predictable.
+
+**What.** Prune runtime deps, cap memory, rotate logs, prune images.
+
+**How (checklist).**
+- [ ] **4.1** In each backend runtime stage, install production-only deps
+      (`pnpm install --prod` / `pnpm deploy --prod`) instead of copying the full
+      builder `node_modules` (D5). Verify the app still starts.
+- [ ] **4.2** Add `mem_limit` + `mem_reservation` (and sensible `cpus`) to the
+      RAM-heavy services first — Cosmos emulator, Azurite — then the rest, in the
+      `docker-compose*.yml` files.
+- [ ] **4.3** Add Docker daemon log rotation (`json-file` with `max-size` +
+      `max-file`, or ship logs to Loki only) so container logs can't fill disk.
+- [ ] **4.4** Add a scheduled `docker image prune -f` (and `builder prune`) on the
+      VM to reclaim dangling layers left by rebuilds.
+- [ ] **4.5** Add a small swap file on the VM as an OOM safety net for any
+      residual on-box work; alert when disk > 80%.
+
+**Done when:** runtime images are prod-only, every heavy service is memory-capped,
+and disk usage is bounded by rotation + prune.
+
+---
+
+### Phase 5 — Fast, artifact-based rollback
+
+**Why.** D7: rollback today means rebuild-on-VM. With SHA-tagged images from
+Phase 1, rollback becomes re-pointing to a known-good tag.
+
+**What.** A `rollback` command that redeploys a previous image tag.
+
+**How (checklist).**
+- [ ] **5.1** Keep the last N image SHAs in the registry (don't prune the most
+      recent good tags).
+- [ ] **5.2** Add `IMAGE_TAG=<previous-sha> docker compose up -d` rollback path
+      (and a `deploy-*.sh --rollback [sha]` wrapper).
+- [ ] **5.3** Update [`DEPLOYMENT_GUIDE.md`](../DEPLOYMENT_GUIDE.md) rollback
+      section to use tag re-point instead of `git revert` + rebuild.
+- [ ] **5.4** Document how to find the currently-deployed SHA (the
+      `/api/devops/version` endpoint already exposed and checked by
+      `deploy-invttrdg.sh`).
+
+**Done when:** rolling back is a sub-30s tag re-point with no rebuild.
+
+---
+
+## 3. Quick-reference summary
+
+| Phase | Theme | Fixes | Primary symptom addressed |
+|---|---|---|---|
+| **1** | Build off VM, pull images | D1, D7 | Slow builds + memory pressure + rollback |
+| **2** | Recreate-in-place | D2 | Downtime |
+| **3** | Deploy only changed + BuildKit guarantee | D3, D4 | Slow "deploy all" |
+| **4** | Image slimming + resource caps | D5, D6 | Disk/memory |
+| **5** | Artifact rollback | D7 | Rollback speed/safety |
+
+**Suggested order:** Phase 1 → 2 (≈80% of the benefit across all three
+symptoms), then 3 → 4 → 5.
+
+## 4. Explicitly out of scope
+
+- Image-build internals (pnpm/BuildKit/Gitea/`.npmrc.docker`/silent-break
+  correctness) — see
+  [`docker-build-optimization-roadmap.md`](docker-build-optimization-roadmap.md).
+- Migration to Kubernetes/Swarm or a managed cloud runtime.
+- Multi-platform image builds.
+
+## 5. Related docs
+
+- [`../DEPLOYMENT_GUIDE.md`](../DEPLOYMENT_GUIDE.md) — current production deploy procedure
+- [`docker-build-optimization-roadmap.md`](docker-build-optimization-roadmap.md) — image-build layer
+- [`VM_OBSERVABILITY_ROADMAP.md`](VM_OBSERVABILITY_ROADMAP.md) — metrics/monitoring for the VM
+- [`vm-security-blind-spots-roadmap.md`](vm-security-blind-spots-roadmap.md) — VM hardening
+- [`../../learning_ai_common_plat/docs/devops/single_azure_vm/docker/DEPLOYMENT_STATUS_2026-03-29.md`](../../learning_ai_common_plat/docs/devops/single_azure_vm/docker/DEPLOYMENT_STATUS_2026-03-29.md) — VM deployment status snapshot
--- a/docs/repo-map.md
+++ b/docs/repo-map.md
@ -52,6 +52,7 @@ Current key files:
 - `docs/remove_user_interactive.md`
 - `docs/hermes-setup-upgrade-roadmap.md`
 - `docs/hermes-operations.md`
+- `docs/deployment-optimization-roadmap.md` - deployment orchestration roadmap (build-off-VM, zero-downtime, VM resource caps); complements `docker-build-optimization-roadmap.md`
 - `docs/llm-utility-workflows.md`
 - `docs/gitea-registry-and-package-resolution.md`
 - `docs/vm-security-blind-spots-roadmap.md`