From f837512026834450611d69a068b8857115c850b4 Mon Sep 17 00:00:00 2001 From: saravanakumardb1 Date: Sun, 31 May 2026 00:40:10 -0700 Subject: [PATCH] docs(deploy): add deployment optimization roadmap Document a phased roadmap for the single-VM deployment layer (build-off-VM, recreate-in-place to cut downtime, change-detection + BuildKit guarantee, image slimming + resource caps, artifact-based rollback). Scoped to deploy orchestration; defers image-build internals to docker-build-optimization-roadmap. Register the doc in repo-map. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com> --- docs/deployment-optimization-roadmap.md | 260 ++++++++++++++++++++++++ docs/repo-map.md | 1 + 2 files changed, 261 insertions(+) create mode 100644 docs/deployment-optimization-roadmap.md diff --git a/docs/deployment-optimization-roadmap.md b/docs/deployment-optimization-roadmap.md new file mode 100644 index 0000000..221008f --- /dev/null +++ b/docs/deployment-optimization-roadmap.md @@ -0,0 +1,260 @@ +# Deployment Optimization Roadmap + +> **Status:** v1 (**PROPOSED** — analysis complete, no changes applied yet) · **Owner:** Platform DevOps · **Created:** 2026-05-31 +> +> Optimize the **deployment orchestration layer** for the single-Azure-VM MVP: +> reduce deploy wall-clock time, eliminate deploy-time downtime, and stop the +> production VM from running out of RAM/disk during builds. +> +> **Scope boundary — read this first.** This roadmap is about *how we ship and +> run* images on the VM (the `deploy-*.sh` scripts, `docker compose` lifecycle, +> registry strategy, VM resource limits). The complementary *image-build* +> concerns (pnpm install speed, BuildKit cache mounts, `.npmrc.docker`, Gitea +> registry correctness, "build green / app broken" silent failures) live in +> [`docker-build-optimization-roadmap.md`](docker-build-optimization-roadmap.md) +> and are **out of scope here** — this doc references that work, never duplicates it. + +--- + +## 0. Current state (audited 2026-05-31) + +The production deployment model is **Docker Compose on a single Azure VM**, +fronted by Caddy on `80/443` (see +[`DEPLOYMENT_GUIDE.md`](../DEPLOYMENT_GUIDE.md) and +[`learning_ai_common_plat/docs/devops/single_azure_vm/docker/DEPLOYMENT_STATUS_2026-03-29.md`](../../learning_ai_common_plat/docs/devops/single_azure_vm/docker/DEPLOYMENT_STATUS_2026-03-29.md)). +Four production repos deploy here: `learning_ai_invt_trdg`, +`learning_ai_common_plat`, `learning_ai_clock`, `learning_ai_notes`. + +**What is already good** (do not redo): the per-repo Dockerfiles are +multi-stage, use `node:22-alpine`, mount a BuildKit pnpm-store cache, inject the +Gitea token via a BuildKit secret, and emit Next.js `standalone` output. The +image-build layer is in good shape — credit to the build-optimization roadmap. + +**The bottlenecks are in the deployment flow, not the Dockerfiles.** + +| # | Finding | Location | Symptom it causes | Severity | +|---|---|---|---|---| +| **D1** | **Images are built _on the production VM_.** `docker build` runs on the box while ~31 containers (Cosmos emulator + Azurite are RAM-heavy) are live. The VM has **less RAM than the deploy scripts assumed** (per the deployment-status doc), so builds thrash/swap. | `deploy-invttrdg.sh` build step; `deploy-all.sh` `docker compose build` | Slow builds **and** memory pressure **and** risk of OOM-killing live services | **High (keystone)** | +| **D2** | **Blanket `docker compose down` → `up -d --force-recreate`.** Every deploy stops _all_ services and restarts them cold, even for a one-line change. | `deploy-invttrdg.sh` (down then force-recreate) | Deploy-time downtime + cold caches on every release | **High** | +| **D3** | **`deploy-all.sh` rebuilds every service in every repo, sequentially.** No change-detection, no parallelism; loops repos one-by-one and runs `docker compose build` (all services). | `deploy-all.sh` deploy loop | Multi-repo "deploy all" takes minutes even when one file changed | **High** | +| **D4** | **`deploy-all.sh` does not guarantee BuildKit.** It calls plain `docker compose build` with no `DOCKER_BUILDKIT=1` / `COMPOSE_DOCKER_CLI_BUILD=1`. On an older Docker it can silently fall back to the legacy builder and **lose the Dockerfile `--mount=type=cache` pnpm-store wins** the build roadmap added. | `deploy-all.sh` build step | Silent slow path; warm rebuilds behave like cold builds | **Medium** | +| **D5** | **Whole `node_modules` (dev deps included) copied into the runtime stage.** Backend runtime `COPY --from=builder .../node_modules` is the full install. | `*/backend/Dockerfile` runtime stage | Larger images → more disk, slower `pull`/start | **Medium** | +| **D6** | **No RAM/disk guardrails on the VM.** No `mem_limit`/reservations on Cosmos emulator/Azurite; no scheduled image prune; no container log rotation. | `docker-compose*.yml`, VM cron | Disk creeps to full + a single container can starve the box (silent cause of "deploys suddenly got slow") | **Medium** | +| **D7** | **Rollback requires a rebuild.** `DEPLOYMENT_GUIDE.md` rollback path is `git revert` + `docker compose up -d --build` — i.e. rebuild on the VM to roll back. Images are tagged only `:latest`, so there is no immutable artifact to re-point to. | `DEPLOYMENT_GUIDE.md` rollback section; image tags | Slow, risky rollbacks; no clean "known-good" artifact | **Medium** | + +**Implications** + +- **D1 is the keystone.** Moving builds off the VM removes the build/runtime + resource contention that drives slow builds, memory pressure, _and_ (with + SHA-tagged images) enables fast rollback. Most other items compound on top of it. +- D2 alone removes the majority of deploy-time downtime and is low-risk. +- D3/D4 are the "deploy all is slow" story; D5/D6 are the disk/memory story. + +--- + +## 1. Goals & non-goals + +**Goals** + +- Cut warm deploy wall-clock time to **seconds** for a single changed service. +- **Zero (or near-zero) downtime** for routine deploys. +- Keep the production VM's RAM/disk **predictable and bounded** during deploys. +- Make rollback an **artifact re-point**, not a rebuild. + +**Non-goals** + +- ❌ Adopting Kubernetes / Swarm / Nomad. Compose-on-one-VM is the correct + model at MVP; revisit orchestration only when we outgrow a single host. +- ❌ Re-doing image-build internals (pnpm, BuildKit cache, Gitea path) — owned + by [`docker-build-optimization-roadmap.md`](docker-build-optimization-roadmap.md). +- ❌ Full blue/green infra. Tier 1 already removes most downtime; multi-replica + comes only when traffic justifies it. + +### Measurement targets + +| Metric | Baseline (observed/estimated) | Target | +|---|---|---| +| Warm deploy, 1 service changed | ~2–3 min (build-on-VM) | **< 30 s** (pull + recreate one service) | +| Deploy-time downtime per service | full stop/start cycle | **~0** (recreate-in-place, old stays up until new is ready) | +| Peak VM RAM during deploy | build spike on top of live stack | **no build spike** (build is off-VM) | +| Rollback to previous release | rebuild on VM (minutes) | **< 30 s** (re-point SHA tag + `up -d`) | + +> Fill in actuals during Phase 3. + +--- + +## 2. Phased roadmap (why / what / how) + +Phases are ordered by leverage. **Phase 1 is the keystone** — do it first; the +rest compound on it. + +### Phase 1 — Build off the VM, ship images, VM pulls ⟵ keystone + +**Why.** D1 is the single biggest cause of slow builds, memory pressure, and +risky rollback. The production VM should *run* containers, not *compile* them. +Removing build work from the box frees its scarce RAM/CPU and turns deploys into +a fast `pull` + recreate. Tagging images by commit SHA gives an immutable, +re-pointable artifact (fixes D7). + +**What.** +- Build images in CI (GitHub Actions / Gitea Actions) or on a dedicated builder. +- Push to the Gitea container/image registry, tagged `:` **and** + `:latest`. +- The VM deploy step becomes `docker compose pull && docker compose up -d` — no + `docker build` on the box. + +**How (checklist).** +- [ ] **1.1** Stand up / confirm an image registry (reuse Gitea on the VM, or a + hosted registry). Decide auth: reuse the existing `GITEA_NPM_TOKEN` pattern + from `deploy-invttrdg.sh` for `docker login`. +- [ ] **1.2** Add a CI build job per production repo: build backend + web images + with BuildKit, tag `:` + `:latest`, push to the registry. Reuse the + existing `BYTELYST_COMMIT_*` build args already collected in + `deploy-invttrdg.sh`. +- [ ] **1.3** Parameterize each `docker-compose.yml` service to use + `image: /:${IMAGE_TAG:-latest}` instead of a local `build:` + context for production. (Keep `build:` for local dev via an override file.) +- [ ] **1.4** Rewrite the VM-side deploy path to `docker compose pull` then + `docker compose up -d` (no build). Pass `IMAGE_TAG=` to deploy a + specific release. +- [ ] **1.5** Keep a thin "emergency build-on-VM" fallback flag for when the + registry/CI is unavailable, but make pull-based the default. +- [ ] **1.6** Verify: a clean deploy uses **zero** `tsc`/`next build` CPU on the VM. + +**Done when:** deploys to the VM perform no compilation; images are addressable +by commit SHA in the registry. + +--- + +### Phase 2 — Stop-the-world → recreate-in-place + +**Why.** D2: `docker compose down` + `--force-recreate` takes everything down on +every deploy. Plain `up -d` already recreates only the containers whose +image/config changed, leaving the rest running. + +**What.** Remove the blanket `down`; rely on Compose's differential recreate. +Target individual services where possible. + +**How (checklist).** +- [ ] **2.1** Remove `docker compose down` from the production deploy path + (`deploy-invttrdg.sh`). Replace `up -d --force-recreate` (all services) + with `up -d` (differential) — `--force-recreate` only when config didn't + change but image did and you're _not_ using SHA tags (after Phase 1, the + SHA tag change makes Compose recreate automatically). +- [ ] **2.2** Support per-service deploys: `docker compose up -d --no-deps ` + so deploying the backend doesn't bounce unrelated services. +- [ ] **2.3** Confirm every service has a correct healthcheck (cross-check the + IPv6/`localhost` healthcheck pitfall documented as F12 in the build roadmap) + so `up -d` waits for healthy before considering the deploy done. +- [ ] **2.4** (Later / optional) For true zero-downtime on a hot service, run two + replicas behind Caddy and recreate one at a time. Defer until traffic needs it. + +**Done when:** a routine single-service deploy does not interrupt the other +running services. + +--- + +### Phase 3 — Deploy only what changed; guarantee the fast path + +**Why.** D3/D4: `deploy-all.sh` rebuilds everything sequentially and may silently +drop BuildKit. After Phase 1 most of this moves to CI, but the VM-side and any +remaining build paths should still skip untouched services and never fall back to +the legacy builder. + +**What.** Change-detection on what to deploy + an explicit BuildKit guarantee for +any path that still builds. + +**How (checklist).** +- [ ] **3.1** In `deploy-all.sh`, compute changed services via + `git diff --name-only ..HEAD` and skip services with no + changes. Record the last-deployed SHA per repo (e.g. a small state file or + the registry tag). +- [ ] **3.2** Export `DOCKER_BUILDKIT=1` and `COMPOSE_DOCKER_CLI_BUILD=1` (or + switch to `docker buildx bake`) anywhere a build still runs, so the + Dockerfile `--mount=type=cache` pnpm-store wins are never silently lost (D4). +- [ ] **3.3** Where multiple independent images must build, build them in + parallel (`buildx bake`, or `docker compose build --parallel`) instead of + the current sequential loop. +- [ ] **3.4** Capture before/after timings into the Measurement targets table above. + +**Done when:** "deploy all" only touches changed services and always uses the +warm-cache build path. + +--- + +### Phase 4 — Image size & VM resource guardrails + +**Why.** D5/D6: bloated images and unbounded disk/RAM are the slow-burn causes of +"the VM filled up / a build OOM'd the box." Caps make pressure predictable. + +**What.** Prune runtime deps, cap memory, rotate logs, prune images. + +**How (checklist).** +- [ ] **4.1** In each backend runtime stage, install production-only deps + (`pnpm install --prod` / `pnpm deploy --prod`) instead of copying the full + builder `node_modules` (D5). Verify the app still starts. +- [ ] **4.2** Add `mem_limit` + `mem_reservation` (and sensible `cpus`) to the + RAM-heavy services first — Cosmos emulator, Azurite — then the rest, in the + `docker-compose*.yml` files. +- [ ] **4.3** Add Docker daemon log rotation (`json-file` with `max-size` + + `max-file`, or ship logs to Loki only) so container logs can't fill disk. +- [ ] **4.4** Add a scheduled `docker image prune -f` (and `builder prune`) on the + VM to reclaim dangling layers left by rebuilds. +- [ ] **4.5** Add a small swap file on the VM as an OOM safety net for any + residual on-box work; alert when disk > 80%. + +**Done when:** runtime images are prod-only, every heavy service is memory-capped, +and disk usage is bounded by rotation + prune. + +--- + +### Phase 5 — Fast, artifact-based rollback + +**Why.** D7: rollback today means rebuild-on-VM. With SHA-tagged images from +Phase 1, rollback becomes re-pointing to a known-good tag. + +**What.** A `rollback` command that redeploys a previous image tag. + +**How (checklist).** +- [ ] **5.1** Keep the last N image SHAs in the registry (don't prune the most + recent good tags). +- [ ] **5.2** Add `IMAGE_TAG= docker compose up -d` rollback path + (and a `deploy-*.sh --rollback [sha]` wrapper). +- [ ] **5.3** Update [`DEPLOYMENT_GUIDE.md`](../DEPLOYMENT_GUIDE.md) rollback + section to use tag re-point instead of `git revert` + rebuild. +- [ ] **5.4** Document how to find the currently-deployed SHA (the + `/api/devops/version` endpoint already exposed and checked by + `deploy-invttrdg.sh`). + +**Done when:** rolling back is a sub-30s tag re-point with no rebuild. + +--- + +## 3. Quick-reference summary + +| Phase | Theme | Fixes | Primary symptom addressed | +|---|---|---|---| +| **1** | Build off VM, pull images | D1, D7 | Slow builds + memory pressure + rollback | +| **2** | Recreate-in-place | D2 | Downtime | +| **3** | Deploy only changed + BuildKit guarantee | D3, D4 | Slow "deploy all" | +| **4** | Image slimming + resource caps | D5, D6 | Disk/memory | +| **5** | Artifact rollback | D7 | Rollback speed/safety | + +**Suggested order:** Phase 1 → 2 (≈80% of the benefit across all three +symptoms), then 3 → 4 → 5. + +## 4. Explicitly out of scope + +- Image-build internals (pnpm/BuildKit/Gitea/`.npmrc.docker`/silent-break + correctness) — see + [`docker-build-optimization-roadmap.md`](docker-build-optimization-roadmap.md). +- Migration to Kubernetes/Swarm or a managed cloud runtime. +- Multi-platform image builds. + +## 5. Related docs + +- [`../DEPLOYMENT_GUIDE.md`](../DEPLOYMENT_GUIDE.md) — current production deploy procedure +- [`docker-build-optimization-roadmap.md`](docker-build-optimization-roadmap.md) — image-build layer +- [`VM_OBSERVABILITY_ROADMAP.md`](VM_OBSERVABILITY_ROADMAP.md) — metrics/monitoring for the VM +- [`vm-security-blind-spots-roadmap.md`](vm-security-blind-spots-roadmap.md) — VM hardening +- [`../../learning_ai_common_plat/docs/devops/single_azure_vm/docker/DEPLOYMENT_STATUS_2026-03-29.md`](../../learning_ai_common_plat/docs/devops/single_azure_vm/docker/DEPLOYMENT_STATUS_2026-03-29.md) — VM deployment status snapshot diff --git a/docs/repo-map.md b/docs/repo-map.md index e3bcfe6..3f5ff0c 100644 --- a/docs/repo-map.md +++ b/docs/repo-map.md @@ -52,6 +52,7 @@ Current key files: - `docs/remove_user_interactive.md` - `docs/hermes-setup-upgrade-roadmap.md` - `docs/hermes-operations.md` +- `docs/deployment-optimization-roadmap.md` - deployment orchestration roadmap (build-off-VM, zero-downtime, VM resource caps); complements `docker-build-optimization-roadmap.md` - `docs/llm-utility-workflows.md` - `docs/gitea-registry-and-package-resolution.md` - `docs/vm-security-blind-spots-roadmap.md`