From 38aefb05e4afd488b7c6451348f12507184c01ff Mon Sep 17 00:00:00 2001 From: saravanakumardb1 Date: Sun, 31 May 2026 00:44:52 -0700 Subject: [PATCH] =?UTF-8?q?docs(deploy):=20v2=20review=20pass=20=E2=80=94?= =?UTF-8?q?=20correct=20findings=20after=20full=20script/compose=20audit?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - D6: memory limits already exist (deploy.resources.limits); reframe as RAM right-sizing + disk hygiene rather than "limits missing" - D2: down/--force-recreate is invttrdg-only; clock/notes already differential - D4: broaden BuildKit gap to all docker compose build paths; fix accuracy - D8 (new): deploy-script drift across per-product scripts + dashboard/deploy.sh - add Phase 0 (unify scripts) as prerequisite; update quick-ref + ordering Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com> --- docs/deployment-optimization-roadmap.md | 155 +++++++++++++++++------- 1 file changed, 111 insertions(+), 44 deletions(-) diff --git a/docs/deployment-optimization-roadmap.md b/docs/deployment-optimization-roadmap.md index 221008f..18695ab 100644 --- a/docs/deployment-optimization-roadmap.md +++ b/docs/deployment-optimization-roadmap.md @@ -1,6 +1,12 @@ # Deployment Optimization Roadmap -> **Status:** v1 (**PROPOSED** — analysis complete, no changes applied yet) · **Owner:** Platform DevOps · **Created:** 2026-05-31 +> **Status:** v2 (**PROPOSED** — analysis complete, no changes applied yet) · **Owner:** Platform DevOps · **Created:** 2026-05-31 · **Revised:** 2026-05-31 +> +> **v2 review pass:** corrected findings after auditing *all* deploy scripts + +> compose files (not just `deploy-invttrdg.sh`). Key fixes: memory limits already +> exist (D6 rewritten), `down`/`--force-recreate` is invttrdg-only not universal +> (D2 re-scoped), BuildKit gap spans all `docker compose build` paths (D4 +> broadened), and added **D8 — deploy-script drift** across the per-product scripts. > > Optimize the **deployment orchestration layer** for the single-Azure-VM MVP: > reduce deploy wall-clock time, eliminate deploy-time downtime, and stop the @@ -25,6 +31,10 @@ fronted by Caddy on `80/443` (see Four production repos deploy here: `learning_ai_invt_trdg`, `learning_ai_common_plat`, `learning_ai_clock`, `learning_ai_notes`. +**Deploy surfaces audited (2026-05-31):** `deploy-invttrdg.sh`, `deploy-clock.sh`, +`deploy-notes.sh`, `deploy-all.sh`, `dashboard/deploy.sh`, plus the production +`docker-compose.yml` of clock and the `docker-compose.ecosystem.yml` of common_plat. + **What is already good** (do not redo): the per-repo Dockerfiles are multi-stage, use `node:22-alpine`, mount a BuildKit pnpm-store cache, inject the Gitea token via a BuildKit secret, and emit Next.js `standalone` output. The @@ -35,20 +45,28 @@ image-build layer is in good shape — credit to the build-optimization roadmap. | # | Finding | Location | Symptom it causes | Severity | |---|---|---|---|---| | **D1** | **Images are built _on the production VM_.** `docker build` runs on the box while ~31 containers (Cosmos emulator + Azurite are RAM-heavy) are live. The VM has **less RAM than the deploy scripts assumed** (per the deployment-status doc), so builds thrash/swap. | `deploy-invttrdg.sh` build step; `deploy-all.sh` `docker compose build` | Slow builds **and** memory pressure **and** risk of OOM-killing live services | **High (keystone)** | -| **D2** | **Blanket `docker compose down` → `up -d --force-recreate`.** Every deploy stops _all_ services and restarts them cold, even for a one-line change. | `deploy-invttrdg.sh` (down then force-recreate) | Deploy-time downtime + cold caches on every release | **High** | +| **D2** | **`deploy-invttrdg.sh` does a blanket `docker compose down` → `up -d --force-recreate`** — stops _all_ its services and restarts them cold, even for a one-line change. (Note: `deploy-clock.sh`/`deploy-notes.sh` do **not** — they already run `docker compose build` + `docker compose up -d`, i.e. differential recreate. So this is a per-script inconsistency, not a universal pattern — see D8.) | `deploy-invttrdg.sh` (down then `--force-recreate`) | Deploy-time downtime + cold caches on invttrdg releases | **Medium** | | **D3** | **`deploy-all.sh` rebuilds every service in every repo, sequentially.** No change-detection, no parallelism; loops repos one-by-one and runs `docker compose build` (all services). | `deploy-all.sh` deploy loop | Multi-repo "deploy all" takes minutes even when one file changed | **High** | -| **D4** | **`deploy-all.sh` does not guarantee BuildKit.** It calls plain `docker compose build` with no `DOCKER_BUILDKIT=1` / `COMPOSE_DOCKER_CLI_BUILD=1`. On an older Docker it can silently fall back to the legacy builder and **lose the Dockerfile `--mount=type=cache` pnpm-store wins** the build roadmap added. | `deploy-all.sh` build step | Silent slow path; warm rebuilds behave like cold builds | **Medium** | -| **D5** | **Whole `node_modules` (dev deps included) copied into the runtime stage.** Backend runtime `COPY --from=builder .../node_modules` is the full install. | `*/backend/Dockerfile` runtime stage | Larger images → more disk, slower `pull`/start | **Medium** | -| **D6** | **No RAM/disk guardrails on the VM.** No `mem_limit`/reservations on Cosmos emulator/Azurite; no scheduled image prune; no container log rotation. | `docker-compose*.yml`, VM cron | Disk creeps to full + a single container can starve the box (silent cause of "deploys suddenly got slow") | **Medium** | -| **D7** | **Rollback requires a rebuild.** `DEPLOYMENT_GUIDE.md` rollback path is `git revert` + `docker compose up -d --build` — i.e. rebuild on the VM to roll back. Images are tagged only `:latest`, so there is no immutable artifact to re-point to. | `DEPLOYMENT_GUIDE.md` rollback section; image tags | Slow, risky rollbacks; no clean "known-good" artifact | **Medium** | +| **D4** | **The `docker compose build` paths don't pin BuildKit.** `deploy-clock.sh`, `deploy-notes.sh`, and `deploy-all.sh` all call plain `docker compose build` with no `DOCKER_BUILDKIT=1` / `COMPOSE_DOCKER_CLI_BUILD=1`. Modern Compose v2 defaults to BuildKit (and the compose files use build-time `secrets:`, which **require** BuildKit — so on a stale/old toolchain the build hard-fails rather than silently slows). Pinning it explicitly removes the version-dependent ambiguity and guarantees the Dockerfile `--mount=type=cache` pnpm-store wins. | `deploy-clock.sh`, `deploy-notes.sh`, `deploy-all.sh` build steps | Toolchain-dependent build behavior; risk of losing warm-cache wins | **Low–Medium** | +| **D5** | **Whole `node_modules` (dev deps included) copied into the runtime stage.** Backend runtime `COPY --from=builder .../node_modules` is the full install (verified in `learning_ai_clock/backend/Dockerfile`). | `*/backend/Dockerfile` runtime stage | Larger images → more disk, slower `pull`/start | **Medium** | +| **D6** | **Memory limits exist but are likely untuned vs the real VM, and log/image-disk guardrails are absent.** Limits _are_ set via `deploy.resources.limits.memory` (cosmos-emulator `1g`, azurite `256m`, most services `128m–512m`). The gaps: (a) the **sum** of limits hasn't been reconciled against the VM's actual (lower-than-assumed) RAM; (b) only `limits`, no `reservations`; (c) no scheduled `docker image prune` and no container-log rotation, so disk creeps unbounded. | `docker-compose.ecosystem.yml`, product `docker-compose.yml`, VM cron | Disk creeps to full; limits may over- or under-commit RAM | **Medium** | +| **D7** | **Rollback requires a rebuild.** `DEPLOYMENT_GUIDE.md` rollback path is `git revert` + `docker compose up -d --build` — i.e. rebuild on the VM to roll back. Production images are tagged `:latest` (e.g. `invttrdg-backend:latest`) or auto-named by Compose, so there is no immutable per-release artifact to re-point to. | `DEPLOYMENT_GUIDE.md` rollback section; image tags | Slow, risky rollbacks; no clean "known-good" artifact | **Medium** | +| **D8** | **Deploy-script drift — no single source of truth.** Three near-duplicate per-product scripts (`deploy-invttrdg.sh`, `deploy-clock.sh`, `deploy-notes.sh`) plus `deploy-all.sh` and `dashboard/deploy.sh` have diverged: invttrdg builds via an explicit `docker build` loop then `down` + `--force-recreate`; clock/notes use `docker compose build` + `up -d`; common_plat has no dedicated script (goes through `deploy-all.sh`). Any fix in this roadmap must be applied N times and risks further drift. | all `deploy-*.sh` + `dashboard/deploy.sh` | Inconsistent behavior; fixes don't propagate; maintenance burden | **Medium** | **Implications** - **D1 is the keystone.** Moving builds off the VM removes the build/runtime resource contention that drives slow builds, memory pressure, _and_ (with SHA-tagged images) enables fast rollback. Most other items compound on top of it. -- D2 alone removes the majority of deploy-time downtime and is low-risk. -- D3/D4 are the "deploy all is slow" story; D5/D6 are the disk/memory story. +- **D8 is a force-multiplier risk.** Because the per-product scripts have already + drifted, every other fix (D2, D4, change-detection, rollback) must either be + applied 3+ times or — better — the scripts should first be unified behind one + parameterized deploy library. Prefer consolidating early so later phases land once. +- D2 removes invttrdg's deploy-time downtime and is low-risk; clock/notes already + do the right thing here, which is exactly why unifying (D8) matters. +- D3/D4 are the "deploy all is slow" story; D5/D6 are the disk/memory story. Note + D6's headline gap is **disk hygiene + RAM right-sizing**, not "limits missing" — + limits already exist. --- @@ -85,8 +103,45 @@ image-build layer is in good shape — credit to the build-optimization roadmap. ## 2. Phased roadmap (why / what / how) -Phases are ordered by leverage. **Phase 1 is the keystone** — do it first; the -rest compound on it. +Phases are ordered by leverage. **Phase 0 (unify the scripts) is a prerequisite +enabler; Phase 1 is the keystone** — every later phase should land in the unified +script once, not be copy-pasted across the per-product scripts. + +### Phase 0 — Unify the deploy scripts (prerequisite) + +**Why.** D8: `deploy-invttrdg.sh`, `deploy-clock.sh`, and `deploy-notes.sh` are +near-duplicate copies that have already drifted (invttrdg uses an explicit +`docker build` loop + `down` + `--force-recreate`; clock/notes use +`docker compose build` + `up -d`). Every fix below would otherwise have to be +written 3+ times. Consolidating first means Phases 1–5 are implemented once. + +**What.** A single parameterized deploy entrypoint (one script or a sourced +library) that takes the repo/product as input and encodes the build + lifecycle ++ health-check steps once. The per-product scripts become thin wrappers (or are +replaced by `deploy.sh `). + +**How (checklist).** +- [ ] **0.1** Inventory the divergence across `deploy-invttrdg.sh`, + `deploy-clock.sh`, `deploy-notes.sh`, `deploy-all.sh`, `dashboard/deploy.sh` + (build method, lifecycle command, health endpoints, package-publication check). +- [ ] **0.2** Extract the shared steps (dirty-check, fetch/rebase, smoke test, + build, deploy, health check) into one library; express per-product + differences (ports, endpoints, image names) as config/data. +- [ ] **0.3** Replace the per-product scripts with thin wrappers calling the + library; keep the old filenames as shims so existing muscle memory + the + DevOps dashboard's deploy buttons keep working. +- [ ] **0.4** Point `deploy-all.sh` and `dashboard/deploy.sh` at the same library. +- [ ] **0.5** Add a drift guard (lint/CI) so the scripts can't silently diverge + again — mirror the `check-*-drift.sh` pattern already used for `.npmrc` and + `docker-prep.sh` in `learning_ai_common_plat`. + +**Done when:** one code path drives all production deploys; per-product files are +config or thin shims; a drift check guards against regression. + +> If Phase 0 is deferred, treat every checklist item in Phases 1–5 as +> "apply to invttrdg **and** clock **and** notes" — the drift tax is real. + +--- ### Phase 1 — Build off the VM, ship images, VM pulls ⟵ keystone @@ -128,24 +183,28 @@ by commit SHA in the registry. ### Phase 2 — Stop-the-world → recreate-in-place -**Why.** D2: `docker compose down` + `--force-recreate` takes everything down on -every deploy. Plain `up -d` already recreates only the containers whose -image/config changed, leaving the rest running. +**Why.** D2: `deploy-invttrdg.sh` does `docker compose down` + `--force-recreate`, +taking everything down on every deploy. Plain `up -d` already recreates only the +containers whose image/config changed, leaving the rest running — which is exactly +what `deploy-clock.sh`/`deploy-notes.sh` already do. **This phase is mostly about +bringing invttrdg in line with clock/notes** (and, post-Phase 0, deleting the +divergence entirely). **What.** Remove the blanket `down`; rely on Compose's differential recreate. Target individual services where possible. **How (checklist).** -- [ ] **2.1** Remove `docker compose down` from the production deploy path - (`deploy-invttrdg.sh`). Replace `up -d --force-recreate` (all services) - with `up -d` (differential) — `--force-recreate` only when config didn't - change but image did and you're _not_ using SHA tags (after Phase 1, the - SHA tag change makes Compose recreate automatically). +- [ ] **2.1** Remove `docker compose down` from `deploy-invttrdg.sh` (the only + script that has it). Replace `up -d --force-recreate` (all services) with + `up -d` (differential) — `--force-recreate` only when config didn't change + but image did and you're _not_ using SHA tags (after Phase 1, the SHA tag + change makes Compose recreate automatically). Adopt the clock/notes pattern. - [ ] **2.2** Support per-service deploys: `docker compose up -d --no-deps ` so deploying the backend doesn't bounce unrelated services. -- [ ] **2.3** Confirm every service has a correct healthcheck (cross-check the - IPv6/`localhost` healthcheck pitfall documented as F12 in the build roadmap) - so `up -d` waits for healthy before considering the deploy done. +- [ ] **2.3** Confirm every service has a correct healthcheck so `up -d` waits + for healthy before considering the deploy done. Clock already handles the + IPv6/`localhost` pitfall (F12 in the build roadmap) by forcing `127.0.0.1` + in its healthcheck — verify the other production repos do the same. - [ ] **2.4** (Later / optional) For true zero-downtime on a hot service, run two replicas behind Caddy and recreate one at a time. Defer until traffic needs it. @@ -156,22 +215,24 @@ running services. ### Phase 3 — Deploy only what changed; guarantee the fast path -**Why.** D3/D4: `deploy-all.sh` rebuilds everything sequentially and may silently -drop BuildKit. After Phase 1 most of this moves to CI, but the VM-side and any -remaining build paths should still skip untouched services and never fall back to -the legacy builder. +**Why.** D3/D4: `deploy-all.sh` rebuilds everything sequentially, and none of the +`docker compose build` paths pin BuildKit. After Phase 1 most of this moves to CI, +but the VM-side and any remaining build paths should still skip untouched services +and use a deterministic, warm-cache builder. **What.** Change-detection on what to deploy + an explicit BuildKit guarantee for any path that still builds. **How (checklist).** -- [ ] **3.1** In `deploy-all.sh`, compute changed services via - `git diff --name-only ..HEAD` and skip services with no - changes. Record the last-deployed SHA per repo (e.g. a small state file or - the registry tag). -- [ ] **3.2** Export `DOCKER_BUILDKIT=1` and `COMPOSE_DOCKER_CLI_BUILD=1` (or - switch to `docker buildx bake`) anywhere a build still runs, so the - Dockerfile `--mount=type=cache` pnpm-store wins are never silently lost (D4). +- [ ] **3.1** In `deploy-all.sh` (or the Phase-0 unified library), compute changed + services via `git diff --name-only ..HEAD` and skip + services with no changes. Record the last-deployed SHA per repo (e.g. a + small state file or the registry tag). +- [ ] **3.2** Pin `DOCKER_BUILDKIT=1` and `COMPOSE_DOCKER_CLI_BUILD=1` (or switch + to `docker buildx bake`) in **every** path that still builds — + `deploy-clock.sh`, `deploy-notes.sh`, `deploy-all.sh` (D4) — so behavior is + not toolchain-version-dependent and the Dockerfile `--mount=type=cache` + pnpm-store wins are guaranteed. - [ ] **3.3** Where multiple independent images must build, build them in parallel (`buildx bake`, or `docker compose build --parallel`) instead of the current sequential loop. @@ -184,18 +245,22 @@ warm-cache build path. ### Phase 4 — Image size & VM resource guardrails -**Why.** D5/D6: bloated images and unbounded disk/RAM are the slow-burn causes of -"the VM filled up / a build OOM'd the box." Caps make pressure predictable. +**Why.** D5/D6: bloated images and unbounded disk are the slow-burn causes of +"the VM filled up / a build OOM'd the box." Memory **limits already exist** via +`deploy.resources.limits.memory` — the remaining work is right-sizing them against +the real (lower-than-assumed) VM RAM, adding reservations, and adding disk hygiene. -**What.** Prune runtime deps, cap memory, rotate logs, prune images. +**What.** Prune runtime deps, reconcile + extend memory caps, rotate logs, prune images. **How (checklist).** - [ ] **4.1** In each backend runtime stage, install production-only deps (`pnpm install --prod` / `pnpm deploy --prod`) instead of copying the full builder `node_modules` (D5). Verify the app still starts. -- [ ] **4.2** Add `mem_limit` + `mem_reservation` (and sensible `cpus`) to the - RAM-heavy services first — Cosmos emulator, Azurite — then the rest, in the - `docker-compose*.yml` files. +- [ ] **4.2** Reconcile the existing `deploy.resources.limits.memory` values + (cosmos-emulator `1g`, azurite `256m`, services `128m–512m`) against the VM's + actual RAM — confirm the **sum** fits with headroom. Add `reservations` (not + just `limits`) so the scheduler protects critical services, and add `cpus` + where a service is CPU-bursty. - [ ] **4.3** Add Docker daemon log rotation (`json-file` with `max-size` + `max-file`, or ship logs to Loki only) so container logs can't fill disk. - [ ] **4.4** Add a scheduled `docker image prune -f` (and `builder prune`) on the @@ -203,8 +268,8 @@ warm-cache build path. - [ ] **4.5** Add a small swap file on the VM as an OOM safety net for any residual on-box work; alert when disk > 80%. -**Done when:** runtime images are prod-only, every heavy service is memory-capped, -and disk usage is bounded by rotation + prune. +**Done when:** runtime images are prod-only, memory limits are reconciled with VM +RAM (+ reservations), and disk usage is bounded by rotation + prune. --- @@ -234,14 +299,16 @@ Phase 1, rollback becomes re-pointing to a known-good tag. | Phase | Theme | Fixes | Primary symptom addressed | |---|---|---|---| +| **0** | Unify deploy scripts | D8 | Stops fixes from being copy-pasted/drifting | | **1** | Build off VM, pull images | D1, D7 | Slow builds + memory pressure + rollback | -| **2** | Recreate-in-place | D2 | Downtime | +| **2** | Recreate-in-place (align invttrdg) | D2 | Downtime | | **3** | Deploy only changed + BuildKit guarantee | D3, D4 | Slow "deploy all" | -| **4** | Image slimming + resource caps | D5, D6 | Disk/memory | +| **4** | Image slimming + RAM right-sizing + disk hygiene | D5, D6 | Disk/memory | | **5** | Artifact rollback | D7 | Rollback speed/safety | -**Suggested order:** Phase 1 → 2 (≈80% of the benefit across all three -symptoms), then 3 → 4 → 5. +**Suggested order:** Phase 0 (unify, so later fixes land once) → Phase 1 → 2 +(≈80% of the benefit across all three symptoms), then 3 → 4 → 5. If Phase 0 is +skipped, apply Phases 2–4 to invttrdg, clock, and notes individually. ## 4. Explicitly out of scope