saravanakumardb1 f837512026 docs(deploy): add deployment optimization roadmap

Document a phased roadmap for the single-VM deployment layer (build-off-VM,
recreate-in-place to cut downtime, change-detection + BuildKit guarantee,
image slimming + resource caps, artifact-based rollback). Scoped to deploy
orchestration; defers image-build internals to docker-build-optimization-roadmap.
Register the doc in repo-map.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

2026-05-31 00:40:15 -07:00

14 KiB

Raw Blame History

Deployment Optimization Roadmap

Status: v1 (PROPOSED — analysis complete, no changes applied yet) · Owner: Platform DevOps · Created: 2026-05-31

Optimize the deployment orchestration layer for the single-Azure-VM MVP: reduce deploy wall-clock time, eliminate deploy-time downtime, and stop the production VM from running out of RAM/disk during builds.

Scope boundary — read this first. This roadmap is about how we ship and run images on the VM (the deploy-*.sh scripts, docker compose lifecycle, registry strategy, VM resource limits). The complementary image-build concerns (pnpm install speed, BuildKit cache mounts, .npmrc.docker, Gitea registry correctness, "build green / app broken" silent failures) live in docker-build-optimization-roadmap.md and are out of scope here — this doc references that work, never duplicates it.

0. Current state (audited 2026-05-31)

The production deployment model is Docker Compose on a single Azure VM, fronted by Caddy on 80/443 (see DEPLOYMENT_GUIDE.md and learning_ai_common_plat/docs/devops/single_azure_vm/docker/DEPLOYMENT_STATUS_2026-03-29.md). Four production repos deploy here: learning_ai_invt_trdg, learning_ai_common_plat, learning_ai_clock, learning_ai_notes.

What is already good (do not redo): the per-repo Dockerfiles are multi-stage, use node:22-alpine, mount a BuildKit pnpm-store cache, inject the Gitea token via a BuildKit secret, and emit Next.js standalone output. The image-build layer is in good shape — credit to the build-optimization roadmap.

The bottlenecks are in the deployment flow, not the Dockerfiles.

#	Finding	Location	Symptom it causes	Severity
D1	*Images are built on the production VM.* `docker build` runs on the box while ~31 containers (Cosmos emulator + Azurite are RAM-heavy) are live. The VM has less RAM than the deploy scripts assumed (per the deployment-status doc), so builds thrash/swap.	`deploy-invttrdg.sh` build step; `deploy-all.sh` `docker compose build`	Slow builds and memory pressure and risk of OOM-killing live services	High (keystone)
D2	Blanket `docker compose down` → `up -d --force-recreate`. Every deploy stops all services and restarts them cold, even for a one-line change.	`deploy-invttrdg.sh` (down then force-recreate)	Deploy-time downtime + cold caches on every release	High
D3	`deploy-all.sh` rebuilds every service in every repo, sequentially. No change-detection, no parallelism; loops repos one-by-one and runs `docker compose build` (all services).	`deploy-all.sh` deploy loop	Multi-repo "deploy all" takes minutes even when one file changed	High
D4	`deploy-all.sh` does not guarantee BuildKit. It calls plain `docker compose build` with no `DOCKER_BUILDKIT=1` / `COMPOSE_DOCKER_CLI_BUILD=1`. On an older Docker it can silently fall back to the legacy builder and lose the Dockerfile `--mount=type=cache` pnpm-store wins the build roadmap added.	`deploy-all.sh` build step	Silent slow path; warm rebuilds behave like cold builds	Medium
D5	Whole `node_modules` (dev deps included) copied into the runtime stage. Backend runtime `COPY --from=builder .../node_modules` is the full install.	`*/backend/Dockerfile` runtime stage	Larger images → more disk, slower `pull`/start	Medium
D6	No RAM/disk guardrails on the VM. No `mem_limit`/reservations on Cosmos emulator/Azurite; no scheduled image prune; no container log rotation.	`docker-compose*.yml`, VM cron	Disk creeps to full + a single container can starve the box (silent cause of "deploys suddenly got slow")	Medium
D7	Rollback requires a rebuild. `DEPLOYMENT_GUIDE.md` rollback path is `git revert` + `docker compose up -d --build` — i.e. rebuild on the VM to roll back. Images are tagged only `:latest`, so there is no immutable artifact to re-point to.	`DEPLOYMENT_GUIDE.md` rollback section; image tags	Slow, risky rollbacks; no clean "known-good" artifact	Medium

Implications

D1 is the keystone. Moving builds off the VM removes the build/runtime resource contention that drives slow builds, memory pressure, and (with SHA-tagged images) enables fast rollback. Most other items compound on top of it.
D2 alone removes the majority of deploy-time downtime and is low-risk.
D3/D4 are the "deploy all is slow" story; D5/D6 are the disk/memory story.

1. Goals & non-goals

Goals

Cut warm deploy wall-clock time to seconds for a single changed service.
Zero (or near-zero) downtime for routine deploys.
Keep the production VM's RAM/disk predictable and bounded during deploys.
Make rollback an artifact re-point, not a rebuild.

Non-goals

❌ Adopting Kubernetes / Swarm / Nomad. Compose-on-one-VM is the correct model at MVP; revisit orchestration only when we outgrow a single host.
❌ Re-doing image-build internals (pnpm, BuildKit cache, Gitea path) — owned by docker-build-optimization-roadmap.md.
❌ Full blue/green infra. Tier 1 already removes most downtime; multi-replica comes only when traffic justifies it.

Measurement targets

Metric	Baseline (observed/estimated)	Target
Warm deploy, 1 service changed	~2–3 min (build-on-VM)	< 30 s (pull + recreate one service)
Deploy-time downtime per service	full stop/start cycle	~0 (recreate-in-place, old stays up until new is ready)
Peak VM RAM during deploy	build spike on top of live stack	no build spike (build is off-VM)
Rollback to previous release	rebuild on VM (minutes)	< 30 s (re-point SHA tag + `up -d`)

Fill in actuals during Phase 3.

2. Phased roadmap (why / what / how)

Phases are ordered by leverage. Phase 1 is the keystone — do it first; the rest compound on it.

Phase 1 — Build off the VM, ship images, VM pulls ⟵ keystone

Why. D1 is the single biggest cause of slow builds, memory pressure, and risky rollback. The production VM should run containers, not compile them. Removing build work from the box frees its scarce RAM/CPU and turns deploys into a fast pull + recreate. Tagging images by commit SHA gives an immutable, re-pointable artifact (fixes D7).

What.

Build images in CI (GitHub Actions / Gitea Actions) or on a dedicated builder.
Push to the Gitea container/image registry, tagged :<commit-sha> and :latest.
The VM deploy step becomes docker compose pull && docker compose up -d — no docker build on the box.

How (checklist).

1.1 Stand up / confirm an image registry (reuse Gitea on the VM, or a hosted registry). Decide auth: reuse the existing GITEA_NPM_TOKEN pattern from deploy-invttrdg.sh for docker login.
1.2 Add a CI build job per production repo: build backend + web images with BuildKit, tag :<git-sha> + :latest, push to the registry. Reuse the existing BYTELYST_COMMIT_* build args already collected in deploy-invttrdg.sh.
1.3 Parameterize each docker-compose.yml service to use image: <registry>/<svc>:${IMAGE_TAG:-latest} instead of a local build: context for production. (Keep build: for local dev via an override file.)
1.4 Rewrite the VM-side deploy path to docker compose pull then docker compose up -d (no build). Pass IMAGE_TAG=<sha> to deploy a specific release.
1.5 Keep a thin "emergency build-on-VM" fallback flag for when the registry/CI is unavailable, but make pull-based the default.
1.6 Verify: a clean deploy uses zero tsc/next build CPU on the VM.

Done when: deploys to the VM perform no compilation; images are addressable by commit SHA in the registry.

Phase 2 — Stop-the-world → recreate-in-place

Why. D2: docker compose down + --force-recreate takes everything down on every deploy. Plain up -d already recreates only the containers whose image/config changed, leaving the rest running.

What. Remove the blanket down; rely on Compose's differential recreate. Target individual services where possible.

How (checklist).

2.1 Remove docker compose down from the production deploy path (deploy-invttrdg.sh). Replace up -d --force-recreate (all services) with up -d (differential) — --force-recreate only when config didn't change but image did and you're not using SHA tags (after Phase 1, the SHA tag change makes Compose recreate automatically).
2.2 Support per-service deploys: docker compose up -d --no-deps <svc> so deploying the backend doesn't bounce unrelated services.
2.3 Confirm every service has a correct healthcheck (cross-check the IPv6/localhost healthcheck pitfall documented as F12 in the build roadmap) so up -d waits for healthy before considering the deploy done.
2.4 (Later / optional) For true zero-downtime on a hot service, run two replicas behind Caddy and recreate one at a time. Defer until traffic needs it.

Done when: a routine single-service deploy does not interrupt the other running services.

Phase 3 — Deploy only what changed; guarantee the fast path

Why. D3/D4: deploy-all.sh rebuilds everything sequentially and may silently drop BuildKit. After Phase 1 most of this moves to CI, but the VM-side and any remaining build paths should still skip untouched services and never fall back to the legacy builder.

What. Change-detection on what to deploy + an explicit BuildKit guarantee for any path that still builds.

How (checklist).

3.1 In deploy-all.sh, compute changed services via git diff --name-only <last-deployed-sha>..HEAD and skip services with no changes. Record the last-deployed SHA per repo (e.g. a small state file or the registry tag).
3.2 Export DOCKER_BUILDKIT=1 and COMPOSE_DOCKER_CLI_BUILD=1 (or switch to docker buildx bake) anywhere a build still runs, so the Dockerfile --mount=type=cache pnpm-store wins are never silently lost (D4).
3.3 Where multiple independent images must build, build them in parallel (buildx bake, or docker compose build --parallel) instead of the current sequential loop.
3.4 Capture before/after timings into the Measurement targets table above.

Done when: "deploy all" only touches changed services and always uses the warm-cache build path.

Phase 4 — Image size & VM resource guardrails

Why. D5/D6: bloated images and unbounded disk/RAM are the slow-burn causes of "the VM filled up / a build OOM'd the box." Caps make pressure predictable.

What. Prune runtime deps, cap memory, rotate logs, prune images.

How (checklist).

4.1 In each backend runtime stage, install production-only deps (pnpm install --prod / pnpm deploy --prod) instead of copying the full builder node_modules (D5). Verify the app still starts.
4.2 Add mem_limit + mem_reservation (and sensible cpus) to the RAM-heavy services first — Cosmos emulator, Azurite — then the rest, in the docker-compose*.yml files.
4.3 Add Docker daemon log rotation (json-file with max-size + max-file, or ship logs to Loki only) so container logs can't fill disk.
4.4 Add a scheduled docker image prune -f (and builder prune) on the VM to reclaim dangling layers left by rebuilds.
4.5 Add a small swap file on the VM as an OOM safety net for any residual on-box work; alert when disk > 80%.

Done when: runtime images are prod-only, every heavy service is memory-capped, and disk usage is bounded by rotation + prune.

Phase 5 — Fast, artifact-based rollback

Why. D7: rollback today means rebuild-on-VM. With SHA-tagged images from Phase 1, rollback becomes re-pointing to a known-good tag.

What. A rollback command that redeploys a previous image tag.

How (checklist).

5.1 Keep the last N image SHAs in the registry (don't prune the most recent good tags).
5.2 Add IMAGE_TAG=<previous-sha> docker compose up -d rollback path (and a deploy-*.sh --rollback [sha] wrapper).
5.3 Update DEPLOYMENT_GUIDE.md rollback section to use tag re-point instead of git revert + rebuild.
5.4 Document how to find the currently-deployed SHA (the /api/devops/version endpoint already exposed and checked by deploy-invttrdg.sh).

Done when: rolling back is a sub-30s tag re-point with no rebuild.

3. Quick-reference summary

Phase	Theme	Fixes	Primary symptom addressed
1	Build off VM, pull images	D1, D7	Slow builds + memory pressure + rollback
2	Recreate-in-place	D2	Downtime
3	Deploy only changed + BuildKit guarantee	D3, D4	Slow "deploy all"
4	Image slimming + resource caps	D5, D6	Disk/memory
5	Artifact rollback	D7	Rollback speed/safety

Suggested order: Phase 1 → 2 (≈80% of the benefit across all three symptoms), then 3 → 4 → 5.

4. Explicitly out of scope

Image-build internals (pnpm/BuildKit/Gitea/.npmrc.docker/silent-break correctness) — see docker-build-optimization-roadmap.md.
Migration to Kubernetes/Swarm or a managed cloud runtime.
Multi-platform image builds.

../DEPLOYMENT_GUIDE.md — current production deploy procedure
docker-build-optimization-roadmap.md — image-build layer
VM_OBSERVABILITY_ROADMAP.md — metrics/monitoring for the VM
vm-security-blind-spots-roadmap.md — VM hardening
../../learning_ai_common_plat/docs/devops/single_azure_vm/docker/DEPLOYMENT_STATUS_2026-03-29.md — VM deployment status snapshot

14 KiB Raw Blame History Unescape Escape

Deployment Optimization Roadmap

0. Current state (audited 2026-05-31)

1. Goals & non-goals

Measurement targets

2. Phased roadmap (why / what / how)

Phase 1 — Build off the VM, ship images, VM pulls ⟵ keystone

Phase 2 — Stop-the-world → recreate-in-place

Phase 3 — Deploy only what changed; guarantee the fast path

Phase 4 — Image size & VM resource guardrails

Phase 5 — Fast, artifact-based rollback

3. Quick-reference summary

4. Explicitly out of scope

5. Related docs

14 KiB

Raw Blame History