- D6: memory limits already exist (deploy.resources.limits); reframe as RAM right-sizing + disk hygiene rather than "limits missing" - D2: down/--force-recreate is invttrdg-only; clock/notes already differential - D4: broaden BuildKit gap to all docker compose build paths; fix accuracy - D8 (new): deploy-script drift across per-product scripts + dashboard/deploy.sh - add Phase 0 (unify scripts) as prerequisite; update quick-ref + ordering Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
328 lines
20 KiB
Markdown
328 lines
20 KiB
Markdown
# Deployment Optimization Roadmap
|
||
|
||
> **Status:** v2 (**PROPOSED** — analysis complete, no changes applied yet) · **Owner:** Platform DevOps · **Created:** 2026-05-31 · **Revised:** 2026-05-31
|
||
>
|
||
> **v2 review pass:** corrected findings after auditing *all* deploy scripts +
|
||
> compose files (not just `deploy-invttrdg.sh`). Key fixes: memory limits already
|
||
> exist (D6 rewritten), `down`/`--force-recreate` is invttrdg-only not universal
|
||
> (D2 re-scoped), BuildKit gap spans all `docker compose build` paths (D4
|
||
> broadened), and added **D8 — deploy-script drift** across the per-product scripts.
|
||
>
|
||
> Optimize the **deployment orchestration layer** for the single-Azure-VM MVP:
|
||
> reduce deploy wall-clock time, eliminate deploy-time downtime, and stop the
|
||
> production VM from running out of RAM/disk during builds.
|
||
>
|
||
> **Scope boundary — read this first.** This roadmap is about *how we ship and
|
||
> run* images on the VM (the `deploy-*.sh` scripts, `docker compose` lifecycle,
|
||
> registry strategy, VM resource limits). The complementary *image-build*
|
||
> concerns (pnpm install speed, BuildKit cache mounts, `.npmrc.docker`, Gitea
|
||
> registry correctness, "build green / app broken" silent failures) live in
|
||
> [`docker-build-optimization-roadmap.md`](docker-build-optimization-roadmap.md)
|
||
> and are **out of scope here** — this doc references that work, never duplicates it.
|
||
|
||
---
|
||
|
||
## 0. Current state (audited 2026-05-31)
|
||
|
||
The production deployment model is **Docker Compose on a single Azure VM**,
|
||
fronted by Caddy on `80/443` (see
|
||
[`DEPLOYMENT_GUIDE.md`](../DEPLOYMENT_GUIDE.md) and
|
||
[`learning_ai_common_plat/docs/devops/single_azure_vm/docker/DEPLOYMENT_STATUS_2026-03-29.md`](../../learning_ai_common_plat/docs/devops/single_azure_vm/docker/DEPLOYMENT_STATUS_2026-03-29.md)).
|
||
Four production repos deploy here: `learning_ai_invt_trdg`,
|
||
`learning_ai_common_plat`, `learning_ai_clock`, `learning_ai_notes`.
|
||
|
||
**Deploy surfaces audited (2026-05-31):** `deploy-invttrdg.sh`, `deploy-clock.sh`,
|
||
`deploy-notes.sh`, `deploy-all.sh`, `dashboard/deploy.sh`, plus the production
|
||
`docker-compose.yml` of clock and the `docker-compose.ecosystem.yml` of common_plat.
|
||
|
||
**What is already good** (do not redo): the per-repo Dockerfiles are
|
||
multi-stage, use `node:22-alpine`, mount a BuildKit pnpm-store cache, inject the
|
||
Gitea token via a BuildKit secret, and emit Next.js `standalone` output. The
|
||
image-build layer is in good shape — credit to the build-optimization roadmap.
|
||
|
||
**The bottlenecks are in the deployment flow, not the Dockerfiles.**
|
||
|
||
| # | Finding | Location | Symptom it causes | Severity |
|
||
|---|---|---|---|---|
|
||
| **D1** | **Images are built _on the production VM_.** `docker build` runs on the box while ~31 containers (Cosmos emulator + Azurite are RAM-heavy) are live. The VM has **less RAM than the deploy scripts assumed** (per the deployment-status doc), so builds thrash/swap. | `deploy-invttrdg.sh` build step; `deploy-all.sh` `docker compose build` | Slow builds **and** memory pressure **and** risk of OOM-killing live services | **High (keystone)** |
|
||
| **D2** | **`deploy-invttrdg.sh` does a blanket `docker compose down` → `up -d --force-recreate`** — stops _all_ its services and restarts them cold, even for a one-line change. (Note: `deploy-clock.sh`/`deploy-notes.sh` do **not** — they already run `docker compose build` + `docker compose up -d`, i.e. differential recreate. So this is a per-script inconsistency, not a universal pattern — see D8.) | `deploy-invttrdg.sh` (down then `--force-recreate`) | Deploy-time downtime + cold caches on invttrdg releases | **Medium** |
|
||
| **D3** | **`deploy-all.sh` rebuilds every service in every repo, sequentially.** No change-detection, no parallelism; loops repos one-by-one and runs `docker compose build` (all services). | `deploy-all.sh` deploy loop | Multi-repo "deploy all" takes minutes even when one file changed | **High** |
|
||
| **D4** | **The `docker compose build` paths don't pin BuildKit.** `deploy-clock.sh`, `deploy-notes.sh`, and `deploy-all.sh` all call plain `docker compose build` with no `DOCKER_BUILDKIT=1` / `COMPOSE_DOCKER_CLI_BUILD=1`. Modern Compose v2 defaults to BuildKit (and the compose files use build-time `secrets:`, which **require** BuildKit — so on a stale/old toolchain the build hard-fails rather than silently slows). Pinning it explicitly removes the version-dependent ambiguity and guarantees the Dockerfile `--mount=type=cache` pnpm-store wins. | `deploy-clock.sh`, `deploy-notes.sh`, `deploy-all.sh` build steps | Toolchain-dependent build behavior; risk of losing warm-cache wins | **Low–Medium** |
|
||
| **D5** | **Whole `node_modules` (dev deps included) copied into the runtime stage.** Backend runtime `COPY --from=builder .../node_modules` is the full install (verified in `learning_ai_clock/backend/Dockerfile`). | `*/backend/Dockerfile` runtime stage | Larger images → more disk, slower `pull`/start | **Medium** |
|
||
| **D6** | **Memory limits exist but are likely untuned vs the real VM, and log/image-disk guardrails are absent.** Limits _are_ set via `deploy.resources.limits.memory` (cosmos-emulator `1g`, azurite `256m`, most services `128m–512m`). The gaps: (a) the **sum** of limits hasn't been reconciled against the VM's actual (lower-than-assumed) RAM; (b) only `limits`, no `reservations`; (c) no scheduled `docker image prune` and no container-log rotation, so disk creeps unbounded. | `docker-compose.ecosystem.yml`, product `docker-compose.yml`, VM cron | Disk creeps to full; limits may over- or under-commit RAM | **Medium** |
|
||
| **D7** | **Rollback requires a rebuild.** `DEPLOYMENT_GUIDE.md` rollback path is `git revert` + `docker compose up -d --build` — i.e. rebuild on the VM to roll back. Production images are tagged `:latest` (e.g. `invttrdg-backend:latest`) or auto-named by Compose, so there is no immutable per-release artifact to re-point to. | `DEPLOYMENT_GUIDE.md` rollback section; image tags | Slow, risky rollbacks; no clean "known-good" artifact | **Medium** |
|
||
| **D8** | **Deploy-script drift — no single source of truth.** Three near-duplicate per-product scripts (`deploy-invttrdg.sh`, `deploy-clock.sh`, `deploy-notes.sh`) plus `deploy-all.sh` and `dashboard/deploy.sh` have diverged: invttrdg builds via an explicit `docker build` loop then `down` + `--force-recreate`; clock/notes use `docker compose build` + `up -d`; common_plat has no dedicated script (goes through `deploy-all.sh`). Any fix in this roadmap must be applied N times and risks further drift. | all `deploy-*.sh` + `dashboard/deploy.sh` | Inconsistent behavior; fixes don't propagate; maintenance burden | **Medium** |
|
||
|
||
**Implications**
|
||
|
||
- **D1 is the keystone.** Moving builds off the VM removes the build/runtime
|
||
resource contention that drives slow builds, memory pressure, _and_ (with
|
||
SHA-tagged images) enables fast rollback. Most other items compound on top of it.
|
||
- **D8 is a force-multiplier risk.** Because the per-product scripts have already
|
||
drifted, every other fix (D2, D4, change-detection, rollback) must either be
|
||
applied 3+ times or — better — the scripts should first be unified behind one
|
||
parameterized deploy library. Prefer consolidating early so later phases land once.
|
||
- D2 removes invttrdg's deploy-time downtime and is low-risk; clock/notes already
|
||
do the right thing here, which is exactly why unifying (D8) matters.
|
||
- D3/D4 are the "deploy all is slow" story; D5/D6 are the disk/memory story. Note
|
||
D6's headline gap is **disk hygiene + RAM right-sizing**, not "limits missing" —
|
||
limits already exist.
|
||
|
||
---
|
||
|
||
## 1. Goals & non-goals
|
||
|
||
**Goals**
|
||
|
||
- Cut warm deploy wall-clock time to **seconds** for a single changed service.
|
||
- **Zero (or near-zero) downtime** for routine deploys.
|
||
- Keep the production VM's RAM/disk **predictable and bounded** during deploys.
|
||
- Make rollback an **artifact re-point**, not a rebuild.
|
||
|
||
**Non-goals**
|
||
|
||
- ❌ Adopting Kubernetes / Swarm / Nomad. Compose-on-one-VM is the correct
|
||
model at MVP; revisit orchestration only when we outgrow a single host.
|
||
- ❌ Re-doing image-build internals (pnpm, BuildKit cache, Gitea path) — owned
|
||
by [`docker-build-optimization-roadmap.md`](docker-build-optimization-roadmap.md).
|
||
- ❌ Full blue/green infra. Tier 1 already removes most downtime; multi-replica
|
||
comes only when traffic justifies it.
|
||
|
||
### Measurement targets
|
||
|
||
| Metric | Baseline (observed/estimated) | Target |
|
||
|---|---|---|
|
||
| Warm deploy, 1 service changed | ~2–3 min (build-on-VM) | **< 30 s** (pull + recreate one service) |
|
||
| Deploy-time downtime per service | full stop/start cycle | **~0** (recreate-in-place, old stays up until new is ready) |
|
||
| Peak VM RAM during deploy | build spike on top of live stack | **no build spike** (build is off-VM) |
|
||
| Rollback to previous release | rebuild on VM (minutes) | **< 30 s** (re-point SHA tag + `up -d`) |
|
||
|
||
> Fill in actuals during Phase 3.
|
||
|
||
---
|
||
|
||
## 2. Phased roadmap (why / what / how)
|
||
|
||
Phases are ordered by leverage. **Phase 0 (unify the scripts) is a prerequisite
|
||
enabler; Phase 1 is the keystone** — every later phase should land in the unified
|
||
script once, not be copy-pasted across the per-product scripts.
|
||
|
||
### Phase 0 — Unify the deploy scripts (prerequisite)
|
||
|
||
**Why.** D8: `deploy-invttrdg.sh`, `deploy-clock.sh`, and `deploy-notes.sh` are
|
||
near-duplicate copies that have already drifted (invttrdg uses an explicit
|
||
`docker build` loop + `down` + `--force-recreate`; clock/notes use
|
||
`docker compose build` + `up -d`). Every fix below would otherwise have to be
|
||
written 3+ times. Consolidating first means Phases 1–5 are implemented once.
|
||
|
||
**What.** A single parameterized deploy entrypoint (one script or a sourced
|
||
library) that takes the repo/product as input and encodes the build + lifecycle
|
||
+ health-check steps once. The per-product scripts become thin wrappers (or are
|
||
replaced by `deploy.sh <product>`).
|
||
|
||
**How (checklist).**
|
||
- [ ] **0.1** Inventory the divergence across `deploy-invttrdg.sh`,
|
||
`deploy-clock.sh`, `deploy-notes.sh`, `deploy-all.sh`, `dashboard/deploy.sh`
|
||
(build method, lifecycle command, health endpoints, package-publication check).
|
||
- [ ] **0.2** Extract the shared steps (dirty-check, fetch/rebase, smoke test,
|
||
build, deploy, health check) into one library; express per-product
|
||
differences (ports, endpoints, image names) as config/data.
|
||
- [ ] **0.3** Replace the per-product scripts with thin wrappers calling the
|
||
library; keep the old filenames as shims so existing muscle memory + the
|
||
DevOps dashboard's deploy buttons keep working.
|
||
- [ ] **0.4** Point `deploy-all.sh` and `dashboard/deploy.sh` at the same library.
|
||
- [ ] **0.5** Add a drift guard (lint/CI) so the scripts can't silently diverge
|
||
again — mirror the `check-*-drift.sh` pattern already used for `.npmrc` and
|
||
`docker-prep.sh` in `learning_ai_common_plat`.
|
||
|
||
**Done when:** one code path drives all production deploys; per-product files are
|
||
config or thin shims; a drift check guards against regression.
|
||
|
||
> If Phase 0 is deferred, treat every checklist item in Phases 1–5 as
|
||
> "apply to invttrdg **and** clock **and** notes" — the drift tax is real.
|
||
|
||
---
|
||
|
||
### Phase 1 — Build off the VM, ship images, VM pulls ⟵ keystone
|
||
|
||
**Why.** D1 is the single biggest cause of slow builds, memory pressure, and
|
||
risky rollback. The production VM should *run* containers, not *compile* them.
|
||
Removing build work from the box frees its scarce RAM/CPU and turns deploys into
|
||
a fast `pull` + recreate. Tagging images by commit SHA gives an immutable,
|
||
re-pointable artifact (fixes D7).
|
||
|
||
**What.**
|
||
- Build images in CI (GitHub Actions / Gitea Actions) or on a dedicated builder.
|
||
- Push to the Gitea container/image registry, tagged `:<commit-sha>` **and**
|
||
`:latest`.
|
||
- The VM deploy step becomes `docker compose pull && docker compose up -d` — no
|
||
`docker build` on the box.
|
||
|
||
**How (checklist).**
|
||
- [ ] **1.1** Stand up / confirm an image registry (reuse Gitea on the VM, or a
|
||
hosted registry). Decide auth: reuse the existing `GITEA_NPM_TOKEN` pattern
|
||
from `deploy-invttrdg.sh` for `docker login`.
|
||
- [ ] **1.2** Add a CI build job per production repo: build backend + web images
|
||
with BuildKit, tag `:<git-sha>` + `:latest`, push to the registry. Reuse the
|
||
existing `BYTELYST_COMMIT_*` build args already collected in
|
||
`deploy-invttrdg.sh`.
|
||
- [ ] **1.3** Parameterize each `docker-compose.yml` service to use
|
||
`image: <registry>/<svc>:${IMAGE_TAG:-latest}` instead of a local `build:`
|
||
context for production. (Keep `build:` for local dev via an override file.)
|
||
- [ ] **1.4** Rewrite the VM-side deploy path to `docker compose pull` then
|
||
`docker compose up -d` (no build). Pass `IMAGE_TAG=<sha>` to deploy a
|
||
specific release.
|
||
- [ ] **1.5** Keep a thin "emergency build-on-VM" fallback flag for when the
|
||
registry/CI is unavailable, but make pull-based the default.
|
||
- [ ] **1.6** Verify: a clean deploy uses **zero** `tsc`/`next build` CPU on the VM.
|
||
|
||
**Done when:** deploys to the VM perform no compilation; images are addressable
|
||
by commit SHA in the registry.
|
||
|
||
---
|
||
|
||
### Phase 2 — Stop-the-world → recreate-in-place
|
||
|
||
**Why.** D2: `deploy-invttrdg.sh` does `docker compose down` + `--force-recreate`,
|
||
taking everything down on every deploy. Plain `up -d` already recreates only the
|
||
containers whose image/config changed, leaving the rest running — which is exactly
|
||
what `deploy-clock.sh`/`deploy-notes.sh` already do. **This phase is mostly about
|
||
bringing invttrdg in line with clock/notes** (and, post-Phase 0, deleting the
|
||
divergence entirely).
|
||
|
||
**What.** Remove the blanket `down`; rely on Compose's differential recreate.
|
||
Target individual services where possible.
|
||
|
||
**How (checklist).**
|
||
- [ ] **2.1** Remove `docker compose down` from `deploy-invttrdg.sh` (the only
|
||
script that has it). Replace `up -d --force-recreate` (all services) with
|
||
`up -d` (differential) — `--force-recreate` only when config didn't change
|
||
but image did and you're _not_ using SHA tags (after Phase 1, the SHA tag
|
||
change makes Compose recreate automatically). Adopt the clock/notes pattern.
|
||
- [ ] **2.2** Support per-service deploys: `docker compose up -d --no-deps <svc>`
|
||
so deploying the backend doesn't bounce unrelated services.
|
||
- [ ] **2.3** Confirm every service has a correct healthcheck so `up -d` waits
|
||
for healthy before considering the deploy done. Clock already handles the
|
||
IPv6/`localhost` pitfall (F12 in the build roadmap) by forcing `127.0.0.1`
|
||
in its healthcheck — verify the other production repos do the same.
|
||
- [ ] **2.4** (Later / optional) For true zero-downtime on a hot service, run two
|
||
replicas behind Caddy and recreate one at a time. Defer until traffic needs it.
|
||
|
||
**Done when:** a routine single-service deploy does not interrupt the other
|
||
running services.
|
||
|
||
---
|
||
|
||
### Phase 3 — Deploy only what changed; guarantee the fast path
|
||
|
||
**Why.** D3/D4: `deploy-all.sh` rebuilds everything sequentially, and none of the
|
||
`docker compose build` paths pin BuildKit. After Phase 1 most of this moves to CI,
|
||
but the VM-side and any remaining build paths should still skip untouched services
|
||
and use a deterministic, warm-cache builder.
|
||
|
||
**What.** Change-detection on what to deploy + an explicit BuildKit guarantee for
|
||
any path that still builds.
|
||
|
||
**How (checklist).**
|
||
- [ ] **3.1** In `deploy-all.sh` (or the Phase-0 unified library), compute changed
|
||
services via `git diff --name-only <last-deployed-sha>..HEAD` and skip
|
||
services with no changes. Record the last-deployed SHA per repo (e.g. a
|
||
small state file or the registry tag).
|
||
- [ ] **3.2** Pin `DOCKER_BUILDKIT=1` and `COMPOSE_DOCKER_CLI_BUILD=1` (or switch
|
||
to `docker buildx bake`) in **every** path that still builds —
|
||
`deploy-clock.sh`, `deploy-notes.sh`, `deploy-all.sh` (D4) — so behavior is
|
||
not toolchain-version-dependent and the Dockerfile `--mount=type=cache`
|
||
pnpm-store wins are guaranteed.
|
||
- [ ] **3.3** Where multiple independent images must build, build them in
|
||
parallel (`buildx bake`, or `docker compose build --parallel`) instead of
|
||
the current sequential loop.
|
||
- [ ] **3.4** Capture before/after timings into the Measurement targets table above.
|
||
|
||
**Done when:** "deploy all" only touches changed services and always uses the
|
||
warm-cache build path.
|
||
|
||
---
|
||
|
||
### Phase 4 — Image size & VM resource guardrails
|
||
|
||
**Why.** D5/D6: bloated images and unbounded disk are the slow-burn causes of
|
||
"the VM filled up / a build OOM'd the box." Memory **limits already exist** via
|
||
`deploy.resources.limits.memory` — the remaining work is right-sizing them against
|
||
the real (lower-than-assumed) VM RAM, adding reservations, and adding disk hygiene.
|
||
|
||
**What.** Prune runtime deps, reconcile + extend memory caps, rotate logs, prune images.
|
||
|
||
**How (checklist).**
|
||
- [ ] **4.1** In each backend runtime stage, install production-only deps
|
||
(`pnpm install --prod` / `pnpm deploy --prod`) instead of copying the full
|
||
builder `node_modules` (D5). Verify the app still starts.
|
||
- [ ] **4.2** Reconcile the existing `deploy.resources.limits.memory` values
|
||
(cosmos-emulator `1g`, azurite `256m`, services `128m–512m`) against the VM's
|
||
actual RAM — confirm the **sum** fits with headroom. Add `reservations` (not
|
||
just `limits`) so the scheduler protects critical services, and add `cpus`
|
||
where a service is CPU-bursty.
|
||
- [ ] **4.3** Add Docker daemon log rotation (`json-file` with `max-size` +
|
||
`max-file`, or ship logs to Loki only) so container logs can't fill disk.
|
||
- [ ] **4.4** Add a scheduled `docker image prune -f` (and `builder prune`) on the
|
||
VM to reclaim dangling layers left by rebuilds.
|
||
- [ ] **4.5** Add a small swap file on the VM as an OOM safety net for any
|
||
residual on-box work; alert when disk > 80%.
|
||
|
||
**Done when:** runtime images are prod-only, memory limits are reconciled with VM
|
||
RAM (+ reservations), and disk usage is bounded by rotation + prune.
|
||
|
||
---
|
||
|
||
### Phase 5 — Fast, artifact-based rollback
|
||
|
||
**Why.** D7: rollback today means rebuild-on-VM. With SHA-tagged images from
|
||
Phase 1, rollback becomes re-pointing to a known-good tag.
|
||
|
||
**What.** A `rollback` command that redeploys a previous image tag.
|
||
|
||
**How (checklist).**
|
||
- [ ] **5.1** Keep the last N image SHAs in the registry (don't prune the most
|
||
recent good tags).
|
||
- [ ] **5.2** Add `IMAGE_TAG=<previous-sha> docker compose up -d` rollback path
|
||
(and a `deploy-*.sh --rollback [sha]` wrapper).
|
||
- [ ] **5.3** Update [`DEPLOYMENT_GUIDE.md`](../DEPLOYMENT_GUIDE.md) rollback
|
||
section to use tag re-point instead of `git revert` + rebuild.
|
||
- [ ] **5.4** Document how to find the currently-deployed SHA (the
|
||
`/api/devops/version` endpoint already exposed and checked by
|
||
`deploy-invttrdg.sh`).
|
||
|
||
**Done when:** rolling back is a sub-30s tag re-point with no rebuild.
|
||
|
||
---
|
||
|
||
## 3. Quick-reference summary
|
||
|
||
| Phase | Theme | Fixes | Primary symptom addressed |
|
||
|---|---|---|---|
|
||
| **0** | Unify deploy scripts | D8 | Stops fixes from being copy-pasted/drifting |
|
||
| **1** | Build off VM, pull images | D1, D7 | Slow builds + memory pressure + rollback |
|
||
| **2** | Recreate-in-place (align invttrdg) | D2 | Downtime |
|
||
| **3** | Deploy only changed + BuildKit guarantee | D3, D4 | Slow "deploy all" |
|
||
| **4** | Image slimming + RAM right-sizing + disk hygiene | D5, D6 | Disk/memory |
|
||
| **5** | Artifact rollback | D7 | Rollback speed/safety |
|
||
|
||
**Suggested order:** Phase 0 (unify, so later fixes land once) → Phase 1 → 2
|
||
(≈80% of the benefit across all three symptoms), then 3 → 4 → 5. If Phase 0 is
|
||
skipped, apply Phases 2–4 to invttrdg, clock, and notes individually.
|
||
|
||
## 4. Explicitly out of scope
|
||
|
||
- Image-build internals (pnpm/BuildKit/Gitea/`.npmrc.docker`/silent-break
|
||
correctness) — see
|
||
[`docker-build-optimization-roadmap.md`](docker-build-optimization-roadmap.md).
|
||
- Migration to Kubernetes/Swarm or a managed cloud runtime.
|
||
- Multi-platform image builds.
|
||
|
||
## 5. Related docs
|
||
|
||
- [`../DEPLOYMENT_GUIDE.md`](../DEPLOYMENT_GUIDE.md) — current production deploy procedure
|
||
- [`docker-build-optimization-roadmap.md`](docker-build-optimization-roadmap.md) — image-build layer
|
||
- [`VM_OBSERVABILITY_ROADMAP.md`](VM_OBSERVABILITY_ROADMAP.md) — metrics/monitoring for the VM
|
||
- [`vm-security-blind-spots-roadmap.md`](vm-security-blind-spots-roadmap.md) — VM hardening
|
||
- [`../../learning_ai_common_plat/docs/devops/single_azure_vm/docker/DEPLOYMENT_STATUS_2026-03-29.md`](../../learning_ai_common_plat/docs/devops/single_azure_vm/docker/DEPLOYMENT_STATUS_2026-03-29.md) — VM deployment status snapshot
|