bytelyst-devops-tools/docs/deployment-optimization-roadmap.md
saravanakumardb1 38aefb05e4 docs(deploy): v2 review pass — correct findings after full script/compose audit
- D6: memory limits already exist (deploy.resources.limits); reframe as RAM
  right-sizing + disk hygiene rather than "limits missing"
- D2: down/--force-recreate is invttrdg-only; clock/notes already differential
- D4: broaden BuildKit gap to all docker compose build paths; fix accuracy
- D8 (new): deploy-script drift across per-product scripts + dashboard/deploy.sh
- add Phase 0 (unify scripts) as prerequisite; update quick-ref + ordering

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
2026-05-31 00:44:52 -07:00

328 lines
20 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Deployment Optimization Roadmap
> **Status:** v2 (**PROPOSED** — analysis complete, no changes applied yet) · **Owner:** Platform DevOps · **Created:** 2026-05-31 · **Revised:** 2026-05-31
>
> **v2 review pass:** corrected findings after auditing *all* deploy scripts +
> compose files (not just `deploy-invttrdg.sh`). Key fixes: memory limits already
> exist (D6 rewritten), `down`/`--force-recreate` is invttrdg-only not universal
> (D2 re-scoped), BuildKit gap spans all `docker compose build` paths (D4
> broadened), and added **D8 — deploy-script drift** across the per-product scripts.
>
> Optimize the **deployment orchestration layer** for the single-Azure-VM MVP:
> reduce deploy wall-clock time, eliminate deploy-time downtime, and stop the
> production VM from running out of RAM/disk during builds.
>
> **Scope boundary — read this first.** This roadmap is about *how we ship and
> run* images on the VM (the `deploy-*.sh` scripts, `docker compose` lifecycle,
> registry strategy, VM resource limits). The complementary *image-build*
> concerns (pnpm install speed, BuildKit cache mounts, `.npmrc.docker`, Gitea
> registry correctness, "build green / app broken" silent failures) live in
> [`docker-build-optimization-roadmap.md`](docker-build-optimization-roadmap.md)
> and are **out of scope here** — this doc references that work, never duplicates it.
---
## 0. Current state (audited 2026-05-31)
The production deployment model is **Docker Compose on a single Azure VM**,
fronted by Caddy on `80/443` (see
[`DEPLOYMENT_GUIDE.md`](../DEPLOYMENT_GUIDE.md) and
[`learning_ai_common_plat/docs/devops/single_azure_vm/docker/DEPLOYMENT_STATUS_2026-03-29.md`](../../learning_ai_common_plat/docs/devops/single_azure_vm/docker/DEPLOYMENT_STATUS_2026-03-29.md)).
Four production repos deploy here: `learning_ai_invt_trdg`,
`learning_ai_common_plat`, `learning_ai_clock`, `learning_ai_notes`.
**Deploy surfaces audited (2026-05-31):** `deploy-invttrdg.sh`, `deploy-clock.sh`,
`deploy-notes.sh`, `deploy-all.sh`, `dashboard/deploy.sh`, plus the production
`docker-compose.yml` of clock and the `docker-compose.ecosystem.yml` of common_plat.
**What is already good** (do not redo): the per-repo Dockerfiles are
multi-stage, use `node:22-alpine`, mount a BuildKit pnpm-store cache, inject the
Gitea token via a BuildKit secret, and emit Next.js `standalone` output. The
image-build layer is in good shape — credit to the build-optimization roadmap.
**The bottlenecks are in the deployment flow, not the Dockerfiles.**
| # | Finding | Location | Symptom it causes | Severity |
|---|---|---|---|---|
| **D1** | **Images are built _on the production VM_.** `docker build` runs on the box while ~31 containers (Cosmos emulator + Azurite are RAM-heavy) are live. The VM has **less RAM than the deploy scripts assumed** (per the deployment-status doc), so builds thrash/swap. | `deploy-invttrdg.sh` build step; `deploy-all.sh` `docker compose build` | Slow builds **and** memory pressure **and** risk of OOM-killing live services | **High (keystone)** |
| **D2** | **`deploy-invttrdg.sh` does a blanket `docker compose down``up -d --force-recreate`** — stops _all_ its services and restarts them cold, even for a one-line change. (Note: `deploy-clock.sh`/`deploy-notes.sh` do **not** — they already run `docker compose build` + `docker compose up -d`, i.e. differential recreate. So this is a per-script inconsistency, not a universal pattern — see D8.) | `deploy-invttrdg.sh` (down then `--force-recreate`) | Deploy-time downtime + cold caches on invttrdg releases | **Medium** |
| **D3** | **`deploy-all.sh` rebuilds every service in every repo, sequentially.** No change-detection, no parallelism; loops repos one-by-one and runs `docker compose build` (all services). | `deploy-all.sh` deploy loop | Multi-repo "deploy all" takes minutes even when one file changed | **High** |
| **D4** | **The `docker compose build` paths don't pin BuildKit.** `deploy-clock.sh`, `deploy-notes.sh`, and `deploy-all.sh` all call plain `docker compose build` with no `DOCKER_BUILDKIT=1` / `COMPOSE_DOCKER_CLI_BUILD=1`. Modern Compose v2 defaults to BuildKit (and the compose files use build-time `secrets:`, which **require** BuildKit — so on a stale/old toolchain the build hard-fails rather than silently slows). Pinning it explicitly removes the version-dependent ambiguity and guarantees the Dockerfile `--mount=type=cache` pnpm-store wins. | `deploy-clock.sh`, `deploy-notes.sh`, `deploy-all.sh` build steps | Toolchain-dependent build behavior; risk of losing warm-cache wins | **LowMedium** |
| **D5** | **Whole `node_modules` (dev deps included) copied into the runtime stage.** Backend runtime `COPY --from=builder .../node_modules` is the full install (verified in `learning_ai_clock/backend/Dockerfile`). | `*/backend/Dockerfile` runtime stage | Larger images → more disk, slower `pull`/start | **Medium** |
| **D6** | **Memory limits exist but are likely untuned vs the real VM, and log/image-disk guardrails are absent.** Limits _are_ set via `deploy.resources.limits.memory` (cosmos-emulator `1g`, azurite `256m`, most services `128m512m`). The gaps: (a) the **sum** of limits hasn't been reconciled against the VM's actual (lower-than-assumed) RAM; (b) only `limits`, no `reservations`; (c) no scheduled `docker image prune` and no container-log rotation, so disk creeps unbounded. | `docker-compose.ecosystem.yml`, product `docker-compose.yml`, VM cron | Disk creeps to full; limits may over- or under-commit RAM | **Medium** |
| **D7** | **Rollback requires a rebuild.** `DEPLOYMENT_GUIDE.md` rollback path is `git revert` + `docker compose up -d --build` — i.e. rebuild on the VM to roll back. Production images are tagged `:latest` (e.g. `invttrdg-backend:latest`) or auto-named by Compose, so there is no immutable per-release artifact to re-point to. | `DEPLOYMENT_GUIDE.md` rollback section; image tags | Slow, risky rollbacks; no clean "known-good" artifact | **Medium** |
| **D8** | **Deploy-script drift — no single source of truth.** Three near-duplicate per-product scripts (`deploy-invttrdg.sh`, `deploy-clock.sh`, `deploy-notes.sh`) plus `deploy-all.sh` and `dashboard/deploy.sh` have diverged: invttrdg builds via an explicit `docker build` loop then `down` + `--force-recreate`; clock/notes use `docker compose build` + `up -d`; common_plat has no dedicated script (goes through `deploy-all.sh`). Any fix in this roadmap must be applied N times and risks further drift. | all `deploy-*.sh` + `dashboard/deploy.sh` | Inconsistent behavior; fixes don't propagate; maintenance burden | **Medium** |
**Implications**
- **D1 is the keystone.** Moving builds off the VM removes the build/runtime
resource contention that drives slow builds, memory pressure, _and_ (with
SHA-tagged images) enables fast rollback. Most other items compound on top of it.
- **D8 is a force-multiplier risk.** Because the per-product scripts have already
drifted, every other fix (D2, D4, change-detection, rollback) must either be
applied 3+ times or — better — the scripts should first be unified behind one
parameterized deploy library. Prefer consolidating early so later phases land once.
- D2 removes invttrdg's deploy-time downtime and is low-risk; clock/notes already
do the right thing here, which is exactly why unifying (D8) matters.
- D3/D4 are the "deploy all is slow" story; D5/D6 are the disk/memory story. Note
D6's headline gap is **disk hygiene + RAM right-sizing**, not "limits missing" —
limits already exist.
---
## 1. Goals & non-goals
**Goals**
- Cut warm deploy wall-clock time to **seconds** for a single changed service.
- **Zero (or near-zero) downtime** for routine deploys.
- Keep the production VM's RAM/disk **predictable and bounded** during deploys.
- Make rollback an **artifact re-point**, not a rebuild.
**Non-goals**
- ❌ Adopting Kubernetes / Swarm / Nomad. Compose-on-one-VM is the correct
model at MVP; revisit orchestration only when we outgrow a single host.
- ❌ Re-doing image-build internals (pnpm, BuildKit cache, Gitea path) — owned
by [`docker-build-optimization-roadmap.md`](docker-build-optimization-roadmap.md).
- ❌ Full blue/green infra. Tier 1 already removes most downtime; multi-replica
comes only when traffic justifies it.
### Measurement targets
| Metric | Baseline (observed/estimated) | Target |
|---|---|---|
| Warm deploy, 1 service changed | ~23 min (build-on-VM) | **< 30 s** (pull + recreate one service) |
| Deploy-time downtime per service | full stop/start cycle | **~0** (recreate-in-place, old stays up until new is ready) |
| Peak VM RAM during deploy | build spike on top of live stack | **no build spike** (build is off-VM) |
| Rollback to previous release | rebuild on VM (minutes) | **< 30 s** (re-point SHA tag + `up -d`) |
> Fill in actuals during Phase 3.
---
## 2. Phased roadmap (why / what / how)
Phases are ordered by leverage. **Phase 0 (unify the scripts) is a prerequisite
enabler; Phase 1 is the keystone** every later phase should land in the unified
script once, not be copy-pasted across the per-product scripts.
### Phase 0 — Unify the deploy scripts (prerequisite)
**Why.** D8: `deploy-invttrdg.sh`, `deploy-clock.sh`, and `deploy-notes.sh` are
near-duplicate copies that have already drifted (invttrdg uses an explicit
`docker build` loop + `down` + `--force-recreate`; clock/notes use
`docker compose build` + `up -d`). Every fix below would otherwise have to be
written 3+ times. Consolidating first means Phases 15 are implemented once.
**What.** A single parameterized deploy entrypoint (one script or a sourced
library) that takes the repo/product as input and encodes the build + lifecycle
+ health-check steps once. The per-product scripts become thin wrappers (or are
replaced by `deploy.sh <product>`).
**How (checklist).**
- [ ] **0.1** Inventory the divergence across `deploy-invttrdg.sh`,
`deploy-clock.sh`, `deploy-notes.sh`, `deploy-all.sh`, `dashboard/deploy.sh`
(build method, lifecycle command, health endpoints, package-publication check).
- [ ] **0.2** Extract the shared steps (dirty-check, fetch/rebase, smoke test,
build, deploy, health check) into one library; express per-product
differences (ports, endpoints, image names) as config/data.
- [ ] **0.3** Replace the per-product scripts with thin wrappers calling the
library; keep the old filenames as shims so existing muscle memory + the
DevOps dashboard's deploy buttons keep working.
- [ ] **0.4** Point `deploy-all.sh` and `dashboard/deploy.sh` at the same library.
- [ ] **0.5** Add a drift guard (lint/CI) so the scripts can't silently diverge
again mirror the `check-*-drift.sh` pattern already used for `.npmrc` and
`docker-prep.sh` in `learning_ai_common_plat`.
**Done when:** one code path drives all production deploys; per-product files are
config or thin shims; a drift check guards against regression.
> If Phase 0 is deferred, treat every checklist item in Phases 15 as
> "apply to invttrdg **and** clock **and** notes" — the drift tax is real.
---
### Phase 1 — Build off the VM, ship images, VM pulls ⟵ keystone
**Why.** D1 is the single biggest cause of slow builds, memory pressure, and
risky rollback. The production VM should *run* containers, not *compile* them.
Removing build work from the box frees its scarce RAM/CPU and turns deploys into
a fast `pull` + recreate. Tagging images by commit SHA gives an immutable,
re-pointable artifact (fixes D7).
**What.**
- Build images in CI (GitHub Actions / Gitea Actions) or on a dedicated builder.
- Push to the Gitea container/image registry, tagged `:<commit-sha>` **and**
`:latest`.
- The VM deploy step becomes `docker compose pull && docker compose up -d` no
`docker build` on the box.
**How (checklist).**
- [ ] **1.1** Stand up / confirm an image registry (reuse Gitea on the VM, or a
hosted registry). Decide auth: reuse the existing `GITEA_NPM_TOKEN` pattern
from `deploy-invttrdg.sh` for `docker login`.
- [ ] **1.2** Add a CI build job per production repo: build backend + web images
with BuildKit, tag `:<git-sha>` + `:latest`, push to the registry. Reuse the
existing `BYTELYST_COMMIT_*` build args already collected in
`deploy-invttrdg.sh`.
- [ ] **1.3** Parameterize each `docker-compose.yml` service to use
`image: <registry>/<svc>:${IMAGE_TAG:-latest}` instead of a local `build:`
context for production. (Keep `build:` for local dev via an override file.)
- [ ] **1.4** Rewrite the VM-side deploy path to `docker compose pull` then
`docker compose up -d` (no build). Pass `IMAGE_TAG=<sha>` to deploy a
specific release.
- [ ] **1.5** Keep a thin "emergency build-on-VM" fallback flag for when the
registry/CI is unavailable, but make pull-based the default.
- [ ] **1.6** Verify: a clean deploy uses **zero** `tsc`/`next build` CPU on the VM.
**Done when:** deploys to the VM perform no compilation; images are addressable
by commit SHA in the registry.
---
### Phase 2 — Stop-the-world → recreate-in-place
**Why.** D2: `deploy-invttrdg.sh` does `docker compose down` + `--force-recreate`,
taking everything down on every deploy. Plain `up -d` already recreates only the
containers whose image/config changed, leaving the rest running which is exactly
what `deploy-clock.sh`/`deploy-notes.sh` already do. **This phase is mostly about
bringing invttrdg in line with clock/notes** (and, post-Phase 0, deleting the
divergence entirely).
**What.** Remove the blanket `down`; rely on Compose's differential recreate.
Target individual services where possible.
**How (checklist).**
- [ ] **2.1** Remove `docker compose down` from `deploy-invttrdg.sh` (the only
script that has it). Replace `up -d --force-recreate` (all services) with
`up -d` (differential) `--force-recreate` only when config didn't change
but image did and you're _not_ using SHA tags (after Phase 1, the SHA tag
change makes Compose recreate automatically). Adopt the clock/notes pattern.
- [ ] **2.2** Support per-service deploys: `docker compose up -d --no-deps <svc>`
so deploying the backend doesn't bounce unrelated services.
- [ ] **2.3** Confirm every service has a correct healthcheck so `up -d` waits
for healthy before considering the deploy done. Clock already handles the
IPv6/`localhost` pitfall (F12 in the build roadmap) by forcing `127.0.0.1`
in its healthcheck verify the other production repos do the same.
- [ ] **2.4** (Later / optional) For true zero-downtime on a hot service, run two
replicas behind Caddy and recreate one at a time. Defer until traffic needs it.
**Done when:** a routine single-service deploy does not interrupt the other
running services.
---
### Phase 3 — Deploy only what changed; guarantee the fast path
**Why.** D3/D4: `deploy-all.sh` rebuilds everything sequentially, and none of the
`docker compose build` paths pin BuildKit. After Phase 1 most of this moves to CI,
but the VM-side and any remaining build paths should still skip untouched services
and use a deterministic, warm-cache builder.
**What.** Change-detection on what to deploy + an explicit BuildKit guarantee for
any path that still builds.
**How (checklist).**
- [ ] **3.1** In `deploy-all.sh` (or the Phase-0 unified library), compute changed
services via `git diff --name-only <last-deployed-sha>..HEAD` and skip
services with no changes. Record the last-deployed SHA per repo (e.g. a
small state file or the registry tag).
- [ ] **3.2** Pin `DOCKER_BUILDKIT=1` and `COMPOSE_DOCKER_CLI_BUILD=1` (or switch
to `docker buildx bake`) in **every** path that still builds
`deploy-clock.sh`, `deploy-notes.sh`, `deploy-all.sh` (D4) so behavior is
not toolchain-version-dependent and the Dockerfile `--mount=type=cache`
pnpm-store wins are guaranteed.
- [ ] **3.3** Where multiple independent images must build, build them in
parallel (`buildx bake`, or `docker compose build --parallel`) instead of
the current sequential loop.
- [ ] **3.4** Capture before/after timings into the Measurement targets table above.
**Done when:** "deploy all" only touches changed services and always uses the
warm-cache build path.
---
### Phase 4 — Image size & VM resource guardrails
**Why.** D5/D6: bloated images and unbounded disk are the slow-burn causes of
"the VM filled up / a build OOM'd the box." Memory **limits already exist** via
`deploy.resources.limits.memory` the remaining work is right-sizing them against
the real (lower-than-assumed) VM RAM, adding reservations, and adding disk hygiene.
**What.** Prune runtime deps, reconcile + extend memory caps, rotate logs, prune images.
**How (checklist).**
- [ ] **4.1** In each backend runtime stage, install production-only deps
(`pnpm install --prod` / `pnpm deploy --prod`) instead of copying the full
builder `node_modules` (D5). Verify the app still starts.
- [ ] **4.2** Reconcile the existing `deploy.resources.limits.memory` values
(cosmos-emulator `1g`, azurite `256m`, services `128m512m`) against the VM's
actual RAM confirm the **sum** fits with headroom. Add `reservations` (not
just `limits`) so the scheduler protects critical services, and add `cpus`
where a service is CPU-bursty.
- [ ] **4.3** Add Docker daemon log rotation (`json-file` with `max-size` +
`max-file`, or ship logs to Loki only) so container logs can't fill disk.
- [ ] **4.4** Add a scheduled `docker image prune -f` (and `builder prune`) on the
VM to reclaim dangling layers left by rebuilds.
- [ ] **4.5** Add a small swap file on the VM as an OOM safety net for any
residual on-box work; alert when disk > 80%.
**Done when:** runtime images are prod-only, memory limits are reconciled with VM
RAM (+ reservations), and disk usage is bounded by rotation + prune.
---
### Phase 5 — Fast, artifact-based rollback
**Why.** D7: rollback today means rebuild-on-VM. With SHA-tagged images from
Phase 1, rollback becomes re-pointing to a known-good tag.
**What.** A `rollback` command that redeploys a previous image tag.
**How (checklist).**
- [ ] **5.1** Keep the last N image SHAs in the registry (don't prune the most
recent good tags).
- [ ] **5.2** Add `IMAGE_TAG=<previous-sha> docker compose up -d` rollback path
(and a `deploy-*.sh --rollback [sha]` wrapper).
- [ ] **5.3** Update [`DEPLOYMENT_GUIDE.md`](../DEPLOYMENT_GUIDE.md) rollback
section to use tag re-point instead of `git revert` + rebuild.
- [ ] **5.4** Document how to find the currently-deployed SHA (the
`/api/devops/version` endpoint already exposed and checked by
`deploy-invttrdg.sh`).
**Done when:** rolling back is a sub-30s tag re-point with no rebuild.
---
## 3. Quick-reference summary
| Phase | Theme | Fixes | Primary symptom addressed |
|---|---|---|---|
| **0** | Unify deploy scripts | D8 | Stops fixes from being copy-pasted/drifting |
| **1** | Build off VM, pull images | D1, D7 | Slow builds + memory pressure + rollback |
| **2** | Recreate-in-place (align invttrdg) | D2 | Downtime |
| **3** | Deploy only changed + BuildKit guarantee | D3, D4 | Slow "deploy all" |
| **4** | Image slimming + RAM right-sizing + disk hygiene | D5, D6 | Disk/memory |
| **5** | Artifact rollback | D7 | Rollback speed/safety |
**Suggested order:** Phase 0 (unify, so later fixes land once) → Phase 1 → 2
(≈80% of the benefit across all three symptoms), then 3 → 4 → 5. If Phase 0 is
skipped, apply Phases 24 to invttrdg, clock, and notes individually.
## 4. Explicitly out of scope
- Image-build internals (pnpm/BuildKit/Gitea/`.npmrc.docker`/silent-break
correctness) — see
[`docker-build-optimization-roadmap.md`](docker-build-optimization-roadmap.md).
- Migration to Kubernetes/Swarm or a managed cloud runtime.
- Multi-platform image builds.
## 5. Related docs
- [`../DEPLOYMENT_GUIDE.md`](../DEPLOYMENT_GUIDE.md) — current production deploy procedure
- [`docker-build-optimization-roadmap.md`](docker-build-optimization-roadmap.md) — image-build layer
- [`VM_OBSERVABILITY_ROADMAP.md`](VM_OBSERVABILITY_ROADMAP.md) — metrics/monitoring for the VM
- [`vm-security-blind-spots-roadmap.md`](vm-security-blind-spots-roadmap.md) — VM hardening
- [`../../learning_ai_common_plat/docs/devops/single_azure_vm/docker/DEPLOYMENT_STATUS_2026-03-29.md`](../../learning_ai_common_plat/docs/devops/single_azure_vm/docker/DEPLOYMENT_STATUS_2026-03-29.md) — VM deployment status snapshot