docs(infra): fix stale service count (27→30), update prompt.md + README.md for Codex agent readiness

- prompt.md: mark tasks 1-3 as DONE, add 'Current State' section listing all implemented features, update bugs-fixed table (16 items), fix service count in architecture diagram, add CLI reference, remove stale --frozen-lockfile - README.md: add Resume & Retry section with examples, add CLI Flags table, fix service count in title/phases, update build failure troubleshooting with build log paths and retry command - setup.sh: fix '27 services' → '30 services' in header comment and banner
2026-03-24 12:35:59 -07:00 · 2026-03-24 12:35:59 -07:00 · 35021b67b9
commit 35021b67b9
parent acbab75aaa
3 changed files with 108 additions and 86 deletions
--- a/docs/devops/single_azure_vm/README.md
+++ b/docs/devops/single_azure_vm/README.md
@ -1,6 +1,6 @@
 # ByteLyst Single-VM Deployment

-> Deploy the **entire ByteLyst ecosystem** on a single **raw** Azure VM.
+> Deploy the **entire ByteLyst ecosystem** (30 services, 10 products) on a single **raw** Azure VM.
 > Nothing pre-installed required — the script handles everything from a blank Ubuntu machine.
 > Two files: this README and `setup.sh`. Copy both to the VM and run the script.

@ -30,7 +30,20 @@ sudo ./setup.sh
 # 4. Wait ~15-25 minutes for full build + deploy

 # 5. Verify
-docker compose -f /opt/bytelyst/learning_ai_common_plat/docker-compose.ecosystem.yml ps
+/opt/bytelyst/check-health.sh
+```
+
+### Resume & Retry
+
+Phase completion is tracked. If anything fails, you don't have to start over:
+
+```bash
+sudo ./setup.sh --phase=7          # Retry just the deploy phase
+sudo ./setup.sh --resume           # Auto-resume after SSH disconnect
+sudo ./setup.sh --resume-from=7    # Jump to deploy after manual fix
+sudo ./setup.sh --status           # Check what's done
+sudo ./setup.sh --reset            # Start completely over
+sudo ./setup.sh --help             # Show full usage
 ```

 ## What the Script Installs & Does
@ -56,8 +69,8 @@ docker compose -f /opt/bytelyst/learning_ai_common_plat/docker-compose.ecosystem
 | 4. Build | ~5 min | `pnpm install && pnpm -r build` all `@bytelyst/*` packages |
 | 5. Publish | ~3 min | Publish all packages to local Gitea npm registry |
 | 6. Env | instant | Generate `.env.ecosystem` with Cosmos emulator key, Azurite key, JWT secret |
-| 7. Deploy | ~10 min | `docker compose up --build` for 27 services |
-| 8. Verify | ~1 min | Health-check all services + create `/opt/bytelyst/check-health.sh` |
+| 7. Deploy | ~10 min | Per-service Docker build + deploy (30 services, with fallback) |
+| 8. Verify | ~1 min | Health-check all 30+ endpoints + create `/opt/bytelyst/check-health.sh` |

 ## Port Map (after deployment)

@ -147,11 +160,22 @@ All optional — defaults work for most setups:
 | `SKIP_CLONE` | `0` | Set `1` to skip cloning (re-runs) |
 | `SKIP_BUILD` | `0` | Set `1` to skip package build+publish (re-runs) |

+## CLI Flags
+
+| Flag | Description |
+|------|-------------|
+| `--resume` | Auto-resume from last completed phase |
+| `--resume-from=N` | Resume from phase N (1-8) |
+| `--phase=N` | Run ONLY phase N (useful for retrying) |
+| `--reset` | Clear phase markers and start fresh |
+| `--status` | Show completed phases and exit |
+| `-h`, `--help` | Show usage help |
+
 ## Troubleshooting

 - **Cosmos emulator slow:** It needs 20-30s on first boot. Services wait via health checks.
 - **Out of memory:** Use at least 32 GB RAM. Cosmos emulator needs ~4 GB, Ollama needs ~4 GB for 3B models.
- **Build failures:** Check Gitea is running (`docker ps | grep gitea`) and packages published (`curl http://localhost:3300/api/packages/bytelyst/npm/`).
+- **Build failures:** Check Gitea is running (`docker ps | grep gitea`) and packages published (`curl http://localhost:3300/api/packages/bytelyst/npm/`). Per-service build logs: `/opt/bytelyst/.setup-state/builds/<service>.log`. Retry: `sudo ./setup.sh --phase=7`.
 - **Ollama not responding:** Check `systemctl status ollama` or `curl http://localhost:11434/api/version`.
 - **Port conflicts:** Ensure nothing else runs on the listed ports before deploying.
 - **Corporate proxy in Dockerfiles:** The script auto-strips hardcoded proxy ENVs from cloned Dockerfiles.
--- a/docs/devops/single_azure_vm/prompt.md
+++ b/docs/devops/single_azure_vm/prompt.md
@ -1,23 +1,42 @@
 # Codex Agent Prompt: ByteLyst Single-VM E2E Deployment

-> **Goal:** Review, harden, test, and complete `setup.sh` so it works flawlessly on a raw Ubuntu 24.04 Azure VM — zero manual intervention, 100% completion, all 27 services healthy.
+> **Goal:** Review, harden, test, and complete `setup.sh` so it works flawlessly on a raw Ubuntu 24.04 Azure VM — zero manual intervention, 100% completion, all 30 services healthy.
+>
+> **IMPORTANT:** Read the "Current State" section below FIRST. Many tasks in this prompt are already completed. Do NOT re-implement them.

 ---

 ## Context

-This folder contains two files you must work with:
+This folder contains three files you must work with:

- **`setup.sh`** — 8-phase bash script that bootstraps the entire ByteLyst ecosystem on a blank Ubuntu VM
+- **`setup.sh`** — 8-phase bash script (~940 lines) that bootstraps the entire ByteLyst ecosystem on a blank Ubuntu VM
 - **`README.md`** — Deployment guide documenting what the script does, ports, troubleshooting
+- **`prompt.md`** — This file (agent instructions)

-The script installs everything from scratch (Docker, Node.js, pnpm, Gitea, Ollama) then clones 11 repos, builds + publishes ~49 `@bytelyst/*` npm packages to a local Gitea registry, generates environment config, and deploys 27 Docker Compose services.
+The script installs everything from scratch (Docker, Node.js, pnpm, Gitea, Ollama) then clones 11 repos, builds + publishes ~49 `@bytelyst/*` npm packages to a local Gitea registry, generates environment config, and deploys 30 Docker Compose services (6 infra + 3 platform + 2 dashboards + 10 backends + 9 webs).
+
+### Current State (ALREADY IMPLEMENTED — do NOT redo)
+
+The following features are already built and tested in `setup.sh`:
+
+- **Resume/retry support:** `--resume`, `--resume-from=N`, `--phase=N`, `--reset`, `--status`, `--help` CLI flags
+- **Phase completion markers:** Stored in `/opt/bytelyst/.setup-state/phaseN.done`
+- **GITEA_NPM_TOKEN auto-restore:** Token saved to `/opt/bytelyst/.gitea_token`, restored on resume
+- **Per-service Docker build:** Phase 7 builds each of 30 services individually with `[N/30]` progress
+- **Per-service fallback:** Failed builds are skipped, remaining services still start
+- **Build logs:** Saved per-service to `/opt/bytelyst/.setup-state/builds/<service>.log`
+- **Phase 7 partial failure handling:** Phase 7 NOT marked done if builds fail, so `--resume` retries it
+- **set -euo pipefail safety:** All pipelines in fallback paths use `|| true` to prevent premature abort
+- **Ollama model pull non-fatal:** Model download failure doesn't abort the entire setup
+- **SSH disconnect protection:** All output tee'd to `/opt/bytelyst/setup.log`
+- **Idempotent:** Every phase handles re-runs gracefully

 ### Key files outside this folder that the script depends on

 | File | Repo | Purpose |
 |------|------|---------|
-| `docker-compose.ecosystem.yml` | `learning_ai_common_plat` (root) | Defines all 27 services |
+| `docker-compose.ecosystem.yml` | `learning_ai_common_plat` (root) | Defines all 30 services |
 | `.env.ecosystem.example` | `learning_ai_common_plat` (root) | Template for env vars |
 | `packages/*/package.json` | `learning_ai_common_plat` | ~49 `@bytelyst/*` packages to publish |
 | `backend/Dockerfile` | Each of the 10 product repos | Product backend Docker builds |
@ -58,89 +77,51 @@ The following issues have already been identified and fixed in the current `setu
 | `localmemgpt-backend` can't reach Ollama on Linux | `extra_hosts: ['host.docker.internal:host-gateway']` in compose | `3b31709b` |
 | Dashboard Dockerfiles had hardcoded corporate proxy | Converted to `ARG`-based proxy with empty defaults | `2b9fd717` |
 | `pnpm install --frozen-lockfile` fails on shallow clones | Removed `--frozen-lockfile` | `3b31709b` |
+| 3 service Dockerfiles had stale package.json COPY lists | Updated to all 57 packages + workspace members | `85aca553` |
+| Phase 5 publish counted 409 conflicts as failures | Distinguish real failures from expected conflicts | `c0bc13e1` |
+| `set -e` + `pipefail` aborted script on `docker compose up` partial failure | Added `|| true` | `a9414218` |
+| Phase 7 marked done even with partial build failures | Only mark done when all builds succeed | `a9414218` |
+| `docker compose config --format json` called 30x in loop | Cached once | `a9414218` |
+| `--phase=7` printed success even with failures | Now exits 1 with build log path | `a9414218` |
+| `last_completed_phase` didn't enforce sequential order | Stops at first gap | `a3f4c6fa` |
+| Phase 7 missing `.env.ecosystem` guard | Fail early with helpful message | `a3f4c6fa` |
+| `ollama pull \| tail` aborted entire setup on slow network | Made non-fatal | `b634708d` |

 ---

 ## Your Tasks (in priority order)

-### 1. Audit `setup.sh` for correctness and completeness
+> **Tasks 1-3 are ALREADY DONE.** See "Current State" above and "Bugs Already Fixed" above.
+> Focus on Tasks 4-7 which are the remaining work.

-Read the entire script and identify every potential failure point. Specifically check:
+### ~~1. Audit `setup.sh` for correctness~~ ✅ DONE

- **Phase 1 (System):**
-  - Docker CE install via official apt repo — verify the GPG key + sources.list format works on Ubuntu 24.04
-  - Node.js 22 via NodeSource — verify `setup_22.x` URL is current
-  - pnpm 10.6.5 via `npm install -g` — correct
-  - Ollama install via `https://ollama.com/install.sh` — verify it starts as systemd service, has fallback
-  - All commands must be non-interactive (`DEBIAN_FRONTEND=noninteractive`)
+The script has been audited and all identified bugs fixed (see table above). Phases 1-8 are tested. Key things already verified:
+- Docker CE install, Node.js 22 (NodeSource), pnpm 10.6.5, Ollama — all idempotent
+- Gitea token: `.sha1 // .token` fallback in place
+- Corporate proxy: removed at source in all repos, no runtime `sed` needed
+- `pnpm install` runs without `--frozen-lockfile`
+- Phase 5 publish: tolerates 409 conflicts
+- Phase 6 env: heredoc with Cosmos/Azurite emulator keys, semicolons handled
+- Phase 7: per-service build with fallback, BuildKit secrets via `GITEA_NPM_TOKEN` env export
+- Phase 8: health check covers all 30 services + Gitea + Ollama

- **Phase 2 (Gitea):**
-  - Gitea Docker container `gitea/gitea:1.22` on port 3300
-  - Admin user creation via `gitea admin user create` inside the container
-  - Organization creation via REST API (`POST /api/v1/orgs`)
-  - API token creation with `write:package` + `read:package` scopes
-  - Token extracted via `jq -r '.sha1'` — verify Gitea 1.22 returns `.sha1` (not `.token`)
+### ~~2. Fix every bug you find~~ ✅ DONE

- **Phase 3 (Clone):**
-  - Shallow clone (`--depth 1`) all 11 repos
-  - Corporate proxy stripping: `sed` removes `HTTP_PROXY`, `HTTPS_PROXY`, `NO_PROXY` ENV lines and `jfrog-pkg-proxy` registry references from ALL Dockerfiles
-  - **CRITICAL:** Verify the glob patterns catch ALL Dockerfiles including special paths:
-    - `learning_multimodal_memory_agents/mindlyst-native/web/Dockerfile`
-    - `learning_voice_ai_agent/user-dashboard-web/Dockerfile`
-    - `learning_voice_ai_agent/backend/Dockerfile` (has backend-python too)
-  - Verify `sed -i` works on Alpine/Ubuntu (GNU sed, not BSD)
+All bugs fixed — see the 16-item table in "Bugs Already Fixed" above.

- **Phase 4 (Build):**
-  - `.npmrc` written to common-plat root with Gitea registry URL + token
-  - `pnpm install` (no `--frozen-lockfile` — shallow clones may have lockfile drift)
-  - `pnpm -r build` — builds ALL packages in dependency order
-  - Verify this works when run as root (pnpm may have permission issues)
+### ~~3. Add error recovery and logging~~ ✅ DONE

- **Phase 5 (Publish):**
-  - Iterates `packages/*/`, skips non-`@bytelyst/*`, skips `private: true`, skips packages without `dist/`
-  - `pnpm publish --registry <url> --no-git-checks`
-  - Must tolerate "already exists" 409 errors gracefully
+Already implemented:
+- **Phase completion markers:** `/opt/bytelyst/.setup-state/phaseN.done`
+- **Resume:** `--resume` (auto-detect), `--resume-from=N`, `--phase=N` (single), `--reset`, `--status`
+- **Logging:** `exec > >(tee -a setup.log) 2>&1`
+- **Per-service fallback:** Failed Docker builds are skipped, remaining services start
+- **Build logs:** Per-service to `/opt/bytelyst/.setup-state/builds/<service>.log`

- **Phase 6 (Env):**
-  - Generates `.env.ecosystem` with well-known Cosmos emulator key and Azurite key
-  - Verify the heredoc correctly expands `${COSMOS_EMULATOR_KEY}` and `${AZURITE_KEY}`
-  - Verify the `AZURE_BLOB_CONNECTION_STRING` semicolons don't break the env file
-  - JWT secret generated via `openssl rand -base64 32`
+### 4. Add a dry-run / validation mode (TODO)

- **Phase 7 (Deploy):**
-  - `detect_docker_host_ip()` returns docker0 bridge IP (usually `172.17.0.1`)
-  - `GITEA_NPM_HOST` set to this IP so Docker builds can reach Gitea on the host
-  - `docker compose up --build -d` with BuildKit secrets for `GITEA_NPM_TOKEN`
-  - Verify the `x-product-build` YAML anchor in `docker-compose.ecosystem.yml` correctly passes `GITEA_NPM_HOST` as build arg and `gitea_npm_token` as secret
-
- **Phase 8 (Verify):**
-  - Waits for `platform-service` health (120s timeout)
-  - Creates `/opt/bytelyst/check-health.sh` with all 27+ service URLs
-  - Sleeps 30s then runs health check
-
-### 2. Fix every bug you find
-
-Do not just report issues — fix them directly in `setup.sh`. Common pitfalls to watch for:
-
- **Gitea API token field:** Gitea 1.22+ may return the token in `.token` instead of `.sha1`. Add fallback: `jq -r '.sha1 // .token'`
- **pnpm as root:** May need `--unsafe-perm` or setting `pnpm config set unsafe-perm true`
- **Docker BuildKit secrets:** The `secrets.gitea_npm_token.environment` directive requires the env var to be set in the shell running `docker compose`. Verify `export GITEA_NPM_TOKEN` is in scope.
- **Cosmos emulator on Linux:** The `vnext-preview` image requires `PROTOCOL=http`. Verify `cosmos-emulator` healthcheck works (it checks port 8080 for `/ready`, not 8081).
- **Product Dockerfiles:** Each uses `--mount=type=secret,id=gitea_npm_token` during `pnpm install`. Verify the secret ID matches what's in the compose file.
- **MindLyst special path:** Its web Dockerfile is at `mindlyst-native/web/Dockerfile` (not `web/Dockerfile`). The compose file references `../learning_multimodal_memory_agents` with `dockerfile: mindlyst-native/web/Dockerfile`. Verify this context + dockerfile path is correct.
- **LysnrAI extra dashboards:** Has `user-dashboard-web/Dockerfile` in addition to `backend/Dockerfile`. Verify the compose references the correct paths.
-
-### 3. Add error recovery and logging
-
-The script uses `set -euo pipefail` which exits on any error. This is too aggressive for a 25-minute deployment. Add:
-
- **Per-phase error trapping:** Wrap each phase in a function that catches errors and prints a clear message about which phase failed and what to check
- **Log file:** Tee all output to `/opt/bytelyst/setup.log` so the user can review after SSH disconnection
- **Resume support:** Save phase completion markers to `/opt/bytelyst/.phase_complete_N`. On re-run, skip already-completed phases (unless the user passes `FORCE_RERUN=1`)
-
-### 4. Add a dry-run / validation mode
-
-Add `DRY_RUN=1` support that:
+Add `--dry-run` support that:
 - Checks all prerequisites (disk space, memory, network access to GitHub)
 - Validates Docker is installed and running
 - Validates Gitea is reachable
@ -157,11 +138,13 @@ Read `docker-compose.ecosystem.yml` (in the repo root) and verify:
 - The `x-product-build` anchor correctly provides `GITEA_NPM_HOST` and `gitea_npm_token` secret
 - All `depends_on` conditions reference services that actually exist
 - The `localmemgpt-backend` service has `extra_hosts: ['host.docker.internal:host-gateway']` for Ollama access
+- **30 total services:** 6 infra (pre-built images) + 24 built from Dockerfiles

 ### 6. Update `README.md`

 After all fixes, update `README.md` to reflect:
- Any new env vars you added (e.g., `DRY_RUN`, `FORCE_RERUN`)
+- CLI flags: `--resume`, `--resume-from=N`, `--phase=N`, `--reset`, `--status`, `--help`
+- Correct service count: 30 (not 27)
 - Updated duration estimates if phases changed
 - Any new troubleshooting entries
 - NSG port list: `22, 80, 1025, 1234, 3000-3003, 3030, 3035, 3040, 3045, 3050, 3055, 3060, 3070, 3100, 3300, 4003, 4005, 4007, 4010-4019, 8025, 8080, 8081, 10000, 11434`
@ -207,7 +190,7 @@ Add a section to `README.md` (or a separate `test-plan.md`) that describes how t

 - [ ] `setup.sh` runs flawlessly from `sudo ./setup.sh` on a raw Ubuntu 24.04 VM
 - [ ] All 8 phases complete without manual intervention
- [ ] `/opt/bytelyst/check-health.sh` shows ALL services green
+- [ ] `/opt/bytelyst/check-health.sh` shows ALL 30+ services green
 - [ ] All 10 product backends respond to `/health` with `{"status":"ok",...}`
 - [ ] All 9 product web apps serve their landing page
 - [ ] Admin dashboard (`http://<vm-ip>:3001`) loads
@ -218,7 +201,10 @@ Add a section to `README.md` (or a separate `test-plan.md`) that describes how t
 - [ ] Mailpit accessible at `http://<vm-ip>:8025`
 - [ ] `README.md` is accurate and complete
 - [ ] Script is idempotent (second run succeeds without errors)
+- [ ] Resume works: `sudo ./setup.sh --resume` after interrupted run
+- [ ] Single-phase retry works: `sudo ./setup.sh --phase=7` after build failure
 - [ ] Setup log saved to `/opt/bytelyst/setup.log`
+- [ ] Build logs saved per-service to `/opt/bytelyst/.setup-state/builds/`

 ---

@ -228,7 +214,7 @@ Add a section to `README.md` (or a separate `test-plan.md`) that describes how t
 Raw Ubuntu 24.04 VM
 ├── Ollama (systemd, :11434) ─── local LLM inference
 ├── Gitea (Docker, :3300) ────── npm package registry
-└── Docker Compose Ecosystem (27 services)
+└── Docker Compose Ecosystem (30 services)
    ├── Infrastructure
    │   ├── cosmos-emulator (:8081, :1234)
    │   ├── azurite (:10000)
@ -273,10 +259,22 @@ Product Dockerfiles use BuildKit secret mount for the npm token:
 RUN --mount=type=secret,id=gitea_npm_token \
    cp .npmrc.docker .npmrc && \
    GITEA_NPM_TOKEN=$(cat /run/secrets/gitea_npm_token) \
-    pnpm install --frozen-lockfile
+    pnpm install
 ```

 The `.npmrc.docker` in each product repo uses `${GITEA_NPM_HOST}:3300` as the registry host.
 During `docker compose build`, the host's `GITEA_NPM_TOKEN` env var is passed as a BuildKit secret,
 and `GITEA_NPM_HOST` is passed as a build arg (defaults to `host.docker.internal`, overridden to
 `172.17.0.1` on Linux VMs by the setup script).
+
+## CLI Reference
+
+```bash
+sudo ./setup.sh                    # Fresh install (all 8 phases)
+sudo ./setup.sh --phase=7          # Retry just the deploy phase
+sudo ./setup.sh --resume           # Auto-resume after SSH disconnect
+sudo ./setup.sh --resume-from=7    # Jump to deploy after manual fix
+sudo ./setup.sh --status           # Check what's done
+sudo ./setup.sh --reset            # Start completely over
+sudo ./setup.sh --help             # Show usage
+```
--- a/docs/devops/single_azure_vm/setup.sh
+++ b/docs/devops/single_azure_vm/setup.sh
@ -12,7 +12,7 @@
 #   - Ollama (local LLM inference for LocalMemGPT on :11434)
 #   - All 11 ByteLyst repos (cloned from GitHub)
 #   - All @bytelyst/* packages (built + published to Gitea)
-#   - Full 27-service ecosystem (via docker-compose.ecosystem.yml)
+#   - Full 30-service ecosystem (via docker-compose.ecosystem.yml)
 #
 # Usage:  sudo ./setup.sh [OPTIONS]
 #
@ -859,7 +859,7 @@ main() {
  echo ""
  echo "╔═══════════════════════════════════════════════════════════════╗"
  echo "║       ByteLyst Single-VM Deployment (raw Ubuntu)              ║"
-  echo "║       27 services · 10 products · Ollama · Gitea · 1 VM      ║"
+  echo "║       30 services · 10 products · Ollama · Gitea · 1 VM      ║"
  echo "╚═══════════════════════════════════════════════════════════════╝"
  echo ""
  log "Log file: ${INSTALL_DIR}/setup.log"