From 35021b67b9e54b919f82d4525b720faa8b267a56 Mon Sep 17 00:00:00 2001 From: saravanakumardb1 Date: Tue, 24 Mar 2026 12:35:59 -0700 Subject: [PATCH] =?UTF-8?q?docs(infra):=20fix=20stale=20service=20count=20?= =?UTF-8?q?(27=E2=86=9230),=20update=20prompt.md=20+=20README.md=20for=20C?= =?UTF-8?q?odex=20agent=20readiness?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - prompt.md: mark tasks 1-3 as DONE, add 'Current State' section listing all implemented features, update bugs-fixed table (16 items), fix service count in architecture diagram, add CLI reference, remove stale --frozen-lockfile - README.md: add Resume & Retry section with examples, add CLI Flags table, fix service count in title/phases, update build failure troubleshooting with build log paths and retry command - setup.sh: fix '27 services' → '30 services' in header comment and banner --- docs/devops/single_azure_vm/README.md | 34 +++++- docs/devops/single_azure_vm/prompt.md | 156 +++++++++++++------------- docs/devops/single_azure_vm/setup.sh | 4 +- 3 files changed, 108 insertions(+), 86 deletions(-) diff --git a/docs/devops/single_azure_vm/README.md b/docs/devops/single_azure_vm/README.md index 7a4294ce..5371bdf7 100644 --- a/docs/devops/single_azure_vm/README.md +++ b/docs/devops/single_azure_vm/README.md @@ -1,6 +1,6 @@ # ByteLyst Single-VM Deployment -> Deploy the **entire ByteLyst ecosystem** on a single **raw** Azure VM. +> Deploy the **entire ByteLyst ecosystem** (30 services, 10 products) on a single **raw** Azure VM. > Nothing pre-installed required — the script handles everything from a blank Ubuntu machine. > Two files: this README and `setup.sh`. Copy both to the VM and run the script. @@ -30,7 +30,20 @@ sudo ./setup.sh # 4. Wait ~15-25 minutes for full build + deploy # 5. Verify -docker compose -f /opt/bytelyst/learning_ai_common_plat/docker-compose.ecosystem.yml ps +/opt/bytelyst/check-health.sh +``` + +### Resume & Retry + +Phase completion is tracked. If anything fails, you don't have to start over: + +```bash +sudo ./setup.sh --phase=7 # Retry just the deploy phase +sudo ./setup.sh --resume # Auto-resume after SSH disconnect +sudo ./setup.sh --resume-from=7 # Jump to deploy after manual fix +sudo ./setup.sh --status # Check what's done +sudo ./setup.sh --reset # Start completely over +sudo ./setup.sh --help # Show full usage ``` ## What the Script Installs & Does @@ -56,8 +69,8 @@ docker compose -f /opt/bytelyst/learning_ai_common_plat/docker-compose.ecosystem | 4. Build | ~5 min | `pnpm install && pnpm -r build` all `@bytelyst/*` packages | | 5. Publish | ~3 min | Publish all packages to local Gitea npm registry | | 6. Env | instant | Generate `.env.ecosystem` with Cosmos emulator key, Azurite key, JWT secret | -| 7. Deploy | ~10 min | `docker compose up --build` for 27 services | -| 8. Verify | ~1 min | Health-check all services + create `/opt/bytelyst/check-health.sh` | +| 7. Deploy | ~10 min | Per-service Docker build + deploy (30 services, with fallback) | +| 8. Verify | ~1 min | Health-check all 30+ endpoints + create `/opt/bytelyst/check-health.sh` | ## Port Map (after deployment) @@ -147,11 +160,22 @@ All optional — defaults work for most setups: | `SKIP_CLONE` | `0` | Set `1` to skip cloning (re-runs) | | `SKIP_BUILD` | `0` | Set `1` to skip package build+publish (re-runs) | +## CLI Flags + +| Flag | Description | +|------|-------------| +| `--resume` | Auto-resume from last completed phase | +| `--resume-from=N` | Resume from phase N (1-8) | +| `--phase=N` | Run ONLY phase N (useful for retrying) | +| `--reset` | Clear phase markers and start fresh | +| `--status` | Show completed phases and exit | +| `-h`, `--help` | Show usage help | + ## Troubleshooting - **Cosmos emulator slow:** It needs 20-30s on first boot. Services wait via health checks. - **Out of memory:** Use at least 32 GB RAM. Cosmos emulator needs ~4 GB, Ollama needs ~4 GB for 3B models. -- **Build failures:** Check Gitea is running (`docker ps | grep gitea`) and packages published (`curl http://localhost:3300/api/packages/bytelyst/npm/`). +- **Build failures:** Check Gitea is running (`docker ps | grep gitea`) and packages published (`curl http://localhost:3300/api/packages/bytelyst/npm/`). Per-service build logs: `/opt/bytelyst/.setup-state/builds/.log`. Retry: `sudo ./setup.sh --phase=7`. - **Ollama not responding:** Check `systemctl status ollama` or `curl http://localhost:11434/api/version`. - **Port conflicts:** Ensure nothing else runs on the listed ports before deploying. - **Corporate proxy in Dockerfiles:** The script auto-strips hardcoded proxy ENVs from cloned Dockerfiles. diff --git a/docs/devops/single_azure_vm/prompt.md b/docs/devops/single_azure_vm/prompt.md index bf8ad283..1a3ddaa3 100644 --- a/docs/devops/single_azure_vm/prompt.md +++ b/docs/devops/single_azure_vm/prompt.md @@ -1,23 +1,42 @@ # Codex Agent Prompt: ByteLyst Single-VM E2E Deployment -> **Goal:** Review, harden, test, and complete `setup.sh` so it works flawlessly on a raw Ubuntu 24.04 Azure VM — zero manual intervention, 100% completion, all 27 services healthy. +> **Goal:** Review, harden, test, and complete `setup.sh` so it works flawlessly on a raw Ubuntu 24.04 Azure VM — zero manual intervention, 100% completion, all 30 services healthy. +> +> **IMPORTANT:** Read the "Current State" section below FIRST. Many tasks in this prompt are already completed. Do NOT re-implement them. --- ## Context -This folder contains two files you must work with: +This folder contains three files you must work with: -- **`setup.sh`** — 8-phase bash script that bootstraps the entire ByteLyst ecosystem on a blank Ubuntu VM +- **`setup.sh`** — 8-phase bash script (~940 lines) that bootstraps the entire ByteLyst ecosystem on a blank Ubuntu VM - **`README.md`** — Deployment guide documenting what the script does, ports, troubleshooting +- **`prompt.md`** — This file (agent instructions) -The script installs everything from scratch (Docker, Node.js, pnpm, Gitea, Ollama) then clones 11 repos, builds + publishes ~49 `@bytelyst/*` npm packages to a local Gitea registry, generates environment config, and deploys 27 Docker Compose services. +The script installs everything from scratch (Docker, Node.js, pnpm, Gitea, Ollama) then clones 11 repos, builds + publishes ~49 `@bytelyst/*` npm packages to a local Gitea registry, generates environment config, and deploys 30 Docker Compose services (6 infra + 3 platform + 2 dashboards + 10 backends + 9 webs). + +### Current State (ALREADY IMPLEMENTED — do NOT redo) + +The following features are already built and tested in `setup.sh`: + +- **Resume/retry support:** `--resume`, `--resume-from=N`, `--phase=N`, `--reset`, `--status`, `--help` CLI flags +- **Phase completion markers:** Stored in `/opt/bytelyst/.setup-state/phaseN.done` +- **GITEA_NPM_TOKEN auto-restore:** Token saved to `/opt/bytelyst/.gitea_token`, restored on resume +- **Per-service Docker build:** Phase 7 builds each of 30 services individually with `[N/30]` progress +- **Per-service fallback:** Failed builds are skipped, remaining services still start +- **Build logs:** Saved per-service to `/opt/bytelyst/.setup-state/builds/.log` +- **Phase 7 partial failure handling:** Phase 7 NOT marked done if builds fail, so `--resume` retries it +- **set -euo pipefail safety:** All pipelines in fallback paths use `|| true` to prevent premature abort +- **Ollama model pull non-fatal:** Model download failure doesn't abort the entire setup +- **SSH disconnect protection:** All output tee'd to `/opt/bytelyst/setup.log` +- **Idempotent:** Every phase handles re-runs gracefully ### Key files outside this folder that the script depends on | File | Repo | Purpose | |------|------|---------| -| `docker-compose.ecosystem.yml` | `learning_ai_common_plat` (root) | Defines all 27 services | +| `docker-compose.ecosystem.yml` | `learning_ai_common_plat` (root) | Defines all 30 services | | `.env.ecosystem.example` | `learning_ai_common_plat` (root) | Template for env vars | | `packages/*/package.json` | `learning_ai_common_plat` | ~49 `@bytelyst/*` packages to publish | | `backend/Dockerfile` | Each of the 10 product repos | Product backend Docker builds | @@ -58,89 +77,51 @@ The following issues have already been identified and fixed in the current `setu | `localmemgpt-backend` can't reach Ollama on Linux | `extra_hosts: ['host.docker.internal:host-gateway']` in compose | `3b31709b` | | Dashboard Dockerfiles had hardcoded corporate proxy | Converted to `ARG`-based proxy with empty defaults | `2b9fd717` | | `pnpm install --frozen-lockfile` fails on shallow clones | Removed `--frozen-lockfile` | `3b31709b` | +| 3 service Dockerfiles had stale package.json COPY lists | Updated to all 57 packages + workspace members | `85aca553` | +| Phase 5 publish counted 409 conflicts as failures | Distinguish real failures from expected conflicts | `c0bc13e1` | +| `set -e` + `pipefail` aborted script on `docker compose up` partial failure | Added `|| true` | `a9414218` | +| Phase 7 marked done even with partial build failures | Only mark done when all builds succeed | `a9414218` | +| `docker compose config --format json` called 30x in loop | Cached once | `a9414218` | +| `--phase=7` printed success even with failures | Now exits 1 with build log path | `a9414218` | +| `last_completed_phase` didn't enforce sequential order | Stops at first gap | `a3f4c6fa` | +| Phase 7 missing `.env.ecosystem` guard | Fail early with helpful message | `a3f4c6fa` | +| `ollama pull \| tail` aborted entire setup on slow network | Made non-fatal | `b634708d` | --- ## Your Tasks (in priority order) -### 1. Audit `setup.sh` for correctness and completeness +> **Tasks 1-3 are ALREADY DONE.** See "Current State" above and "Bugs Already Fixed" above. +> Focus on Tasks 4-7 which are the remaining work. -Read the entire script and identify every potential failure point. Specifically check: +### ~~1. Audit `setup.sh` for correctness~~ ✅ DONE -- **Phase 1 (System):** - - Docker CE install via official apt repo — verify the GPG key + sources.list format works on Ubuntu 24.04 - - Node.js 22 via NodeSource — verify `setup_22.x` URL is current - - pnpm 10.6.5 via `npm install -g` — correct - - Ollama install via `https://ollama.com/install.sh` — verify it starts as systemd service, has fallback - - All commands must be non-interactive (`DEBIAN_FRONTEND=noninteractive`) +The script has been audited and all identified bugs fixed (see table above). Phases 1-8 are tested. Key things already verified: +- Docker CE install, Node.js 22 (NodeSource), pnpm 10.6.5, Ollama — all idempotent +- Gitea token: `.sha1 // .token` fallback in place +- Corporate proxy: removed at source in all repos, no runtime `sed` needed +- `pnpm install` runs without `--frozen-lockfile` +- Phase 5 publish: tolerates 409 conflicts +- Phase 6 env: heredoc with Cosmos/Azurite emulator keys, semicolons handled +- Phase 7: per-service build with fallback, BuildKit secrets via `GITEA_NPM_TOKEN` env export +- Phase 8: health check covers all 30 services + Gitea + Ollama -- **Phase 2 (Gitea):** - - Gitea Docker container `gitea/gitea:1.22` on port 3300 - - Admin user creation via `gitea admin user create` inside the container - - Organization creation via REST API (`POST /api/v1/orgs`) - - API token creation with `write:package` + `read:package` scopes - - Token extracted via `jq -r '.sha1'` — verify Gitea 1.22 returns `.sha1` (not `.token`) +### ~~2. Fix every bug you find~~ ✅ DONE -- **Phase 3 (Clone):** - - Shallow clone (`--depth 1`) all 11 repos - - Corporate proxy stripping: `sed` removes `HTTP_PROXY`, `HTTPS_PROXY`, `NO_PROXY` ENV lines and `jfrog-pkg-proxy` registry references from ALL Dockerfiles - - **CRITICAL:** Verify the glob patterns catch ALL Dockerfiles including special paths: - - `learning_multimodal_memory_agents/mindlyst-native/web/Dockerfile` - - `learning_voice_ai_agent/user-dashboard-web/Dockerfile` - - `learning_voice_ai_agent/backend/Dockerfile` (has backend-python too) - - Verify `sed -i` works on Alpine/Ubuntu (GNU sed, not BSD) +All bugs fixed — see the 16-item table in "Bugs Already Fixed" above. -- **Phase 4 (Build):** - - `.npmrc` written to common-plat root with Gitea registry URL + token - - `pnpm install` (no `--frozen-lockfile` — shallow clones may have lockfile drift) - - `pnpm -r build` — builds ALL packages in dependency order - - Verify this works when run as root (pnpm may have permission issues) +### ~~3. Add error recovery and logging~~ ✅ DONE -- **Phase 5 (Publish):** - - Iterates `packages/*/`, skips non-`@bytelyst/*`, skips `private: true`, skips packages without `dist/` - - `pnpm publish --registry --no-git-checks` - - Must tolerate "already exists" 409 errors gracefully +Already implemented: +- **Phase completion markers:** `/opt/bytelyst/.setup-state/phaseN.done` +- **Resume:** `--resume` (auto-detect), `--resume-from=N`, `--phase=N` (single), `--reset`, `--status` +- **Logging:** `exec > >(tee -a setup.log) 2>&1` +- **Per-service fallback:** Failed Docker builds are skipped, remaining services start +- **Build logs:** Per-service to `/opt/bytelyst/.setup-state/builds/.log` -- **Phase 6 (Env):** - - Generates `.env.ecosystem` with well-known Cosmos emulator key and Azurite key - - Verify the heredoc correctly expands `${COSMOS_EMULATOR_KEY}` and `${AZURITE_KEY}` - - Verify the `AZURE_BLOB_CONNECTION_STRING` semicolons don't break the env file - - JWT secret generated via `openssl rand -base64 32` +### 4. Add a dry-run / validation mode (TODO) -- **Phase 7 (Deploy):** - - `detect_docker_host_ip()` returns docker0 bridge IP (usually `172.17.0.1`) - - `GITEA_NPM_HOST` set to this IP so Docker builds can reach Gitea on the host - - `docker compose up --build -d` with BuildKit secrets for `GITEA_NPM_TOKEN` - - Verify the `x-product-build` YAML anchor in `docker-compose.ecosystem.yml` correctly passes `GITEA_NPM_HOST` as build arg and `gitea_npm_token` as secret - -- **Phase 8 (Verify):** - - Waits for `platform-service` health (120s timeout) - - Creates `/opt/bytelyst/check-health.sh` with all 27+ service URLs - - Sleeps 30s then runs health check - -### 2. Fix every bug you find - -Do not just report issues — fix them directly in `setup.sh`. Common pitfalls to watch for: - -- **Gitea API token field:** Gitea 1.22+ may return the token in `.token` instead of `.sha1`. Add fallback: `jq -r '.sha1 // .token'` -- **pnpm as root:** May need `--unsafe-perm` or setting `pnpm config set unsafe-perm true` -- **Docker BuildKit secrets:** The `secrets.gitea_npm_token.environment` directive requires the env var to be set in the shell running `docker compose`. Verify `export GITEA_NPM_TOKEN` is in scope. -- **Cosmos emulator on Linux:** The `vnext-preview` image requires `PROTOCOL=http`. Verify `cosmos-emulator` healthcheck works (it checks port 8080 for `/ready`, not 8081). -- **Product Dockerfiles:** Each uses `--mount=type=secret,id=gitea_npm_token` during `pnpm install`. Verify the secret ID matches what's in the compose file. -- **MindLyst special path:** Its web Dockerfile is at `mindlyst-native/web/Dockerfile` (not `web/Dockerfile`). The compose file references `../learning_multimodal_memory_agents` with `dockerfile: mindlyst-native/web/Dockerfile`. Verify this context + dockerfile path is correct. -- **LysnrAI extra dashboards:** Has `user-dashboard-web/Dockerfile` in addition to `backend/Dockerfile`. Verify the compose references the correct paths. - -### 3. Add error recovery and logging - -The script uses `set -euo pipefail` which exits on any error. This is too aggressive for a 25-minute deployment. Add: - -- **Per-phase error trapping:** Wrap each phase in a function that catches errors and prints a clear message about which phase failed and what to check -- **Log file:** Tee all output to `/opt/bytelyst/setup.log` so the user can review after SSH disconnection -- **Resume support:** Save phase completion markers to `/opt/bytelyst/.phase_complete_N`. On re-run, skip already-completed phases (unless the user passes `FORCE_RERUN=1`) - -### 4. Add a dry-run / validation mode - -Add `DRY_RUN=1` support that: +Add `--dry-run` support that: - Checks all prerequisites (disk space, memory, network access to GitHub) - Validates Docker is installed and running - Validates Gitea is reachable @@ -157,11 +138,13 @@ Read `docker-compose.ecosystem.yml` (in the repo root) and verify: - The `x-product-build` anchor correctly provides `GITEA_NPM_HOST` and `gitea_npm_token` secret - All `depends_on` conditions reference services that actually exist - The `localmemgpt-backend` service has `extra_hosts: ['host.docker.internal:host-gateway']` for Ollama access +- **30 total services:** 6 infra (pre-built images) + 24 built from Dockerfiles ### 6. Update `README.md` After all fixes, update `README.md` to reflect: -- Any new env vars you added (e.g., `DRY_RUN`, `FORCE_RERUN`) +- CLI flags: `--resume`, `--resume-from=N`, `--phase=N`, `--reset`, `--status`, `--help` +- Correct service count: 30 (not 27) - Updated duration estimates if phases changed - Any new troubleshooting entries - NSG port list: `22, 80, 1025, 1234, 3000-3003, 3030, 3035, 3040, 3045, 3050, 3055, 3060, 3070, 3100, 3300, 4003, 4005, 4007, 4010-4019, 8025, 8080, 8081, 10000, 11434` @@ -207,7 +190,7 @@ Add a section to `README.md` (or a separate `test-plan.md`) that describes how t - [ ] `setup.sh` runs flawlessly from `sudo ./setup.sh` on a raw Ubuntu 24.04 VM - [ ] All 8 phases complete without manual intervention -- [ ] `/opt/bytelyst/check-health.sh` shows ALL services green +- [ ] `/opt/bytelyst/check-health.sh` shows ALL 30+ services green - [ ] All 10 product backends respond to `/health` with `{"status":"ok",...}` - [ ] All 9 product web apps serve their landing page - [ ] Admin dashboard (`http://:3001`) loads @@ -218,7 +201,10 @@ Add a section to `README.md` (or a separate `test-plan.md`) that describes how t - [ ] Mailpit accessible at `http://:8025` - [ ] `README.md` is accurate and complete - [ ] Script is idempotent (second run succeeds without errors) +- [ ] Resume works: `sudo ./setup.sh --resume` after interrupted run +- [ ] Single-phase retry works: `sudo ./setup.sh --phase=7` after build failure - [ ] Setup log saved to `/opt/bytelyst/setup.log` +- [ ] Build logs saved per-service to `/opt/bytelyst/.setup-state/builds/` --- @@ -228,7 +214,7 @@ Add a section to `README.md` (or a separate `test-plan.md`) that describes how t Raw Ubuntu 24.04 VM ├── Ollama (systemd, :11434) ─── local LLM inference ├── Gitea (Docker, :3300) ────── npm package registry -└── Docker Compose Ecosystem (27 services) +└── Docker Compose Ecosystem (30 services) ├── Infrastructure │ ├── cosmos-emulator (:8081, :1234) │ ├── azurite (:10000) @@ -273,10 +259,22 @@ Product Dockerfiles use BuildKit secret mount for the npm token: RUN --mount=type=secret,id=gitea_npm_token \ cp .npmrc.docker .npmrc && \ GITEA_NPM_TOKEN=$(cat /run/secrets/gitea_npm_token) \ - pnpm install --frozen-lockfile + pnpm install ``` The `.npmrc.docker` in each product repo uses `${GITEA_NPM_HOST}:3300` as the registry host. During `docker compose build`, the host's `GITEA_NPM_TOKEN` env var is passed as a BuildKit secret, and `GITEA_NPM_HOST` is passed as a build arg (defaults to `host.docker.internal`, overridden to `172.17.0.1` on Linux VMs by the setup script). + +## CLI Reference + +```bash +sudo ./setup.sh # Fresh install (all 8 phases) +sudo ./setup.sh --phase=7 # Retry just the deploy phase +sudo ./setup.sh --resume # Auto-resume after SSH disconnect +sudo ./setup.sh --resume-from=7 # Jump to deploy after manual fix +sudo ./setup.sh --status # Check what's done +sudo ./setup.sh --reset # Start completely over +sudo ./setup.sh --help # Show usage +``` diff --git a/docs/devops/single_azure_vm/setup.sh b/docs/devops/single_azure_vm/setup.sh index 3476aec4..7f0948ba 100755 --- a/docs/devops/single_azure_vm/setup.sh +++ b/docs/devops/single_azure_vm/setup.sh @@ -12,7 +12,7 @@ # - Ollama (local LLM inference for LocalMemGPT on :11434) # - All 11 ByteLyst repos (cloned from GitHub) # - All @bytelyst/* packages (built + published to Gitea) -# - Full 27-service ecosystem (via docker-compose.ecosystem.yml) +# - Full 30-service ecosystem (via docker-compose.ecosystem.yml) # # Usage: sudo ./setup.sh [OPTIONS] # @@ -859,7 +859,7 @@ main() { echo "" echo "╔═══════════════════════════════════════════════════════════════╗" echo "║ ByteLyst Single-VM Deployment (raw Ubuntu) ║" - echo "║ 27 services · 10 products · Ollama · Gitea · 1 VM ║" + echo "║ 30 services · 10 products · Ollama · Gitea · 1 VM ║" echo "╚═══════════════════════════════════════════════════════════════╝" echo "" log "Log file: ${INSTALL_DIR}/setup.log"