docs(infra): fix stale service count (27→30), update prompt.md + README.md for Codex agent readiness

- prompt.md: mark tasks 1-3 as DONE, add 'Current State' section listing
  all implemented features, update bugs-fixed table (16 items), fix service
  count in architecture diagram, add CLI reference, remove stale --frozen-lockfile
- README.md: add Resume & Retry section with examples, add CLI Flags table,
  fix service count in title/phases, update build failure troubleshooting
  with build log paths and retry command
- setup.sh: fix '27 services' → '30 services' in header comment and banner
This commit is contained in:
saravanakumardb1 2026-03-24 12:35:59 -07:00
parent acbab75aaa
commit 35021b67b9
3 changed files with 108 additions and 86 deletions

View File

@ -1,6 +1,6 @@
# ByteLyst Single-VM Deployment
> Deploy the **entire ByteLyst ecosystem** on a single **raw** Azure VM.
> Deploy the **entire ByteLyst ecosystem** (30 services, 10 products) on a single **raw** Azure VM.
> Nothing pre-installed required — the script handles everything from a blank Ubuntu machine.
> Two files: this README and `setup.sh`. Copy both to the VM and run the script.
@ -30,7 +30,20 @@ sudo ./setup.sh
# 4. Wait ~15-25 minutes for full build + deploy
# 5. Verify
docker compose -f /opt/bytelyst/learning_ai_common_plat/docker-compose.ecosystem.yml ps
/opt/bytelyst/check-health.sh
```
### Resume & Retry
Phase completion is tracked. If anything fails, you don't have to start over:
```bash
sudo ./setup.sh --phase=7 # Retry just the deploy phase
sudo ./setup.sh --resume # Auto-resume after SSH disconnect
sudo ./setup.sh --resume-from=7 # Jump to deploy after manual fix
sudo ./setup.sh --status # Check what's done
sudo ./setup.sh --reset # Start completely over
sudo ./setup.sh --help # Show full usage
```
## What the Script Installs & Does
@ -56,8 +69,8 @@ docker compose -f /opt/bytelyst/learning_ai_common_plat/docker-compose.ecosystem
| 4. Build | ~5 min | `pnpm install && pnpm -r build` all `@bytelyst/*` packages |
| 5. Publish | ~3 min | Publish all packages to local Gitea npm registry |
| 6. Env | instant | Generate `.env.ecosystem` with Cosmos emulator key, Azurite key, JWT secret |
| 7. Deploy | ~10 min | `docker compose up --build` for 27 services |
| 8. Verify | ~1 min | Health-check all services + create `/opt/bytelyst/check-health.sh` |
| 7. Deploy | ~10 min | Per-service Docker build + deploy (30 services, with fallback) |
| 8. Verify | ~1 min | Health-check all 30+ endpoints + create `/opt/bytelyst/check-health.sh` |
## Port Map (after deployment)
@ -147,11 +160,22 @@ All optional — defaults work for most setups:
| `SKIP_CLONE` | `0` | Set `1` to skip cloning (re-runs) |
| `SKIP_BUILD` | `0` | Set `1` to skip package build+publish (re-runs) |
## CLI Flags
| Flag | Description |
|------|-------------|
| `--resume` | Auto-resume from last completed phase |
| `--resume-from=N` | Resume from phase N (1-8) |
| `--phase=N` | Run ONLY phase N (useful for retrying) |
| `--reset` | Clear phase markers and start fresh |
| `--status` | Show completed phases and exit |
| `-h`, `--help` | Show usage help |
## Troubleshooting
- **Cosmos emulator slow:** It needs 20-30s on first boot. Services wait via health checks.
- **Out of memory:** Use at least 32 GB RAM. Cosmos emulator needs ~4 GB, Ollama needs ~4 GB for 3B models.
- **Build failures:** Check Gitea is running (`docker ps | grep gitea`) and packages published (`curl http://localhost:3300/api/packages/bytelyst/npm/`).
- **Build failures:** Check Gitea is running (`docker ps | grep gitea`) and packages published (`curl http://localhost:3300/api/packages/bytelyst/npm/`). Per-service build logs: `/opt/bytelyst/.setup-state/builds/<service>.log`. Retry: `sudo ./setup.sh --phase=7`.
- **Ollama not responding:** Check `systemctl status ollama` or `curl http://localhost:11434/api/version`.
- **Port conflicts:** Ensure nothing else runs on the listed ports before deploying.
- **Corporate proxy in Dockerfiles:** The script auto-strips hardcoded proxy ENVs from cloned Dockerfiles.

View File

@ -1,23 +1,42 @@
# Codex Agent Prompt: ByteLyst Single-VM E2E Deployment
> **Goal:** Review, harden, test, and complete `setup.sh` so it works flawlessly on a raw Ubuntu 24.04 Azure VM — zero manual intervention, 100% completion, all 27 services healthy.
> **Goal:** Review, harden, test, and complete `setup.sh` so it works flawlessly on a raw Ubuntu 24.04 Azure VM — zero manual intervention, 100% completion, all 30 services healthy.
>
> **IMPORTANT:** Read the "Current State" section below FIRST. Many tasks in this prompt are already completed. Do NOT re-implement them.
---
## Context
This folder contains two files you must work with:
This folder contains three files you must work with:
- **`setup.sh`** — 8-phase bash script that bootstraps the entire ByteLyst ecosystem on a blank Ubuntu VM
- **`setup.sh`** — 8-phase bash script (~940 lines) that bootstraps the entire ByteLyst ecosystem on a blank Ubuntu VM
- **`README.md`** — Deployment guide documenting what the script does, ports, troubleshooting
- **`prompt.md`** — This file (agent instructions)
The script installs everything from scratch (Docker, Node.js, pnpm, Gitea, Ollama) then clones 11 repos, builds + publishes ~49 `@bytelyst/*` npm packages to a local Gitea registry, generates environment config, and deploys 27 Docker Compose services.
The script installs everything from scratch (Docker, Node.js, pnpm, Gitea, Ollama) then clones 11 repos, builds + publishes ~49 `@bytelyst/*` npm packages to a local Gitea registry, generates environment config, and deploys 30 Docker Compose services (6 infra + 3 platform + 2 dashboards + 10 backends + 9 webs).
### Current State (ALREADY IMPLEMENTED — do NOT redo)
The following features are already built and tested in `setup.sh`:
- **Resume/retry support:** `--resume`, `--resume-from=N`, `--phase=N`, `--reset`, `--status`, `--help` CLI flags
- **Phase completion markers:** Stored in `/opt/bytelyst/.setup-state/phaseN.done`
- **GITEA_NPM_TOKEN auto-restore:** Token saved to `/opt/bytelyst/.gitea_token`, restored on resume
- **Per-service Docker build:** Phase 7 builds each of 30 services individually with `[N/30]` progress
- **Per-service fallback:** Failed builds are skipped, remaining services still start
- **Build logs:** Saved per-service to `/opt/bytelyst/.setup-state/builds/<service>.log`
- **Phase 7 partial failure handling:** Phase 7 NOT marked done if builds fail, so `--resume` retries it
- **set -euo pipefail safety:** All pipelines in fallback paths use `|| true` to prevent premature abort
- **Ollama model pull non-fatal:** Model download failure doesn't abort the entire setup
- **SSH disconnect protection:** All output tee'd to `/opt/bytelyst/setup.log`
- **Idempotent:** Every phase handles re-runs gracefully
### Key files outside this folder that the script depends on
| File | Repo | Purpose |
|------|------|---------|
| `docker-compose.ecosystem.yml` | `learning_ai_common_plat` (root) | Defines all 27 services |
| `docker-compose.ecosystem.yml` | `learning_ai_common_plat` (root) | Defines all 30 services |
| `.env.ecosystem.example` | `learning_ai_common_plat` (root) | Template for env vars |
| `packages/*/package.json` | `learning_ai_common_plat` | ~49 `@bytelyst/*` packages to publish |
| `backend/Dockerfile` | Each of the 10 product repos | Product backend Docker builds |
@ -58,89 +77,51 @@ The following issues have already been identified and fixed in the current `setu
| `localmemgpt-backend` can't reach Ollama on Linux | `extra_hosts: ['host.docker.internal:host-gateway']` in compose | `3b31709b` |
| Dashboard Dockerfiles had hardcoded corporate proxy | Converted to `ARG`-based proxy with empty defaults | `2b9fd717` |
| `pnpm install --frozen-lockfile` fails on shallow clones | Removed `--frozen-lockfile` | `3b31709b` |
| 3 service Dockerfiles had stale package.json COPY lists | Updated to all 57 packages + workspace members | `85aca553` |
| Phase 5 publish counted 409 conflicts as failures | Distinguish real failures from expected conflicts | `c0bc13e1` |
| `set -e` + `pipefail` aborted script on `docker compose up` partial failure | Added `|| true` | `a9414218` |
| Phase 7 marked done even with partial build failures | Only mark done when all builds succeed | `a9414218` |
| `docker compose config --format json` called 30x in loop | Cached once | `a9414218` |
| `--phase=7` printed success even with failures | Now exits 1 with build log path | `a9414218` |
| `last_completed_phase` didn't enforce sequential order | Stops at first gap | `a3f4c6fa` |
| Phase 7 missing `.env.ecosystem` guard | Fail early with helpful message | `a3f4c6fa` |
| `ollama pull \| tail` aborted entire setup on slow network | Made non-fatal | `b634708d` |
---
## Your Tasks (in priority order)
### 1. Audit `setup.sh` for correctness and completeness
> **Tasks 1-3 are ALREADY DONE.** See "Current State" above and "Bugs Already Fixed" above.
> Focus on Tasks 4-7 which are the remaining work.
Read the entire script and identify every potential failure point. Specifically check:
### ~~1. Audit `setup.sh` for correctness~~ ✅ DONE
- **Phase 1 (System):**
- Docker CE install via official apt repo — verify the GPG key + sources.list format works on Ubuntu 24.04
- Node.js 22 via NodeSource — verify `setup_22.x` URL is current
- pnpm 10.6.5 via `npm install -g` — correct
- Ollama install via `https://ollama.com/install.sh` — verify it starts as systemd service, has fallback
- All commands must be non-interactive (`DEBIAN_FRONTEND=noninteractive`)
The script has been audited and all identified bugs fixed (see table above). Phases 1-8 are tested. Key things already verified:
- Docker CE install, Node.js 22 (NodeSource), pnpm 10.6.5, Ollama — all idempotent
- Gitea token: `.sha1 // .token` fallback in place
- Corporate proxy: removed at source in all repos, no runtime `sed` needed
- `pnpm install` runs without `--frozen-lockfile`
- Phase 5 publish: tolerates 409 conflicts
- Phase 6 env: heredoc with Cosmos/Azurite emulator keys, semicolons handled
- Phase 7: per-service build with fallback, BuildKit secrets via `GITEA_NPM_TOKEN` env export
- Phase 8: health check covers all 30 services + Gitea + Ollama
- **Phase 2 (Gitea):**
- Gitea Docker container `gitea/gitea:1.22` on port 3300
- Admin user creation via `gitea admin user create` inside the container
- Organization creation via REST API (`POST /api/v1/orgs`)
- API token creation with `write:package` + `read:package` scopes
- Token extracted via `jq -r '.sha1'` — verify Gitea 1.22 returns `.sha1` (not `.token`)
### ~~2. Fix every bug you find~~ ✅ DONE
- **Phase 3 (Clone):**
- Shallow clone (`--depth 1`) all 11 repos
- Corporate proxy stripping: `sed` removes `HTTP_PROXY`, `HTTPS_PROXY`, `NO_PROXY` ENV lines and `jfrog-pkg-proxy` registry references from ALL Dockerfiles
- **CRITICAL:** Verify the glob patterns catch ALL Dockerfiles including special paths:
- `learning_multimodal_memory_agents/mindlyst-native/web/Dockerfile`
- `learning_voice_ai_agent/user-dashboard-web/Dockerfile`
- `learning_voice_ai_agent/backend/Dockerfile` (has backend-python too)
- Verify `sed -i` works on Alpine/Ubuntu (GNU sed, not BSD)
All bugs fixed — see the 16-item table in "Bugs Already Fixed" above.
- **Phase 4 (Build):**
- `.npmrc` written to common-plat root with Gitea registry URL + token
- `pnpm install` (no `--frozen-lockfile` — shallow clones may have lockfile drift)
- `pnpm -r build` — builds ALL packages in dependency order
- Verify this works when run as root (pnpm may have permission issues)
### ~~3. Add error recovery and logging~~ ✅ DONE
- **Phase 5 (Publish):**
- Iterates `packages/*/`, skips non-`@bytelyst/*`, skips `private: true`, skips packages without `dist/`
- `pnpm publish --registry <url> --no-git-checks`
- Must tolerate "already exists" 409 errors gracefully
Already implemented:
- **Phase completion markers:** `/opt/bytelyst/.setup-state/phaseN.done`
- **Resume:** `--resume` (auto-detect), `--resume-from=N`, `--phase=N` (single), `--reset`, `--status`
- **Logging:** `exec > >(tee -a setup.log) 2>&1`
- **Per-service fallback:** Failed Docker builds are skipped, remaining services start
- **Build logs:** Per-service to `/opt/bytelyst/.setup-state/builds/<service>.log`
- **Phase 6 (Env):**
- Generates `.env.ecosystem` with well-known Cosmos emulator key and Azurite key
- Verify the heredoc correctly expands `${COSMOS_EMULATOR_KEY}` and `${AZURITE_KEY}`
- Verify the `AZURE_BLOB_CONNECTION_STRING` semicolons don't break the env file
- JWT secret generated via `openssl rand -base64 32`
### 4. Add a dry-run / validation mode (TODO)
- **Phase 7 (Deploy):**
- `detect_docker_host_ip()` returns docker0 bridge IP (usually `172.17.0.1`)
- `GITEA_NPM_HOST` set to this IP so Docker builds can reach Gitea on the host
- `docker compose up --build -d` with BuildKit secrets for `GITEA_NPM_TOKEN`
- Verify the `x-product-build` YAML anchor in `docker-compose.ecosystem.yml` correctly passes `GITEA_NPM_HOST` as build arg and `gitea_npm_token` as secret
- **Phase 8 (Verify):**
- Waits for `platform-service` health (120s timeout)
- Creates `/opt/bytelyst/check-health.sh` with all 27+ service URLs
- Sleeps 30s then runs health check
### 2. Fix every bug you find
Do not just report issues — fix them directly in `setup.sh`. Common pitfalls to watch for:
- **Gitea API token field:** Gitea 1.22+ may return the token in `.token` instead of `.sha1`. Add fallback: `jq -r '.sha1 // .token'`
- **pnpm as root:** May need `--unsafe-perm` or setting `pnpm config set unsafe-perm true`
- **Docker BuildKit secrets:** The `secrets.gitea_npm_token.environment` directive requires the env var to be set in the shell running `docker compose`. Verify `export GITEA_NPM_TOKEN` is in scope.
- **Cosmos emulator on Linux:** The `vnext-preview` image requires `PROTOCOL=http`. Verify `cosmos-emulator` healthcheck works (it checks port 8080 for `/ready`, not 8081).
- **Product Dockerfiles:** Each uses `--mount=type=secret,id=gitea_npm_token` during `pnpm install`. Verify the secret ID matches what's in the compose file.
- **MindLyst special path:** Its web Dockerfile is at `mindlyst-native/web/Dockerfile` (not `web/Dockerfile`). The compose file references `../learning_multimodal_memory_agents` with `dockerfile: mindlyst-native/web/Dockerfile`. Verify this context + dockerfile path is correct.
- **LysnrAI extra dashboards:** Has `user-dashboard-web/Dockerfile` in addition to `backend/Dockerfile`. Verify the compose references the correct paths.
### 3. Add error recovery and logging
The script uses `set -euo pipefail` which exits on any error. This is too aggressive for a 25-minute deployment. Add:
- **Per-phase error trapping:** Wrap each phase in a function that catches errors and prints a clear message about which phase failed and what to check
- **Log file:** Tee all output to `/opt/bytelyst/setup.log` so the user can review after SSH disconnection
- **Resume support:** Save phase completion markers to `/opt/bytelyst/.phase_complete_N`. On re-run, skip already-completed phases (unless the user passes `FORCE_RERUN=1`)
### 4. Add a dry-run / validation mode
Add `DRY_RUN=1` support that:
Add `--dry-run` support that:
- Checks all prerequisites (disk space, memory, network access to GitHub)
- Validates Docker is installed and running
- Validates Gitea is reachable
@ -157,11 +138,13 @@ Read `docker-compose.ecosystem.yml` (in the repo root) and verify:
- The `x-product-build` anchor correctly provides `GITEA_NPM_HOST` and `gitea_npm_token` secret
- All `depends_on` conditions reference services that actually exist
- The `localmemgpt-backend` service has `extra_hosts: ['host.docker.internal:host-gateway']` for Ollama access
- **30 total services:** 6 infra (pre-built images) + 24 built from Dockerfiles
### 6. Update `README.md`
After all fixes, update `README.md` to reflect:
- Any new env vars you added (e.g., `DRY_RUN`, `FORCE_RERUN`)
- CLI flags: `--resume`, `--resume-from=N`, `--phase=N`, `--reset`, `--status`, `--help`
- Correct service count: 30 (not 27)
- Updated duration estimates if phases changed
- Any new troubleshooting entries
- NSG port list: `22, 80, 1025, 1234, 3000-3003, 3030, 3035, 3040, 3045, 3050, 3055, 3060, 3070, 3100, 3300, 4003, 4005, 4007, 4010-4019, 8025, 8080, 8081, 10000, 11434`
@ -207,7 +190,7 @@ Add a section to `README.md` (or a separate `test-plan.md`) that describes how t
- [ ] `setup.sh` runs flawlessly from `sudo ./setup.sh` on a raw Ubuntu 24.04 VM
- [ ] All 8 phases complete without manual intervention
- [ ] `/opt/bytelyst/check-health.sh` shows ALL services green
- [ ] `/opt/bytelyst/check-health.sh` shows ALL 30+ services green
- [ ] All 10 product backends respond to `/health` with `{"status":"ok",...}`
- [ ] All 9 product web apps serve their landing page
- [ ] Admin dashboard (`http://<vm-ip>:3001`) loads
@ -218,7 +201,10 @@ Add a section to `README.md` (or a separate `test-plan.md`) that describes how t
- [ ] Mailpit accessible at `http://<vm-ip>:8025`
- [ ] `README.md` is accurate and complete
- [ ] Script is idempotent (second run succeeds without errors)
- [ ] Resume works: `sudo ./setup.sh --resume` after interrupted run
- [ ] Single-phase retry works: `sudo ./setup.sh --phase=7` after build failure
- [ ] Setup log saved to `/opt/bytelyst/setup.log`
- [ ] Build logs saved per-service to `/opt/bytelyst/.setup-state/builds/`
---
@ -228,7 +214,7 @@ Add a section to `README.md` (or a separate `test-plan.md`) that describes how t
Raw Ubuntu 24.04 VM
├── Ollama (systemd, :11434) ─── local LLM inference
├── Gitea (Docker, :3300) ────── npm package registry
└── Docker Compose Ecosystem (27 services)
└── Docker Compose Ecosystem (30 services)
├── Infrastructure
│ ├── cosmos-emulator (:8081, :1234)
│ ├── azurite (:10000)
@ -273,10 +259,22 @@ Product Dockerfiles use BuildKit secret mount for the npm token:
RUN --mount=type=secret,id=gitea_npm_token \
cp .npmrc.docker .npmrc && \
GITEA_NPM_TOKEN=$(cat /run/secrets/gitea_npm_token) \
pnpm install --frozen-lockfile
pnpm install
```
The `.npmrc.docker` in each product repo uses `${GITEA_NPM_HOST}:3300` as the registry host.
During `docker compose build`, the host's `GITEA_NPM_TOKEN` env var is passed as a BuildKit secret,
and `GITEA_NPM_HOST` is passed as a build arg (defaults to `host.docker.internal`, overridden to
`172.17.0.1` on Linux VMs by the setup script).
## CLI Reference
```bash
sudo ./setup.sh # Fresh install (all 8 phases)
sudo ./setup.sh --phase=7 # Retry just the deploy phase
sudo ./setup.sh --resume # Auto-resume after SSH disconnect
sudo ./setup.sh --resume-from=7 # Jump to deploy after manual fix
sudo ./setup.sh --status # Check what's done
sudo ./setup.sh --reset # Start completely over
sudo ./setup.sh --help # Show usage
```

View File

@ -12,7 +12,7 @@
# - Ollama (local LLM inference for LocalMemGPT on :11434)
# - All 11 ByteLyst repos (cloned from GitHub)
# - All @bytelyst/* packages (built + published to Gitea)
# - Full 27-service ecosystem (via docker-compose.ecosystem.yml)
# - Full 30-service ecosystem (via docker-compose.ecosystem.yml)
#
# Usage: sudo ./setup.sh [OPTIONS]
#
@ -859,7 +859,7 @@ main() {
echo ""
echo "╔═══════════════════════════════════════════════════════════════╗"
echo "║ ByteLyst Single-VM Deployment (raw Ubuntu) ║"
echo "║ 27 services · 10 products · Ollama · Gitea · 1 VM ║"
echo "║ 30 services · 10 products · Ollama · Gitea · 1 VM ║"
echo "╚═══════════════════════════════════════════════════════════════╝"
echo ""
log "Log file: ${INSTALL_DIR}/setup.log"