saravanakumardb1 35021b67b9 docs(infra): fix stale service count (27→30), update prompt.md + README.md for Codex agent readiness

- prompt.md: mark tasks 1-3 as DONE, add 'Current State' section listing
  all implemented features, update bugs-fixed table (16 items), fix service
  count in architecture diagram, add CLI reference, remove stale --frozen-lockfile
- README.md: add Resume & Retry section with examples, add CLI Flags table,
  fix service count in title/phases, update build failure troubleshooting
  with build log paths and retry command
- setup.sh: fix '27 services' → '30 services' in header comment and banner

2026-03-24 12:35:59 -07:00

14 KiB

Raw Blame History

Codex Agent Prompt: ByteLyst Single-VM E2E Deployment

Goal: Review, harden, test, and complete setup.sh so it works flawlessly on a raw Ubuntu 24.04 Azure VM — zero manual intervention, 100% completion, all 30 services healthy.

IMPORTANT: Read the "Current State" section below FIRST. Many tasks in this prompt are already completed. Do NOT re-implement them.

Context

This folder contains three files you must work with:

setup.sh — 8-phase bash script (~940 lines) that bootstraps the entire ByteLyst ecosystem on a blank Ubuntu VM
README.md — Deployment guide documenting what the script does, ports, troubleshooting
prompt.md — This file (agent instructions)

The script installs everything from scratch (Docker, Node.js, pnpm, Gitea, Ollama) then clones 11 repos, builds + publishes ~49 @bytelyst/* npm packages to a local Gitea registry, generates environment config, and deploys 30 Docker Compose services (6 infra + 3 platform + 2 dashboards + 10 backends + 9 webs).

Current State (ALREADY IMPLEMENTED — do NOT redo)

The following features are already built and tested in setup.sh:

Resume/retry support: --resume, --resume-from=N, --phase=N, --reset, --status, --help CLI flags
Phase completion markers: Stored in /opt/bytelyst/.setup-state/phaseN.done
GITEA_NPM_TOKEN auto-restore: Token saved to /opt/bytelyst/.gitea_token, restored on resume
Per-service Docker build: Phase 7 builds each of 30 services individually with [N/30] progress
Per-service fallback: Failed builds are skipped, remaining services still start
Build logs: Saved per-service to /opt/bytelyst/.setup-state/builds/<service>.log
Phase 7 partial failure handling: Phase 7 NOT marked done if builds fail, so --resume retries it
set -euo pipefail safety: All pipelines in fallback paths use || true to prevent premature abort
Ollama model pull non-fatal: Model download failure doesn't abort the entire setup
SSH disconnect protection: All output tee'd to /opt/bytelyst/setup.log
Idempotent: Every phase handles re-runs gracefully

Key files outside this folder that the script depends on

File	Repo	Purpose
`docker-compose.ecosystem.yml`	`learning_ai_common_plat` (root)	Defines all 30 services
`.env.ecosystem.example`	`learning_ai_common_plat` (root)	Template for env vars
`packages/*/package.json`	`learning_ai_common_plat`	~49 `@bytelyst/*` packages to publish
`backend/Dockerfile`	Each of the 10 product repos	Product backend Docker builds
`web/Dockerfile`	Each of the 10 product repos	Product web Docker builds
`.npmrc.docker`	Each of the 10 product repos	Gitea npm registry config for Docker builds

Repo list (all 11, cloned to `/opt/bytelyst/`)

learning_ai_common_plat          # Shared platform: packages, services, dashboards, compose
learning_voice_ai_agent          # LysnrAI
learning_multimodal_memory_agents # MindLyst (web is at mindlyst-native/web/)
learning_ai_clock                # ChronoMind
learning_ai_jarvis_jr            # JarvisJr
learning_ai_fastgap              # NomGap
learning_ai_peakpulse            # PeakPulse
learning_ai_flowmonk             # FlowMonk
learning_ai_notes                # NoteLett
learning_ai_trails               # ActionTrail
learning_ai_local_memory_gpt     # LocalMemGPT

GitHub org: saravanakumardb1 (repos are public).

Bugs Already Fixed (do NOT re-fix these)

The following issues have already been identified and fixed in the current setup.sh:

Bug	Fix	Commit
Docker apt source had extra whitespace from `\` continuation	Single-line echo	`ddd2db84`
Gitea 1.22 returns token in `.sha1`, newer versions use `.token`	`jq -r '.sha1 // .token'` fallback	`ddd2db84`
jfrog registry sed didn't handle multi-line `\` continuation	Added `/jfrog-pkg-proxy.*\\$/d` pattern	`ddd2db84`
`detect_docker_host_ip()` uses `ip` command not in minimal installs	Added `iproute2` to apt deps	`ddd2db84`
SSH disconnect loses all output	`exec > >(tee -a setup.log) 2>&1`	`ddd2db84`
`localmemgpt-backend` can't reach Ollama on Linux	`extra_hosts: ['host.docker.internal:host-gateway']` in compose	`3b31709b`
Dashboard Dockerfiles had hardcoded corporate proxy	Converted to `ARG`-based proxy with empty defaults	`2b9fd717`
`pnpm install --frozen-lockfile` fails on shallow clones	Removed `--frozen-lockfile`	`3b31709b`
3 service Dockerfiles had stale package.json COPY lists	Updated to all 57 packages + workspace members	`85aca553`
Phase 5 publish counted 409 conflicts as failures	Distinguish real failures from expected conflicts	`c0bc13e1`
`set -e` + `pipefail` aborted script on `docker compose up` partial failure	Added `
Phase 7 marked done even with partial build failures	Only mark done when all builds succeed	`a9414218`
`docker compose config --format json` called 30x in loop	Cached once	`a9414218`
`--phase=7` printed success even with failures	Now exits 1 with build log path	`a9414218`
`last_completed_phase` didn't enforce sequential order	Stops at first gap	`a3f4c6fa`
Phase 7 missing `.env.ecosystem` guard	Fail early with helpful message	`a3f4c6fa`
`ollama pull \| tail` aborted entire setup on slow network	Made non-fatal	`b634708d`

Your Tasks (in priority order)

Tasks 1-3 are ALREADY DONE. See "Current State" above and "Bugs Already Fixed" above. Focus on Tasks 4-7 which are the remaining work.

1. Audit `setup.sh` for correctness ✅ DONE

The script has been audited and all identified bugs fixed (see table above). Phases 1-8 are tested. Key things already verified:

Docker CE install, Node.js 22 (NodeSource), pnpm 10.6.5, Ollama — all idempotent
Gitea token: .sha1 // .token fallback in place
Corporate proxy: removed at source in all repos, no runtime sed needed
pnpm install runs without --frozen-lockfile
Phase 5 publish: tolerates 409 conflicts
Phase 6 env: heredoc with Cosmos/Azurite emulator keys, semicolons handled
Phase 7: per-service build with fallback, BuildKit secrets via GITEA_NPM_TOKEN env export
Phase 8: health check covers all 30 services + Gitea + Ollama

2. Fix every bug you find ✅ DONE

All bugs fixed — see the 16-item table in "Bugs Already Fixed" above.

3. Add error recovery and logging ✅ DONE

Already implemented:

Phase completion markers: /opt/bytelyst/.setup-state/phaseN.done
Resume: --resume (auto-detect), --resume-from=N, --phase=N (single), --reset, --status
Logging: exec > >(tee -a setup.log) 2>&1
Per-service fallback: Failed Docker builds are skipped, remaining services start
Build logs: Per-service to /opt/bytelyst/.setup-state/builds/<service>.log

4. Add a dry-run / validation mode (TODO)

Add --dry-run support that:

Checks all prerequisites (disk space, memory, network access to GitHub)
Validates Docker is installed and running
Validates Gitea is reachable
Validates all repos can be cloned (HEAD request to GitHub)
Does NOT build, publish, or deploy
Prints a summary of what WOULD happen

5. Validate the `docker-compose.ecosystem.yml` integration

Read docker-compose.ecosystem.yml (in the repo root) and verify:

Every service's build.context and build.dockerfile paths are correct relative to the compose file location
Every service's port mapping matches the backend's PORT env var
The x-product-build anchor correctly provides GITEA_NPM_HOST and gitea_npm_token secret
All depends_on conditions reference services that actually exist
The localmemgpt-backend service has extra_hosts: ['host.docker.internal:host-gateway'] for Ollama access
30 total services: 6 infra (pre-built images) + 24 built from Dockerfiles

6. Update `README.md`

After all fixes, update README.md to reflect:

CLI flags: --resume, --resume-from=N, --phase=N, --reset, --status, --help
Correct service count: 30 (not 27)
Updated duration estimates if phases changed
Any new troubleshooting entries
NSG port list: 22, 80, 1025, 1234, 3000-3003, 3030, 3035, 3040, 3045, 3050, 3055, 3060, 3070, 3100, 3300, 4003, 4005, 4007, 4010-4019, 8025, 8080, 8081, 10000, 11434

7. Create a test plan

Add a section to README.md (or a separate test-plan.md) that describes how to validate the deployment end-to-end:

1. SSH into VM
2. Run: /opt/bytelyst/check-health.sh
   Expected: All 27+ checks green
3. Run: curl http://localhost:4003/health
   Expected: {"status":"ok","service":"platform-service",...}
4. Run: curl http://localhost:4003/api/auth/register -X POST -H 'Content-Type: application/json' -d '{"email":"test@test.com","password":"Test1234!","displayName":"Test"}'
   Expected: 201 with user object
5. Open browser: http://<vm-ip>:3001
   Expected: Admin dashboard login page
6. Open browser: http://<vm-ip>:3040
   Expected: FlowMonk web app
7. Run: curl http://localhost:4019/api/models
   Expected: List of Ollama models including llama3.2:3b
8. Open browser: http://<vm-ip>:8025
   Expected: Mailpit inbox (empty)
9. Open browser: http://<vm-ip>:3000
   Expected: Grafana login (admin / bytelyst)

Constraints

DO NOT change any files outside docs/devops/single_azure_vm/ without asking
DO NOT modify docker-compose.ecosystem.yml or any Dockerfile — the script must work with the repos as-is (it patches Dockerfiles after cloning)
DO NOT hardcode secrets or API keys (Cosmos emulator and Azurite keys are well-known public keys, those are OK)
DO NOT add emojis to code
DO NOT use console.log or print — use the existing log(), ok(), warn(), fail() helpers
The script MUST work on a completely fresh Ubuntu 24.04 LTS VM with NOTHING pre-installed except SSH
The script MUST be idempotent — running it twice should not break anything
The script MUST complete in under 30 minutes on a Standard_D8s_v5 (8 vCPU, 32 GB)

Definition of Done

setup.sh runs flawlessly from sudo ./setup.sh on a raw Ubuntu 24.04 VM
All 8 phases complete without manual intervention
/opt/bytelyst/check-health.sh shows ALL 30+ services green
All 10 product backends respond to /health with {"status":"ok",...}
All 9 product web apps serve their landing page
Admin dashboard (http://<vm-ip>:3001) loads
Tracker dashboard (http://<vm-ip>:3003) loads
LocalMemGPT can reach Ollama (curl http://localhost:4019/api/models returns models)
Gitea UI accessible at http://<vm-ip>:3300 with all @bytelyst/* packages visible
Grafana accessible at http://<vm-ip>:3000 (admin / bytelyst)
Mailpit accessible at http://<vm-ip>:8025
README.md is accurate and complete
Script is idempotent (second run succeeds without errors)
Resume works: sudo ./setup.sh --resume after interrupted run
Single-phase retry works: sudo ./setup.sh --phase=7 after build failure
Setup log saved to /opt/bytelyst/setup.log
Build logs saved per-service to /opt/bytelyst/.setup-state/builds/

Architecture Reference

Raw Ubuntu 24.04 VM
├── Ollama (systemd, :11434) ─── local LLM inference
├── Gitea (Docker, :3300) ────── npm package registry
└── Docker Compose Ecosystem (30 services)
    ├── Infrastructure
    │   ├── cosmos-emulator (:8081, :1234)
    │   ├── azurite (:10000)
    │   ├── mailpit (:1025, :8025)
    │   ├── loki (:3100)
    │   ├── grafana (:3000)
    │   └── gateway/traefik (:80, :8080)
    ├── Platform Services
    │   ├── platform-service (:4003) ── auth, billing, flags, audit
    │   ├── extraction-service (:4005) ── AI text extraction
    │   └── mcp-server (:4007) ── MCP tool server
    ├── Dashboards
    │   ├── admin-web (:3001) ── platform admin console
    │   └── tracker-web (:3003) ── issue tracker
    ├── Product Backends (Fastify 5 + TypeScript)
    │   ├── peakpulse-backend (:4010)
    │   ├── chronomind-backend (:4011)
    │   ├── jarvisjr-backend (:4012)
    │   ├── nomgap-backend (:4013)
    │   ├── mindlyst-backend (:4014)
    │   ├── lysnrai-backend (:4015)
    │   ├── notelett-backend (:4016)
    │   ├── flowmonk-backend (:4017)
    │   ├── actiontrail-backend (:4018)
    │   └── localmemgpt-backend (:4019) ── connects to Ollama
    └── Product Web Apps (Next.js 16)
        ├── lysnrai-web (:3002)
        ├── chronomind-web (:3030)
        ├── jarvisjr-web (:3035)
        ├── flowmonk-web (:3040)
        ├── notelett-web (:3045)
        ├── mindlyst-web (:3050)
        ├── nomgap-web (:3055)
        ├── actiontrail-web (:3060)
        └── localmemgpt-web (:3070)

How Docker Builds Reach Gitea

Product Dockerfiles use BuildKit secret mount for the npm token:

RUN --mount=type=secret,id=gitea_npm_token \
    cp .npmrc.docker .npmrc && \
    GITEA_NPM_TOKEN=$(cat /run/secrets/gitea_npm_token) \
    pnpm install

The .npmrc.docker in each product repo uses ${GITEA_NPM_HOST}:3300 as the registry host. During docker compose build, the host's GITEA_NPM_TOKEN env var is passed as a BuildKit secret, and GITEA_NPM_HOST is passed as a build arg (defaults to host.docker.internal, overridden to 172.17.0.1 on Linux VMs by the setup script).

CLI Reference

sudo ./setup.sh                    # Fresh install (all 8 phases)
sudo ./setup.sh --phase=7          # Retry just the deploy phase
sudo ./setup.sh --resume           # Auto-resume after SSH disconnect
sudo ./setup.sh --resume-from=7    # Jump to deploy after manual fix
sudo ./setup.sh --status           # Check what's done
sudo ./setup.sh --reset            # Start completely over
sudo ./setup.sh --help             # Show usage

14 KiB Raw Blame History