diff --git a/docs/devops/single_azure_vm/docker/DEPLOYMENT_STATUS_2026-03-29.md b/docs/devops/single_azure_vm/docker/DEPLOYMENT_STATUS_2026-03-29.md new file mode 100644 index 00000000..209422c7 --- /dev/null +++ b/docs/devops/single_azure_vm/docker/DEPLOYMENT_STATUS_2026-03-29.md @@ -0,0 +1,374 @@ +# Azure VM Deployment Status — 2026-03-29 + +> Status snapshot for the single-Azure-VM Docker deployment described in this folder. +> This document records what was completed on the VM, what was manually fixed during validation, what is currently healthy, and what still remains. + +--- + +## Executive Summary + +The Azure VM deployment is **partially successful**. + +### What is working + +- all platform and product **backend services** are running and healthy +- core infrastructure needed by the backend stack is up: + - Cosmos emulator + - Azurite + - Mailpit + - Loki + - Grafana + - Gitea + - Traefik +- the generated host-side health script confirms all backend APIs are reachable +- the package registry path used by Docker builds was repaired, and the backend fleet was recovered after build failures + +### What is not fully complete + +- `setup.sh --status` still reports **Phase 7 pending** +- `admin-web` and `tracker-web` remain unresolved +- some product web apps were brought up but remain out of scope for a backend-only completion target +- some prompt-level validation checks still fail because they depend on web/UI or local-LLM surfaces rather than backend health + +### Current bottom line + +If the definition of success is: + +- "all backends are running and healthy" + +then the deployment is in a usable state. + +If the definition of success is: + +- "all 31 services are green and the original prompt is fully complete" + +then the deployment is **not yet complete**. + +--- + +## Environment Reality vs. Prompt Assumptions + +The prompt assumed: + +- Ubuntu 24.04 LTS +- `Standard_D8s_v5` +- 8 vCPU +- 32 GB RAM +- 128 GB disk + +The actual VM observed during execution differed: + +- Ubuntu was newer than the prompt assumption +- available RAM was lower than the prompt expectation + +This matters because the deployment and dry-run checks were written around the original VM assumptions. + +--- + +## Phase Status + +From `sudo ./setup.sh --status` on the deployed VM copy: + +- Phase 1: DONE (`2026-03-29T07:48:24+00:00`) +- Phase 2: DONE (`2026-03-29T17:26:38+00:00`) +- Phase 3: DONE (`2026-03-29T07:48:37+00:00`) +- Phase 4: DONE (`2026-03-29T07:52:46+00:00`) +- Phase 5: DONE (`2026-03-29T07:53:33+00:00`) +- Phase 6: DONE (`2026-03-29T07:53:33+00:00`) +- Phase 7: **pending** +- Phase 8: DONE (`2026-03-29T08:33:11+00:00`) + +### Interpretation + +The setup state is inconsistent with the actual recovered runtime: + +- several services were manually repaired and rebuilt after the original deploy attempt +- backends are healthy now +- but the phase marker for Phase 7 was never cleanly advanced to done after the manual recovery work + +--- + +## Verified Healthy Backend Services + +Host-side health verification confirmed these backend ports returned success: + +- `4003` platform-service +- `4005` extraction-service +- `4007` mcp-server +- `4010` peakpulse +- `4011` chronomind +- `4012` jarvisjr +- `4013` nomgap +- `4014` mindlyst +- `4015` lysnrai +- `4016` notelett +- `4017` flowmonk +- `4018` actiontrail +- `4019` localmemgpt + +This was verified with host-context curls against `http://127.0.0.1:/health`. + +--- + +## Current Container State Summary + +### Healthy infrastructure / service containers + +- cosmos-emulator +- azurite +- mailpit +- loki +- grafana +- gateway +- platform-service +- extraction-service +- mcp-server +- all 10 product backends + +### Not fully resolved + +- `admin-web` +- `tracker-web` + +### Deliberately deprioritized during the final pass + +These were not needed for the backend-only goal: + +- product web apps +- LLM Lab dashboard + +Some of those web containers were partially repaired and some became reachable, but the final backend-only objective explicitly skipped them. + +--- + +## Manual Fixes Applied During Recovery + +The deployment required several manual fixes on the **deployed VM copies under `/opt/bytelyst/`**, not in this repo working tree. + +### 1. Gitea package metadata host fix + +Problem: + +- package tarballs were advertised with `http://localhost:3300/...` +- Docker builds inside containers could not fetch those packages + +Fix: + +- changed the deployed Gitea setup to advertise package URLs via the Docker-reachable host IP +- recreated the Gitea registry container with the corrected host settings + +Impact: + +- allowed product Docker builds to fetch private `@bytelyst/*` packages successfully + +### 2. Re-published shared packages needed by downstream builds + +Published corrected package versions in the local Gitea registry: + +- `@bytelyst/ui@0.1.1` +- `@bytelyst/llm-router@0.1.1` + +Impact: + +- fixed broken downstream web and dashboard builds that depended on those packages + +### 3. Extraction-service healthcheck fix + +Problem: + +- extraction runtime container healthcheck used `wget` +- runtime image did not include `wget` + +Fix: + +- updated the deployed extraction-service runtime image to include `wget` + +Impact: + +- extraction-service became healthy + +### 4. Product web build repairs + +Examples: + +- added missing `next-env.d.ts` to ChronoMind web +- moved several product web apps off `file:` package references and onto published Gitea package versions +- corrected some runtime `CMD` entrypoints for Next standalone images + +Impact: + +- many product web apps became buildable and at least partially runnable + +### 5. Local LLM dashboard Dockerfile repair + +Problem: + +- dashboard Dockerfile had an invalid `COPY ... 2>/dev/null || true` pattern + +Fix: + +- replaced that with a valid Docker instruction and corrected runtime pathing + +Impact: + +- dashboard image became buildable + +--- + +## Validation Results + +### `/opt/bytelyst/check-health.sh` + +At the time of the final backend-focused validation: + +- backend and core service health checks were green +- `admin-web` and `tracker-web` were still red +- host-side probes for some optional/non-backend surfaces still failed + +### `sudo ./setup.sh --dry-run` + +Dry-run reported `13/17` checks passed. + +Failures included: + +- disk/RAM checks reported `0 GB` +- Ollama service running +- GitHub reachable +- Phase 7 pending + +### Important dry-run caveat + +The RAM and disk failures were not trustworthy as-is because the dry-run parsing logic itself emitted `awk` errors and produced `0 GB` values. + +That means the dry-run output should be treated as: + +- useful for broad status +- not authoritative for RAM/disk capacity on this VM + +--- + +## What Is Completed So Far + +### Completed deployment capabilities + +- base system dependencies installed +- local registry installed and working +- repos cloned on the VM +- shared packages built and published +- environment generated +- backend fleet deployed and healthy +- host-side health script generated + +### Completed operational outcome + +The VM can currently serve the backend API surface required by: + +- platform clients +- product backends +- backend-to-backend traffic +- backend health validation + +This makes the VM usable for backend integration testing and further deployment hardening. + +--- + +## What Remains + +### High priority + +1. Resolve `admin-web` +2. Resolve `tracker-web` +3. Bring Phase 7 state into a truthful completed state + +### Medium priority + +4. Normalize the web runtime images so all product web containers use the correct Next standalone entrypoint +5. Re-run the full end-to-end prompt validation after web surfaces are repaired + +### Security / hardening priority + +6. Stop documenting raw `http://:` exposure as the recommended client model +7. Move client-facing access behind a single HTTPS API hostname +8. Reduce public NSG exposure to `80/443` for app traffic +9. Lock down or remove public access to operational ports: + - `3000` + - `3300` + - `8025` + - `8080` + - `10000` + - `11434` + +### Nice to have + +10. Correct the dry-run disk/RAM check parsing so it reflects real system capacity + +--- + +## Recommended Next Steps + +### If the goal is backend-only readiness + +The VM is already in acceptable shape. + +Recommended action: + +- treat the backend deployment as operational +- document the two remaining dashboard gaps +- move on to HTTPS and gateway hardening + +### If the goal is full prompt completion + +Do this next: + +1. finish `admin-web` +2. finish `tracker-web` +3. re-run full health validation +4. update phase markers or complete Phase 7 cleanly + +### If the goal is client-facing production readiness + +Do not expose raw backend ports directly. + +Instead: + +1. front the APIs behind one HTTPS domain +2. route by path prefix +3. keep backend ports private on the Docker network + +That recommendation is documented separately in: + +- [`SECURE_API_EXPOSURE.md`](./SECURE_API_EXPOSURE.md) + +--- + +## Status Classification + +### Backend deployment + +- **Status:** complete enough for backend use + +### Full single-VM ecosystem deployment + +- **Status:** partially complete + +### Client-facing production-ready deployment + +- **Status:** not complete + +--- + +## Notes on Scope + +This document records deployment status as observed during an execution and repair session on the VM. + +It is intentionally: + +- factual rather than aspirational +- focused on what was verified +- separate from the secure API exposure decision doc + +It should be updated again after: + +- `admin-web` is fixed +- `tracker-web` is fixed +- the final phase-state inconsistency is resolved diff --git a/docs/devops/single_azure_vm/docker/README.md b/docs/devops/single_azure_vm/docker/README.md index ca52dbcd..6d799f3d 100644 --- a/docs/devops/single_azure_vm/docker/README.md +++ b/docs/devops/single_azure_vm/docker/README.md @@ -7,6 +7,7 @@ Related: - [`SECURE_API_EXPOSURE.md`](./SECURE_API_EXPOSURE.md) — recommended public API exposure model, alternatives, and security guidance for client-facing URLs +- [`DEPLOYMENT_STATUS_2026-03-29.md`](./DEPLOYMENT_STATUS_2026-03-29.md) — deployment snapshot: what completed on the Azure VM, what was manually fixed, and what remains ---