docs(devops): add azure vm deployment status snapshot

2026-03-29 22:42:33 +00:00 · 2026-03-29 22:42:33 +00:00 · 388d71a06f
commit 388d71a06f
parent 626e19f776
2 changed files with 375 additions and 0 deletions
--- a/docs/devops/single_azure_vm/docker/DEPLOYMENT_STATUS_2026-03-29.md
+++ b/docs/devops/single_azure_vm/docker/DEPLOYMENT_STATUS_2026-03-29.md
@ -0,0 +1,374 @@
+# Azure VM Deployment Status — 2026-03-29
+
+> Status snapshot for the single-Azure-VM Docker deployment described in this folder.
+> This document records what was completed on the VM, what was manually fixed during validation, what is currently healthy, and what still remains.
+
+---
+
+## Executive Summary
+
+The Azure VM deployment is **partially successful**.
+
+### What is working
+
+- all platform and product **backend services** are running and healthy
+- core infrastructure needed by the backend stack is up:
+  - Cosmos emulator
+  - Azurite
+  - Mailpit
+  - Loki
+  - Grafana
+  - Gitea
+  - Traefik
+- the generated host-side health script confirms all backend APIs are reachable
+- the package registry path used by Docker builds was repaired, and the backend fleet was recovered after build failures
+
+### What is not fully complete
+
+- `setup.sh --status` still reports **Phase 7 pending**
+- `admin-web` and `tracker-web` remain unresolved
+- some product web apps were brought up but remain out of scope for a backend-only completion target
+- some prompt-level validation checks still fail because they depend on web/UI or local-LLM surfaces rather than backend health
+
+### Current bottom line
+
+If the definition of success is:
+
+- "all backends are running and healthy"
+
+then the deployment is in a usable state.
+
+If the definition of success is:
+
+- "all 31 services are green and the original prompt is fully complete"
+
+then the deployment is **not yet complete**.
+
+---
+
+## Environment Reality vs. Prompt Assumptions
+
+The prompt assumed:
+
+- Ubuntu 24.04 LTS
+- `Standard_D8s_v5`
+- 8 vCPU
+- 32 GB RAM
+- 128 GB disk
+
+The actual VM observed during execution differed:
+
+- Ubuntu was newer than the prompt assumption
+- available RAM was lower than the prompt expectation
+
+This matters because the deployment and dry-run checks were written around the original VM assumptions.
+
+---
+
+## Phase Status
+
+From `sudo ./setup.sh --status` on the deployed VM copy:
+
+- Phase 1: DONE (`2026-03-29T07:48:24+00:00`)
+- Phase 2: DONE (`2026-03-29T17:26:38+00:00`)
+- Phase 3: DONE (`2026-03-29T07:48:37+00:00`)
+- Phase 4: DONE (`2026-03-29T07:52:46+00:00`)
+- Phase 5: DONE (`2026-03-29T07:53:33+00:00`)
+- Phase 6: DONE (`2026-03-29T07:53:33+00:00`)
+- Phase 7: **pending**
+- Phase 8: DONE (`2026-03-29T08:33:11+00:00`)
+
+### Interpretation
+
+The setup state is inconsistent with the actual recovered runtime:
+
+- several services were manually repaired and rebuilt after the original deploy attempt
+- backends are healthy now
+- but the phase marker for Phase 7 was never cleanly advanced to done after the manual recovery work
+
+---
+
+## Verified Healthy Backend Services
+
+Host-side health verification confirmed these backend ports returned success:
+
+- `4003` platform-service
+- `4005` extraction-service
+- `4007` mcp-server
+- `4010` peakpulse
+- `4011` chronomind
+- `4012` jarvisjr
+- `4013` nomgap
+- `4014` mindlyst
+- `4015` lysnrai
+- `4016` notelett
+- `4017` flowmonk
+- `4018` actiontrail
+- `4019` localmemgpt
+
+This was verified with host-context curls against `http://127.0.0.1:<port>/health`.
+
+---
+
+## Current Container State Summary
+
+### Healthy infrastructure / service containers
+
+- cosmos-emulator
+- azurite
+- mailpit
+- loki
+- grafana
+- gateway
+- platform-service
+- extraction-service
+- mcp-server
+- all 10 product backends
+
+### Not fully resolved
+
+- `admin-web`
+- `tracker-web`
+
+### Deliberately deprioritized during the final pass
+
+These were not needed for the backend-only goal:
+
+- product web apps
+- LLM Lab dashboard
+
+Some of those web containers were partially repaired and some became reachable, but the final backend-only objective explicitly skipped them.
+
+---
+
+## Manual Fixes Applied During Recovery
+
+The deployment required several manual fixes on the **deployed VM copies under `/opt/bytelyst/`**, not in this repo working tree.
+
+### 1. Gitea package metadata host fix
+
+Problem:
+
+- package tarballs were advertised with `http://localhost:3300/...`
+- Docker builds inside containers could not fetch those packages
+
+Fix:
+
+- changed the deployed Gitea setup to advertise package URLs via the Docker-reachable host IP
+- recreated the Gitea registry container with the corrected host settings
+
+Impact:
+
+- allowed product Docker builds to fetch private `@bytelyst/*` packages successfully
+
+### 2. Re-published shared packages needed by downstream builds
+
+Published corrected package versions in the local Gitea registry:
+
+- `@bytelyst/ui@0.1.1`
+- `@bytelyst/llm-router@0.1.1`
+
+Impact:
+
+- fixed broken downstream web and dashboard builds that depended on those packages
+
+### 3. Extraction-service healthcheck fix
+
+Problem:
+
+- extraction runtime container healthcheck used `wget`
+- runtime image did not include `wget`
+
+Fix:
+
+- updated the deployed extraction-service runtime image to include `wget`
+
+Impact:
+
+- extraction-service became healthy
+
+### 4. Product web build repairs
+
+Examples:
+
+- added missing `next-env.d.ts` to ChronoMind web
+- moved several product web apps off `file:` package references and onto published Gitea package versions
+- corrected some runtime `CMD` entrypoints for Next standalone images
+
+Impact:
+
+- many product web apps became buildable and at least partially runnable
+
+### 5. Local LLM dashboard Dockerfile repair
+
+Problem:
+
+- dashboard Dockerfile had an invalid `COPY ... 2>/dev/null || true` pattern
+
+Fix:
+
+- replaced that with a valid Docker instruction and corrected runtime pathing
+
+Impact:
+
+- dashboard image became buildable
+
+---
+
+## Validation Results
+
+### `/opt/bytelyst/check-health.sh`
+
+At the time of the final backend-focused validation:
+
+- backend and core service health checks were green
+- `admin-web` and `tracker-web` were still red
+- host-side probes for some optional/non-backend surfaces still failed
+
+### `sudo ./setup.sh --dry-run`
+
+Dry-run reported `13/17` checks passed.
+
+Failures included:
+
+- disk/RAM checks reported `0 GB`
+- Ollama service running
+- GitHub reachable
+- Phase 7 pending
+
+### Important dry-run caveat
+
+The RAM and disk failures were not trustworthy as-is because the dry-run parsing logic itself emitted `awk` errors and produced `0 GB` values.
+
+That means the dry-run output should be treated as:
+
+- useful for broad status
+- not authoritative for RAM/disk capacity on this VM
+
+---
+
+## What Is Completed So Far
+
+### Completed deployment capabilities
+
+- base system dependencies installed
+- local registry installed and working
+- repos cloned on the VM
+- shared packages built and published
+- environment generated
+- backend fleet deployed and healthy
+- host-side health script generated
+
+### Completed operational outcome
+
+The VM can currently serve the backend API surface required by:
+
+- platform clients
+- product backends
+- backend-to-backend traffic
+- backend health validation
+
+This makes the VM usable for backend integration testing and further deployment hardening.
+
+---
+
+## What Remains
+
+### High priority
+
+1. Resolve `admin-web`
+2. Resolve `tracker-web`
+3. Bring Phase 7 state into a truthful completed state
+
+### Medium priority
+
+4. Normalize the web runtime images so all product web containers use the correct Next standalone entrypoint
+5. Re-run the full end-to-end prompt validation after web surfaces are repaired
+
+### Security / hardening priority
+
+6. Stop documenting raw `http://<vm-ip>:<port>` exposure as the recommended client model
+7. Move client-facing access behind a single HTTPS API hostname
+8. Reduce public NSG exposure to `80/443` for app traffic
+9. Lock down or remove public access to operational ports:
+   - `3000`
+   - `3300`
+   - `8025`
+   - `8080`
+   - `10000`
+   - `11434`
+
+### Nice to have
+
+10. Correct the dry-run disk/RAM check parsing so it reflects real system capacity
+
+---
+
+## Recommended Next Steps
+
+### If the goal is backend-only readiness
+
+The VM is already in acceptable shape.
+
+Recommended action:
+
+- treat the backend deployment as operational
+- document the two remaining dashboard gaps
+- move on to HTTPS and gateway hardening
+
+### If the goal is full prompt completion
+
+Do this next:
+
+1. finish `admin-web`
+2. finish `tracker-web`
+3. re-run full health validation
+4. update phase markers or complete Phase 7 cleanly
+
+### If the goal is client-facing production readiness
+
+Do not expose raw backend ports directly.
+
+Instead:
+
+1. front the APIs behind one HTTPS domain
+2. route by path prefix
+3. keep backend ports private on the Docker network
+
+That recommendation is documented separately in:
+
+- [`SECURE_API_EXPOSURE.md`](./SECURE_API_EXPOSURE.md)
+
+---
+
+## Status Classification
+
+### Backend deployment
+
+- **Status:** complete enough for backend use
+
+### Full single-VM ecosystem deployment
+
+- **Status:** partially complete
+
+### Client-facing production-ready deployment
+
+- **Status:** not complete
+
+---
+
+## Notes on Scope
+
+This document records deployment status as observed during an execution and repair session on the VM.
+
+It is intentionally:
+
+- factual rather than aspirational
+- focused on what was verified
+- separate from the secure API exposure decision doc
+
+It should be updated again after:
+
+- `admin-web` is fixed
+- `tracker-web` is fixed
+- the final phase-state inconsistency is resolved
--- a/docs/devops/single_azure_vm/docker/README.md
+++ b/docs/devops/single_azure_vm/docker/README.md
@ -7,6 +7,7 @@
 Related:

 - [`SECURE_API_EXPOSURE.md`](./SECURE_API_EXPOSURE.md) — recommended public API exposure model, alternatives, and security guidance for client-facing URLs
+- [`DEPLOYMENT_STATUS_2026-03-29.md`](./DEPLOYMENT_STATUS_2026-03-29.md) — deployment snapshot: what completed on the Azure VM, what was manually fixed, and what remains

 ---