docs(devops): add azure vm deployment status snapshot

This commit is contained in:
root 2026-03-29 22:42:33 +00:00
parent 626e19f776
commit 388d71a06f
2 changed files with 375 additions and 0 deletions

View File

@ -0,0 +1,374 @@
# Azure VM Deployment Status — 2026-03-29
> Status snapshot for the single-Azure-VM Docker deployment described in this folder.
> This document records what was completed on the VM, what was manually fixed during validation, what is currently healthy, and what still remains.
---
## Executive Summary
The Azure VM deployment is **partially successful**.
### What is working
- all platform and product **backend services** are running and healthy
- core infrastructure needed by the backend stack is up:
- Cosmos emulator
- Azurite
- Mailpit
- Loki
- Grafana
- Gitea
- Traefik
- the generated host-side health script confirms all backend APIs are reachable
- the package registry path used by Docker builds was repaired, and the backend fleet was recovered after build failures
### What is not fully complete
- `setup.sh --status` still reports **Phase 7 pending**
- `admin-web` and `tracker-web` remain unresolved
- some product web apps were brought up but remain out of scope for a backend-only completion target
- some prompt-level validation checks still fail because they depend on web/UI or local-LLM surfaces rather than backend health
### Current bottom line
If the definition of success is:
- "all backends are running and healthy"
then the deployment is in a usable state.
If the definition of success is:
- "all 31 services are green and the original prompt is fully complete"
then the deployment is **not yet complete**.
---
## Environment Reality vs. Prompt Assumptions
The prompt assumed:
- Ubuntu 24.04 LTS
- `Standard_D8s_v5`
- 8 vCPU
- 32 GB RAM
- 128 GB disk
The actual VM observed during execution differed:
- Ubuntu was newer than the prompt assumption
- available RAM was lower than the prompt expectation
This matters because the deployment and dry-run checks were written around the original VM assumptions.
---
## Phase Status
From `sudo ./setup.sh --status` on the deployed VM copy:
- Phase 1: DONE (`2026-03-29T07:48:24+00:00`)
- Phase 2: DONE (`2026-03-29T17:26:38+00:00`)
- Phase 3: DONE (`2026-03-29T07:48:37+00:00`)
- Phase 4: DONE (`2026-03-29T07:52:46+00:00`)
- Phase 5: DONE (`2026-03-29T07:53:33+00:00`)
- Phase 6: DONE (`2026-03-29T07:53:33+00:00`)
- Phase 7: **pending**
- Phase 8: DONE (`2026-03-29T08:33:11+00:00`)
### Interpretation
The setup state is inconsistent with the actual recovered runtime:
- several services were manually repaired and rebuilt after the original deploy attempt
- backends are healthy now
- but the phase marker for Phase 7 was never cleanly advanced to done after the manual recovery work
---
## Verified Healthy Backend Services
Host-side health verification confirmed these backend ports returned success:
- `4003` platform-service
- `4005` extraction-service
- `4007` mcp-server
- `4010` peakpulse
- `4011` chronomind
- `4012` jarvisjr
- `4013` nomgap
- `4014` mindlyst
- `4015` lysnrai
- `4016` notelett
- `4017` flowmonk
- `4018` actiontrail
- `4019` localmemgpt
This was verified with host-context curls against `http://127.0.0.1:<port>/health`.
---
## Current Container State Summary
### Healthy infrastructure / service containers
- cosmos-emulator
- azurite
- mailpit
- loki
- grafana
- gateway
- platform-service
- extraction-service
- mcp-server
- all 10 product backends
### Not fully resolved
- `admin-web`
- `tracker-web`
### Deliberately deprioritized during the final pass
These were not needed for the backend-only goal:
- product web apps
- LLM Lab dashboard
Some of those web containers were partially repaired and some became reachable, but the final backend-only objective explicitly skipped them.
---
## Manual Fixes Applied During Recovery
The deployment required several manual fixes on the **deployed VM copies under `/opt/bytelyst/`**, not in this repo working tree.
### 1. Gitea package metadata host fix
Problem:
- package tarballs were advertised with `http://localhost:3300/...`
- Docker builds inside containers could not fetch those packages
Fix:
- changed the deployed Gitea setup to advertise package URLs via the Docker-reachable host IP
- recreated the Gitea registry container with the corrected host settings
Impact:
- allowed product Docker builds to fetch private `@bytelyst/*` packages successfully
### 2. Re-published shared packages needed by downstream builds
Published corrected package versions in the local Gitea registry:
- `@bytelyst/ui@0.1.1`
- `@bytelyst/llm-router@0.1.1`
Impact:
- fixed broken downstream web and dashboard builds that depended on those packages
### 3. Extraction-service healthcheck fix
Problem:
- extraction runtime container healthcheck used `wget`
- runtime image did not include `wget`
Fix:
- updated the deployed extraction-service runtime image to include `wget`
Impact:
- extraction-service became healthy
### 4. Product web build repairs
Examples:
- added missing `next-env.d.ts` to ChronoMind web
- moved several product web apps off `file:` package references and onto published Gitea package versions
- corrected some runtime `CMD` entrypoints for Next standalone images
Impact:
- many product web apps became buildable and at least partially runnable
### 5. Local LLM dashboard Dockerfile repair
Problem:
- dashboard Dockerfile had an invalid `COPY ... 2>/dev/null || true` pattern
Fix:
- replaced that with a valid Docker instruction and corrected runtime pathing
Impact:
- dashboard image became buildable
---
## Validation Results
### `/opt/bytelyst/check-health.sh`
At the time of the final backend-focused validation:
- backend and core service health checks were green
- `admin-web` and `tracker-web` were still red
- host-side probes for some optional/non-backend surfaces still failed
### `sudo ./setup.sh --dry-run`
Dry-run reported `13/17` checks passed.
Failures included:
- disk/RAM checks reported `0 GB`
- Ollama service running
- GitHub reachable
- Phase 7 pending
### Important dry-run caveat
The RAM and disk failures were not trustworthy as-is because the dry-run parsing logic itself emitted `awk` errors and produced `0 GB` values.
That means the dry-run output should be treated as:
- useful for broad status
- not authoritative for RAM/disk capacity on this VM
---
## What Is Completed So Far
### Completed deployment capabilities
- base system dependencies installed
- local registry installed and working
- repos cloned on the VM
- shared packages built and published
- environment generated
- backend fleet deployed and healthy
- host-side health script generated
### Completed operational outcome
The VM can currently serve the backend API surface required by:
- platform clients
- product backends
- backend-to-backend traffic
- backend health validation
This makes the VM usable for backend integration testing and further deployment hardening.
---
## What Remains
### High priority
1. Resolve `admin-web`
2. Resolve `tracker-web`
3. Bring Phase 7 state into a truthful completed state
### Medium priority
4. Normalize the web runtime images so all product web containers use the correct Next standalone entrypoint
5. Re-run the full end-to-end prompt validation after web surfaces are repaired
### Security / hardening priority
6. Stop documenting raw `http://<vm-ip>:<port>` exposure as the recommended client model
7. Move client-facing access behind a single HTTPS API hostname
8. Reduce public NSG exposure to `80/443` for app traffic
9. Lock down or remove public access to operational ports:
- `3000`
- `3300`
- `8025`
- `8080`
- `10000`
- `11434`
### Nice to have
10. Correct the dry-run disk/RAM check parsing so it reflects real system capacity
---
## Recommended Next Steps
### If the goal is backend-only readiness
The VM is already in acceptable shape.
Recommended action:
- treat the backend deployment as operational
- document the two remaining dashboard gaps
- move on to HTTPS and gateway hardening
### If the goal is full prompt completion
Do this next:
1. finish `admin-web`
2. finish `tracker-web`
3. re-run full health validation
4. update phase markers or complete Phase 7 cleanly
### If the goal is client-facing production readiness
Do not expose raw backend ports directly.
Instead:
1. front the APIs behind one HTTPS domain
2. route by path prefix
3. keep backend ports private on the Docker network
That recommendation is documented separately in:
- [`SECURE_API_EXPOSURE.md`](./SECURE_API_EXPOSURE.md)
---
## Status Classification
### Backend deployment
- **Status:** complete enough for backend use
### Full single-VM ecosystem deployment
- **Status:** partially complete
### Client-facing production-ready deployment
- **Status:** not complete
---
## Notes on Scope
This document records deployment status as observed during an execution and repair session on the VM.
It is intentionally:
- factual rather than aspirational
- focused on what was verified
- separate from the secure API exposure decision doc
It should be updated again after:
- `admin-web` is fixed
- `tracker-web` is fixed
- the final phase-state inconsistency is resolved

View File

@ -7,6 +7,7 @@
Related:
- [`SECURE_API_EXPOSURE.md`](./SECURE_API_EXPOSURE.md) — recommended public API exposure model, alternatives, and security guidance for client-facing URLs
- [`DEPLOYMENT_STATUS_2026-03-29.md`](./DEPLOYMENT_STATUS_2026-03-29.md) — deployment snapshot: what completed on the Azure VM, what was manually fixed, and what remains
---