docs(devops): add azure vm deployment status snapshot
This commit is contained in:
parent
626e19f776
commit
388d71a06f
@ -0,0 +1,374 @@
|
||||
# Azure VM Deployment Status — 2026-03-29
|
||||
|
||||
> Status snapshot for the single-Azure-VM Docker deployment described in this folder.
|
||||
> This document records what was completed on the VM, what was manually fixed during validation, what is currently healthy, and what still remains.
|
||||
|
||||
---
|
||||
|
||||
## Executive Summary
|
||||
|
||||
The Azure VM deployment is **partially successful**.
|
||||
|
||||
### What is working
|
||||
|
||||
- all platform and product **backend services** are running and healthy
|
||||
- core infrastructure needed by the backend stack is up:
|
||||
- Cosmos emulator
|
||||
- Azurite
|
||||
- Mailpit
|
||||
- Loki
|
||||
- Grafana
|
||||
- Gitea
|
||||
- Traefik
|
||||
- the generated host-side health script confirms all backend APIs are reachable
|
||||
- the package registry path used by Docker builds was repaired, and the backend fleet was recovered after build failures
|
||||
|
||||
### What is not fully complete
|
||||
|
||||
- `setup.sh --status` still reports **Phase 7 pending**
|
||||
- `admin-web` and `tracker-web` remain unresolved
|
||||
- some product web apps were brought up but remain out of scope for a backend-only completion target
|
||||
- some prompt-level validation checks still fail because they depend on web/UI or local-LLM surfaces rather than backend health
|
||||
|
||||
### Current bottom line
|
||||
|
||||
If the definition of success is:
|
||||
|
||||
- "all backends are running and healthy"
|
||||
|
||||
then the deployment is in a usable state.
|
||||
|
||||
If the definition of success is:
|
||||
|
||||
- "all 31 services are green and the original prompt is fully complete"
|
||||
|
||||
then the deployment is **not yet complete**.
|
||||
|
||||
---
|
||||
|
||||
## Environment Reality vs. Prompt Assumptions
|
||||
|
||||
The prompt assumed:
|
||||
|
||||
- Ubuntu 24.04 LTS
|
||||
- `Standard_D8s_v5`
|
||||
- 8 vCPU
|
||||
- 32 GB RAM
|
||||
- 128 GB disk
|
||||
|
||||
The actual VM observed during execution differed:
|
||||
|
||||
- Ubuntu was newer than the prompt assumption
|
||||
- available RAM was lower than the prompt expectation
|
||||
|
||||
This matters because the deployment and dry-run checks were written around the original VM assumptions.
|
||||
|
||||
---
|
||||
|
||||
## Phase Status
|
||||
|
||||
From `sudo ./setup.sh --status` on the deployed VM copy:
|
||||
|
||||
- Phase 1: DONE (`2026-03-29T07:48:24+00:00`)
|
||||
- Phase 2: DONE (`2026-03-29T17:26:38+00:00`)
|
||||
- Phase 3: DONE (`2026-03-29T07:48:37+00:00`)
|
||||
- Phase 4: DONE (`2026-03-29T07:52:46+00:00`)
|
||||
- Phase 5: DONE (`2026-03-29T07:53:33+00:00`)
|
||||
- Phase 6: DONE (`2026-03-29T07:53:33+00:00`)
|
||||
- Phase 7: **pending**
|
||||
- Phase 8: DONE (`2026-03-29T08:33:11+00:00`)
|
||||
|
||||
### Interpretation
|
||||
|
||||
The setup state is inconsistent with the actual recovered runtime:
|
||||
|
||||
- several services were manually repaired and rebuilt after the original deploy attempt
|
||||
- backends are healthy now
|
||||
- but the phase marker for Phase 7 was never cleanly advanced to done after the manual recovery work
|
||||
|
||||
---
|
||||
|
||||
## Verified Healthy Backend Services
|
||||
|
||||
Host-side health verification confirmed these backend ports returned success:
|
||||
|
||||
- `4003` platform-service
|
||||
- `4005` extraction-service
|
||||
- `4007` mcp-server
|
||||
- `4010` peakpulse
|
||||
- `4011` chronomind
|
||||
- `4012` jarvisjr
|
||||
- `4013` nomgap
|
||||
- `4014` mindlyst
|
||||
- `4015` lysnrai
|
||||
- `4016` notelett
|
||||
- `4017` flowmonk
|
||||
- `4018` actiontrail
|
||||
- `4019` localmemgpt
|
||||
|
||||
This was verified with host-context curls against `http://127.0.0.1:<port>/health`.
|
||||
|
||||
---
|
||||
|
||||
## Current Container State Summary
|
||||
|
||||
### Healthy infrastructure / service containers
|
||||
|
||||
- cosmos-emulator
|
||||
- azurite
|
||||
- mailpit
|
||||
- loki
|
||||
- grafana
|
||||
- gateway
|
||||
- platform-service
|
||||
- extraction-service
|
||||
- mcp-server
|
||||
- all 10 product backends
|
||||
|
||||
### Not fully resolved
|
||||
|
||||
- `admin-web`
|
||||
- `tracker-web`
|
||||
|
||||
### Deliberately deprioritized during the final pass
|
||||
|
||||
These were not needed for the backend-only goal:
|
||||
|
||||
- product web apps
|
||||
- LLM Lab dashboard
|
||||
|
||||
Some of those web containers were partially repaired and some became reachable, but the final backend-only objective explicitly skipped them.
|
||||
|
||||
---
|
||||
|
||||
## Manual Fixes Applied During Recovery
|
||||
|
||||
The deployment required several manual fixes on the **deployed VM copies under `/opt/bytelyst/`**, not in this repo working tree.
|
||||
|
||||
### 1. Gitea package metadata host fix
|
||||
|
||||
Problem:
|
||||
|
||||
- package tarballs were advertised with `http://localhost:3300/...`
|
||||
- Docker builds inside containers could not fetch those packages
|
||||
|
||||
Fix:
|
||||
|
||||
- changed the deployed Gitea setup to advertise package URLs via the Docker-reachable host IP
|
||||
- recreated the Gitea registry container with the corrected host settings
|
||||
|
||||
Impact:
|
||||
|
||||
- allowed product Docker builds to fetch private `@bytelyst/*` packages successfully
|
||||
|
||||
### 2. Re-published shared packages needed by downstream builds
|
||||
|
||||
Published corrected package versions in the local Gitea registry:
|
||||
|
||||
- `@bytelyst/ui@0.1.1`
|
||||
- `@bytelyst/llm-router@0.1.1`
|
||||
|
||||
Impact:
|
||||
|
||||
- fixed broken downstream web and dashboard builds that depended on those packages
|
||||
|
||||
### 3. Extraction-service healthcheck fix
|
||||
|
||||
Problem:
|
||||
|
||||
- extraction runtime container healthcheck used `wget`
|
||||
- runtime image did not include `wget`
|
||||
|
||||
Fix:
|
||||
|
||||
- updated the deployed extraction-service runtime image to include `wget`
|
||||
|
||||
Impact:
|
||||
|
||||
- extraction-service became healthy
|
||||
|
||||
### 4. Product web build repairs
|
||||
|
||||
Examples:
|
||||
|
||||
- added missing `next-env.d.ts` to ChronoMind web
|
||||
- moved several product web apps off `file:` package references and onto published Gitea package versions
|
||||
- corrected some runtime `CMD` entrypoints for Next standalone images
|
||||
|
||||
Impact:
|
||||
|
||||
- many product web apps became buildable and at least partially runnable
|
||||
|
||||
### 5. Local LLM dashboard Dockerfile repair
|
||||
|
||||
Problem:
|
||||
|
||||
- dashboard Dockerfile had an invalid `COPY ... 2>/dev/null || true` pattern
|
||||
|
||||
Fix:
|
||||
|
||||
- replaced that with a valid Docker instruction and corrected runtime pathing
|
||||
|
||||
Impact:
|
||||
|
||||
- dashboard image became buildable
|
||||
|
||||
---
|
||||
|
||||
## Validation Results
|
||||
|
||||
### `/opt/bytelyst/check-health.sh`
|
||||
|
||||
At the time of the final backend-focused validation:
|
||||
|
||||
- backend and core service health checks were green
|
||||
- `admin-web` and `tracker-web` were still red
|
||||
- host-side probes for some optional/non-backend surfaces still failed
|
||||
|
||||
### `sudo ./setup.sh --dry-run`
|
||||
|
||||
Dry-run reported `13/17` checks passed.
|
||||
|
||||
Failures included:
|
||||
|
||||
- disk/RAM checks reported `0 GB`
|
||||
- Ollama service running
|
||||
- GitHub reachable
|
||||
- Phase 7 pending
|
||||
|
||||
### Important dry-run caveat
|
||||
|
||||
The RAM and disk failures were not trustworthy as-is because the dry-run parsing logic itself emitted `awk` errors and produced `0 GB` values.
|
||||
|
||||
That means the dry-run output should be treated as:
|
||||
|
||||
- useful for broad status
|
||||
- not authoritative for RAM/disk capacity on this VM
|
||||
|
||||
---
|
||||
|
||||
## What Is Completed So Far
|
||||
|
||||
### Completed deployment capabilities
|
||||
|
||||
- base system dependencies installed
|
||||
- local registry installed and working
|
||||
- repos cloned on the VM
|
||||
- shared packages built and published
|
||||
- environment generated
|
||||
- backend fleet deployed and healthy
|
||||
- host-side health script generated
|
||||
|
||||
### Completed operational outcome
|
||||
|
||||
The VM can currently serve the backend API surface required by:
|
||||
|
||||
- platform clients
|
||||
- product backends
|
||||
- backend-to-backend traffic
|
||||
- backend health validation
|
||||
|
||||
This makes the VM usable for backend integration testing and further deployment hardening.
|
||||
|
||||
---
|
||||
|
||||
## What Remains
|
||||
|
||||
### High priority
|
||||
|
||||
1. Resolve `admin-web`
|
||||
2. Resolve `tracker-web`
|
||||
3. Bring Phase 7 state into a truthful completed state
|
||||
|
||||
### Medium priority
|
||||
|
||||
4. Normalize the web runtime images so all product web containers use the correct Next standalone entrypoint
|
||||
5. Re-run the full end-to-end prompt validation after web surfaces are repaired
|
||||
|
||||
### Security / hardening priority
|
||||
|
||||
6. Stop documenting raw `http://<vm-ip>:<port>` exposure as the recommended client model
|
||||
7. Move client-facing access behind a single HTTPS API hostname
|
||||
8. Reduce public NSG exposure to `80/443` for app traffic
|
||||
9. Lock down or remove public access to operational ports:
|
||||
- `3000`
|
||||
- `3300`
|
||||
- `8025`
|
||||
- `8080`
|
||||
- `10000`
|
||||
- `11434`
|
||||
|
||||
### Nice to have
|
||||
|
||||
10. Correct the dry-run disk/RAM check parsing so it reflects real system capacity
|
||||
|
||||
---
|
||||
|
||||
## Recommended Next Steps
|
||||
|
||||
### If the goal is backend-only readiness
|
||||
|
||||
The VM is already in acceptable shape.
|
||||
|
||||
Recommended action:
|
||||
|
||||
- treat the backend deployment as operational
|
||||
- document the two remaining dashboard gaps
|
||||
- move on to HTTPS and gateway hardening
|
||||
|
||||
### If the goal is full prompt completion
|
||||
|
||||
Do this next:
|
||||
|
||||
1. finish `admin-web`
|
||||
2. finish `tracker-web`
|
||||
3. re-run full health validation
|
||||
4. update phase markers or complete Phase 7 cleanly
|
||||
|
||||
### If the goal is client-facing production readiness
|
||||
|
||||
Do not expose raw backend ports directly.
|
||||
|
||||
Instead:
|
||||
|
||||
1. front the APIs behind one HTTPS domain
|
||||
2. route by path prefix
|
||||
3. keep backend ports private on the Docker network
|
||||
|
||||
That recommendation is documented separately in:
|
||||
|
||||
- [`SECURE_API_EXPOSURE.md`](./SECURE_API_EXPOSURE.md)
|
||||
|
||||
---
|
||||
|
||||
## Status Classification
|
||||
|
||||
### Backend deployment
|
||||
|
||||
- **Status:** complete enough for backend use
|
||||
|
||||
### Full single-VM ecosystem deployment
|
||||
|
||||
- **Status:** partially complete
|
||||
|
||||
### Client-facing production-ready deployment
|
||||
|
||||
- **Status:** not complete
|
||||
|
||||
---
|
||||
|
||||
## Notes on Scope
|
||||
|
||||
This document records deployment status as observed during an execution and repair session on the VM.
|
||||
|
||||
It is intentionally:
|
||||
|
||||
- factual rather than aspirational
|
||||
- focused on what was verified
|
||||
- separate from the secure API exposure decision doc
|
||||
|
||||
It should be updated again after:
|
||||
|
||||
- `admin-web` is fixed
|
||||
- `tracker-web` is fixed
|
||||
- the final phase-state inconsistency is resolved
|
||||
@ -7,6 +7,7 @@
|
||||
Related:
|
||||
|
||||
- [`SECURE_API_EXPOSURE.md`](./SECURE_API_EXPOSURE.md) — recommended public API exposure model, alternatives, and security guidance for client-facing URLs
|
||||
- [`DEPLOYMENT_STATUS_2026-03-29.md`](./DEPLOYMENT_STATUS_2026-03-29.md) — deployment snapshot: what completed on the Azure VM, what was manually fixed, and what remains
|
||||
|
||||
---
|
||||
|
||||
|
||||
Loading…
Reference in New Issue
Block a user