diff --git a/docs/devops/SINGLE_VM_ENHANCED_PLAN.md b/docs/devops/SINGLE_VM_ENHANCED_PLAN.md index b252a39a..947f3f06 100644 --- a/docs/devops/SINGLE_VM_ENHANCED_PLAN.md +++ b/docs/devops/SINGLE_VM_ENHANCED_PLAN.md @@ -12,6 +12,7 @@ |------|----------|-----| | **[Coolify](https://coolify.io)** | Manual compose orchestration + Traefik config + SSL + deploy scripts | Self-hosted PaaS. Git-push deploys, automatic SSL (Let's Encrypt), env var management UI, Docker Compose support, real-time logs, one-click rollbacks. **Eliminates ~60% of manual deployment work.** | | **[Uptime Kuma](https://github.com/louislam/uptime-kuma)** | Custom health-check scripts + `prototype-self-test.sh` | Beautiful status page + monitoring for all 25+ endpoints. Slack/Discord/email alerts. Multi-protocol (HTTP, TCP, DNS, Docker). Setup: 2 minutes. | +| **[Prometheus](https://prometheus.io)** + **node-exporter** + **cadvisor** | Missing metrics stack next to Grafana/Loki | Adds host metrics, container metrics, alertable service metrics, and closes the main observability gap in the current VM stack. | | **[Valkey](https://valkey.io)** (Redis fork, BSD licensed) | In-memory caches scattered across services | Centralized session store, rate-limit counters, pub/sub for SSE fan-out, feature flag cache, job queue backend. Eliminates per-service in-memory state that dies on restart. | ### Tier 2 — Operational Excellence @@ -22,6 +23,7 @@ | **[Portainer CE](https://portainer.io)** | CLI-only Docker management | Visual container management, resource monitoring, compose stack deployment, volume management. Good for when the AI agent isn't available. | | **[Restic](https://restic.net)** + cron | No backup strategy | Encrypted, deduplicated backups of Docker volumes (Cosmos data, Gitea repos, Grafana dashboards) to Azure Blob or S3. Scheduled via the platform-service jobs module. | | **[SOPS](https://github.com/getsops/sops)** + [age](https://github.com/FiloSottile/age) | Plain `.env` files (secrets in cleartext) | Encrypt secrets in git. `sops -e .env.production > .env.production.enc`. Decrypt at deploy time. No Key Vault dependency for single-VM. | +| **[PostgreSQL](https://www.postgresql.org)** + **[pgvector](https://github.com/pgvector/pgvector)** | Ad hoc relational/vector persistence plans | Add only when a concrete service needs relational data plus embedding/vector search. Not a day-one requirement for the current VM. | ### Tier 3 — Developer Experience @@ -48,6 +50,15 @@ **Recommendation: Use Coolify for the VM deployment.** It's a mature, actively maintained project (36K+ GitHub stars) that handles the boring plumbing. Reserve raw compose/K3s for when you need multi-node or fine-grained control. +### Recommended rollout order for the current VM + +1. Keep Grafana and Loki internal on the VM. +2. Add Prometheus + node-exporter + cadvisor next. +3. Add Valkey after metrics are in place. +4. Add PostgreSQL + pgvector only when a concrete product or platform service requires it. + +This keeps the stack incremental and avoids carrying the operational weight of PostgreSQL before there is a real consumer. + --- ## 2. Enhanced Architecture (Single VM) diff --git a/docs/devops/single_azure_vm/docker/DEPLOYMENT_STATUS_2026-03-29.md b/docs/devops/single_azure_vm/docker/DEPLOYMENT_STATUS_2026-03-29.md index 2c8fdaa0..5de11100 100644 --- a/docs/devops/single_azure_vm/docker/DEPLOYMENT_STATUS_2026-03-29.md +++ b/docs/devops/single_azure_vm/docker/DEPLOYMENT_STATUS_2026-03-29.md @@ -125,6 +125,17 @@ This was verified with host-context curls against `http://127.0.0.1:/healt - mcp-server - all 10 product backends +### Present but not yet added to the active VM stack + +- Prometheus +- Alertmanager +- node-exporter +- cadvisor +- Valkey / Redis-compatible cache +- PostgreSQL + pgvector + +These are not currently running containers on the VM snapshot described by this document. + ### Not fully resolved - `admin-web` @@ -259,6 +270,36 @@ The RAM and disk failures were not trustworthy as-is because the dry-run parsing That means the dry-run output should be treated as: +--- + +## Recommended Phase 2 Additions + +The next infrastructure additions for this VM should be introduced in this order: + +1. Prometheus + node-exporter + cadvisor +2. Valkey +3. PostgreSQL + pgvector only when a concrete service requires it + +### Why this order + +- Grafana and Loki are already running, so metrics collection is the biggest missing observability gap +- Valkey is low-overhead and unlocks shared cache, pub/sub, rate-limits, and SSE fan-out +- PostgreSQL + pgvector adds more operational weight and should be installed only for an actual relational/vector workload + +### Internal-only hosting policy + +These services should remain VM-hosted and internal-only: + +- Grafana +- Loki +- Prometheus +- Alertmanager +- admin-web +- tracker-web +- any future logs, traces, metrics, or ops dashboards + +Do not expose raw service ports publicly. Front any browser-accessed internal tool through Caddy with authentication and preferably IP allowlisting, VPN, or SSO. + - useful for broad status - not authoritative for RAM/disk capacity on this VM