docs(devops): add phased VM stack recommendations
This commit is contained in:
parent
b7b3869014
commit
4aba0a83cc
@ -12,6 +12,7 @@
|
||||
|------|----------|-----|
|
||||
| **[Coolify](https://coolify.io)** | Manual compose orchestration + Traefik config + SSL + deploy scripts | Self-hosted PaaS. Git-push deploys, automatic SSL (Let's Encrypt), env var management UI, Docker Compose support, real-time logs, one-click rollbacks. **Eliminates ~60% of manual deployment work.** |
|
||||
| **[Uptime Kuma](https://github.com/louislam/uptime-kuma)** | Custom health-check scripts + `prototype-self-test.sh` | Beautiful status page + monitoring for all 25+ endpoints. Slack/Discord/email alerts. Multi-protocol (HTTP, TCP, DNS, Docker). Setup: 2 minutes. |
|
||||
| **[Prometheus](https://prometheus.io)** + **node-exporter** + **cadvisor** | Missing metrics stack next to Grafana/Loki | Adds host metrics, container metrics, alertable service metrics, and closes the main observability gap in the current VM stack. |
|
||||
| **[Valkey](https://valkey.io)** (Redis fork, BSD licensed) | In-memory caches scattered across services | Centralized session store, rate-limit counters, pub/sub for SSE fan-out, feature flag cache, job queue backend. Eliminates per-service in-memory state that dies on restart. |
|
||||
|
||||
### Tier 2 — Operational Excellence
|
||||
@ -22,6 +23,7 @@
|
||||
| **[Portainer CE](https://portainer.io)** | CLI-only Docker management | Visual container management, resource monitoring, compose stack deployment, volume management. Good for when the AI agent isn't available. |
|
||||
| **[Restic](https://restic.net)** + cron | No backup strategy | Encrypted, deduplicated backups of Docker volumes (Cosmos data, Gitea repos, Grafana dashboards) to Azure Blob or S3. Scheduled via the platform-service jobs module. |
|
||||
| **[SOPS](https://github.com/getsops/sops)** + [age](https://github.com/FiloSottile/age) | Plain `.env` files (secrets in cleartext) | Encrypt secrets in git. `sops -e .env.production > .env.production.enc`. Decrypt at deploy time. No Key Vault dependency for single-VM. |
|
||||
| **[PostgreSQL](https://www.postgresql.org)** + **[pgvector](https://github.com/pgvector/pgvector)** | Ad hoc relational/vector persistence plans | Add only when a concrete service needs relational data plus embedding/vector search. Not a day-one requirement for the current VM. |
|
||||
|
||||
### Tier 3 — Developer Experience
|
||||
|
||||
@ -48,6 +50,15 @@
|
||||
|
||||
**Recommendation: Use Coolify for the VM deployment.** It's a mature, actively maintained project (36K+ GitHub stars) that handles the boring plumbing. Reserve raw compose/K3s for when you need multi-node or fine-grained control.
|
||||
|
||||
### Recommended rollout order for the current VM
|
||||
|
||||
1. Keep Grafana and Loki internal on the VM.
|
||||
2. Add Prometheus + node-exporter + cadvisor next.
|
||||
3. Add Valkey after metrics are in place.
|
||||
4. Add PostgreSQL + pgvector only when a concrete product or platform service requires it.
|
||||
|
||||
This keeps the stack incremental and avoids carrying the operational weight of PostgreSQL before there is a real consumer.
|
||||
|
||||
---
|
||||
|
||||
## 2. Enhanced Architecture (Single VM)
|
||||
|
||||
@ -125,6 +125,17 @@ This was verified with host-context curls against `http://127.0.0.1:<port>/healt
|
||||
- mcp-server
|
||||
- all 10 product backends
|
||||
|
||||
### Present but not yet added to the active VM stack
|
||||
|
||||
- Prometheus
|
||||
- Alertmanager
|
||||
- node-exporter
|
||||
- cadvisor
|
||||
- Valkey / Redis-compatible cache
|
||||
- PostgreSQL + pgvector
|
||||
|
||||
These are not currently running containers on the VM snapshot described by this document.
|
||||
|
||||
### Not fully resolved
|
||||
|
||||
- `admin-web`
|
||||
@ -259,6 +270,36 @@ The RAM and disk failures were not trustworthy as-is because the dry-run parsing
|
||||
|
||||
That means the dry-run output should be treated as:
|
||||
|
||||
---
|
||||
|
||||
## Recommended Phase 2 Additions
|
||||
|
||||
The next infrastructure additions for this VM should be introduced in this order:
|
||||
|
||||
1. Prometheus + node-exporter + cadvisor
|
||||
2. Valkey
|
||||
3. PostgreSQL + pgvector only when a concrete service requires it
|
||||
|
||||
### Why this order
|
||||
|
||||
- Grafana and Loki are already running, so metrics collection is the biggest missing observability gap
|
||||
- Valkey is low-overhead and unlocks shared cache, pub/sub, rate-limits, and SSE fan-out
|
||||
- PostgreSQL + pgvector adds more operational weight and should be installed only for an actual relational/vector workload
|
||||
|
||||
### Internal-only hosting policy
|
||||
|
||||
These services should remain VM-hosted and internal-only:
|
||||
|
||||
- Grafana
|
||||
- Loki
|
||||
- Prometheus
|
||||
- Alertmanager
|
||||
- admin-web
|
||||
- tracker-web
|
||||
- any future logs, traces, metrics, or ops dashboards
|
||||
|
||||
Do not expose raw service ports publicly. Front any browser-accessed internal tool through Caddy with authentication and preferably IP allowlisting, VPN, or SSO.
|
||||
|
||||
- useful for broad status
|
||||
- not authoritative for RAM/disk capacity on this VM
|
||||
|
||||
|
||||
Loading…
Reference in New Issue
Block a user