From 5646cefcbd5debf7847a82dd6b25ba081b3b6bc8 Mon Sep 17 00:00:00 2001 From: saravanakumardb1 Date: Sun, 22 Mar 2026 00:36:59 -0700 Subject: [PATCH] docs(devops): add K8s best practices from production comparisons, update gap table to reflect all Dockerfiles created --- docs/devops/SINGLE_VM_DEPLOYMENT.md | 329 +++++++++++++++++++++++++--- 1 file changed, 302 insertions(+), 27 deletions(-) diff --git a/docs/devops/SINGLE_VM_DEPLOYMENT.md b/docs/devops/SINGLE_VM_DEPLOYMENT.md index 16e12d0d..ee243329 100644 --- a/docs/devops/SINGLE_VM_DEPLOYMENT.md +++ b/docs/devops/SINGLE_VM_DEPLOYMENT.md @@ -729,22 +729,22 @@ kubectl port-forward svc/platform-service 4003:4003 -n bytelyst-platform ## 10. What's NOT Dockerized Yet (gaps) -| Repo | Backend Dockerfile | Web Dockerfile | `docker-prep.sh` | `output:'standalone'` | Status | -| --------------- | ------------------ | ------------------- | ---------------- | --------------------- | -------------------------------------------------------------- | -| **LysnrAI** | ❌ | ✅ user-dashboard | ❌ | ✅ (conditional) | Need backend Dockerfile + docker-prep.sh | -| **MindLyst** | ❌ | ❌ | ❌ | ❌ | Need all 4 | -| **ChronoMind** | ❌ | ❌ | ❌ | ❌ | Need all 4 | -| **JarvisJr** | ❌ | ❌ | ❌ | ❌ | Need all 4 | -| **PeakPulse** | ❌ | ❌ | ❌ | ❌ | Need all 4 | -| **FlowMonk** | ❌ | ❌ | ❌ | ❌ | Need all 4 | -| **NomGap** | ✅ ⚠️ | ✅ | ✅ | ✅ | Backend Dockerfile ignores `file:` deps — see §12.F3 | -| **NoteLett** | ✅ ⚠️ | ✅ | ✅ | ✅ | Backend Dockerfile `COPY .` pulls broken symlinks — see §12.F4 | -| **ActionTrail** | ✅ | ✅ | ✅ | ✅ | Ready (uses `.tarballs/` pattern) | -| **LocalMemGPT** | ✅ | ✅ | ✅ | ✅ | Ready (repo-root build context) | -| **admin-web** | — | ✅ (in common-plat) | N/A (pnpm) | ✅ (conditional) | Ready | -| **tracker-web** | — | ✅ (in common-plat) | N/A (pnpm) | ✅ (conditional) | Ready | +| Repo | Backend Dockerfile | Web Dockerfile | `docker-prep.sh` | `output:'standalone'` | Status | +| --------------- | ------------------ | ------------------- | ---------------- | --------------------- | ------------------------------------ | +| **LysnrAI** | ✅ | ✅ user-dashboard | ✅ | ✅ (conditional) | ✅ Ready | +| **MindLyst** | ✅ | ✅ | ✅ | ✅ (conditional) | ✅ Ready | +| **ChronoMind** | ✅ | ✅ | ✅ | ✅ (conditional) | ✅ Ready | +| **JarvisJr** | ✅ | ✅ | ✅ | ✅ (conditional) | ✅ Ready | +| **PeakPulse** | ✅ | — (no web) | ✅ | — | ✅ Ready | +| **FlowMonk** | ✅ | ✅ | ✅ | ✅ (conditional) | ✅ Ready | +| **NomGap** | ✅ | ✅ | ✅ | ✅ | ✅ Fixed (added `.tarballs/` COPY) | +| **NoteLett** | ✅ | ✅ | ✅ | ✅ | ✅ Fixed (explicit COPY, not `.`) | +| **ActionTrail** | ✅ | ✅ | ✅ | ✅ | ✅ Ready (uses `.tarballs/` pattern) | +| **LocalMemGPT** | ✅ | ✅ | ✅ | ✅ | ✅ Ready (repo-root build context) | +| **admin-web** | — | ✅ (in common-plat) | N/A (pnpm) | ✅ (conditional) | ✅ Ready | +| **tracker-web** | — | ✅ (in common-plat) | N/A (pnpm) | ✅ (conditional) | ✅ Ready | -**6 repos need Dockerfiles** + `docker-prep.sh` + `output:'standalone'`. 2 existing Dockerfiles have issues. +**All 10 product repos now have Dockerfiles, `docker-prep.sh`, and `output:'standalone'`.** Created 2026-03-22. --- @@ -931,17 +931,292 @@ LocalMemGPT uses `OLLAMA_URL: 'http://host.docker.internal:11434'` — this work ### Summary of Required Work Before Compose Works -| Priority | Item | Count | -| -------- | -------------------------------------------------------- | ------------- | -| **P0** | Create missing `docker-prep.sh` | 6 repos | -| **P0** | Create missing backend Dockerfiles | 6 repos | -| **P0** | Create missing web Dockerfiles | 5 repos | -| **P0** | Add `output: 'standalone'` to next.config.ts | 3 webs | -| **P1** | Fix NomGap backend Dockerfile (add `.tarballs/` COPY) | 1 file | -| **P1** | Fix NoteLett backend Dockerfile (explicit COPY, not `.`) | 1 file | -| **P1** | Create `.env.ecosystem` template | 1 file | -| **P2** | Standardize Node.js version to 22-alpine | 4 Dockerfiles | -| **P2** | Add `extra_hosts` for Linux VM Ollama access | 1 service | +| Priority | Item | Count | Status | +| -------- | -------------------------------------------------------- | ------------- | ---------------------------------------------------------- | +| **P0** | Create missing `docker-prep.sh` | 6 repos | ✅ Done (3 created, 3 already existed) | +| **P0** | Create missing backend Dockerfiles | 6 repos | ✅ Done | +| **P0** | Create missing web Dockerfiles | 5 repos | ✅ Done (4 created, PeakPulse has no web) | +| **P0** | Add `output: 'standalone'` to next.config.ts | 3 webs | ✅ Done (4 webs: ChronoMind, JarvisJr, FlowMonk, MindLyst) | +| **P1** | Fix NomGap backend Dockerfile (add `.tarballs/` COPY) | 1 file | ✅ Done | +| **P1** | Fix NoteLett backend Dockerfile (explicit COPY, not `.`) | 1 file | ✅ Done | +| **P1** | Create `.env.ecosystem` template | 1 file | Pending | +| **P2** | Standardize Node.js version to 22-alpine | 4 Dockerfiles | ✅ Done (all new Dockerfiles use 22-alpine) | +| **P2** | Add `extra_hosts` for Linux VM Ollama access | 1 service | Pending | + +--- + +## 13. K8s & Docker Best Practices (from Production Comparisons) + +> Derived from comparing three production K8s deployments: a Go-based Call Controller (Paladin), a Python/FastAPI streaming agent platform (NetBond), and a Python/FastAPI voice agent (Welcome Agent). These patterns should be adopted when ByteLyst moves from Docker Compose → K3s → managed K8s. + +### 13.1 Deployment — Zero-Downtime Rolling Updates + +**Do this (NetBond pattern):** + +```yaml +spec: + strategy: + type: RollingUpdate + rollingUpdate: + maxUnavailable: 0 # Never kill a pod before its replacement is ready + maxSurge: 1 # Only 1 extra pod during rollout + template: + spec: + terminationGracePeriodSeconds: 45 # Match your app's drain timeout + containers: + - lifecycle: + preStop: + exec: + command: ['sleep', '5'] # Let load balancer deregister before SIGTERM +``` + +**Don't do this (Paladin anti-pattern):** + +```yaml +maxUnavailable: 50% # Half your pods die instantly — users get errors +maxSurge: 50% # Wastes resources by doubling pod count +``` + +**ByteLyst action:** Every deployment template should use `maxUnavailable: 0` + preStop sleep + explicit `terminationGracePeriodSeconds` matching the Fastify graceful shutdown timeout. + +### 13.2 Pod Security Context + +**Always set (NetBond pattern):** + +```yaml +securityContext: + runAsNonRoot: true + runAsUser: 1000 + runAsGroup: 1000 + allowPrivilegeEscalation: false + readOnlyRootFilesystem: true +``` + +If the app needs writable paths (e.g., `/tmp`, cache dirs), use `emptyDir` volumes: + +```yaml +volumes: + - name: tmp + emptyDir: {} + - name: cache + emptyDir: {} +volumeMounts: + - name: tmp + mountPath: /tmp + - name: cache + mountPath: /home/node/.cache +``` + +**ByteLyst action:** All Fastify backends are stateless — `readOnlyRootFilesystem: true` works. Next.js standalone servers may need `/tmp` writable. + +### 13.3 Health Probes — Dedicated Endpoints + +**Do this:** + +```yaml +livenessProbe: + httpGet: + path: /health # Dedicated lightweight endpoint + port: 4003 + initialDelaySeconds: 10 + periodSeconds: 10 + timeoutSeconds: 5 # Fast fail — 5s max +readinessProbe: + httpGet: + path: /health + port: 4003 + initialDelaySeconds: 5 + periodSeconds: 5 + timeoutSeconds: 5 +``` + +**Don't do this (Welcome Agent anti-pattern):** + +```yaml +livenessProbe: + httpGet: + path: /openapi.json # Heavy endpoint, not a health check + timeoutSeconds: 60 # Masks real failures for a full minute +``` + +**ByteLyst action:** All backends already expose `GET /health` → `{ status: "ok" }`. Use it. Set timeout to 5s. + +### 13.4 Ingress — WebSocket Support + +If any service uses WebSocket or SSE (FlowMonk SSE, LocalMemGPT streaming, future real-time features): + +```yaml +apiVersion: networking.k8s.io/v1 +kind: Ingress +metadata: + annotations: + nginx.ingress.kubernetes.io/proxy-read-timeout: '1800' + nginx.ingress.kubernetes.io/proxy-send-timeout: '1800' + nginx.ingress.kubernetes.io/proxy-buffering: 'off' + nginx.ingress.kubernetes.io/proxy-http-version: '1.1' + nginx.ingress.kubernetes.io/configuration-snippet: | + proxy_set_header Upgrade $http_upgrade; + proxy_set_header Connection "upgrade"; +``` + +Missing WebSocket headers is a silent failure — connections drop after 60s with no error. + +### 13.5 HPA — Use `autoscaling/v2` + +**Do this:** + +```yaml +apiVersion: autoscaling/v2 # Current API, supports multiple metrics +``` + +**Don't do this:** + +```yaml +apiVersion: autoscaling/v1 # Deprecated, CPU-only, will be removed +``` + +### 13.6 Dockerfile Best Practices + +| Practice | Do | Don't | +| ------------------- | ---------------------------------------------------------- | ---------------------------------------------------------------------------------- | +| **ENTRYPOINT form** | `ENTRYPOINT ["node", "dist/server.js"]` (exec form) | `ENTRYPOINT node dist/server.js` (shell form — PID 1 is `/bin/sh`, signals broken) | +| **COPY scope** | `COPY package.json ./` then `COPY src/ ./src/` (selective) | `COPY . .` (copies node_modules, .git, tests, everything) | +| **Layer count** | Combine related `RUN` steps | 3 separate `RUN pip install` / `RUN npm install` steps | +| **Non-root** | `USER node` (Node.js images have a `node` user) | Running as root in production | +| **Local variant** | Provide `local.Dockerfile` without corp proxy/JFrog deps | Single Dockerfile that only works behind corporate proxy | +| **Build args** | `ARG NODE_ENV=production` for conditional behavior | Hardcoded env in Dockerfile | + +### 13.7 Helm Values Layering + +Use 3 layers for environment management: + +``` +values.yaml # Base defaults (image, port, probes, resources) +├── env/local.yaml # Local K3s overrides (lower resources, NodePort, no TLS) +├── env/dev.yaml # Dev cluster overrides (replicas, hostnames, secrets) +└── env/prod.yaml # Prod overrides (more replicas, real TLS, HPA limits) +``` + +Deploy with layered `-f` flags: + +```bash +# Local +helm upgrade --install myapp ./charts -f charts/values.yaml -f charts/env/local.yaml + +# Dev +helm upgrade --install myapp ./charts -f charts/values.yaml -f charts/env/dev.yaml + +# Prod +helm upgrade --install myapp ./charts -f charts/values.yaml -f charts/env/prod.yaml +``` + +### 13.8 Namespace Strategy + +Use Helm `_helpers.tpl` for namespace — never hardcode: + +```yaml +# ✅ Standard pattern — respects --namespace flag +{{ include "myapp.namespace" . }} + +# ❌ Anti-pattern — ignores helm --namespace, causes confusion +{{ .Values.namespace }} +``` + +### 13.9 Secrets Management Progression + +| Phase | Strategy | Complexity | +| ------------------------ | ----------------------------------------------------- | ---------- | +| **Phase 1** (Compose) | `.env.ecosystem` file (gitignored) | Trivial | +| **Phase 2** (K3s) | Native K8s `Secret` objects + `kubectl create secret` | Low | +| **Phase 3** (Production) | Azure Key Vault via `SecretProviderClass` CSI driver | Medium | +| **Phase 4** (Enterprise) | AKV + `AzureKeyVaultSecret` CRD with auto-sync | High | + +ByteLyst already uses AKV in production (platform-service) — the CSI driver pattern is the natural next step. + +### 13.10 CI/CD Best Practices (Lessons from Production Pipelines) + +| Practice | Description | +| ---------------------- | ------------------------------------------------------------------------------------------------------------ | +| **Semantic release** | Auto-version from commit messages (`feat:` → minor, `fix:` → patch). ByteLyst already uses this convention. | +| **Image promotion** | Build once → push to staging repo → promote to gold/prod repo (never rebuild for prod). | +| **Branch pipelines** | Different CI stages per branch: feature (lint+test), develop (build+deploy-dev), main (promote+deploy-prod). | +| **Security gates** | SAST + SCA scans on every build. Block merges on critical findings. | +| **Quality gates** | Unit tests + coverage + SonarQube. Fail pipeline if coverage drops. | +| **Auto-deploy to dev** | Pipeline trigger: when build completes → auto-deploy to dev. Manual gate for prod. | +| **Chart versioning** | Publish Helm chart to OCI registry (ACR) with semantic version. Pull by version during deploy. | + +### 13.11 Local K8s Development Script Template + +A good local K8s deploy script should handle: + +```bash +#!/usr/bin/env bash +# deploy-local-k8s.sh — Full local K8s deployment for ByteLyst ecosystem + +set -euo pipefail + +NAMESPACE="bytelyst" +ACTION="${1:-deploy}" # deploy | teardown + +case "$ACTION" in + deploy) + # 1. Build all Docker images + for svc in platform-service extraction-service mcp-server; do + docker build -t bytelyst/$svc:local ./learning_ai_common_plat/services/$svc + done + + # 2. Load images into K3s containerd (not needed with Docker Desktop) + if command -v k3s &>/dev/null; then + for img in $(docker images --format '{{.Repository}}:{{.Tag}}' | grep bytelyst); do + sudo k3s ctr images import <(docker save "$img") + done + fi + + # 3. Create namespace + secrets + kubectl create namespace "$NAMESPACE" --dry-run=client -o yaml | kubectl apply -f - + kubectl create secret generic bytelyst-secrets \ + --from-env-file=.env.ecosystem \ + -n "$NAMESPACE" --dry-run=client -o yaml | kubectl apply -f - + + # 4. Deploy via Helm with local overlay + helm upgrade --install bytelyst ./helm/bytelyst-ecosystem \ + -f helm/bytelyst-ecosystem/values.yaml \ + -f helm/bytelyst-ecosystem/env/local.yaml \ + -n "$NAMESPACE" + + # 5. Wait + verify + kubectl rollout status deploy -n "$NAMESPACE" --timeout=120s + echo "All pods:" + kubectl get pods -n "$NAMESPACE" + echo "" + echo "Port-forward: kubectl port-forward svc/platform-service 4003:4003 -n $NAMESPACE" + ;; + + teardown) + helm uninstall bytelyst -n "$NAMESPACE" 2>/dev/null || true + kubectl delete namespace "$NAMESPACE" 2>/dev/null || true + echo "Teardown complete." + ;; +esac +``` + +### 13.12 Quick Reference — What to Apply at Each Phase + +| Best Practice | Phase 1 (Compose) | Phase 2 (K3s) | Phase 3 (Prod K8s) | +| ---------------------------- | ------------------------ | ------------------ | ------------------ | +| Zero-downtime rolling update | N/A | ✅ Apply | ✅ Apply | +| Pod security context | N/A | ✅ Apply | ✅ Apply | +| Health probes | N/A (use `healthcheck:`) | ✅ Apply | ✅ Apply | +| WebSocket ingress headers | N/A | ✅ If using SSE/WS | ✅ Apply | +| HPA v2 | N/A | Optional | ✅ Apply | +| Exec-form ENTRYPOINT | ✅ Apply now | ✅ | ✅ | +| Selective COPY | ✅ Apply now | ✅ | ✅ | +| Non-root user | ✅ Apply now | ✅ | ✅ | +| Values layering | N/A | ✅ Apply | ✅ Apply | +| Secrets via AKV CSI | N/A | N/A | ✅ Apply | +| Semantic release | ✅ Apply now | ✅ | ✅ | +| Image promotion | N/A | N/A | ✅ Apply | +| Local deploy script | N/A | ✅ Apply | ✅ Adapt | --- @@ -950,7 +1225,7 @@ LocalMemGPT uses `OLLAMA_URL: 'http://host.docker.internal:11434'` — this work | Question | Answer | | ------------------------------ | -------------------------------------------------------------------------------------------------------------- | | **Can deploy on single VM?** | **Yes.** All ~25 services fit in 32 GB RAM. | -| **All Dockerized?** | 4/10 product repos fully Dockerized. 6 need Dockerfiles (copy-paste template). | +| **All Dockerized?** | **Yes.** All 10 product repos now have Dockerfiles + docker-prep.sh. | | **K8s practice on single VM?** | **K3s** — certified K8s, single binary, same manifests scale to multi-node or AKS/EKS/GKE. | | **Recommended VM?** | 8 vCPU / 32 GB (min) or 16 vCPU / 64 GB (with Ollama). Hetzner ~$45/mo for dev. | | **Time to production K8s?** | Phase 1 (compose) → Phase 2 (K3s single) → Phase 3 (K3s multi) → Phase 4 (managed). Same manifests throughout. |