docs(devops): add K8s best practices from production comparisons, update gap table to reflect all Dockerfiles created

2026-03-22 00:36:59 -07:00 · 2026-03-22 00:36:59 -07:00 · 5646cefcbd
commit 5646cefcbd
parent 09525f671f
1 changed files with 302 additions and 27 deletions
--- a/docs/devops/SINGLE_VM_DEPLOYMENT.md
+++ b/docs/devops/SINGLE_VM_DEPLOYMENT.md
@ -729,22 +729,22 @@ kubectl port-forward svc/platform-service 4003:4003 -n bytelyst-platform

 ## 10. What's NOT Dockerized Yet (gaps)

-| Repo            | Backend Dockerfile | Web Dockerfile      | `docker-prep.sh` | `output:'standalone'` | Status                                                         |
-| --------------- | ------------------ | ------------------- | ---------------- | --------------------- | -------------------------------------------------------------- |
-| **LysnrAI**     | ❌                 | ✅ user-dashboard   | ❌               | ✅ (conditional)      | Need backend Dockerfile + docker-prep.sh                       |
-| **MindLyst**    | ❌                 | ❌                  | ❌               | ❌                    | Need all 4                                                     |
-| **ChronoMind**  | ❌                 | ❌                  | ❌               | ❌                    | Need all 4                                                     |
-| **JarvisJr**    | ❌                 | ❌                  | ❌               | ❌                    | Need all 4                                                     |
-| **PeakPulse**   | ❌                 | ❌                  | ❌               | ❌                    | Need all 4                                                     |
-| **FlowMonk**    | ❌                 | ❌                  | ❌               | ❌                    | Need all 4                                                     |
-| **NomGap**      | ✅ ⚠️              | ✅                  | ✅               | ✅                    | Backend Dockerfile ignores `file:` deps — see §12.F3           |
-| **NoteLett**    | ✅ ⚠️              | ✅                  | ✅               | ✅                    | Backend Dockerfile `COPY .` pulls broken symlinks — see §12.F4 |
-| **ActionTrail** | ✅                 | ✅                  | ✅               | ✅                    | Ready (uses `.tarballs/` pattern)                              |
-| **LocalMemGPT** | ✅                 | ✅                  | ✅               | ✅                    | Ready (repo-root build context)                                |
-| **admin-web**   | —                  | ✅ (in common-plat) | N/A (pnpm)       | ✅ (conditional)      | Ready                                                          |
-| **tracker-web** | —                  | ✅ (in common-plat) | N/A (pnpm)       | ✅ (conditional)      | Ready                                                          |
+| Repo            | Backend Dockerfile | Web Dockerfile      | `docker-prep.sh` | `output:'standalone'` | Status                               |
+| --------------- | ------------------ | ------------------- | ---------------- | --------------------- | ------------------------------------ |
+| **LysnrAI**     | ✅                 | ✅ user-dashboard   | ✅               | ✅ (conditional)      | ✅ Ready                             |
+| **MindLyst**    | ✅                 | ✅                  | ✅               | ✅ (conditional)      | ✅ Ready                             |
+| **ChronoMind**  | ✅                 | ✅                  | ✅               | ✅ (conditional)      | ✅ Ready                             |
+| **JarvisJr**    | ✅                 | ✅                  | ✅               | ✅ (conditional)      | ✅ Ready                             |
+| **PeakPulse**   | ✅                 | — (no web)          | ✅               | —                     | ✅ Ready                             |
+| **FlowMonk**    | ✅                 | ✅                  | ✅               | ✅ (conditional)      | ✅ Ready                             |
+| **NomGap**      | ✅                 | ✅                  | ✅               | ✅                    | ✅ Fixed (added `.tarballs/` COPY)   |
+| **NoteLett**    | ✅                 | ✅                  | ✅               | ✅                    | ✅ Fixed (explicit COPY, not `.`)    |
+| **ActionTrail** | ✅                 | ✅                  | ✅               | ✅                    | ✅ Ready (uses `.tarballs/` pattern) |
+| **LocalMemGPT** | ✅                 | ✅                  | ✅               | ✅                    | ✅ Ready (repo-root build context)   |
+| **admin-web**   | —                  | ✅ (in common-plat) | N/A (pnpm)       | ✅ (conditional)      | ✅ Ready                             |
+| **tracker-web** | —                  | ✅ (in common-plat) | N/A (pnpm)       | ✅ (conditional)      | ✅ Ready                             |

-**6 repos need Dockerfiles** + `docker-prep.sh` + `output:'standalone'`. 2 existing Dockerfiles have issues.
+**All 10 product repos now have Dockerfiles, `docker-prep.sh`, and `output:'standalone'`.** Created 2026-03-22.

 ---

@ -931,17 +931,292 @@ LocalMemGPT uses `OLLAMA_URL: 'http://host.docker.internal:11434'` — this work

 ### Summary of Required Work Before Compose Works

-| Priority | Item                                                     | Count         |
-| -------- | -------------------------------------------------------- | ------------- |
-| **P0**   | Create missing `docker-prep.sh`                          | 6 repos       |
-| **P0**   | Create missing backend Dockerfiles                       | 6 repos       |
-| **P0**   | Create missing web Dockerfiles                           | 5 repos       |
-| **P0**   | Add `output: 'standalone'` to next.config.ts             | 3 webs        |
-| **P1**   | Fix NomGap backend Dockerfile (add `.tarballs/` COPY)    | 1 file        |
-| **P1**   | Fix NoteLett backend Dockerfile (explicit COPY, not `.`) | 1 file        |
-| **P1**   | Create `.env.ecosystem` template                         | 1 file        |
-| **P2**   | Standardize Node.js version to 22-alpine                 | 4 Dockerfiles |
-| **P2**   | Add `extra_hosts` for Linux VM Ollama access             | 1 service     |
+| Priority | Item                                                     | Count         | Status                                                     |
+| -------- | -------------------------------------------------------- | ------------- | ---------------------------------------------------------- |
+| **P0**   | Create missing `docker-prep.sh`                          | 6 repos       | ✅ Done (3 created, 3 already existed)                     |
+| **P0**   | Create missing backend Dockerfiles                       | 6 repos       | ✅ Done                                                    |
+| **P0**   | Create missing web Dockerfiles                           | 5 repos       | ✅ Done (4 created, PeakPulse has no web)                  |
+| **P0**   | Add `output: 'standalone'` to next.config.ts             | 3 webs        | ✅ Done (4 webs: ChronoMind, JarvisJr, FlowMonk, MindLyst) |
+| **P1**   | Fix NomGap backend Dockerfile (add `.tarballs/` COPY)    | 1 file        | ✅ Done                                                    |
+| **P1**   | Fix NoteLett backend Dockerfile (explicit COPY, not `.`) | 1 file        | ✅ Done                                                    |
+| **P1**   | Create `.env.ecosystem` template                         | 1 file        | Pending                                                    |
+| **P2**   | Standardize Node.js version to 22-alpine                 | 4 Dockerfiles | ✅ Done (all new Dockerfiles use 22-alpine)                |
+| **P2**   | Add `extra_hosts` for Linux VM Ollama access             | 1 service     | Pending                                                    |
+
+---
+
+## 13. K8s & Docker Best Practices (from Production Comparisons)
+
+> Derived from comparing three production K8s deployments: a Go-based Call Controller (Paladin), a Python/FastAPI streaming agent platform (NetBond), and a Python/FastAPI voice agent (Welcome Agent). These patterns should be adopted when ByteLyst moves from Docker Compose → K3s → managed K8s.
+
+### 13.1 Deployment — Zero-Downtime Rolling Updates
+
+**Do this (NetBond pattern):**
+
+```yaml
+spec:
+  strategy:
+    type: RollingUpdate
+    rollingUpdate:
+      maxUnavailable: 0 # Never kill a pod before its replacement is ready
+      maxSurge: 1 # Only 1 extra pod during rollout
+  template:
+    spec:
+      terminationGracePeriodSeconds: 45 # Match your app's drain timeout
+      containers:
+        - lifecycle:
+            preStop:
+              exec:
+                command: ['sleep', '5'] # Let load balancer deregister before SIGTERM
+```
+
+**Don't do this (Paladin anti-pattern):**
+
+```yaml
+maxUnavailable: 50% # Half your pods die instantly — users get errors
+maxSurge: 50% # Wastes resources by doubling pod count
+```
+
+**ByteLyst action:** Every deployment template should use `maxUnavailable: 0` + preStop sleep + explicit `terminationGracePeriodSeconds` matching the Fastify graceful shutdown timeout.
+
+### 13.2 Pod Security Context
+
+**Always set (NetBond pattern):**
+
+```yaml
+securityContext:
+  runAsNonRoot: true
+  runAsUser: 1000
+  runAsGroup: 1000
+  allowPrivilegeEscalation: false
+  readOnlyRootFilesystem: true
+```
+
+If the app needs writable paths (e.g., `/tmp`, cache dirs), use `emptyDir` volumes:
+
+```yaml
+volumes:
+  - name: tmp
+    emptyDir: {}
+  - name: cache
+    emptyDir: {}
+volumeMounts:
+  - name: tmp
+    mountPath: /tmp
+  - name: cache
+    mountPath: /home/node/.cache
+```
+
+**ByteLyst action:** All Fastify backends are stateless — `readOnlyRootFilesystem: true` works. Next.js standalone servers may need `/tmp` writable.
+
+### 13.3 Health Probes — Dedicated Endpoints
+
+**Do this:**
+
+```yaml
+livenessProbe:
+  httpGet:
+    path: /health # Dedicated lightweight endpoint
+    port: 4003
+  initialDelaySeconds: 10
+  periodSeconds: 10
+  timeoutSeconds: 5 # Fast fail — 5s max
+readinessProbe:
+  httpGet:
+    path: /health
+    port: 4003
+  initialDelaySeconds: 5
+  periodSeconds: 5
+  timeoutSeconds: 5
+```
+
+**Don't do this (Welcome Agent anti-pattern):**
+
+```yaml
+livenessProbe:
+  httpGet:
+    path: /openapi.json # Heavy endpoint, not a health check
+  timeoutSeconds: 60 # Masks real failures for a full minute
+```
+
+**ByteLyst action:** All backends already expose `GET /health` → `{ status: "ok" }`. Use it. Set timeout to 5s.
+
+### 13.4 Ingress — WebSocket Support
+
+If any service uses WebSocket or SSE (FlowMonk SSE, LocalMemGPT streaming, future real-time features):
+
+```yaml
+apiVersion: networking.k8s.io/v1
+kind: Ingress
+metadata:
+  annotations:
+    nginx.ingress.kubernetes.io/proxy-read-timeout: '1800'
+    nginx.ingress.kubernetes.io/proxy-send-timeout: '1800'
+    nginx.ingress.kubernetes.io/proxy-buffering: 'off'
+    nginx.ingress.kubernetes.io/proxy-http-version: '1.1'
+    nginx.ingress.kubernetes.io/configuration-snippet: |
+      proxy_set_header Upgrade $http_upgrade;
+      proxy_set_header Connection "upgrade";
+```
+
+Missing WebSocket headers is a silent failure — connections drop after 60s with no error.
+
+### 13.5 HPA — Use `autoscaling/v2`
+
+**Do this:**
+
+```yaml
+apiVersion: autoscaling/v2 # Current API, supports multiple metrics
+```
+
+**Don't do this:**
+
+```yaml
+apiVersion: autoscaling/v1 # Deprecated, CPU-only, will be removed
+```
+
+### 13.6 Dockerfile Best Practices
+
+| Practice            | Do                                                         | Don't                                                                              |
+| ------------------- | ---------------------------------------------------------- | ---------------------------------------------------------------------------------- |
+| **ENTRYPOINT form** | `ENTRYPOINT ["node", "dist/server.js"]` (exec form)        | `ENTRYPOINT node dist/server.js` (shell form — PID 1 is `/bin/sh`, signals broken) |
+| **COPY scope**      | `COPY package.json ./` then `COPY src/ ./src/` (selective) | `COPY . .` (copies node_modules, .git, tests, everything)                          |
+| **Layer count**     | Combine related `RUN` steps                                | 3 separate `RUN pip install` / `RUN npm install` steps                             |
+| **Non-root**        | `USER node` (Node.js images have a `node` user)            | Running as root in production                                                      |
+| **Local variant**   | Provide `local.Dockerfile` without corp proxy/JFrog deps   | Single Dockerfile that only works behind corporate proxy                           |
+| **Build args**      | `ARG NODE_ENV=production` for conditional behavior         | Hardcoded env in Dockerfile                                                        |
+
+### 13.7 Helm Values Layering
+
+Use 3 layers for environment management:
+
+```
+values.yaml          # Base defaults (image, port, probes, resources)
+├── env/local.yaml   # Local K3s overrides (lower resources, NodePort, no TLS)
+├── env/dev.yaml     # Dev cluster overrides (replicas, hostnames, secrets)
+└── env/prod.yaml    # Prod overrides (more replicas, real TLS, HPA limits)
+```
+
+Deploy with layered `-f` flags:
+
+```bash
+# Local
+helm upgrade --install myapp ./charts -f charts/values.yaml -f charts/env/local.yaml
+
+# Dev
+helm upgrade --install myapp ./charts -f charts/values.yaml -f charts/env/dev.yaml
+
+# Prod
+helm upgrade --install myapp ./charts -f charts/values.yaml -f charts/env/prod.yaml
+```
+
+### 13.8 Namespace Strategy
+
+Use Helm `_helpers.tpl` for namespace — never hardcode:
+
+```yaml
+# ✅ Standard pattern — respects --namespace flag
+{{ include "myapp.namespace" . }}
+
+# ❌ Anti-pattern — ignores helm --namespace, causes confusion
+{{ .Values.namespace }}
+```
+
+### 13.9 Secrets Management Progression
+
+| Phase                    | Strategy                                              | Complexity |
+| ------------------------ | ----------------------------------------------------- | ---------- |
+| **Phase 1** (Compose)    | `.env.ecosystem` file (gitignored)                    | Trivial    |
+| **Phase 2** (K3s)        | Native K8s `Secret` objects + `kubectl create secret` | Low        |
+| **Phase 3** (Production) | Azure Key Vault via `SecretProviderClass` CSI driver  | Medium     |
+| **Phase 4** (Enterprise) | AKV + `AzureKeyVaultSecret` CRD with auto-sync        | High       |
+
+ByteLyst already uses AKV in production (platform-service) — the CSI driver pattern is the natural next step.
+
+### 13.10 CI/CD Best Practices (Lessons from Production Pipelines)
+
+| Practice               | Description                                                                                                  |
+| ---------------------- | ------------------------------------------------------------------------------------------------------------ |
+| **Semantic release**   | Auto-version from commit messages (`feat:` → minor, `fix:` → patch). ByteLyst already uses this convention.  |
+| **Image promotion**    | Build once → push to staging repo → promote to gold/prod repo (never rebuild for prod).                      |
+| **Branch pipelines**   | Different CI stages per branch: feature (lint+test), develop (build+deploy-dev), main (promote+deploy-prod). |
+| **Security gates**     | SAST + SCA scans on every build. Block merges on critical findings.                                          |
+| **Quality gates**      | Unit tests + coverage + SonarQube. Fail pipeline if coverage drops.                                          |
+| **Auto-deploy to dev** | Pipeline trigger: when build completes → auto-deploy to dev. Manual gate for prod.                           |
+| **Chart versioning**   | Publish Helm chart to OCI registry (ACR) with semantic version. Pull by version during deploy.               |
+
+### 13.11 Local K8s Development Script Template
+
+A good local K8s deploy script should handle:
+
+```bash
+#!/usr/bin/env bash
+# deploy-local-k8s.sh — Full local K8s deployment for ByteLyst ecosystem
+
+set -euo pipefail
+
+NAMESPACE="bytelyst"
+ACTION="${1:-deploy}"  # deploy | teardown
+
+case "$ACTION" in
+  deploy)
+    # 1. Build all Docker images
+    for svc in platform-service extraction-service mcp-server; do
+      docker build -t bytelyst/$svc:local ./learning_ai_common_plat/services/$svc
+    done
+
+    # 2. Load images into K3s containerd (not needed with Docker Desktop)
+    if command -v k3s &>/dev/null; then
+      for img in $(docker images --format '{{.Repository}}:{{.Tag}}' | grep bytelyst); do
+        sudo k3s ctr images import <(docker save "$img")
+      done
+    fi
+
+    # 3. Create namespace + secrets
+    kubectl create namespace "$NAMESPACE" --dry-run=client -o yaml | kubectl apply -f -
+    kubectl create secret generic bytelyst-secrets \
+      --from-env-file=.env.ecosystem \
+      -n "$NAMESPACE" --dry-run=client -o yaml | kubectl apply -f -
+
+    # 4. Deploy via Helm with local overlay
+    helm upgrade --install bytelyst ./helm/bytelyst-ecosystem \
+      -f helm/bytelyst-ecosystem/values.yaml \
+      -f helm/bytelyst-ecosystem/env/local.yaml \
+      -n "$NAMESPACE"
+
+    # 5. Wait + verify
+    kubectl rollout status deploy -n "$NAMESPACE" --timeout=120s
+    echo "All pods:"
+    kubectl get pods -n "$NAMESPACE"
+    echo ""
+    echo "Port-forward: kubectl port-forward svc/platform-service 4003:4003 -n $NAMESPACE"
+    ;;
+
+  teardown)
+    helm uninstall bytelyst -n "$NAMESPACE" 2>/dev/null || true
+    kubectl delete namespace "$NAMESPACE" 2>/dev/null || true
+    echo "Teardown complete."
+    ;;
+esac
+```
+
+### 13.12 Quick Reference — What to Apply at Each Phase
+
+| Best Practice                | Phase 1 (Compose)        | Phase 2 (K3s)      | Phase 3 (Prod K8s) |
+| ---------------------------- | ------------------------ | ------------------ | ------------------ |
+| Zero-downtime rolling update | N/A                      | ✅ Apply           | ✅ Apply           |
+| Pod security context         | N/A                      | ✅ Apply           | ✅ Apply           |
+| Health probes                | N/A (use `healthcheck:`) | ✅ Apply           | ✅ Apply           |
+| WebSocket ingress headers    | N/A                      | ✅ If using SSE/WS | ✅ Apply           |
+| HPA v2                       | N/A                      | Optional           | ✅ Apply           |
+| Exec-form ENTRYPOINT         | ✅ Apply now             | ✅                 | ✅                 |
+| Selective COPY               | ✅ Apply now             | ✅                 | ✅                 |
+| Non-root user                | ✅ Apply now             | ✅                 | ✅                 |
+| Values layering              | N/A                      | ✅ Apply           | ✅ Apply           |
+| Secrets via AKV CSI          | N/A                      | N/A                | ✅ Apply           |
+| Semantic release             | ✅ Apply now             | ✅                 | ✅                 |
+| Image promotion              | N/A                      | N/A                | ✅ Apply           |
+| Local deploy script          | N/A                      | ✅ Apply           | ✅ Adapt           |

 ---

@ -950,7 +1225,7 @@ LocalMemGPT uses `OLLAMA_URL: 'http://host.docker.internal:11434'` — this work
 | Question                       | Answer                                                                                                         |
 | ------------------------------ | -------------------------------------------------------------------------------------------------------------- |
 | **Can deploy on single VM?**   | **Yes.** All ~25 services fit in 32 GB RAM.                                                                    |
-| **All Dockerized?**            | 4/10 product repos fully Dockerized. 6 need Dockerfiles (copy-paste template).                                 |
+| **All Dockerized?**            | **Yes.** All 10 product repos now have Dockerfiles + docker-prep.sh.                                           |
 | **K8s practice on single VM?** | **K3s** — certified K8s, single binary, same manifests scale to multi-node or AKS/EKS/GKE.                     |
 | **Recommended VM?**            | 8 vCPU / 32 GB (min) or 16 vCPU / 64 GB (with Ollama). Hetzner ~$45/mo for dev.                                |
 | **Time to production K8s?**    | Phase 1 (compose) → Phase 2 (K3s single) → Phase 3 (K3s multi) → Phase 4 (managed). Same manifests throughout. |