docs(devops): add K8s best practices from production comparisons, update gap table to reflect all Dockerfiles created

2026-03-22 00:36:59 -07:00 · 2026-03-22 00:36:59 -07:00 · 5646cefcbd
commit 5646cefcbd
parent 09525f671f
1 changed files with 302 additions and 27 deletions
--- a/docs/devops/SINGLE_VM_DEPLOYMENT.md
+++ b/docs/devops/SINGLE_VM_DEPLOYMENT.md
@ -729,22 +729,22 @@ kubectl port-forward svc/platform-service 4003:4003 -n bytelyst-platform
 ## 10. What's NOT Dockerized Yet (gaps)
-| Repo            | Backend Dockerfile | Web Dockerfile      | `docker-prep.sh` | `output:'standalone'` | Status                                                         |
+| Repo            | Backend Dockerfile | Web Dockerfile      | `docker-prep.sh` | `output:'standalone'` | Status                               |
-| --------------- | ------------------ | ------------------- | ---------------- | --------------------- | -------------------------------------------------------------- |
+| --------------- | ------------------ | ------------------- | ---------------- | --------------------- | ------------------------------------ |
-| **LysnrAI**     | ❌                 | ✅ user-dashboard   | ❌               | ✅ (conditional)      | Need backend Dockerfile + docker-prep.sh                       |
+| **LysnrAI**     | ✅                 | ✅ user-dashboard   | ✅               | ✅ (conditional)      | ✅ Ready                             |
-| **MindLyst**    | ❌                 | ❌                  | ❌               | ❌                    | Need all 4                                                     |
+| **MindLyst**    | ✅                 | ✅                  | ✅               | ✅ (conditional)      | ✅ Ready                             |
-| **ChronoMind**  | ❌                 | ❌                  | ❌               | ❌                    | Need all 4                                                     |
+| **ChronoMind**  | ✅                 | ✅                  | ✅               | ✅ (conditional)      | ✅ Ready                             |
-| **JarvisJr**    | ❌                 | ❌                  | ❌               | ❌                    | Need all 4                                                     |
+| **JarvisJr**    | ✅                 | ✅                  | ✅               | ✅ (conditional)      | ✅ Ready                             |
-| **PeakPulse**   | ❌                 | ❌                  | ❌               | ❌                    | Need all 4                                                     |
+| **PeakPulse**   | ✅                 | — (no web)          | ✅               | —                     | ✅ Ready                             |
-| **FlowMonk**    | ❌                 | ❌                  | ❌               | ❌                    | Need all 4                                                     |
+| **FlowMonk**    | ✅                 | ✅                  | ✅               | ✅ (conditional)      | ✅ Ready                             |
-| **NomGap**      | ✅ ⚠️              | ✅                  | ✅               | ✅                    | Backend Dockerfile ignores `file:` deps — see §12.F3           |
+| **NomGap**      | ✅                 | ✅                  | ✅               | ✅                    | ✅ Fixed (added `.tarballs/` COPY)   |
-| **NoteLett**    | ✅ ⚠️              | ✅                  | ✅               | ✅                    | Backend Dockerfile `COPY .` pulls broken symlinks — see §12.F4 |
+| **NoteLett**    | ✅                 | ✅                  | ✅               | ✅                    | ✅ Fixed (explicit COPY, not `.`)    |
-| **ActionTrail** | ✅                 | ✅                  | ✅               | ✅                    | Ready (uses `.tarballs/` pattern)                              |
+| **ActionTrail** | ✅                 | ✅                  | ✅               | ✅                    | ✅ Ready (uses `.tarballs/` pattern) |
-| **LocalMemGPT** | ✅                 | ✅                  | ✅               | ✅                    | Ready (repo-root build context)                                |
+| **LocalMemGPT** | ✅                 | ✅                  | ✅               | ✅                    | ✅ Ready (repo-root build context)   |
-| **admin-web**   | —                  | ✅ (in common-plat) | N/A (pnpm)       | ✅ (conditional)      | Ready                                                          |
+| **admin-web**   | —                  | ✅ (in common-plat) | N/A (pnpm)       | ✅ (conditional)      | ✅ Ready                             |
-| **tracker-web** | —                  | ✅ (in common-plat) | N/A (pnpm)       | ✅ (conditional)      | Ready                                                          |
+| **tracker-web** | —                  | ✅ (in common-plat) | N/A (pnpm)       | ✅ (conditional)      | ✅ Ready                             |
-**6 repos need Dockerfiles** + `docker-prep.sh` + `output:'standalone'`. 2 existing Dockerfiles have issues.
+**All 10 product repos now have Dockerfiles, `docker-prep.sh`, and `output:'standalone'`.** Created 2026-03-22.
 ---
@ -931,17 +931,292 @@ LocalMemGPT uses `OLLAMA_URL: 'http://host.docker.internal:11434'` — this work
 ### Summary of Required Work Before Compose Works
-| Priority | Item                                                     | Count         |
+| Priority | Item                                                     | Count         | Status                                                     |
-| -------- | -------------------------------------------------------- | ------------- |
+| -------- | -------------------------------------------------------- | ------------- | ---------------------------------------------------------- |
-| **P0**   | Create missing `docker-prep.sh`                          | 6 repos       |
+| **P0**   | Create missing `docker-prep.sh`                          | 6 repos       | ✅ Done (3 created, 3 already existed)                     |
-| **P0**   | Create missing backend Dockerfiles                       | 6 repos       |
+| **P0**   | Create missing backend Dockerfiles                       | 6 repos       | ✅ Done                                                    |
-| **P0**   | Create missing web Dockerfiles                           | 5 repos       |
+| **P0**   | Create missing web Dockerfiles                           | 5 repos       | ✅ Done (4 created, PeakPulse has no web)                  |
-| **P0**   | Add `output: 'standalone'` to next.config.ts             | 3 webs        |
+| **P0**   | Add `output: 'standalone'` to next.config.ts             | 3 webs        | ✅ Done (4 webs: ChronoMind, JarvisJr, FlowMonk, MindLyst) |
-| **P1**   | Fix NomGap backend Dockerfile (add `.tarballs/` COPY)    | 1 file        |
+| **P1**   | Fix NomGap backend Dockerfile (add `.tarballs/` COPY)    | 1 file        | ✅ Done                                                    |
-| **P1**   | Fix NoteLett backend Dockerfile (explicit COPY, not `.`) | 1 file        |
+| **P1**   | Fix NoteLett backend Dockerfile (explicit COPY, not `.`) | 1 file        | ✅ Done                                                    |
-| **P1**   | Create `.env.ecosystem` template                         | 1 file        |
+| **P1**   | Create `.env.ecosystem` template                         | 1 file        | Pending                                                    |
-| **P2**   | Standardize Node.js version to 22-alpine                 | 4 Dockerfiles |
+| **P2**   | Standardize Node.js version to 22-alpine                 | 4 Dockerfiles | ✅ Done (all new Dockerfiles use 22-alpine)                |
-| **P2**   | Add `extra_hosts` for Linux VM Ollama access             | 1 service     |
+| **P2**   | Add `extra_hosts` for Linux VM Ollama access             | 1 service     | Pending                                                    |
 ---
 ## 13. K8s & Docker Best Practices (from Production Comparisons)
 > Derived from comparing three production K8s deployments: a Go-based Call Controller (Paladin), a Python/FastAPI streaming agent platform (NetBond), and a Python/FastAPI voice agent (Welcome Agent). These patterns should be adopted when ByteLyst moves from Docker Compose → K3s → managed K8s.
 ### 13.1 Deployment — Zero-Downtime Rolling Updates
 **Do this (NetBond pattern):**
 ```yaml
 spec:
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 0 # Never kill a pod before its replacement is ready
      maxSurge: 1 # Only 1 extra pod during rollout
  template:
    spec:
      terminationGracePeriodSeconds: 45 # Match your app's drain timeout
      containers:
        - lifecycle:
            preStop:
              exec:
                command: ['sleep', '5'] # Let load balancer deregister before SIGTERM
 ```
 **Don't do this (Paladin anti-pattern):**
 ```yaml
 maxUnavailable: 50% # Half your pods die instantly — users get errors
 maxSurge: 50% # Wastes resources by doubling pod count
 ```
 **ByteLyst action:** Every deployment template should use `maxUnavailable: 0` + preStop sleep + explicit `terminationGracePeriodSeconds` matching the Fastify graceful shutdown timeout.
 ### 13.2 Pod Security Context
 **Always set (NetBond pattern):**
 ```yaml
 securityContext:
  runAsNonRoot: true
  runAsUser: 1000
  runAsGroup: 1000
  allowPrivilegeEscalation: false
  readOnlyRootFilesystem: true
 ```
 If the app needs writable paths (e.g., `/tmp`, cache dirs), use `emptyDir` volumes:
 ```yaml
 volumes:
  - name: tmp
    emptyDir: {}
  - name: cache
    emptyDir: {}
 volumeMounts:
  - name: tmp
    mountPath: /tmp
  - name: cache
    mountPath: /home/node/.cache
 ```
 **ByteLyst action:** All Fastify backends are stateless — `readOnlyRootFilesystem: true` works. Next.js standalone servers may need `/tmp` writable.
 ### 13.3 Health Probes — Dedicated Endpoints
 **Do this:**
 ```yaml
 livenessProbe:
  httpGet:
    path: /health # Dedicated lightweight endpoint
    port: 4003
  initialDelaySeconds: 10
  periodSeconds: 10
  timeoutSeconds: 5 # Fast fail — 5s max
 readinessProbe:
  httpGet:
    path: /health
    port: 4003
  initialDelaySeconds: 5
  periodSeconds: 5
  timeoutSeconds: 5
 ```
 **Don't do this (Welcome Agent anti-pattern):**
 ```yaml
 livenessProbe:
  httpGet:
    path: /openapi.json # Heavy endpoint, not a health check
  timeoutSeconds: 60 # Masks real failures for a full minute
 ```
 **ByteLyst action:** All backends already expose `GET /health` → `{ status: "ok" }`. Use it. Set timeout to 5s.
 ### 13.4 Ingress — WebSocket Support
 If any service uses WebSocket or SSE (FlowMonk SSE, LocalMemGPT streaming, future real-time features):
 ```yaml
 apiVersion: networking.k8s.io/v1
 kind: Ingress
 metadata:
  annotations:
    nginx.ingress.kubernetes.io/proxy-read-timeout: '1800'
    nginx.ingress.kubernetes.io/proxy-send-timeout: '1800'
    nginx.ingress.kubernetes.io/proxy-buffering: 'off'
    nginx.ingress.kubernetes.io/proxy-http-version: '1.1'
    nginx.ingress.kubernetes.io/configuration-snippet: |
      proxy_set_header Upgrade $http_upgrade;
      proxy_set_header Connection "upgrade";
 ```
 Missing WebSocket headers is a silent failure — connections drop after 60s with no error.
 ### 13.5 HPA — Use `autoscaling/v2`
 **Do this:**
 ```yaml
 apiVersion: autoscaling/v2 # Current API, supports multiple metrics
 ```
 **Don't do this:**
 ```yaml
 apiVersion: autoscaling/v1 # Deprecated, CPU-only, will be removed
 ```
 ### 13.6 Dockerfile Best Practices
 | Practice            | Do                                                         | Don't                                                                              |
 | ------------------- | ---------------------------------------------------------- | ---------------------------------------------------------------------------------- |
 | **ENTRYPOINT form** | `ENTRYPOINT ["node", "dist/server.js"]` (exec form)        | `ENTRYPOINT node dist/server.js` (shell form — PID 1 is `/bin/sh`, signals broken) |
 | **COPY scope**      | `COPY package.json ./` then `COPY src/ ./src/` (selective) | `COPY . .` (copies node_modules, .git, tests, everything)                          |
 | **Layer count**     | Combine related `RUN` steps                                | 3 separate `RUN pip install` / `RUN npm install` steps                             |
 | **Non-root**        | `USER node` (Node.js images have a `node` user)            | Running as root in production                                                      |
 | **Local variant**   | Provide `local.Dockerfile` without corp proxy/JFrog deps   | Single Dockerfile that only works behind corporate proxy                           |
 | **Build args**      | `ARG NODE_ENV=production` for conditional behavior         | Hardcoded env in Dockerfile                                                        |
 ### 13.7 Helm Values Layering
 Use 3 layers for environment management:
 ```
 values.yaml          # Base defaults (image, port, probes, resources)
 ├── env/local.yaml   # Local K3s overrides (lower resources, NodePort, no TLS)
 ├── env/dev.yaml     # Dev cluster overrides (replicas, hostnames, secrets)
 └── env/prod.yaml    # Prod overrides (more replicas, real TLS, HPA limits)
 ```
 Deploy with layered `-f` flags:
 ```bash
 # Local
 helm upgrade --install myapp ./charts -f charts/values.yaml -f charts/env/local.yaml
 # Dev
 helm upgrade --install myapp ./charts -f charts/values.yaml -f charts/env/dev.yaml
 # Prod
 helm upgrade --install myapp ./charts -f charts/values.yaml -f charts/env/prod.yaml
 ```
 ### 13.8 Namespace Strategy
 Use Helm `_helpers.tpl` for namespace — never hardcode:
 ```yaml
 # ✅ Standard pattern — respects --namespace flag
 {{ include "myapp.namespace" . }}
 # ❌ Anti-pattern — ignores helm --namespace, causes confusion
 {{ .Values.namespace }}
 ```
 ### 13.9 Secrets Management Progression
 | Phase                    | Strategy                                              | Complexity |
 | ------------------------ | ----------------------------------------------------- | ---------- |
 | **Phase 1** (Compose)    | `.env.ecosystem` file (gitignored)                    | Trivial    |
 | **Phase 2** (K3s)        | Native K8s `Secret` objects + `kubectl create secret` | Low        |
 | **Phase 3** (Production) | Azure Key Vault via `SecretProviderClass` CSI driver  | Medium     |
 | **Phase 4** (Enterprise) | AKV + `AzureKeyVaultSecret` CRD with auto-sync        | High       |
 ByteLyst already uses AKV in production (platform-service) — the CSI driver pattern is the natural next step.
 ### 13.10 CI/CD Best Practices (Lessons from Production Pipelines)
 | Practice               | Description                                                                                                  |
 | ---------------------- | ------------------------------------------------------------------------------------------------------------ |
 | **Semantic release**   | Auto-version from commit messages (`feat:` → minor, `fix:` → patch). ByteLyst already uses this convention.  |
 | **Image promotion**    | Build once → push to staging repo → promote to gold/prod repo (never rebuild for prod).                      |
 | **Branch pipelines**   | Different CI stages per branch: feature (lint+test), develop (build+deploy-dev), main (promote+deploy-prod). |
 | **Security gates**     | SAST + SCA scans on every build. Block merges on critical findings.                                          |
 | **Quality gates**      | Unit tests + coverage + SonarQube. Fail pipeline if coverage drops.                                          |
 | **Auto-deploy to dev** | Pipeline trigger: when build completes → auto-deploy to dev. Manual gate for prod.                           |
 | **Chart versioning**   | Publish Helm chart to OCI registry (ACR) with semantic version. Pull by version during deploy.               |
 ### 13.11 Local K8s Development Script Template
 A good local K8s deploy script should handle:
 ```bash
 #!/usr/bin/env bash
 # deploy-local-k8s.sh — Full local K8s deployment for ByteLyst ecosystem
 set -euo pipefail
 NAMESPACE="bytelyst"
 ACTION="${1:-deploy}"  # deploy | teardown
 case "$ACTION" in
  deploy)
    # 1. Build all Docker images
    for svc in platform-service extraction-service mcp-server; do
      docker build -t bytelyst/$svc:local ./learning_ai_common_plat/services/$svc
    done
    # 2. Load images into K3s containerd (not needed with Docker Desktop)
    if command -v k3s &>/dev/null; then
      for img in $(docker images --format '{{.Repository}}:{{.Tag}}' | grep bytelyst); do
        sudo k3s ctr images import <(docker save "$img")
      done
    fi
    # 3. Create namespace + secrets
    kubectl create namespace "$NAMESPACE" --dry-run=client -o yaml | kubectl apply -f -
    kubectl create secret generic bytelyst-secrets \
      --from-env-file=.env.ecosystem \
      -n "$NAMESPACE" --dry-run=client -o yaml | kubectl apply -f -
    # 4. Deploy via Helm with local overlay
    helm upgrade --install bytelyst ./helm/bytelyst-ecosystem \
      -f helm/bytelyst-ecosystem/values.yaml \
      -f helm/bytelyst-ecosystem/env/local.yaml \
      -n "$NAMESPACE"
    # 5. Wait + verify
    kubectl rollout status deploy -n "$NAMESPACE" --timeout=120s
    echo "All pods:"
    kubectl get pods -n "$NAMESPACE"
    echo ""
    echo "Port-forward: kubectl port-forward svc/platform-service 4003:4003 -n $NAMESPACE"
    ;;
  teardown)
    helm uninstall bytelyst -n "$NAMESPACE" 2>/dev/null || true
    kubectl delete namespace "$NAMESPACE" 2>/dev/null || true
    echo "Teardown complete."
    ;;
 esac
 ```
 ### 13.12 Quick Reference — What to Apply at Each Phase
 | Best Practice                | Phase 1 (Compose)        | Phase 2 (K3s)      | Phase 3 (Prod K8s) |
 | ---------------------------- | ------------------------ | ------------------ | ------------------ |
 | Zero-downtime rolling update | N/A                      | ✅ Apply           | ✅ Apply           |
 | Pod security context         | N/A                      | ✅ Apply           | ✅ Apply           |
 | Health probes                | N/A (use `healthcheck:`) | ✅ Apply           | ✅ Apply           |
 | WebSocket ingress headers    | N/A                      | ✅ If using SSE/WS | ✅ Apply           |
 | HPA v2                       | N/A                      | Optional           | ✅ Apply           |
 | Exec-form ENTRYPOINT         | ✅ Apply now             | ✅                 | ✅                 |
 | Selective COPY               | ✅ Apply now             | ✅                 | ✅                 |
 | Non-root user                | ✅ Apply now             | ✅                 | ✅                 |
 | Values layering              | N/A                      | ✅ Apply           | ✅ Apply           |
 | Secrets via AKV CSI          | N/A                      | N/A                | ✅ Apply           |
 | Semantic release             | ✅ Apply now             | ✅                 | ✅                 |
 | Image promotion              | N/A                      | N/A                | ✅ Apply           |
 | Local deploy script          | N/A                      | ✅ Apply           | ✅ Adapt           |
 ---
@ -950,7 +1225,7 @@ LocalMemGPT uses `OLLAMA_URL: 'http://host.docker.internal:11434'` — this work
 | Question                       | Answer                                                                                                         |
 | ------------------------------ | -------------------------------------------------------------------------------------------------------------- |
 | **Can deploy on single VM?**   | **Yes.** All ~25 services fit in 32 GB RAM.                                                                    |
-| **All Dockerized?**            | 4/10 product repos fully Dockerized. 6 need Dockerfiles (copy-paste template).                                 |
+| **All Dockerized?**            | **Yes.** All 10 product repos now have Dockerfiles + docker-prep.sh.                                           |
 | **K8s practice on single VM?** | **K3s** — certified K8s, single binary, same manifests scale to multi-node or AKS/EKS/GKE.                     |
 | **Recommended VM?**            | 8 vCPU / 32 GB (min) or 16 vCPU / 64 GB (with Ollama). Hetzner ~$45/mo for dev.                                |
 | **Time to production K8s?**    | Phase 1 (compose) → Phase 2 (K3s single) → Phase 3 (K3s multi) → Phase 4 (managed). Same manifests throughout. |