docs(devops): add K8s best practices from production comparisons, update gap table to reflect all Dockerfiles created

This commit is contained in:
saravanakumardb1 2026-03-22 00:36:59 -07:00
parent 09525f671f
commit 5646cefcbd

View File

@ -729,22 +729,22 @@ kubectl port-forward svc/platform-service 4003:4003 -n bytelyst-platform
## 10. What's NOT Dockerized Yet (gaps)
| Repo | Backend Dockerfile | Web Dockerfile | `docker-prep.sh` | `output:'standalone'` | Status |
| --------------- | ------------------ | ------------------- | ---------------- | --------------------- | -------------------------------------------------------------- |
| **LysnrAI** | ❌ | ✅ user-dashboard | ❌ | ✅ (conditional) | Need backend Dockerfile + docker-prep.sh |
| **MindLyst** | ❌ | ❌ | ❌ | ❌ | Need all 4 |
| **ChronoMind** | ❌ | ❌ | ❌ | ❌ | Need all 4 |
| **JarvisJr** | ❌ | ❌ | ❌ | ❌ | Need all 4 |
| **PeakPulse** | ❌ | ❌ | ❌ | ❌ | Need all 4 |
| **FlowMonk** | ❌ | ❌ | ❌ | ❌ | Need all 4 |
| **NomGap** | ✅ ⚠️ | ✅ | ✅ | ✅ | Backend Dockerfile ignores `file:` deps — see §12.F3 |
| **NoteLett** | ✅ ⚠️ | ✅ | ✅ | ✅ | Backend Dockerfile `COPY .` pulls broken symlinks — see §12.F4 |
| **ActionTrail** | ✅ | ✅ | ✅ | ✅ | Ready (uses `.tarballs/` pattern) |
| **LocalMemGPT** | ✅ | ✅ | ✅ | ✅ | Ready (repo-root build context) |
| **admin-web** | — | ✅ (in common-plat) | N/A (pnpm) | ✅ (conditional) | Ready |
| **tracker-web** | — | ✅ (in common-plat) | N/A (pnpm) | ✅ (conditional) | Ready |
| Repo | Backend Dockerfile | Web Dockerfile | `docker-prep.sh` | `output:'standalone'` | Status |
| --------------- | ------------------ | ------------------- | ---------------- | --------------------- | ------------------------------------ |
| **LysnrAI** | ✅ | ✅ user-dashboard | ✅ | ✅ (conditional) | ✅ Ready |
| **MindLyst** | ✅ | ✅ | ✅ | ✅ (conditional) | ✅ Ready |
| **ChronoMind** | ✅ | ✅ | ✅ | ✅ (conditional) | ✅ Ready |
| **JarvisJr** | ✅ | ✅ | ✅ | ✅ (conditional) | ✅ Ready |
| **PeakPulse** | ✅ | — (no web) | ✅ | — | ✅ Ready |
| **FlowMonk** | ✅ | ✅ | ✅ | ✅ (conditional) | ✅ Ready |
| **NomGap** | ✅ | ✅ | ✅ | ✅ | ✅ Fixed (added `.tarballs/` COPY) |
| **NoteLett** | ✅ | ✅ | ✅ | ✅ | ✅ Fixed (explicit COPY, not `.`) |
| **ActionTrail** | ✅ | ✅ | ✅ | ✅ | Ready (uses `.tarballs/` pattern) |
| **LocalMemGPT** | ✅ | ✅ | ✅ | ✅ | Ready (repo-root build context) |
| **admin-web** | — | ✅ (in common-plat) | N/A (pnpm) | ✅ (conditional) | Ready |
| **tracker-web** | — | ✅ (in common-plat) | N/A (pnpm) | ✅ (conditional) | Ready |
**6 repos need Dockerfiles** + `docker-prep.sh` + `output:'standalone'`. 2 existing Dockerfiles have issues.
**All 10 product repos now have Dockerfiles, `docker-prep.sh`, and `output:'standalone'`.** Created 2026-03-22.
---
@ -931,17 +931,292 @@ LocalMemGPT uses `OLLAMA_URL: 'http://host.docker.internal:11434'` — this work
### Summary of Required Work Before Compose Works
| Priority | Item | Count |
| -------- | -------------------------------------------------------- | ------------- |
| **P0** | Create missing `docker-prep.sh` | 6 repos |
| **P0** | Create missing backend Dockerfiles | 6 repos |
| **P0** | Create missing web Dockerfiles | 5 repos |
| **P0** | Add `output: 'standalone'` to next.config.ts | 3 webs |
| **P1** | Fix NomGap backend Dockerfile (add `.tarballs/` COPY) | 1 file |
| **P1** | Fix NoteLett backend Dockerfile (explicit COPY, not `.`) | 1 file |
| **P1** | Create `.env.ecosystem` template | 1 file |
| **P2** | Standardize Node.js version to 22-alpine | 4 Dockerfiles |
| **P2** | Add `extra_hosts` for Linux VM Ollama access | 1 service |
| Priority | Item | Count | Status |
| -------- | -------------------------------------------------------- | ------------- | ---------------------------------------------------------- |
| **P0** | Create missing `docker-prep.sh` | 6 repos | ✅ Done (3 created, 3 already existed) |
| **P0** | Create missing backend Dockerfiles | 6 repos | ✅ Done |
| **P0** | Create missing web Dockerfiles | 5 repos | ✅ Done (4 created, PeakPulse has no web) |
| **P0** | Add `output: 'standalone'` to next.config.ts | 3 webs | ✅ Done (4 webs: ChronoMind, JarvisJr, FlowMonk, MindLyst) |
| **P1** | Fix NomGap backend Dockerfile (add `.tarballs/` COPY) | 1 file | ✅ Done |
| **P1** | Fix NoteLett backend Dockerfile (explicit COPY, not `.`) | 1 file | ✅ Done |
| **P1** | Create `.env.ecosystem` template | 1 file | Pending |
| **P2** | Standardize Node.js version to 22-alpine | 4 Dockerfiles | ✅ Done (all new Dockerfiles use 22-alpine) |
| **P2** | Add `extra_hosts` for Linux VM Ollama access | 1 service | Pending |
---
## 13. K8s & Docker Best Practices (from Production Comparisons)
> Derived from comparing three production K8s deployments: a Go-based Call Controller (Paladin), a Python/FastAPI streaming agent platform (NetBond), and a Python/FastAPI voice agent (Welcome Agent). These patterns should be adopted when ByteLyst moves from Docker Compose → K3s → managed K8s.
### 13.1 Deployment — Zero-Downtime Rolling Updates
**Do this (NetBond pattern):**
```yaml
spec:
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 0 # Never kill a pod before its replacement is ready
maxSurge: 1 # Only 1 extra pod during rollout
template:
spec:
terminationGracePeriodSeconds: 45 # Match your app's drain timeout
containers:
- lifecycle:
preStop:
exec:
command: ['sleep', '5'] # Let load balancer deregister before SIGTERM
```
**Don't do this (Paladin anti-pattern):**
```yaml
maxUnavailable: 50% # Half your pods die instantly — users get errors
maxSurge: 50% # Wastes resources by doubling pod count
```
**ByteLyst action:** Every deployment template should use `maxUnavailable: 0` + preStop sleep + explicit `terminationGracePeriodSeconds` matching the Fastify graceful shutdown timeout.
### 13.2 Pod Security Context
**Always set (NetBond pattern):**
```yaml
securityContext:
runAsNonRoot: true
runAsUser: 1000
runAsGroup: 1000
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
```
If the app needs writable paths (e.g., `/tmp`, cache dirs), use `emptyDir` volumes:
```yaml
volumes:
- name: tmp
emptyDir: {}
- name: cache
emptyDir: {}
volumeMounts:
- name: tmp
mountPath: /tmp
- name: cache
mountPath: /home/node/.cache
```
**ByteLyst action:** All Fastify backends are stateless — `readOnlyRootFilesystem: true` works. Next.js standalone servers may need `/tmp` writable.
### 13.3 Health Probes — Dedicated Endpoints
**Do this:**
```yaml
livenessProbe:
httpGet:
path: /health # Dedicated lightweight endpoint
port: 4003
initialDelaySeconds: 10
periodSeconds: 10
timeoutSeconds: 5 # Fast fail — 5s max
readinessProbe:
httpGet:
path: /health
port: 4003
initialDelaySeconds: 5
periodSeconds: 5
timeoutSeconds: 5
```
**Don't do this (Welcome Agent anti-pattern):**
```yaml
livenessProbe:
httpGet:
path: /openapi.json # Heavy endpoint, not a health check
timeoutSeconds: 60 # Masks real failures for a full minute
```
**ByteLyst action:** All backends already expose `GET /health``{ status: "ok" }`. Use it. Set timeout to 5s.
### 13.4 Ingress — WebSocket Support
If any service uses WebSocket or SSE (FlowMonk SSE, LocalMemGPT streaming, future real-time features):
```yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
annotations:
nginx.ingress.kubernetes.io/proxy-read-timeout: '1800'
nginx.ingress.kubernetes.io/proxy-send-timeout: '1800'
nginx.ingress.kubernetes.io/proxy-buffering: 'off'
nginx.ingress.kubernetes.io/proxy-http-version: '1.1'
nginx.ingress.kubernetes.io/configuration-snippet: |
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
```
Missing WebSocket headers is a silent failure — connections drop after 60s with no error.
### 13.5 HPA — Use `autoscaling/v2`
**Do this:**
```yaml
apiVersion: autoscaling/v2 # Current API, supports multiple metrics
```
**Don't do this:**
```yaml
apiVersion: autoscaling/v1 # Deprecated, CPU-only, will be removed
```
### 13.6 Dockerfile Best Practices
| Practice | Do | Don't |
| ------------------- | ---------------------------------------------------------- | ---------------------------------------------------------------------------------- |
| **ENTRYPOINT form** | `ENTRYPOINT ["node", "dist/server.js"]` (exec form) | `ENTRYPOINT node dist/server.js` (shell form — PID 1 is `/bin/sh`, signals broken) |
| **COPY scope** | `COPY package.json ./` then `COPY src/ ./src/` (selective) | `COPY . .` (copies node_modules, .git, tests, everything) |
| **Layer count** | Combine related `RUN` steps | 3 separate `RUN pip install` / `RUN npm install` steps |
| **Non-root** | `USER node` (Node.js images have a `node` user) | Running as root in production |
| **Local variant** | Provide `local.Dockerfile` without corp proxy/JFrog deps | Single Dockerfile that only works behind corporate proxy |
| **Build args** | `ARG NODE_ENV=production` for conditional behavior | Hardcoded env in Dockerfile |
### 13.7 Helm Values Layering
Use 3 layers for environment management:
```
values.yaml # Base defaults (image, port, probes, resources)
├── env/local.yaml # Local K3s overrides (lower resources, NodePort, no TLS)
├── env/dev.yaml # Dev cluster overrides (replicas, hostnames, secrets)
└── env/prod.yaml # Prod overrides (more replicas, real TLS, HPA limits)
```
Deploy with layered `-f` flags:
```bash
# Local
helm upgrade --install myapp ./charts -f charts/values.yaml -f charts/env/local.yaml
# Dev
helm upgrade --install myapp ./charts -f charts/values.yaml -f charts/env/dev.yaml
# Prod
helm upgrade --install myapp ./charts -f charts/values.yaml -f charts/env/prod.yaml
```
### 13.8 Namespace Strategy
Use Helm `_helpers.tpl` for namespace — never hardcode:
```yaml
# ✅ Standard pattern — respects --namespace flag
{{ include "myapp.namespace" . }}
# ❌ Anti-pattern — ignores helm --namespace, causes confusion
{{ .Values.namespace }}
```
### 13.9 Secrets Management Progression
| Phase | Strategy | Complexity |
| ------------------------ | ----------------------------------------------------- | ---------- |
| **Phase 1** (Compose) | `.env.ecosystem` file (gitignored) | Trivial |
| **Phase 2** (K3s) | Native K8s `Secret` objects + `kubectl create secret` | Low |
| **Phase 3** (Production) | Azure Key Vault via `SecretProviderClass` CSI driver | Medium |
| **Phase 4** (Enterprise) | AKV + `AzureKeyVaultSecret` CRD with auto-sync | High |
ByteLyst already uses AKV in production (platform-service) — the CSI driver pattern is the natural next step.
### 13.10 CI/CD Best Practices (Lessons from Production Pipelines)
| Practice | Description |
| ---------------------- | ------------------------------------------------------------------------------------------------------------ |
| **Semantic release** | Auto-version from commit messages (`feat:` → minor, `fix:` → patch). ByteLyst already uses this convention. |
| **Image promotion** | Build once → push to staging repo → promote to gold/prod repo (never rebuild for prod). |
| **Branch pipelines** | Different CI stages per branch: feature (lint+test), develop (build+deploy-dev), main (promote+deploy-prod). |
| **Security gates** | SAST + SCA scans on every build. Block merges on critical findings. |
| **Quality gates** | Unit tests + coverage + SonarQube. Fail pipeline if coverage drops. |
| **Auto-deploy to dev** | Pipeline trigger: when build completes → auto-deploy to dev. Manual gate for prod. |
| **Chart versioning** | Publish Helm chart to OCI registry (ACR) with semantic version. Pull by version during deploy. |
### 13.11 Local K8s Development Script Template
A good local K8s deploy script should handle:
```bash
#!/usr/bin/env bash
# deploy-local-k8s.sh — Full local K8s deployment for ByteLyst ecosystem
set -euo pipefail
NAMESPACE="bytelyst"
ACTION="${1:-deploy}" # deploy | teardown
case "$ACTION" in
deploy)
# 1. Build all Docker images
for svc in platform-service extraction-service mcp-server; do
docker build -t bytelyst/$svc:local ./learning_ai_common_plat/services/$svc
done
# 2. Load images into K3s containerd (not needed with Docker Desktop)
if command -v k3s &>/dev/null; then
for img in $(docker images --format '{{.Repository}}:{{.Tag}}' | grep bytelyst); do
sudo k3s ctr images import <(docker save "$img")
done
fi
# 3. Create namespace + secrets
kubectl create namespace "$NAMESPACE" --dry-run=client -o yaml | kubectl apply -f -
kubectl create secret generic bytelyst-secrets \
--from-env-file=.env.ecosystem \
-n "$NAMESPACE" --dry-run=client -o yaml | kubectl apply -f -
# 4. Deploy via Helm with local overlay
helm upgrade --install bytelyst ./helm/bytelyst-ecosystem \
-f helm/bytelyst-ecosystem/values.yaml \
-f helm/bytelyst-ecosystem/env/local.yaml \
-n "$NAMESPACE"
# 5. Wait + verify
kubectl rollout status deploy -n "$NAMESPACE" --timeout=120s
echo "All pods:"
kubectl get pods -n "$NAMESPACE"
echo ""
echo "Port-forward: kubectl port-forward svc/platform-service 4003:4003 -n $NAMESPACE"
;;
teardown)
helm uninstall bytelyst -n "$NAMESPACE" 2>/dev/null || true
kubectl delete namespace "$NAMESPACE" 2>/dev/null || true
echo "Teardown complete."
;;
esac
```
### 13.12 Quick Reference — What to Apply at Each Phase
| Best Practice | Phase 1 (Compose) | Phase 2 (K3s) | Phase 3 (Prod K8s) |
| ---------------------------- | ------------------------ | ------------------ | ------------------ |
| Zero-downtime rolling update | N/A | ✅ Apply | ✅ Apply |
| Pod security context | N/A | ✅ Apply | ✅ Apply |
| Health probes | N/A (use `healthcheck:`) | ✅ Apply | ✅ Apply |
| WebSocket ingress headers | N/A | ✅ If using SSE/WS | ✅ Apply |
| HPA v2 | N/A | Optional | ✅ Apply |
| Exec-form ENTRYPOINT | ✅ Apply now | ✅ | ✅ |
| Selective COPY | ✅ Apply now | ✅ | ✅ |
| Non-root user | ✅ Apply now | ✅ | ✅ |
| Values layering | N/A | ✅ Apply | ✅ Apply |
| Secrets via AKV CSI | N/A | N/A | ✅ Apply |
| Semantic release | ✅ Apply now | ✅ | ✅ |
| Image promotion | N/A | N/A | ✅ Apply |
| Local deploy script | N/A | ✅ Apply | ✅ Adapt |
---
@ -950,7 +1225,7 @@ LocalMemGPT uses `OLLAMA_URL: 'http://host.docker.internal:11434'` — this work
| Question | Answer |
| ------------------------------ | -------------------------------------------------------------------------------------------------------------- |
| **Can deploy on single VM?** | **Yes.** All ~25 services fit in 32 GB RAM. |
| **All Dockerized?** | 4/10 product repos fully Dockerized. 6 need Dockerfiles (copy-paste template). |
| **All Dockerized?** | **Yes.** All 10 product repos now have Dockerfiles + docker-prep.sh. |
| **K8s practice on single VM?** | **K3s** — certified K8s, single binary, same manifests scale to multi-node or AKS/EKS/GKE. |
| **Recommended VM?** | 8 vCPU / 32 GB (min) or 16 vCPU / 64 GB (with Ollama). Hetzner ~$45/mo for dev. |
| **Time to production K8s?** | Phase 1 (compose) → Phase 2 (K3s single) → Phase 3 (K3s multi) → Phase 4 (managed). Same manifests throughout. |