docs(devops): add K8s best practices from production comparisons, update gap table to reflect all Dockerfiles created
This commit is contained in:
parent
09525f671f
commit
5646cefcbd
@ -729,22 +729,22 @@ kubectl port-forward svc/platform-service 4003:4003 -n bytelyst-platform
|
|||||||
|
|
||||||
## 10. What's NOT Dockerized Yet (gaps)
|
## 10. What's NOT Dockerized Yet (gaps)
|
||||||
|
|
||||||
| Repo | Backend Dockerfile | Web Dockerfile | `docker-prep.sh` | `output:'standalone'` | Status |
|
| Repo | Backend Dockerfile | Web Dockerfile | `docker-prep.sh` | `output:'standalone'` | Status |
|
||||||
| --------------- | ------------------ | ------------------- | ---------------- | --------------------- | -------------------------------------------------------------- |
|
| --------------- | ------------------ | ------------------- | ---------------- | --------------------- | ------------------------------------ |
|
||||||
| **LysnrAI** | ❌ | ✅ user-dashboard | ❌ | ✅ (conditional) | Need backend Dockerfile + docker-prep.sh |
|
| **LysnrAI** | ✅ | ✅ user-dashboard | ✅ | ✅ (conditional) | ✅ Ready |
|
||||||
| **MindLyst** | ❌ | ❌ | ❌ | ❌ | Need all 4 |
|
| **MindLyst** | ✅ | ✅ | ✅ | ✅ (conditional) | ✅ Ready |
|
||||||
| **ChronoMind** | ❌ | ❌ | ❌ | ❌ | Need all 4 |
|
| **ChronoMind** | ✅ | ✅ | ✅ | ✅ (conditional) | ✅ Ready |
|
||||||
| **JarvisJr** | ❌ | ❌ | ❌ | ❌ | Need all 4 |
|
| **JarvisJr** | ✅ | ✅ | ✅ | ✅ (conditional) | ✅ Ready |
|
||||||
| **PeakPulse** | ❌ | ❌ | ❌ | ❌ | Need all 4 |
|
| **PeakPulse** | ✅ | — (no web) | ✅ | — | ✅ Ready |
|
||||||
| **FlowMonk** | ❌ | ❌ | ❌ | ❌ | Need all 4 |
|
| **FlowMonk** | ✅ | ✅ | ✅ | ✅ (conditional) | ✅ Ready |
|
||||||
| **NomGap** | ✅ ⚠️ | ✅ | ✅ | ✅ | Backend Dockerfile ignores `file:` deps — see §12.F3 |
|
| **NomGap** | ✅ | ✅ | ✅ | ✅ | ✅ Fixed (added `.tarballs/` COPY) |
|
||||||
| **NoteLett** | ✅ ⚠️ | ✅ | ✅ | ✅ | Backend Dockerfile `COPY .` pulls broken symlinks — see §12.F4 |
|
| **NoteLett** | ✅ | ✅ | ✅ | ✅ | ✅ Fixed (explicit COPY, not `.`) |
|
||||||
| **ActionTrail** | ✅ | ✅ | ✅ | ✅ | Ready (uses `.tarballs/` pattern) |
|
| **ActionTrail** | ✅ | ✅ | ✅ | ✅ | ✅ Ready (uses `.tarballs/` pattern) |
|
||||||
| **LocalMemGPT** | ✅ | ✅ | ✅ | ✅ | Ready (repo-root build context) |
|
| **LocalMemGPT** | ✅ | ✅ | ✅ | ✅ | ✅ Ready (repo-root build context) |
|
||||||
| **admin-web** | — | ✅ (in common-plat) | N/A (pnpm) | ✅ (conditional) | Ready |
|
| **admin-web** | — | ✅ (in common-plat) | N/A (pnpm) | ✅ (conditional) | ✅ Ready |
|
||||||
| **tracker-web** | — | ✅ (in common-plat) | N/A (pnpm) | ✅ (conditional) | Ready |
|
| **tracker-web** | — | ✅ (in common-plat) | N/A (pnpm) | ✅ (conditional) | ✅ Ready |
|
||||||
|
|
||||||
**6 repos need Dockerfiles** + `docker-prep.sh` + `output:'standalone'`. 2 existing Dockerfiles have issues.
|
**All 10 product repos now have Dockerfiles, `docker-prep.sh`, and `output:'standalone'`.** Created 2026-03-22.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
@ -931,17 +931,292 @@ LocalMemGPT uses `OLLAMA_URL: 'http://host.docker.internal:11434'` — this work
|
|||||||
|
|
||||||
### Summary of Required Work Before Compose Works
|
### Summary of Required Work Before Compose Works
|
||||||
|
|
||||||
| Priority | Item | Count |
|
| Priority | Item | Count | Status |
|
||||||
| -------- | -------------------------------------------------------- | ------------- |
|
| -------- | -------------------------------------------------------- | ------------- | ---------------------------------------------------------- |
|
||||||
| **P0** | Create missing `docker-prep.sh` | 6 repos |
|
| **P0** | Create missing `docker-prep.sh` | 6 repos | ✅ Done (3 created, 3 already existed) |
|
||||||
| **P0** | Create missing backend Dockerfiles | 6 repos |
|
| **P0** | Create missing backend Dockerfiles | 6 repos | ✅ Done |
|
||||||
| **P0** | Create missing web Dockerfiles | 5 repos |
|
| **P0** | Create missing web Dockerfiles | 5 repos | ✅ Done (4 created, PeakPulse has no web) |
|
||||||
| **P0** | Add `output: 'standalone'` to next.config.ts | 3 webs |
|
| **P0** | Add `output: 'standalone'` to next.config.ts | 3 webs | ✅ Done (4 webs: ChronoMind, JarvisJr, FlowMonk, MindLyst) |
|
||||||
| **P1** | Fix NomGap backend Dockerfile (add `.tarballs/` COPY) | 1 file |
|
| **P1** | Fix NomGap backend Dockerfile (add `.tarballs/` COPY) | 1 file | ✅ Done |
|
||||||
| **P1** | Fix NoteLett backend Dockerfile (explicit COPY, not `.`) | 1 file |
|
| **P1** | Fix NoteLett backend Dockerfile (explicit COPY, not `.`) | 1 file | ✅ Done |
|
||||||
| **P1** | Create `.env.ecosystem` template | 1 file |
|
| **P1** | Create `.env.ecosystem` template | 1 file | Pending |
|
||||||
| **P2** | Standardize Node.js version to 22-alpine | 4 Dockerfiles |
|
| **P2** | Standardize Node.js version to 22-alpine | 4 Dockerfiles | ✅ Done (all new Dockerfiles use 22-alpine) |
|
||||||
| **P2** | Add `extra_hosts` for Linux VM Ollama access | 1 service |
|
| **P2** | Add `extra_hosts` for Linux VM Ollama access | 1 service | Pending |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 13. K8s & Docker Best Practices (from Production Comparisons)
|
||||||
|
|
||||||
|
> Derived from comparing three production K8s deployments: a Go-based Call Controller (Paladin), a Python/FastAPI streaming agent platform (NetBond), and a Python/FastAPI voice agent (Welcome Agent). These patterns should be adopted when ByteLyst moves from Docker Compose → K3s → managed K8s.
|
||||||
|
|
||||||
|
### 13.1 Deployment — Zero-Downtime Rolling Updates
|
||||||
|
|
||||||
|
**Do this (NetBond pattern):**
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
spec:
|
||||||
|
strategy:
|
||||||
|
type: RollingUpdate
|
||||||
|
rollingUpdate:
|
||||||
|
maxUnavailable: 0 # Never kill a pod before its replacement is ready
|
||||||
|
maxSurge: 1 # Only 1 extra pod during rollout
|
||||||
|
template:
|
||||||
|
spec:
|
||||||
|
terminationGracePeriodSeconds: 45 # Match your app's drain timeout
|
||||||
|
containers:
|
||||||
|
- lifecycle:
|
||||||
|
preStop:
|
||||||
|
exec:
|
||||||
|
command: ['sleep', '5'] # Let load balancer deregister before SIGTERM
|
||||||
|
```
|
||||||
|
|
||||||
|
**Don't do this (Paladin anti-pattern):**
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
maxUnavailable: 50% # Half your pods die instantly — users get errors
|
||||||
|
maxSurge: 50% # Wastes resources by doubling pod count
|
||||||
|
```
|
||||||
|
|
||||||
|
**ByteLyst action:** Every deployment template should use `maxUnavailable: 0` + preStop sleep + explicit `terminationGracePeriodSeconds` matching the Fastify graceful shutdown timeout.
|
||||||
|
|
||||||
|
### 13.2 Pod Security Context
|
||||||
|
|
||||||
|
**Always set (NetBond pattern):**
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
securityContext:
|
||||||
|
runAsNonRoot: true
|
||||||
|
runAsUser: 1000
|
||||||
|
runAsGroup: 1000
|
||||||
|
allowPrivilegeEscalation: false
|
||||||
|
readOnlyRootFilesystem: true
|
||||||
|
```
|
||||||
|
|
||||||
|
If the app needs writable paths (e.g., `/tmp`, cache dirs), use `emptyDir` volumes:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
volumes:
|
||||||
|
- name: tmp
|
||||||
|
emptyDir: {}
|
||||||
|
- name: cache
|
||||||
|
emptyDir: {}
|
||||||
|
volumeMounts:
|
||||||
|
- name: tmp
|
||||||
|
mountPath: /tmp
|
||||||
|
- name: cache
|
||||||
|
mountPath: /home/node/.cache
|
||||||
|
```
|
||||||
|
|
||||||
|
**ByteLyst action:** All Fastify backends are stateless — `readOnlyRootFilesystem: true` works. Next.js standalone servers may need `/tmp` writable.
|
||||||
|
|
||||||
|
### 13.3 Health Probes — Dedicated Endpoints
|
||||||
|
|
||||||
|
**Do this:**
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
livenessProbe:
|
||||||
|
httpGet:
|
||||||
|
path: /health # Dedicated lightweight endpoint
|
||||||
|
port: 4003
|
||||||
|
initialDelaySeconds: 10
|
||||||
|
periodSeconds: 10
|
||||||
|
timeoutSeconds: 5 # Fast fail — 5s max
|
||||||
|
readinessProbe:
|
||||||
|
httpGet:
|
||||||
|
path: /health
|
||||||
|
port: 4003
|
||||||
|
initialDelaySeconds: 5
|
||||||
|
periodSeconds: 5
|
||||||
|
timeoutSeconds: 5
|
||||||
|
```
|
||||||
|
|
||||||
|
**Don't do this (Welcome Agent anti-pattern):**
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
livenessProbe:
|
||||||
|
httpGet:
|
||||||
|
path: /openapi.json # Heavy endpoint, not a health check
|
||||||
|
timeoutSeconds: 60 # Masks real failures for a full minute
|
||||||
|
```
|
||||||
|
|
||||||
|
**ByteLyst action:** All backends already expose `GET /health` → `{ status: "ok" }`. Use it. Set timeout to 5s.
|
||||||
|
|
||||||
|
### 13.4 Ingress — WebSocket Support
|
||||||
|
|
||||||
|
If any service uses WebSocket or SSE (FlowMonk SSE, LocalMemGPT streaming, future real-time features):
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
apiVersion: networking.k8s.io/v1
|
||||||
|
kind: Ingress
|
||||||
|
metadata:
|
||||||
|
annotations:
|
||||||
|
nginx.ingress.kubernetes.io/proxy-read-timeout: '1800'
|
||||||
|
nginx.ingress.kubernetes.io/proxy-send-timeout: '1800'
|
||||||
|
nginx.ingress.kubernetes.io/proxy-buffering: 'off'
|
||||||
|
nginx.ingress.kubernetes.io/proxy-http-version: '1.1'
|
||||||
|
nginx.ingress.kubernetes.io/configuration-snippet: |
|
||||||
|
proxy_set_header Upgrade $http_upgrade;
|
||||||
|
proxy_set_header Connection "upgrade";
|
||||||
|
```
|
||||||
|
|
||||||
|
Missing WebSocket headers is a silent failure — connections drop after 60s with no error.
|
||||||
|
|
||||||
|
### 13.5 HPA — Use `autoscaling/v2`
|
||||||
|
|
||||||
|
**Do this:**
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
apiVersion: autoscaling/v2 # Current API, supports multiple metrics
|
||||||
|
```
|
||||||
|
|
||||||
|
**Don't do this:**
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
apiVersion: autoscaling/v1 # Deprecated, CPU-only, will be removed
|
||||||
|
```
|
||||||
|
|
||||||
|
### 13.6 Dockerfile Best Practices
|
||||||
|
|
||||||
|
| Practice | Do | Don't |
|
||||||
|
| ------------------- | ---------------------------------------------------------- | ---------------------------------------------------------------------------------- |
|
||||||
|
| **ENTRYPOINT form** | `ENTRYPOINT ["node", "dist/server.js"]` (exec form) | `ENTRYPOINT node dist/server.js` (shell form — PID 1 is `/bin/sh`, signals broken) |
|
||||||
|
| **COPY scope** | `COPY package.json ./` then `COPY src/ ./src/` (selective) | `COPY . .` (copies node_modules, .git, tests, everything) |
|
||||||
|
| **Layer count** | Combine related `RUN` steps | 3 separate `RUN pip install` / `RUN npm install` steps |
|
||||||
|
| **Non-root** | `USER node` (Node.js images have a `node` user) | Running as root in production |
|
||||||
|
| **Local variant** | Provide `local.Dockerfile` without corp proxy/JFrog deps | Single Dockerfile that only works behind corporate proxy |
|
||||||
|
| **Build args** | `ARG NODE_ENV=production` for conditional behavior | Hardcoded env in Dockerfile |
|
||||||
|
|
||||||
|
### 13.7 Helm Values Layering
|
||||||
|
|
||||||
|
Use 3 layers for environment management:
|
||||||
|
|
||||||
|
```
|
||||||
|
values.yaml # Base defaults (image, port, probes, resources)
|
||||||
|
├── env/local.yaml # Local K3s overrides (lower resources, NodePort, no TLS)
|
||||||
|
├── env/dev.yaml # Dev cluster overrides (replicas, hostnames, secrets)
|
||||||
|
└── env/prod.yaml # Prod overrides (more replicas, real TLS, HPA limits)
|
||||||
|
```
|
||||||
|
|
||||||
|
Deploy with layered `-f` flags:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Local
|
||||||
|
helm upgrade --install myapp ./charts -f charts/values.yaml -f charts/env/local.yaml
|
||||||
|
|
||||||
|
# Dev
|
||||||
|
helm upgrade --install myapp ./charts -f charts/values.yaml -f charts/env/dev.yaml
|
||||||
|
|
||||||
|
# Prod
|
||||||
|
helm upgrade --install myapp ./charts -f charts/values.yaml -f charts/env/prod.yaml
|
||||||
|
```
|
||||||
|
|
||||||
|
### 13.8 Namespace Strategy
|
||||||
|
|
||||||
|
Use Helm `_helpers.tpl` for namespace — never hardcode:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
# ✅ Standard pattern — respects --namespace flag
|
||||||
|
{{ include "myapp.namespace" . }}
|
||||||
|
|
||||||
|
# ❌ Anti-pattern — ignores helm --namespace, causes confusion
|
||||||
|
{{ .Values.namespace }}
|
||||||
|
```
|
||||||
|
|
||||||
|
### 13.9 Secrets Management Progression
|
||||||
|
|
||||||
|
| Phase | Strategy | Complexity |
|
||||||
|
| ------------------------ | ----------------------------------------------------- | ---------- |
|
||||||
|
| **Phase 1** (Compose) | `.env.ecosystem` file (gitignored) | Trivial |
|
||||||
|
| **Phase 2** (K3s) | Native K8s `Secret` objects + `kubectl create secret` | Low |
|
||||||
|
| **Phase 3** (Production) | Azure Key Vault via `SecretProviderClass` CSI driver | Medium |
|
||||||
|
| **Phase 4** (Enterprise) | AKV + `AzureKeyVaultSecret` CRD with auto-sync | High |
|
||||||
|
|
||||||
|
ByteLyst already uses AKV in production (platform-service) — the CSI driver pattern is the natural next step.
|
||||||
|
|
||||||
|
### 13.10 CI/CD Best Practices (Lessons from Production Pipelines)
|
||||||
|
|
||||||
|
| Practice | Description |
|
||||||
|
| ---------------------- | ------------------------------------------------------------------------------------------------------------ |
|
||||||
|
| **Semantic release** | Auto-version from commit messages (`feat:` → minor, `fix:` → patch). ByteLyst already uses this convention. |
|
||||||
|
| **Image promotion** | Build once → push to staging repo → promote to gold/prod repo (never rebuild for prod). |
|
||||||
|
| **Branch pipelines** | Different CI stages per branch: feature (lint+test), develop (build+deploy-dev), main (promote+deploy-prod). |
|
||||||
|
| **Security gates** | SAST + SCA scans on every build. Block merges on critical findings. |
|
||||||
|
| **Quality gates** | Unit tests + coverage + SonarQube. Fail pipeline if coverage drops. |
|
||||||
|
| **Auto-deploy to dev** | Pipeline trigger: when build completes → auto-deploy to dev. Manual gate for prod. |
|
||||||
|
| **Chart versioning** | Publish Helm chart to OCI registry (ACR) with semantic version. Pull by version during deploy. |
|
||||||
|
|
||||||
|
### 13.11 Local K8s Development Script Template
|
||||||
|
|
||||||
|
A good local K8s deploy script should handle:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
#!/usr/bin/env bash
|
||||||
|
# deploy-local-k8s.sh — Full local K8s deployment for ByteLyst ecosystem
|
||||||
|
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
NAMESPACE="bytelyst"
|
||||||
|
ACTION="${1:-deploy}" # deploy | teardown
|
||||||
|
|
||||||
|
case "$ACTION" in
|
||||||
|
deploy)
|
||||||
|
# 1. Build all Docker images
|
||||||
|
for svc in platform-service extraction-service mcp-server; do
|
||||||
|
docker build -t bytelyst/$svc:local ./learning_ai_common_plat/services/$svc
|
||||||
|
done
|
||||||
|
|
||||||
|
# 2. Load images into K3s containerd (not needed with Docker Desktop)
|
||||||
|
if command -v k3s &>/dev/null; then
|
||||||
|
for img in $(docker images --format '{{.Repository}}:{{.Tag}}' | grep bytelyst); do
|
||||||
|
sudo k3s ctr images import <(docker save "$img")
|
||||||
|
done
|
||||||
|
fi
|
||||||
|
|
||||||
|
# 3. Create namespace + secrets
|
||||||
|
kubectl create namespace "$NAMESPACE" --dry-run=client -o yaml | kubectl apply -f -
|
||||||
|
kubectl create secret generic bytelyst-secrets \
|
||||||
|
--from-env-file=.env.ecosystem \
|
||||||
|
-n "$NAMESPACE" --dry-run=client -o yaml | kubectl apply -f -
|
||||||
|
|
||||||
|
# 4. Deploy via Helm with local overlay
|
||||||
|
helm upgrade --install bytelyst ./helm/bytelyst-ecosystem \
|
||||||
|
-f helm/bytelyst-ecosystem/values.yaml \
|
||||||
|
-f helm/bytelyst-ecosystem/env/local.yaml \
|
||||||
|
-n "$NAMESPACE"
|
||||||
|
|
||||||
|
# 5. Wait + verify
|
||||||
|
kubectl rollout status deploy -n "$NAMESPACE" --timeout=120s
|
||||||
|
echo "All pods:"
|
||||||
|
kubectl get pods -n "$NAMESPACE"
|
||||||
|
echo ""
|
||||||
|
echo "Port-forward: kubectl port-forward svc/platform-service 4003:4003 -n $NAMESPACE"
|
||||||
|
;;
|
||||||
|
|
||||||
|
teardown)
|
||||||
|
helm uninstall bytelyst -n "$NAMESPACE" 2>/dev/null || true
|
||||||
|
kubectl delete namespace "$NAMESPACE" 2>/dev/null || true
|
||||||
|
echo "Teardown complete."
|
||||||
|
;;
|
||||||
|
esac
|
||||||
|
```
|
||||||
|
|
||||||
|
### 13.12 Quick Reference — What to Apply at Each Phase
|
||||||
|
|
||||||
|
| Best Practice | Phase 1 (Compose) | Phase 2 (K3s) | Phase 3 (Prod K8s) |
|
||||||
|
| ---------------------------- | ------------------------ | ------------------ | ------------------ |
|
||||||
|
| Zero-downtime rolling update | N/A | ✅ Apply | ✅ Apply |
|
||||||
|
| Pod security context | N/A | ✅ Apply | ✅ Apply |
|
||||||
|
| Health probes | N/A (use `healthcheck:`) | ✅ Apply | ✅ Apply |
|
||||||
|
| WebSocket ingress headers | N/A | ✅ If using SSE/WS | ✅ Apply |
|
||||||
|
| HPA v2 | N/A | Optional | ✅ Apply |
|
||||||
|
| Exec-form ENTRYPOINT | ✅ Apply now | ✅ | ✅ |
|
||||||
|
| Selective COPY | ✅ Apply now | ✅ | ✅ |
|
||||||
|
| Non-root user | ✅ Apply now | ✅ | ✅ |
|
||||||
|
| Values layering | N/A | ✅ Apply | ✅ Apply |
|
||||||
|
| Secrets via AKV CSI | N/A | N/A | ✅ Apply |
|
||||||
|
| Semantic release | ✅ Apply now | ✅ | ✅ |
|
||||||
|
| Image promotion | N/A | N/A | ✅ Apply |
|
||||||
|
| Local deploy script | N/A | ✅ Apply | ✅ Adapt |
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
@ -950,7 +1225,7 @@ LocalMemGPT uses `OLLAMA_URL: 'http://host.docker.internal:11434'` — this work
|
|||||||
| Question | Answer |
|
| Question | Answer |
|
||||||
| ------------------------------ | -------------------------------------------------------------------------------------------------------------- |
|
| ------------------------------ | -------------------------------------------------------------------------------------------------------------- |
|
||||||
| **Can deploy on single VM?** | **Yes.** All ~25 services fit in 32 GB RAM. |
|
| **Can deploy on single VM?** | **Yes.** All ~25 services fit in 32 GB RAM. |
|
||||||
| **All Dockerized?** | 4/10 product repos fully Dockerized. 6 need Dockerfiles (copy-paste template). |
|
| **All Dockerized?** | **Yes.** All 10 product repos now have Dockerfiles + docker-prep.sh. |
|
||||||
| **K8s practice on single VM?** | **K3s** — certified K8s, single binary, same manifests scale to multi-node or AKS/EKS/GKE. |
|
| **K8s practice on single VM?** | **K3s** — certified K8s, single binary, same manifests scale to multi-node or AKS/EKS/GKE. |
|
||||||
| **Recommended VM?** | 8 vCPU / 32 GB (min) or 16 vCPU / 64 GB (with Ollama). Hetzner ~$45/mo for dev. |
|
| **Recommended VM?** | 8 vCPU / 32 GB (min) or 16 vCPU / 64 GB (with Ollama). Hetzner ~$45/mo for dev. |
|
||||||
| **Time to production K8s?** | Phase 1 (compose) → Phase 2 (K3s single) → Phase 3 (K3s multi) → Phase 4 (managed). Same manifests throughout. |
|
| **Time to production K8s?** | Phase 1 (compose) → Phase 2 (K3s single) → Phase 3 (K3s multi) → Phase 4 (managed). Same manifests throughout. |
|
||||||
|
|||||||
Loading…
Reference in New Issue
Block a user