docs(devops): add K8s best practices from production comparisons, update gap table to reflect all Dockerfiles created
This commit is contained in:
parent
09525f671f
commit
5646cefcbd
@ -729,22 +729,22 @@ kubectl port-forward svc/platform-service 4003:4003 -n bytelyst-platform
|
||||
|
||||
## 10. What's NOT Dockerized Yet (gaps)
|
||||
|
||||
| Repo | Backend Dockerfile | Web Dockerfile | `docker-prep.sh` | `output:'standalone'` | Status |
|
||||
| --------------- | ------------------ | ------------------- | ---------------- | --------------------- | -------------------------------------------------------------- |
|
||||
| **LysnrAI** | ❌ | ✅ user-dashboard | ❌ | ✅ (conditional) | Need backend Dockerfile + docker-prep.sh |
|
||||
| **MindLyst** | ❌ | ❌ | ❌ | ❌ | Need all 4 |
|
||||
| **ChronoMind** | ❌ | ❌ | ❌ | ❌ | Need all 4 |
|
||||
| **JarvisJr** | ❌ | ❌ | ❌ | ❌ | Need all 4 |
|
||||
| **PeakPulse** | ❌ | ❌ | ❌ | ❌ | Need all 4 |
|
||||
| **FlowMonk** | ❌ | ❌ | ❌ | ❌ | Need all 4 |
|
||||
| **NomGap** | ✅ ⚠️ | ✅ | ✅ | ✅ | Backend Dockerfile ignores `file:` deps — see §12.F3 |
|
||||
| **NoteLett** | ✅ ⚠️ | ✅ | ✅ | ✅ | Backend Dockerfile `COPY .` pulls broken symlinks — see §12.F4 |
|
||||
| **ActionTrail** | ✅ | ✅ | ✅ | ✅ | Ready (uses `.tarballs/` pattern) |
|
||||
| **LocalMemGPT** | ✅ | ✅ | ✅ | ✅ | Ready (repo-root build context) |
|
||||
| **admin-web** | — | ✅ (in common-plat) | N/A (pnpm) | ✅ (conditional) | Ready |
|
||||
| **tracker-web** | — | ✅ (in common-plat) | N/A (pnpm) | ✅ (conditional) | Ready |
|
||||
| Repo | Backend Dockerfile | Web Dockerfile | `docker-prep.sh` | `output:'standalone'` | Status |
|
||||
| --------------- | ------------------ | ------------------- | ---------------- | --------------------- | ------------------------------------ |
|
||||
| **LysnrAI** | ✅ | ✅ user-dashboard | ✅ | ✅ (conditional) | ✅ Ready |
|
||||
| **MindLyst** | ✅ | ✅ | ✅ | ✅ (conditional) | ✅ Ready |
|
||||
| **ChronoMind** | ✅ | ✅ | ✅ | ✅ (conditional) | ✅ Ready |
|
||||
| **JarvisJr** | ✅ | ✅ | ✅ | ✅ (conditional) | ✅ Ready |
|
||||
| **PeakPulse** | ✅ | — (no web) | ✅ | — | ✅ Ready |
|
||||
| **FlowMonk** | ✅ | ✅ | ✅ | ✅ (conditional) | ✅ Ready |
|
||||
| **NomGap** | ✅ | ✅ | ✅ | ✅ | ✅ Fixed (added `.tarballs/` COPY) |
|
||||
| **NoteLett** | ✅ | ✅ | ✅ | ✅ | ✅ Fixed (explicit COPY, not `.`) |
|
||||
| **ActionTrail** | ✅ | ✅ | ✅ | ✅ | ✅ Ready (uses `.tarballs/` pattern) |
|
||||
| **LocalMemGPT** | ✅ | ✅ | ✅ | ✅ | ✅ Ready (repo-root build context) |
|
||||
| **admin-web** | — | ✅ (in common-plat) | N/A (pnpm) | ✅ (conditional) | ✅ Ready |
|
||||
| **tracker-web** | — | ✅ (in common-plat) | N/A (pnpm) | ✅ (conditional) | ✅ Ready |
|
||||
|
||||
**6 repos need Dockerfiles** + `docker-prep.sh` + `output:'standalone'`. 2 existing Dockerfiles have issues.
|
||||
**All 10 product repos now have Dockerfiles, `docker-prep.sh`, and `output:'standalone'`.** Created 2026-03-22.
|
||||
|
||||
---
|
||||
|
||||
@ -931,17 +931,292 @@ LocalMemGPT uses `OLLAMA_URL: 'http://host.docker.internal:11434'` — this work
|
||||
|
||||
### Summary of Required Work Before Compose Works
|
||||
|
||||
| Priority | Item | Count |
|
||||
| -------- | -------------------------------------------------------- | ------------- |
|
||||
| **P0** | Create missing `docker-prep.sh` | 6 repos |
|
||||
| **P0** | Create missing backend Dockerfiles | 6 repos |
|
||||
| **P0** | Create missing web Dockerfiles | 5 repos |
|
||||
| **P0** | Add `output: 'standalone'` to next.config.ts | 3 webs |
|
||||
| **P1** | Fix NomGap backend Dockerfile (add `.tarballs/` COPY) | 1 file |
|
||||
| **P1** | Fix NoteLett backend Dockerfile (explicit COPY, not `.`) | 1 file |
|
||||
| **P1** | Create `.env.ecosystem` template | 1 file |
|
||||
| **P2** | Standardize Node.js version to 22-alpine | 4 Dockerfiles |
|
||||
| **P2** | Add `extra_hosts` for Linux VM Ollama access | 1 service |
|
||||
| Priority | Item | Count | Status |
|
||||
| -------- | -------------------------------------------------------- | ------------- | ---------------------------------------------------------- |
|
||||
| **P0** | Create missing `docker-prep.sh` | 6 repos | ✅ Done (3 created, 3 already existed) |
|
||||
| **P0** | Create missing backend Dockerfiles | 6 repos | ✅ Done |
|
||||
| **P0** | Create missing web Dockerfiles | 5 repos | ✅ Done (4 created, PeakPulse has no web) |
|
||||
| **P0** | Add `output: 'standalone'` to next.config.ts | 3 webs | ✅ Done (4 webs: ChronoMind, JarvisJr, FlowMonk, MindLyst) |
|
||||
| **P1** | Fix NomGap backend Dockerfile (add `.tarballs/` COPY) | 1 file | ✅ Done |
|
||||
| **P1** | Fix NoteLett backend Dockerfile (explicit COPY, not `.`) | 1 file | ✅ Done |
|
||||
| **P1** | Create `.env.ecosystem` template | 1 file | Pending |
|
||||
| **P2** | Standardize Node.js version to 22-alpine | 4 Dockerfiles | ✅ Done (all new Dockerfiles use 22-alpine) |
|
||||
| **P2** | Add `extra_hosts` for Linux VM Ollama access | 1 service | Pending |
|
||||
|
||||
---
|
||||
|
||||
## 13. K8s & Docker Best Practices (from Production Comparisons)
|
||||
|
||||
> Derived from comparing three production K8s deployments: a Go-based Call Controller (Paladin), a Python/FastAPI streaming agent platform (NetBond), and a Python/FastAPI voice agent (Welcome Agent). These patterns should be adopted when ByteLyst moves from Docker Compose → K3s → managed K8s.
|
||||
|
||||
### 13.1 Deployment — Zero-Downtime Rolling Updates
|
||||
|
||||
**Do this (NetBond pattern):**
|
||||
|
||||
```yaml
|
||||
spec:
|
||||
strategy:
|
||||
type: RollingUpdate
|
||||
rollingUpdate:
|
||||
maxUnavailable: 0 # Never kill a pod before its replacement is ready
|
||||
maxSurge: 1 # Only 1 extra pod during rollout
|
||||
template:
|
||||
spec:
|
||||
terminationGracePeriodSeconds: 45 # Match your app's drain timeout
|
||||
containers:
|
||||
- lifecycle:
|
||||
preStop:
|
||||
exec:
|
||||
command: ['sleep', '5'] # Let load balancer deregister before SIGTERM
|
||||
```
|
||||
|
||||
**Don't do this (Paladin anti-pattern):**
|
||||
|
||||
```yaml
|
||||
maxUnavailable: 50% # Half your pods die instantly — users get errors
|
||||
maxSurge: 50% # Wastes resources by doubling pod count
|
||||
```
|
||||
|
||||
**ByteLyst action:** Every deployment template should use `maxUnavailable: 0` + preStop sleep + explicit `terminationGracePeriodSeconds` matching the Fastify graceful shutdown timeout.
|
||||
|
||||
### 13.2 Pod Security Context
|
||||
|
||||
**Always set (NetBond pattern):**
|
||||
|
||||
```yaml
|
||||
securityContext:
|
||||
runAsNonRoot: true
|
||||
runAsUser: 1000
|
||||
runAsGroup: 1000
|
||||
allowPrivilegeEscalation: false
|
||||
readOnlyRootFilesystem: true
|
||||
```
|
||||
|
||||
If the app needs writable paths (e.g., `/tmp`, cache dirs), use `emptyDir` volumes:
|
||||
|
||||
```yaml
|
||||
volumes:
|
||||
- name: tmp
|
||||
emptyDir: {}
|
||||
- name: cache
|
||||
emptyDir: {}
|
||||
volumeMounts:
|
||||
- name: tmp
|
||||
mountPath: /tmp
|
||||
- name: cache
|
||||
mountPath: /home/node/.cache
|
||||
```
|
||||
|
||||
**ByteLyst action:** All Fastify backends are stateless — `readOnlyRootFilesystem: true` works. Next.js standalone servers may need `/tmp` writable.
|
||||
|
||||
### 13.3 Health Probes — Dedicated Endpoints
|
||||
|
||||
**Do this:**
|
||||
|
||||
```yaml
|
||||
livenessProbe:
|
||||
httpGet:
|
||||
path: /health # Dedicated lightweight endpoint
|
||||
port: 4003
|
||||
initialDelaySeconds: 10
|
||||
periodSeconds: 10
|
||||
timeoutSeconds: 5 # Fast fail — 5s max
|
||||
readinessProbe:
|
||||
httpGet:
|
||||
path: /health
|
||||
port: 4003
|
||||
initialDelaySeconds: 5
|
||||
periodSeconds: 5
|
||||
timeoutSeconds: 5
|
||||
```
|
||||
|
||||
**Don't do this (Welcome Agent anti-pattern):**
|
||||
|
||||
```yaml
|
||||
livenessProbe:
|
||||
httpGet:
|
||||
path: /openapi.json # Heavy endpoint, not a health check
|
||||
timeoutSeconds: 60 # Masks real failures for a full minute
|
||||
```
|
||||
|
||||
**ByteLyst action:** All backends already expose `GET /health` → `{ status: "ok" }`. Use it. Set timeout to 5s.
|
||||
|
||||
### 13.4 Ingress — WebSocket Support
|
||||
|
||||
If any service uses WebSocket or SSE (FlowMonk SSE, LocalMemGPT streaming, future real-time features):
|
||||
|
||||
```yaml
|
||||
apiVersion: networking.k8s.io/v1
|
||||
kind: Ingress
|
||||
metadata:
|
||||
annotations:
|
||||
nginx.ingress.kubernetes.io/proxy-read-timeout: '1800'
|
||||
nginx.ingress.kubernetes.io/proxy-send-timeout: '1800'
|
||||
nginx.ingress.kubernetes.io/proxy-buffering: 'off'
|
||||
nginx.ingress.kubernetes.io/proxy-http-version: '1.1'
|
||||
nginx.ingress.kubernetes.io/configuration-snippet: |
|
||||
proxy_set_header Upgrade $http_upgrade;
|
||||
proxy_set_header Connection "upgrade";
|
||||
```
|
||||
|
||||
Missing WebSocket headers is a silent failure — connections drop after 60s with no error.
|
||||
|
||||
### 13.5 HPA — Use `autoscaling/v2`
|
||||
|
||||
**Do this:**
|
||||
|
||||
```yaml
|
||||
apiVersion: autoscaling/v2 # Current API, supports multiple metrics
|
||||
```
|
||||
|
||||
**Don't do this:**
|
||||
|
||||
```yaml
|
||||
apiVersion: autoscaling/v1 # Deprecated, CPU-only, will be removed
|
||||
```
|
||||
|
||||
### 13.6 Dockerfile Best Practices
|
||||
|
||||
| Practice | Do | Don't |
|
||||
| ------------------- | ---------------------------------------------------------- | ---------------------------------------------------------------------------------- |
|
||||
| **ENTRYPOINT form** | `ENTRYPOINT ["node", "dist/server.js"]` (exec form) | `ENTRYPOINT node dist/server.js` (shell form — PID 1 is `/bin/sh`, signals broken) |
|
||||
| **COPY scope** | `COPY package.json ./` then `COPY src/ ./src/` (selective) | `COPY . .` (copies node_modules, .git, tests, everything) |
|
||||
| **Layer count** | Combine related `RUN` steps | 3 separate `RUN pip install` / `RUN npm install` steps |
|
||||
| **Non-root** | `USER node` (Node.js images have a `node` user) | Running as root in production |
|
||||
| **Local variant** | Provide `local.Dockerfile` without corp proxy/JFrog deps | Single Dockerfile that only works behind corporate proxy |
|
||||
| **Build args** | `ARG NODE_ENV=production` for conditional behavior | Hardcoded env in Dockerfile |
|
||||
|
||||
### 13.7 Helm Values Layering
|
||||
|
||||
Use 3 layers for environment management:
|
||||
|
||||
```
|
||||
values.yaml # Base defaults (image, port, probes, resources)
|
||||
├── env/local.yaml # Local K3s overrides (lower resources, NodePort, no TLS)
|
||||
├── env/dev.yaml # Dev cluster overrides (replicas, hostnames, secrets)
|
||||
└── env/prod.yaml # Prod overrides (more replicas, real TLS, HPA limits)
|
||||
```
|
||||
|
||||
Deploy with layered `-f` flags:
|
||||
|
||||
```bash
|
||||
# Local
|
||||
helm upgrade --install myapp ./charts -f charts/values.yaml -f charts/env/local.yaml
|
||||
|
||||
# Dev
|
||||
helm upgrade --install myapp ./charts -f charts/values.yaml -f charts/env/dev.yaml
|
||||
|
||||
# Prod
|
||||
helm upgrade --install myapp ./charts -f charts/values.yaml -f charts/env/prod.yaml
|
||||
```
|
||||
|
||||
### 13.8 Namespace Strategy
|
||||
|
||||
Use Helm `_helpers.tpl` for namespace — never hardcode:
|
||||
|
||||
```yaml
|
||||
# ✅ Standard pattern — respects --namespace flag
|
||||
{{ include "myapp.namespace" . }}
|
||||
|
||||
# ❌ Anti-pattern — ignores helm --namespace, causes confusion
|
||||
{{ .Values.namespace }}
|
||||
```
|
||||
|
||||
### 13.9 Secrets Management Progression
|
||||
|
||||
| Phase | Strategy | Complexity |
|
||||
| ------------------------ | ----------------------------------------------------- | ---------- |
|
||||
| **Phase 1** (Compose) | `.env.ecosystem` file (gitignored) | Trivial |
|
||||
| **Phase 2** (K3s) | Native K8s `Secret` objects + `kubectl create secret` | Low |
|
||||
| **Phase 3** (Production) | Azure Key Vault via `SecretProviderClass` CSI driver | Medium |
|
||||
| **Phase 4** (Enterprise) | AKV + `AzureKeyVaultSecret` CRD with auto-sync | High |
|
||||
|
||||
ByteLyst already uses AKV in production (platform-service) — the CSI driver pattern is the natural next step.
|
||||
|
||||
### 13.10 CI/CD Best Practices (Lessons from Production Pipelines)
|
||||
|
||||
| Practice | Description |
|
||||
| ---------------------- | ------------------------------------------------------------------------------------------------------------ |
|
||||
| **Semantic release** | Auto-version from commit messages (`feat:` → minor, `fix:` → patch). ByteLyst already uses this convention. |
|
||||
| **Image promotion** | Build once → push to staging repo → promote to gold/prod repo (never rebuild for prod). |
|
||||
| **Branch pipelines** | Different CI stages per branch: feature (lint+test), develop (build+deploy-dev), main (promote+deploy-prod). |
|
||||
| **Security gates** | SAST + SCA scans on every build. Block merges on critical findings. |
|
||||
| **Quality gates** | Unit tests + coverage + SonarQube. Fail pipeline if coverage drops. |
|
||||
| **Auto-deploy to dev** | Pipeline trigger: when build completes → auto-deploy to dev. Manual gate for prod. |
|
||||
| **Chart versioning** | Publish Helm chart to OCI registry (ACR) with semantic version. Pull by version during deploy. |
|
||||
|
||||
### 13.11 Local K8s Development Script Template
|
||||
|
||||
A good local K8s deploy script should handle:
|
||||
|
||||
```bash
|
||||
#!/usr/bin/env bash
|
||||
# deploy-local-k8s.sh — Full local K8s deployment for ByteLyst ecosystem
|
||||
|
||||
set -euo pipefail
|
||||
|
||||
NAMESPACE="bytelyst"
|
||||
ACTION="${1:-deploy}" # deploy | teardown
|
||||
|
||||
case "$ACTION" in
|
||||
deploy)
|
||||
# 1. Build all Docker images
|
||||
for svc in platform-service extraction-service mcp-server; do
|
||||
docker build -t bytelyst/$svc:local ./learning_ai_common_plat/services/$svc
|
||||
done
|
||||
|
||||
# 2. Load images into K3s containerd (not needed with Docker Desktop)
|
||||
if command -v k3s &>/dev/null; then
|
||||
for img in $(docker images --format '{{.Repository}}:{{.Tag}}' | grep bytelyst); do
|
||||
sudo k3s ctr images import <(docker save "$img")
|
||||
done
|
||||
fi
|
||||
|
||||
# 3. Create namespace + secrets
|
||||
kubectl create namespace "$NAMESPACE" --dry-run=client -o yaml | kubectl apply -f -
|
||||
kubectl create secret generic bytelyst-secrets \
|
||||
--from-env-file=.env.ecosystem \
|
||||
-n "$NAMESPACE" --dry-run=client -o yaml | kubectl apply -f -
|
||||
|
||||
# 4. Deploy via Helm with local overlay
|
||||
helm upgrade --install bytelyst ./helm/bytelyst-ecosystem \
|
||||
-f helm/bytelyst-ecosystem/values.yaml \
|
||||
-f helm/bytelyst-ecosystem/env/local.yaml \
|
||||
-n "$NAMESPACE"
|
||||
|
||||
# 5. Wait + verify
|
||||
kubectl rollout status deploy -n "$NAMESPACE" --timeout=120s
|
||||
echo "All pods:"
|
||||
kubectl get pods -n "$NAMESPACE"
|
||||
echo ""
|
||||
echo "Port-forward: kubectl port-forward svc/platform-service 4003:4003 -n $NAMESPACE"
|
||||
;;
|
||||
|
||||
teardown)
|
||||
helm uninstall bytelyst -n "$NAMESPACE" 2>/dev/null || true
|
||||
kubectl delete namespace "$NAMESPACE" 2>/dev/null || true
|
||||
echo "Teardown complete."
|
||||
;;
|
||||
esac
|
||||
```
|
||||
|
||||
### 13.12 Quick Reference — What to Apply at Each Phase
|
||||
|
||||
| Best Practice | Phase 1 (Compose) | Phase 2 (K3s) | Phase 3 (Prod K8s) |
|
||||
| ---------------------------- | ------------------------ | ------------------ | ------------------ |
|
||||
| Zero-downtime rolling update | N/A | ✅ Apply | ✅ Apply |
|
||||
| Pod security context | N/A | ✅ Apply | ✅ Apply |
|
||||
| Health probes | N/A (use `healthcheck:`) | ✅ Apply | ✅ Apply |
|
||||
| WebSocket ingress headers | N/A | ✅ If using SSE/WS | ✅ Apply |
|
||||
| HPA v2 | N/A | Optional | ✅ Apply |
|
||||
| Exec-form ENTRYPOINT | ✅ Apply now | ✅ | ✅ |
|
||||
| Selective COPY | ✅ Apply now | ✅ | ✅ |
|
||||
| Non-root user | ✅ Apply now | ✅ | ✅ |
|
||||
| Values layering | N/A | ✅ Apply | ✅ Apply |
|
||||
| Secrets via AKV CSI | N/A | N/A | ✅ Apply |
|
||||
| Semantic release | ✅ Apply now | ✅ | ✅ |
|
||||
| Image promotion | N/A | N/A | ✅ Apply |
|
||||
| Local deploy script | N/A | ✅ Apply | ✅ Adapt |
|
||||
|
||||
---
|
||||
|
||||
@ -950,7 +1225,7 @@ LocalMemGPT uses `OLLAMA_URL: 'http://host.docker.internal:11434'` — this work
|
||||
| Question | Answer |
|
||||
| ------------------------------ | -------------------------------------------------------------------------------------------------------------- |
|
||||
| **Can deploy on single VM?** | **Yes.** All ~25 services fit in 32 GB RAM. |
|
||||
| **All Dockerized?** | 4/10 product repos fully Dockerized. 6 need Dockerfiles (copy-paste template). |
|
||||
| **All Dockerized?** | **Yes.** All 10 product repos now have Dockerfiles + docker-prep.sh. |
|
||||
| **K8s practice on single VM?** | **K3s** — certified K8s, single binary, same manifests scale to multi-node or AKS/EKS/GKE. |
|
||||
| **Recommended VM?** | 8 vCPU / 32 GB (min) or 16 vCPU / 64 GB (with Ollama). Hetzner ~$45/mo for dev. |
|
||||
| **Time to production K8s?** | Phase 1 (compose) → Phase 2 (K3s single) → Phase 3 (K3s multi) → Phase 4 (managed). Same manifests throughout. |
|
||||
|
||||
Loading…
Reference in New Issue
Block a user