learning_ai_common_plat/docs/devops/KUBERNETES_ROADMAP.md
2026-03-23 18:04:18 -07:00

360 lines
11 KiB
Markdown

# ByteLyst Ecosystem — Kubernetes Roadmap
This document is the standalone roadmap for moving the ByteLyst ecosystem from Docker Compose on a single VM to local Kubernetes practice and eventually production-grade Kubernetes deployment.
## Scope
Use this roadmap for:
- Docker Compose → Docker Desktop Kubernetes / K3s transition planning
- local Kubernetes validation strategy
- Helm/chart planning
- Kubernetes best practices for deployments, security, probes, ingress, and scaling
- secrets progression from `.env.ecosystem` to Kubernetes `Secret` objects and later Azure Key Vault integration
- CI/CD expectations for image promotion and chart versioning
This document does **not** replace `docs/devops/SINGLE_VM_DEPLOYMENT.md`.
`SINGLE_VM_DEPLOYMENT.md` remains the source of truth for:
- single-VM deployment scope
- Docker Compose ecosystem architecture
- Dockerization and package-manager-aware deployment guidance
- current implementation status and audit findings
## Current State
### Completed foundation
- Docker Compose ecosystem architecture is documented
- product repos have Dockerfiles and `docker-prep.sh`
- shared services have been built and validated in the ecosystem stack
- LocalMemGPT Linux-host Ollama access is addressed in Compose via `extra_hosts`
- deployment docs now separate Compose/source-of-truth concerns from Kubernetes roadmap concerns
### Not yet completed
- standalone local Kubernetes assets
- Helm charts / values structure in-repo
- Kubernetes manifests for the ecosystem
- local K8s deployment script implementation
- full K3s / Docker Desktop K8s validation
## Phase Plan
### Phase 1 — Docker Compose baseline
Goal: keep Compose as the operational baseline while Docker/build/runtime contracts stabilize.
Success criteria:
- all ecosystem images build successfully
- all required services start in Docker Compose
- health endpoints are reachable for shared services and product backends
- major host/container networking assumptions are documented
### Phase 2 — Local Kubernetes practice
Goal: run the same ecosystem ideas on a single-node Kubernetes environment for production-readiness practice.
Two supported paths:
#### Option A: Docker Desktop Kubernetes
Best for:
- macOS / Windows development
- quick iteration
- visual debugging
Characteristics:
- built-in `kind`-style cluster
- Docker-built images are immediately visible to the cluster
- easiest local path for validating manifests and Helm shape
#### Option B: K3s
Best for:
- Linux VMs
- Hetzner or cloud-hosted single-node practice
- future multi-node growth
Characteristics:
- lightweight CNCF-certified Kubernetes distro
- built-in Traefik ingress
- built-in local-path storage class
- can evolve from single-node to multi-node more naturally than Docker Desktop
### Phase 3 — Production-grade Kubernetes shape
Goal: make local K8s patterns production-ready enough to port to AKS/EKS/GKE later without redesign.
Key outcomes:
- health probes standardized
- rolling update behavior standardized
- security context standardized
- ingress and SSE/WebSocket behavior standardized
- Helm values layering defined
- secret management progression defined
### Phase 4 — Managed Kubernetes target
Goal: preserve the same deployment model while moving to managed infrastructure.
Expected direction:
- managed ingress controller and TLS
- chart/image promotion flow
- Azure Key Vault CSI integration
- HPA and environment-specific overlays
## Local Kubernetes Best Practices
### 1. Deployment rollout safety
Use zero-downtime defaults:
```yaml
spec:
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 0
maxSurge: 1
template:
spec:
terminationGracePeriodSeconds: 45
containers:
- lifecycle:
preStop:
exec:
command: ['sleep', '5']
```
Guidance:
- never use aggressive `maxUnavailable` values for user-facing services
- match `terminationGracePeriodSeconds` to graceful shutdown behavior
- use `preStop` delay to give the load balancer time to drain
### 2. Pod security context
Default posture:
```yaml
securityContext:
runAsNonRoot: true
runAsUser: 1000
runAsGroup: 1000
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
```
If writable paths are needed:
```yaml
volumes:
- name: tmp
emptyDir: {}
- name: cache
emptyDir: {}
volumeMounts:
- name: tmp
mountPath: /tmp
- name: cache
mountPath: /home/node/.cache
```
Guidance:
- Fastify backends should generally tolerate read-only root filesystems
- Next.js standalone servers may need writable `/tmp`
### 3. Health probes
Use dedicated `/health` endpoints:
```yaml
livenessProbe:
httpGet:
path: /health
port: 4003
initialDelaySeconds: 10
periodSeconds: 10
timeoutSeconds: 5
readinessProbe:
httpGet:
path: /health
port: 4003
initialDelaySeconds: 5
periodSeconds: 5
timeoutSeconds: 5
```
Guidance:
- do not use heavy endpoints like `/openapi.json` for liveness
- keep timeouts short enough to expose real failures quickly
### 4. Ingress for SSE / WebSocket traffic
For streaming or long-lived connections:
```yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
annotations:
nginx.ingress.kubernetes.io/proxy-read-timeout: '1800'
nginx.ingress.kubernetes.io/proxy-send-timeout: '1800'
nginx.ingress.kubernetes.io/proxy-buffering: 'off'
nginx.ingress.kubernetes.io/proxy-http-version: '1.1'
nginx.ingress.kubernetes.io/configuration-snippet: |
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
```
Applies to:
- FlowMonk SSE
- LocalMemGPT streaming
- future realtime features
### 5. HPA API choice
Use:
```yaml
apiVersion: autoscaling/v2
```
Avoid:
```yaml
apiVersion: autoscaling/v1
```
## Docker and Image Guidance for K8s Readiness
| Practice | Do | Avoid |
| ------------------- | -------------------------------------------- | --------------------------------------------------------- |
| **ENTRYPOINT form** | `ENTRYPOINT ["node", "dist/server.js"]` | shell-form entrypoints |
| **COPY scope** | selective `COPY` steps | broad `COPY . .` |
| **Layer count** | combine related `RUN` steps | fragmented install layers |
| **Non-root** | run as `node` or non-root UID | root runtime |
| **Local variant** | allow local Dockerfile variants where needed | one Dockerfile that only works in one network environment |
| **Build args** | use `ARG`/`ENV` deliberately | hardcoded deployment assumptions |
## Helm Values Layering
Recommended structure:
```text
values.yaml
├── env/local.yaml
├── env/dev.yaml
└── env/prod.yaml
```
Recommended usage:
```bash
helm upgrade --install bytelyst ./helm/bytelyst-ecosystem -f helm/bytelyst-ecosystem/values.yaml -f helm/bytelyst-ecosystem/env/local.yaml
helm upgrade --install bytelyst ./helm/bytelyst-ecosystem -f helm/bytelyst-ecosystem/values.yaml -f helm/bytelyst-ecosystem/env/dev.yaml
helm upgrade --install bytelyst ./helm/bytelyst-ecosystem -f helm/bytelyst-ecosystem/values.yaml -f helm/bytelyst-ecosystem/env/prod.yaml
```
## Namespace Strategy
Use helpers rather than hardcoded namespaces:
```yaml
{ { include "myapp.namespace" . } }
```
Avoid:
```yaml
{ { .Values.namespace } }
```
## Secrets Progression
| Phase | Strategy | Complexity |
| ----------- | --------------------------------------------- | ---------- |
| **Phase 1** | `.env.ecosystem` file (gitignored) | Trivial |
| **Phase 2** | Native Kubernetes `Secret` objects | Low |
| **Phase 3** | Azure Key Vault via CSI `SecretProviderClass` | Medium |
| **Phase 4** | AKV + operator/CRD auto-sync model | High |
## CI/CD Expectations
| Practice | Expectation |
| -------------------- | --------------------------------------------------------------- |
| **Semantic release** | keep `feat:` / `fix:` conventions usable for release automation |
| **Image promotion** | build once, promote later; do not rebuild for prod |
| **Branch pipelines** | branch-specific quality and deploy stages |
| **Security gates** | SAST/SCA in pipeline |
| **Quality gates** | tests, coverage, type safety, build verification |
| **Chart versioning** | publish/version charts independently |
## Local K8s Deployment Workflow Shape
A future local K8s script should do the following:
1. detect Docker Desktop K8s vs K3s
2. build required images
3. load/import images into the local cluster runtime when needed
4. create namespace
5. create secrets from `.env.ecosystem`
6. deploy Helm chart with local overlay
7. wait for rollout
8. print verification commands and port-forward hints
## Recommended Next Items
### Next now
- run full Docker Compose ecosystem validation end-to-end
- capture blockers by service
- decide whether K8s phase starts with Docker Desktop K8s or K3s first
### Next after Compose validation
- define `helm/bytelyst-ecosystem/` layout
- define namespace and secret model
- draft minimal shared-service-first Kubernetes manifests or chart values
- create local K8s deploy helper script
### Hold for later
- full Helm/K3s implementation across the ecosystem
- managed cluster rollout details
- advanced autoscaling and production ingress hardening
## Quick Reference
| Practice | Compose | Local K8s | Prod K8s |
| ---------------------------- | --------------------------------------- | --------- | -------- |
| Zero-downtime rolling update | N/A | Apply | Apply |
| Pod security context | N/A | Apply | Apply |
| Health probes | use Docker `healthcheck` where relevant | Apply | Apply |
| SSE/WebSocket ingress tuning | N/A | If needed | Apply |
| HPA v2 | N/A | Optional | Apply |
| Exec-form entrypoint | Apply now | Apply | Apply |
| Selective COPY | Apply now | Apply | Apply |
| Non-root user | Apply now | Apply | Apply |
| Values layering | N/A | Apply | Apply |
| AKV CSI | N/A | N/A | Apply |
| Image promotion | N/A | N/A | Apply |
## Status
- standalone Kubernetes roadmap: **created**
- Compose source-of-truth split: **done**
- Helm/K3s implementation: **held pending validation**