learning_ai_common_plat/docs/devops/KUBERNETES_ROADMAP.md

# ByteLyst Ecosystem — Kubernetes Roadmap

This document is the standalone roadmap for moving the ByteLyst ecosystem from Docker Compose on a single VM to local Kubernetes practice and eventually production-grade Kubernetes deployment.

## Scope

Use this roadmap for:

- Docker Compose → Docker Desktop Kubernetes / K3s transition planning
- local Kubernetes validation strategy
- Helm/chart planning
- Kubernetes best practices for deployments, security, probes, ingress, and scaling
- secrets progression from `.env.ecosystem` to Kubernetes `Secret` objects and later Azure Key Vault integration
- CI/CD expectations for image promotion and chart versioning

This document does **not** replace `docs/devops/SINGLE_VM_DEPLOYMENT.md`.

`SINGLE_VM_DEPLOYMENT.md` remains the source of truth for:

- single-VM deployment scope
- Docker Compose ecosystem architecture
- Dockerization and package-manager-aware deployment guidance
- current implementation status and audit findings

## Current State

### Completed foundation

- Docker Compose ecosystem architecture is documented
- product repos have Dockerfiles and `docker-prep.sh`
- shared services have been built and validated in the ecosystem stack
- LocalMemGPT Linux-host Ollama access is addressed in Compose via `extra_hosts`
- deployment docs now separate Compose/source-of-truth concerns from Kubernetes roadmap concerns

### Not yet completed

- standalone local Kubernetes assets
- Helm charts / values structure in-repo
- Kubernetes manifests for the ecosystem
- local K8s deployment script implementation
- full K3s / Docker Desktop K8s validation

## Phase Plan

### Phase 1 — Docker Compose baseline

Goal: keep Compose as the operational baseline while Docker/build/runtime contracts stabilize.

Success criteria:

- all ecosystem images build successfully
- all required services start in Docker Compose
- health endpoints are reachable for shared services and product backends
- major host/container networking assumptions are documented

### Phase 2 — Local Kubernetes practice

Goal: run the same ecosystem ideas on a single-node Kubernetes environment for production-readiness practice.

Two supported paths:

#### Option A: Docker Desktop Kubernetes

Best for:

- macOS / Windows development
- quick iteration
- visual debugging

Characteristics:

- built-in `kind`-style cluster
- Docker-built images are immediately visible to the cluster
- easiest local path for validating manifests and Helm shape

#### Option B: K3s

Best for:

- Linux VMs
- Hetzner or cloud-hosted single-node practice
- future multi-node growth

Characteristics:

- lightweight CNCF-certified Kubernetes distro
- built-in Traefik ingress
- built-in local-path storage class
- can evolve from single-node to multi-node more naturally than Docker Desktop

### Phase 3 — Production-grade Kubernetes shape

Goal: make local K8s patterns production-ready enough to port to AKS/EKS/GKE later without redesign.

Key outcomes:

- health probes standardized
- rolling update behavior standardized
- security context standardized
- ingress and SSE/WebSocket behavior standardized
- Helm values layering defined
- secret management progression defined

### Phase 4 — Managed Kubernetes target

Goal: preserve the same deployment model while moving to managed infrastructure.

Expected direction:

- managed ingress controller and TLS
- chart/image promotion flow
- Azure Key Vault CSI integration
- HPA and environment-specific overlays

## Local Kubernetes Best Practices

### 1. Deployment rollout safety

Use zero-downtime defaults:

```yaml
spec:
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 0
      maxSurge: 1
  template:
    spec:
      terminationGracePeriodSeconds: 45
      containers:
        - lifecycle:
            preStop:
              exec:
                command: ['sleep', '5']
```

Guidance:

- never use aggressive `maxUnavailable` values for user-facing services
- match `terminationGracePeriodSeconds` to graceful shutdown behavior
- use `preStop` delay to give the load balancer time to drain

### 2. Pod security context

Default posture:

```yaml
securityContext:
  runAsNonRoot: true
  runAsUser: 1000
  runAsGroup: 1000
  allowPrivilegeEscalation: false
  readOnlyRootFilesystem: true
```

If writable paths are needed:

```yaml
volumes:
  - name: tmp
    emptyDir: {}
  - name: cache
    emptyDir: {}
volumeMounts:
  - name: tmp
    mountPath: /tmp
  - name: cache
    mountPath: /home/node/.cache
```

Guidance:

- Fastify backends should generally tolerate read-only root filesystems
- Next.js standalone servers may need writable `/tmp`

### 3. Health probes

Use dedicated `/health` endpoints:

```yaml
livenessProbe:
  httpGet:
    path: /health
    port: 4003
  initialDelaySeconds: 10
  periodSeconds: 10
  timeoutSeconds: 5
readinessProbe:
  httpGet:
    path: /health
    port: 4003
  initialDelaySeconds: 5
  periodSeconds: 5
  timeoutSeconds: 5
```

Guidance:

- do not use heavy endpoints like `/openapi.json` for liveness
- keep timeouts short enough to expose real failures quickly

### 4. Ingress for SSE / WebSocket traffic

For streaming or long-lived connections:

```yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  annotations:
    nginx.ingress.kubernetes.io/proxy-read-timeout: '1800'
    nginx.ingress.kubernetes.io/proxy-send-timeout: '1800'
    nginx.ingress.kubernetes.io/proxy-buffering: 'off'
    nginx.ingress.kubernetes.io/proxy-http-version: '1.1'
    nginx.ingress.kubernetes.io/configuration-snippet: |
      proxy_set_header Upgrade $http_upgrade;
      proxy_set_header Connection "upgrade";
```

Applies to:

- FlowMonk SSE
- LocalMemGPT streaming
- future realtime features

### 5. HPA API choice

Use:

```yaml
apiVersion: autoscaling/v2
```

Avoid:

```yaml
apiVersion: autoscaling/v1
```

## Docker and Image Guidance for K8s Readiness

| Practice            | Do                                           | Avoid                                                     |
| ------------------- | -------------------------------------------- | --------------------------------------------------------- |
| **ENTRYPOINT form** | `ENTRYPOINT ["node", "dist/server.js"]`      | shell-form entrypoints                                    |
| **COPY scope**      | selective `COPY` steps                       | broad `COPY . .`                                          |
| **Layer count**     | combine related `RUN` steps                  | fragmented install layers                                 |
| **Non-root**        | run as `node` or non-root UID                | root runtime                                              |
| **Local variant**   | allow local Dockerfile variants where needed | one Dockerfile that only works in one network environment |
| **Build args**      | use `ARG`/`ENV` deliberately                 | hardcoded deployment assumptions                          |

## Helm Values Layering

Recommended structure:

```text
values.yaml
├── env/local.yaml
├── env/dev.yaml
└── env/prod.yaml
```

Recommended usage:

```bash
helm upgrade --install bytelyst ./helm/bytelyst-ecosystem -f helm/bytelyst-ecosystem/values.yaml -f helm/bytelyst-ecosystem/env/local.yaml
helm upgrade --install bytelyst ./helm/bytelyst-ecosystem -f helm/bytelyst-ecosystem/values.yaml -f helm/bytelyst-ecosystem/env/dev.yaml
helm upgrade --install bytelyst ./helm/bytelyst-ecosystem -f helm/bytelyst-ecosystem/values.yaml -f helm/bytelyst-ecosystem/env/prod.yaml
```

## Namespace Strategy

Use helpers rather than hardcoded namespaces:

```yaml
{ { include "myapp.namespace" . } }
```

Avoid:

```yaml
{ { .Values.namespace } }
```

## Secrets Progression

| Phase       | Strategy                                      | Complexity |
| ----------- | --------------------------------------------- | ---------- |
| **Phase 1** | `.env.ecosystem` file (gitignored)            | Trivial    |
| **Phase 2** | Native Kubernetes `Secret` objects            | Low        |
| **Phase 3** | Azure Key Vault via CSI `SecretProviderClass` | Medium     |
| **Phase 4** | AKV + operator/CRD auto-sync model            | High       |

## CI/CD Expectations

| Practice             | Expectation                                                     |
| -------------------- | --------------------------------------------------------------- |
| **Semantic release** | keep `feat:` / `fix:` conventions usable for release automation |
| **Image promotion**  | build once, promote later; do not rebuild for prod              |
| **Branch pipelines** | branch-specific quality and deploy stages                       |
| **Security gates**   | SAST/SCA in pipeline                                            |
| **Quality gates**    | tests, coverage, type safety, build verification                |
| **Chart versioning** | publish/version charts independently                            |

## Local K8s Deployment Workflow Shape

A future local K8s script should do the following:

1. detect Docker Desktop K8s vs K3s
2. build required images
3. load/import images into the local cluster runtime when needed
4. create namespace
5. create secrets from `.env.ecosystem`
6. deploy Helm chart with local overlay
7. wait for rollout
8. print verification commands and port-forward hints

## Recommended Next Items

### Next now

- run full Docker Compose ecosystem validation end-to-end
- capture blockers by service
- decide whether K8s phase starts with Docker Desktop K8s or K3s first

### Next after Compose validation

- define `helm/bytelyst-ecosystem/` layout
- define namespace and secret model
- draft minimal shared-service-first Kubernetes manifests or chart values
- create local K8s deploy helper script

### Hold for later

- full Helm/K3s implementation across the ecosystem
- managed cluster rollout details
- advanced autoscaling and production ingress hardening

## Quick Reference

| Practice                     | Compose                                 | Local K8s | Prod K8s |
| ---------------------------- | --------------------------------------- | --------- | -------- |
| Zero-downtime rolling update | N/A                                     | Apply     | Apply    |
| Pod security context         | N/A                                     | Apply     | Apply    |
| Health probes                | use Docker `healthcheck` where relevant | Apply     | Apply    |
| SSE/WebSocket ingress tuning | N/A                                     | If needed | Apply    |
| HPA v2                       | N/A                                     | Optional  | Apply    |
| Exec-form entrypoint         | Apply now                               | Apply     | Apply    |
| Selective COPY               | Apply now                               | Apply     | Apply    |
| Non-root user                | Apply now                               | Apply     | Apply    |
| Values layering              | N/A                                     | Apply     | Apply    |
| AKV CSI                      | N/A                                     | N/A       | Apply    |
| Image promotion              | N/A                                     | N/A       | Apply    |

## Status

- standalone Kubernetes roadmap: **created**
- Compose source-of-truth split: **done**
- Helm/K3s implementation: **held pending validation**