bytelyst-devops-tools/dashboard/docker-compose.yml
Hermes VM 253e888a24 feat(infra): Phase 2.3 — memory limits across all active Docker stacks
Apply deploy.resources.limits.memory to 45 services across 5 compose files.
Limits take effect on next docker compose up (no running containers affected).

Limits derived from 2-day Prometheus RSS baseline (avg of 2026-05-27-29):

  common_plat ecosystem (37 services):
    cosmos-emulator: 1g   (319 MiB baseline, can spike on writes)
    loki:           384m  (75 MiB)
    prometheus:     384m  (91 MiB, grows with series cardinality)
    node-exporter:  128m  (21 MiB, very stable)
    cadvisor:       256m  (38 MiB)
    valkey:         128m  (tiny)
    caddy:          256m  (35 MiB)
    platform-service: 512m (61 MiB)
    extraction-service: 512m (99 MiB, Python sidecar)
    mcp-server:     384m  (21 MiB)
    product backends: 512m (30-65 MiB each)
    product webs:   512m  (35-93 MiB each)
    llmlab-dashboard: 512m (Ollama proxy, larger cache budget)

  dashboard (2 services): backend 512m, web 512m
  invttrdg (2 services): backend 768m (159 MiB + heavy state writes),
                          web 256m (nginx SPA)
  clock/chronomind (2 services): backend 512m, web 512m
  notes/notelett (2 services): backend 512m, web 512m

Ollama host process has NO limit (model load unpredictable, up to 8 GB).
trading-backend compose file not on disk — limit not applied.
gitea-npm-registry started manually — limit not applied.

Monitor OOMKill for 48h after next stack restart:
  dmesg | grep -i oom

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-30 05:26:49 +00:00

92 lines
2.9 KiB
YAML

# Production-mode compose for DevOps Dashboard
# Usage:
# docker compose up --build
#
# Requires:
# - backend/.env populated (copy from backend/.env.example)
# - web/.env.local populated (copy from web/.env.local.example)
#
# For hot-reload dev mode use:
# docker compose -f docker-compose.yml -f docker-compose.dev.yml up
services:
# ---------------------------------------------------------------------------
# Backend — DevOps API service
# ---------------------------------------------------------------------------
backend:
build:
context: .
dockerfile: backend/Dockerfile
args:
BYTELYST_PACKAGE_SOURCE: ${BYTELYST_PACKAGE_SOURCE:-vendor}
container_name: devops-backend
env_file:
- backend/.env
environment:
- VM_SCRIPTS_PATH=/vm-scripts/VMs/HostingerVM
- VM_LOG_DIR=/host-logs
ports:
- '127.0.0.1:4004:4004'
networks:
- default
- platform_net
volumes:
# Read-only access to VM management scripts
- /opt/bytelyst/learning_ai_devops_tools/scripts:/vm-scripts:ro
# Read-write access to VM log files (cleanup + health-check write here)
- /var/log/vm-cleanup.log:/host-logs/vm-cleanup.log
- /var/log/vm-health-check.log:/host-logs/vm-health-check.log
- /var/log/docker-watchdog.log:/host-logs/docker-watchdog.log
# Docker socket — allows running docker commands against the host daemon
# (same pattern as Portainer/cAdvisor; container already runs as root)
- /var/run/docker.sock:/var/run/docker.sock
extra_hosts:
# Reach the host for Ollama API (port 11434) and host-only services
- "host-gateway:host-gateway"
restart: unless-stopped
deploy:
resources:
limits:
memory: 512m
healthcheck:
test: ['CMD', 'curl', '-f', 'http://localhost:4004/health']
interval: 30s
timeout: 5s
retries: 3
start_period: 15s
# ---------------------------------------------------------------------------
# Web — Next.js dashboard
# ---------------------------------------------------------------------------
web:
build:
context: .
dockerfile: web/Dockerfile
args:
BYTELYST_PACKAGE_SOURCE: ${BYTELYST_PACKAGE_SOURCE:-vendor}
NEXT_PUBLIC_PRODUCT_ID: ${NEXT_PUBLIC_PRODUCT_ID:-devops}
NEXT_PUBLIC_PLATFORM_URL: https://api.bytelyst.com/platform/api
NEXT_PUBLIC_DEVOPS_API_URL: https://api.bytelyst.com/devops
container_name: devops-web
ports:
- '127.0.0.1:3049:3000'
networks:
- default
- platform_net
restart: unless-stopped
deploy:
resources:
limits:
memory: 512m
depends_on:
backend:
condition: service_healthy
environment:
- NODE_ENV=production
networks:
default: {}
platform_net:
external: true
name: learning_ai_common_plat_default