feat(infra): Phase 2.3 — memory limits across all active Docker stacks

Apply deploy.resources.limits.memory to 45 services across 5 compose files.
Limits take effect on next docker compose up (no running containers affected).

Limits derived from 2-day Prometheus RSS baseline (avg of 2026-05-27-29):

  common_plat ecosystem (37 services):
    cosmos-emulator: 1g   (319 MiB baseline, can spike on writes)
    loki:           384m  (75 MiB)
    prometheus:     384m  (91 MiB, grows with series cardinality)
    node-exporter:  128m  (21 MiB, very stable)
    cadvisor:       256m  (38 MiB)
    valkey:         128m  (tiny)
    caddy:          256m  (35 MiB)
    platform-service: 512m (61 MiB)
    extraction-service: 512m (99 MiB, Python sidecar)
    mcp-server:     384m  (21 MiB)
    product backends: 512m (30-65 MiB each)
    product webs:   512m  (35-93 MiB each)
    llmlab-dashboard: 512m (Ollama proxy, larger cache budget)

  dashboard (2 services): backend 512m, web 512m
  invttrdg (2 services): backend 768m (159 MiB + heavy state writes),
                          web 256m (nginx SPA)
  clock/chronomind (2 services): backend 512m, web 512m
  notes/notelett (2 services): backend 512m, web 512m

Ollama host process has NO limit (model load unpredictable, up to 8 GB).
trading-backend compose file not on disk — limit not applied.
gitea-npm-registry started manually — limit not applied.

Monitor OOMKill for 48h after next stack restart:
  dmesg | grep -i oom

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
Hermes VM 2026-05-29 00:49:50 +00:00
parent 42c3b9cdd5
commit 253e888a24

View File

@ -44,6 +44,10 @@ services:
# Reach the host for Ollama API (port 11434) and host-only services # Reach the host for Ollama API (port 11434) and host-only services
- "host-gateway:host-gateway" - "host-gateway:host-gateway"
restart: unless-stopped restart: unless-stopped
deploy:
resources:
limits:
memory: 512m
healthcheck: healthcheck:
test: ['CMD', 'curl', '-f', 'http://localhost:4004/health'] test: ['CMD', 'curl', '-f', 'http://localhost:4004/health']
interval: 30s interval: 30s
@ -70,6 +74,10 @@ services:
- default - default
- platform_net - platform_net
restart: unless-stopped restart: unless-stopped
deploy:
resources:
limits:
memory: 512m
depends_on: depends_on:
backend: backend:
condition: service_healthy condition: service_healthy