Compare commits
2 Commits
d60c81ebda
...
b15c570587
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
b15c570587 | ||
|
|
d9618ba7b0 |
@ -57,31 +57,31 @@ These listeners were bound on `0.0.0.0` and/or `[::]` during review.
|
||||
| `22` | `sshd` | host systemd | direct SSH | `public-direct` | Keep public only after SSH key hardening |
|
||||
| `80`, `443` | `caddy` | `/opt/bytelyst/learning_ai_common_plat/docker-compose.ecosystem.yml` | public ingress | `public-caddy` | Keep public |
|
||||
| `3000` | `notelett-web` | `/opt/bytelyst/learning_ai_notes/docker-compose.yml` | `notes.bytelyst.com` | `public-caddy` with direct bypass | Bind loopback or remove host port after Caddy smoke |
|
||||
| `3002` | `lysnrai-dashboard` | `/opt/bytelyst/learning_ai_common_plat/docker-compose.ecosystem.yml` | none found in Caddy | `needs-decision` | Private/admin or retire direct exposure |
|
||||
| `3003` | `tracker-web` | `/opt/bytelyst/learning_ai_common_plat/docker-compose.ecosystem.yml` | `tracker.bytelyst.com` | `public-caddy` with direct bypass | Bind loopback or remove host port after Caddy smoke |
|
||||
| `3002` | `lysnrai-dashboard` | `/opt/bytelyst/learning_ai_common_plat/docker-compose.ecosystem.yml` | none found in Caddy | `loopback-only` | Bound to `127.0.0.1` on 2026-05-27; still needs public/private product decision |
|
||||
| `3003` | `tracker-web` | `/opt/bytelyst/learning_ai_common_plat/docker-compose.ecosystem.yml` | `tracker.bytelyst.com` | `public-caddy` | Bound to `127.0.0.1` on 2026-05-27; keep Caddy route |
|
||||
| `3030` | `chronomind-web` | `/root/bytelyst.ai/repos/learning_ai_clock/docker-compose.yml` | `clock.bytelyst.com` | `public-caddy` with direct bypass | Bind loopback or remove host port after Caddy smoke |
|
||||
| `3035` | `jarvisjr-web` | `/opt/bytelyst/learning_ai_common_plat/docker-compose.ecosystem.yml` | none found in Caddy | `needs-decision` | Unhealthy; classify as private/admin or retire |
|
||||
| `3040` | `flowmonk-web` | `/opt/bytelyst/learning_ai_common_plat/docker-compose.ecosystem.yml` | none found in Caddy | `needs-decision` | Unhealthy; classify as private/admin or retire |
|
||||
| `3035` | `jarvisjr-web` | `/opt/bytelyst/learning_ai_common_plat/docker-compose.ecosystem.yml` | none found in Caddy | `loopback-only` | Bound to `127.0.0.1` on 2026-05-27; still needs public/private product decision |
|
||||
| `3040` | `flowmonk-web` | `/opt/bytelyst/learning_ai_common_plat/docker-compose.ecosystem.yml` | none found in Caddy | `loopback-only` | Bound to `127.0.0.1` on 2026-05-27; still needs public/private product decision |
|
||||
| `3049` | `devops-web` | `/opt/bytelyst/bytelyst-devops-tools/dashboard/docker-compose.yml` | `devops.bytelyst.com` | `private-admin` with direct bypass | Fix old repo path drift, then bind loopback/private |
|
||||
| `3050` | `mindlyst-web` | `/opt/bytelyst/learning_ai_common_plat/docker-compose.ecosystem.yml` | none found in Caddy | `needs-decision` | Unhealthy; classify as private/admin or retire |
|
||||
| `3050` | `mindlyst-web` | `/opt/bytelyst/learning_ai_common_plat/docker-compose.ecosystem.yml` | none found in Caddy | `loopback-only` | Bound to `127.0.0.1` on 2026-05-27; still needs public/private product decision |
|
||||
| `3055` | `nomgap-web` | orphan from older `/opt/bytelyst/learning_ai_common_plat/docker-compose.ecosystem.yml` | none found in Caddy | `retire` | Retired on 2026-05-27; current Compose says Nomgap web is deployed to Vercel |
|
||||
| `3060` | `actiontrail-web` | `/opt/bytelyst/learning_ai_common_plat/docker-compose.ecosystem.yml` | none found in Caddy | `needs-decision` | Unhealthy; classify as private/admin or retire |
|
||||
| `3070` | `localmemgpt-web` | `/opt/bytelyst/learning_ai_common_plat/docker-compose.ecosystem.yml` | none found in Caddy | `needs-decision` | Unhealthy; classify as private/admin or retire |
|
||||
| `3075` | `llmlab-dashboard` | `/opt/bytelyst/learning_ai_common_plat/docker-compose.ecosystem.yml` | `llmlab.bytelyst.com` | `private-admin` with direct bypass | Dashboard unhealthy; gate or retire |
|
||||
| `3060` | `actiontrail-web` | `/opt/bytelyst/learning_ai_common_plat/docker-compose.ecosystem.yml` | none found in Caddy | `loopback-only` | Bound to `127.0.0.1` on 2026-05-27; still needs public/private product decision |
|
||||
| `3070` | `localmemgpt-web` | `/opt/bytelyst/learning_ai_common_plat/docker-compose.ecosystem.yml` | none found in Caddy | `loopback-only` | Bound to `127.0.0.1` on 2026-05-27; still needs public/private product decision |
|
||||
| `3075` | `llmlab-dashboard` | `/opt/bytelyst/learning_ai_common_plat/docker-compose.ecosystem.yml` | `llmlab.bytelyst.com` | `private-admin` | Bound to `127.0.0.1` on 2026-05-27; still needs auth/private gate for Caddy route |
|
||||
| `3085` | `invttrdg-web` | `/opt/bytelyst/learning_ai_invt_trdg/docker-compose.yml` | `invttrdg.bytelyst.com` | `public-caddy` with direct bypass | Bind loopback or remove host port after Caddy smoke |
|
||||
| `3100` | `loki` | `/opt/bytelyst/learning_ai_common_plat/docker-compose.ecosystem.yml` | none found in Caddy | `loopback-only` | Bound to `127.0.0.1` on 2026-05-27 |
|
||||
| `3300` | `gitea-npm-registry` | non-Compose container labels absent | `gitea.bytelyst.com` | `public-caddy` with direct bypass | Bind loopback or private; keep Caddy route |
|
||||
| `4004` | `devops-backend` | `/opt/bytelyst/learning_ai_devops_tools/dashboard/docker-compose.yml` | `api.bytelyst.com/devops/*` | `private-admin` with direct bypass | Bind loopback/private |
|
||||
| `4010` | `peakpulse-backend` | `/opt/bytelyst/learning_ai_common_plat/docker-compose.ecosystem.yml` | `api.bytelyst.com/peakpulse/*` | `public-caddy` with direct bypass | Bind loopback or remove host port after Caddy smoke |
|
||||
| `4010` | `peakpulse-backend` | `/opt/bytelyst/learning_ai_common_plat/docker-compose.ecosystem.yml` | `api.bytelyst.com/peakpulse/*` | `public-caddy` | Host port removed by Compose recreate on 2026-05-27; keep Caddy route |
|
||||
| `4011` | `chronomind-backend` | `/root/bytelyst.ai/repos/learning_ai_clock/docker-compose.yml` | `api.bytelyst.com/chronomind/*` | `public-caddy` with direct bypass | Bind loopback or remove host port after Caddy smoke |
|
||||
| `4012` | `jarvisjr-backend` | `/opt/bytelyst/learning_ai_common_plat/docker-compose.ecosystem.yml` | `api.bytelyst.com/jarvisjr/*` | `public-caddy` with direct bypass | Bind loopback or remove host port after Caddy smoke |
|
||||
| `4013` | `nomgap-backend` | `/opt/bytelyst/learning_ai_common_plat/docker-compose.ecosystem.yml` | `api.bytelyst.com/nomgap/*` | `public-caddy` with direct bypass | Bind loopback or remove host port after Caddy smoke |
|
||||
| `4014` | `mindlyst-backend` | `/opt/bytelyst/learning_ai_common_plat/docker-compose.ecosystem.yml` | `api.bytelyst.com/mindlyst/*` | `public-caddy` with direct bypass | Bind loopback or remove host port after Caddy smoke |
|
||||
| `4015` | `lysnrai-backend` | `/opt/bytelyst/learning_ai_common_plat/docker-compose.ecosystem.yml` | `api.bytelyst.com/lysnrai/*` | `public-caddy` with direct bypass | Bind loopback or remove host port after Caddy smoke |
|
||||
| `4012` | `jarvisjr-backend` | `/opt/bytelyst/learning_ai_common_plat/docker-compose.ecosystem.yml` | `api.bytelyst.com/jarvisjr/*` | `public-caddy` | Host port removed by Compose recreate on 2026-05-27; keep Caddy route |
|
||||
| `4013` | `nomgap-backend` | `/opt/bytelyst/learning_ai_common_plat/docker-compose.ecosystem.yml` | `api.bytelyst.com/nomgap/*` | `public-caddy` | Host port removed by Compose recreate on 2026-05-27; keep Caddy route |
|
||||
| `4014` | `mindlyst-backend` | `/opt/bytelyst/learning_ai_common_plat/docker-compose.ecosystem.yml` | `api.bytelyst.com/mindlyst/*` | `public-caddy` | Host port removed by Compose recreate on 2026-05-27; keep Caddy route |
|
||||
| `4015` | `lysnrai-backend` | `/opt/bytelyst/learning_ai_common_plat/docker-compose.ecosystem.yml` | `api.bytelyst.com/lysnrai/*` | `public-caddy` | Host port removed by Compose recreate on 2026-05-27; keep Caddy route |
|
||||
| `4016` | `notelett-backend` | `/opt/bytelyst/learning_ai_notes/docker-compose.yml` | `api.bytelyst.com/notelett/*` | `public-caddy` with direct bypass | Bind loopback or remove host port after Caddy smoke |
|
||||
| `4017` | `flowmonk-backend` | `/opt/bytelyst/learning_ai_common_plat/docker-compose.ecosystem.yml` | `api.bytelyst.com/flowmonk/*` | `public-caddy` with direct bypass | Bind loopback or remove host port after Caddy smoke |
|
||||
| `4019` | `localmemgpt-backend` | `/opt/bytelyst/learning_ai_common_plat/docker-compose.ecosystem.yml` | `api.bytelyst.com/localmemgpt/*` | `public-caddy` with direct bypass | Bind loopback or remove host port after Caddy smoke |
|
||||
| `4020` | `actiontrail-backend` | `/opt/bytelyst/learning_ai_common_plat/docker-compose.ecosystem.yml` | `api.bytelyst.com/actiontrail/*` | `public-caddy` with direct bypass | Bind loopback or remove host port after Caddy smoke |
|
||||
| `4017` | `flowmonk-backend` | `/opt/bytelyst/learning_ai_common_plat/docker-compose.ecosystem.yml` | `api.bytelyst.com/flowmonk/*` | `public-caddy` | Host port removed by Compose recreate on 2026-05-27; keep Caddy route |
|
||||
| `4019` | `localmemgpt-backend` | `/opt/bytelyst/learning_ai_common_plat/docker-compose.ecosystem.yml` | `api.bytelyst.com/localmemgpt/*` | `public-caddy` | Host port removed by Compose recreate on 2026-05-27; keep Caddy route |
|
||||
| `4020` | `actiontrail-backend` | `/opt/bytelyst/learning_ai_common_plat/docker-compose.ecosystem.yml` | `api.bytelyst.com/actiontrail/*` | `public-caddy` | Bound to `127.0.0.1` on 2026-05-27; route mapping still needs Caddy/product verification |
|
||||
| `4025` | `invttrdg-backend` | `/opt/bytelyst/learning_ai_invt_trdg/docker-compose.yml` | `api.bytelyst.com/invttrdg/*` | `public-caddy` with direct bypass | Bind loopback or remove host port after Caddy smoke |
|
||||
| `1025` | `mailpit` SMTP | `/opt/bytelyst/learning_ai_common_plat/docker-compose.ecosystem.yml` | none found in Caddy | `loopback-only` | Bound to `127.0.0.1` on 2026-05-27 |
|
||||
| `8025` | `mailpit` UI | `/opt/bytelyst/learning_ai_common_plat/docker-compose.ecosystem.yml` | none found in Caddy | `loopback-only` | Bound to `127.0.0.1` on 2026-05-27 |
|
||||
|
||||
@ -182,6 +182,7 @@ Effective `sshd -T` settings showed:
|
||||
- [ ] For each service, decide one of: public via Caddy, private via Tailscale/SSH, loopback-only host port, Docker-internal only, or remove.
|
||||
- [ ] Bind non-public Compose ports to `127.0.0.1` or remove host port mapping entirely.
|
||||
- [x] Internal emulator/mail/observability ports `1025`, `8025`, `10000`, `1234`, `8081`, and `3100` are loopback-bound.
|
||||
- [x] Common-platform direct app/API bypasses are loopback-bound or removed from host publishing.
|
||||
- [ ] Add a `DOCKER-USER` chain policy to drop unsolicited traffic to non-approved published ports before Docker's accept rules.
|
||||
- [ ] Keep only `80/443` and intentionally public SSH exposed at the provider/firewall layer.
|
||||
- [ ] Add a recurring check that compares `ss -ltn` and Docker published ports against the approved inventory.
|
||||
@ -391,6 +392,7 @@ Effective `sshd -T` settings showed:
|
||||
|
||||
- [ ] Close or loopback-bind non-public Docker host ports.
|
||||
- [x] Loopback-bound internal emulator/mail/observability ports `1025`, `8025`, `10000`, `1234`, `8081`, and `3100`.
|
||||
- [x] Closed/loopback-bound common-platform direct app/API bypasses.
|
||||
- [ ] Add `DOCKER-USER` default-deny rules for non-approved ports.
|
||||
- [ ] Harden SSH root/password access after key-based access is verified.
|
||||
- [ ] Put `ollama.bytelyst.com`, admin dashboards, and dev tooling behind private/auth-gated access unless explicitly approved as public.
|
||||
@ -565,6 +567,36 @@ Minimum post-checks for Phase 1:
|
||||
- Public direct bypass remains for app/API ports, Gitea direct port `3300`, devops/admin surfaces, and Ollama `11434`.
|
||||
- Add a `DOCKER-USER` fallback policy after the remaining allowlist is reviewed.
|
||||
|
||||
### 2026-05-27 — Phase 1 common-platform app/API bypasses
|
||||
|
||||
**Changed:**
|
||||
|
||||
- Updated `/opt/bytelyst/learning_ai_common_plat/docker-compose.ecosystem.yml` so remaining published common-platform web/dashboard ports bind to `127.0.0.1`.
|
||||
- Recreated the common-platform web/dashboard services that previously published on `0.0.0.0`: `tracker-web`, `lysnrai-dashboard`, `jarvisjr-web`, `flowmonk-web`, `mindlyst-web`, `actiontrail-web`, `localmemgpt-web`, and `llmlab-dashboard`.
|
||||
- Recreated stale common-platform backend containers `peakpulse-backend`, `lysnrai-backend`, and `nomgap-backend`; their current Compose definitions do not publish host ports, so the old direct `4010`, `4015`, and `4013` mappings were removed.
|
||||
|
||||
**Verified:**
|
||||
|
||||
- `docker compose -f docker-compose.ecosystem.yml --env-file .env.ecosystem config --quiet` passed.
|
||||
- `docker ps --filter name=learning_ai_common_plat ... | grep 0.0.0.0` returned no common-platform wildcard-published containers.
|
||||
- `docker ps --filter health=unhealthy` returned no unhealthy containers.
|
||||
- `ss -ltnp` shows `3002`, `3003`, `3035`, `3040`, `3050`, `3060`, `3070`, and `3075` bound to `127.0.0.1`.
|
||||
- Host smoke checks returned HTTP `200` for `3002`, `3003`, `3035`, `3040`, `3050`, `3060`, `3070`, and `3075`.
|
||||
|
||||
**Committed/pushed:**
|
||||
|
||||
- `learning_ai_common_plat`: `e29cc58a` (`fix: bind app host ports to loopback`) pushed to GitHub.
|
||||
|
||||
**Remaining wildcard Docker publishes after this checkpoint:**
|
||||
|
||||
- Caddy public ingress: `80`, `443`.
|
||||
- Local Gitea direct port: `3300`.
|
||||
- DevOps dashboard/API: `3049`, `4004`.
|
||||
- Notes direct ports: `3000`, `4016`.
|
||||
- Clock direct ports: `3030`, `4011`.
|
||||
- InvtTrdg direct ports: `3085`, `4025`.
|
||||
- Host Ollama still listens on wildcard `11434`.
|
||||
|
||||
## Do Not Start With
|
||||
|
||||
- Rootless Docker migration.
|
||||
|
||||
63
scripts/VMs/HostingerVM/docker-health-watchdog.sh
Executable file
63
scripts/VMs/HostingerVM/docker-health-watchdog.sh
Executable file
@ -0,0 +1,63 @@
|
||||
#!/usr/bin/env bash
|
||||
# =============================================================================
|
||||
# docker-health-watchdog.sh — restart containers stuck in unhealthy state
|
||||
#
|
||||
# Systemd timer invokes this every 10 minutes.
|
||||
# A container is only restarted after 3 consecutive failing health checks
|
||||
# (i.e. the last 3 entries in .State.Health.Log all have ExitCode != 0).
|
||||
# This gives a 30-minute grace window before action is taken — avoids
|
||||
# restarting containers that are transiently unhealthy during a deploy.
|
||||
#
|
||||
# Log: /var/log/docker-watchdog.log
|
||||
# =============================================================================
|
||||
set -Eeuo pipefail
|
||||
|
||||
LOG=/var/log/docker-watchdog.log
|
||||
TOKEN_FILE="${HERMES_HOME:-/root/.hermes}/.env"
|
||||
|
||||
log() { echo "[$(date -u '+%Y-%m-%dT%H:%M:%SZ')] $*" | tee -a "$LOG" 2>/dev/null || true; }
|
||||
|
||||
notify_telegram() {
|
||||
local msg="$1"
|
||||
local token chat_id
|
||||
token=$(grep -oP '(?<=TELEGRAM_BOT_TOKEN=)\S+' "$TOKEN_FILE" 2>/dev/null || true)
|
||||
chat_id=$(grep -oP '(?<=TELEGRAM_CHAT_ID=)\S+' "$TOKEN_FILE" 2>/dev/null || true)
|
||||
[[ -z "$token" || -z "$chat_id" ]] && return
|
||||
curl -sf -X POST "https://api.telegram.org/bot${token}/sendMessage" \
|
||||
-d chat_id="$chat_id" \
|
||||
-d text="$msg" > /dev/null 2>&1 || true
|
||||
}
|
||||
|
||||
if ! command -v docker &>/dev/null || ! docker info &>/dev/null 2>&1; then
|
||||
log "Docker not available — skipping watchdog run"
|
||||
exit 0
|
||||
fi
|
||||
|
||||
mapfile -t unhealthy < <(docker ps --filter health=unhealthy --format '{{.Names}}' 2>/dev/null || true)
|
||||
|
||||
if (( ${#unhealthy[@]} == 0 )); then
|
||||
exit 0
|
||||
fi
|
||||
|
||||
log "Unhealthy containers detected: ${unhealthy[*]}"
|
||||
|
||||
for container in "${unhealthy[@]}"; do
|
||||
# Count how many of the last 3 health check log entries failed (ExitCode != 0)
|
||||
failures=$(docker inspect "$container" 2>/dev/null | python3 -c "
|
||||
import json, sys
|
||||
data = json.load(sys.stdin)
|
||||
if not data:
|
||||
print(0); exit()
|
||||
log = data[0].get('State', {}).get('Health', {}).get('Log', [])
|
||||
recent = log[-3:] if len(log) >= 3 else log
|
||||
print(sum(1 for e in recent if e.get('ExitCode', 0) != 0))
|
||||
" 2>/dev/null || echo 0)
|
||||
|
||||
if [[ "$failures" -eq 3 ]]; then
|
||||
log "Auto-restarting $container (unhealthy 3/3 consecutive checks)"
|
||||
docker restart "$container" 2>&1 | head -1 | tee -a "$LOG" || true
|
||||
notify_telegram "🔄 docker-watchdog restarted $container (3 consecutive unhealthy health checks) — $(hostname)"
|
||||
else
|
||||
log "$container is unhealthy but only $failures/3 consecutive failures — waiting"
|
||||
fi
|
||||
done
|
||||
@ -30,8 +30,11 @@ LOAD_WARN=4.0 # absolute (not per-CPU)
|
||||
LOAD_CRIT=8.0
|
||||
RAM_FREE_WARN_GB=3 # GB available
|
||||
RAM_FREE_CRIT_GB=1
|
||||
SWAP_USED_WARN_GB=1
|
||||
SWAP_USED_WARN_GB=1.5
|
||||
SWAP_USED_CRIT_GB=3
|
||||
SWAP_CACHED_WARN_MB=200 # early-warning: recent swap pressure even if current usage looks ok
|
||||
STEAL_WARN=5 # % steal time
|
||||
STEAL_CRIT=15
|
||||
CONTAINER_RESTART_WARN=10
|
||||
CONTAINER_RESTART_CRIT=50
|
||||
BUILD_CACHE_WARN_GB=5
|
||||
@ -161,24 +164,66 @@ check_memory() {
|
||||
fi
|
||||
}
|
||||
|
||||
check_steal() {
|
||||
header "CPU STEAL"
|
||||
# Requires two /proc/stat samples 1s apart — single sample gives lifetime average, not current.
|
||||
local s1 s2
|
||||
s1=$(awk '/^cpu /{print $9" "$2+$3+$4+$5+$6+$7+$8+$9+$10}' /proc/stat)
|
||||
sleep 1
|
||||
s2=$(awk '/^cpu /{print $9" "$2+$3+$4+$5+$6+$7+$8+$9+$10}' /proc/stat)
|
||||
local steal_pct
|
||||
steal_pct=$(awk -v s1="$s1" -v s2="$s2" 'BEGIN{
|
||||
split(s1,a," "); split(s2,b," ")
|
||||
delta_steal=b[1]-a[1]; delta_total=b[2]-a[2]
|
||||
if (delta_total == 0) { printf "0.0"; exit }
|
||||
printf "%.1f", (delta_steal/delta_total)*100
|
||||
}')
|
||||
local steal_int
|
||||
steal_int=$(awk -v v="$steal_pct" 'BEGIN{printf "%d", v}')
|
||||
|
||||
if (( steal_int >= STEAL_CRIT )); then record steal CRIT "${steal_pct}%" "CPU steal ${steal_pct}% — CRITICAL (host is overcommitted)"
|
||||
elif (( steal_int >= STEAL_WARN )); then record steal WARN "${steal_pct}%" "CPU steal ${steal_pct}% — WARNING (host contention; degrades LLM inference)"
|
||||
else record steal OK "${steal_pct}%" "CPU steal OK (${steal_pct}%)"
|
||||
fi
|
||||
}
|
||||
|
||||
check_swap() {
|
||||
header "SWAP"
|
||||
local swap_total_kb swap_used_kb
|
||||
local swap_total_kb swap_free_kb swap_cached_kb
|
||||
swap_total_kb=$(awk '/^SwapTotal/ {print $2}' /proc/meminfo)
|
||||
swap_used_kb=$(awk '/^SwapFree/ {print $2}' /proc/meminfo)
|
||||
swap_used_kb=$(( swap_total_kb - swap_used_kb ))
|
||||
local swap_total_gb swap_used_gb
|
||||
swap_free_kb=$(awk '/^SwapFree/ {print $2}' /proc/meminfo)
|
||||
swap_cached_kb=$(awk '/^SwapCached/ {print $2}' /proc/meminfo)
|
||||
local swap_used_kb
|
||||
swap_used_kb=$(( swap_total_kb - swap_free_kb ))
|
||||
local swap_total_gb
|
||||
swap_total_gb=$(( swap_total_kb / 1024 / 1024 ))
|
||||
swap_used_gb=$(( swap_used_kb / 1024 / 1024 ))
|
||||
local swap_cached_mb
|
||||
swap_cached_mb=$(( swap_cached_kb / 1024 ))
|
||||
|
||||
if (( swap_total_kb == 0 )); then
|
||||
record swap CRIT "no swap" "NO SWAP configured — CRITICAL (add swapfile!)"
|
||||
return
|
||||
fi
|
||||
|
||||
if (( swap_used_gb >= SWAP_USED_CRIT_GB )); then record swap CRIT "${swap_used_gb}G used" "Swap ${swap_used_gb}G used — CRITICAL"
|
||||
elif (( swap_used_gb >= SWAP_USED_WARN_GB )); then record swap WARN "${swap_used_gb}G used" "Swap ${swap_used_gb}G used — WARNING"
|
||||
else record swap OK "${swap_used_gb}G / ${swap_total_gb}G" "Swap OK (${swap_used_gb}G used)"
|
||||
# Compare used GB using awk to handle the fractional threshold (1.5)
|
||||
local used_gb_10x warn_10x crit_10x
|
||||
used_gb_10x=$(awk -v kb="$swap_used_kb" 'BEGIN{printf "%d", (kb/1024/1024)*10}')
|
||||
warn_10x=$(awk -v t="$SWAP_USED_WARN_GB" 'BEGIN{printf "%d", t*10}')
|
||||
crit_10x=$(awk -v t="$SWAP_USED_CRIT_GB" 'BEGIN{printf "%d", t*10}')
|
||||
local swap_used_display
|
||||
swap_used_display=$(awk -v kb="$swap_used_kb" 'BEGIN{printf "%.1fG", kb/1024/1024}')
|
||||
|
||||
if (( used_gb_10x >= crit_10x )); then
|
||||
record swap CRIT "${swap_used_display} used" "Swap ${swap_used_display} used — CRITICAL"
|
||||
elif (( used_gb_10x >= warn_10x )); then
|
||||
record swap WARN "${swap_used_display} used" "Swap ${swap_used_display} used — WARNING (>${SWAP_USED_WARN_GB}G)"
|
||||
elif (( swap_cached_mb >= SWAP_CACHED_WARN_MB )); then
|
||||
# SwapCached is pages reclaimed from swap still sitting in cache — indicates
|
||||
# recent memory pressure even though current usage looks ok.
|
||||
record swap WARN "${swap_used_display} used, ${swap_cached_mb}MB cached" \
|
||||
"Swap pressure indicator: SwapCached ${swap_cached_mb}MB — recent memory pressure (threshold ${SWAP_CACHED_WARN_MB}MB)"
|
||||
else
|
||||
record swap OK "${swap_used_display} / ${swap_total_gb}G" "Swap OK (${swap_used_display} used, ${swap_cached_mb}MB cached)"
|
||||
fi
|
||||
}
|
||||
|
||||
@ -310,6 +355,7 @@ fi
|
||||
|
||||
check_disk
|
||||
check_load
|
||||
check_steal
|
||||
check_memory
|
||||
check_swap
|
||||
check_docker_containers
|
||||
|
||||
12
systemd/docker-health-watchdog.service
Normal file
12
systemd/docker-health-watchdog.service
Normal file
@ -0,0 +1,12 @@
|
||||
[Unit]
|
||||
Description=Restart Docker containers stuck in unhealthy state
|
||||
Documentation=file:///usr/local/bin/docker-health-watchdog.sh
|
||||
After=docker.service
|
||||
Requires=docker.service
|
||||
|
||||
[Service]
|
||||
Type=oneshot
|
||||
User=root
|
||||
Group=root
|
||||
Environment="HERMES_HOME=/root/.hermes"
|
||||
ExecStart=/usr/local/bin/docker-health-watchdog.sh
|
||||
11
systemd/docker-health-watchdog.timer
Normal file
11
systemd/docker-health-watchdog.timer
Normal file
@ -0,0 +1,11 @@
|
||||
[Unit]
|
||||
Description=Run Docker health watchdog every 10 minutes
|
||||
After=docker.service
|
||||
|
||||
[Timer]
|
||||
OnBootSec=5min
|
||||
OnUnitActiveSec=10min
|
||||
AccuracySec=30s
|
||||
|
||||
[Install]
|
||||
WantedBy=timers.target
|
||||
Loading…
Reference in New Issue
Block a user