learning_ai_invt_trdg/backend/runbooks/loop-health.md

2.0 KiB

Trading & Reconciliation Loop Health Runbook

Incident description

The trading loop or reconciliation loop is not executing within expected cadence, threatening data freshness and trading responsiveness.

Symptoms

  • /internal/health fields tradingLoopHealthy or reconciliationLoopHealthy flip to false.
  • reconciliationLastRun or loop duration metrics show no updates for longer than twice the expected interval.
  • Prometheus metrics show loop duration stuck or not emitted.

Metrics to check

  • /internal/health ? tradingLoopHealthy, reconciliationLoopHealthy, monitorLoopHealthy, reconciliationLastRun.
  • /metrics ? trading_loop_duration_seconds, reconciliation_loop_duration_seconds, loop_run_failure_total.

Immediate mitigation

  1. Inspect last log timestamp to determine if the loop crashed or is still running slow.
  2. Verify that the process is still up and not stuck on external calls (e.g., exchange or Supabase). Use debugger or profiling if needed.
  3. If a loop is hung, send kill signal to the specific loop worker (do not restart the entire service) and allow auto-restart if configured.
  4. Ensure lock contention or capital invariant metrics are not blocking the loop.

Expected self-recovery

  • Auto-recovery restarts the loop or continues once blocked resource clears (e.g., exchange call returns or Supabase writes succeed).
  • /internal/health marks the loop healthy after successful iteration.

When to escalate

  • Loops fail consecutively more than twice within 10 minutes.
  • Loop restarts exceed the configured threshold without recovery.
  • Business trades miss critical windows due to loop downtime. Escalate via docs/runbooks/loop-health.md and notify the reliability team.

What NOT to do

  • Do not disable the health endpoint; it is the single source of truth.
  • Do not restart unrelated services; focus on the affected loop.
  • Do not skip verification of capital or lock metrics before concluding the loop itself is broken.