2.0 KiB
2.0 KiB
Trading & Reconciliation Loop Health Runbook
Incident description
The trading loop or reconciliation loop is not executing within expected cadence, threatening data freshness and trading responsiveness.
Symptoms
/internal/healthfieldstradingLoopHealthyorreconciliationLoopHealthyflip to false.reconciliationLastRunor loop duration metrics show no updates for longer than twice the expected interval.- Prometheus metrics show loop duration stuck or not emitted.
Metrics to check
/internal/health?tradingLoopHealthy,reconciliationLoopHealthy,monitorLoopHealthy,reconciliationLastRun./metrics?trading_loop_duration_seconds,reconciliation_loop_duration_seconds,loop_run_failure_total.
Immediate mitigation
- Inspect last log timestamp to determine if the loop crashed or is still running slow.
- Verify that the process is still up and not stuck on external calls (e.g., exchange or Supabase). Use debugger or profiling if needed.
- If a loop is hung, send kill signal to the specific loop worker (do not restart the entire service) and allow auto-restart if configured.
- Ensure lock contention or capital invariant metrics are not blocking the loop.
Expected self-recovery
- Auto-recovery restarts the loop or continues once blocked resource clears (e.g., exchange call returns or Supabase writes succeed).
/internal/healthmarks the loop healthy after successful iteration.
When to escalate
- Loops fail consecutively more than twice within 10 minutes.
- Loop restarts exceed the configured threshold without recovery.
- Business trades miss critical windows due to loop downtime. Escalate via docs/runbooks/loop-health.md and notify the reliability team.
What NOT to do
- Do not disable the health endpoint; it is the single source of truth.
- Do not restart unrelated services; focus on the affected loop.
- Do not skip verification of capital or lock metrics before concluding the loop itself is broken.