35 lines
2.0 KiB
Markdown
35 lines
2.0 KiB
Markdown
# Trading & Reconciliation Loop Health Runbook
|
|
|
|
## Incident description
|
|
The trading loop or reconciliation loop is not executing within expected cadence, threatening data freshness and trading responsiveness.
|
|
|
|
## Symptoms
|
|
- `/internal/health` fields `tradingLoopHealthy` or `reconciliationLoopHealthy` flip to false.
|
|
- `reconciliationLastRun` or loop duration metrics show no updates for longer than twice the expected interval.
|
|
- Prometheus metrics show loop duration stuck or not emitted.
|
|
|
|
## Metrics to check
|
|
- `/internal/health` ? `tradingLoopHealthy`, `reconciliationLoopHealthy`, `monitorLoopHealthy`, `reconciliationLastRun`.
|
|
- `/metrics` ? `trading_loop_duration_seconds`, `reconciliation_loop_duration_seconds`, `loop_run_failure_total`.
|
|
|
|
## Immediate mitigation
|
|
1. Inspect last log timestamp to determine if the loop crashed or is still running slow.
|
|
2. Verify that the process is still up and not stuck on external calls (e.g., exchange or Supabase). Use debugger or profiling if needed.
|
|
3. If a loop is hung, send kill signal to the specific loop worker (do not restart the entire service) and allow auto-restart if configured.
|
|
4. Ensure lock contention or capital invariant metrics are not blocking the loop.
|
|
|
|
## Expected self-recovery
|
|
- Auto-recovery restarts the loop or continues once blocked resource clears (e.g., exchange call returns or Supabase writes succeed).
|
|
- `/internal/health` marks the loop healthy after successful iteration.
|
|
|
|
## When to escalate
|
|
- Loops fail consecutively more than twice within 10 minutes.
|
|
- Loop restarts exceed the configured threshold without recovery.
|
|
- Business trades miss critical windows due to loop downtime.
|
|
Escalate via docs/runbooks/loop-health.md and notify the reliability team.
|
|
|
|
## What NOT to do
|
|
- Do not disable the health endpoint; it is the single source of truth.
|
|
- Do not restart unrelated services; focus on the affected loop.
|
|
- Do not skip verification of capital or lock metrics before concluding the loop itself is broken.
|