learning_ai_invt_trdg/backend/runbooks/exchange-degradation.md

1.7 KiB

Exchange API Degradation Runbook

Incident description

The exchange API responds slowly or returns errors, affecting ENTRY/EXIT execution and reconciliation data.

Symptoms

  • Exchange latency histogram in /metrics shows spikes; errors logged from exchange connector.
  • tradingLoopHealthy or monitorLoopHealthy flag false because loops hit timeouts.
  • Logs show exchange timeout or repeated 429/503 responses.

Metrics to check

  • /internal/health ? tradingLoopHealthy, monitorLoopHealthy, exchangeLatencyHistogram.
  • /metrics ? exchange_api_latency_seconds, exchange_api_errors_total, entry_orders_rejections_total.

Immediate mitigation

  1. Back off new ENTRY signals for profiles if exchange is unreachable.
  2. Ensure deterministic clientOrderId is ready before retries; do not reissue new orders.
  3. Activate retry/backoff logic in connectors; log each retry with correlation IDs.
  4. Inform downstream systems (dashboard, ops) about degraded state.

Expected self-recovery

  • Exchange recovers and accepts pending requests; trading loop resumes once latency normalizes.
  • Reconciliation loop eventually runs against fresh data; metrics fall back to baseline.

When to escalate

  • Errors persist beyond 5 minutes despite retries.
  • Exchange reports credential or rate-limit problems requiring intervention.
  • Business-critical trading windows are missed. Escalate to the Exchange Account Manager and Cloud Ops; reference docs/runbooks/exchange-degradation.md.

What NOT to do

  • Do not flood the exchange with retries; respect backoff policies.
  • Do not change API keys mid-incident without direction from the exchange team.
  • Do not pause reconciliation; accurate state is needed to diagnose missing fills.