1.7 KiB
1.7 KiB
Exchange API Degradation Runbook
Incident description
The exchange API responds slowly or returns errors, affecting ENTRY/EXIT execution and reconciliation data.
Symptoms
- Exchange latency histogram in
/metricsshows spikes; errors logged from exchange connector. tradingLoopHealthyormonitorLoopHealthyflag false because loops hit timeouts.- Logs show
exchange timeoutor repeated429/503responses.
Metrics to check
/internal/health?tradingLoopHealthy,monitorLoopHealthy,exchangeLatencyHistogram./metrics?exchange_api_latency_seconds,exchange_api_errors_total,entry_orders_rejections_total.
Immediate mitigation
- Back off new ENTRY signals for profiles if exchange is unreachable.
- Ensure deterministic clientOrderId is ready before retries; do not reissue new orders.
- Activate retry/backoff logic in connectors; log each retry with correlation IDs.
- Inform downstream systems (dashboard, ops) about degraded state.
Expected self-recovery
- Exchange recovers and accepts pending requests; trading loop resumes once latency normalizes.
- Reconciliation loop eventually runs against fresh data; metrics fall back to baseline.
When to escalate
- Errors persist beyond 5 minutes despite retries.
- Exchange reports credential or rate-limit problems requiring intervention.
- Business-critical trading windows are missed. Escalate to the Exchange Account Manager and Cloud Ops; reference docs/runbooks/exchange-degradation.md.
What NOT to do
- Do not flood the exchange with retries; respect backoff policies.
- Do not change API keys mid-incident without direction from the exchange team.
- Do not pause reconciliation; accurate state is needed to diagnose missing fills.