35 lines
1.7 KiB
Markdown
35 lines
1.7 KiB
Markdown
# Exchange API Degradation Runbook
|
|
|
|
## Incident description
|
|
The exchange API responds slowly or returns errors, affecting ENTRY/EXIT execution and reconciliation data.
|
|
|
|
## Symptoms
|
|
- Exchange latency histogram in `/metrics` shows spikes; errors logged from exchange connector.
|
|
- `tradingLoopHealthy` or `monitorLoopHealthy` flag false because loops hit timeouts.
|
|
- Logs show `exchange timeout` or repeated `429`/`503` responses.
|
|
|
|
## Metrics to check
|
|
- `/internal/health` ? `tradingLoopHealthy`, `monitorLoopHealthy`, `exchangeLatencyHistogram`.
|
|
- `/metrics` ? `exchange_api_latency_seconds`, `exchange_api_errors_total`, `entry_orders_rejections_total`.
|
|
|
|
## Immediate mitigation
|
|
1. Back off new ENTRY signals for profiles if exchange is unreachable.
|
|
2. Ensure deterministic clientOrderId is ready before retries; do not reissue new orders.
|
|
3. Activate retry/backoff logic in connectors; log each retry with correlation IDs.
|
|
4. Inform downstream systems (dashboard, ops) about degraded state.
|
|
|
|
## Expected self-recovery
|
|
- Exchange recovers and accepts pending requests; trading loop resumes once latency normalizes.
|
|
- Reconciliation loop eventually runs against fresh data; metrics fall back to baseline.
|
|
|
|
## When to escalate
|
|
- Errors persist beyond 5 minutes despite retries.
|
|
- Exchange reports credential or rate-limit problems requiring intervention.
|
|
- Business-critical trading windows are missed.
|
|
Escalate to the Exchange Account Manager and Cloud Ops; reference docs/runbooks/exchange-degradation.md.
|
|
|
|
## What NOT to do
|
|
- Do not flood the exchange with retries; respect backoff policies.
|
|
- Do not change API keys mid-incident without direction from the exchange team.
|
|
- Do not pause reconciliation; accurate state is needed to diagnose missing fills.
|