learning_ai_invt_trdg/backend/runbooks/exchange-degradation.md

35 lines
1.7 KiB
Markdown

# Exchange API Degradation Runbook
## Incident description
The exchange API responds slowly or returns errors, affecting ENTRY/EXIT execution and reconciliation data.
## Symptoms
- Exchange latency histogram in `/metrics` shows spikes; errors logged from exchange connector.
- `tradingLoopHealthy` or `monitorLoopHealthy` flag false because loops hit timeouts.
- Logs show `exchange timeout` or repeated `429`/`503` responses.
## Metrics to check
- `/internal/health` ? `tradingLoopHealthy`, `monitorLoopHealthy`, `exchangeLatencyHistogram`.
- `/metrics` ? `exchange_api_latency_seconds`, `exchange_api_errors_total`, `entry_orders_rejections_total`.
## Immediate mitigation
1. Back off new ENTRY signals for profiles if exchange is unreachable.
2. Ensure deterministic clientOrderId is ready before retries; do not reissue new orders.
3. Activate retry/backoff logic in connectors; log each retry with correlation IDs.
4. Inform downstream systems (dashboard, ops) about degraded state.
## Expected self-recovery
- Exchange recovers and accepts pending requests; trading loop resumes once latency normalizes.
- Reconciliation loop eventually runs against fresh data; metrics fall back to baseline.
## When to escalate
- Errors persist beyond 5 minutes despite retries.
- Exchange reports credential or rate-limit problems requiring intervention.
- Business-critical trading windows are missed.
Escalate to the Exchange Account Manager and Cloud Ops; reference docs/runbooks/exchange-degradation.md.
## What NOT to do
- Do not flood the exchange with retries; respect backoff policies.
- Do not change API keys mid-incident without direction from the exchange team.
- Do not pause reconciliation; accurate state is needed to diagnose missing fills.