bytelyst/learning_ai_invt_trdg

Saravana Achu Mac 3cbbd6ccaa feat: scaffold trading monorepo foundation

2026-04-04 11:18:21 -07:00

1.7 KiB

Raw Blame History

Exchange API Degradation Runbook

Incident description

The exchange API responds slowly or returns errors, affecting ENTRY/EXIT execution and reconciliation data.

Symptoms

Exchange latency histogram in /metrics shows spikes; errors logged from exchange connector.
tradingLoopHealthy or monitorLoopHealthy flag false because loops hit timeouts.
Logs show exchange timeout or repeated 429/503 responses.

Metrics to check

/internal/health ? tradingLoopHealthy, monitorLoopHealthy, exchangeLatencyHistogram.
/metrics ? exchange_api_latency_seconds, exchange_api_errors_total, entry_orders_rejections_total.

Immediate mitigation

Back off new ENTRY signals for profiles if exchange is unreachable.
Ensure deterministic clientOrderId is ready before retries; do not reissue new orders.
Activate retry/backoff logic in connectors; log each retry with correlation IDs.
Inform downstream systems (dashboard, ops) about degraded state.

Expected self-recovery

Exchange recovers and accepts pending requests; trading loop resumes once latency normalizes.
Reconciliation loop eventually runs against fresh data; metrics fall back to baseline.

When to escalate

Errors persist beyond 5 minutes despite retries.
Exchange reports credential or rate-limit problems requiring intervention.
Business-critical trading windows are missed. Escalate to the Exchange Account Manager and Cloud Ops; reference docs/runbooks/exchange-degradation.md.

What NOT to do

Do not flood the exchange with retries; respect backoff policies.
Do not change API keys mid-incident without direction from the exchange team.
Do not pause reconciliation; accurate state is needed to diagnose missing fills.