learning_ai_invt_trdg/backend/INCIDENT_RUNBOOKS.md

100 lines
3.0 KiB
Markdown

# Incident Runbooks
Date: 2026-02-14
Scope: `bytelyst-trading-bot-service`
## Severity Levels
- `SEV-1`: Active risk of financial loss or uncontrolled exposure.
- `SEV-2`: Trading degraded but risk controls still active.
- `SEV-3`: Non-critical observability or configuration issue.
## 1) Ghost Position (Exchange Open, Bot Closed)
Trigger:
- Dashboard/API shows no open position, but exchange account has an open position.
Severity:
- `SEV-1`
Immediate Actions:
1. Stop new entries for impacted profile(s): set profile status to inactive in DB.
2. Confirm live exchange position size/side via broker UI/API.
3. Manually close or hedge exchange position if risk threshold breached.
4. Capture evidence: order IDs, timestamps, profile ID, symbol, side, qty.
Bot Recovery:
1. Run reconciliation:
- Wait for scheduled reconciliation cycle or restart bot to trigger startup reconciliation.
2. Verify `OrderStatusSyncService` has resolved related stale orders.
3. If still mismatched, update order state to `unknown` and treat as quarantined.
4. Re-enable profile only after position parity is confirmed.
Post-Incident:
1. Open RCA with timeline and root cause category:
- exchange timeout
- rejected exit
- stale local state
2. Add regression test for failing path.
## 2) Stale Pending Orders
Trigger:
- Orders remain `pending_new`/non-terminal beyond expected SLA.
Severity:
- `SEV-2` (escalate to `SEV-1` if exposure is uncertain).
Immediate Actions:
1. Check stale backlog via `/health` and logs (`[OrderSync]`, `[QUARANTINE]`).
2. Validate broker status for impacted order IDs.
3. Cancel stuck live orders in broker if safe and policy-approved.
Bot Recovery:
1. Allow `OrderStatusSyncService` to run.
2. Orders older than 24h and missing on exchange must be marked `unknown` (quarantined).
3. For quarantined orders, require manual review and final status correction.
Escalation Criteria:
- Backlog > 20 for > 15 minutes.
- Repeated stale growth across multiple profiles.
## 3) Auth Failures (API/WebSocket)
Trigger:
- Spike in `401`/`403` responses or websocket auth rejections.
Severity:
- `SEV-2` (or `SEV-1` if all trading control endpoints fail).
Immediate Actions:
1. Confirm Supabase availability and JWT issuance health.
2. Validate environment variables:
- `SUPABASE_URL`
- `SUPABASE_SERVICE_ROLE_KEY`
3. Verify dashboard token refresh behavior and expiration handling.
Bot Recovery:
1. Restart bot service after verifying credentials/config.
2. Validate:
- `/health/live` returns `200`
- `/health/ready` returns `200` (or investigate degraded fields)
3. Perform controlled API test:
- authenticated `/api/status`
- unauthenticated `/api/trade` should return unauthorized
Post-Incident:
1. Capture failing token claims (issuer, audience, exp, user id).
2. Record whether failure was config, infra, or app regression.
## Communication Template
Use this template in incident channel:
1. `Incident`: short title
2. `Severity`: SEV-1/2/3
3. `Impact`: profiles/symbols/orders affected
4. `Mitigation`: action taken
5. `Next Update`: timestamp (UTC)