100 lines
3.0 KiB
Markdown
100 lines
3.0 KiB
Markdown
# Incident Runbooks
|
|
|
|
Date: 2026-02-14
|
|
Scope: `bytelyst-trading-bot-service`
|
|
|
|
## Severity Levels
|
|
|
|
- `SEV-1`: Active risk of financial loss or uncontrolled exposure.
|
|
- `SEV-2`: Trading degraded but risk controls still active.
|
|
- `SEV-3`: Non-critical observability or configuration issue.
|
|
|
|
## 1) Ghost Position (Exchange Open, Bot Closed)
|
|
|
|
Trigger:
|
|
- Dashboard/API shows no open position, but exchange account has an open position.
|
|
|
|
Severity:
|
|
- `SEV-1`
|
|
|
|
Immediate Actions:
|
|
1. Stop new entries for impacted profile(s): set profile status to inactive in DB.
|
|
2. Confirm live exchange position size/side via broker UI/API.
|
|
3. Manually close or hedge exchange position if risk threshold breached.
|
|
4. Capture evidence: order IDs, timestamps, profile ID, symbol, side, qty.
|
|
|
|
Bot Recovery:
|
|
1. Run reconciliation:
|
|
- Wait for scheduled reconciliation cycle or restart bot to trigger startup reconciliation.
|
|
2. Verify `OrderStatusSyncService` has resolved related stale orders.
|
|
3. If still mismatched, update order state to `unknown` and treat as quarantined.
|
|
4. Re-enable profile only after position parity is confirmed.
|
|
|
|
Post-Incident:
|
|
1. Open RCA with timeline and root cause category:
|
|
- exchange timeout
|
|
- rejected exit
|
|
- stale local state
|
|
2. Add regression test for failing path.
|
|
|
|
## 2) Stale Pending Orders
|
|
|
|
Trigger:
|
|
- Orders remain `pending_new`/non-terminal beyond expected SLA.
|
|
|
|
Severity:
|
|
- `SEV-2` (escalate to `SEV-1` if exposure is uncertain).
|
|
|
|
Immediate Actions:
|
|
1. Check stale backlog via `/health` and logs (`[OrderSync]`, `[QUARANTINE]`).
|
|
2. Validate broker status for impacted order IDs.
|
|
3. Cancel stuck live orders in broker if safe and policy-approved.
|
|
|
|
Bot Recovery:
|
|
1. Allow `OrderStatusSyncService` to run.
|
|
2. Orders older than 24h and missing on exchange must be marked `unknown` (quarantined).
|
|
3. For quarantined orders, require manual review and final status correction.
|
|
|
|
Escalation Criteria:
|
|
- Backlog > 20 for > 15 minutes.
|
|
- Repeated stale growth across multiple profiles.
|
|
|
|
## 3) Auth Failures (API/WebSocket)
|
|
|
|
Trigger:
|
|
- Spike in `401`/`403` responses or websocket auth rejections.
|
|
|
|
Severity:
|
|
- `SEV-2` (or `SEV-1` if all trading control endpoints fail).
|
|
|
|
Immediate Actions:
|
|
1. Confirm Supabase availability and JWT issuance health.
|
|
2. Validate environment variables:
|
|
- `SUPABASE_URL`
|
|
- `SUPABASE_SERVICE_ROLE_KEY`
|
|
3. Verify dashboard token refresh behavior and expiration handling.
|
|
|
|
Bot Recovery:
|
|
1. Restart bot service after verifying credentials/config.
|
|
2. Validate:
|
|
- `/health/live` returns `200`
|
|
- `/health/ready` returns `200` (or investigate degraded fields)
|
|
3. Perform controlled API test:
|
|
- authenticated `/api/status`
|
|
- unauthenticated `/api/trade` should return unauthorized
|
|
|
|
Post-Incident:
|
|
1. Capture failing token claims (issuer, audience, exp, user id).
|
|
2. Record whether failure was config, infra, or app regression.
|
|
|
|
## Communication Template
|
|
|
|
Use this template in incident channel:
|
|
|
|
1. `Incident`: short title
|
|
2. `Severity`: SEV-1/2/3
|
|
3. `Impact`: profiles/symbols/orders affected
|
|
4. `Mitigation`: action taken
|
|
5. `Next Update`: timestamp (UTC)
|
|
|