# Incident Runbooks Date: 2026-02-14 Scope: `bytelyst-trading-bot-service` ## Severity Levels - `SEV-1`: Active risk of financial loss or uncontrolled exposure. - `SEV-2`: Trading degraded but risk controls still active. - `SEV-3`: Non-critical observability or configuration issue. ## 1) Ghost Position (Exchange Open, Bot Closed) Trigger: - Dashboard/API shows no open position, but exchange account has an open position. Severity: - `SEV-1` Immediate Actions: 1. Stop new entries for impacted profile(s): set profile status to inactive in DB. 2. Confirm live exchange position size/side via broker UI/API. 3. Manually close or hedge exchange position if risk threshold breached. 4. Capture evidence: order IDs, timestamps, profile ID, symbol, side, qty. Bot Recovery: 1. Run reconciliation: - Wait for scheduled reconciliation cycle or restart bot to trigger startup reconciliation. 2. Verify `OrderStatusSyncService` has resolved related stale orders. 3. If still mismatched, update order state to `unknown` and treat as quarantined. 4. Re-enable profile only after position parity is confirmed. Post-Incident: 1. Open RCA with timeline and root cause category: - exchange timeout - rejected exit - stale local state 2. Add regression test for failing path. ## 2) Stale Pending Orders Trigger: - Orders remain `pending_new`/non-terminal beyond expected SLA. Severity: - `SEV-2` (escalate to `SEV-1` if exposure is uncertain). Immediate Actions: 1. Check stale backlog via `/health` and logs (`[OrderSync]`, `[QUARANTINE]`). 2. Validate broker status for impacted order IDs. 3. Cancel stuck live orders in broker if safe and policy-approved. Bot Recovery: 1. Allow `OrderStatusSyncService` to run. 2. Orders older than 24h and missing on exchange must be marked `unknown` (quarantined). 3. For quarantined orders, require manual review and final status correction. Escalation Criteria: - Backlog > 20 for > 15 minutes. - Repeated stale growth across multiple profiles. ## 3) Auth Failures (API/WebSocket) Trigger: - Spike in `401`/`403` responses or websocket auth rejections. Severity: - `SEV-2` (or `SEV-1` if all trading control endpoints fail). Immediate Actions: 1. Confirm Supabase availability and JWT issuance health. 2. Validate environment variables: - `SUPABASE_URL` - `SUPABASE_SERVICE_ROLE_KEY` 3. Verify dashboard token refresh behavior and expiration handling. Bot Recovery: 1. Restart bot service after verifying credentials/config. 2. Validate: - `/health/live` returns `200` - `/health/ready` returns `200` (or investigate degraded fields) 3. Perform controlled API test: - authenticated `/api/status` - unauthenticated `/api/trade` should return unauthorized Post-Incident: 1. Capture failing token claims (issuer, audience, exp, user id). 2. Record whether failure was config, infra, or app regression. ## Communication Template Use this template in incident channel: 1. `Incident`: short title 2. `Severity`: SEV-1/2/3 3. `Impact`: profiles/symbols/orders affected 4. `Mitigation`: action taken 5. `Next Update`: timestamp (UTC)