3.0 KiB
3.0 KiB
Incident Runbooks
Date: 2026-02-14
Scope: bytelyst-trading-bot-service
Severity Levels
SEV-1: Active risk of financial loss or uncontrolled exposure.SEV-2: Trading degraded but risk controls still active.SEV-3: Non-critical observability or configuration issue.
1) Ghost Position (Exchange Open, Bot Closed)
Trigger:
- Dashboard/API shows no open position, but exchange account has an open position.
Severity:
SEV-1
Immediate Actions:
- Stop new entries for impacted profile(s): set profile status to inactive in DB.
- Confirm live exchange position size/side via broker UI/API.
- Manually close or hedge exchange position if risk threshold breached.
- Capture evidence: order IDs, timestamps, profile ID, symbol, side, qty.
Bot Recovery:
- Run reconciliation:
- Wait for scheduled reconciliation cycle or restart bot to trigger startup reconciliation.
- Verify
OrderStatusSyncServicehas resolved related stale orders. - If still mismatched, update order state to
unknownand treat as quarantined. - Re-enable profile only after position parity is confirmed.
Post-Incident:
- Open RCA with timeline and root cause category:
- exchange timeout
- rejected exit
- stale local state
- Add regression test for failing path.
2) Stale Pending Orders
Trigger:
- Orders remain
pending_new/non-terminal beyond expected SLA.
Severity:
SEV-2(escalate toSEV-1if exposure is uncertain).
Immediate Actions:
- Check stale backlog via
/healthand logs ([OrderSync],[QUARANTINE]). - Validate broker status for impacted order IDs.
- Cancel stuck live orders in broker if safe and policy-approved.
Bot Recovery:
- Allow
OrderStatusSyncServiceto run. - Orders older than 24h and missing on exchange must be marked
unknown(quarantined). - For quarantined orders, require manual review and final status correction.
Escalation Criteria:
- Backlog > 20 for > 15 minutes.
- Repeated stale growth across multiple profiles.
3) Auth Failures (API/WebSocket)
Trigger:
- Spike in
401/403responses or websocket auth rejections.
Severity:
SEV-2(orSEV-1if all trading control endpoints fail).
Immediate Actions:
- Confirm Supabase availability and JWT issuance health.
- Validate environment variables:
SUPABASE_URLSUPABASE_SERVICE_ROLE_KEY
- Verify dashboard token refresh behavior and expiration handling.
Bot Recovery:
- Restart bot service after verifying credentials/config.
- Validate:
/health/livereturns200/health/readyreturns200(or investigate degraded fields)
- Perform controlled API test:
- authenticated
/api/status - unauthenticated
/api/tradeshould return unauthorized
- authenticated
Post-Incident:
- Capture failing token claims (issuer, audience, exp, user id).
- Record whether failure was config, infra, or app regression.
Communication Template
Use this template in incident channel:
Incident: short titleSeverity: SEV-1/2/3Impact: profiles/symbols/orders affectedMitigation: action takenNext Update: timestamp (UTC)