learning_ai_invt_trdg/backend/INCIDENT_RUNBOOKS.md

3.0 KiB

Incident Runbooks

Date: 2026-02-14 Scope: bytelyst-trading-bot-service

Severity Levels

  • SEV-1: Active risk of financial loss or uncontrolled exposure.
  • SEV-2: Trading degraded but risk controls still active.
  • SEV-3: Non-critical observability or configuration issue.

1) Ghost Position (Exchange Open, Bot Closed)

Trigger:

  • Dashboard/API shows no open position, but exchange account has an open position.

Severity:

  • SEV-1

Immediate Actions:

  1. Stop new entries for impacted profile(s): set profile status to inactive in DB.
  2. Confirm live exchange position size/side via broker UI/API.
  3. Manually close or hedge exchange position if risk threshold breached.
  4. Capture evidence: order IDs, timestamps, profile ID, symbol, side, qty.

Bot Recovery:

  1. Run reconciliation:
    • Wait for scheduled reconciliation cycle or restart bot to trigger startup reconciliation.
  2. Verify OrderStatusSyncService has resolved related stale orders.
  3. If still mismatched, update order state to unknown and treat as quarantined.
  4. Re-enable profile only after position parity is confirmed.

Post-Incident:

  1. Open RCA with timeline and root cause category:
    • exchange timeout
    • rejected exit
    • stale local state
  2. Add regression test for failing path.

2) Stale Pending Orders

Trigger:

  • Orders remain pending_new/non-terminal beyond expected SLA.

Severity:

  • SEV-2 (escalate to SEV-1 if exposure is uncertain).

Immediate Actions:

  1. Check stale backlog via /health and logs ([OrderSync], [QUARANTINE]).
  2. Validate broker status for impacted order IDs.
  3. Cancel stuck live orders in broker if safe and policy-approved.

Bot Recovery:

  1. Allow OrderStatusSyncService to run.
  2. Orders older than 24h and missing on exchange must be marked unknown (quarantined).
  3. For quarantined orders, require manual review and final status correction.

Escalation Criteria:

  • Backlog > 20 for > 15 minutes.
  • Repeated stale growth across multiple profiles.

3) Auth Failures (API/WebSocket)

Trigger:

  • Spike in 401/403 responses or websocket auth rejections.

Severity:

  • SEV-2 (or SEV-1 if all trading control endpoints fail).

Immediate Actions:

  1. Confirm Supabase availability and JWT issuance health.
  2. Validate environment variables:
    • SUPABASE_URL
    • SUPABASE_SERVICE_ROLE_KEY
  3. Verify dashboard token refresh behavior and expiration handling.

Bot Recovery:

  1. Restart bot service after verifying credentials/config.
  2. Validate:
    • /health/live returns 200
    • /health/ready returns 200 (or investigate degraded fields)
  3. Perform controlled API test:
    • authenticated /api/status
    • unauthenticated /api/trade should return unauthorized

Post-Incident:

  1. Capture failing token claims (issuer, audience, exp, user id).
  2. Record whether failure was config, infra, or app regression.

Communication Template

Use this template in incident channel:

  1. Incident: short title
  2. Severity: SEV-1/2/3
  3. Impact: profiles/symbols/orders affected
  4. Mitigation: action taken
  5. Next Update: timestamp (UTC)