learning_ai_invt_trdg/backend/runbooks/reconciliation.md

4.5 KiB

Reconciliation Divergence Runbook

Incident description

The reconciliation loop detects mismatches between Supabase orders/positions and exchange open orders, meaning the database drifted from exchange truth.

Symptoms

  • reconciliationMismatchCount, reconciliationMissingFromExchange, or reconciliationMissingInDb metrics rise.
  • Logs show lifecycle-safe handlers executing to correct entry/exit states.
  • Dashboard shows active orders or positions that do not exist on the exchange.

Metrics to check

  • /internal/health ? reconciliationLoopHealthy, reconciliationMismatchCount, reconciliationLastRun.
  • /metrics ? reconciliation_mismatch_total, reconciliation_missing_from_exchange_total, reconciliation_missing_in_db_total.
  • /internal/health runtime fields:
    • reconciliationParityMismatchTrades
    • reconciliationParityQuarantinedTrades
    • reconciliationParityAutoClosedTrades
    • reconciliationParityMaxMismatchNotionalUsd
    • reconciliationParityTotalMismatchNotionalUsd
    • reconciliationIntegrityWatchdogTriggered

Automated parity heartbeat (ghost self-healing)

  • Feature flag: ENABLE_RECON_POSITION_PARITY_HEARTBEAT=true (default is true; set false only for controlled rollback).
  • Confirmation gate: RECON_POSITION_PARITY_CONFIRMATIONS (default 3 consecutive checks).
  • Attribution safety gate: RECON_POSITION_PARITY_REQUIRE_SUBTAG_ATTRIBUTION (default true).
  • Watchdog threshold: RECON_POSITION_PARITY_MAX_NOTIONAL_PCT (default 0.5 of allocated capital).
  • Auto-resume gate: ENABLE_RECON_WATCHDOG_AUTO_RESUME=true.
  • Auto-resume delay: RECON_WATCHDOG_AUTO_RESUME_MIN_PAUSE_MS (default 900000).
  • Auto-resume clean streak: RECON_WATCHDOG_AUTO_RESUME_CLEAN_CYCLES (default 2).
  • Auto-resume cooldown: RECON_WATCHDOG_AUTO_RESUME_COOLDOWN_MS (default 1800000).
  • Dry-run mode: RECON_POSITION_PARITY_DRY_RUN=true to observe without applying synthetic exits.

Heartbeat behavior:

  • Detects ghost lifecycle slices where virtual open qty remains but exchange position is effectively zero.
  • Requires consecutive mismatch confirmations before synthetic EXIT reconciliation.
  • Enforces sub-tag attribution before any synthetic close; unattributed slices are quarantined.
  • Triggers integrity watchdog pause when cumulative mismatch notional exceeds configured capital ratio.
  • Auto-resumes trading only when pause source is parity watchdog and reconciliation stays clean for required consecutive cycles.

EXIT backfill safety gates

  • RECON_EXIT_BACKFILL_REQUIRE_STRONG_ATTRIBUTION=true:
    • only uses exchange fills that are attributable to the profile/trade (sub_tag, deterministic client_order_id, or explicit trade_id hint).
    • prevents auto-backfill from consuming unrelated account activity.
  • RECON_EXIT_BACKFILL_ALLOW_HEURISTIC_MATCH=false:
    • disables heuristic assignment modes (single_open_trade, qty_unique) by default.
    • keeps unmatched rows in NO_GO for operator review instead of auto-closing.
  • RECON_EXIT_BACKFILL_FILL_AFTER_TRADE_GRACE_MINUTES=5:
    • rejects stale fill evidence that predates the lifecycle slice timestamp beyond grace.
    • blocks historical fills from being attached to newer open trades.

Immediate mitigation

  1. Confirm the reconciliation lock is available for the affected profile to avoid double processing.
  2. Allow the reconciliation loop to run; it will route mismatches through lifecycle-safe handlers (reconcileEntryFill, reconcileExitFill, reconcileCancel).
  3. If divergence persists, manually inspect trade_history and positions for inconsistent state.
  4. Notify stakeholders that reconciliation is running and that no manual edits should occur during the fix.

Expected self-recovery

  • Handler corrections align DB orders/positions with exchange data, and metrics return to zero.
  • The capital ledger recalculates reservations, and dashboard data becomes consistent.

When to escalate

  • Mismatch metrics stay elevated after two reconciliation runs.
  • Reconciliation lock contention prevents the loop from running.
  • Exchanges report stale or unknown fills after reconciliation. Escalate to the trading engineering lead and reference docs/runbooks/reconciliation.md and docs/runbooks/lifecycle-incident.md for follow-up.

What NOT to do

  • Do not manually patch orders or positions tables while reconciliation is active.
  • Do not disable the reconciliation loop; divergence will only grow.
  • Do not trigger new ENTRY/EXIT flows for the affected profile until reconciliation completes.