learning_ai_invt_trdg/backend/runbooks/reconciliation.md

71 lines
4.5 KiB
Markdown

# Reconciliation Divergence Runbook
## Incident description
The reconciliation loop detects mismatches between Supabase orders/positions and exchange open orders, meaning the database drifted from exchange truth.
## Symptoms
- `reconciliationMismatchCount`, `reconciliationMissingFromExchange`, or `reconciliationMissingInDb` metrics rise.
- Logs show lifecycle-safe handlers executing to correct entry/exit states.
- Dashboard shows active orders or positions that do not exist on the exchange.
## Metrics to check
- `/internal/health` ? `reconciliationLoopHealthy`, `reconciliationMismatchCount`, `reconciliationLastRun`.
- `/metrics` ? `reconciliation_mismatch_total`, `reconciliation_missing_from_exchange_total`, `reconciliation_missing_in_db_total`.
- `/internal/health` runtime fields:
- `reconciliationParityMismatchTrades`
- `reconciliationParityQuarantinedTrades`
- `reconciliationParityAutoClosedTrades`
- `reconciliationParityMaxMismatchNotionalUsd`
- `reconciliationParityTotalMismatchNotionalUsd`
- `reconciliationIntegrityWatchdogTriggered`
## Automated parity heartbeat (ghost self-healing)
- Feature flag: `ENABLE_RECON_POSITION_PARITY_HEARTBEAT=true` (default is `true`; set `false` only for controlled rollback).
- Confirmation gate: `RECON_POSITION_PARITY_CONFIRMATIONS` (default `3` consecutive checks).
- Attribution safety gate: `RECON_POSITION_PARITY_REQUIRE_SUBTAG_ATTRIBUTION` (default `true`).
- Watchdog threshold: `RECON_POSITION_PARITY_MAX_NOTIONAL_PCT` (default `0.5` of allocated capital).
- Auto-resume gate: `ENABLE_RECON_WATCHDOG_AUTO_RESUME=true`.
- Auto-resume delay: `RECON_WATCHDOG_AUTO_RESUME_MIN_PAUSE_MS` (default `900000`).
- Auto-resume clean streak: `RECON_WATCHDOG_AUTO_RESUME_CLEAN_CYCLES` (default `2`).
- Auto-resume cooldown: `RECON_WATCHDOG_AUTO_RESUME_COOLDOWN_MS` (default `1800000`).
- Dry-run mode: `RECON_POSITION_PARITY_DRY_RUN=true` to observe without applying synthetic exits.
Heartbeat behavior:
- Detects ghost lifecycle slices where virtual open qty remains but exchange position is effectively zero.
- Requires consecutive mismatch confirmations before synthetic EXIT reconciliation.
- Enforces sub-tag attribution before any synthetic close; unattributed slices are quarantined.
- Triggers integrity watchdog pause when cumulative mismatch notional exceeds configured capital ratio.
- Auto-resumes trading only when pause source is parity watchdog and reconciliation stays clean for required consecutive cycles.
## EXIT backfill safety gates
- `RECON_EXIT_BACKFILL_REQUIRE_STRONG_ATTRIBUTION=true`:
- only uses exchange fills that are attributable to the profile/trade (`sub_tag`, deterministic `client_order_id`, or explicit `trade_id` hint).
- prevents auto-backfill from consuming unrelated account activity.
- `RECON_EXIT_BACKFILL_ALLOW_HEURISTIC_MATCH=false`:
- disables heuristic assignment modes (`single_open_trade`, `qty_unique`) by default.
- keeps unmatched rows in `NO_GO` for operator review instead of auto-closing.
- `RECON_EXIT_BACKFILL_FILL_AFTER_TRADE_GRACE_MINUTES=5`:
- rejects stale fill evidence that predates the lifecycle slice timestamp beyond grace.
- blocks historical fills from being attached to newer open trades.
## Immediate mitigation
1. Confirm the reconciliation lock is available for the affected profile to avoid double processing.
2. Allow the reconciliation loop to run; it will route mismatches through lifecycle-safe handlers (`reconcileEntryFill`, `reconcileExitFill`, `reconcileCancel`).
3. If divergence persists, manually inspect trade_history and positions for inconsistent state.
4. Notify stakeholders that reconciliation is running and that no manual edits should occur during the fix.
## Expected self-recovery
- Handler corrections align DB orders/positions with exchange data, and metrics return to zero.
- The capital ledger recalculates reservations, and dashboard data becomes consistent.
## When to escalate
- Mismatch metrics stay elevated after two reconciliation runs.
- Reconciliation lock contention prevents the loop from running.
- Exchanges report stale or unknown fills after reconciliation.
Escalate to the trading engineering lead and reference docs/runbooks/reconciliation.md and docs/runbooks/lifecycle-incident.md for follow-up.
## What NOT to do
- Do not manually patch `orders` or `positions` tables while reconciliation is active.
- Do not disable the reconciliation loop; divergence will only grow.
- Do not trigger new ENTRY/EXIT flows for the affected profile until reconciliation completes.