learning_ai_invt_trdg/backend/ENTERPRISE_ARCHITECTURE_REFERENCE.md

15 KiB
Raw Blame History

SECTION 1 — SYSTEM OVERVIEW

  • High-level architecture diagram (textual)
    Trading Bot Service (single codebase) ←→ Supabase/Postgres for durable state ←→ Exchange Connectors (Alpaca, etc.)
    Observability Layer (Prometheus /metrics, structured logs)
    Dashboard UI reads Supabase state and subscribes to userscoped WebSocket channels.
    Distributed workers (multi-instance) share Supabase + Exchange key + Observability feeds.

  • Runtime components

    • Trading loop: scheduled loop per profile that evaluates strategy signals, performs capital checks, acquires lock, submits ENTRY order, and invokes lifecycle RPC.
    • Monitor loop: polls exchange/account state, updates positions/orders, enforces invariant watchdogs (capital, lifecycle) and emits metrics/logs.
    • Reconciliation loop: acquires per-profile reconciliation lock, fetches full DB vs exchange open order sets, routes discrepancies through lifecycle-safe handlers, and updates metrics/health.
    • Order sync loop: keeps DB and exchange orders synced (fills, cancels) by reconciling via lifecycle flows and ledger adjustments.
  • Exchange interaction model

    • Exchange order submission always occurs before DB persistence.
    • clientOrderId deterministic (bytelyst-${profileId}-${tradeId}) ensures idempotent exchange requests.
    • Lifecycle RPC persists confirmed exchange order metadata; no retries issue new exchange orders.
  • Single exchange key multi-profile model

    • One shared exchange API key per bot deployment.
    • Profile isolation achieved via REST/WebSocket tenant scoping and capital ledger segregation.
    • Distributed lock ensures one active ENTRY per (profile_id, symbol) even with shared API key.

SECTION 2 — PHASE-BY-PHASE ENTERPRISE HARDENING

Phase 1 — Tenant Isolation

  • Profiles isolated with Supabase RLS: orders/trade_history/positions rows are scoped to profile_id, forced by RLS policies (profile owner or service_role only).
  • WebSocket scopes: runtime state broadcast filtered by user_id, preventing cross-tenant leakage.
  • Data exposure guarantees: authenticated requests only see their profile records; Realtime channels emit tenant-scoped runtime state; service tokens required for administrative access.

Phase 2 — Restart Durability

  • Startup rebuild flow: on start, for each profile, load persisted profiles, fetch exchange open positions/orders, rebuild lifecycle maps and ledger.
  • Capital rebuild logic: ledger reset per profile then rebuilt by re-playing open positions/orders from exchange state, recalculating reserved_for_positions/reserved_for_orders.
  • Lifecycle rebuild: trade lifecycle map reconstructed from persisted orders/trade_history; open positions re-linked via trade_id.
  • Pending order reconstruction: open exchange orders reinserted if missing, reconciled via lifecycle RPCs to ensure consistent DB state.
  • Deterministic state rebuild proof: deterministic parsing of exchange data + ledger reconstruction ensures restart idempotency (same inputs → same ledger state).

Phase 3 — Capital Ledger

  • Ledger schema: per-profile ledger table columns allocated_capital, reserved_for_orders, reserved_for_positions, realized_pnl. available_capital computed from invariant.
  • RPC guarantees: ledger updates happen via atomic RPCs; reservation occurs before exchange calls; releases happen on cancel/exit/fail via RPC ensures durable state.
  • Reservation lifecycle: before ENTRY, profile-level mutex acquired, required capital deducted into reserved_for_orders, released upon exchange failure.
  • Invariant:
    available_capital = allocated_capital - reserved_for_orders - reserved_for_positions + realized_pnl
    always upheld after every ledger mutation.
  • Partial fill math: filled notional moves proportionally from reserved_for_orders into reserved_for_positions; partial execution persists both fill amount and remaining reservation.
  • Restart math proof: ledger rebuild sums exchange open orders/positions exact fill notional, ensuring invariant recomputed identically upon restart.
  • Crash recovery proof: if crash occurs during reservation, restart logic recomputes reservation from exchange state. New reservations only re-run when capital available.

Phase 4 — Transactional Lifecycle

  • Exchange-first entry flow: trade signals evaluate → capital reserve → lock → exchange order → receive order_id → call fn_persist_entry_lifecycle.
  • Lifecycle RPC flow: inserts trade_lifecycle, orders, positions, trade_history atomically; uses UNIQUE(profile_id, trade_id) guard; child inserts have ON CONFLICT DO NOTHING.
  • Idempotency keys: RPC uses trade_id + profile_id; deterministic clientOrderId; repeated calls look up existing lifecycle instead of inserting duplicates.
  • Unique constraints: trade_lifecycle(profile_id, trade_id) unique; orders(order_id) unique; positions keyed by (profile_id, trade_id); ensures duplicates cannot arise.
  • Failure handling: RPC wrapped in transaction; failures roll back entire lifecycle INSERT. On retry, unique constraint ensures safe idempotency; duplicate lifecycle fetch returns existing state.
  • Why exchange is source of truth: order is placed before any persistence. If DB commit fails, replays fetch confirmed exchange order via idempotent RPC without re-submitting.

Phase 5 — Reconciliation

  • Deterministic comparison algorithm: for each profile, fetch entire open DB order set + recent closed set (no limit) and full exchange open set; match using order_idclient_order_idtrade_id.
  • Locking model: row-based reconciliation_locks per profile with TTL; RPCs fn_try_acquire_reconciliation_lock_row, fn_release_reconciliation_lock_row.
  • Lifecycle-safe handler routing: discrepancies processed via handlers (reconcileEntryFill, reconcileExitFill, reconcileCancel, logOrder) instead of raw updateOrderStatus.
  • Ledger adjustment routing: reconciliation uses capital ledger APIs for fills/cancels to maintain invariants.
  • Health metrics: reconciliation loop exposes reconciliationLoopHealthy, reconciliationLastRun, reconciliationMismatchCount, reconciliationMissingFromExchange, reconciliationMissingInDb, reconciliationLockContentionCount.
  • Failure table: reconciliation handles DB-only orders (mark cancel via lifecycle), exchange-only orders (insert lifecycle), status mismatches (trigger lifecycle transitions), partial fills, exchange cancels.

Phase 6 — Distributed Safety

  • Row-based lock model: entry_locks(profile_id, lower(symbol)) with TTL; RPCs fn_try_acquire_entry_lock_row, fn_release_entry_lock_row.
  • Lock TTL logic: default 30s TTL; lock expires automatically if owner crashes; optimistic updates ensure quick lock turnover.
  • Owner token design: owner string processPid-uuid, stored per attempt; only matching owner can release lock.
  • Deterministic clientOrderId: bytelyst-${profileId}-${tradeId} ensures same trade never re-submits new order; exchange rejects duplicates and response interpreted as existing order.
  • Multi-instance behavior: each worker attempts lock acquisition; only one obtains lock, performs ENTRY; others skip and wait for lock release.
  • Horizontal scaling model: distributed lock + shared DB/exchange key allows safe scaling; no per-instance state relied upon.
  • Deadlock prevention: TTL and owner-based release ensure locks eventually expire; finally blocks always release lock.
  • Failure table: lock acquisition failure leads to immediate entry skip; network partition releases lock via TTL; crash during exchange preserves safety because lock expires before restart.

SECTION 3 — CRITICAL INVARIANTS

  1. No duplicate exchange order
    • Why holds: deterministic clientOrderId + row-based lock prevents re-entry; lifecycle RPC guarded by unique constraints.
  2. No lifecycle without confirmed exchange order
    • Why holds: exchange-first submission ensures RPC only called with confirmed order_id; RPC never replays exchange call.
  3. Capital cannot go negative
    • Why holds: ledger enforces check before reservation; available capital invariant prevents overspend; watchdog logs and rejects actions if invariant violated.
  4. Only one active ENTRY per (profile_id, symbol)
    • Why holds: acquisition of (profile_id, symbol) row lock before entry; TTL ensures exclusivity.
  5. Reconciliation converges to exchange truth
    • Why holds: reconciliation fetches full open order sets, uses lifecycle handlers, ledger updates, and repeats deterministically (idempotent).
  6. Restart does not corrupt ledger
    • Why holds: restart rebuild recomputes ledger from exchange open positions/orders; no reliance on cached values.
  7. Distributed workers cannot double submit
    • Why holds: distributed locks + deterministic clientOrderId + lifecycle uniqueness ensure only one worker can create a lifecycle even under concurrency.

SECTION 4 — EXECUTION FLOW DIAGRAMS

  1. ENTRY execution
    • signal → profile-level lock acquire → capital check/reserve → deterministic clientOrderId → exchange order → lifecycle RPC (insert lifecycle/orders/position/history) → ledger update → lock release.
  2. EXIT execution
    • cancel/exit signal → lifecycle handler identifies trade_id → create exit order via exchange → lifecycle RPC atomic update (order row, lifecycle, trade_history, position) → ledger releases position reservation, adds realized_pnl.
  3. Partial fill handling
    • exchange fill update via reconciliation/monitor → partial quantityFilled → lifecycle handler adjusts reserved_for_orders → move filled notional to reserved_for_positions → ensure remaining reservation equals unfilled amount.
  4. Restart rebuild
    • service start → load profiles → fetch exchange open positions/orders → rebuild ledger reservations + lifecycle map → reconcile pending orders → resume loops.
  5. Reconciliation cycle
    • for each profile: acquire reconciliation lock → fetch full DB open + recent orders + exchange open set → deterministic matching (order_id/client_order_id/trade_id) → route through lifecycle handlers → release lock → emit metrics.
  6. Distributed lock acquisition
    • compute deterministic lock key (profile_id + symbol) → call fn_try_acquire_entry_lock_row → on success proceed → finally release via fn_release_entry_lock_row; TTL auto-expiry handles crashes.

SECTION 5 — FAILURE SCENARIO TABLE

Scenario What happens Why safe Recovery behavior
Two workers race Only one acquires row lock; other aborts Lock ensures mutual exclusion Winner proceeds; loser tries next signal
Network partition Lock TTL expires, prevents hang TTL avoids perpetual ownership Worker restarts, reacquires lock after TTL
DB failure Transaction aborts, no lifecycle persisted Persistent state only changes when txn commits Retry after DB available; idempotent RPC avoids duplicates
Exchange timeout Capital reservation rolled back via mutex/finally No exchange order submitted Signal retries after timeout
Crash before lifecycle RPC Lock TTL ensures future worker can resume No lifecycle inserted, no capital moved Restart replay resumes from exchange state
Crash after exchange but before persistence Lifecycle RPC retried with existing clientOrderId Unique constraints prevent duplicates RPC idempotent insert replays once
Partial fill after restart Reconciliation partial fill handler adjusts ledger Handler moves filled notional into positions Consistent ledger, no double counting
Supabase outage RPCs fail, operations roll back Transactions atomic; no partial writes Retry after Supabase recovers; observers alerted
Lock stuck TTL expiry clears stale lock Hard TTL prevents deadlock Waiting worker acquires after TTL

SECTION 6 — HEALTH & OBSERVABILITY

  • /internal/health fields:
    tradingLoopHealthy, tradingLoopLastRun, tradingLoopDuration,
    monitorLoopHealthy, monitorLoopLastRun, monitorLoopDuration,
    reconciliationLoopHealthy, reconciliationLoopLastRun, reconciliationMismatchCount, reconciliationMissingFromExchange, reconciliationMissingInDb, reconciliationLockContentionCount,
    lockContentionCount, capitalInvariantViolations, observabilityTimestamp.
  • Loop metrics: duration histograms + last run timestamps; SLO: healthy flag true if last run < 2x expected interval.
  • Lock contention metrics: increment per failed lock acquisition; field surfaced both via /internal/health and Prometheus.
  • Reconciliation metrics: mismatch counts, missing-from-exchange, missing-in-db, lock contention.
  • Readiness signals: SLO flags combined to determine readiness; observability records degrade gracefully (logs emitted on invariant violations).
  • Degraded mode behavior: if capital invariant fails, watchdog increments violation counter, logs critical error, halts further ENTRY until resolved.

SECTION 7 — HORIZONTAL SCALING MODEL

  • Multi-worker deployment: each worker runs trading/monitor/reconciliation loops; shared DB/exchange key; distributed locks coordinate actions.
  • Shared DB: Supabase/Postgres is the single source of truth; all loops interact with same tables.
  • Shared exchange key: deterministic clientOrderId + lock prevent double submissions despite shared credentials.
  • Lock guarantees: row-based entry locks and reconciliation locks with TTL/owner ensure cross-instance exclusivity.
  • Why no duplication is possible: distributed locks + deterministic order IDs + lifecycle RPC uniqueness ensure only one worker can create a lifecycle even under concurrency.

SECTION 8 — SAFE ENHANCEMENT RULES

  • Lifecycle:
    • DO NOT bypass fn_persist_entry_lifecycle.
    • DO NOT write raw status updates; use lifecycle-safe handlers.
  • Ledger:
    • DO NOT mutate ledger without RPCs that respect invariant.
    • Always recompute available_capital via allocated - reserved_orders - reserved_positions + realized_pnl.
  • Reconciliation:
    • Always acquire reconciliation_locks.
    • Route changes through lifecycle handlers.
  • Locking:
    • Always use row locks with TTL and owner tokens; release in finally block.
    • Do not assume single-process state.
  • Exchange submission:
    • Always reserve capital before exchange call.
    • Use deterministic clientOrderId.
    • Never re-submit the same trade_id; rely on idempotent failures.

DO NOT BREAK these rules; any change violating them risks duplicate executions, capital drift, or stale lifecycle data.