learning_ai_invt_trdg/backend/architecture.md

16 KiB
Raw Permalink Blame History

Bytelyst Trading Platform Architecture Reference

SECTION 1 — SYSTEM OVERVIEW

  • High-level architecture diagram (textual) — A single trading service instance drives a trading loop that evaluates profiles, places orders through exchange connectors, and sends confirmed state into Supabase. Supabase dispatches data to the dashboard, the monitoring loop, and the reconciliation loop. Operational health is published via the /internal/health endpoint, while observability metrics flow to the Prometheus /metrics exporter.
  • Runtime components:
    • Trading loop — Executes profile signals, reserves capital, acquires row-based locks, delegates confirmed exchange orders to transactional lifecycle RPCs, and emits audit logs.
    • Monitor loop — Collects exchange syncs, ensures lifecycle state mirrors exchange fills, and feeds capital/position snapshots back into Supabase.
    • Reconciliation loop — Locks per profile, fetches full open orders from exchange and database, routes mismatches through lifecycle-safe handlers, and updates metrics for parity, miss counts, and lock contention.
    • Order sync loop — Aggregates lifecycle history, updates the public dashboard data, and ensures active orders, positions, and trade history remain aligned with Supabase slices.
  • Exchange interaction model — The trading loop targets the exchange as the source of truth: it first places an order, receives an exchange order_id alongside deterministically generated clientOrderId, then persists lifecycle data. Subsequent reconciliation cycles compare Supabase rows to the exchanges reported open orders, triggering lifecycle handlers rather than raw database patches.
  • Single exchange key multi-profile model — A shared exchange API key services multiple profiles but isolation is maintained through per-profile capital ledgers, row-based locking, and RLS policies; no profile may observe or affect another profiles runtime state.

SECTION 2 — PHASE-BY-PHASE ENTERPRISE HARDENING

Phase 1 — Tenant Isolation

Purpose: Prevent cross-profile data leakage and enforce per-user scoping. Problem solved: Without isolation, one authenticated user could receive global runtime state or execute orders for another profile. Core mechanisms: Supabase RLS filters (orders and trade_history tables, WebSocket payloads scoped by user_id), WebSocket broadcasts that partition runtime state per authenticated user, and runtime checks that reject exchanges for mismatched profile_id. Must never break: The global runtime state must never be broadcast without tenant attribution, and no query should bypass RLS.

Phase 2 — Restart Durability

Purpose: Guarantee deterministic state reconstruction after service restart. Problem solved: Volatile in-memory maps caused missed reservations, lost lifecycle state, and inconsistent dashboards post-restart. Core mechanisms: On startup, the service loads profiles, replays exchange open positions and orders, rebuilds lifecycle/trade history mappings, and rehydrates the capital ledger from the database/exchange state. File snapshots are deprecated; the DB/exchange become authoritative. Must never break: Restart must not rely on process-local idempotency maps; open orders/positions must always be re-fetched on boot.

Phase 3 — Capital Ledger

Purpose: Enforce deterministic capital isolation per profile. Problem solved: Concurrent entries could over-allocate capital and leave the ledger inconsistent. Core mechanisms: A ledger schema maintains allocated_capital, reserved_for_orders, reserved_for_positions, and realized_pnl, with available_capital computed as allocated minus reservations plus realized. Entry execution acquires a profile-level mutex, reserves capital before exchange placement, and adjusts reservations on fills, partial fills, cancels, and exits. Restart rebuilds the ledger from exchange open orders/positions, ensuring any drift resets to authoritative values. Must never break: The invariant available_capital = allocated - reserved_for_orders - reserved_for_positions + realized_pnl must always hold; no code should mutate ledger fields outside the defined RPCs or ledger service.

Phase 4 — Transactional Lifecycle

Purpose: Guarantee atomic ENTRY/EXIT persistence tied to confirmed exchange events. Problem solved: Partial writes left orphaned lifecycle entries, phantom positions, and duplicate trade history rows. Core mechanisms: ENTRY flow places the exchange order first, then calls the fn_persist_entry_lifecycle RPC that inserts the lifecycle row, order row, position seed, and optional history slice within one transaction (using UNIQUE(profile_id, trade_id) and idempotent child inserts). EXIT flow places exit orders, then updates lifecycle rows, positions, and history in another single transaction. Idempotency keys prevent duplicate rows, and the unique constraints enforce lifecycle integrity. Must never break: The exchange must remain the source of truth; no lifecycle row may exist without a confirmed exchange order, and idempotency safeguards must never be bypassed.

Phase 5 — Reconciliation

Purpose: Align database state with exchange truth continuously. Problem solved: Discrepancies between Supabase rows and exchange orders led to stale dashboard data and capital mismatches. Core mechanisms: The reconciliation loop acquires a profile-specific row lock, loads full open orders from both the exchange and the database (no limit), compares by order_id/client_order_id/trade_id, and routes any discrepancy through lifecycle-safe handlers (entry fill, exit fill, cancel). It also tracks metrics for reconciliation health and mismatch counts. Must never break: Raw status patches are forbidden; every correction must trigger lifecycle handlers so the capital ledger and positions stay consistent.

Phase 6 — Distributed Safety

Purpose: Make ENTRY execution safe across horizontally scaled instances. Problem solved: In-memory profile mutexes could not coordinate across multiple bots, leading to double submissions. Core mechanisms: ENTRY distributed locking now uses row-based lock table with TTL, owner tokens, and deterministic symbol keys, ensuring only one active signal per profile/symbol. Deterministic clientOrderId (bytelyst-profile-trade) prevents duplicate exchange submissions when retries occur. Horizontal scaling relies on shared DB locks, so no duplicate lifecycle creation occurs even with many workers. Must never break: The row-lock acquisition/release must always execute around exchange submission; failing to release or regenerate owner tokens is unacceptable.

Phase 7 — Observability & Health

Purpose: Provide operational insight and safeguards. Problem solved: Blind spots in loops and invariants made debugging and proactive alerting difficult. Core mechanisms: Prometheus /metrics tracks loop durations, reconciliation mismatches, lock contention, capital invariant violations, and exchange latency histograms. /internal/health exposes trading/monitor/reconciliation loop health, lock contention counts, reconciliation mismatch counts, and degraded indicators. Structured audit logs capture ENTRY/EXIT submissions, fills, cancellations, and reconciliation corrections. A capital invariant watchdog logs critical errors if the ledger computation negative. Must never break: Observable metrics must never regress; the health endpoint must always include counters referenced by runbooks.

Phase 8 — Final Enterprise Validation

Purpose: Formalize invariants, failure runbooks, and emergency controls. Problem solved: Without operator guidance, teams risk violating essential guarantees when extending the system. Core mechanisms: Operators rely on docs/invariants.md for safety rules, docs/runbooks/*.md for failure handling, kill switches for trading loops, and circuit breakers (global/profile/exchange) described in those runbooks. Incident response includes capital invariant monitoring and mutex lock health checks. Must never break: The canonical architecture reference must remain accurate; any code touching core components must cite these runbooks before modification.

SECTION 3 — CRITICAL INVARIANTS

  • No duplicate exchange order. Documented in docs/invariants.md; deterministic clientOrderId plus lifecycle atomicity prevents duplicates.
  • No lifecycle without confirmed exchange order. ENTRY RPC rejects inserts unless exchange order_id is confirmed.
  • Capital cannot go negative. The ledger service enforces available_capital calculations and invariants; violations trigger logs and the capital invariant metric.
  • Only one active ENTRY per (profile_id, symbol). Row-based entry_locks enforce exclusivity.
  • Reconciliation converges to exchange truth. The reconciliation loop locks per profile, compares full datasets, and uses lifecycle handlers to correct mismatches.
  • Restart does not corrupt ledger. Startup rebuild rehydrates ledger from exchange open orders/positions.
  • Distributed workers cannot double submit. Shared locks and deterministic clientOrderId plus RPC idempotency guarantee this. Each invariant references docs/invariants.md and the relevant runbook under docs/runbooks/*.md for procedures when invariants fail.

SECTION 4 — EXECUTION FLOW DIAGRAMS

  1. ENTRY execution — Validate signal, acquire reconciliation/entry row lock, reserve capital via ledger RPC, place exchange order with deterministic clientOrderId, call fn_persist_entry_lifecycle within transaction, release lock, emit audit log, and update dashboard state.
  2. EXIT execution — Trigger exit signal, place exit order with exchange, call transactional RPC to update lifecycle, adjust ledger (release reserved positions, add realized_pnl), close positions, notify dashboard.
  3. Partial fill handling — Exchange reports partial fill, monitor/reconciliation loop routes through entry-fill handler, move delta from reserved_for_orders to reserved_for_positions, update position quantity, emit lifecycle history slice, maintain ledger invariant.
  4. Restart rebuild — On boot, load profiles, fetch exchange open orders and positions, rebuild lifecycle map, reconstruct ledger reservations and realized_pnl, validate no stale idempotency entries remain.
  5. Reconciliation cycle — Acquire profile reconciliation lock, fetch full DB open/closed orders, fetch exchange open orders, match by identifiers, route discrepancies through lifecycle handlers, update metrics, release lock.
  6. Distributed lock acquisition — Generate owner token, call fn_try_acquire_entry_lock_row with TTL (30s), verify success, re-check lifecycle state, proceed with capital reservation and exchange call, and finally release lock via fn_release_entry_lock_row.

SECTION 5 — FAILURE SCENARIO TABLE

Scenario | What happens | Why safe | Recovery behavior Two workers race | Only one lock owner receives lock, other skip entry | Row lock plus lifecycle check prevents double submission | Loser retries after lock TTL; health endpoint increments lock contention metric Network partition | Exchange call eventually times out | Lock TTL ensures no permanent hold; trading loop raises error via runbook steps | Retry logic plus alert in docs/runbooks/lock-timeout.md DB failure | RPC fails, transaction rolled back | ENTRY RPC transaction atomicity prevents partial lifecycle writes | Retry after DB recovery; reconciliation finds any discrepancy Exchange timeout | Entry order fails after reservation | Reserved capital released under finally block and ledger recalculates | Monitor loop logs failure; health endpoint lattice flags degrade Crash before lifecycle RPC | Exchange order exists, RPC never called | Reconciliation detects order without lifecycle and replays handler | Lifecycle handler reprocesses confirmed order; runbook instructs to check logs Crash after exchange but before persistence | Exchange order_id exists, lifecycle missing | Reconciliation handles orphaned orders via lifecycle-safe handler | Handler inserts lifecycle; capital ledger adjusts from exchange positions Partial fill after restart | Rebuilt ledger recalculates reserved positions from exchange fills | Ledger rebuild logic replays fills; invariants hold | No manual recovery needed; reconciliation verifies Supabase outage | Health metrics report service_role inability to write | Monitoring loop marks degraded; lock/metric thresholds trigger alerts | Operations route to runbook in docs/runbooks/supabase-outage.md Lock stuck | Entry lock expires at TTL and is reacquired | TTL prevents deadlock | Health metric increments lock contention; runbook uses kill switch

SECTION 6 — HEALTH & OBSERVABILITY

  • /internal/health fields — tradingLoopHealthy, monitorLoopHealthy, reconciliationLoopHealthy, reconciliationLastRun, lockContentionCount, reconciliationMismatchCount, reconciliationMissingFromExchange, reconciliationMissingInDb, capitalInvariantViolations, exchangeLatencyHistogram, readiness (true only when loops run within expected intervals). Degradation occurs if any loop exceeds twice its normal cadence or capitalInvariantViolations increments.
  • Loop metrics — /metrics exposes durations and last-run timestamps for trading, monitor, reconciliation, order sync loops. An unhealthy threshold is 2x the expected interval.
  • Lock contention metrics — Incremented when fn_try_acquire_entry_lock_row fails; acquisition latency is recorded as a histogram; stuck locks are surfaced when TTL expires without release.
  • Reconciliation metrics — mismatch and missing counts show convergence progress; reconciliationHealthy toggles when mismatch count remains zero for two cycles.
  • Observability design — Prometheus ensures minimal overhead by pushing counters via the health tracker; structured logs include profile_id, trade_id, event, and shedding to maintain compliance.

SECTION 7 — HORIZONTAL SCALING MODEL

  • Multi-worker deployment — Each bot instance shares the same Supabase project and exchange key; they coordinate through row-based locks and the shared capital ledger stored in Supabase.
  • Shared DB — Lifecycles, ledgers, locks, and reconciliation state live in Supabase; worker nodes treat the DB as the single source of state, and all RPCs operate against it.
  • Shared exchange key — A deterministic clientOrderId plus order lifecycle handling ensures duplicate submissions never occur even though multiple workers use the same key.
  • Lock guarantees — Entry locks and reconciliation locks use TTL and owner tokens; only one worker may hold a lock for a profile/symbol combination or a profiles reconciliation cycle.
  • No duplication — Atomic lifecycle RPCs, deterministic clientOrderId, and reconciliation lock semantics guarantee that no two workers can simultaneously report conflicting lifecycles or capital adjustments.

SECTION 8 — SAFE ENHANCEMENT RULES

Checklist for future agents:

  • Lifecycle: Do not modify fn_persist_entry_lifecycle or exit RPCs without referencing docs/runbooks/lifecycle-incident.md; every change must maintain exchange-first order and transactional guarantees.
  • Ledger: Preserve available_capital invariants and only update ledger fields through the ledger service; see docs/invariants.md for safety rules.
  • Reconciliation: Never bypass lifecycle handlers; matching logic must still route through reconcileEntryFill/reconcileExitFill/reconcileCancel.
  • Locking: Entry and reconciliation locks (row-based) must stay TTL-based and owner-checked; no in-memory mutex hacks.
  • Exchange submission: ClientOrderId strategy is deterministic; do not modify it without ensuring idempotency. "DO NOT BREAK" rules: The exchange remains the source of truth, the capital ledger must never go negative, distributed locks must be respected, and reconciliation must converge to exchange state. Any deviation triggers procedures spelled out in docs/runbooks/invariant-violation.md.