learning_ai_invt_trdg/backend/ENTERPRISE_ARCHITECTURE_REFERENCE.md

177 lines
15 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

**SECTION 1 — SYSTEM OVERVIEW**
- **High-level architecture diagram (textual)**
Trading Bot Service (single codebase) ←→ Supabase/Postgres for durable state ←→ Exchange Connectors (Alpaca, etc.)
Observability Layer (Prometheus `/metrics`, structured logs)
Dashboard UI reads Supabase state and subscribes to userscoped WebSocket channels.
Distributed workers (multi-instance) share Supabase + Exchange key + Observability feeds.
- **Runtime components**
- *Trading loop*: scheduled loop per profile that evaluates strategy signals, performs capital checks, acquires lock, submits ENTRY order, and invokes lifecycle RPC.
- *Monitor loop*: polls exchange/account state, updates positions/orders, enforces invariant watchdogs (capital, lifecycle) and emits metrics/logs.
- *Reconciliation loop*: acquires per-profile reconciliation lock, fetches full DB vs exchange open order sets, routes discrepancies through lifecycle-safe handlers, and updates metrics/health.
- *Order sync loop*: keeps DB and exchange orders synced (fills, cancels) by reconciling via lifecycle flows and ledger adjustments.
- **Exchange interaction model**
- Exchange order submission always occurs before DB persistence.
- `clientOrderId` deterministic (`bytelyst-${profileId}-${tradeId}`) ensures idempotent exchange requests.
- Lifecycle RPC persists confirmed exchange order metadata; no retries issue new exchange orders.
- **Single exchange key multi-profile model**
- One shared exchange API key per bot deployment.
- Profile isolation achieved via REST/WebSocket tenant scoping and capital ledger segregation.
- Distributed lock ensures one active ENTRY per `(profile_id, symbol)` even with shared API key.
---
**SECTION 2 — PHASE-BY-PHASE ENTERPRISE HARDENING**
*Phase 1 — Tenant Isolation*
- Profiles isolated with Supabase RLS: orders/trade_history/positions rows are scoped to `profile_id`, forced by RLS policies (profile owner or `service_role` only).
- WebSocket scopes: runtime state broadcast filtered by `user_id`, preventing cross-tenant leakage.
- Data exposure guarantees: authenticated requests only see their profile records; Realtime channels emit tenant-scoped runtime state; service tokens required for administrative access.
*Phase 2 — Restart Durability*
- **Startup rebuild flow**: on start, for each profile, load persisted profiles, fetch exchange open positions/orders, rebuild lifecycle maps and ledger.
- **Capital rebuild logic**: ledger reset per profile then rebuilt by re-playing open positions/orders from exchange state, recalculating `reserved_for_positions`/`reserved_for_orders`.
- **Lifecycle rebuild**: trade lifecycle map reconstructed from persisted orders/trade_history; open positions re-linked via trade_id.
- **Pending order reconstruction**: open exchange orders reinserted if missing, reconciled via lifecycle RPCs to ensure consistent DB state.
- **Deterministic state rebuild proof**: deterministic parsing of exchange data + ledger reconstruction ensures restart idempotency (same inputs → same ledger state).
*Phase 3 — Capital Ledger*
- **Ledger schema**: per-profile ledger table columns `allocated_capital`, `reserved_for_orders`, `reserved_for_positions`, `realized_pnl`. `available_capital` computed from invariant.
- **RPC guarantees**: ledger updates happen via atomic RPCs; reservation occurs before exchange calls; releases happen on cancel/exit/fail via RPC ensures durable state.
- **Reservation lifecycle**: before ENTRY, profile-level mutex acquired, required capital deducted into `reserved_for_orders`, released upon exchange failure.
- **Invariant**:
`available_capital = allocated_capital - reserved_for_orders - reserved_for_positions + realized_pnl`
always upheld after every ledger mutation.
- **Partial fill math**: filled notional moves proportionally from `reserved_for_orders` into `reserved_for_positions`; partial execution persists both fill amount and remaining reservation.
- **Restart math proof**: ledger rebuild sums exchange open orders/positions exact fill notional, ensuring invariant recomputed identically upon restart.
- **Crash recovery proof**: if crash occurs during reservation, restart logic recomputes reservation from exchange state. New reservations only re-run when capital available.
*Phase 4 — Transactional Lifecycle*
- **Exchange-first entry flow**: trade signals evaluate → capital reserve → lock → exchange order → receive `order_id` → call `fn_persist_entry_lifecycle`.
- **Lifecycle RPC flow**: inserts `trade_lifecycle`, `orders`, `positions`, `trade_history` atomically; uses `UNIQUE(profile_id, trade_id)` guard; child inserts have `ON CONFLICT DO NOTHING`.
- **Idempotency keys**: RPC uses `trade_id` + `profile_id`; deterministic `clientOrderId`; repeated calls look up existing lifecycle instead of inserting duplicates.
- **Unique constraints**: `trade_lifecycle(profile_id, trade_id)` unique; `orders(order_id)` unique; positions keyed by `(profile_id, trade_id)`; ensures duplicates cannot arise.
- **Failure handling**: RPC wrapped in transaction; failures roll back entire lifecycle `INSERT`. On retry, unique constraint ensures safe idempotency; duplicate lifecycle fetch returns existing state.
- **Why exchange is source of truth**: order is placed before any persistence. If DB commit fails, replays fetch confirmed exchange order via idempotent RPC without re-submitting.
*Phase 5 — Reconciliation*
- **Deterministic comparison algorithm**: for each profile, fetch entire open DB order set + recent closed set (no limit) and full exchange open set; match using `order_id``client_order_id``trade_id`.
- **Locking model**: row-based `reconciliation_locks` per profile with TTL; RPCs `fn_try_acquire_reconciliation_lock_row`, `fn_release_reconciliation_lock_row`.
- **Lifecycle-safe handler routing**: discrepancies processed via handlers (`reconcileEntryFill`, `reconcileExitFill`, `reconcileCancel`, `logOrder`) instead of raw `updateOrderStatus`.
- **Ledger adjustment routing**: reconciliation uses capital ledger APIs for fills/cancels to maintain invariants.
- **Health metrics**: reconciliation loop exposes `reconciliationLoopHealthy`, `reconciliationLastRun`, `reconciliationMismatchCount`, `reconciliationMissingFromExchange`, `reconciliationMissingInDb`, `reconciliationLockContentionCount`.
- **Failure table**: reconciliation handles DB-only orders (mark cancel via lifecycle), exchange-only orders (insert lifecycle), status mismatches (trigger lifecycle transitions), partial fills, exchange cancels.
*Phase 6 — Distributed Safety*
- **Row-based lock model**: `entry_locks(profile_id, lower(symbol))` with TTL; RPCs `fn_try_acquire_entry_lock_row`, `fn_release_entry_lock_row`.
- **Lock TTL logic**: default 30s TTL; lock expires automatically if owner crashes; optimistic updates ensure quick lock turnover.
- **Owner token design**: owner string `processPid-uuid`, stored per attempt; only matching owner can release lock.
- **Deterministic `clientOrderId`**: `bytelyst-${profileId}-${tradeId}` ensures same trade never re-submits new order; exchange rejects duplicates and response interpreted as existing order.
- **Multi-instance behavior**: each worker attempts lock acquisition; only one obtains lock, performs ENTRY; others skip and wait for lock release.
- **Horizontal scaling model**: distributed lock + shared DB/exchange key allows safe scaling; no per-instance state relied upon.
- **Deadlock prevention**: TTL and owner-based release ensure locks eventually expire; finally blocks always release lock.
- **Failure table**: lock acquisition failure leads to immediate entry skip; network partition releases lock via TTL; crash during exchange preserves safety because lock expires before restart.
---
**SECTION 3 — CRITICAL INVARIANTS**
1. **No duplicate exchange order**
- Why holds: deterministic `clientOrderId` + row-based lock prevents re-entry; lifecycle RPC guarded by unique constraints.
2. **No lifecycle without confirmed exchange order**
- Why holds: exchange-first submission ensures RPC only called with confirmed `order_id`; RPC never replays exchange call.
3. **Capital cannot go negative**
- Why holds: ledger enforces check before reservation; available capital invariant prevents overspend; watchdog logs and rejects actions if invariant violated.
4. **Only one active ENTRY per (profile_id, symbol)**
- Why holds: acquisition of `(profile_id, symbol)` row lock before entry; TTL ensures exclusivity.
5. **Reconciliation converges to exchange truth**
- Why holds: reconciliation fetches full open order sets, uses lifecycle handlers, ledger updates, and repeats deterministically (idempotent).
6. **Restart does not corrupt ledger**
- Why holds: restart rebuild recomputes ledger from exchange open positions/orders; no reliance on cached values.
7. **Distributed workers cannot double submit**
- Why holds: distributed locks + deterministic clientOrderId + lifecycle uniqueness ensure only one worker can create a lifecycle even under concurrency.
---
**SECTION 4 — EXECUTION FLOW DIAGRAMS**
1. **ENTRY execution**
- signal → profile-level lock acquire → capital check/reserve → deterministic `clientOrderId` → exchange order → lifecycle RPC (insert lifecycle/orders/position/history) → ledger update → lock release.
2. **EXIT execution**
- cancel/exit signal → lifecycle handler identifies trade_id → create exit order via exchange → lifecycle RPC atomic update (order row, lifecycle, trade_history, position) → ledger releases position reservation, adds realized_pnl.
3. **Partial fill handling**
- exchange fill update via reconciliation/monitor → partial `quantityFilled` → lifecycle handler adjusts `reserved_for_orders` → move filled notional to `reserved_for_positions` → ensure remaining reservation equals unfilled amount.
4. **Restart rebuild**
- service start → load profiles → fetch exchange open positions/orders → rebuild ledger reservations + lifecycle map → reconcile pending orders → resume loops.
5. **Reconciliation cycle**
- for each profile: acquire reconciliation lock → fetch full DB open + recent orders + exchange open set → deterministic matching (order_id/client_order_id/trade_id) → route through lifecycle handlers → release lock → emit metrics.
6. **Distributed lock acquisition**
- compute deterministic lock key (profile_id + symbol) → call `fn_try_acquire_entry_lock_row` → on success proceed → finally release via `fn_release_entry_lock_row`; TTL auto-expiry handles crashes.
---
**SECTION 5 — FAILURE SCENARIO TABLE**
| Scenario | What happens | Why safe | Recovery behavior |
|---|---|---|---|
| Two workers race | Only one acquires row lock; other aborts | Lock ensures mutual exclusion | Winner proceeds; loser tries next signal |
| Network partition | Lock TTL expires, prevents hang | TTL avoids perpetual ownership | Worker restarts, reacquires lock after TTL |
| DB failure | Transaction aborts, no lifecycle persisted | Persistent state only changes when txn commits | Retry after DB available; idempotent RPC avoids duplicates |
| Exchange timeout | Capital reservation rolled back via mutex/finally | No exchange order submitted | Signal retries after timeout |
| Crash before lifecycle RPC | Lock TTL ensures future worker can resume | No lifecycle inserted, no capital moved | Restart replay resumes from exchange state |
| Crash after exchange but before persistence | Lifecycle RPC retried with existing `clientOrderId` | Unique constraints prevent duplicates | RPC idempotent insert replays once |
| Partial fill after restart | Reconciliation partial fill handler adjusts ledger | Handler moves filled notional into positions | Consistent ledger, no double counting |
| Supabase outage | RPCs fail, operations roll back | Transactions atomic; no partial writes | Retry after Supabase recovers; observers alerted |
| Lock stuck | TTL expiry clears stale lock | Hard TTL prevents deadlock | Waiting worker acquires after TTL |
---
**SECTION 6 — HEALTH & OBSERVABILITY**
- `/internal/health` fields:
`tradingLoopHealthy`, `tradingLoopLastRun`, `tradingLoopDuration`,
`monitorLoopHealthy`, `monitorLoopLastRun`, `monitorLoopDuration`,
`reconciliationLoopHealthy`, `reconciliationLoopLastRun`, `reconciliationMismatchCount`, `reconciliationMissingFromExchange`, `reconciliationMissingInDb`, `reconciliationLockContentionCount`,
`lockContentionCount`, `capitalInvariantViolations`, `observabilityTimestamp`.
- Loop metrics: duration histograms + last run timestamps; SLO: healthy flag true if last run < 2x expected interval.
- Lock contention metrics: increment per failed lock acquisition; field surfaced both via `/internal/health` and Prometheus.
- Reconciliation metrics: mismatch counts, missing-from-exchange, missing-in-db, lock contention.
- Readiness signals: SLO flags combined to determine readiness; observability records degrade gracefully (logs emitted on invariant violations).
- Degraded mode behavior: if capital invariant fails, watchdog increments violation counter, logs critical error, halts further ENTRY until resolved.
---
**SECTION 7 — HORIZONTAL SCALING MODEL**
- Multi-worker deployment: each worker runs trading/monitor/reconciliation loops; shared DB/exchange key; distributed locks coordinate actions.
- Shared DB: Supabase/Postgres is the single source of truth; all loops interact with same tables.
- Shared exchange key: deterministic `clientOrderId` + lock prevent double submissions despite shared credentials.
- Lock guarantees: row-based entry locks and reconciliation locks with TTL/owner ensure cross-instance exclusivity.
- Why no duplication is possible: distributed locks + deterministic order IDs + lifecycle RPC uniqueness ensure only one worker can create a lifecycle even under concurrency.
---
**SECTION 8 — SAFE ENHANCEMENT RULES**
- **Lifecycle**:
- DO NOT bypass `fn_persist_entry_lifecycle`.
- DO NOT write raw status updates; use lifecycle-safe handlers.
- **Ledger**:
- DO NOT mutate ledger without RPCs that respect invariant.
- Always recompute `available_capital` via `allocated - reserved_orders - reserved_positions + realized_pnl`.
- **Reconciliation**:
- Always acquire `reconciliation_locks`.
- Route changes through lifecycle handlers.
- **Locking**:
- Always use row locks with TTL and owner tokens; release in finally block.
- Do not assume single-process state.
- **Exchange submission**:
- Always reserve capital before exchange call.
- Use deterministic `clientOrderId`.
- Never re-submit the same trade_id; rely on idempotent failures.
**DO NOT BREAK** these rules; any change violating them risks duplicate executions, capital drift, or stale lifecycle data.