15 KiB
SECTION 1 — SYSTEM OVERVIEW
-
High-level architecture diagram (textual)
Trading Bot Service (single codebase) ←→ Supabase/Postgres for durable state ←→ Exchange Connectors (Alpaca, etc.)
Observability Layer (Prometheus/metrics, structured logs)
Dashboard UI reads Supabase state and subscribes to user‑scoped WebSocket channels.
Distributed workers (multi-instance) share Supabase + Exchange key + Observability feeds. -
Runtime components
- Trading loop: scheduled loop per profile that evaluates strategy signals, performs capital checks, acquires lock, submits ENTRY order, and invokes lifecycle RPC.
- Monitor loop: polls exchange/account state, updates positions/orders, enforces invariant watchdogs (capital, lifecycle) and emits metrics/logs.
- Reconciliation loop: acquires per-profile reconciliation lock, fetches full DB vs exchange open order sets, routes discrepancies through lifecycle-safe handlers, and updates metrics/health.
- Order sync loop: keeps DB and exchange orders synced (fills, cancels) by reconciling via lifecycle flows and ledger adjustments.
-
Exchange interaction model
- Exchange order submission always occurs before DB persistence.
clientOrderIddeterministic (bytelyst-${profileId}-${tradeId}) ensures idempotent exchange requests.- Lifecycle RPC persists confirmed exchange order metadata; no retries issue new exchange orders.
-
Single exchange key multi-profile model
- One shared exchange API key per bot deployment.
- Profile isolation achieved via REST/WebSocket tenant scoping and capital ledger segregation.
- Distributed lock ensures one active ENTRY per
(profile_id, symbol)even with shared API key.
SECTION 2 — PHASE-BY-PHASE ENTERPRISE HARDENING
Phase 1 — Tenant Isolation
- Profiles isolated with Supabase RLS: orders/trade_history/positions rows are scoped to
profile_id, forced by RLS policies (profile owner orservice_roleonly). - WebSocket scopes: runtime state broadcast filtered by
user_id, preventing cross-tenant leakage. - Data exposure guarantees: authenticated requests only see their profile records; Realtime channels emit tenant-scoped runtime state; service tokens required for administrative access.
Phase 2 — Restart Durability
- Startup rebuild flow: on start, for each profile, load persisted profiles, fetch exchange open positions/orders, rebuild lifecycle maps and ledger.
- Capital rebuild logic: ledger reset per profile then rebuilt by re-playing open positions/orders from exchange state, recalculating
reserved_for_positions/reserved_for_orders. - Lifecycle rebuild: trade lifecycle map reconstructed from persisted orders/trade_history; open positions re-linked via trade_id.
- Pending order reconstruction: open exchange orders reinserted if missing, reconciled via lifecycle RPCs to ensure consistent DB state.
- Deterministic state rebuild proof: deterministic parsing of exchange data + ledger reconstruction ensures restart idempotency (same inputs → same ledger state).
Phase 3 — Capital Ledger
- Ledger schema: per-profile ledger table columns
allocated_capital,reserved_for_orders,reserved_for_positions,realized_pnl.available_capitalcomputed from invariant. - RPC guarantees: ledger updates happen via atomic RPCs; reservation occurs before exchange calls; releases happen on cancel/exit/fail via RPC ensures durable state.
- Reservation lifecycle: before ENTRY, profile-level mutex acquired, required capital deducted into
reserved_for_orders, released upon exchange failure. - Invariant:
available_capital = allocated_capital - reserved_for_orders - reserved_for_positions + realized_pnl
always upheld after every ledger mutation. - Partial fill math: filled notional moves proportionally from
reserved_for_ordersintoreserved_for_positions; partial execution persists both fill amount and remaining reservation. - Restart math proof: ledger rebuild sums exchange open orders/positions exact fill notional, ensuring invariant recomputed identically upon restart.
- Crash recovery proof: if crash occurs during reservation, restart logic recomputes reservation from exchange state. New reservations only re-run when capital available.
Phase 4 — Transactional Lifecycle
- Exchange-first entry flow: trade signals evaluate → capital reserve → lock → exchange order → receive
order_id→ callfn_persist_entry_lifecycle. - Lifecycle RPC flow: inserts
trade_lifecycle,orders,positions,trade_historyatomically; usesUNIQUE(profile_id, trade_id)guard; child inserts haveON CONFLICT DO NOTHING. - Idempotency keys: RPC uses
trade_id+profile_id; deterministicclientOrderId; repeated calls look up existing lifecycle instead of inserting duplicates. - Unique constraints:
trade_lifecycle(profile_id, trade_id)unique;orders(order_id)unique; positions keyed by(profile_id, trade_id); ensures duplicates cannot arise. - Failure handling: RPC wrapped in transaction; failures roll back entire lifecycle
INSERT. On retry, unique constraint ensures safe idempotency; duplicate lifecycle fetch returns existing state. - Why exchange is source of truth: order is placed before any persistence. If DB commit fails, replays fetch confirmed exchange order via idempotent RPC without re-submitting.
Phase 5 — Reconciliation
- Deterministic comparison algorithm: for each profile, fetch entire open DB order set + recent closed set (no limit) and full exchange open set; match using
order_id→client_order_id→trade_id. - Locking model: row-based
reconciliation_locksper profile with TTL; RPCsfn_try_acquire_reconciliation_lock_row,fn_release_reconciliation_lock_row. - Lifecycle-safe handler routing: discrepancies processed via handlers (
reconcileEntryFill,reconcileExitFill,reconcileCancel,logOrder) instead of rawupdateOrderStatus. - Ledger adjustment routing: reconciliation uses capital ledger APIs for fills/cancels to maintain invariants.
- Health metrics: reconciliation loop exposes
reconciliationLoopHealthy,reconciliationLastRun,reconciliationMismatchCount,reconciliationMissingFromExchange,reconciliationMissingInDb,reconciliationLockContentionCount. - Failure table: reconciliation handles DB-only orders (mark cancel via lifecycle), exchange-only orders (insert lifecycle), status mismatches (trigger lifecycle transitions), partial fills, exchange cancels.
Phase 6 — Distributed Safety
- Row-based lock model:
entry_locks(profile_id, lower(symbol))with TTL; RPCsfn_try_acquire_entry_lock_row,fn_release_entry_lock_row. - Lock TTL logic: default 30s TTL; lock expires automatically if owner crashes; optimistic updates ensure quick lock turnover.
- Owner token design: owner string
processPid-uuid, stored per attempt; only matching owner can release lock. - Deterministic
clientOrderId:bytelyst-${profileId}-${tradeId}ensures same trade never re-submits new order; exchange rejects duplicates and response interpreted as existing order. - Multi-instance behavior: each worker attempts lock acquisition; only one obtains lock, performs ENTRY; others skip and wait for lock release.
- Horizontal scaling model: distributed lock + shared DB/exchange key allows safe scaling; no per-instance state relied upon.
- Deadlock prevention: TTL and owner-based release ensure locks eventually expire; finally blocks always release lock.
- Failure table: lock acquisition failure leads to immediate entry skip; network partition releases lock via TTL; crash during exchange preserves safety because lock expires before restart.
SECTION 3 — CRITICAL INVARIANTS
- No duplicate exchange order
- Why holds: deterministic
clientOrderId+ row-based lock prevents re-entry; lifecycle RPC guarded by unique constraints.
- Why holds: deterministic
- No lifecycle without confirmed exchange order
- Why holds: exchange-first submission ensures RPC only called with confirmed
order_id; RPC never replays exchange call.
- Why holds: exchange-first submission ensures RPC only called with confirmed
- Capital cannot go negative
- Why holds: ledger enforces check before reservation; available capital invariant prevents overspend; watchdog logs and rejects actions if invariant violated.
- Only one active ENTRY per (profile_id, symbol)
- Why holds: acquisition of
(profile_id, symbol)row lock before entry; TTL ensures exclusivity.
- Why holds: acquisition of
- Reconciliation converges to exchange truth
- Why holds: reconciliation fetches full open order sets, uses lifecycle handlers, ledger updates, and repeats deterministically (idempotent).
- Restart does not corrupt ledger
- Why holds: restart rebuild recomputes ledger from exchange open positions/orders; no reliance on cached values.
- Distributed workers cannot double submit
- Why holds: distributed locks + deterministic clientOrderId + lifecycle uniqueness ensure only one worker can create a lifecycle even under concurrency.
SECTION 4 — EXECUTION FLOW DIAGRAMS
- ENTRY execution
- signal → profile-level lock acquire → capital check/reserve → deterministic
clientOrderId→ exchange order → lifecycle RPC (insert lifecycle/orders/position/history) → ledger update → lock release.
- signal → profile-level lock acquire → capital check/reserve → deterministic
- EXIT execution
- cancel/exit signal → lifecycle handler identifies trade_id → create exit order via exchange → lifecycle RPC atomic update (order row, lifecycle, trade_history, position) → ledger releases position reservation, adds realized_pnl.
- Partial fill handling
- exchange fill update via reconciliation/monitor → partial
quantityFilled→ lifecycle handler adjustsreserved_for_orders→ move filled notional toreserved_for_positions→ ensure remaining reservation equals unfilled amount.
- exchange fill update via reconciliation/monitor → partial
- Restart rebuild
- service start → load profiles → fetch exchange open positions/orders → rebuild ledger reservations + lifecycle map → reconcile pending orders → resume loops.
- Reconciliation cycle
- for each profile: acquire reconciliation lock → fetch full DB open + recent orders + exchange open set → deterministic matching (order_id/client_order_id/trade_id) → route through lifecycle handlers → release lock → emit metrics.
- Distributed lock acquisition
- compute deterministic lock key (profile_id + symbol) → call
fn_try_acquire_entry_lock_row→ on success proceed → finally release viafn_release_entry_lock_row; TTL auto-expiry handles crashes.
- compute deterministic lock key (profile_id + symbol) → call
SECTION 5 — FAILURE SCENARIO TABLE
| Scenario | What happens | Why safe | Recovery behavior |
|---|---|---|---|
| Two workers race | Only one acquires row lock; other aborts | Lock ensures mutual exclusion | Winner proceeds; loser tries next signal |
| Network partition | Lock TTL expires, prevents hang | TTL avoids perpetual ownership | Worker restarts, reacquires lock after TTL |
| DB failure | Transaction aborts, no lifecycle persisted | Persistent state only changes when txn commits | Retry after DB available; idempotent RPC avoids duplicates |
| Exchange timeout | Capital reservation rolled back via mutex/finally | No exchange order submitted | Signal retries after timeout |
| Crash before lifecycle RPC | Lock TTL ensures future worker can resume | No lifecycle inserted, no capital moved | Restart replay resumes from exchange state |
| Crash after exchange but before persistence | Lifecycle RPC retried with existing clientOrderId |
Unique constraints prevent duplicates | RPC idempotent insert replays once |
| Partial fill after restart | Reconciliation partial fill handler adjusts ledger | Handler moves filled notional into positions | Consistent ledger, no double counting |
| Supabase outage | RPCs fail, operations roll back | Transactions atomic; no partial writes | Retry after Supabase recovers; observers alerted |
| Lock stuck | TTL expiry clears stale lock | Hard TTL prevents deadlock | Waiting worker acquires after TTL |
SECTION 6 — HEALTH & OBSERVABILITY
/internal/healthfields:
tradingLoopHealthy,tradingLoopLastRun,tradingLoopDuration,
monitorLoopHealthy,monitorLoopLastRun,monitorLoopDuration,
reconciliationLoopHealthy,reconciliationLoopLastRun,reconciliationMismatchCount,reconciliationMissingFromExchange,reconciliationMissingInDb,reconciliationLockContentionCount,
lockContentionCount,capitalInvariantViolations,observabilityTimestamp.- Loop metrics: duration histograms + last run timestamps; SLO: healthy flag true if last run < 2x expected interval.
- Lock contention metrics: increment per failed lock acquisition; field surfaced both via
/internal/healthand Prometheus. - Reconciliation metrics: mismatch counts, missing-from-exchange, missing-in-db, lock contention.
- Readiness signals: SLO flags combined to determine readiness; observability records degrade gracefully (logs emitted on invariant violations).
- Degraded mode behavior: if capital invariant fails, watchdog increments violation counter, logs critical error, halts further ENTRY until resolved.
SECTION 7 — HORIZONTAL SCALING MODEL
- Multi-worker deployment: each worker runs trading/monitor/reconciliation loops; shared DB/exchange key; distributed locks coordinate actions.
- Shared DB: Supabase/Postgres is the single source of truth; all loops interact with same tables.
- Shared exchange key: deterministic
clientOrderId+ lock prevent double submissions despite shared credentials. - Lock guarantees: row-based entry locks and reconciliation locks with TTL/owner ensure cross-instance exclusivity.
- Why no duplication is possible: distributed locks + deterministic order IDs + lifecycle RPC uniqueness ensure only one worker can create a lifecycle even under concurrency.
SECTION 8 — SAFE ENHANCEMENT RULES
- Lifecycle:
- DO NOT bypass
fn_persist_entry_lifecycle. - DO NOT write raw status updates; use lifecycle-safe handlers.
- DO NOT bypass
- Ledger:
- DO NOT mutate ledger without RPCs that respect invariant.
- Always recompute
available_capitalviaallocated - reserved_orders - reserved_positions + realized_pnl.
- Reconciliation:
- Always acquire
reconciliation_locks. - Route changes through lifecycle handlers.
- Always acquire
- Locking:
- Always use row locks with TTL and owner tokens; release in finally block.
- Do not assume single-process state.
- Exchange submission:
- Always reserve capital before exchange call.
- Use deterministic
clientOrderId. - Never re-submit the same trade_id; rely on idempotent failures.
DO NOT BREAK these rules; any change violating them risks duplicate executions, capital drift, or stale lifecycle data.