177 lines
15 KiB
Markdown
177 lines
15 KiB
Markdown
**SECTION 1 — SYSTEM OVERVIEW**
|
||
|
||
- **High-level architecture diagram (textual)**
|
||
Trading Bot Service (single codebase) ←→ Supabase/Postgres for durable state ←→ Exchange Connectors (Alpaca, etc.)
|
||
Observability Layer (Prometheus `/metrics`, structured logs)
|
||
Dashboard UI reads Supabase state and subscribes to user‑scoped WebSocket channels.
|
||
Distributed workers (multi-instance) share Supabase + Exchange key + Observability feeds.
|
||
|
||
- **Runtime components**
|
||
- *Trading loop*: scheduled loop per profile that evaluates strategy signals, performs capital checks, acquires lock, submits ENTRY order, and invokes lifecycle RPC.
|
||
- *Monitor loop*: polls exchange/account state, updates positions/orders, enforces invariant watchdogs (capital, lifecycle) and emits metrics/logs.
|
||
- *Reconciliation loop*: acquires per-profile reconciliation lock, fetches full DB vs exchange open order sets, routes discrepancies through lifecycle-safe handlers, and updates metrics/health.
|
||
- *Order sync loop*: keeps DB and exchange orders synced (fills, cancels) by reconciling via lifecycle flows and ledger adjustments.
|
||
|
||
- **Exchange interaction model**
|
||
- Exchange order submission always occurs before DB persistence.
|
||
- `clientOrderId` deterministic (`bytelyst-${profileId}-${tradeId}`) ensures idempotent exchange requests.
|
||
- Lifecycle RPC persists confirmed exchange order metadata; no retries issue new exchange orders.
|
||
|
||
- **Single exchange key multi-profile model**
|
||
- One shared exchange API key per bot deployment.
|
||
- Profile isolation achieved via REST/WebSocket tenant scoping and capital ledger segregation.
|
||
- Distributed lock ensures one active ENTRY per `(profile_id, symbol)` even with shared API key.
|
||
|
||
---
|
||
|
||
**SECTION 2 — PHASE-BY-PHASE ENTERPRISE HARDENING**
|
||
|
||
*Phase 1 — Tenant Isolation*
|
||
- Profiles isolated with Supabase RLS: orders/trade_history/positions rows are scoped to `profile_id`, forced by RLS policies (profile owner or `service_role` only).
|
||
- WebSocket scopes: runtime state broadcast filtered by `user_id`, preventing cross-tenant leakage.
|
||
- Data exposure guarantees: authenticated requests only see their profile records; Realtime channels emit tenant-scoped runtime state; service tokens required for administrative access.
|
||
|
||
*Phase 2 — Restart Durability*
|
||
- **Startup rebuild flow**: on start, for each profile, load persisted profiles, fetch exchange open positions/orders, rebuild lifecycle maps and ledger.
|
||
- **Capital rebuild logic**: ledger reset per profile then rebuilt by re-playing open positions/orders from exchange state, recalculating `reserved_for_positions`/`reserved_for_orders`.
|
||
- **Lifecycle rebuild**: trade lifecycle map reconstructed from persisted orders/trade_history; open positions re-linked via trade_id.
|
||
- **Pending order reconstruction**: open exchange orders reinserted if missing, reconciled via lifecycle RPCs to ensure consistent DB state.
|
||
- **Deterministic state rebuild proof**: deterministic parsing of exchange data + ledger reconstruction ensures restart idempotency (same inputs → same ledger state).
|
||
|
||
*Phase 3 — Capital Ledger*
|
||
- **Ledger schema**: per-profile ledger table columns `allocated_capital`, `reserved_for_orders`, `reserved_for_positions`, `realized_pnl`. `available_capital` computed from invariant.
|
||
- **RPC guarantees**: ledger updates happen via atomic RPCs; reservation occurs before exchange calls; releases happen on cancel/exit/fail via RPC ensures durable state.
|
||
- **Reservation lifecycle**: before ENTRY, profile-level mutex acquired, required capital deducted into `reserved_for_orders`, released upon exchange failure.
|
||
- **Invariant**:
|
||
`available_capital = allocated_capital - reserved_for_orders - reserved_for_positions + realized_pnl`
|
||
always upheld after every ledger mutation.
|
||
- **Partial fill math**: filled notional moves proportionally from `reserved_for_orders` into `reserved_for_positions`; partial execution persists both fill amount and remaining reservation.
|
||
- **Restart math proof**: ledger rebuild sums exchange open orders/positions exact fill notional, ensuring invariant recomputed identically upon restart.
|
||
- **Crash recovery proof**: if crash occurs during reservation, restart logic recomputes reservation from exchange state. New reservations only re-run when capital available.
|
||
|
||
*Phase 4 — Transactional Lifecycle*
|
||
- **Exchange-first entry flow**: trade signals evaluate → capital reserve → lock → exchange order → receive `order_id` → call `fn_persist_entry_lifecycle`.
|
||
- **Lifecycle RPC flow**: inserts `trade_lifecycle`, `orders`, `positions`, `trade_history` atomically; uses `UNIQUE(profile_id, trade_id)` guard; child inserts have `ON CONFLICT DO NOTHING`.
|
||
- **Idempotency keys**: RPC uses `trade_id` + `profile_id`; deterministic `clientOrderId`; repeated calls look up existing lifecycle instead of inserting duplicates.
|
||
- **Unique constraints**: `trade_lifecycle(profile_id, trade_id)` unique; `orders(order_id)` unique; positions keyed by `(profile_id, trade_id)`; ensures duplicates cannot arise.
|
||
- **Failure handling**: RPC wrapped in transaction; failures roll back entire lifecycle `INSERT`. On retry, unique constraint ensures safe idempotency; duplicate lifecycle fetch returns existing state.
|
||
- **Why exchange is source of truth**: order is placed before any persistence. If DB commit fails, replays fetch confirmed exchange order via idempotent RPC without re-submitting.
|
||
|
||
*Phase 5 — Reconciliation*
|
||
- **Deterministic comparison algorithm**: for each profile, fetch entire open DB order set + recent closed set (no limit) and full exchange open set; match using `order_id` → `client_order_id` → `trade_id`.
|
||
- **Locking model**: row-based `reconciliation_locks` per profile with TTL; RPCs `fn_try_acquire_reconciliation_lock_row`, `fn_release_reconciliation_lock_row`.
|
||
- **Lifecycle-safe handler routing**: discrepancies processed via handlers (`reconcileEntryFill`, `reconcileExitFill`, `reconcileCancel`, `logOrder`) instead of raw `updateOrderStatus`.
|
||
- **Ledger adjustment routing**: reconciliation uses capital ledger APIs for fills/cancels to maintain invariants.
|
||
- **Health metrics**: reconciliation loop exposes `reconciliationLoopHealthy`, `reconciliationLastRun`, `reconciliationMismatchCount`, `reconciliationMissingFromExchange`, `reconciliationMissingInDb`, `reconciliationLockContentionCount`.
|
||
- **Failure table**: reconciliation handles DB-only orders (mark cancel via lifecycle), exchange-only orders (insert lifecycle), status mismatches (trigger lifecycle transitions), partial fills, exchange cancels.
|
||
|
||
*Phase 6 — Distributed Safety*
|
||
- **Row-based lock model**: `entry_locks(profile_id, lower(symbol))` with TTL; RPCs `fn_try_acquire_entry_lock_row`, `fn_release_entry_lock_row`.
|
||
- **Lock TTL logic**: default 30s TTL; lock expires automatically if owner crashes; optimistic updates ensure quick lock turnover.
|
||
- **Owner token design**: owner string `processPid-uuid`, stored per attempt; only matching owner can release lock.
|
||
- **Deterministic `clientOrderId`**: `bytelyst-${profileId}-${tradeId}` ensures same trade never re-submits new order; exchange rejects duplicates and response interpreted as existing order.
|
||
- **Multi-instance behavior**: each worker attempts lock acquisition; only one obtains lock, performs ENTRY; others skip and wait for lock release.
|
||
- **Horizontal scaling model**: distributed lock + shared DB/exchange key allows safe scaling; no per-instance state relied upon.
|
||
- **Deadlock prevention**: TTL and owner-based release ensure locks eventually expire; finally blocks always release lock.
|
||
- **Failure table**: lock acquisition failure leads to immediate entry skip; network partition releases lock via TTL; crash during exchange preserves safety because lock expires before restart.
|
||
|
||
---
|
||
|
||
**SECTION 3 — CRITICAL INVARIANTS**
|
||
|
||
1. **No duplicate exchange order**
|
||
- Why holds: deterministic `clientOrderId` + row-based lock prevents re-entry; lifecycle RPC guarded by unique constraints.
|
||
2. **No lifecycle without confirmed exchange order**
|
||
- Why holds: exchange-first submission ensures RPC only called with confirmed `order_id`; RPC never replays exchange call.
|
||
3. **Capital cannot go negative**
|
||
- Why holds: ledger enforces check before reservation; available capital invariant prevents overspend; watchdog logs and rejects actions if invariant violated.
|
||
4. **Only one active ENTRY per (profile_id, symbol)**
|
||
- Why holds: acquisition of `(profile_id, symbol)` row lock before entry; TTL ensures exclusivity.
|
||
5. **Reconciliation converges to exchange truth**
|
||
- Why holds: reconciliation fetches full open order sets, uses lifecycle handlers, ledger updates, and repeats deterministically (idempotent).
|
||
6. **Restart does not corrupt ledger**
|
||
- Why holds: restart rebuild recomputes ledger from exchange open positions/orders; no reliance on cached values.
|
||
7. **Distributed workers cannot double submit**
|
||
- Why holds: distributed locks + deterministic clientOrderId + lifecycle uniqueness ensure only one worker can create a lifecycle even under concurrency.
|
||
|
||
---
|
||
|
||
**SECTION 4 — EXECUTION FLOW DIAGRAMS**
|
||
|
||
1. **ENTRY execution**
|
||
- signal → profile-level lock acquire → capital check/reserve → deterministic `clientOrderId` → exchange order → lifecycle RPC (insert lifecycle/orders/position/history) → ledger update → lock release.
|
||
2. **EXIT execution**
|
||
- cancel/exit signal → lifecycle handler identifies trade_id → create exit order via exchange → lifecycle RPC atomic update (order row, lifecycle, trade_history, position) → ledger releases position reservation, adds realized_pnl.
|
||
3. **Partial fill handling**
|
||
- exchange fill update via reconciliation/monitor → partial `quantityFilled` → lifecycle handler adjusts `reserved_for_orders` → move filled notional to `reserved_for_positions` → ensure remaining reservation equals unfilled amount.
|
||
4. **Restart rebuild**
|
||
- service start → load profiles → fetch exchange open positions/orders → rebuild ledger reservations + lifecycle map → reconcile pending orders → resume loops.
|
||
5. **Reconciliation cycle**
|
||
- for each profile: acquire reconciliation lock → fetch full DB open + recent orders + exchange open set → deterministic matching (order_id/client_order_id/trade_id) → route through lifecycle handlers → release lock → emit metrics.
|
||
6. **Distributed lock acquisition**
|
||
- compute deterministic lock key (profile_id + symbol) → call `fn_try_acquire_entry_lock_row` → on success proceed → finally release via `fn_release_entry_lock_row`; TTL auto-expiry handles crashes.
|
||
|
||
---
|
||
|
||
**SECTION 5 — FAILURE SCENARIO TABLE**
|
||
|
||
| Scenario | What happens | Why safe | Recovery behavior |
|
||
|---|---|---|---|
|
||
| Two workers race | Only one acquires row lock; other aborts | Lock ensures mutual exclusion | Winner proceeds; loser tries next signal |
|
||
| Network partition | Lock TTL expires, prevents hang | TTL avoids perpetual ownership | Worker restarts, reacquires lock after TTL |
|
||
| DB failure | Transaction aborts, no lifecycle persisted | Persistent state only changes when txn commits | Retry after DB available; idempotent RPC avoids duplicates |
|
||
| Exchange timeout | Capital reservation rolled back via mutex/finally | No exchange order submitted | Signal retries after timeout |
|
||
| Crash before lifecycle RPC | Lock TTL ensures future worker can resume | No lifecycle inserted, no capital moved | Restart replay resumes from exchange state |
|
||
| Crash after exchange but before persistence | Lifecycle RPC retried with existing `clientOrderId` | Unique constraints prevent duplicates | RPC idempotent insert replays once |
|
||
| Partial fill after restart | Reconciliation partial fill handler adjusts ledger | Handler moves filled notional into positions | Consistent ledger, no double counting |
|
||
| Supabase outage | RPCs fail, operations roll back | Transactions atomic; no partial writes | Retry after Supabase recovers; observers alerted |
|
||
| Lock stuck | TTL expiry clears stale lock | Hard TTL prevents deadlock | Waiting worker acquires after TTL |
|
||
|
||
---
|
||
|
||
**SECTION 6 — HEALTH & OBSERVABILITY**
|
||
|
||
- `/internal/health` fields:
|
||
`tradingLoopHealthy`, `tradingLoopLastRun`, `tradingLoopDuration`,
|
||
`monitorLoopHealthy`, `monitorLoopLastRun`, `monitorLoopDuration`,
|
||
`reconciliationLoopHealthy`, `reconciliationLoopLastRun`, `reconciliationMismatchCount`, `reconciliationMissingFromExchange`, `reconciliationMissingInDb`, `reconciliationLockContentionCount`,
|
||
`lockContentionCount`, `capitalInvariantViolations`, `observabilityTimestamp`.
|
||
- Loop metrics: duration histograms + last run timestamps; SLO: healthy flag true if last run < 2x expected interval.
|
||
- Lock contention metrics: increment per failed lock acquisition; field surfaced both via `/internal/health` and Prometheus.
|
||
- Reconciliation metrics: mismatch counts, missing-from-exchange, missing-in-db, lock contention.
|
||
- Readiness signals: SLO flags combined to determine readiness; observability records degrade gracefully (logs emitted on invariant violations).
|
||
- Degraded mode behavior: if capital invariant fails, watchdog increments violation counter, logs critical error, halts further ENTRY until resolved.
|
||
|
||
---
|
||
|
||
**SECTION 7 — HORIZONTAL SCALING MODEL**
|
||
|
||
- Multi-worker deployment: each worker runs trading/monitor/reconciliation loops; shared DB/exchange key; distributed locks coordinate actions.
|
||
- Shared DB: Supabase/Postgres is the single source of truth; all loops interact with same tables.
|
||
- Shared exchange key: deterministic `clientOrderId` + lock prevent double submissions despite shared credentials.
|
||
- Lock guarantees: row-based entry locks and reconciliation locks with TTL/owner ensure cross-instance exclusivity.
|
||
- Why no duplication is possible: distributed locks + deterministic order IDs + lifecycle RPC uniqueness ensure only one worker can create a lifecycle even under concurrency.
|
||
|
||
---
|
||
|
||
**SECTION 8 — SAFE ENHANCEMENT RULES**
|
||
|
||
- **Lifecycle**:
|
||
- DO NOT bypass `fn_persist_entry_lifecycle`.
|
||
- DO NOT write raw status updates; use lifecycle-safe handlers.
|
||
- **Ledger**:
|
||
- DO NOT mutate ledger without RPCs that respect invariant.
|
||
- Always recompute `available_capital` via `allocated - reserved_orders - reserved_positions + realized_pnl`.
|
||
- **Reconciliation**:
|
||
- Always acquire `reconciliation_locks`.
|
||
- Route changes through lifecycle handlers.
|
||
- **Locking**:
|
||
- Always use row locks with TTL and owner tokens; release in finally block.
|
||
- Do not assume single-process state.
|
||
- **Exchange submission**:
|
||
- Always reserve capital before exchange call.
|
||
- Use deterministic `clientOrderId`.
|
||
- Never re-submit the same trade_id; rely on idempotent failures.
|
||
|
||
**DO NOT BREAK** these rules; any change violating them risks duplicate executions, capital drift, or stale lifecycle data.
|