**SECTION 1 — SYSTEM OVERVIEW** - **High-level architecture diagram (textual)** Trading Bot Service (single codebase) ←→ Supabase/Postgres for durable state ←→ Exchange Connectors (Alpaca, etc.) Observability Layer (Prometheus `/metrics`, structured logs) Dashboard UI reads Supabase state and subscribes to user‑scoped WebSocket channels. Distributed workers (multi-instance) share Supabase + Exchange key + Observability feeds. - **Runtime components** - *Trading loop*: scheduled loop per profile that evaluates strategy signals, performs capital checks, acquires lock, submits ENTRY order, and invokes lifecycle RPC. - *Monitor loop*: polls exchange/account state, updates positions/orders, enforces invariant watchdogs (capital, lifecycle) and emits metrics/logs. - *Reconciliation loop*: acquires per-profile reconciliation lock, fetches full DB vs exchange open order sets, routes discrepancies through lifecycle-safe handlers, and updates metrics/health. - *Order sync loop*: keeps DB and exchange orders synced (fills, cancels) by reconciling via lifecycle flows and ledger adjustments. - **Exchange interaction model** - Exchange order submission always occurs before DB persistence. - `clientOrderId` deterministic (`bytelyst-${profileId}-${tradeId}`) ensures idempotent exchange requests. - Lifecycle RPC persists confirmed exchange order metadata; no retries issue new exchange orders. - **Single exchange key multi-profile model** - One shared exchange API key per bot deployment. - Profile isolation achieved via REST/WebSocket tenant scoping and capital ledger segregation. - Distributed lock ensures one active ENTRY per `(profile_id, symbol)` even with shared API key. --- **SECTION 2 — PHASE-BY-PHASE ENTERPRISE HARDENING** *Phase 1 — Tenant Isolation* - Profiles isolated with Supabase RLS: orders/trade_history/positions rows are scoped to `profile_id`, forced by RLS policies (profile owner or `service_role` only). - WebSocket scopes: runtime state broadcast filtered by `user_id`, preventing cross-tenant leakage. - Data exposure guarantees: authenticated requests only see their profile records; Realtime channels emit tenant-scoped runtime state; service tokens required for administrative access. *Phase 2 — Restart Durability* - **Startup rebuild flow**: on start, for each profile, load persisted profiles, fetch exchange open positions/orders, rebuild lifecycle maps and ledger. - **Capital rebuild logic**: ledger reset per profile then rebuilt by re-playing open positions/orders from exchange state, recalculating `reserved_for_positions`/`reserved_for_orders`. - **Lifecycle rebuild**: trade lifecycle map reconstructed from persisted orders/trade_history; open positions re-linked via trade_id. - **Pending order reconstruction**: open exchange orders reinserted if missing, reconciled via lifecycle RPCs to ensure consistent DB state. - **Deterministic state rebuild proof**: deterministic parsing of exchange data + ledger reconstruction ensures restart idempotency (same inputs → same ledger state). *Phase 3 — Capital Ledger* - **Ledger schema**: per-profile ledger table columns `allocated_capital`, `reserved_for_orders`, `reserved_for_positions`, `realized_pnl`. `available_capital` computed from invariant. - **RPC guarantees**: ledger updates happen via atomic RPCs; reservation occurs before exchange calls; releases happen on cancel/exit/fail via RPC ensures durable state. - **Reservation lifecycle**: before ENTRY, profile-level mutex acquired, required capital deducted into `reserved_for_orders`, released upon exchange failure. - **Invariant**: `available_capital = allocated_capital - reserved_for_orders - reserved_for_positions + realized_pnl` always upheld after every ledger mutation. - **Partial fill math**: filled notional moves proportionally from `reserved_for_orders` into `reserved_for_positions`; partial execution persists both fill amount and remaining reservation. - **Restart math proof**: ledger rebuild sums exchange open orders/positions exact fill notional, ensuring invariant recomputed identically upon restart. - **Crash recovery proof**: if crash occurs during reservation, restart logic recomputes reservation from exchange state. New reservations only re-run when capital available. *Phase 4 — Transactional Lifecycle* - **Exchange-first entry flow**: trade signals evaluate → capital reserve → lock → exchange order → receive `order_id` → call `fn_persist_entry_lifecycle`. - **Lifecycle RPC flow**: inserts `trade_lifecycle`, `orders`, `positions`, `trade_history` atomically; uses `UNIQUE(profile_id, trade_id)` guard; child inserts have `ON CONFLICT DO NOTHING`. - **Idempotency keys**: RPC uses `trade_id` + `profile_id`; deterministic `clientOrderId`; repeated calls look up existing lifecycle instead of inserting duplicates. - **Unique constraints**: `trade_lifecycle(profile_id, trade_id)` unique; `orders(order_id)` unique; positions keyed by `(profile_id, trade_id)`; ensures duplicates cannot arise. - **Failure handling**: RPC wrapped in transaction; failures roll back entire lifecycle `INSERT`. On retry, unique constraint ensures safe idempotency; duplicate lifecycle fetch returns existing state. - **Why exchange is source of truth**: order is placed before any persistence. If DB commit fails, replays fetch confirmed exchange order via idempotent RPC without re-submitting. *Phase 5 — Reconciliation* - **Deterministic comparison algorithm**: for each profile, fetch entire open DB order set + recent closed set (no limit) and full exchange open set; match using `order_id` → `client_order_id` → `trade_id`. - **Locking model**: row-based `reconciliation_locks` per profile with TTL; RPCs `fn_try_acquire_reconciliation_lock_row`, `fn_release_reconciliation_lock_row`. - **Lifecycle-safe handler routing**: discrepancies processed via handlers (`reconcileEntryFill`, `reconcileExitFill`, `reconcileCancel`, `logOrder`) instead of raw `updateOrderStatus`. - **Ledger adjustment routing**: reconciliation uses capital ledger APIs for fills/cancels to maintain invariants. - **Health metrics**: reconciliation loop exposes `reconciliationLoopHealthy`, `reconciliationLastRun`, `reconciliationMismatchCount`, `reconciliationMissingFromExchange`, `reconciliationMissingInDb`, `reconciliationLockContentionCount`. - **Failure table**: reconciliation handles DB-only orders (mark cancel via lifecycle), exchange-only orders (insert lifecycle), status mismatches (trigger lifecycle transitions), partial fills, exchange cancels. *Phase 6 — Distributed Safety* - **Row-based lock model**: `entry_locks(profile_id, lower(symbol))` with TTL; RPCs `fn_try_acquire_entry_lock_row`, `fn_release_entry_lock_row`. - **Lock TTL logic**: default 30s TTL; lock expires automatically if owner crashes; optimistic updates ensure quick lock turnover. - **Owner token design**: owner string `processPid-uuid`, stored per attempt; only matching owner can release lock. - **Deterministic `clientOrderId`**: `bytelyst-${profileId}-${tradeId}` ensures same trade never re-submits new order; exchange rejects duplicates and response interpreted as existing order. - **Multi-instance behavior**: each worker attempts lock acquisition; only one obtains lock, performs ENTRY; others skip and wait for lock release. - **Horizontal scaling model**: distributed lock + shared DB/exchange key allows safe scaling; no per-instance state relied upon. - **Deadlock prevention**: TTL and owner-based release ensure locks eventually expire; finally blocks always release lock. - **Failure table**: lock acquisition failure leads to immediate entry skip; network partition releases lock via TTL; crash during exchange preserves safety because lock expires before restart. --- **SECTION 3 — CRITICAL INVARIANTS** 1. **No duplicate exchange order** - Why holds: deterministic `clientOrderId` + row-based lock prevents re-entry; lifecycle RPC guarded by unique constraints. 2. **No lifecycle without confirmed exchange order** - Why holds: exchange-first submission ensures RPC only called with confirmed `order_id`; RPC never replays exchange call. 3. **Capital cannot go negative** - Why holds: ledger enforces check before reservation; available capital invariant prevents overspend; watchdog logs and rejects actions if invariant violated. 4. **Only one active ENTRY per (profile_id, symbol)** - Why holds: acquisition of `(profile_id, symbol)` row lock before entry; TTL ensures exclusivity. 5. **Reconciliation converges to exchange truth** - Why holds: reconciliation fetches full open order sets, uses lifecycle handlers, ledger updates, and repeats deterministically (idempotent). 6. **Restart does not corrupt ledger** - Why holds: restart rebuild recomputes ledger from exchange open positions/orders; no reliance on cached values. 7. **Distributed workers cannot double submit** - Why holds: distributed locks + deterministic clientOrderId + lifecycle uniqueness ensure only one worker can create a lifecycle even under concurrency. --- **SECTION 4 — EXECUTION FLOW DIAGRAMS** 1. **ENTRY execution** - signal → profile-level lock acquire → capital check/reserve → deterministic `clientOrderId` → exchange order → lifecycle RPC (insert lifecycle/orders/position/history) → ledger update → lock release. 2. **EXIT execution** - cancel/exit signal → lifecycle handler identifies trade_id → create exit order via exchange → lifecycle RPC atomic update (order row, lifecycle, trade_history, position) → ledger releases position reservation, adds realized_pnl. 3. **Partial fill handling** - exchange fill update via reconciliation/monitor → partial `quantityFilled` → lifecycle handler adjusts `reserved_for_orders` → move filled notional to `reserved_for_positions` → ensure remaining reservation equals unfilled amount. 4. **Restart rebuild** - service start → load profiles → fetch exchange open positions/orders → rebuild ledger reservations + lifecycle map → reconcile pending orders → resume loops. 5. **Reconciliation cycle** - for each profile: acquire reconciliation lock → fetch full DB open + recent orders + exchange open set → deterministic matching (order_id/client_order_id/trade_id) → route through lifecycle handlers → release lock → emit metrics. 6. **Distributed lock acquisition** - compute deterministic lock key (profile_id + symbol) → call `fn_try_acquire_entry_lock_row` → on success proceed → finally release via `fn_release_entry_lock_row`; TTL auto-expiry handles crashes. --- **SECTION 5 — FAILURE SCENARIO TABLE** | Scenario | What happens | Why safe | Recovery behavior | |---|---|---|---| | Two workers race | Only one acquires row lock; other aborts | Lock ensures mutual exclusion | Winner proceeds; loser tries next signal | | Network partition | Lock TTL expires, prevents hang | TTL avoids perpetual ownership | Worker restarts, reacquires lock after TTL | | DB failure | Transaction aborts, no lifecycle persisted | Persistent state only changes when txn commits | Retry after DB available; idempotent RPC avoids duplicates | | Exchange timeout | Capital reservation rolled back via mutex/finally | No exchange order submitted | Signal retries after timeout | | Crash before lifecycle RPC | Lock TTL ensures future worker can resume | No lifecycle inserted, no capital moved | Restart replay resumes from exchange state | | Crash after exchange but before persistence | Lifecycle RPC retried with existing `clientOrderId` | Unique constraints prevent duplicates | RPC idempotent insert replays once | | Partial fill after restart | Reconciliation partial fill handler adjusts ledger | Handler moves filled notional into positions | Consistent ledger, no double counting | | Supabase outage | RPCs fail, operations roll back | Transactions atomic; no partial writes | Retry after Supabase recovers; observers alerted | | Lock stuck | TTL expiry clears stale lock | Hard TTL prevents deadlock | Waiting worker acquires after TTL | --- **SECTION 6 — HEALTH & OBSERVABILITY** - `/internal/health` fields: `tradingLoopHealthy`, `tradingLoopLastRun`, `tradingLoopDuration`, `monitorLoopHealthy`, `monitorLoopLastRun`, `monitorLoopDuration`, `reconciliationLoopHealthy`, `reconciliationLoopLastRun`, `reconciliationMismatchCount`, `reconciliationMissingFromExchange`, `reconciliationMissingInDb`, `reconciliationLockContentionCount`, `lockContentionCount`, `capitalInvariantViolations`, `observabilityTimestamp`. - Loop metrics: duration histograms + last run timestamps; SLO: healthy flag true if last run < 2x expected interval. - Lock contention metrics: increment per failed lock acquisition; field surfaced both via `/internal/health` and Prometheus. - Reconciliation metrics: mismatch counts, missing-from-exchange, missing-in-db, lock contention. - Readiness signals: SLO flags combined to determine readiness; observability records degrade gracefully (logs emitted on invariant violations). - Degraded mode behavior: if capital invariant fails, watchdog increments violation counter, logs critical error, halts further ENTRY until resolved. --- **SECTION 7 — HORIZONTAL SCALING MODEL** - Multi-worker deployment: each worker runs trading/monitor/reconciliation loops; shared DB/exchange key; distributed locks coordinate actions. - Shared DB: Supabase/Postgres is the single source of truth; all loops interact with same tables. - Shared exchange key: deterministic `clientOrderId` + lock prevent double submissions despite shared credentials. - Lock guarantees: row-based entry locks and reconciliation locks with TTL/owner ensure cross-instance exclusivity. - Why no duplication is possible: distributed locks + deterministic order IDs + lifecycle RPC uniqueness ensure only one worker can create a lifecycle even under concurrency. --- **SECTION 8 — SAFE ENHANCEMENT RULES** - **Lifecycle**: - DO NOT bypass `fn_persist_entry_lifecycle`. - DO NOT write raw status updates; use lifecycle-safe handlers. - **Ledger**: - DO NOT mutate ledger without RPCs that respect invariant. - Always recompute `available_capital` via `allocated - reserved_orders - reserved_positions + realized_pnl`. - **Reconciliation**: - Always acquire `reconciliation_locks`. - Route changes through lifecycle handlers. - **Locking**: - Always use row locks with TTL and owner tokens; release in finally block. - Do not assume single-process state. - **Exchange submission**: - Always reserve capital before exchange call. - Use deterministic `clientOrderId`. - Never re-submit the same trade_id; rely on idempotent failures. **DO NOT BREAK** these rules; any change violating them risks duplicate executions, capital drift, or stale lifecycle data.