learning_ai_invt_trdg/backend/ENTERPRISE_ARCHITECTURE_REFERENCE.md

**SECTION 1 — SYSTEM OVERVIEW**

- **High-level architecture diagram (textual)**
  Trading Bot Service (single codebase) ←→ Supabase/Postgres for durable state ←→ Exchange Connectors (Alpaca, etc.)
  Observability Layer (Prometheus `/metrics`, structured logs)
  Dashboard UI reads Supabase state and subscribes to user‑scoped WebSocket channels.
  Distributed workers (multi-instance) share Supabase + Exchange key + Observability feeds.

- **Runtime components**
  - *Trading loop*: scheduled loop per profile that evaluates strategy signals, performs capital checks, acquires lock, submits ENTRY order, and invokes lifecycle RPC.
  - *Monitor loop*: polls exchange/account state, updates positions/orders, enforces invariant watchdogs (capital, lifecycle) and emits metrics/logs.
  - *Reconciliation loop*: acquires per-profile reconciliation lock, fetches full DB vs exchange open order sets, routes discrepancies through lifecycle-safe handlers, and updates metrics/health.
  - *Order sync loop*: keeps DB and exchange orders synced (fills, cancels) by reconciling via lifecycle flows and ledger adjustments.

- **Exchange interaction model**
  - Exchange order submission always occurs before DB persistence.
  - `clientOrderId` deterministic (`bytelyst-${profileId}-${tradeId}`) ensures idempotent exchange requests.
  - Lifecycle RPC persists confirmed exchange order metadata; no retries issue new exchange orders.

- **Single exchange key multi-profile model**
  - One shared exchange API key per bot deployment.
  - Profile isolation achieved via REST/WebSocket tenant scoping and capital ledger segregation.
  - Distributed lock ensures one active ENTRY per `(profile_id, symbol)` even with shared API key.

---

**SECTION 2 — PHASE-BY-PHASE ENTERPRISE HARDENING**

*Phase 1 — Tenant Isolation*
- Profiles isolated with Supabase RLS: orders/trade_history/positions rows are scoped to `profile_id`, forced by RLS policies (profile owner or `service_role` only).
- WebSocket scopes: runtime state broadcast filtered by `user_id`, preventing cross-tenant leakage.
- Data exposure guarantees: authenticated requests only see their profile records; Realtime channels emit tenant-scoped runtime state; service tokens required for administrative access.

*Phase 2 — Restart Durability*
- **Startup rebuild flow**: on start, for each profile, load persisted profiles, fetch exchange open positions/orders, rebuild lifecycle maps and ledger.
- **Capital rebuild logic**: ledger reset per profile then rebuilt by re-playing open positions/orders from exchange state, recalculating `reserved_for_positions`/`reserved_for_orders`.
- **Lifecycle rebuild**: trade lifecycle map reconstructed from persisted orders/trade_history; open positions re-linked via trade_id.
- **Pending order reconstruction**: open exchange orders reinserted if missing, reconciled via lifecycle RPCs to ensure consistent DB state.
- **Deterministic state rebuild proof**: deterministic parsing of exchange data + ledger reconstruction ensures restart idempotency (same inputs → same ledger state).

*Phase 3 — Capital Ledger*
- **Ledger schema**: per-profile ledger table columns `allocated_capital`, `reserved_for_orders`, `reserved_for_positions`, `realized_pnl`. `available_capital` computed from invariant.
- **RPC guarantees**: ledger updates happen via atomic RPCs; reservation occurs before exchange calls; releases happen on cancel/exit/fail via RPC ensures durable state.
- **Reservation lifecycle**: before ENTRY, profile-level mutex acquired, required capital deducted into `reserved_for_orders`, released upon exchange failure.
- **Invariant**:
  `available_capital = allocated_capital - reserved_for_orders - reserved_for_positions + realized_pnl`
  always upheld after every ledger mutation.
- **Partial fill math**: filled notional moves proportionally from `reserved_for_orders` into `reserved_for_positions`; partial execution persists both fill amount and remaining reservation.
- **Restart math proof**: ledger rebuild sums exchange open orders/positions exact fill notional, ensuring invariant recomputed identically upon restart.
- **Crash recovery proof**: if crash occurs during reservation, restart logic recomputes reservation from exchange state. New reservations only re-run when capital available.

*Phase 4 — Transactional Lifecycle*
- **Exchange-first entry flow**: trade signals evaluate → capital reserve → lock → exchange order → receive `order_id` → call `fn_persist_entry_lifecycle`.
- **Lifecycle RPC flow**: inserts `trade_lifecycle`, `orders`, `positions`, `trade_history` atomically; uses `UNIQUE(profile_id, trade_id)` guard; child inserts have `ON CONFLICT DO NOTHING`.
- **Idempotency keys**: RPC uses `trade_id` + `profile_id`; deterministic `clientOrderId`; repeated calls look up existing lifecycle instead of inserting duplicates.
- **Unique constraints**: `trade_lifecycle(profile_id, trade_id)` unique; `orders(order_id)` unique; positions keyed by `(profile_id, trade_id)`; ensures duplicates cannot arise.
- **Failure handling**: RPC wrapped in transaction; failures roll back entire lifecycle `INSERT`. On retry, unique constraint ensures safe idempotency; duplicate lifecycle fetch returns existing state.
- **Why exchange is source of truth**: order is placed before any persistence. If DB commit fails, replays fetch confirmed exchange order via idempotent RPC without re-submitting.

*Phase 5 — Reconciliation*
- **Deterministic comparison algorithm**: for each profile, fetch entire open DB order set + recent closed set (no limit) and full exchange open set; match using `order_id` → `client_order_id` → `trade_id`.
- **Locking model**: row-based `reconciliation_locks` per profile with TTL; RPCs `fn_try_acquire_reconciliation_lock_row`, `fn_release_reconciliation_lock_row`.
- **Lifecycle-safe handler routing**: discrepancies processed via handlers (`reconcileEntryFill`, `reconcileExitFill`, `reconcileCancel`, `logOrder`) instead of raw `updateOrderStatus`.
- **Ledger adjustment routing**: reconciliation uses capital ledger APIs for fills/cancels to maintain invariants.
- **Health metrics**: reconciliation loop exposes `reconciliationLoopHealthy`, `reconciliationLastRun`, `reconciliationMismatchCount`, `reconciliationMissingFromExchange`, `reconciliationMissingInDb`, `reconciliationLockContentionCount`.
- **Failure table**: reconciliation handles DB-only orders (mark cancel via lifecycle), exchange-only orders (insert lifecycle), status mismatches (trigger lifecycle transitions), partial fills, exchange cancels.

*Phase 6 — Distributed Safety*
- **Row-based lock model**: `entry_locks(profile_id, lower(symbol))` with TTL; RPCs `fn_try_acquire_entry_lock_row`, `fn_release_entry_lock_row`.
- **Lock TTL logic**: default 30s TTL; lock expires automatically if owner crashes; optimistic updates ensure quick lock turnover.
- **Owner token design**: owner string `processPid-uuid`, stored per attempt; only matching owner can release lock.
- **Deterministic `clientOrderId`**: `bytelyst-${profileId}-${tradeId}` ensures same trade never re-submits new order; exchange rejects duplicates and response interpreted as existing order.
- **Multi-instance behavior**: each worker attempts lock acquisition; only one obtains lock, performs ENTRY; others skip and wait for lock release.
- **Horizontal scaling model**: distributed lock + shared DB/exchange key allows safe scaling; no per-instance state relied upon.
- **Deadlock prevention**: TTL and owner-based release ensure locks eventually expire; finally blocks always release lock.
- **Failure table**: lock acquisition failure leads to immediate entry skip; network partition releases lock via TTL; crash during exchange preserves safety because lock expires before restart.

---

**SECTION 3 — CRITICAL INVARIANTS**

1. **No duplicate exchange order**
   - Why holds: deterministic `clientOrderId` + row-based lock prevents re-entry; lifecycle RPC guarded by unique constraints.
2. **No lifecycle without confirmed exchange order**
   - Why holds: exchange-first submission ensures RPC only called with confirmed `order_id`; RPC never replays exchange call.
3. **Capital cannot go negative**
   - Why holds: ledger enforces check before reservation; available capital invariant prevents overspend; watchdog logs and rejects actions if invariant violated.
4. **Only one active ENTRY per (profile_id, symbol)**
   - Why holds: acquisition of `(profile_id, symbol)` row lock before entry; TTL ensures exclusivity.
5. **Reconciliation converges to exchange truth**
   - Why holds: reconciliation fetches full open order sets, uses lifecycle handlers, ledger updates, and repeats deterministically (idempotent).
6. **Restart does not corrupt ledger**
   - Why holds: restart rebuild recomputes ledger from exchange open positions/orders; no reliance on cached values.
7. **Distributed workers cannot double submit**
   - Why holds: distributed locks + deterministic clientOrderId + lifecycle uniqueness ensure only one worker can create a lifecycle even under concurrency.

---

**SECTION 4 — EXECUTION FLOW DIAGRAMS**

1. **ENTRY execution**
   - signal → profile-level lock acquire → capital check/reserve → deterministic `clientOrderId` → exchange order → lifecycle RPC (insert lifecycle/orders/position/history) → ledger update → lock release.
2. **EXIT execution**
   - cancel/exit signal → lifecycle handler identifies trade_id → create exit order via exchange → lifecycle RPC atomic update (order row, lifecycle, trade_history, position) → ledger releases position reservation, adds realized_pnl.
3. **Partial fill handling**
   - exchange fill update via reconciliation/monitor → partial `quantityFilled` → lifecycle handler adjusts `reserved_for_orders` → move filled notional to `reserved_for_positions` → ensure remaining reservation equals unfilled amount.
4. **Restart rebuild**
   - service start → load profiles → fetch exchange open positions/orders → rebuild ledger reservations + lifecycle map → reconcile pending orders → resume loops.
5. **Reconciliation cycle**
   - for each profile: acquire reconciliation lock → fetch full DB open + recent orders + exchange open set → deterministic matching (order_id/client_order_id/trade_id) → route through lifecycle handlers → release lock → emit metrics.
6. **Distributed lock acquisition**
   - compute deterministic lock key (profile_id + symbol) → call `fn_try_acquire_entry_lock_row` → on success proceed → finally release via `fn_release_entry_lock_row`; TTL auto-expiry handles crashes.

---

**SECTION 5 — FAILURE SCENARIO TABLE**

| Scenario | What happens | Why safe | Recovery behavior |
|---|---|---|---|
| Two workers race | Only one acquires row lock; other aborts | Lock ensures mutual exclusion | Winner proceeds; loser tries next signal |
| Network partition | Lock TTL expires, prevents hang | TTL avoids perpetual ownership | Worker restarts, reacquires lock after TTL |
| DB failure | Transaction aborts, no lifecycle persisted | Persistent state only changes when txn commits | Retry after DB available; idempotent RPC avoids duplicates |
| Exchange timeout | Capital reservation rolled back via mutex/finally | No exchange order submitted | Signal retries after timeout |
| Crash before lifecycle RPC | Lock TTL ensures future worker can resume | No lifecycle inserted, no capital moved | Restart replay resumes from exchange state |
| Crash after exchange but before persistence | Lifecycle RPC retried with existing `clientOrderId` | Unique constraints prevent duplicates | RPC idempotent insert replays once |
| Partial fill after restart | Reconciliation partial fill handler adjusts ledger | Handler moves filled notional into positions | Consistent ledger, no double counting |
| Supabase outage | RPCs fail, operations roll back | Transactions atomic; no partial writes | Retry after Supabase recovers; observers alerted |
| Lock stuck | TTL expiry clears stale lock | Hard TTL prevents deadlock | Waiting worker acquires after TTL |

---

**SECTION 6 — HEALTH & OBSERVABILITY**

- `/internal/health` fields:
  `tradingLoopHealthy`, `tradingLoopLastRun`, `tradingLoopDuration`,
  `monitorLoopHealthy`, `monitorLoopLastRun`, `monitorLoopDuration`,
  `reconciliationLoopHealthy`, `reconciliationLoopLastRun`, `reconciliationMismatchCount`, `reconciliationMissingFromExchange`, `reconciliationMissingInDb`, `reconciliationLockContentionCount`,
  `lockContentionCount`, `capitalInvariantViolations`, `observabilityTimestamp`.
- Loop metrics: duration histograms + last run timestamps; SLO: healthy flag true if last run < 2x expected interval.
- Lock contention metrics: increment per failed lock acquisition; field surfaced both via `/internal/health` and Prometheus.
- Reconciliation metrics: mismatch counts, missing-from-exchange, missing-in-db, lock contention.
- Readiness signals: SLO flags combined to determine readiness; observability records degrade gracefully (logs emitted on invariant violations).
- Degraded mode behavior: if capital invariant fails, watchdog increments violation counter, logs critical error, halts further ENTRY until resolved.

---

**SECTION 7 — HORIZONTAL SCALING MODEL**

- Multi-worker deployment: each worker runs trading/monitor/reconciliation loops; shared DB/exchange key; distributed locks coordinate actions.
- Shared DB: Supabase/Postgres is the single source of truth; all loops interact with same tables.
- Shared exchange key: deterministic `clientOrderId` + lock prevent double submissions despite shared credentials.
- Lock guarantees: row-based entry locks and reconciliation locks with TTL/owner ensure cross-instance exclusivity.
- Why no duplication is possible: distributed locks + deterministic order IDs + lifecycle RPC uniqueness ensure only one worker can create a lifecycle even under concurrency.

---

**SECTION 8 — SAFE ENHANCEMENT RULES**

- **Lifecycle**:
  - DO NOT bypass `fn_persist_entry_lifecycle`.
  - DO NOT write raw status updates; use lifecycle-safe handlers.
- **Ledger**:
  - DO NOT mutate ledger without RPCs that respect invariant.
  - Always recompute `available_capital` via `allocated - reserved_orders - reserved_positions + realized_pnl`.
- **Reconciliation**:
  - Always acquire `reconciliation_locks`.
  - Route changes through lifecycle handlers.
- **Locking**:
  - Always use row locks with TTL and owner tokens; release in finally block.
  - Do not assume single-process state.
- **Exchange submission**:
  - Always reserve capital before exchange call.
  - Use deterministic `clientOrderId`.
  - Never re-submit the same trade_id; rely on idempotent failures.

**DO NOT BREAK** these rules; any change violating them risks duplicate executions, capital drift, or stale lifecycle data.