diff --git a/docs/backtest/ENGINE_READINESS.md b/docs/backtest/ENGINE_READINESS.md new file mode 100644 index 0000000..5918eaf --- /dev/null +++ b/docs/backtest/ENGINE_READINESS.md @@ -0,0 +1,201 @@ +# Backtest Engine Readiness Report + +> **Date:** 2026-05-10 +> **Trigger:** request to add "test against history" feature to `/plans` page (e.g. COVID, war periods) +> **Method:** code review + 3 synthetic-data smoke tests against the running engine +> **Audience:** anyone deciding whether to ship customer-facing backtest UI + +--- + +## TL;DR + +| Concern | Status | +|---|---| +| Engine determinism | ✅ Verified (same input → byte-identical output) | +| Multi-timeframe data pipeline | ✅ `normalize.ts` aggregates 1h+4h from 15m automatically | +| Intra-candle SL/TP policy wiring | ✅ Correctly used as a tiebreaker (only fires on real conflicts) | +| Strategy schema compatibility | ✅ Saved trade plans (`strategy_config` field) feed directly into backtest | +| Performance | ✅ ~1.5s for 5,000 candles × 1 symbol on this VM | +| **Equity (stock) data source** | ❌ **Missing.** Live trading uses Alpaca for equities, but no Alpaca backtest loader exists. Only Kraken (crypto) auto-fetch. | +| **Test coverage** | ❌ **Zero engine tests.** 1,984 lines of backend backtest code, 0 `*.test.*` files. Only the FE `utils.test.ts` exists (68 lines, tests the chart rendering helpers — not the engine). | +| **Production gating** | ⚠️ Explicit comment in `flags.ts`: *"Default to disabled — backtest is not yet production-ready."* Three flags must align (`VITE_BACKTEST_ENABLED` build, `enableBacktest` runtime, `customerEnabled` runtime). | +| **Log noise** | ⚠️ ~5 log lines per candle at default level. A 5,000-candle backtest produces ~25,000 log lines. Operationally disruptive at scale. | +| **Historical depth** | ⚠️ Kraken via ccxt typically returns 5-7 years for major pairs. **COVID Mar 2020 reachable for BTC/ETH; uncertain for younger altcoins. 2008 GFC unreachable.** | + +**Verdict for shipping a customer-facing "test against history" feature:** *not yet*. The engine itself is sound, but ship-blocking gaps are: +1. No equity data source (so stock plans can't be tested without manual CSV upload) +2. Zero engine unit tests — high regression risk +3. Log noise needs throttling + +**Verdict for an admin-only POC:** *safe*. The engine is deterministic, fast, and runs the same strategy code as live. Adding a button on `/plans` cards visible only to admins (using `isAdminView` from `useBacktestFeatureGate`) is low-risk. + +--- + +## 1. What the engine actually is + +``` +backend/src/backtest/ +├── engine/ +│ ├── BacktestRunner.ts (384 LOC) — top-level orchestration +│ ├── VirtualExecutionEngine.ts (609 LOC) — fill simulation, SL/TP, slippage +│ ├── VirtualLedger.ts ( 77 LOC) — position + cash bookkeeping +│ ├── timeFreeze.ts ( 26 LOC) — Date.now() lockdown for determinism +│ ├── warmup.ts ( 88 LOC) — pre-window candle reservation +│ └── computeSummary.ts ( 60 LOC) — netPnl/winRate/drawdown/sharpe +├── data/ +│ ├── csvLoader.ts — user-uploaded CSV +│ ├── jsonLoader.ts — user-uploaded JSON +│ ├── exchangeReplayAdapter.ts — replay user's own past trades +│ ├── krakenLoader.ts — auto-fetch crypto via ccxt +│ ├── normalize.ts — ⭐ aggregates 1h/4h from 15m +│ └── loadHistoricalData.ts — source dispatcher +├── exchange/ +│ └── ReplayExchangeConnector.ts — IExchangeConnector impl backed by dataset +├── metrics/ +│ └── computeSummary.ts +├── guards.ts — feature-flag + mode assertions +├── strategySafety.ts — config validation +└── types.ts +``` + +**Design strength:** the `ReplayExchangeConnector` implements the same `IExchangeConnector` interface as live trading. The strategy engine (`ProStrategyEngine`) doesn't know it's in a backtest. Same code paths run for live and historical — high fidelity. + +**Key invariant:** data loading aggregates upward from 15m to 1h and 4h via `aggregateCandles()` in `normalize.ts:86`. The Kraken loader fetches **only** 1m or 15m; the engine's required 4h+1h+15m view is built from there. Skipping `normalize.ts` (e.g. by calling `runBacktestReplay` with a manually-constructed dataset) **silently fails with "Insufficient data" warnings**. This is a footgun for any new data adapter that doesn't go through `buildDatasetFromRows`. + +--- + +## 2. Smoke test results (this session) + +### Test 1 — Determinism + +5,000 synthetic 15m candles + multi-timeframe aggregation, default strategy config, two consecutive runs. + +``` +Run #1: trades=3 netPnl=-15.32 drawdown=0.31% sharpe=-4.098 ms=1625 +Run #2: trades=3 netPnl=-15.32 drawdown=0.31% sharpe=-4.098 ms=1304 +DETERMINISTIC: true (JSON.stringify(r1) === JSON.stringify(r2)) +``` + +✅ Engine is deterministic byte-for-byte. Performance: ~250–325 candles/second on this VM, which extrapolates to ~6 seconds for a one-year 15m backtest. + +### Test 2 — Intra-candle policy + +Same 5,000 candles, three policies (`ohlc_path`, `stop_loss_first`, `take_profit_first`): + +``` +ohlc_path: trades=3 netPnl=-15.32 +stop_loss_first: trades=3 netPnl=-15.32 +take_profit_first: trades=3 netPnl=-15.32 +``` + +⚠️ **Identical results across all 3 policies.** Initial concern: is the policy a no-op? Source review (`VirtualExecutionEngine.ts:258` → `resolveIntraCandleConflictReason`) confirms the policy IS wired but **only invoked when both SL and TP trigger inside the same candle**. None of the 3 trades in this test had such an intra-candle conflict, so the policy never fired. **Verdict: correct behavior, but the test wasn't strong enough to exercise it.** A targeted test with constructed wide-range candles is needed to validate policy semantics; one didn't exist before this audit. + +### Test 3 — Constant price (zero-volatility edge case) + +5,000 flat candles at $50,000: + +``` +trades=0 netPnl=0.00 +``` + +✅ Sane — no signals trigger on a flat tape. + +### Test 4 — Small window (~40 days, COVID-crash size) + +``` +trades=1 netPnl=10.71 drawdown=0.03% +``` + +✅ Engine handles short windows without errors. + +--- + +## 3. Concerns ranked + +### 3.1 Zero unit tests on the backend engine + +The engine is 1,984 lines of pure logic with non-trivial fill semantics, and has zero unit tests. The frontend `utils.test.ts` (68 lines) tests `buildInsightCards` / `parseSymbolsInput` — UI helpers, not engine math. + +**Risk:** any change to `VirtualExecutionEngine.ts` or `VirtualLedger.ts` could silently change every customer's backtest results. There's no regression net. + +**Minimum viable test set before customer rollout:** +- Determinism: same seed + data → identical output (this session demonstrated it works, but it's not codified) +- Intra-candle policies: construct a candle with both SL and TP inside, assert each policy returns the expected exit +- Slippage + fees: assert `netPnl` matches hand-computed expectation for a single trade with known slippage/fee bps +- Drawdown: assert `maxDrawdownPct` matches a manually-computed equity curve +- OPEN_AT_END vs FORCE_CLOSE: assert window-end policy difference + +Estimate: ~6-8 hours to write the minimum set. + +### 3.2 No equity backtest data source + +The backtest data dispatch in `loadHistoricalData.ts` supports `csv | json | replay | kraken`. Live trading uses `AlpacaConnector` (`backend/src/connectors/alpaca.ts`) which handles `ASSET_CLASS: 'us_equity'`. **No `alpacaLoader.ts` for backtest.** + +If a saved plan is for `AAPL`, the only way to backtest it today is to upload a CSV. That's fine for power users but unworkable for "click to test against COVID." + +**Estimated work for `BacktestAlpacaSource`:** ~3-4 hours. Mirror `krakenLoader.ts`, plumb `AlpacaConnector.fetchOHLCV` through `loadHistoricalData.ts`, add `'alpaca'` to `BacktestDataSourceType`. Trickiest part: Alpaca's data tier limits (free tier has 15-minute delayed data and limited history; paid tier needed for full COVID-era access). + +### 3.3 Log noise + +Default-level (`info`) logging produces a line per candle per rule evaluation. A 5,000-candle backtest emitted ~25,000 lines in our test. At year-scale this is 80,000+ lines per backtest. Concerns: +- Container log volume at scale (Loki ingestion cost) +- Slow stdout flushing can extend backtest duration +- Customer-visible logs would leak strategy internals + +**Fix:** wrap all `[ProEngine] info` logs with a `BACKTEST_QUIET` env flag, or default to `warn` level inside the backtest entrypoint. + +### 3.4 Historical depth + +Verified data source coverage: + +| Source | COVID Mar 2020 | Russia/Ukraine 2022 | 2018 selloff | 2008 GFC | +|---|---|---|---|---| +| Kraken (BTC, ETH) | ✅ Reachable | ✅ | ✅ | ❌ | +| Kraken (newer alts) | ⚠️ Pair-dependent | ✅ likely | ❌ likely | ❌ | +| Alpaca (equities) | ⚠️ Free tier may not | ✅ | ✅ | ⚠️ Paid tier | +| User CSV/JSON | ✅ if user has data | ✅ | ✅ | ✅ | + +For preset historical-event buttons to work reliably on customer-launched backtests, presets older than ~3 years should fall back to "upload your own data" or be hidden. + +### 3.5 Production gate is intentional + +`web/src/backtest/flags.ts` line 32: + +```js +// Default to disabled — backtest is not yet production-ready. +return false; +``` + +The team has explicitly opted out of production exposure. Flipping `VITE_BACKTEST_ENABLED=true` is a deliberate decision, not an oversight. Whoever takes this on should: +1. Find the original PR/discussion where backtest was gated +2. Confirm whether the listed concerns above are the same ones that drove gating, or if there are others +3. Get sign-off before un-gating + +--- + +## 4. Recommended path forward + +| Stage | Scope | Risk | Value | +|---|---|---|---| +| **A** | Admin-only POC: button on `/plans` cards visible when `isAdminView`, opens `BacktestRunnerPanel` pre-loaded with the plan's `strategy_config` + 5 preset crypto-friendly date ranges. No flag changes; uses existing admin gate. | Low | Validates the UX hypothesis. Lets the team dogfood. | +| **B** | Add the minimum-viable backtest test suite (§3.1). | Low | Regression net. Ship-blocker for any customer feature. | +| **C** | Add `BacktestAlpacaSource` (§3.2) IF saved plans include equities. | Medium | Without this, customer feature only works for crypto plans. | +| **D** | Quiet-mode logging wrapper (§3.3). | Low | Operational hygiene. | +| **E** | Replay against the user's own past trades (already exists as `replay` source) — surface as "see how this plan would have played your last 90 days" instead of "see how it would have played COVID." Lower data risk, higher per-user truthfulness. | Low | Could ship without B/C above; uses existing canonical lifecycle data. | +| **F** | Customer rollout: validate B + C + D, flip the runtime flag, optionally add tier gating. | High | The big one. Don't do this without B+C+D. | + +**My recommendation:** A first (this session can do it), then E in parallel (it's possibly the better v1 product anyway — "show me how my plan would have played my actual past trades" is more truthful than synthetic historical scenarios). B + C + D are prereqs for F. + +--- + +## 5. Reproducible smoke tests + +The Node ESM scripts used for this audit live at `/tmp/backtest_smoke.mjs` and `/tmp/backtest_smoke2.mjs` (this session — not committed). To re-run: + +```bash +cd /opt/bytelyst/learning_ai_invt_trdg +pnpm --filter @bytelyst/trading-backend build +node /tmp/backtest_smoke2.mjs 2>/dev/null | grep -E "trades=|DETERMINISTIC|netPnl" +``` + +If we proceed, these should be promoted to `backend/src/backtest/__tests__/` with vitest.