# Backtest Engine Readiness Report > **Date:** 2026-05-10 > **Trigger:** request to add "test against history" feature to `/plans` page (e.g. COVID, war periods) > **Method:** code review + 3 synthetic-data smoke tests against the running engine > **Audience:** anyone deciding whether to ship customer-facing backtest UI --- ## TL;DR | Concern | Status | |---|---| | Engine determinism | ✅ Verified (same input → byte-identical output) | | Multi-timeframe data pipeline | ✅ `normalize.ts` aggregates 1h+4h from 15m automatically | | Intra-candle SL/TP policy wiring | ✅ Correctly used as a tiebreaker (only fires on real conflicts) | | Strategy schema compatibility | ✅ Saved trade plans (`strategy_config` field) feed directly into backtest | | Performance | ✅ ~1.5s for 5,000 candles × 1 symbol on this VM | | **Equity (stock) data source** | ❌ **Missing.** Live trading uses Alpaca for equities, but no Alpaca backtest loader exists. Only Kraken (crypto) auto-fetch. | | **Test coverage** | ❌ **Zero engine tests.** 1,984 lines of backend backtest code, 0 `*.test.*` files. Only the FE `utils.test.ts` exists (68 lines, tests the chart rendering helpers — not the engine). | | **Production gating** | ⚠️ Explicit comment in `flags.ts`: *"Default to disabled — backtest is not yet production-ready."* Three flags must align (`VITE_BACKTEST_ENABLED` build, `enableBacktest` runtime, `customerEnabled` runtime). | | **Log noise** | ⚠️ ~5 log lines per candle at default level. A 5,000-candle backtest produces ~25,000 log lines. Operationally disruptive at scale. | | **Historical depth** | ⚠️ Kraken via ccxt typically returns 5-7 years for major pairs. **COVID Mar 2020 reachable for BTC/ETH; uncertain for younger altcoins. 2008 GFC unreachable.** | **Verdict for shipping a customer-facing "test against history" feature:** *not yet*. The engine itself is sound, but ship-blocking gaps are: 1. No equity data source (so stock plans can't be tested without manual CSV upload) 2. Zero engine unit tests — high regression risk 3. Log noise needs throttling **Verdict for an admin-only POC:** *safe*. The engine is deterministic, fast, and runs the same strategy code as live. Adding a button on `/plans` cards visible only to admins (using `isAdminView` from `useBacktestFeatureGate`) is low-risk. --- ## 1. What the engine actually is ``` backend/src/backtest/ ├── engine/ │ ├── BacktestRunner.ts (384 LOC) — top-level orchestration │ ├── VirtualExecutionEngine.ts (609 LOC) — fill simulation, SL/TP, slippage │ ├── VirtualLedger.ts ( 77 LOC) — position + cash bookkeeping │ ├── timeFreeze.ts ( 26 LOC) — Date.now() lockdown for determinism │ ├── warmup.ts ( 88 LOC) — pre-window candle reservation │ └── computeSummary.ts ( 60 LOC) — netPnl/winRate/drawdown/sharpe ├── data/ │ ├── csvLoader.ts — user-uploaded CSV │ ├── jsonLoader.ts — user-uploaded JSON │ ├── exchangeReplayAdapter.ts — replay user's own past trades │ ├── krakenLoader.ts — auto-fetch crypto via ccxt │ ├── normalize.ts — ⭐ aggregates 1h/4h from 15m │ └── loadHistoricalData.ts — source dispatcher ├── exchange/ │ └── ReplayExchangeConnector.ts — IExchangeConnector impl backed by dataset ├── metrics/ │ └── computeSummary.ts ├── guards.ts — feature-flag + mode assertions ├── strategySafety.ts — config validation └── types.ts ``` **Design strength:** the `ReplayExchangeConnector` implements the same `IExchangeConnector` interface as live trading. The strategy engine (`ProStrategyEngine`) doesn't know it's in a backtest. Same code paths run for live and historical — high fidelity. **Key invariant:** data loading aggregates upward from 15m to 1h and 4h via `aggregateCandles()` in `normalize.ts:86`. The Kraken loader fetches **only** 1m or 15m; the engine's required 4h+1h+15m view is built from there. Skipping `normalize.ts` (e.g. by calling `runBacktestReplay` with a manually-constructed dataset) **silently fails with "Insufficient data" warnings**. This is a footgun for any new data adapter that doesn't go through `buildDatasetFromRows`. --- ## 2. Smoke test results (this session) ### Test 1 — Determinism 5,000 synthetic 15m candles + multi-timeframe aggregation, default strategy config, two consecutive runs. ``` Run #1: trades=3 netPnl=-15.32 drawdown=0.31% sharpe=-4.098 ms=1625 Run #2: trades=3 netPnl=-15.32 drawdown=0.31% sharpe=-4.098 ms=1304 DETERMINISTIC: true (JSON.stringify(r1) === JSON.stringify(r2)) ``` ✅ Engine is deterministic byte-for-byte. Performance: ~250–325 candles/second on this VM, which extrapolates to ~6 seconds for a one-year 15m backtest. ### Test 2 — Intra-candle policy Same 5,000 candles, three policies (`ohlc_path`, `stop_loss_first`, `take_profit_first`): ``` ohlc_path: trades=3 netPnl=-15.32 stop_loss_first: trades=3 netPnl=-15.32 take_profit_first: trades=3 netPnl=-15.32 ``` ⚠️ **Identical results across all 3 policies.** Initial concern: is the policy a no-op? Source review (`VirtualExecutionEngine.ts:258` → `resolveIntraCandleConflictReason`) confirms the policy IS wired but **only invoked when both SL and TP trigger inside the same candle**. None of the 3 trades in this test had such an intra-candle conflict, so the policy never fired. **Verdict: correct behavior, but the test wasn't strong enough to exercise it.** A targeted test with constructed wide-range candles is needed to validate policy semantics; one didn't exist before this audit. ### Test 3 — Constant price (zero-volatility edge case) 5,000 flat candles at $50,000: ``` trades=0 netPnl=0.00 ``` ✅ Sane — no signals trigger on a flat tape. ### Test 4 — Small window (~40 days, COVID-crash size) ``` trades=1 netPnl=10.71 drawdown=0.03% ``` ✅ Engine handles short windows without errors. --- ## 3. Concerns ranked ### 3.1 Zero unit tests on the backend engine The engine is 1,984 lines of pure logic with non-trivial fill semantics, and has zero unit tests. The frontend `utils.test.ts` (68 lines) tests `buildInsightCards` / `parseSymbolsInput` — UI helpers, not engine math. **Risk:** any change to `VirtualExecutionEngine.ts` or `VirtualLedger.ts` could silently change every customer's backtest results. There's no regression net. **Minimum viable test set before customer rollout:** - Determinism: same seed + data → identical output (this session demonstrated it works, but it's not codified) - Intra-candle policies: construct a candle with both SL and TP inside, assert each policy returns the expected exit - Slippage + fees: assert `netPnl` matches hand-computed expectation for a single trade with known slippage/fee bps - Drawdown: assert `maxDrawdownPct` matches a manually-computed equity curve - OPEN_AT_END vs FORCE_CLOSE: assert window-end policy difference Estimate: ~6-8 hours to write the minimum set. ### 3.2 No equity backtest data source The backtest data dispatch in `loadHistoricalData.ts` supports `csv | json | replay | kraken`. Live trading uses `AlpacaConnector` (`backend/src/connectors/alpaca.ts`) which handles `ASSET_CLASS: 'us_equity'`. **No `alpacaLoader.ts` for backtest.** If a saved plan is for `AAPL`, the only way to backtest it today is to upload a CSV. That's fine for power users but unworkable for "click to test against COVID." **Estimated work for `BacktestAlpacaSource`:** ~3-4 hours. Mirror `krakenLoader.ts`, plumb `AlpacaConnector.fetchOHLCV` through `loadHistoricalData.ts`, add `'alpaca'` to `BacktestDataSourceType`. Trickiest part: Alpaca's data tier limits (free tier has 15-minute delayed data and limited history; paid tier needed for full COVID-era access). ### 3.3 Log noise Default-level (`info`) logging produces a line per candle per rule evaluation. A 5,000-candle backtest emitted ~25,000 lines in our test. At year-scale this is 80,000+ lines per backtest. Concerns: - Container log volume at scale (Loki ingestion cost) - Slow stdout flushing can extend backtest duration - Customer-visible logs would leak strategy internals **Fix:** wrap all `[ProEngine] info` logs with a `BACKTEST_QUIET` env flag, or default to `warn` level inside the backtest entrypoint. ### 3.4 Historical depth Verified data source coverage: | Source | COVID Mar 2020 | Russia/Ukraine 2022 | 2018 selloff | 2008 GFC | |---|---|---|---|---| | Kraken (BTC, ETH) | ✅ Reachable | ✅ | ✅ | ❌ | | Kraken (newer alts) | ⚠️ Pair-dependent | ✅ likely | ❌ likely | ❌ | | Alpaca (equities) | ⚠️ Free tier may not | ✅ | ✅ | ⚠️ Paid tier | | User CSV/JSON | ✅ if user has data | ✅ | ✅ | ✅ | For preset historical-event buttons to work reliably on customer-launched backtests, presets older than ~3 years should fall back to "upload your own data" or be hidden. ### 3.5 Production gate is intentional `web/src/backtest/flags.ts` line 32: ```js // Default to disabled — backtest is not yet production-ready. return false; ``` The team has explicitly opted out of production exposure. Flipping `VITE_BACKTEST_ENABLED=true` is a deliberate decision, not an oversight. Whoever takes this on should: 1. Find the original PR/discussion where backtest was gated 2. Confirm whether the listed concerns above are the same ones that drove gating, or if there are others 3. Get sign-off before un-gating --- ## 4. Recommended path forward | Stage | Scope | Risk | Value | |---|---|---|---| | **A** | Admin-only POC: button on `/plans` cards visible when `isAdminView`, opens `BacktestRunnerPanel` pre-loaded with the plan's `strategy_config` + 5 preset crypto-friendly date ranges. No flag changes; uses existing admin gate. | Low | Validates the UX hypothesis. Lets the team dogfood. | | **B** | Add the minimum-viable backtest test suite (§3.1). | Low | Regression net. Ship-blocker for any customer feature. | | **C** | Add `BacktestAlpacaSource` (§3.2) IF saved plans include equities. | Medium | Without this, customer feature only works for crypto plans. | | **D** | Quiet-mode logging wrapper (§3.3). | Low | Operational hygiene. | | **E** | Replay against the user's own past trades (already exists as `replay` source) — surface as "see how this plan would have played your last 90 days" instead of "see how it would have played COVID." Lower data risk, higher per-user truthfulness. | Low | Could ship without B/C above; uses existing canonical lifecycle data. | | **F** | Customer rollout: validate B + C + D, flip the runtime flag, optionally add tier gating. | High | The big one. Don't do this without B+C+D. | **My recommendation:** A first (this session can do it), then E in parallel (it's possibly the better v1 product anyway — "show me how my plan would have played my actual past trades" is more truthful than synthetic historical scenarios). B + C + D are prereqs for F. --- ## 5. Reproducible smoke tests The Node ESM scripts used for this audit live at `/tmp/backtest_smoke.mjs` and `/tmp/backtest_smoke2.mjs` (this session — not committed). To re-run: ```bash cd /opt/bytelyst/learning_ai_invt_trdg pnpm --filter @bytelyst/trading-backend build node /tmp/backtest_smoke2.mjs 2>/dev/null | grep -E "trades=|DETERMINISTIC|netPnl" ``` If we proceed, these should be promoted to `backend/src/backtest/__tests__/` with vitest.