learning_ai_invt_trdg/docs/backtest/ENGINE_READINESS.md

# Backtest Engine Readiness Report

> **Date:** 2026-05-10
> **Trigger:** request to add "test against history" feature to `/plans` page (e.g. COVID, war periods)
> **Method:** code review + 3 synthetic-data smoke tests against the running engine
> **Audience:** anyone deciding whether to ship customer-facing backtest UI

---

## TL;DR

| Concern | Status |
|---|---|
| Engine determinism | ✅ Verified (same input → byte-identical output) |
| Multi-timeframe data pipeline | ✅ `normalize.ts` aggregates 1h+4h from 15m automatically |
| Intra-candle SL/TP policy wiring | ✅ Correctly used as a tiebreaker (only fires on real conflicts) |
| Strategy schema compatibility | ✅ Saved trade plans (`strategy_config` field) feed directly into backtest |
| Performance | ✅ ~1.5s for 5,000 candles × 1 symbol on this VM |
| **Equity (stock) data source** | ❌ **Missing.** Live trading uses Alpaca for equities, but no Alpaca backtest loader exists. Only Kraken (crypto) auto-fetch. |
| **Test coverage** | ❌ **Zero engine tests.** 1,984 lines of backend backtest code, 0 `*.test.*` files. Only the FE `utils.test.ts` exists (68 lines, tests the chart rendering helpers — not the engine). |
| **Production gating** | ⚠️ Explicit comment in `flags.ts`: *"Default to disabled — backtest is not yet production-ready."* Three flags must align (`VITE_BACKTEST_ENABLED` build, `enableBacktest` runtime, `customerEnabled` runtime). |
| **Log noise** | ⚠️ ~5 log lines per candle at default level. A 5,000-candle backtest produces ~25,000 log lines. Operationally disruptive at scale. |
| **Historical depth** | ⚠️ Kraken via ccxt typically returns 5-7 years for major pairs. **COVID Mar 2020 reachable for BTC/ETH; uncertain for younger altcoins. 2008 GFC unreachable.** |

**Verdict for shipping a customer-facing "test against history" feature:** *not yet*. The engine itself is sound, but ship-blocking gaps are:
1. No equity data source (so stock plans can't be tested without manual CSV upload)
2. Zero engine unit tests — high regression risk
3. Log noise needs throttling

**Verdict for an admin-only POC:** *safe*. The engine is deterministic, fast, and runs the same strategy code as live. Adding a button on `/plans` cards visible only to admins (using `isAdminView` from `useBacktestFeatureGate`) is low-risk.

---

## 1. What the engine actually is

```
backend/src/backtest/
├── engine/
│   ├── BacktestRunner.ts          (384 LOC) — top-level orchestration
│   ├── VirtualExecutionEngine.ts  (609 LOC) — fill simulation, SL/TP, slippage
│   ├── VirtualLedger.ts           ( 77 LOC) — position + cash bookkeeping
│   ├── timeFreeze.ts              ( 26 LOC) — Date.now() lockdown for determinism
│   ├── warmup.ts                  ( 88 LOC) — pre-window candle reservation
│   └── computeSummary.ts          ( 60 LOC) — netPnl/winRate/drawdown/sharpe
├── data/
│   ├── csvLoader.ts               — user-uploaded CSV
│   ├── jsonLoader.ts              — user-uploaded JSON
│   ├── exchangeReplayAdapter.ts   — replay user's own past trades
│   ├── krakenLoader.ts            — auto-fetch crypto via ccxt
│   ├── normalize.ts               — ⭐ aggregates 1h/4h from 15m
│   └── loadHistoricalData.ts      — source dispatcher
├── exchange/
│   └── ReplayExchangeConnector.ts — IExchangeConnector impl backed by dataset
├── metrics/
│   └── computeSummary.ts
├── guards.ts                      — feature-flag + mode assertions
├── strategySafety.ts              — config validation
└── types.ts
```

**Design strength:** the `ReplayExchangeConnector` implements the same `IExchangeConnector` interface as live trading. The strategy engine (`ProStrategyEngine`) doesn't know it's in a backtest. Same code paths run for live and historical — high fidelity.

**Key invariant:** data loading aggregates upward from 15m to 1h and 4h via `aggregateCandles()` in `normalize.ts:86`. The Kraken loader fetches **only** 1m or 15m; the engine's required 4h+1h+15m view is built from there. Skipping `normalize.ts` (e.g. by calling `runBacktestReplay` with a manually-constructed dataset) **silently fails with "Insufficient data" warnings**. This is a footgun for any new data adapter that doesn't go through `buildDatasetFromRows`.

---

## 2. Smoke test results (this session)

### Test 1 — Determinism

5,000 synthetic 15m candles + multi-timeframe aggregation, default strategy config, two consecutive runs.

```
Run #1: trades=3 netPnl=-15.32 drawdown=0.31% sharpe=-4.098 ms=1625
Run #2: trades=3 netPnl=-15.32 drawdown=0.31% sharpe=-4.098 ms=1304
DETERMINISTIC: true   (JSON.stringify(r1) === JSON.stringify(r2))
```

✅ Engine is deterministic byte-for-byte. Performance: ~250–325 candles/second on this VM, which extrapolates to ~6 seconds for a one-year 15m backtest.

### Test 2 — Intra-candle policy

Same 5,000 candles, three policies (`ohlc_path`, `stop_loss_first`, `take_profit_first`):

```
ohlc_path:           trades=3  netPnl=-15.32
stop_loss_first:     trades=3  netPnl=-15.32
take_profit_first:   trades=3  netPnl=-15.32
```

⚠️ **Identical results across all 3 policies.** Initial concern: is the policy a no-op? Source review (`VirtualExecutionEngine.ts:258` → `resolveIntraCandleConflictReason`) confirms the policy IS wired but **only invoked when both SL and TP trigger inside the same candle**. None of the 3 trades in this test had such an intra-candle conflict, so the policy never fired. **Verdict: correct behavior, but the test wasn't strong enough to exercise it.** A targeted test with constructed wide-range candles is needed to validate policy semantics; one didn't exist before this audit.

### Test 3 — Constant price (zero-volatility edge case)

5,000 flat candles at $50,000:

```
trades=0 netPnl=0.00
```

✅ Sane — no signals trigger on a flat tape.

### Test 4 — Small window (~40 days, COVID-crash size)

```
trades=1 netPnl=10.71 drawdown=0.03%
```

✅ Engine handles short windows without errors.

---

## 3. Concerns ranked

### 3.1 Zero unit tests on the backend engine

The engine is 1,984 lines of pure logic with non-trivial fill semantics, and has zero unit tests. The frontend `utils.test.ts` (68 lines) tests `buildInsightCards` / `parseSymbolsInput` — UI helpers, not engine math.

**Risk:** any change to `VirtualExecutionEngine.ts` or `VirtualLedger.ts` could silently change every customer's backtest results. There's no regression net.

**Minimum viable test set before customer rollout:**
- Determinism: same seed + data → identical output (this session demonstrated it works, but it's not codified)
- Intra-candle policies: construct a candle with both SL and TP inside, assert each policy returns the expected exit
- Slippage + fees: assert `netPnl` matches hand-computed expectation for a single trade with known slippage/fee bps
- Drawdown: assert `maxDrawdownPct` matches a manually-computed equity curve
- OPEN_AT_END vs FORCE_CLOSE: assert window-end policy difference

Estimate: ~6-8 hours to write the minimum set.

### 3.2 No equity backtest data source

The backtest data dispatch in `loadHistoricalData.ts` supports `csv | json | replay | kraken`. Live trading uses `AlpacaConnector` (`backend/src/connectors/alpaca.ts`) which handles `ASSET_CLASS: 'us_equity'`. **No `alpacaLoader.ts` for backtest.**

If a saved plan is for `AAPL`, the only way to backtest it today is to upload a CSV. That's fine for power users but unworkable for "click to test against COVID."

**Estimated work for `BacktestAlpacaSource`:** ~3-4 hours. Mirror `krakenLoader.ts`, plumb `AlpacaConnector.fetchOHLCV` through `loadHistoricalData.ts`, add `'alpaca'` to `BacktestDataSourceType`. Trickiest part: Alpaca's data tier limits (free tier has 15-minute delayed data and limited history; paid tier needed for full COVID-era access).

### 3.3 Log noise

Default-level (`info`) logging produces a line per candle per rule evaluation. A 5,000-candle backtest emitted ~25,000 lines in our test. At year-scale this is 80,000+ lines per backtest. Concerns:
- Container log volume at scale (Loki ingestion cost)
- Slow stdout flushing can extend backtest duration
- Customer-visible logs would leak strategy internals

**Fix:** wrap all `[ProEngine] info` logs with a `BACKTEST_QUIET` env flag, or default to `warn` level inside the backtest entrypoint.

### 3.4 Historical depth

Verified data source coverage:

| Source | COVID Mar 2020 | Russia/Ukraine 2022 | 2018 selloff | 2008 GFC |
|---|---|---|---|---|
| Kraken (BTC, ETH) | ✅ Reachable | ✅ | ✅ | ❌ |
| Kraken (newer alts) | ⚠️ Pair-dependent | ✅ likely | ❌ likely | ❌ |
| Alpaca (equities) | ⚠️ Free tier may not | ✅ | ✅ | ⚠️ Paid tier |
| User CSV/JSON | ✅ if user has data | ✅ | ✅ | ✅ |

For preset historical-event buttons to work reliably on customer-launched backtests, presets older than ~3 years should fall back to "upload your own data" or be hidden.

### 3.5 Production gate is intentional

`web/src/backtest/flags.ts` line 32:

```js
// Default to disabled — backtest is not yet production-ready.
return false;
```

The team has explicitly opted out of production exposure. Flipping `VITE_BACKTEST_ENABLED=true` is a deliberate decision, not an oversight. Whoever takes this on should:
1. Find the original PR/discussion where backtest was gated
2. Confirm whether the listed concerns above are the same ones that drove gating, or if there are others
3. Get sign-off before un-gating

---

## 4. Recommended path forward

| Stage | Scope | Risk | Value |
|---|---|---|---|
| **A** | Admin-only POC: button on `/plans` cards visible when `isAdminView`, opens `BacktestRunnerPanel` pre-loaded with the plan's `strategy_config` + 5 preset crypto-friendly date ranges. No flag changes; uses existing admin gate. | Low | Validates the UX hypothesis. Lets the team dogfood. |
| **B** | Add the minimum-viable backtest test suite (§3.1). | Low | Regression net. Ship-blocker for any customer feature. |
| **C** | Add `BacktestAlpacaSource` (§3.2) IF saved plans include equities. | Medium | Without this, customer feature only works for crypto plans. |
| **D** | Quiet-mode logging wrapper (§3.3). | Low | Operational hygiene. |
| **E** | Replay against the user's own past trades (already exists as `replay` source) — surface as "see how this plan would have played your last 90 days" instead of "see how it would have played COVID." Lower data risk, higher per-user truthfulness. | Low | Could ship without B/C above; uses existing canonical lifecycle data. |
| **F** | Customer rollout: validate B + C + D, flip the runtime flag, optionally add tier gating. | High | The big one. Don't do this without B+C+D. |

**My recommendation:** A first (this session can do it), then E in parallel (it's possibly the better v1 product anyway — "show me how my plan would have played my actual past trades" is more truthful than synthetic historical scenarios). B + C + D are prereqs for F.

---

## 5. Reproducible smoke tests

The Node ESM scripts used for this audit live at `/tmp/backtest_smoke.mjs` and `/tmp/backtest_smoke2.mjs` (this session — not committed). To re-run:

```bash
cd /opt/bytelyst/learning_ai_invt_trdg
pnpm --filter @bytelyst/trading-backend build
node /tmp/backtest_smoke2.mjs 2>/dev/null | grep -E "trades=|DETERMINISTIC|netPnl"
```

If we proceed, these should be promoted to `backend/src/backtest/__tests__/` with vitest.