learning_ai_invt_trdg/docs/backtest/ENGINE_READINESS.md
Devin 2fa41dd000 docs(backtest): engine readiness assessment
Investigates whether the backtest engine is ready to power a customer-
facing "test plan against history" feature on /plans (e.g. COVID, war
periods). Combines code review with 4 synthetic-data smoke tests run
against the running engine.

TL;DR:
  Engine itself is sound — deterministic, fast (~250-325 candles/sec),
  multi-timeframe data pipeline works, intra-candle policies wired
  correctly, runs the same strategy code as live trading.

Ship blockers for customer feature:
  - Zero engine unit tests (1,984 LOC backend, 0 *.test.* files)
  - No Alpaca data source — only Kraken auto-fetch + CSV/JSON upload
    means stock plans can't be tested without manual CSV
  - Log noise (~25k lines per 5k-candle backtest)
  - Production gate is intentional ("not yet production-ready" comment
    in flags.ts) — not an oversight

Recommended path: admin-only POC first (low risk, no flag changes),
plus the user's-own-trade-replay source as a v1 product alternative
that's truthier than synthetic historical scenarios.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
2026-05-10 10:20:19 +00:00

202 lines
11 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Backtest Engine Readiness Report
> **Date:** 2026-05-10
> **Trigger:** request to add "test against history" feature to `/plans` page (e.g. COVID, war periods)
> **Method:** code review + 3 synthetic-data smoke tests against the running engine
> **Audience:** anyone deciding whether to ship customer-facing backtest UI
---
## TL;DR
| Concern | Status |
|---|---|
| Engine determinism | ✅ Verified (same input → byte-identical output) |
| Multi-timeframe data pipeline | ✅ `normalize.ts` aggregates 1h+4h from 15m automatically |
| Intra-candle SL/TP policy wiring | ✅ Correctly used as a tiebreaker (only fires on real conflicts) |
| Strategy schema compatibility | ✅ Saved trade plans (`strategy_config` field) feed directly into backtest |
| Performance | ✅ ~1.5s for 5,000 candles × 1 symbol on this VM |
| **Equity (stock) data source** | ❌ **Missing.** Live trading uses Alpaca for equities, but no Alpaca backtest loader exists. Only Kraken (crypto) auto-fetch. |
| **Test coverage** | ❌ **Zero engine tests.** 1,984 lines of backend backtest code, 0 `*.test.*` files. Only the FE `utils.test.ts` exists (68 lines, tests the chart rendering helpers — not the engine). |
| **Production gating** | ⚠️ Explicit comment in `flags.ts`: *"Default to disabled — backtest is not yet production-ready."* Three flags must align (`VITE_BACKTEST_ENABLED` build, `enableBacktest` runtime, `customerEnabled` runtime). |
| **Log noise** | ⚠️ ~5 log lines per candle at default level. A 5,000-candle backtest produces ~25,000 log lines. Operationally disruptive at scale. |
| **Historical depth** | ⚠️ Kraken via ccxt typically returns 5-7 years for major pairs. **COVID Mar 2020 reachable for BTC/ETH; uncertain for younger altcoins. 2008 GFC unreachable.** |
**Verdict for shipping a customer-facing "test against history" feature:** *not yet*. The engine itself is sound, but ship-blocking gaps are:
1. No equity data source (so stock plans can't be tested without manual CSV upload)
2. Zero engine unit tests — high regression risk
3. Log noise needs throttling
**Verdict for an admin-only POC:** *safe*. The engine is deterministic, fast, and runs the same strategy code as live. Adding a button on `/plans` cards visible only to admins (using `isAdminView` from `useBacktestFeatureGate`) is low-risk.
---
## 1. What the engine actually is
```
backend/src/backtest/
├── engine/
│ ├── BacktestRunner.ts (384 LOC) — top-level orchestration
│ ├── VirtualExecutionEngine.ts (609 LOC) — fill simulation, SL/TP, slippage
│ ├── VirtualLedger.ts ( 77 LOC) — position + cash bookkeeping
│ ├── timeFreeze.ts ( 26 LOC) — Date.now() lockdown for determinism
│ ├── warmup.ts ( 88 LOC) — pre-window candle reservation
│ └── computeSummary.ts ( 60 LOC) — netPnl/winRate/drawdown/sharpe
├── data/
│ ├── csvLoader.ts — user-uploaded CSV
│ ├── jsonLoader.ts — user-uploaded JSON
│ ├── exchangeReplayAdapter.ts — replay user's own past trades
│ ├── krakenLoader.ts — auto-fetch crypto via ccxt
│ ├── normalize.ts — ⭐ aggregates 1h/4h from 15m
│ └── loadHistoricalData.ts — source dispatcher
├── exchange/
│ └── ReplayExchangeConnector.ts — IExchangeConnector impl backed by dataset
├── metrics/
│ └── computeSummary.ts
├── guards.ts — feature-flag + mode assertions
├── strategySafety.ts — config validation
└── types.ts
```
**Design strength:** the `ReplayExchangeConnector` implements the same `IExchangeConnector` interface as live trading. The strategy engine (`ProStrategyEngine`) doesn't know it's in a backtest. Same code paths run for live and historical — high fidelity.
**Key invariant:** data loading aggregates upward from 15m to 1h and 4h via `aggregateCandles()` in `normalize.ts:86`. The Kraken loader fetches **only** 1m or 15m; the engine's required 4h+1h+15m view is built from there. Skipping `normalize.ts` (e.g. by calling `runBacktestReplay` with a manually-constructed dataset) **silently fails with "Insufficient data" warnings**. This is a footgun for any new data adapter that doesn't go through `buildDatasetFromRows`.
---
## 2. Smoke test results (this session)
### Test 1 — Determinism
5,000 synthetic 15m candles + multi-timeframe aggregation, default strategy config, two consecutive runs.
```
Run #1: trades=3 netPnl=-15.32 drawdown=0.31% sharpe=-4.098 ms=1625
Run #2: trades=3 netPnl=-15.32 drawdown=0.31% sharpe=-4.098 ms=1304
DETERMINISTIC: true (JSON.stringify(r1) === JSON.stringify(r2))
```
✅ Engine is deterministic byte-for-byte. Performance: ~250325 candles/second on this VM, which extrapolates to ~6 seconds for a one-year 15m backtest.
### Test 2 — Intra-candle policy
Same 5,000 candles, three policies (`ohlc_path`, `stop_loss_first`, `take_profit_first`):
```
ohlc_path: trades=3 netPnl=-15.32
stop_loss_first: trades=3 netPnl=-15.32
take_profit_first: trades=3 netPnl=-15.32
```
⚠️ **Identical results across all 3 policies.** Initial concern: is the policy a no-op? Source review (`VirtualExecutionEngine.ts:258` → `resolveIntraCandleConflictReason`) confirms the policy IS wired but **only invoked when both SL and TP trigger inside the same candle**. None of the 3 trades in this test had such an intra-candle conflict, so the policy never fired. **Verdict: correct behavior, but the test wasn't strong enough to exercise it.** A targeted test with constructed wide-range candles is needed to validate policy semantics; one didn't exist before this audit.
### Test 3 — Constant price (zero-volatility edge case)
5,000 flat candles at $50,000:
```
trades=0 netPnl=0.00
```
✅ Sane — no signals trigger on a flat tape.
### Test 4 — Small window (~40 days, COVID-crash size)
```
trades=1 netPnl=10.71 drawdown=0.03%
```
✅ Engine handles short windows without errors.
---
## 3. Concerns ranked
### 3.1 Zero unit tests on the backend engine
The engine is 1,984 lines of pure logic with non-trivial fill semantics, and has zero unit tests. The frontend `utils.test.ts` (68 lines) tests `buildInsightCards` / `parseSymbolsInput` — UI helpers, not engine math.
**Risk:** any change to `VirtualExecutionEngine.ts` or `VirtualLedger.ts` could silently change every customer's backtest results. There's no regression net.
**Minimum viable test set before customer rollout:**
- Determinism: same seed + data → identical output (this session demonstrated it works, but it's not codified)
- Intra-candle policies: construct a candle with both SL and TP inside, assert each policy returns the expected exit
- Slippage + fees: assert `netPnl` matches hand-computed expectation for a single trade with known slippage/fee bps
- Drawdown: assert `maxDrawdownPct` matches a manually-computed equity curve
- OPEN_AT_END vs FORCE_CLOSE: assert window-end policy difference
Estimate: ~6-8 hours to write the minimum set.
### 3.2 No equity backtest data source
The backtest data dispatch in `loadHistoricalData.ts` supports `csv | json | replay | kraken`. Live trading uses `AlpacaConnector` (`backend/src/connectors/alpaca.ts`) which handles `ASSET_CLASS: 'us_equity'`. **No `alpacaLoader.ts` for backtest.**
If a saved plan is for `AAPL`, the only way to backtest it today is to upload a CSV. That's fine for power users but unworkable for "click to test against COVID."
**Estimated work for `BacktestAlpacaSource`:** ~3-4 hours. Mirror `krakenLoader.ts`, plumb `AlpacaConnector.fetchOHLCV` through `loadHistoricalData.ts`, add `'alpaca'` to `BacktestDataSourceType`. Trickiest part: Alpaca's data tier limits (free tier has 15-minute delayed data and limited history; paid tier needed for full COVID-era access).
### 3.3 Log noise
Default-level (`info`) logging produces a line per candle per rule evaluation. A 5,000-candle backtest emitted ~25,000 lines in our test. At year-scale this is 80,000+ lines per backtest. Concerns:
- Container log volume at scale (Loki ingestion cost)
- Slow stdout flushing can extend backtest duration
- Customer-visible logs would leak strategy internals
**Fix:** wrap all `[ProEngine] info` logs with a `BACKTEST_QUIET` env flag, or default to `warn` level inside the backtest entrypoint.
### 3.4 Historical depth
Verified data source coverage:
| Source | COVID Mar 2020 | Russia/Ukraine 2022 | 2018 selloff | 2008 GFC |
|---|---|---|---|---|
| Kraken (BTC, ETH) | ✅ Reachable | ✅ | ✅ | ❌ |
| Kraken (newer alts) | ⚠️ Pair-dependent | ✅ likely | ❌ likely | ❌ |
| Alpaca (equities) | ⚠️ Free tier may not | ✅ | ✅ | ⚠️ Paid tier |
| User CSV/JSON | ✅ if user has data | ✅ | ✅ | ✅ |
For preset historical-event buttons to work reliably on customer-launched backtests, presets older than ~3 years should fall back to "upload your own data" or be hidden.
### 3.5 Production gate is intentional
`web/src/backtest/flags.ts` line 32:
```js
// Default to disabled — backtest is not yet production-ready.
return false;
```
The team has explicitly opted out of production exposure. Flipping `VITE_BACKTEST_ENABLED=true` is a deliberate decision, not an oversight. Whoever takes this on should:
1. Find the original PR/discussion where backtest was gated
2. Confirm whether the listed concerns above are the same ones that drove gating, or if there are others
3. Get sign-off before un-gating
---
## 4. Recommended path forward
| Stage | Scope | Risk | Value |
|---|---|---|---|
| **A** | Admin-only POC: button on `/plans` cards visible when `isAdminView`, opens `BacktestRunnerPanel` pre-loaded with the plan's `strategy_config` + 5 preset crypto-friendly date ranges. No flag changes; uses existing admin gate. | Low | Validates the UX hypothesis. Lets the team dogfood. |
| **B** | Add the minimum-viable backtest test suite (§3.1). | Low | Regression net. Ship-blocker for any customer feature. |
| **C** | Add `BacktestAlpacaSource` (§3.2) IF saved plans include equities. | Medium | Without this, customer feature only works for crypto plans. |
| **D** | Quiet-mode logging wrapper (§3.3). | Low | Operational hygiene. |
| **E** | Replay against the user's own past trades (already exists as `replay` source) — surface as "see how this plan would have played your last 90 days" instead of "see how it would have played COVID." Lower data risk, higher per-user truthfulness. | Low | Could ship without B/C above; uses existing canonical lifecycle data. |
| **F** | Customer rollout: validate B + C + D, flip the runtime flag, optionally add tier gating. | High | The big one. Don't do this without B+C+D. |
**My recommendation:** A first (this session can do it), then E in parallel (it's possibly the better v1 product anyway — "show me how my plan would have played my actual past trades" is more truthful than synthetic historical scenarios). B + C + D are prereqs for F.
---
## 5. Reproducible smoke tests
The Node ESM scripts used for this audit live at `/tmp/backtest_smoke.mjs` and `/tmp/backtest_smoke2.mjs` (this session — not committed). To re-run:
```bash
cd /opt/bytelyst/learning_ai_invt_trdg
pnpm --filter @bytelyst/trading-backend build
node /tmp/backtest_smoke2.mjs 2>/dev/null | grep -E "trades=|DETERMINISTIC|netPnl"
```
If we proceed, these should be promoted to `backend/src/backtest/__tests__/` with vitest.