Devin 2fa41dd000 docs(backtest): engine readiness assessment

Investigates whether the backtest engine is ready to power a customer-
facing "test plan against history" feature on /plans (e.g. COVID, war
periods). Combines code review with 4 synthetic-data smoke tests run
against the running engine.

TL;DR:
  Engine itself is sound — deterministic, fast (~250-325 candles/sec),
  multi-timeframe data pipeline works, intra-candle policies wired
  correctly, runs the same strategy code as live trading.

Ship blockers for customer feature:
  - Zero engine unit tests (1,984 LOC backend, 0 *.test.* files)
  - No Alpaca data source — only Kraken auto-fetch + CSV/JSON upload
    means stock plans can't be tested without manual CSV
  - Log noise (~25k lines per 5k-candle backtest)
  - Production gate is intentional ("not yet production-ready" comment
    in flags.ts) — not an oversight

Recommended path: admin-only POC first (low risk, no flag changes),
plus the user's-own-trade-replay source as a v1 product alternative
that's truthier than synthetic historical scenarios.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

2026-05-10 10:20:19 +00:00

11 KiB

Raw Permalink Blame History

Backtest Engine Readiness Report

Date: 2026-05-10 Trigger: request to add "test against history" feature to /plans page (e.g. COVID, war periods) Method: code review + 3 synthetic-data smoke tests against the running engine Audience: anyone deciding whether to ship customer-facing backtest UI

TL;DR

Concern	Status
Engine determinism	✅ Verified (same input → byte-identical output)
Multi-timeframe data pipeline	✅ `normalize.ts` aggregates 1h+4h from 15m automatically
Intra-candle SL/TP policy wiring	✅ Correctly used as a tiebreaker (only fires on real conflicts)
Strategy schema compatibility	✅ Saved trade plans (`strategy_config` field) feed directly into backtest
Performance	✅ ~1.5s for 5,000 candles × 1 symbol on this VM
Equity (stock) data source	❌ Missing. Live trading uses Alpaca for equities, but no Alpaca backtest loader exists. Only Kraken (crypto) auto-fetch.
Test coverage	❌ Zero engine tests. 1,984 lines of backend backtest code, 0 `.test.` files. Only the FE `utils.test.ts` exists (68 lines, tests the chart rendering helpers — not the engine).
Production gating	⚠️ Explicit comment in `flags.ts`: "Default to disabled — backtest is not yet production-ready." Three flags must align (`VITE_BACKTEST_ENABLED` build, `enableBacktest` runtime, `customerEnabled` runtime).
Log noise	⚠️ ~5 log lines per candle at default level. A 5,000-candle backtest produces ~25,000 log lines. Operationally disruptive at scale.
Historical depth	⚠️ Kraken via ccxt typically returns 5-7 years for major pairs. COVID Mar 2020 reachable for BTC/ETH; uncertain for younger altcoins. 2008 GFC unreachable.

Verdict for shipping a customer-facing "test against history" feature: not yet. The engine itself is sound, but ship-blocking gaps are:

No equity data source (so stock plans can't be tested without manual CSV upload)
Zero engine unit tests — high regression risk
Log noise needs throttling

Verdict for an admin-only POC: safe. The engine is deterministic, fast, and runs the same strategy code as live. Adding a button on /plans cards visible only to admins (using isAdminView from useBacktestFeatureGate) is low-risk.

1. What the engine actually is

backend/src/backtest/
├── engine/
│   ├── BacktestRunner.ts          (384 LOC) — top-level orchestration
│   ├── VirtualExecutionEngine.ts  (609 LOC) — fill simulation, SL/TP, slippage
│   ├── VirtualLedger.ts           ( 77 LOC) — position + cash bookkeeping
│   ├── timeFreeze.ts              ( 26 LOC) — Date.now() lockdown for determinism
│   ├── warmup.ts                  ( 88 LOC) — pre-window candle reservation
│   └── computeSummary.ts          ( 60 LOC) — netPnl/winRate/drawdown/sharpe
├── data/
│   ├── csvLoader.ts               — user-uploaded CSV
│   ├── jsonLoader.ts              — user-uploaded JSON
│   ├── exchangeReplayAdapter.ts   — replay user's own past trades
│   ├── krakenLoader.ts            — auto-fetch crypto via ccxt
│   ├── normalize.ts               — ⭐ aggregates 1h/4h from 15m
│   └── loadHistoricalData.ts      — source dispatcher
├── exchange/
│   └── ReplayExchangeConnector.ts — IExchangeConnector impl backed by dataset
├── metrics/
│   └── computeSummary.ts
├── guards.ts                      — feature-flag + mode assertions
├── strategySafety.ts              — config validation
└── types.ts

Design strength: the ReplayExchangeConnector implements the same IExchangeConnector interface as live trading. The strategy engine (ProStrategyEngine) doesn't know it's in a backtest. Same code paths run for live and historical — high fidelity.

Key invariant: data loading aggregates upward from 15m to 1h and 4h via aggregateCandles() in normalize.ts:86. The Kraken loader fetches only 1m or 15m; the engine's required 4h+1h+15m view is built from there. Skipping normalize.ts (e.g. by calling runBacktestReplay with a manually-constructed dataset) silently fails with "Insufficient data" warnings. This is a footgun for any new data adapter that doesn't go through buildDatasetFromRows.

2. Smoke test results (this session)

Test 1 — Determinism

5,000 synthetic 15m candles + multi-timeframe aggregation, default strategy config, two consecutive runs.

Run #1: trades=3 netPnl=-15.32 drawdown=0.31% sharpe=-4.098 ms=1625
Run #2: trades=3 netPnl=-15.32 drawdown=0.31% sharpe=-4.098 ms=1304
DETERMINISTIC: true   (JSON.stringify(r1) === JSON.stringify(r2))

✅ Engine is deterministic byte-for-byte. Performance: ~250–325 candles/second on this VM, which extrapolates to ~6 seconds for a one-year 15m backtest.

Test 2 — Intra-candle policy

Same 5,000 candles, three policies (ohlc_path, stop_loss_first, take_profit_first):

ohlc_path:           trades=3  netPnl=-15.32
stop_loss_first:     trades=3  netPnl=-15.32
take_profit_first:   trades=3  netPnl=-15.32

⚠️ Identical results across all 3 policies. Initial concern: is the policy a no-op? Source review (VirtualExecutionEngine.ts:258 → resolveIntraCandleConflictReason) confirms the policy IS wired but only invoked when both SL and TP trigger inside the same candle. None of the 3 trades in this test had such an intra-candle conflict, so the policy never fired. Verdict: correct behavior, but the test wasn't strong enough to exercise it. A targeted test with constructed wide-range candles is needed to validate policy semantics; one didn't exist before this audit.

Test 3 — Constant price (zero-volatility edge case)

5,000 flat candles at $50,000:

trades=0 netPnl=0.00

✅ Sane — no signals trigger on a flat tape.

Test 4 — Small window (~40 days, COVID-crash size)

trades=1 netPnl=10.71 drawdown=0.03%

✅ Engine handles short windows without errors.

3. Concerns ranked

3.1 Zero unit tests on the backend engine

The engine is 1,984 lines of pure logic with non-trivial fill semantics, and has zero unit tests. The frontend utils.test.ts (68 lines) tests buildInsightCards / parseSymbolsInput — UI helpers, not engine math.

Risk: any change to VirtualExecutionEngine.ts or VirtualLedger.ts could silently change every customer's backtest results. There's no regression net.

Minimum viable test set before customer rollout:

Determinism: same seed + data → identical output (this session demonstrated it works, but it's not codified)
Intra-candle policies: construct a candle with both SL and TP inside, assert each policy returns the expected exit
Slippage + fees: assert netPnl matches hand-computed expectation for a single trade with known slippage/fee bps
Drawdown: assert maxDrawdownPct matches a manually-computed equity curve
OPEN_AT_END vs FORCE_CLOSE: assert window-end policy difference

Estimate: ~6-8 hours to write the minimum set.

3.2 No equity backtest data source

The backtest data dispatch in loadHistoricalData.ts supports csv | json | replay | kraken. Live trading uses AlpacaConnector (backend/src/connectors/alpaca.ts) which handles ASSET_CLASS: 'us_equity'. No alpacaLoader.ts for backtest.

If a saved plan is for AAPL, the only way to backtest it today is to upload a CSV. That's fine for power users but unworkable for "click to test against COVID."

Estimated work for BacktestAlpacaSource: ~3-4 hours. Mirror krakenLoader.ts, plumb AlpacaConnector.fetchOHLCV through loadHistoricalData.ts, add 'alpaca' to BacktestDataSourceType. Trickiest part: Alpaca's data tier limits (free tier has 15-minute delayed data and limited history; paid tier needed for full COVID-era access).

3.3 Log noise

Default-level (info) logging produces a line per candle per rule evaluation. A 5,000-candle backtest emitted ~25,000 lines in our test. At year-scale this is 80,000+ lines per backtest. Concerns:

Container log volume at scale (Loki ingestion cost)
Slow stdout flushing can extend backtest duration
Customer-visible logs would leak strategy internals

Fix: wrap all [ProEngine] info logs with a BACKTEST_QUIET env flag, or default to warn level inside the backtest entrypoint.

3.4 Historical depth

Verified data source coverage:

Source	COVID Mar 2020	Russia/Ukraine 2022	2018 selloff	2008 GFC
Kraken (BTC, ETH)	✅ Reachable	✅	✅	❌
Kraken (newer alts)	⚠️ Pair-dependent	✅ likely	❌ likely	❌
Alpaca (equities)	⚠️ Free tier may not	✅	✅	⚠️ Paid tier
User CSV/JSON	✅ if user has data	✅	✅	✅

For preset historical-event buttons to work reliably on customer-launched backtests, presets older than ~3 years should fall back to "upload your own data" or be hidden.

3.5 Production gate is intentional

web/src/backtest/flags.ts line 32:

// Default to disabled — backtest is not yet production-ready.
return false;

The team has explicitly opted out of production exposure. Flipping VITE_BACKTEST_ENABLED=true is a deliberate decision, not an oversight. Whoever takes this on should:

Find the original PR/discussion where backtest was gated
Confirm whether the listed concerns above are the same ones that drove gating, or if there are others
Get sign-off before un-gating

4. Recommended path forward

Stage	Scope	Risk	Value
A	Admin-only POC: button on `/plans` cards visible when `isAdminView`, opens `BacktestRunnerPanel` pre-loaded with the plan's `strategy_config` + 5 preset crypto-friendly date ranges. No flag changes; uses existing admin gate.	Low	Validates the UX hypothesis. Lets the team dogfood.
B	Add the minimum-viable backtest test suite (§3.1).	Low	Regression net. Ship-blocker for any customer feature.
C	Add `BacktestAlpacaSource` (§3.2) IF saved plans include equities.	Medium	Without this, customer feature only works for crypto plans.
D	Quiet-mode logging wrapper (§3.3).	Low	Operational hygiene.
E	Replay against the user's own past trades (already exists as `replay` source) — surface as "see how this plan would have played your last 90 days" instead of "see how it would have played COVID." Lower data risk, higher per-user truthfulness.	Low	Could ship without B/C above; uses existing canonical lifecycle data.
F	Customer rollout: validate B + C + D, flip the runtime flag, optionally add tier gating.	High	The big one. Don't do this without B+C+D.

My recommendation: A first (this session can do it), then E in parallel (it's possibly the better v1 product anyway — "show me how my plan would have played my actual past trades" is more truthful than synthetic historical scenarios). B + C + D are prereqs for F.

5. Reproducible smoke tests

The Node ESM scripts used for this audit live at /tmp/backtest_smoke.mjs and /tmp/backtest_smoke2.mjs (this session — not committed). To re-run:

cd /opt/bytelyst/learning_ai_invt_trdg
pnpm --filter @bytelyst/trading-backend build
node /tmp/backtest_smoke2.mjs 2>/dev/null | grep -E "trades=|DETERMINISTIC|netPnl"

If we proceed, these should be promoted to backend/src/backtest/__tests__/ with vitest.

11 KiB Raw Permalink Blame History Unescape Escape