Investigates whether the backtest engine is ready to power a customer-
facing "test plan against history" feature on /plans (e.g. COVID, war
periods). Combines code review with 4 synthetic-data smoke tests run
against the running engine.
TL;DR:
Engine itself is sound — deterministic, fast (~250-325 candles/sec),
multi-timeframe data pipeline works, intra-candle policies wired
correctly, runs the same strategy code as live trading.
Ship blockers for customer feature:
- Zero engine unit tests (1,984 LOC backend, 0 *.test.* files)
- No Alpaca data source — only Kraken auto-fetch + CSV/JSON upload
means stock plans can't be tested without manual CSV
- Log noise (~25k lines per 5k-candle backtest)
- Production gate is intentional ("not yet production-ready" comment
in flags.ts) — not an oversight
Recommended path: admin-only POC first (low risk, no flag changes),
plus the user's-own-trade-replay source as a v1 product alternative
that's truthier than synthetic historical scenarios.
Generated with [Devin](https://cli.devin.ai/docs)
Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
11 KiB
Backtest Engine Readiness Report
Date: 2026-05-10 Trigger: request to add "test against history" feature to
/planspage (e.g. COVID, war periods) Method: code review + 3 synthetic-data smoke tests against the running engine Audience: anyone deciding whether to ship customer-facing backtest UI
TL;DR
| Concern | Status |
|---|---|
| Engine determinism | ✅ Verified (same input → byte-identical output) |
| Multi-timeframe data pipeline | ✅ normalize.ts aggregates 1h+4h from 15m automatically |
| Intra-candle SL/TP policy wiring | ✅ Correctly used as a tiebreaker (only fires on real conflicts) |
| Strategy schema compatibility | ✅ Saved trade plans (strategy_config field) feed directly into backtest |
| Performance | ✅ ~1.5s for 5,000 candles × 1 symbol on this VM |
| Equity (stock) data source | ❌ Missing. Live trading uses Alpaca for equities, but no Alpaca backtest loader exists. Only Kraken (crypto) auto-fetch. |
| Test coverage | ❌ Zero engine tests. 1,984 lines of backend backtest code, 0 *.test.* files. Only the FE utils.test.ts exists (68 lines, tests the chart rendering helpers — not the engine). |
| Production gating | ⚠️ Explicit comment in flags.ts: "Default to disabled — backtest is not yet production-ready." Three flags must align (VITE_BACKTEST_ENABLED build, enableBacktest runtime, customerEnabled runtime). |
| Log noise | ⚠️ ~5 log lines per candle at default level. A 5,000-candle backtest produces ~25,000 log lines. Operationally disruptive at scale. |
| Historical depth | ⚠️ Kraken via ccxt typically returns 5-7 years for major pairs. COVID Mar 2020 reachable for BTC/ETH; uncertain for younger altcoins. 2008 GFC unreachable. |
Verdict for shipping a customer-facing "test against history" feature: not yet. The engine itself is sound, but ship-blocking gaps are:
- No equity data source (so stock plans can't be tested without manual CSV upload)
- Zero engine unit tests — high regression risk
- Log noise needs throttling
Verdict for an admin-only POC: safe. The engine is deterministic, fast, and runs the same strategy code as live. Adding a button on /plans cards visible only to admins (using isAdminView from useBacktestFeatureGate) is low-risk.
1. What the engine actually is
backend/src/backtest/
├── engine/
│ ├── BacktestRunner.ts (384 LOC) — top-level orchestration
│ ├── VirtualExecutionEngine.ts (609 LOC) — fill simulation, SL/TP, slippage
│ ├── VirtualLedger.ts ( 77 LOC) — position + cash bookkeeping
│ ├── timeFreeze.ts ( 26 LOC) — Date.now() lockdown for determinism
│ ├── warmup.ts ( 88 LOC) — pre-window candle reservation
│ └── computeSummary.ts ( 60 LOC) — netPnl/winRate/drawdown/sharpe
├── data/
│ ├── csvLoader.ts — user-uploaded CSV
│ ├── jsonLoader.ts — user-uploaded JSON
│ ├── exchangeReplayAdapter.ts — replay user's own past trades
│ ├── krakenLoader.ts — auto-fetch crypto via ccxt
│ ├── normalize.ts — ⭐ aggregates 1h/4h from 15m
│ └── loadHistoricalData.ts — source dispatcher
├── exchange/
│ └── ReplayExchangeConnector.ts — IExchangeConnector impl backed by dataset
├── metrics/
│ └── computeSummary.ts
├── guards.ts — feature-flag + mode assertions
├── strategySafety.ts — config validation
└── types.ts
Design strength: the ReplayExchangeConnector implements the same IExchangeConnector interface as live trading. The strategy engine (ProStrategyEngine) doesn't know it's in a backtest. Same code paths run for live and historical — high fidelity.
Key invariant: data loading aggregates upward from 15m to 1h and 4h via aggregateCandles() in normalize.ts:86. The Kraken loader fetches only 1m or 15m; the engine's required 4h+1h+15m view is built from there. Skipping normalize.ts (e.g. by calling runBacktestReplay with a manually-constructed dataset) silently fails with "Insufficient data" warnings. This is a footgun for any new data adapter that doesn't go through buildDatasetFromRows.
2. Smoke test results (this session)
Test 1 — Determinism
5,000 synthetic 15m candles + multi-timeframe aggregation, default strategy config, two consecutive runs.
Run #1: trades=3 netPnl=-15.32 drawdown=0.31% sharpe=-4.098 ms=1625
Run #2: trades=3 netPnl=-15.32 drawdown=0.31% sharpe=-4.098 ms=1304
DETERMINISTIC: true (JSON.stringify(r1) === JSON.stringify(r2))
✅ Engine is deterministic byte-for-byte. Performance: ~250–325 candles/second on this VM, which extrapolates to ~6 seconds for a one-year 15m backtest.
Test 2 — Intra-candle policy
Same 5,000 candles, three policies (ohlc_path, stop_loss_first, take_profit_first):
ohlc_path: trades=3 netPnl=-15.32
stop_loss_first: trades=3 netPnl=-15.32
take_profit_first: trades=3 netPnl=-15.32
⚠️ Identical results across all 3 policies. Initial concern: is the policy a no-op? Source review (VirtualExecutionEngine.ts:258 → resolveIntraCandleConflictReason) confirms the policy IS wired but only invoked when both SL and TP trigger inside the same candle. None of the 3 trades in this test had such an intra-candle conflict, so the policy never fired. Verdict: correct behavior, but the test wasn't strong enough to exercise it. A targeted test with constructed wide-range candles is needed to validate policy semantics; one didn't exist before this audit.
Test 3 — Constant price (zero-volatility edge case)
5,000 flat candles at $50,000:
trades=0 netPnl=0.00
✅ Sane — no signals trigger on a flat tape.
Test 4 — Small window (~40 days, COVID-crash size)
trades=1 netPnl=10.71 drawdown=0.03%
✅ Engine handles short windows without errors.
3. Concerns ranked
3.1 Zero unit tests on the backend engine
The engine is 1,984 lines of pure logic with non-trivial fill semantics, and has zero unit tests. The frontend utils.test.ts (68 lines) tests buildInsightCards / parseSymbolsInput — UI helpers, not engine math.
Risk: any change to VirtualExecutionEngine.ts or VirtualLedger.ts could silently change every customer's backtest results. There's no regression net.
Minimum viable test set before customer rollout:
- Determinism: same seed + data → identical output (this session demonstrated it works, but it's not codified)
- Intra-candle policies: construct a candle with both SL and TP inside, assert each policy returns the expected exit
- Slippage + fees: assert
netPnlmatches hand-computed expectation for a single trade with known slippage/fee bps - Drawdown: assert
maxDrawdownPctmatches a manually-computed equity curve - OPEN_AT_END vs FORCE_CLOSE: assert window-end policy difference
Estimate: ~6-8 hours to write the minimum set.
3.2 No equity backtest data source
The backtest data dispatch in loadHistoricalData.ts supports csv | json | replay | kraken. Live trading uses AlpacaConnector (backend/src/connectors/alpaca.ts) which handles ASSET_CLASS: 'us_equity'. No alpacaLoader.ts for backtest.
If a saved plan is for AAPL, the only way to backtest it today is to upload a CSV. That's fine for power users but unworkable for "click to test against COVID."
Estimated work for BacktestAlpacaSource: ~3-4 hours. Mirror krakenLoader.ts, plumb AlpacaConnector.fetchOHLCV through loadHistoricalData.ts, add 'alpaca' to BacktestDataSourceType. Trickiest part: Alpaca's data tier limits (free tier has 15-minute delayed data and limited history; paid tier needed for full COVID-era access).
3.3 Log noise
Default-level (info) logging produces a line per candle per rule evaluation. A 5,000-candle backtest emitted ~25,000 lines in our test. At year-scale this is 80,000+ lines per backtest. Concerns:
- Container log volume at scale (Loki ingestion cost)
- Slow stdout flushing can extend backtest duration
- Customer-visible logs would leak strategy internals
Fix: wrap all [ProEngine] info logs with a BACKTEST_QUIET env flag, or default to warn level inside the backtest entrypoint.
3.4 Historical depth
Verified data source coverage:
| Source | COVID Mar 2020 | Russia/Ukraine 2022 | 2018 selloff | 2008 GFC |
|---|---|---|---|---|
| Kraken (BTC, ETH) | ✅ Reachable | ✅ | ✅ | ❌ |
| Kraken (newer alts) | ⚠️ Pair-dependent | ✅ likely | ❌ likely | ❌ |
| Alpaca (equities) | ⚠️ Free tier may not | ✅ | ✅ | ⚠️ Paid tier |
| User CSV/JSON | ✅ if user has data | ✅ | ✅ | ✅ |
For preset historical-event buttons to work reliably on customer-launched backtests, presets older than ~3 years should fall back to "upload your own data" or be hidden.
3.5 Production gate is intentional
web/src/backtest/flags.ts line 32:
// Default to disabled — backtest is not yet production-ready.
return false;
The team has explicitly opted out of production exposure. Flipping VITE_BACKTEST_ENABLED=true is a deliberate decision, not an oversight. Whoever takes this on should:
- Find the original PR/discussion where backtest was gated
- Confirm whether the listed concerns above are the same ones that drove gating, or if there are others
- Get sign-off before un-gating
4. Recommended path forward
| Stage | Scope | Risk | Value |
|---|---|---|---|
| A | Admin-only POC: button on /plans cards visible when isAdminView, opens BacktestRunnerPanel pre-loaded with the plan's strategy_config + 5 preset crypto-friendly date ranges. No flag changes; uses existing admin gate. |
Low | Validates the UX hypothesis. Lets the team dogfood. |
| B | Add the minimum-viable backtest test suite (§3.1). | Low | Regression net. Ship-blocker for any customer feature. |
| C | Add BacktestAlpacaSource (§3.2) IF saved plans include equities. |
Medium | Without this, customer feature only works for crypto plans. |
| D | Quiet-mode logging wrapper (§3.3). | Low | Operational hygiene. |
| E | Replay against the user's own past trades (already exists as replay source) — surface as "see how this plan would have played your last 90 days" instead of "see how it would have played COVID." Lower data risk, higher per-user truthfulness. |
Low | Could ship without B/C above; uses existing canonical lifecycle data. |
| F | Customer rollout: validate B + C + D, flip the runtime flag, optionally add tier gating. | High | The big one. Don't do this without B+C+D. |
My recommendation: A first (this session can do it), then E in parallel (it's possibly the better v1 product anyway — "show me how my plan would have played my actual past trades" is more truthful than synthetic historical scenarios). B + C + D are prereqs for F.
5. Reproducible smoke tests
The Node ESM scripts used for this audit live at /tmp/backtest_smoke.mjs and /tmp/backtest_smoke2.mjs (this session — not committed). To re-run:
cd /opt/bytelyst/learning_ai_invt_trdg
pnpm --filter @bytelyst/trading-backend build
node /tmp/backtest_smoke2.mjs 2>/dev/null | grep -E "trades=|DETERMINISTIC|netPnl"
If we proceed, these should be promoted to backend/src/backtest/__tests__/ with vitest.