11 KiB
Trading Monorepo Operations
Purpose
This document is the operator and engineer runbook for learning_ai_invt_trdg.
It covers:
- local development setup
- verification and CI expectations
- staged rollout of the new monorepo deployment
- rollback rules
- release go/no-go checks
- post-cutover monitoring
Local Development
Prerequisites
- Node.js
>=20 pnpm>=10- local checkout of:
learning_ai_invt_trdglearning_ai_common_plat
- access to:
- platform-service
- Azure Cosmos DB
Workspace bootstrap
pnpm run install:common-plat
cp .env.example .env
pnpm verify
If you need the registry path instead, use pnpm run install:gitea. The active
resolver is controlled by BYTELYST_PACKAGE_SOURCE in .pnpmfile.cjs.
Core commands
pnpm verify
pnpm lint
pnpm typecheck
pnpm test
pnpm build
pnpm smoke:release
Docker commands
# Production — build images and start backend + web
pnpm docker:up # equivalent: docker compose up --build
# Development — hot-reload (tsx for backend, Vite HMR for web)
pnpm docker:dev # equivalent: docker compose -f docker-compose.yml -f docker-compose.dev.yml up
# Stop all containers
pnpm docker:down
Prerequisites for Docker:
.envat repo root filled in (copy from.env.example)GITEA_NPM_TOKENset whenBYTELYST_PACKAGE_SOURCE=giteaVITE_PLATFORM_URLandVITE_TRADING_API_URLset if not using localhost defaults- For dev mode: run the matching local install first (for example
pnpm run install:common-plat)
Docker note: common-plat mode targets /opt/bytelyst/learning_ai_common_plat
on the host, so container builds should stay on vendor or gitea unless the
build context is expanded to include the sibling repo.
Surface-specific commands
pnpm --filter @bytelyst/trading-backend dev
pnpm --filter @bytelyst/trading-web dev
pnpm --filter @bytelyst/trading-mobile dev
Environment Model
Platform-service
PLATFORM_API_URLPLATFORM_AUTH_ENABLEDPLATFORM_JWT_ISSUERPLATFORM_JWT_PUBLIC_KEYorPLATFORM_JWT_JWKS_URLJWT_SECRETonly for HS256 compatibility environments
Cosmos
COSMOS_ENDPOINTCOSMOS_KEYCOSMOS_DATABASE
Rule:
- platform-service and Cosmos are the only supported production systems for this repo
- legacy repos may still be consulted as code references, but they are not runtime dependencies
- trading user profiles, dynamic config, trading controls, snapshots, capital ledgers, and strategy presets already use Cosmos-backed authority paths
- dynamic config runtime refresh and admin updates no longer seed from or mirror to legacy storage in the active backend runtime path
Verification Standard
Before merge or release, all of the following must pass from repo root:
pnpm verify
pnpm lint
pnpm verify currently gates:
- backend, web, and mobile typecheck
- backend and web test suites
- backend and web build plus mobile typecheck
pnpm lint currently gates:
- backend contract and safety verification scripts
- web lint
- mobile lint
Request Tracing
- the main web and mobile API paths, operator actions, and lifecycle fetches now attach
x-request-id - backend HTTP responses echo
x-request-idso browser/app logs can be correlated with backend logs - during incident review, treat
x-request-idas the primary request correlation key across client and backend traces
Feature Flag Ownership
- backend
GET /api/feature-flagsis the authoritative runtime contract for user-facing feature access - web feature gates must read explicit feature-flag contracts instead of scraping generic config payloads
- dynamic config may still store the underlying values, but the product surfaces should consume the typed feature-flag API
Staged Cutover
Order
- Backend internal validation
- Web internal adoption
- Mobile internal beta
- Controlled operator rollout
- Broader production cutover
Backend cutover
- deploy backend with platform JWT support and Cosmos-backed control-plane and execution persistence enabled
- confirm runtime control reads/writes work through backend APIs
- confirm
dynamic_config, trading-control, order, trade-history, and manual-entry containers are readable and writable - confirm unauthorized requests are rejected and tenant-scoped reads are enforced
Web cutover
See docs/CUTOVER_WEB.md for the full step-by-step checklist.
Summary:
- move operators to the monorepo web dashboard
- validate sign-in, session restore, kill-switch handling, and admin controls
- validate dynamic config writes through backend APIs
- run parallel period (1–3 days) before switching traffic fully
- keep legacy direct-table workflows disabled where backend API replacements exist
Mobile cutover
- release to internal beta first
- validate sign-in, session restore, live state, degraded-state handling, and safe interventions
- do not enable broader rollout until backend/web contracts stay stable through at least one backend deploy cycle
Rollback Rules
Hard rollback triggers
- auth/session failures prevent sign-in or session refresh
- incorrect tenant scoping leaks another user's profile, orders, alerts, or history
- global trade halt or scoped disable controls do not apply correctly
- dynamic config writes fail or partially apply without clear operator visibility
- mobile/web clients cannot recover from degraded platform-service or backend states
Rollback actions
- stop rollout to additional users immediately
- revert the most recent monorepo deployment
- restore traffic to the previous stable web/backend/mobile release
- keep backend trade-halt authority available during rollback
- preserve audit logs and operational events for incident review
Data rollback rule
- do not rewrite or delete Cosmos control-plane state as part of first-response rollback
- prefer application rollback first, then explicit state repair if needed
Release Go/No-Go
Release is go only if all of the following are true:
pnpm verifypassespnpm lintpassespnpm smoke:releasepasses- platform-service auth is reachable from web and mobile
- Cosmos control-plane reads and writes succeed
- Cosmos execution-data reads and writes succeed
- kill-switch and maintenance behavior are validated on web and mobile
- backend tenant isolation checks are green
- operator-safe mobile interventions are limited to approved actions only
- no legacy runtime data dependency remains in critical public flows
Release is no-go if any of the following are true:
- auth source of truth is ambiguous in production
- admin/runtime-control actions are not fully audited
- rollback owner or rollback commands are unclear
Release Smoke Checklist
pnpm smoke:release currently validates:
- web sign-in flow behavior
- web password reset flow behavior
- web authenticated session bootstrap behavior
- web websocket auth token gating
- web product kill-switch accessibility gating
- mobile auth and product-availability surfaces still compile against the shared platform contracts
npm run test in backend/ additionally validates:
- WebSocket BotState contract and lifecycle consistency (
check:websocket-contract) - Session rule normalization across all session-string variants (
check:session-rule-normalization) - API contract: feature-flag shapes, audit event literals, BotState health, realtime helpers,
namespace constants (
check:api-contract)
Manual mobile release smoke is still required before broad rollout:
- Sign in on a fresh install.
- Confirm session restore after app restart.
- Confirm product-disabled state blocks the app shell.
- Confirm maintenance/availability messaging is visible.
- Confirm the app recovers after re-enabling the product.
Post-Cutover Monitoring
Watch immediately after rollout
- platform auth failures
- token refresh failures
- backend
401and403spikes - websocket connection failure rate
- dynamic config update failures
- trading-control update failures
- mobile degraded/offline state frequency
- unexpected operator intervention failures
Watch for the first 24 hours
- tenant isolation anomalies
- runtime control drift between backend memory and Cosmos control state
- kill-switch misfires
- stale session behavior across web and mobile
- build or chunk-size regressions affecting web load
Known Remaining Gaps
The following are follow-up items, not hidden defects. They are tracked here until resolved.
Resolved since last update (2026-04-07)
- Exchange/order-level correlation-ID propagation — resolved.
x-request-idis now standardised across all main web/mobile API paths, operator actions, lifecycle fetches, and backend HTTP responses. SeeOPERATIONS.md > Request Tracing. - Feature-flag ownership beyond backtest — resolved.
GET /api/feature-flagsnow returns the fullTradingFeatureFlagsResponseincludingtabs.marketplaceandtabs.membership. Web and mobile consume these flags. Key constants are shared viashared/feature-flags.ts. Seedocs/BACKEND_API_DEPRECATION.md. - Admin audit event schema — resolved. Schema formalised in
docs/BACKEND_AUDIT_SCHEMA.md.TradeAuditEventinterface covers all current audit call sites. Future: persist to Cosmos. - Deprecated endpoint documentation — resolved. See
docs/BACKEND_API_DEPRECATION.mdfor full endpoint lifecycle catalogue, WebSocket namespace model, and planned additions. - WebSocket single-namespace isolation — resolved.
/tradingand/adminnamed namespaces added to backend alongside the backward-compatible root namespace. Web and mobile clients connect to/tradingby default. Admin namespace rejects non-admins at connection time. - Backend contract tests absent — resolved.
verifyApiContract.tsadded and wired intonpm run testviacheck:api-contract. Tests cover feature-flag shape, audit event literals, BotState health contract, realtime helpers, and namespace constants.
Open
- Mobile push notification infrastructure — mobile settings UI has toggle state but
no push provider, backend registration endpoint, or token storage. Defer to post-cutover.
Planned endpoints:
POST /api/push/register,DELETE /api/push/register. - Backend telemetry infrastructure — backend has structured logging (Winston) but no
OpenTelemetry or
@bytelyst/telemetry-clientintegration. Web and mobile bootstrap telemetry via the common-platform SDK; backend does not yet send telemetry events. Defer untillearning_ai_common_platpublishes a Node.js telemetry adapter. - Cosmos audit-events container —
auditRepository.tsandGET /api/admin/auditare implemented. Create theaudit-eventscontainer in Cosmos (partition key:/productId, TTL: 7776000 / 90 days) to activate durable audit persistence. Until the container exists,auditTradeEvent()logs to Winston only (safe fallback).