# Trading Monorepo Operations ## Purpose This document is the operator and engineer runbook for `learning_ai_invt_trdg`. It covers: - local development setup - verification and CI expectations - staged rollout of the new monorepo deployment - rollback rules - release go/no-go checks - post-cutover monitoring ## Local Development ### Prerequisites - Node.js `>=20` - `pnpm` `>=10` - local checkout of: - `learning_ai_invt_trdg` - `learning_ai_common_plat` - access to: - platform-service - Azure Cosmos DB ### Workspace bootstrap ```bash pnpm run install:common-plat cp .env.example .env pnpm verify ``` If you need the registry path instead, use `pnpm run install:gitea`. The active resolver is controlled by `BYTELYST_PACKAGE_SOURCE` in `.pnpmfile.cjs`. ### Core commands ```bash pnpm verify pnpm lint pnpm typecheck pnpm test pnpm build pnpm smoke:release ``` ### Docker commands ```bash # Production — build images and start backend + web pnpm docker:up # equivalent: docker compose up --build # Development — hot-reload (tsx for backend, Vite HMR for web) pnpm docker:dev # equivalent: docker compose -f docker-compose.yml -f docker-compose.dev.yml up # Stop all containers pnpm docker:down ``` Prerequisites for Docker: - `.env` at repo root filled in (copy from `.env.example`) - `GITEA_NPM_TOKEN` set when `BYTELYST_PACKAGE_SOURCE=gitea` - `VITE_PLATFORM_URL` and `VITE_TRADING_API_URL` set if not using localhost defaults - For dev mode: run the matching local install first (for example `pnpm run install:common-plat`) Docker note: `common-plat` mode targets `/opt/bytelyst/learning_ai_common_plat` on the host, so container builds should stay on `vendor` or `gitea` unless the build context is expanded to include the sibling repo. ### Surface-specific commands ```bash pnpm --filter @bytelyst/trading-backend dev pnpm --filter @bytelyst/trading-web dev pnpm --filter @bytelyst/trading-mobile dev ``` ## Environment Model ### Platform-service - `PLATFORM_API_URL` - `PLATFORM_AUTH_ENABLED` - `PLATFORM_JWT_ISSUER` - `PLATFORM_JWT_PUBLIC_KEY` or `PLATFORM_JWT_JWKS_URL` - `JWT_SECRET` only for HS256 compatibility environments ### Cosmos - `COSMOS_ENDPOINT` - `COSMOS_KEY` - `COSMOS_DATABASE` Rule: - platform-service and Cosmos are the only supported production systems for this repo - legacy repos may still be consulted as code references, but they are not runtime dependencies - trading user profiles, dynamic config, trading controls, snapshots, capital ledgers, and strategy presets already use Cosmos-backed authority paths - dynamic config runtime refresh and admin updates no longer seed from or mirror to legacy storage in the active backend runtime path ## Verification Standard Before merge or release, all of the following must pass from repo root: ```bash pnpm verify pnpm lint ``` `pnpm verify` currently gates: - backend, web, and mobile typecheck - backend and web test suites - backend and web build plus mobile typecheck `pnpm lint` currently gates: - backend contract and safety verification scripts - web lint - mobile lint ## Request Tracing - the main web and mobile API paths, operator actions, and lifecycle fetches now attach `x-request-id` - backend HTTP responses echo `x-request-id` so browser/app logs can be correlated with backend logs - during incident review, treat `x-request-id` as the primary request correlation key across client and backend traces ## Feature Flag Ownership - backend `GET /api/feature-flags` is the authoritative runtime contract for user-facing feature access - web feature gates must read explicit feature-flag contracts instead of scraping generic config payloads - dynamic config may still store the underlying values, but the product surfaces should consume the typed feature-flag API ## Staged Cutover ### Order 1. Backend internal validation 2. Web internal adoption 3. Mobile internal beta 4. Controlled operator rollout 5. Broader production cutover ### Backend cutover - deploy backend with platform JWT support and Cosmos-backed control-plane and execution persistence enabled - confirm runtime control reads/writes work through backend APIs - confirm `dynamic_config`, trading-control, order, trade-history, and manual-entry containers are readable and writable - confirm unauthorized requests are rejected and tenant-scoped reads are enforced ### Web cutover See `docs/CUTOVER_WEB.md` for the full step-by-step checklist. Summary: - move operators to the monorepo web dashboard - validate sign-in, session restore, kill-switch handling, and admin controls - validate dynamic config writes through backend APIs - run parallel period (1–3 days) before switching traffic fully - keep legacy direct-table workflows disabled where backend API replacements exist ### Mobile cutover - release to internal beta first - validate sign-in, session restore, live state, degraded-state handling, and safe interventions - do not enable broader rollout until backend/web contracts stay stable through at least one backend deploy cycle ## Rollback Rules ### Hard rollback triggers - auth/session failures prevent sign-in or session refresh - incorrect tenant scoping leaks another user's profile, orders, alerts, or history - global trade halt or scoped disable controls do not apply correctly - dynamic config writes fail or partially apply without clear operator visibility - mobile/web clients cannot recover from degraded platform-service or backend states ### Rollback actions 1. stop rollout to additional users immediately 2. revert the most recent monorepo deployment 3. restore traffic to the previous stable web/backend/mobile release 4. keep backend trade-halt authority available during rollback 5. preserve audit logs and operational events for incident review ### Data rollback rule - do not rewrite or delete Cosmos control-plane state as part of first-response rollback - prefer application rollback first, then explicit state repair if needed ## Release Go/No-Go Release is `go` only if all of the following are true: - `pnpm verify` passes - `pnpm lint` passes - `pnpm smoke:release` passes - platform-service auth is reachable from web and mobile - Cosmos control-plane reads and writes succeed - Cosmos execution-data reads and writes succeed - kill-switch and maintenance behavior are validated on web and mobile - backend tenant isolation checks are green - operator-safe mobile interventions are limited to approved actions only - no legacy runtime data dependency remains in critical public flows Release is `no-go` if any of the following are true: - auth source of truth is ambiguous in production - admin/runtime-control actions are not fully audited - rollback owner or rollback commands are unclear ## Release Smoke Checklist `pnpm smoke:release` currently validates: - web sign-in flow behavior - web password reset flow behavior - web authenticated session bootstrap behavior - web websocket auth token gating - web product kill-switch accessibility gating - mobile auth and product-availability surfaces still compile against the shared platform contracts `npm run test` in `backend/` additionally validates: - WebSocket BotState contract and lifecycle consistency (`check:websocket-contract`) - Session rule normalization across all session-string variants (`check:session-rule-normalization`) - API contract: feature-flag shapes, audit event literals, BotState health, realtime helpers, namespace constants (`check:api-contract`) Manual mobile release smoke is still required before broad rollout: 1. Sign in on a fresh install. 2. Confirm session restore after app restart. 3. Confirm product-disabled state blocks the app shell. 4. Confirm maintenance/availability messaging is visible. 5. Confirm the app recovers after re-enabling the product. ## Post-Cutover Monitoring ### Watch immediately after rollout - platform auth failures - token refresh failures - backend `401` and `403` spikes - websocket connection failure rate - dynamic config update failures - trading-control update failures - mobile degraded/offline state frequency - unexpected operator intervention failures ### Watch for the first 24 hours - tenant isolation anomalies - runtime control drift between backend memory and Cosmos control state - kill-switch misfires - stale session behavior across web and mobile - build or chunk-size regressions affecting web load ## Known Remaining Gaps The following are follow-up items, not hidden defects. They are tracked here until resolved. ### Resolved since last update (2026-04-07) - **Exchange/order-level correlation-ID propagation** — resolved. `x-request-id` is now standardised across all main web/mobile API paths, operator actions, lifecycle fetches, and backend HTTP responses. See `OPERATIONS.md > Request Tracing`. - **Feature-flag ownership beyond backtest** — resolved. `GET /api/feature-flags` now returns the full `TradingFeatureFlagsResponse` including `tabs.marketplace` and `tabs.membership`. Web and mobile consume these flags. Key constants are shared via `shared/feature-flags.ts`. See `docs/BACKEND_API_DEPRECATION.md`. - **Admin audit event schema** — resolved. Schema formalised in `docs/BACKEND_AUDIT_SCHEMA.md`. `TradeAuditEvent` interface covers all current audit call sites. Future: persist to Cosmos. - **Deprecated endpoint documentation** — resolved. See `docs/BACKEND_API_DEPRECATION.md` for full endpoint lifecycle catalogue, WebSocket namespace model, and planned additions. - **WebSocket single-namespace isolation** — resolved. `/trading` and `/admin` named namespaces added to backend alongside the backward-compatible root namespace. Web and mobile clients connect to `/trading` by default. Admin namespace rejects non-admins at connection time. - **Backend contract tests absent** — resolved. `verifyApiContract.ts` added and wired into `npm run test` via `check:api-contract`. Tests cover feature-flag shape, audit event literals, BotState health contract, realtime helpers, and namespace constants. ### Open - **Mobile push notification infrastructure** — mobile settings UI has toggle state but no push provider, backend registration endpoint, or token storage. Defer to post-cutover. Planned endpoints: `POST /api/push/register`, `DELETE /api/push/register`. - **Backend telemetry infrastructure** — backend has structured logging (Winston) but no OpenTelemetry or `@bytelyst/telemetry-client` integration. Web and mobile bootstrap telemetry via the common-platform SDK; backend does not yet send telemetry events. Defer until `learning_ai_common_plat` publishes a Node.js telemetry adapter. - **Cosmos audit-events container** — `auditRepository.ts` and `GET /api/admin/audit` are implemented. Create the `audit-events` container in Cosmos (partition key: `/productId`, TTL: 7776000 / 90 days) to activate durable audit persistence. Until the container exists, `auditTradeEvent()` logs to Winston only (safe fallback).