learning_ai_invt_trdg/docs/OPERATIONS.md

7.4 KiB

Trading Monorepo Operations

Purpose

This document is the operator and engineer runbook for learning_ai_invt_trdg.

It covers:

  • local development setup
  • verification and CI expectations
  • staged rollout of the new monorepo deployment
  • rollback rules
  • release go/no-go checks
  • post-cutover monitoring

Local Development

Prerequisites

  • Node.js >=20
  • pnpm >=10
  • local checkout of:
    • learning_ai_invt_trdg
    • learning_ai_common_plat
  • access to:
    • platform-service
    • Azure Cosmos DB

Workspace bootstrap

pnpm install
cp .env.example .env
pnpm verify

Core commands

pnpm verify
pnpm lint
pnpm typecheck
pnpm test
pnpm build
pnpm smoke:release

Surface-specific commands

pnpm --filter @bytelyst/trading-backend dev
pnpm --filter @bytelyst/trading-web dev
pnpm --filter @bytelyst/trading-mobile dev

Environment Model

Platform-service

  • PLATFORM_API_URL
  • PLATFORM_AUTH_ENABLED
  • PLATFORM_JWT_ISSUER
  • PLATFORM_JWT_PUBLIC_KEY or PLATFORM_JWT_JWKS_URL
  • JWT_SECRET only for HS256 compatibility environments

Cosmos

  • COSMOS_ENDPOINT
  • COSMOS_KEY
  • COSMOS_DATABASE

Rule:

  • platform-service and Cosmos are the only supported production systems for this repo
  • legacy repos may still be consulted as code references, but they are not runtime dependencies
  • trading user profiles, dynamic config, trading controls, snapshots, capital ledgers, and strategy presets already use Cosmos-backed authority paths

Verification Standard

Before merge or release, all of the following must pass from repo root:

pnpm verify
pnpm lint

pnpm verify currently gates:

  • backend, web, and mobile typecheck
  • backend and web test suites
  • backend and web build plus mobile typecheck

pnpm lint currently gates:

  • backend contract and safety verification scripts
  • web lint
  • mobile lint

Request Tracing

  • the main web and mobile API paths, operator actions, and lifecycle fetches now attach x-request-id
  • backend HTTP responses echo x-request-id so browser/app logs can be correlated with backend logs
  • during incident review, treat x-request-id as the primary request correlation key across client and backend traces

Feature Flag Ownership

  • backend GET /api/feature-flags is the authoritative runtime contract for user-facing feature access
  • web feature gates must read explicit feature-flag contracts instead of scraping generic config payloads
  • dynamic config may still store the underlying values, but the product surfaces should consume the typed feature-flag API

Staged Cutover

Order

  1. Backend internal validation
  2. Web internal adoption
  3. Mobile internal beta
  4. Controlled operator rollout
  5. Broader production cutover

Backend cutover

  • deploy backend with platform JWT support and Cosmos-backed control-plane and execution persistence enabled
  • confirm runtime control reads/writes work through backend APIs
  • confirm dynamic_config, trading-control, order, trade-history, and manual-entry containers are readable and writable
  • confirm unauthorized requests are rejected and tenant-scoped reads are enforced

Web cutover

  • move operators to the monorepo web dashboard
  • validate sign-in, session restore, kill-switch handling, and admin controls
  • validate dynamic config writes through backend APIs
  • keep legacy direct-table workflows disabled where backend API replacements exist

Mobile cutover

  • release to internal beta first
  • validate sign-in, session restore, live state, degraded-state handling, and safe interventions
  • do not enable broader rollout until backend/web contracts stay stable through at least one backend deploy cycle

Rollback Rules

Hard rollback triggers

  • auth/session failures prevent sign-in or session refresh
  • incorrect tenant scoping leaks another user's profile, orders, alerts, or history
  • global trade halt or scoped disable controls do not apply correctly
  • dynamic config writes fail or partially apply without clear operator visibility
  • mobile/web clients cannot recover from degraded platform-service or backend states

Rollback actions

  1. stop rollout to additional users immediately
  2. revert the most recent monorepo deployment
  3. restore traffic to the previous stable web/backend/mobile release
  4. keep backend trade-halt authority available during rollback
  5. preserve audit logs and operational events for incident review

Data rollback rule

  • do not rewrite or delete Cosmos control-plane state as part of first-response rollback
  • prefer application rollback first, then explicit state repair if needed

Release Go/No-Go

Release is go only if all of the following are true:

  • pnpm verify passes
  • pnpm lint passes
  • pnpm smoke:release passes
  • platform-service auth is reachable from web and mobile
  • Cosmos control-plane reads and writes succeed
  • Cosmos execution-data reads and writes succeed
  • kill-switch and maintenance behavior are validated on web and mobile
  • backend tenant isolation checks are green
  • operator-safe mobile interventions are limited to approved actions only
  • no legacy runtime data dependency remains in critical public flows

Release is no-go if any of the following are true:

  • auth source of truth is ambiguous in production
  • admin/runtime-control actions are not fully audited
  • rollback owner or rollback commands are unclear

Release Smoke Checklist

pnpm smoke:release currently validates:

  • web sign-in flow behavior
  • web password reset flow behavior
  • web authenticated session bootstrap behavior
  • web websocket auth token gating
  • web product kill-switch accessibility gating
  • mobile auth and product-availability surfaces still compile against the shared platform contracts

Manual mobile release smoke is still required before broad rollout:

  1. Sign in on a fresh install.
  2. Confirm session restore after app restart.
  3. Confirm product-disabled state blocks the app shell.
  4. Confirm maintenance/availability messaging is visible.
  5. Confirm the app recovers after re-enabling the product.

Post-Cutover Monitoring

Watch immediately after rollout

  • platform auth failures
  • token refresh failures
  • backend 401 and 403 spikes
  • websocket connection failure rate
  • dynamic config update failures
  • trading-control update failures
  • mobile degraded/offline state frequency
  • unexpected operator intervention failures

Watch for the first 24 hours

  • tenant isolation anomalies
  • runtime control drift between backend memory and Cosmos control state
  • kill-switch misfires
  • stale session behavior across web and mobile
  • build or chunk-size regressions affecting web load

Known Remaining Gaps

  • Cosmos-only execution persistence is now in place for the main backend runtime paths, but dormant legacy code and one-off reference scripts still need cleanup
  • web now uses platform-session handling end to end; the remaining auth cleanup is removing dormant compatibility stubs and aligning profile bootstrap contracts fully with backend-owned product APIs
  • root pnpm verify is green again after aligning the web Vitest harness with platform-session storage and current API contracts
  • mobile does not yet include push notification infrastructure
  • broader feature-flag ownership beyond the current shared backtest contract is not fully standardized yet
  • exchange/order-level correlation-ID propagation is not fully standardized yet

These are follow-up items, not hidden defects. They should remain tracked in docs/ROADMAP.md.