learning_ai_invt_trdg/docs/OPERATIONS.md

7.5 KiB

Trading Monorepo Operations

Purpose

This document is the operator and engineer runbook for learning_ai_invt_trdg.

It covers:

  • local development setup
  • verification and CI expectations
  • staged rollout of the new monorepo deployment
  • rollback rules
  • release go/no-go checks
  • post-cutover monitoring

Local Development

Prerequisites

  • Node.js >=20
  • pnpm >=10
  • local checkout of:
    • learning_ai_invt_trdg
    • learning_ai_common_plat
  • access to:
    • platform-service
    • Azure Cosmos DB

Workspace bootstrap

pnpm install
cp .env.example .env
pnpm verify

Core commands

pnpm verify
pnpm lint
pnpm typecheck
pnpm test
pnpm build
pnpm smoke:release

Surface-specific commands

pnpm --filter @bytelyst/trading-backend dev
pnpm --filter @bytelyst/trading-web dev
pnpm --filter @bytelyst/trading-mobile dev

Environment Model

Platform-service

  • PLATFORM_API_URL
  • PLATFORM_AUTH_ENABLED
  • PLATFORM_JWT_ISSUER
  • PLATFORM_JWT_PUBLIC_KEY or PLATFORM_JWT_JWKS_URL
  • JWT_SECRET only for HS256 compatibility environments

Cosmos

  • COSMOS_ENDPOINT
  • COSMOS_KEY
  • COSMOS_DATABASE

Rule:

  • platform-service and Cosmos are the only supported production systems for this repo
  • legacy repos may still be consulted as code references, but they are not runtime dependencies
  • trading user profiles, dynamic config, trading controls, snapshots, capital ledgers, and strategy presets already use Cosmos-backed authority paths
  • dynamic config runtime refresh and admin updates no longer seed from or mirror to legacy storage in the active backend runtime path

Verification Standard

Before merge or release, all of the following must pass from repo root:

pnpm verify
pnpm lint

pnpm verify currently gates:

  • backend, web, and mobile typecheck
  • backend and web test suites
  • backend and web build plus mobile typecheck

pnpm lint currently gates:

  • backend contract and safety verification scripts
  • web lint
  • mobile lint

Request Tracing

  • the main web and mobile API paths, operator actions, and lifecycle fetches now attach x-request-id
  • backend HTTP responses echo x-request-id so browser/app logs can be correlated with backend logs
  • during incident review, treat x-request-id as the primary request correlation key across client and backend traces

Feature Flag Ownership

  • backend GET /api/feature-flags is the authoritative runtime contract for user-facing feature access
  • web feature gates must read explicit feature-flag contracts instead of scraping generic config payloads
  • dynamic config may still store the underlying values, but the product surfaces should consume the typed feature-flag API

Staged Cutover

Order

  1. Backend internal validation
  2. Web internal adoption
  3. Mobile internal beta
  4. Controlled operator rollout
  5. Broader production cutover

Backend cutover

  • deploy backend with platform JWT support and Cosmos-backed control-plane and execution persistence enabled
  • confirm runtime control reads/writes work through backend APIs
  • confirm dynamic_config, trading-control, order, trade-history, and manual-entry containers are readable and writable
  • confirm unauthorized requests are rejected and tenant-scoped reads are enforced

Web cutover

  • move operators to the monorepo web dashboard
  • validate sign-in, session restore, kill-switch handling, and admin controls
  • validate dynamic config writes through backend APIs
  • keep legacy direct-table workflows disabled where backend API replacements exist

Mobile cutover

  • release to internal beta first
  • validate sign-in, session restore, live state, degraded-state handling, and safe interventions
  • do not enable broader rollout until backend/web contracts stay stable through at least one backend deploy cycle

Rollback Rules

Hard rollback triggers

  • auth/session failures prevent sign-in or session refresh
  • incorrect tenant scoping leaks another user's profile, orders, alerts, or history
  • global trade halt or scoped disable controls do not apply correctly
  • dynamic config writes fail or partially apply without clear operator visibility
  • mobile/web clients cannot recover from degraded platform-service or backend states

Rollback actions

  1. stop rollout to additional users immediately
  2. revert the most recent monorepo deployment
  3. restore traffic to the previous stable web/backend/mobile release
  4. keep backend trade-halt authority available during rollback
  5. preserve audit logs and operational events for incident review

Data rollback rule

  • do not rewrite or delete Cosmos control-plane state as part of first-response rollback
  • prefer application rollback first, then explicit state repair if needed

Release Go/No-Go

Release is go only if all of the following are true:

  • pnpm verify passes
  • pnpm lint passes
  • pnpm smoke:release passes
  • platform-service auth is reachable from web and mobile
  • Cosmos control-plane reads and writes succeed
  • Cosmos execution-data reads and writes succeed
  • kill-switch and maintenance behavior are validated on web and mobile
  • backend tenant isolation checks are green
  • operator-safe mobile interventions are limited to approved actions only
  • no legacy runtime data dependency remains in critical public flows

Release is no-go if any of the following are true:

  • auth source of truth is ambiguous in production
  • admin/runtime-control actions are not fully audited
  • rollback owner or rollback commands are unclear

Release Smoke Checklist

pnpm smoke:release currently validates:

  • web sign-in flow behavior
  • web password reset flow behavior
  • web authenticated session bootstrap behavior
  • web websocket auth token gating
  • web product kill-switch accessibility gating
  • mobile auth and product-availability surfaces still compile against the shared platform contracts

Manual mobile release smoke is still required before broad rollout:

  1. Sign in on a fresh install.
  2. Confirm session restore after app restart.
  3. Confirm product-disabled state blocks the app shell.
  4. Confirm maintenance/availability messaging is visible.
  5. Confirm the app recovers after re-enabling the product.

Post-Cutover Monitoring

Watch immediately after rollout

  • platform auth failures
  • token refresh failures
  • backend 401 and 403 spikes
  • websocket connection failure rate
  • dynamic config update failures
  • trading-control update failures
  • mobile degraded/offline state frequency
  • unexpected operator intervention failures

Watch for the first 24 hours

  • tenant isolation anomalies
  • runtime control drift between backend memory and Cosmos control state
  • kill-switch misfires
  • stale session behavior across web and mobile
  • build or chunk-size regressions affecting web load

Known Remaining Gaps

  • Cosmos-only execution persistence is now in place for the main backend runtime paths, but dormant legacy code and one-off reference scripts still need cleanup
  • web now uses platform-session handling end to end; the remaining auth cleanup is removing dormant compatibility stubs and aligning profile bootstrap contracts fully with backend-owned product APIs
  • root pnpm verify is green again after aligning the web Vitest harness with platform-session storage and current API contracts
  • mobile does not yet include push notification infrastructure
  • broader feature-flag ownership beyond the current shared backtest contract is not fully standardized yet
  • exchange/order-level correlation-ID propagation is not fully standardized yet

These are follow-up items, not hidden defects. They should remain tracked in docs/ROADMAP.md.