learning_ai_invt_trdg/docs/OPERATIONS.md

6.7 KiB

Trading Monorepo Operations

Purpose

This document is the operator and engineer runbook for learning_ai_invt_trdg.

It covers:

  • local development setup
  • verification and CI expectations
  • staged cutover from legacy repos
  • rollback rules
  • release go/no-go checks
  • post-cutover monitoring

Local Development

Prerequisites

  • Node.js >=20
  • pnpm >=10
  • local checkout of:
    • learning_ai_invt_trdg
    • learning_ai_common_plat
  • access to:
    • platform-service
    • Azure Cosmos DB
    • optional legacy Supabase project during migration

Workspace bootstrap

pnpm install
cp .env.example .env
pnpm verify

Core commands

pnpm verify
pnpm lint
pnpm typecheck
pnpm test
pnpm build
pnpm smoke:release

Surface-specific commands

pnpm --filter @bytelyst/trading-backend dev
pnpm --filter @bytelyst/trading-web dev
pnpm --filter @bytelyst/trading-mobile dev

Environment Model

Platform-service

  • PLATFORM_API_URL
  • PLATFORM_AUTH_ENABLED
  • PLATFORM_JWT_ISSUER
  • PLATFORM_JWT_PUBLIC_KEY or PLATFORM_JWT_JWKS_URL
  • JWT_SECRET only for HS256 compatibility environments

Cosmos

  • COSMOS_ENDPOINT
  • COSMOS_KEY
  • COSMOS_DATABASE

Transitional legacy migration support

  • SUPABASE_URL
  • SUPABASE_KEY
  • SUPABASE_JWT_ISSUER
  • SUPABASE_JWT_AUDIENCE

Rule:

  • platform-service and Cosmos are the target system
  • Supabase remains transitional only where trading persistence has not yet been migrated

Verification Standard

Before merge or release, all of the following must pass from repo root:

pnpm verify
pnpm lint

pnpm verify currently gates:

  • backend, web, and mobile typecheck
  • backend and web test suites
  • backend and web build plus mobile typecheck

pnpm lint currently gates:

  • backend contract and safety verification scripts
  • web lint
  • mobile lint

Request Tracing

  • the main web and mobile API paths now attach x-request-id
  • backend HTTP responses echo x-request-id so browser/app logs can be correlated with backend logs
  • during incident review, treat x-request-id as the primary request correlation key across client and backend traces

Staged Cutover

Order

  1. Backend internal validation
  2. Web internal adoption
  3. Mobile internal beta
  4. Controlled operator rollout
  5. Broader production cutover

Backend cutover

  • deploy backend with platform JWT support and Cosmos-backed trading controls enabled
  • allow legacy Supabase reads only for controlled migration seeding where a Cosmos-native repository is not complete yet
  • confirm runtime control reads/writes work through backend APIs
  • confirm dynamic_config and trading-control containers are readable and writable
  • confirm unauthorized requests are rejected and tenant-scoped reads are enforced

Web cutover

  • move operators to the monorepo web dashboard
  • validate sign-in, session restore, kill-switch handling, and admin controls
  • validate dynamic config writes through backend APIs
  • keep legacy direct-table workflows disabled where backend API replacements exist

Mobile cutover

  • release to internal beta first
  • validate sign-in, session restore, live state, degraded-state handling, and safe interventions
  • do not enable broader rollout until backend/web contracts stay stable through at least one backend deploy cycle

Rollback Rules

Hard rollback triggers

  • auth/session failures prevent sign-in or session refresh
  • incorrect tenant scoping leaks another user's profile, orders, alerts, or history
  • global trade halt or scoped disable controls do not apply correctly
  • dynamic config writes fail or partially apply without clear operator visibility
  • mobile/web clients cannot recover from degraded platform-service or backend states

Rollback actions

  1. stop rollout to additional users immediately
  2. revert the most recent monorepo deployment
  3. restore traffic to the previous stable web/backend/mobile release
  4. keep backend trade-halt authority available during rollback
  5. preserve audit logs and operational events for incident review

Data rollback rule

  • do not rewrite or delete Cosmos control-plane state as part of first-response rollback
  • prefer application rollback first, then explicit state repair if needed

Release Go/No-Go

Release is go only if all of the following are true:

  • pnpm verify passes
  • pnpm lint passes
  • pnpm smoke:release passes
  • platform-service auth is reachable from web and mobile
  • Cosmos control-plane reads and writes succeed
  • kill-switch and maintenance behavior are validated on web and mobile
  • backend tenant isolation checks are green
  • operator-safe mobile interventions are limited to approved actions only
  • known migration-only legacy dependencies are documented

Release is no-go if any of the following are true:

  • Supabase fallback is still required for a critical public flow that has no monitored contingency
  • auth source of truth is ambiguous in production
  • admin/runtime-control actions are not fully audited
  • rollback owner or rollback commands are unclear

Release Smoke Checklist

pnpm smoke:release currently validates:

  • web sign-in flow behavior
  • web password reset flow behavior
  • web authenticated session bootstrap behavior
  • web websocket auth token gating
  • web product kill-switch accessibility gating
  • mobile auth and product-availability surfaces still compile against the shared platform contracts

Manual mobile release smoke is still required before broad rollout:

  1. Sign in on a fresh install.
  2. Confirm session restore after app restart.
  3. Confirm product-disabled state blocks the app shell.
  4. Confirm maintenance/availability messaging is visible.
  5. Confirm the app recovers after re-enabling the product.

Post-Cutover Monitoring

Watch immediately after rollout

  • platform auth failures
  • token refresh failures
  • backend 401 and 403 spikes
  • websocket connection failure rate
  • dynamic config update failures
  • trading-control update failures
  • mobile degraded/offline state frequency
  • unexpected operator intervention failures

Watch for the first 24 hours

  • tenant isolation anomalies
  • runtime control drift between backend memory and Cosmos control state
  • kill-switch misfires
  • stale session behavior across web and mobile
  • build or chunk-size regressions affecting web load

Known Remaining Gaps

  • full trading data-plane migration away from legacy Supabase is not complete
  • web still carries some legacy compatibility layers around auth/profile bootstrap
  • mobile does not yet include push notification infrastructure
  • feature-flag ownership and correlation-ID propagation are not fully standardized yet

These are follow-up items, not hidden defects. They should remain tracked in docs/ROADMAP.md.