learning_ai_invt_trdg/docs/OPERATIONS.md
Saravana Achu Mac 4cfb446f57 feat(backend): WebSocket namespaces, audit persistence, tab flags, telemetry
- Add /trading and /admin named Socket.IO namespaces; root namespace kept for
  backward compat; admin namespace rejects non-admins at connect time
- Wire auditRepository.ts: persist TradeAuditEvent to Cosmos audit-events
  container (best-effort); expose GET /api/admin/audit for admin queries
- Add tradingTelemetry singleton (Node.js Map-based storage adapter); init
  and fatal-error tracking wired in index.ts main()
- Add TAB_MARKETPLACE_ENABLED / TAB_MEMBERSHIP_ENABLED config flags; expose
  tabs.* shape in GET /api/feature-flags response
- Fix SupabaseService URL validation (regex check before createClient)
- Wire check:api-contract and check:audit-repository into npm run test
- Switch @bytelyst/* deps to file:../vendor/* references

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-29 19:35:00 -04:00

10 KiB
Raw Blame History

Trading Monorepo Operations

Purpose

This document is the operator and engineer runbook for learning_ai_invt_trdg.

It covers:

  • local development setup
  • verification and CI expectations
  • staged rollout of the new monorepo deployment
  • rollback rules
  • release go/no-go checks
  • post-cutover monitoring

Local Development

Prerequisites

  • Node.js >=20
  • pnpm >=10
  • local checkout of:
    • learning_ai_invt_trdg
    • learning_ai_common_plat
  • access to:
    • platform-service
    • Azure Cosmos DB

Workspace bootstrap

pnpm install
cp .env.example .env
pnpm verify

Core commands

pnpm verify
pnpm lint
pnpm typecheck
pnpm test
pnpm build
pnpm smoke:release

Docker commands

# Production — build images and start backend + web
pnpm docker:up           # equivalent: docker compose up --build

# Development — hot-reload (tsx for backend, Vite HMR for web)
pnpm docker:dev          # equivalent: docker compose -f docker-compose.yml -f docker-compose.dev.yml up

# Stop all containers
pnpm docker:down

Prerequisites for Docker:

  • .env at repo root filled in (copy from .env.example)
  • GITEA_NPM_TOKEN set in .env for private @bytelyst/* registry
  • VITE_PLATFORM_URL and VITE_TRADING_API_URL set if not using localhost defaults
  • For dev mode: run pnpm install locally first (node_modules mounted as volume)

Surface-specific commands

pnpm --filter @bytelyst/trading-backend dev
pnpm --filter @bytelyst/trading-web dev
pnpm --filter @bytelyst/trading-mobile dev

Environment Model

Platform-service

  • PLATFORM_API_URL
  • PLATFORM_AUTH_ENABLED
  • PLATFORM_JWT_ISSUER
  • PLATFORM_JWT_PUBLIC_KEY or PLATFORM_JWT_JWKS_URL
  • JWT_SECRET only for HS256 compatibility environments

Cosmos

  • COSMOS_ENDPOINT
  • COSMOS_KEY
  • COSMOS_DATABASE

Rule:

  • platform-service and Cosmos are the only supported production systems for this repo
  • legacy repos may still be consulted as code references, but they are not runtime dependencies
  • trading user profiles, dynamic config, trading controls, snapshots, capital ledgers, and strategy presets already use Cosmos-backed authority paths
  • dynamic config runtime refresh and admin updates no longer seed from or mirror to legacy storage in the active backend runtime path

Verification Standard

Before merge or release, all of the following must pass from repo root:

pnpm verify
pnpm lint

pnpm verify currently gates:

  • backend, web, and mobile typecheck
  • backend and web test suites
  • backend and web build plus mobile typecheck

pnpm lint currently gates:

  • backend contract and safety verification scripts
  • web lint
  • mobile lint

Request Tracing

  • the main web and mobile API paths, operator actions, and lifecycle fetches now attach x-request-id
  • backend HTTP responses echo x-request-id so browser/app logs can be correlated with backend logs
  • during incident review, treat x-request-id as the primary request correlation key across client and backend traces

Feature Flag Ownership

  • backend GET /api/feature-flags is the authoritative runtime contract for user-facing feature access
  • web feature gates must read explicit feature-flag contracts instead of scraping generic config payloads
  • dynamic config may still store the underlying values, but the product surfaces should consume the typed feature-flag API

Staged Cutover

Order

  1. Backend internal validation
  2. Web internal adoption
  3. Mobile internal beta
  4. Controlled operator rollout
  5. Broader production cutover

Backend cutover

  • deploy backend with platform JWT support and Cosmos-backed control-plane and execution persistence enabled
  • confirm runtime control reads/writes work through backend APIs
  • confirm dynamic_config, trading-control, order, trade-history, and manual-entry containers are readable and writable
  • confirm unauthorized requests are rejected and tenant-scoped reads are enforced

Web cutover

See docs/CUTOVER_WEB.md for the full step-by-step checklist.

Summary:

  • move operators to the monorepo web dashboard
  • validate sign-in, session restore, kill-switch handling, and admin controls
  • validate dynamic config writes through backend APIs
  • run parallel period (13 days) before switching traffic fully
  • keep legacy direct-table workflows disabled where backend API replacements exist

Mobile cutover

  • release to internal beta first
  • validate sign-in, session restore, live state, degraded-state handling, and safe interventions
  • do not enable broader rollout until backend/web contracts stay stable through at least one backend deploy cycle

Rollback Rules

Hard rollback triggers

  • auth/session failures prevent sign-in or session refresh
  • incorrect tenant scoping leaks another user's profile, orders, alerts, or history
  • global trade halt or scoped disable controls do not apply correctly
  • dynamic config writes fail or partially apply without clear operator visibility
  • mobile/web clients cannot recover from degraded platform-service or backend states

Rollback actions

  1. stop rollout to additional users immediately
  2. revert the most recent monorepo deployment
  3. restore traffic to the previous stable web/backend/mobile release
  4. keep backend trade-halt authority available during rollback
  5. preserve audit logs and operational events for incident review

Data rollback rule

  • do not rewrite or delete Cosmos control-plane state as part of first-response rollback
  • prefer application rollback first, then explicit state repair if needed

Release Go/No-Go

Release is go only if all of the following are true:

  • pnpm verify passes
  • pnpm lint passes
  • pnpm smoke:release passes
  • platform-service auth is reachable from web and mobile
  • Cosmos control-plane reads and writes succeed
  • Cosmos execution-data reads and writes succeed
  • kill-switch and maintenance behavior are validated on web and mobile
  • backend tenant isolation checks are green
  • operator-safe mobile interventions are limited to approved actions only
  • no legacy runtime data dependency remains in critical public flows

Release is no-go if any of the following are true:

  • auth source of truth is ambiguous in production
  • admin/runtime-control actions are not fully audited
  • rollback owner or rollback commands are unclear

Release Smoke Checklist

pnpm smoke:release currently validates:

  • web sign-in flow behavior
  • web password reset flow behavior
  • web authenticated session bootstrap behavior
  • web websocket auth token gating
  • web product kill-switch accessibility gating
  • mobile auth and product-availability surfaces still compile against the shared platform contracts

npm run test in backend/ additionally validates:

  • WebSocket BotState contract and lifecycle consistency (check:websocket-contract)
  • Session rule normalization across all session-string variants (check:session-rule-normalization)
  • API contract: feature-flag shapes, audit event literals, BotState health, realtime helpers, namespace constants (check:api-contract)

Manual mobile release smoke is still required before broad rollout:

  1. Sign in on a fresh install.
  2. Confirm session restore after app restart.
  3. Confirm product-disabled state blocks the app shell.
  4. Confirm maintenance/availability messaging is visible.
  5. Confirm the app recovers after re-enabling the product.

Post-Cutover Monitoring

Watch immediately after rollout

  • platform auth failures
  • token refresh failures
  • backend 401 and 403 spikes
  • websocket connection failure rate
  • dynamic config update failures
  • trading-control update failures
  • mobile degraded/offline state frequency
  • unexpected operator intervention failures

Watch for the first 24 hours

  • tenant isolation anomalies
  • runtime control drift between backend memory and Cosmos control state
  • kill-switch misfires
  • stale session behavior across web and mobile
  • build or chunk-size regressions affecting web load

Known Remaining Gaps

The following are follow-up items, not hidden defects. They are tracked here until resolved.

Resolved since last update (2026-04-07)

  • Exchange/order-level correlation-ID propagation — resolved. x-request-id is now standardised across all main web/mobile API paths, operator actions, lifecycle fetches, and backend HTTP responses. See OPERATIONS.md > Request Tracing.
  • Feature-flag ownership beyond backtest — resolved. GET /api/feature-flags now returns the full TradingFeatureFlagsResponse including tabs.marketplace and tabs.membership. Web and mobile consume these flags. Key constants are shared via shared/feature-flags.ts. See docs/BACKEND_API_DEPRECATION.md.
  • Admin audit event schema — resolved. Schema formalised in docs/BACKEND_AUDIT_SCHEMA.md. TradeAuditEvent interface covers all current audit call sites. Future: persist to Cosmos.
  • Deprecated endpoint documentation — resolved. See docs/BACKEND_API_DEPRECATION.md for full endpoint lifecycle catalogue, WebSocket namespace model, and planned additions.
  • WebSocket single-namespace isolation — resolved. /trading and /admin named namespaces added to backend alongside the backward-compatible root namespace. Web and mobile clients connect to /trading by default. Admin namespace rejects non-admins at connection time.
  • Backend contract tests absent — resolved. verifyApiContract.ts added and wired into npm run test via check:api-contract. Tests cover feature-flag shape, audit event literals, BotState health contract, realtime helpers, and namespace constants.

Open

  • Mobile push notification infrastructure — mobile settings UI has toggle state but no push provider, backend registration endpoint, or token storage. Defer to post-cutover. Planned endpoints: POST /api/push/register, DELETE /api/push/register.
  • Backend telemetry infrastructure — backend has structured logging (Winston) but no OpenTelemetry or @bytelyst/telemetry-client integration. Web and mobile bootstrap telemetry via the common-platform SDK; backend does not yet send telemetry events. Defer until learning_ai_common_plat publishes a Node.js telemetry adapter.
  • Cosmos audit-events containerauditRepository.ts and GET /api/admin/audit are implemented. Create the audit-events container in Cosmos (partition key: /productId, TTL: 7776000 / 90 days) to activate durable audit persistence. Until the container exists, auditTradeEvent() logs to Winston only (safe fallback).