learning_ai_invt_trdg/docs/OPERATIONS.md

208 lines
5.6 KiB
Markdown

# Trading Monorepo Operations
## Purpose
This document is the operator and engineer runbook for `learning_ai_invt_trdg`.
It covers:
- local development setup
- verification and CI expectations
- staged cutover from legacy repos
- rollback rules
- release go/no-go checks
- post-cutover monitoring
## Local Development
### Prerequisites
- Node.js `>=20`
- `pnpm` `>=10`
- local checkout of:
- `learning_ai_invt_trdg`
- `learning_ai_common_plat`
- access to:
- platform-service
- Azure Cosmos DB
- optional legacy Supabase project during migration
### Workspace bootstrap
```bash
pnpm install
cp .env.example .env
pnpm verify
```
### Core commands
```bash
pnpm verify
pnpm lint
pnpm typecheck
pnpm test
pnpm build
```
### Surface-specific commands
```bash
pnpm --filter @bytelyst/trading-backend dev
pnpm --filter @bytelyst/trading-web dev
pnpm --filter @bytelyst/trading-mobile dev
```
## Environment Model
### Platform-service
- `PLATFORM_API_URL`
- `PLATFORM_AUTH_ENABLED`
- `PLATFORM_JWT_ISSUER`
- `PLATFORM_JWT_PUBLIC_KEY` or `PLATFORM_JWT_JWKS_URL`
- `JWT_SECRET` only for HS256 compatibility environments
### Cosmos
- `COSMOS_ENDPOINT`
- `COSMOS_KEY`
- `COSMOS_DATABASE`
### Transitional legacy migration support
- `SUPABASE_URL`
- `SUPABASE_KEY`
- `SUPABASE_JWT_ISSUER`
- `SUPABASE_JWT_AUDIENCE`
Rule:
- platform-service and Cosmos are the target system
- Supabase remains transitional only where trading persistence has not yet been migrated
## Verification Standard
Before merge or release, all of the following must pass from repo root:
```bash
pnpm verify
pnpm lint
```
`pnpm verify` currently gates:
- backend, web, and mobile typecheck
- backend and web test suites
- backend and web build plus mobile typecheck
`pnpm lint` currently gates:
- backend contract and safety verification scripts
- web lint
- mobile lint
## Staged Cutover
### Order
1. Backend internal validation
2. Web internal adoption
3. Mobile internal beta
4. Controlled operator rollout
5. Broader production cutover
### Backend cutover
- deploy backend with platform JWT support and Cosmos-backed trading controls enabled
- keep legacy Supabase fallback enabled during first production bake
- confirm runtime control reads/writes work through backend APIs
- confirm `dynamic_config` and trading-control containers are readable and writable
- confirm unauthorized requests are rejected and tenant-scoped reads are enforced
### Web cutover
- move operators to the monorepo web dashboard
- validate sign-in, session restore, kill-switch handling, and admin controls
- validate dynamic config writes through backend APIs
- keep legacy direct-table workflows disabled where backend API replacements exist
### Mobile cutover
- release to internal beta first
- validate sign-in, session restore, live state, degraded-state handling, and safe interventions
- do not enable broader rollout until backend/web contracts stay stable through at least one backend deploy cycle
## Rollback Rules
### Hard rollback triggers
- auth/session failures prevent sign-in or session refresh
- incorrect tenant scoping leaks another user's profile, orders, alerts, or history
- global trade halt or scoped disable controls do not apply correctly
- dynamic config writes fail or partially apply without clear operator visibility
- mobile/web clients cannot recover from degraded platform-service or backend states
### Rollback actions
1. stop rollout to additional users immediately
2. revert the most recent monorepo deployment
3. restore traffic to the previous stable web/backend/mobile release
4. keep backend trade-halt authority available during rollback
5. preserve audit logs and operational events for incident review
### Data rollback rule
- do not rewrite or delete Cosmos control-plane state as part of first-response rollback
- prefer application rollback first, then explicit state repair if needed
## Release Go/No-Go
Release is `go` only if all of the following are true:
- `pnpm verify` passes
- `pnpm lint` passes
- platform-service auth is reachable from web and mobile
- Cosmos control-plane reads and writes succeed
- kill-switch and maintenance behavior are validated on web and mobile
- backend tenant isolation checks are green
- operator-safe mobile interventions are limited to approved actions only
- known migration-only legacy dependencies are documented
Release is `no-go` if any of the following are true:
- Supabase fallback is still required for a critical public flow that has no monitored contingency
- auth source of truth is ambiguous in production
- admin/runtime-control actions are not fully audited
- rollback owner or rollback commands are unclear
## Post-Cutover Monitoring
### Watch immediately after rollout
- platform auth failures
- token refresh failures
- backend `401` and `403` spikes
- websocket connection failure rate
- dynamic config update failures
- trading-control update failures
- mobile degraded/offline state frequency
- unexpected operator intervention failures
### Watch for the first 24 hours
- tenant isolation anomalies
- runtime control drift between backend memory and Cosmos control state
- kill-switch misfires
- stale session behavior across web and mobile
- build or chunk-size regressions affecting web load
## Known Remaining Gaps
- full trading data-plane migration away from legacy Supabase is not complete
- web still carries some legacy compatibility layers around auth/profile bootstrap
- mobile does not yet include push notification infrastructure
- feature-flag ownership and correlation-ID propagation are not fully standardized yet
These are follow-up items, not hidden defects. They should remain tracked in `docs/ROADMAP.md`.