learning_ai_invt_trdg/docs/OPERATIONS.md

300 lines
11 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Trading Monorepo Operations
## Purpose
This document is the operator and engineer runbook for `learning_ai_invt_trdg`.
It covers:
- local development setup
- verification and CI expectations
- staged rollout of the new monorepo deployment
- rollback rules
- release go/no-go checks
- post-cutover monitoring
## Local Development
### Prerequisites
- Node.js `>=20`
- `pnpm` `>=10`
- local checkout of:
- `learning_ai_invt_trdg`
- `learning_ai_common_plat`
- access to:
- platform-service
- Azure Cosmos DB
### Workspace bootstrap
```bash
pnpm run install:common-plat
cp .env.example .env
pnpm verify
```
If you need the registry path instead, use `pnpm run install:gitea`. The active
resolver is controlled by `BYTELYST_PACKAGE_SOURCE` in `.pnpmfile.cjs`.
### Core commands
```bash
pnpm verify
pnpm lint
pnpm typecheck
pnpm test
pnpm build
pnpm smoke:release
```
### Docker commands
```bash
# Production — build images and start backend + web
pnpm docker:up # equivalent: docker compose up --build
# Development — hot-reload (tsx for backend, Vite HMR for web)
pnpm docker:dev # equivalent: docker compose -f docker-compose.yml -f docker-compose.dev.yml up
# Stop all containers
pnpm docker:down
```
Prerequisites for Docker:
- `.env` at repo root filled in (copy from `.env.example`)
- `GITEA_NPM_TOKEN` set when `BYTELYST_PACKAGE_SOURCE=gitea`
- `VITE_PLATFORM_URL` and `VITE_TRADING_API_URL` set if not using localhost defaults
- For dev mode: run the matching local install first (for example `pnpm run install:common-plat`)
Docker note: `common-plat` mode targets `/opt/bytelyst/learning_ai_common_plat`
on the host, so container builds should stay on `vendor` or `gitea` unless the
build context is expanded to include the sibling repo.
### Surface-specific commands
```bash
pnpm --filter @bytelyst/trading-backend dev
pnpm --filter @bytelyst/trading-web dev
pnpm --filter @bytelyst/trading-mobile dev
```
## Environment Model
### Platform-service
- `PLATFORM_API_URL`
- `PLATFORM_AUTH_ENABLED`
- `PLATFORM_JWT_ISSUER`
- `PLATFORM_JWT_PUBLIC_KEY` or `PLATFORM_JWT_JWKS_URL`
- `JWT_SECRET` only for HS256 compatibility environments
### Cosmos
- `COSMOS_ENDPOINT`
- `COSMOS_KEY`
- `COSMOS_DATABASE`
Rule:
- platform-service and Cosmos are the only supported production systems for this repo
- legacy repos may still be consulted as code references, but they are not runtime dependencies
- trading user profiles, dynamic config, trading controls, snapshots, capital ledgers, and strategy presets already use Cosmos-backed authority paths
- dynamic config runtime refresh and admin updates no longer seed from or mirror to legacy storage in the active backend runtime path
## Verification Standard
Before merge or release, all of the following must pass from repo root:
```bash
pnpm verify
pnpm lint
```
`pnpm verify` currently gates:
- backend, web, and mobile typecheck
- backend and web test suites
- backend and web build plus mobile typecheck
`pnpm lint` currently gates:
- backend contract and safety verification scripts
- web lint
- mobile lint
## Request Tracing
- the main web and mobile API paths, operator actions, and lifecycle fetches now attach `x-request-id`
- backend HTTP responses echo `x-request-id` so browser/app logs can be correlated with backend logs
- during incident review, treat `x-request-id` as the primary request correlation key across client and backend traces
## Feature Flag Ownership
- backend `GET /api/feature-flags` is the authoritative runtime contract for user-facing feature access
- web feature gates must read explicit feature-flag contracts instead of scraping generic config payloads
- dynamic config may still store the underlying values, but the product surfaces should consume the typed feature-flag API
## Staged Cutover
### Order
1. Backend internal validation
2. Web internal adoption
3. Mobile internal beta
4. Controlled operator rollout
5. Broader production cutover
### Backend cutover
- deploy backend with platform JWT support and Cosmos-backed control-plane and execution persistence enabled
- confirm runtime control reads/writes work through backend APIs
- confirm `dynamic_config`, trading-control, order, trade-history, and manual-entry containers are readable and writable
- confirm unauthorized requests are rejected and tenant-scoped reads are enforced
### Web cutover
See `docs/CUTOVER_WEB.md` for the full step-by-step checklist.
Summary:
- move operators to the monorepo web dashboard
- validate sign-in, session restore, kill-switch handling, and admin controls
- validate dynamic config writes through backend APIs
- run parallel period (13 days) before switching traffic fully
- keep legacy direct-table workflows disabled where backend API replacements exist
### Mobile cutover
- release to internal beta first
- validate sign-in, session restore, live state, degraded-state handling, and safe interventions
- do not enable broader rollout until backend/web contracts stay stable through at least one backend deploy cycle
## Rollback Rules
### Hard rollback triggers
- auth/session failures prevent sign-in or session refresh
- incorrect tenant scoping leaks another user's profile, orders, alerts, or history
- global trade halt or scoped disable controls do not apply correctly
- dynamic config writes fail or partially apply without clear operator visibility
- mobile/web clients cannot recover from degraded platform-service or backend states
### Rollback actions
1. stop rollout to additional users immediately
2. revert the most recent monorepo deployment
3. restore traffic to the previous stable web/backend/mobile release
4. keep backend trade-halt authority available during rollback
5. preserve audit logs and operational events for incident review
### Data rollback rule
- do not rewrite or delete Cosmos control-plane state as part of first-response rollback
- prefer application rollback first, then explicit state repair if needed
## Release Go/No-Go
Release is `go` only if all of the following are true:
- `pnpm verify` passes
- `pnpm lint` passes
- `pnpm smoke:release` passes
- platform-service auth is reachable from web and mobile
- Cosmos control-plane reads and writes succeed
- Cosmos execution-data reads and writes succeed
- kill-switch and maintenance behavior are validated on web and mobile
- backend tenant isolation checks are green
- operator-safe mobile interventions are limited to approved actions only
- no legacy runtime data dependency remains in critical public flows
Release is `no-go` if any of the following are true:
- auth source of truth is ambiguous in production
- admin/runtime-control actions are not fully audited
- rollback owner or rollback commands are unclear
## Release Smoke Checklist
`pnpm smoke:release` currently validates:
- web sign-in flow behavior
- web password reset flow behavior
- web authenticated session bootstrap behavior
- web websocket auth token gating
- web product kill-switch accessibility gating
- mobile auth and product-availability surfaces still compile against the shared platform contracts
`npm run test` in `backend/` additionally validates:
- WebSocket BotState contract and lifecycle consistency (`check:websocket-contract`)
- Session rule normalization across all session-string variants (`check:session-rule-normalization`)
- API contract: feature-flag shapes, audit event literals, BotState health, realtime helpers,
namespace constants (`check:api-contract`)
Manual mobile release smoke is still required before broad rollout:
1. Sign in on a fresh install.
2. Confirm session restore after app restart.
3. Confirm product-disabled state blocks the app shell.
4. Confirm maintenance/availability messaging is visible.
5. Confirm the app recovers after re-enabling the product.
## Post-Cutover Monitoring
### Watch immediately after rollout
- platform auth failures
- token refresh failures
- backend `401` and `403` spikes
- websocket connection failure rate
- dynamic config update failures
- trading-control update failures
- mobile degraded/offline state frequency
- unexpected operator intervention failures
### Watch for the first 24 hours
- tenant isolation anomalies
- runtime control drift between backend memory and Cosmos control state
- kill-switch misfires
- stale session behavior across web and mobile
- build or chunk-size regressions affecting web load
## Known Remaining Gaps
The following are follow-up items, not hidden defects. They are tracked here until resolved.
### Resolved since last update (2026-04-07)
- **Exchange/order-level correlation-ID propagation** — resolved. `x-request-id` is now
standardised across all main web/mobile API paths, operator actions, lifecycle fetches,
and backend HTTP responses. See `OPERATIONS.md > Request Tracing`.
- **Feature-flag ownership beyond backtest** — resolved. `GET /api/feature-flags` now
returns the full `TradingFeatureFlagsResponse` including `tabs.marketplace` and
`tabs.membership`. Web and mobile consume these flags. Key constants are shared via
`shared/feature-flags.ts`. See `docs/BACKEND_API_DEPRECATION.md`.
- **Admin audit event schema** — resolved. Schema formalised in `docs/BACKEND_AUDIT_SCHEMA.md`.
`TradeAuditEvent` interface covers all current audit call sites. Future: persist to Cosmos.
- **Deprecated endpoint documentation** — resolved. See `docs/BACKEND_API_DEPRECATION.md`
for full endpoint lifecycle catalogue, WebSocket namespace model, and planned additions.
- **WebSocket single-namespace isolation** — resolved. `/trading` and `/admin` named namespaces
added to backend alongside the backward-compatible root namespace. Web and mobile clients
connect to `/trading` by default. Admin namespace rejects non-admins at connection time.
- **Backend contract tests absent** — resolved. `verifyApiContract.ts` added and wired into
`npm run test` via `check:api-contract`. Tests cover feature-flag shape, audit event
literals, BotState health contract, realtime helpers, and namespace constants.
### Open
- **Mobile push notification infrastructure** — mobile settings UI has toggle state but
no push provider, backend registration endpoint, or token storage. Defer to post-cutover.
Planned endpoints: `POST /api/push/register`, `DELETE /api/push/register`.
- **Backend telemetry infrastructure** — backend has structured logging (Winston) but no
OpenTelemetry or `@bytelyst/telemetry-client` integration. Web and mobile bootstrap
telemetry via the common-platform SDK; backend does not yet send telemetry events.
Defer until `learning_ai_common_plat` publishes a Node.js telemetry adapter.
- **Cosmos audit-events container** — `auditRepository.ts` and `GET /api/admin/audit`
are implemented. Create the `audit-events` container in Cosmos (partition key: `/productId`,
TTL: 7776000 / 90 days) to activate durable audit persistence. Until the container
exists, `auditTradeEvent()` logs to Winston only (safe fallback).