300 lines
11 KiB
Markdown
300 lines
11 KiB
Markdown
# Trading Monorepo Operations
|
||
|
||
## Purpose
|
||
|
||
This document is the operator and engineer runbook for `learning_ai_invt_trdg`.
|
||
|
||
It covers:
|
||
|
||
- local development setup
|
||
- verification and CI expectations
|
||
- staged rollout of the new monorepo deployment
|
||
- rollback rules
|
||
- release go/no-go checks
|
||
- post-cutover monitoring
|
||
|
||
## Local Development
|
||
|
||
### Prerequisites
|
||
|
||
- Node.js `>=20`
|
||
- `pnpm` `>=10`
|
||
- local checkout of:
|
||
- `learning_ai_invt_trdg`
|
||
- `learning_ai_common_plat`
|
||
- access to:
|
||
- platform-service
|
||
- Azure Cosmos DB
|
||
|
||
### Workspace bootstrap
|
||
|
||
```bash
|
||
pnpm run install:common-plat
|
||
cp .env.example .env
|
||
pnpm verify
|
||
```
|
||
|
||
If you need the registry path instead, use `pnpm run install:gitea`. The active
|
||
resolver is controlled by `BYTELYST_PACKAGE_SOURCE` in `.pnpmfile.cjs`.
|
||
|
||
### Core commands
|
||
|
||
```bash
|
||
pnpm verify
|
||
pnpm lint
|
||
pnpm typecheck
|
||
pnpm test
|
||
pnpm build
|
||
pnpm smoke:release
|
||
```
|
||
|
||
### Docker commands
|
||
|
||
```bash
|
||
# Production — build images and start backend + web
|
||
pnpm docker:up # equivalent: docker compose up --build
|
||
|
||
# Development — hot-reload (tsx for backend, Vite HMR for web)
|
||
pnpm docker:dev # equivalent: docker compose -f docker-compose.yml -f docker-compose.dev.yml up
|
||
|
||
# Stop all containers
|
||
pnpm docker:down
|
||
```
|
||
|
||
Prerequisites for Docker:
|
||
- `.env` at repo root filled in (copy from `.env.example`)
|
||
- `GITEA_NPM_TOKEN` set when `BYTELYST_PACKAGE_SOURCE=gitea`
|
||
- `VITE_PLATFORM_URL` and `VITE_TRADING_API_URL` set if not using localhost defaults
|
||
- For dev mode: run the matching local install first (for example `pnpm run install:common-plat`)
|
||
|
||
Docker note: `common-plat` mode targets `/opt/bytelyst/learning_ai_common_plat`
|
||
on the host, so container builds should stay on `vendor` or `gitea` unless the
|
||
build context is expanded to include the sibling repo.
|
||
|
||
### Surface-specific commands
|
||
|
||
```bash
|
||
pnpm --filter @bytelyst/trading-backend dev
|
||
pnpm --filter @bytelyst/trading-web dev
|
||
pnpm --filter @bytelyst/trading-mobile dev
|
||
```
|
||
|
||
## Environment Model
|
||
|
||
### Platform-service
|
||
|
||
- `PLATFORM_API_URL`
|
||
- `PLATFORM_AUTH_ENABLED`
|
||
- `PLATFORM_JWT_ISSUER`
|
||
- `PLATFORM_JWT_PUBLIC_KEY` or `PLATFORM_JWT_JWKS_URL`
|
||
- `JWT_SECRET` only for HS256 compatibility environments
|
||
|
||
### Cosmos
|
||
|
||
- `COSMOS_ENDPOINT`
|
||
- `COSMOS_KEY`
|
||
- `COSMOS_DATABASE`
|
||
|
||
Rule:
|
||
|
||
- platform-service and Cosmos are the only supported production systems for this repo
|
||
- legacy repos may still be consulted as code references, but they are not runtime dependencies
|
||
- trading user profiles, dynamic config, trading controls, snapshots, capital ledgers, and strategy presets already use Cosmos-backed authority paths
|
||
- dynamic config runtime refresh and admin updates no longer seed from or mirror to legacy storage in the active backend runtime path
|
||
|
||
## Verification Standard
|
||
|
||
Before merge or release, all of the following must pass from repo root:
|
||
|
||
```bash
|
||
pnpm verify
|
||
pnpm lint
|
||
```
|
||
|
||
`pnpm verify` currently gates:
|
||
|
||
- backend, web, and mobile typecheck
|
||
- backend and web test suites
|
||
- backend and web build plus mobile typecheck
|
||
|
||
`pnpm lint` currently gates:
|
||
|
||
- backend contract and safety verification scripts
|
||
- web lint
|
||
- mobile lint
|
||
|
||
## Request Tracing
|
||
|
||
- the main web and mobile API paths, operator actions, and lifecycle fetches now attach `x-request-id`
|
||
- backend HTTP responses echo `x-request-id` so browser/app logs can be correlated with backend logs
|
||
- during incident review, treat `x-request-id` as the primary request correlation key across client and backend traces
|
||
|
||
## Feature Flag Ownership
|
||
|
||
- backend `GET /api/feature-flags` is the authoritative runtime contract for user-facing feature access
|
||
- web feature gates must read explicit feature-flag contracts instead of scraping generic config payloads
|
||
- dynamic config may still store the underlying values, but the product surfaces should consume the typed feature-flag API
|
||
|
||
## Staged Cutover
|
||
|
||
### Order
|
||
|
||
1. Backend internal validation
|
||
2. Web internal adoption
|
||
3. Mobile internal beta
|
||
4. Controlled operator rollout
|
||
5. Broader production cutover
|
||
|
||
### Backend cutover
|
||
|
||
- deploy backend with platform JWT support and Cosmos-backed control-plane and execution persistence enabled
|
||
- confirm runtime control reads/writes work through backend APIs
|
||
- confirm `dynamic_config`, trading-control, order, trade-history, and manual-entry containers are readable and writable
|
||
- confirm unauthorized requests are rejected and tenant-scoped reads are enforced
|
||
|
||
### Web cutover
|
||
|
||
See `docs/CUTOVER_WEB.md` for the full step-by-step checklist.
|
||
|
||
Summary:
|
||
- move operators to the monorepo web dashboard
|
||
- validate sign-in, session restore, kill-switch handling, and admin controls
|
||
- validate dynamic config writes through backend APIs
|
||
- run parallel period (1–3 days) before switching traffic fully
|
||
- keep legacy direct-table workflows disabled where backend API replacements exist
|
||
|
||
### Mobile cutover
|
||
|
||
- release to internal beta first
|
||
- validate sign-in, session restore, live state, degraded-state handling, and safe interventions
|
||
- do not enable broader rollout until backend/web contracts stay stable through at least one backend deploy cycle
|
||
|
||
## Rollback Rules
|
||
|
||
### Hard rollback triggers
|
||
|
||
- auth/session failures prevent sign-in or session refresh
|
||
- incorrect tenant scoping leaks another user's profile, orders, alerts, or history
|
||
- global trade halt or scoped disable controls do not apply correctly
|
||
- dynamic config writes fail or partially apply without clear operator visibility
|
||
- mobile/web clients cannot recover from degraded platform-service or backend states
|
||
|
||
### Rollback actions
|
||
|
||
1. stop rollout to additional users immediately
|
||
2. revert the most recent monorepo deployment
|
||
3. restore traffic to the previous stable web/backend/mobile release
|
||
4. keep backend trade-halt authority available during rollback
|
||
5. preserve audit logs and operational events for incident review
|
||
|
||
### Data rollback rule
|
||
|
||
- do not rewrite or delete Cosmos control-plane state as part of first-response rollback
|
||
- prefer application rollback first, then explicit state repair if needed
|
||
|
||
## Release Go/No-Go
|
||
|
||
Release is `go` only if all of the following are true:
|
||
|
||
- `pnpm verify` passes
|
||
- `pnpm lint` passes
|
||
- `pnpm smoke:release` passes
|
||
- platform-service auth is reachable from web and mobile
|
||
- Cosmos control-plane reads and writes succeed
|
||
- Cosmos execution-data reads and writes succeed
|
||
- kill-switch and maintenance behavior are validated on web and mobile
|
||
- backend tenant isolation checks are green
|
||
- operator-safe mobile interventions are limited to approved actions only
|
||
- no legacy runtime data dependency remains in critical public flows
|
||
|
||
Release is `no-go` if any of the following are true:
|
||
|
||
- auth source of truth is ambiguous in production
|
||
- admin/runtime-control actions are not fully audited
|
||
- rollback owner or rollback commands are unclear
|
||
|
||
## Release Smoke Checklist
|
||
|
||
`pnpm smoke:release` currently validates:
|
||
|
||
- web sign-in flow behavior
|
||
- web password reset flow behavior
|
||
- web authenticated session bootstrap behavior
|
||
- web websocket auth token gating
|
||
- web product kill-switch accessibility gating
|
||
- mobile auth and product-availability surfaces still compile against the shared platform contracts
|
||
|
||
`npm run test` in `backend/` additionally validates:
|
||
|
||
- WebSocket BotState contract and lifecycle consistency (`check:websocket-contract`)
|
||
- Session rule normalization across all session-string variants (`check:session-rule-normalization`)
|
||
- API contract: feature-flag shapes, audit event literals, BotState health, realtime helpers,
|
||
namespace constants (`check:api-contract`)
|
||
|
||
Manual mobile release smoke is still required before broad rollout:
|
||
|
||
1. Sign in on a fresh install.
|
||
2. Confirm session restore after app restart.
|
||
3. Confirm product-disabled state blocks the app shell.
|
||
4. Confirm maintenance/availability messaging is visible.
|
||
5. Confirm the app recovers after re-enabling the product.
|
||
|
||
## Post-Cutover Monitoring
|
||
|
||
### Watch immediately after rollout
|
||
|
||
- platform auth failures
|
||
- token refresh failures
|
||
- backend `401` and `403` spikes
|
||
- websocket connection failure rate
|
||
- dynamic config update failures
|
||
- trading-control update failures
|
||
- mobile degraded/offline state frequency
|
||
- unexpected operator intervention failures
|
||
|
||
### Watch for the first 24 hours
|
||
|
||
- tenant isolation anomalies
|
||
- runtime control drift between backend memory and Cosmos control state
|
||
- kill-switch misfires
|
||
- stale session behavior across web and mobile
|
||
- build or chunk-size regressions affecting web load
|
||
|
||
## Known Remaining Gaps
|
||
|
||
The following are follow-up items, not hidden defects. They are tracked here until resolved.
|
||
|
||
### Resolved since last update (2026-04-07)
|
||
|
||
- **Exchange/order-level correlation-ID propagation** — resolved. `x-request-id` is now
|
||
standardised across all main web/mobile API paths, operator actions, lifecycle fetches,
|
||
and backend HTTP responses. See `OPERATIONS.md > Request Tracing`.
|
||
- **Feature-flag ownership beyond backtest** — resolved. `GET /api/feature-flags` now
|
||
returns the full `TradingFeatureFlagsResponse` including `tabs.marketplace` and
|
||
`tabs.membership`. Web and mobile consume these flags. Key constants are shared via
|
||
`shared/feature-flags.ts`. See `docs/BACKEND_API_DEPRECATION.md`.
|
||
- **Admin audit event schema** — resolved. Schema formalised in `docs/BACKEND_AUDIT_SCHEMA.md`.
|
||
`TradeAuditEvent` interface covers all current audit call sites. Future: persist to Cosmos.
|
||
- **Deprecated endpoint documentation** — resolved. See `docs/BACKEND_API_DEPRECATION.md`
|
||
for full endpoint lifecycle catalogue, WebSocket namespace model, and planned additions.
|
||
- **WebSocket single-namespace isolation** — resolved. `/trading` and `/admin` named namespaces
|
||
added to backend alongside the backward-compatible root namespace. Web and mobile clients
|
||
connect to `/trading` by default. Admin namespace rejects non-admins at connection time.
|
||
- **Backend contract tests absent** — resolved. `verifyApiContract.ts` added and wired into
|
||
`npm run test` via `check:api-contract`. Tests cover feature-flag shape, audit event
|
||
literals, BotState health contract, realtime helpers, and namespace constants.
|
||
|
||
### Open
|
||
|
||
- **Mobile push notification infrastructure** — mobile settings UI has toggle state but
|
||
no push provider, backend registration endpoint, or token storage. Defer to post-cutover.
|
||
Planned endpoints: `POST /api/push/register`, `DELETE /api/push/register`.
|
||
- **Backend telemetry infrastructure** — backend has structured logging (Winston) but no
|
||
OpenTelemetry or `@bytelyst/telemetry-client` integration. Web and mobile bootstrap
|
||
telemetry via the common-platform SDK; backend does not yet send telemetry events.
|
||
Defer until `learning_ai_common_plat` publishes a Node.js telemetry adapter.
|
||
- **Cosmos audit-events container** — `auditRepository.ts` and `GET /api/admin/audit`
|
||
are implemented. Create the `audit-events` container in Cosmos (partition key: `/productId`,
|
||
TTL: 7776000 / 90 days) to activate durable audit persistence. Until the container
|
||
exists, `auditTradeEvent()` logs to Winston only (safe fallback).
|