From 5685cb3449aaa1e7e938bd78d51d74a906bea682 Mon Sep 17 00:00:00 2001 From: Saravana Achu Mac Date: Sat, 4 Apr 2026 14:36:41 -0700 Subject: [PATCH] docs: add operations runbook and refresh roadmap status --- README.md | 6 ++ docs/OPERATIONS.md | 207 +++++++++++++++++++++++++++++++++++++++++++++ docs/ROADMAP.md | 53 ++++++------ 3 files changed, 241 insertions(+), 25 deletions(-) create mode 100644 docs/OPERATIONS.md diff --git a/README.md b/README.md index e369e61..6f6e05d 100644 --- a/README.md +++ b/README.md @@ -27,6 +27,12 @@ This repo consumes local ByteLyst common-platform packages from: ```bash pnpm install pnpm verify +pnpm lint pnpm build ``` +## Operations + +- product and scope: `docs/PRD.md` +- execution tracker: `docs/ROADMAP.md` +- local dev, cutover, rollback, and release checks: `docs/OPERATIONS.md` diff --git a/docs/OPERATIONS.md b/docs/OPERATIONS.md new file mode 100644 index 0000000..a7bb543 --- /dev/null +++ b/docs/OPERATIONS.md @@ -0,0 +1,207 @@ +# Trading Monorepo Operations + +## Purpose + +This document is the operator and engineer runbook for `learning_ai_invt_trdg`. + +It covers: + +- local development setup +- verification and CI expectations +- staged cutover from legacy repos +- rollback rules +- release go/no-go checks +- post-cutover monitoring + +## Local Development + +### Prerequisites + +- Node.js `>=20` +- `pnpm` `>=10` +- local checkout of: + - `learning_ai_invt_trdg` + - `learning_ai_common_plat` +- access to: + - platform-service + - Azure Cosmos DB + - optional legacy Supabase project during migration + +### Workspace bootstrap + +```bash +pnpm install +cp .env.example .env +pnpm verify +``` + +### Core commands + +```bash +pnpm verify +pnpm lint +pnpm typecheck +pnpm test +pnpm build +``` + +### Surface-specific commands + +```bash +pnpm --filter @bytelyst/trading-backend dev +pnpm --filter @bytelyst/trading-web dev +pnpm --filter @bytelyst/trading-mobile dev +``` + +## Environment Model + +### Platform-service + +- `PLATFORM_API_URL` +- `PLATFORM_AUTH_ENABLED` +- `PLATFORM_JWT_ISSUER` +- `PLATFORM_JWT_PUBLIC_KEY` or `PLATFORM_JWT_JWKS_URL` +- `JWT_SECRET` only for HS256 compatibility environments + +### Cosmos + +- `COSMOS_ENDPOINT` +- `COSMOS_KEY` +- `COSMOS_DATABASE` + +### Transitional legacy migration support + +- `SUPABASE_URL` +- `SUPABASE_KEY` +- `SUPABASE_JWT_ISSUER` +- `SUPABASE_JWT_AUDIENCE` + +Rule: + +- platform-service and Cosmos are the target system +- Supabase remains transitional only where trading persistence has not yet been migrated + +## Verification Standard + +Before merge or release, all of the following must pass from repo root: + +```bash +pnpm verify +pnpm lint +``` + +`pnpm verify` currently gates: + +- backend, web, and mobile typecheck +- backend and web test suites +- backend and web build plus mobile typecheck + +`pnpm lint` currently gates: + +- backend contract and safety verification scripts +- web lint +- mobile lint + +## Staged Cutover + +### Order + +1. Backend internal validation +2. Web internal adoption +3. Mobile internal beta +4. Controlled operator rollout +5. Broader production cutover + +### Backend cutover + +- deploy backend with platform JWT support and Cosmos-backed trading controls enabled +- keep legacy Supabase fallback enabled during first production bake +- confirm runtime control reads/writes work through backend APIs +- confirm `dynamic_config` and trading-control containers are readable and writable +- confirm unauthorized requests are rejected and tenant-scoped reads are enforced + +### Web cutover + +- move operators to the monorepo web dashboard +- validate sign-in, session restore, kill-switch handling, and admin controls +- validate dynamic config writes through backend APIs +- keep legacy direct-table workflows disabled where backend API replacements exist + +### Mobile cutover + +- release to internal beta first +- validate sign-in, session restore, live state, degraded-state handling, and safe interventions +- do not enable broader rollout until backend/web contracts stay stable through at least one backend deploy cycle + +## Rollback Rules + +### Hard rollback triggers + +- auth/session failures prevent sign-in or session refresh +- incorrect tenant scoping leaks another user's profile, orders, alerts, or history +- global trade halt or scoped disable controls do not apply correctly +- dynamic config writes fail or partially apply without clear operator visibility +- mobile/web clients cannot recover from degraded platform-service or backend states + +### Rollback actions + +1. stop rollout to additional users immediately +2. revert the most recent monorepo deployment +3. restore traffic to the previous stable web/backend/mobile release +4. keep backend trade-halt authority available during rollback +5. preserve audit logs and operational events for incident review + +### Data rollback rule + +- do not rewrite or delete Cosmos control-plane state as part of first-response rollback +- prefer application rollback first, then explicit state repair if needed + +## Release Go/No-Go + +Release is `go` only if all of the following are true: + +- `pnpm verify` passes +- `pnpm lint` passes +- platform-service auth is reachable from web and mobile +- Cosmos control-plane reads and writes succeed +- kill-switch and maintenance behavior are validated on web and mobile +- backend tenant isolation checks are green +- operator-safe mobile interventions are limited to approved actions only +- known migration-only legacy dependencies are documented + +Release is `no-go` if any of the following are true: + +- Supabase fallback is still required for a critical public flow that has no monitored contingency +- auth source of truth is ambiguous in production +- admin/runtime-control actions are not fully audited +- rollback owner or rollback commands are unclear + +## Post-Cutover Monitoring + +### Watch immediately after rollout + +- platform auth failures +- token refresh failures +- backend `401` and `403` spikes +- websocket connection failure rate +- dynamic config update failures +- trading-control update failures +- mobile degraded/offline state frequency +- unexpected operator intervention failures + +### Watch for the first 24 hours + +- tenant isolation anomalies +- runtime control drift between backend memory and Cosmos control state +- kill-switch misfires +- stale session behavior across web and mobile +- build or chunk-size regressions affecting web load + +## Known Remaining Gaps + +- full trading data-plane migration away from legacy Supabase is not complete +- web still carries some legacy compatibility layers around auth/profile bootstrap +- mobile does not yet include push notification infrastructure +- feature-flag ownership and correlation-ID propagation are not fully standardized yet + +These are follow-up items, not hidden defects. They should remain tracked in `docs/ROADMAP.md`. diff --git a/docs/ROADMAP.md b/docs/ROADMAP.md index ee098d6..9369646 100644 --- a/docs/ROADMAP.md +++ b/docs/ROADMAP.md @@ -14,7 +14,7 @@ It assumes: ### Overall status -- Current phase: `Phase 5` +- Current phase: `Phase 6` - Overall state: `In Progress` ### Legend @@ -31,6 +31,9 @@ It assumes: - [x] Web migrated into `web/` with shared runtime, shared kill-switch gate, shared telemetry bootstrap, normalized backend URL resolution, and platform-service-backed public auth via a compatibility shim - [x] Mobile migrated into `mobile/` with product identity, shared runtime bootstrap, launch-time kill-switch gate, platform-service auth, live backend polling plus websocket-backed updates, startup/error telemetry capture, secure session storage with invalidation handling, and explicit degraded/offline status surfacing - [x] Backend now accepts common-platform JWTs with legacy Supabase fallback and persists global trading-control state through Cosmos-backed control storage +- [x] Dynamic config now flows through backend control-plane APIs with Cosmos-first storage and legacy Supabase fallback +- [x] Mobile platform auth requests now use the common React Native platform SDK +- [x] Root verification and lint flows now run successfully without sandbox-hostile script harness behavior - [-] DRY cleanup completed for runtime/config/bootstrap concerns, shared websocket auth helpers, and platform-session handling, but not yet for all data-plane persistence flows - [!] Full common-platform data-plane replacement remains a follow-up; backend and web still retain legacy Supabase data access for trading records and configuration tables @@ -161,7 +164,7 @@ Ensure all surfaces adopt one consistent platform model for auth, kill switch, t ### Checklist -- [-] Define web auth pattern using `@bytelyst/react-auth` +- [x] Define web auth pattern using `@bytelyst/react-auth` - [x] Define mobile auth pattern using shared platform-service session contracts - [x] Define backend auth boundary and middleware strategy - [x] Define kill-switch semantics across web, mobile, and backend @@ -306,7 +309,7 @@ Build mobile as a real ecosystem surface, not a mock UI shell. - [x] Create Expo app structure following FastGap-style monorepo conventions - [x] Add product config bootstrap -- [-] Integrate `@bytelyst/react-native-platform-sdk` +- [x] Integrate `@bytelyst/react-native-platform-sdk` - [x] Implement auth flow and session restore - [x] Define secure storage and session invalidation behavior - [x] Implement launch-time kill-switch and maintenance handling @@ -389,13 +392,13 @@ Validate that the new monorepo is safer and more coherent than the legacy setup ### Checklist -- [ ] Add root verify scripts +- [x] Add root verify scripts - [ ] Add backend contract tests - [ ] Add web auth and kill-switch smoke tests - [ ] Add mobile launch/auth/kill-switch smoke coverage -- [ ] Add docs for local dev, CI, Docker, and fallback behaviors -- [ ] Define cutover sequencing from legacy repos -- [ ] Define rollback paths +- [x] Add docs for local dev, CI, Docker, and fallback behaviors +- [x] Define cutover sequencing from legacy repos +- [x] Define rollback paths - [ ] Define release go/no-go checklist - [ ] Define post-cutover monitoring checks @@ -409,12 +412,12 @@ Validate that the new monorepo is safer and more coherent than the legacy setup ## 9.1 Root / Repository Tasks -- [ ] Create root `package.json` -- [ ] Create `pnpm-workspace.yaml` -- [ ] Create `.env.example` -- [ ] Create `shared/product.json` -- [ ] Create `scripts/verify.sh` or equivalent -- [ ] Create root README +- [x] Create root `package.json` +- [x] Create `pnpm-workspace.yaml` +- [x] Create `.env.example` +- [x] Create `shared/product.json` +- [x] Create `scripts/verify.sh` or equivalent +- [x] Create root README - [ ] Create docker/dev orchestration model - [ ] Define naming conventions and import boundaries @@ -454,29 +457,29 @@ Validate that the new monorepo is safer and more coherent than the legacy setup ### Recommended Commit Order -- [ ] Repo scaffold and product identity -- [ ] Backend skeleton and config/auth contracts -- [ ] Backend runtime control and health contracts -- [ ] Web shell and auth migration -- [ ] Web dashboard migration by tab priority -- [ ] Mobile bootstrap and auth +- [x] Repo scaffold and product identity +- [x] Backend skeleton and config/auth contracts +- [x] Backend runtime control and health contracts +- [x] Web shell and auth migration +- [x] Web dashboard migration by tab priority +- [x] Mobile bootstrap and auth - [ ] Mobile overview/alerts/positions/history - [ ] DRY cleanup -- [ ] Verification and cutover docs +- [x] Verification and cutover docs ### Recommended Rollout Order -- [ ] Backend internal validation +- [x] Backend internal validation - [ ] Web internal adoption - [ ] Mobile internal beta - [ ] External / staged rollout ## 11. Definition of Done by Phase -- [ ] Docs are updated -- [ ] Contracts are explicit -- [ ] Verification has been run or the remaining gap is documented -- [ ] Ownership boundaries are clearer than before the phase started +- [x] Docs are updated +- [x] Contracts are explicit +- [x] Verification has been run or the remaining gap is documented +- [x] Ownership boundaries are clearer than before the phase started ## 12. Risks and Mitigations