From 856788c38613923d007089f470b4f6dc6db03db6 Mon Sep 17 00:00:00 2001
From: saravanakumardb1 <saravanakumardb1@users.noreply.github.com>
Date: Tue, 17 Feb 2026 10:35:46 -0800
Subject: [PATCH] docs: update documentation

---
 docs/WINDSURF/PLATFORM_COMPONENTS_ROADMAP.md | 1160 ++++++++++++++++++
 1 file changed, 1160 insertions(+)
 create mode 100644 docs/WINDSURF/PLATFORM_COMPONENTS_ROADMAP.md

diff --git a/docs/WINDSURF/PLATFORM_COMPONENTS_ROADMAP.md b/docs/WINDSURF/PLATFORM_COMPONENTS_ROADMAP.md
new file mode 100644
index 00000000..9aced78b
--- /dev/null
+++ b/docs/WINDSURF/PLATFORM_COMPONENTS_ROADMAP.md
@@ -0,0 +1,1160 @@
+# Platform Components Roadmap — What's Built, What's Missing, What's Next
+
+> **Status:** Living document — brainstorm + gap analysis  
+> **Last updated:** 2026-02-17  
+> **Scope:** All infrastructure components relevant to admin, DevOps, and product operations across the ByteLyst platform.  
+> **Repos:** `learning_ai_common_plat` (platform-service, packages) · `learning_voice_ai_agent` (dashboards, clients)
+
+---
+
+## Table of Contents
+
+1. [Current Inventory](#1-current-inventory)
+2. [Gap Analysis — Missing Components](#2-gap-analysis--missing-components)
+   - [P0 — Foundational](#p0--foundational)
+   - [P1 — Operational Maturity](#p1--operational-maturity)
+   - [P2 — Product Intelligence](#p2--product-intelligence)
+   - [P3 — Scale & Polish](#p3--scale--polish)
+3. [Implementation Priority Matrix](#3-implementation-priority-matrix)
+4. [New Cosmos Containers & Cost Impact](#4-new-cosmos-containers--cost-impact)
+5. [New Environment Variables](#5-new-environment-variables)
+6. [Quick Reference — Where Things Live](#6-quick-reference--where-things-live)
+
+---
+
+## 1. Current Inventory
+
+### 1.1 Platform-Service Modules (25 modules)
+
+| Category     | Module          | Endpoints | Description                                                                                         |
+| ------------ | --------------- | --------- | --------------------------------------------------------------------------------------------------- |
+| **Identity** | `auth`          | 11 routes | Login, register, refresh, SSO, profile, admin user CRUD                                             |
+| **Identity** | `tokens`        | CRUD      | API token management                                                                                |
+| **Identity** | `licenses`      | CRUD      | License key generation, activation, device binding                                                  |
+| **Billing**  | `subscriptions` | CRUD      | Plan management, trial tracking, period management                                                  |
+| **Billing**  | `stripe`        | Webhooks  | Inbound Stripe webhook processing                                                                   |
+| **Billing**  | `plans`         | CRUD      | Plan definitions (free, pro, enterprise)                                                            |
+| **Billing**  | `usage`         | CRUD      | Usage tracking and quota enforcement                                                                |
+| **Billing**  | `promos`        | CRUD      | Promo code creation, validation, redemption                                                         |
+| **Growth**   | `invitations`   | CRUD      | Invitation code generation, redemption, tracking                                                    |
+| **Growth**   | `referrals`     | CRUD      | Referral link tracking, status transitions                                                          |
+| **Growth**   | `waitlist`      | 12 routes | Pre-launch signups, position tracking, admin batch invite, CSV export                               |
+| **Growth**   | `public`        | 5 routes  | Public roadmap, community voting, feature submissions                                               |
+| **Content**  | `items`         | CRUD      | Tracker items (bugs, features, tasks)                                                               |
+| **Content**  | `comments`      | CRUD      | Threaded comments on items                                                                          |
+| **Content**  | `votes`         | CRUD      | User votes on items and comments                                                                    |
+| **Content**  | `memory`        | 5 routes  | Memory items — create, reassign, patch, delete                                                      |
+| **Ops**      | `audit`         | Query     | Audit log recording and admin queries                                                               |
+| **Ops**      | `flags`         | CRUD      | Feature flags with FNV-1a deterministic rollout                                                     |
+| **Ops**      | `telemetry`     | 9 routes  | Client event ingestion, error clustering, collection policies, GDPR erasure                         |
+| **Ops**      | `notifications` | 5 routes  | Device registration, notification preferences                                                       |
+| **Ops**      | `settings`      | 6 routes  | User/device settings, kill switch                                                                   |
+| **Ops**      | `ratelimit`     | 4 routes  | Rate limit checking, config management                                                              |
+| **Ops**      | `themes`        | 7 routes  | Platform theming (iOS, Android, Desktop)                                                            |
+| **Ops**      | `blob`          | 5 routes  | Azure Blob Storage SAS tokens, list, delete, info                                                   |
+| **Registry** | `products`      | 4 routes  | Multi-product registry with full lifecycle (draft → pre_launch → beta → active → sunset → disabled) |
+
+### 1.2 Shared Packages (13 packages)
+
+| Package                   | Purpose                                                     |
+| ------------------------- | ----------------------------------------------------------- |
+| `@bytelyst/errors`        | Typed HTTP errors (400–429)                                 |
+| `@bytelyst/cosmos`        | Cosmos DB client singleton + container registry             |
+| `@bytelyst/config`        | Zod env loader, product identity, AKV resolver              |
+| `@bytelyst/auth`          | JWT utilities, auth middleware, password hashing            |
+| `@bytelyst/api-client`    | Fetch wrapper with auth token injection                     |
+| `@bytelyst/fastify-core`  | `createServiceApp()` factory + `startService()`             |
+| `@bytelyst/react-auth`    | React auth context factory                                  |
+| `@bytelyst/logger`        | Structured logging (pino-based)                             |
+| `@bytelyst/testing`       | Shared test mocks, Fastify inject helpers                   |
+| `@bytelyst/blob`          | Azure Blob Storage client + SAS helpers                     |
+| `@bytelyst/extraction`    | Extraction client + shared types                            |
+| `@bytelyst/monitoring`    | Health-check utilities                                      |
+| `@bytelyst/design-tokens` | Cross-platform token generator (JSON → CSS/TS/Kotlin/Swift) |
+
+### 1.3 Services
+
+| Service                | Port | Description                                           |
+| ---------------------- | ---- | ----------------------------------------------------- |
+| **platform-service**   | 4003 | Consolidated Fastify service (25 modules, 158+ tests) |
+| **extraction-service** | 4005 | LangExtract text extraction + Python sidecar          |
+| **monitoring**         | 4004 | Health-check aggregator (all services)                |
+
+### 1.4 Dashboards
+
+| Dashboard                 | Port | Pages                                                            |
+| ------------------------- | ---- | ---------------------------------------------------------------- |
+| **admin-dashboard-web**   | 3001 | ~25 pages — users, billing, flags, ops, telemetry, secrets, etc. |
+| **user-dashboard-web**    | 3002 | User portal — subscription, usage, settings                      |
+| **tracker-dashboard-web** | 3003 | Public roadmap, issue tracker, community voting                  |
+
+### 1.5 Infrastructure Already In Place
+
+| Component              | Status     | Notes                                                                                                                                         |
+| ---------------------- | ---------- | --------------------------------------------------------------------------------------------------------------------------------------------- |
+| **Health checks**      | ✅         | Per-service `/health` + aggregated monitoring script                                                                                          |
+| **Structured logging** | ✅         | Pino (Fastify) + structlog (Python)                                                                                                           |
+| **Log aggregation**    | ✅         | Loki + Grafana (Docker Compose)                                                                                                               |
+| **Reverse proxy**      | ✅         | Traefik (Docker Compose)                                                                                                                      |
+| **Secret management**  | ✅         | Azure Key Vault + admin CRUD UI at `/ops/secrets`                                                                                             |
+| **Feature flags**      | ✅         | FNV-1a hash, percentage rollout, admin UI                                                                                                     |
+| **Client telemetry**   | ✅         | All platforms instrumented, admin Client Logs page                                                                                            |
+| **Rate limiting**      | ✅         | In-memory sliding window + configurable rules per product                                                                                     |
+| **Outbound webhooks**  | ⚠️ Partial | Fire-and-forget POST for 3 events (`lib/webhooks.ts`); no subscription model, no retry, no HMAC signing                                       |
+| **Kill switch**        | ✅         | Per-product, checked by all clients via `/settings/kill-switch`                                                                               |
+| **Audit logging**      | ✅         | Records admin actions, queryable from admin dashboard                                                                                         |
+| **Blob storage**       | ✅         | 6 containers (audio, transcripts, attachments, avatars, releases, backups), SAS tokens, admin endpoints                                       |
+| **Swagger / OpenAPI**  | ⚠️ Partial | `createServiceApp()` passes `swagger` config; Fastify plugin wired but Zod schemas not fully connected to route definitions via type provider |
+| **Prometheus metrics** | ⚠️ Partial | `metrics: true` in `createServiceApp()` — basic request metrics exposed; no custom business metrics, no Grafana dashboards for them           |
+| **Product registry**   | ✅         | Multi-product with full status lifecycle (draft → pre_launch → beta → active → sunset → disabled), prelaunch config, custom fields            |
+| **Admin doc browser**  | ✅         | `/docs` page with markdown viewer, search, and AI chat — browses repo documentation                                                           |
+
+---
+
+## 2. Gap Analysis — Missing Components
+
+### P0 — Foundational
+
+These are blocking features that nearly every production app needs. Without them, critical operational workflows are manual or impossible.
+
+---
+
+#### 2.1 Scheduled Jobs / Background Task Runner
+
+**Why:** No way to run recurring work today. Trial expirations, subscription renewals, usage quota resets, stale data cleanup, digest emails, and report generation all require a scheduler.
+
+**Current state:** Zero. All logic is request-driven (HTTP request → response).
+
+**Proposed design:**
+
+```
+platform-service/src/modules/jobs/
+├── types.ts         — JobDefinition, JobRun, JobSchedule schemas
+├── registry.ts      — Job registry (register named jobs with cron expressions)
+├── runner.ts        — Tick loop: evaluate cron, run due jobs, record outcomes
+├── repository.ts    — Cosmos: job_definitions, job_runs containers
+└── routes.ts        — Admin: list jobs, trigger manually, view run history, pause/resume
+```
+
+**Built-in jobs to ship on day 1:**
+
+| Job                      | Schedule              | Description                                                                                            |
+| ------------------------ | --------------------- | ------------------------------------------------------------------------------------------------------ |
+| `trial-expiration-check` | Every hour            | Find subscriptions with `status=trialing` past `currentPeriodEnd`, transition to `expired` or `active` |
+| `usage-quota-reset`      | Daily at midnight UTC | Reset daily/monthly counters in `usage_daily` container                                                |
+| `stale-session-cleanup`  | Every 6 hours         | Remove expired refresh tokens and inactive sessions                                                    |
+| `telemetry-ttl-sweep`    | Daily at 3am UTC      | Delete telemetry events past retention TTL (Cosmos TTL is best-effort)                                 |
+| `waitlist-reminder`      | Weekly                | Identify stale waitlist entries, mark for follow-up                                                    |
+| `license-expiry-check`   | Daily                 | Warn users whose licenses expire within 7 days                                                         |
+
+**Options for the runner:**
+
+- **In-process tick loop** (simplest): `setInterval` in platform-service, with leader election via Cosmos lease
+- **Azure Functions timer triggers** (serverless): Lower cost, built-in cron, but adds deployment complexity
+- **BullMQ + Redis** (heavy): Best for high-throughput, but adds a Redis dependency
+
+**Recommendation:** Start with in-process tick loop + Cosmos lease for leader election (avoids Redis). Migrate to Azure Functions if job volume grows.
+
+**Admin UI:**
+
+- `/ops/jobs` page: list all registered jobs, last run status, next scheduled run
+- Manual trigger button per job
+- Run history table with duration, outcome, error details
+- Pause/resume toggle per job
+
+**Cosmos containers:**
+
+- `job_definitions` (pk: `/productId`) — name, cron, enabled, lastRunAt, nextRunAt
+- `job_runs` (pk: `/productId:jobName`) — runId, startedAt, completedAt, status, error, metrics
+
+---
+
+#### 2.2 Transactional Email & Push Delivery
+
+**Why:** The `notifications` module manages device registration and preferences, but has **no delivery mechanism**. Notifications are database records with no way to reach users.
+
+**Current state:** Device registration + preference management only. No email, no push, no SMS.
+
+**Proposed design:**
+
+```
+platform-service/src/modules/delivery/
+├── types.ts         — DeliveryRequest, DeliveryLog, ChannelConfig schemas
+├── channels/
+│   ├── email.ts     — SendGrid/Postmark adapter
+│   ├── push-apns.ts — Apple Push Notification Service
+│   ├── push-fcm.ts  — Firebase Cloud Messaging
+│   └── sms.ts       — Twilio/Azure Communication Services (future)
+├── renderer.ts      — Template rendering (Handlebars/Mustache for email)
+├── repository.ts    — delivery_log container (track sent/failed/bounced)
+├── dispatcher.ts    — Route delivery request to correct channel(s) based on prefs
+└── routes.ts        — Admin: send test, view delivery log, manage templates
+```
+
+**Email templates to ship on day 1:**
+
+| Template            | Trigger                                    | Description                                  |
+| ------------------- | ------------------------------------------ | -------------------------------------------- |
+| `welcome`           | `auth.register`                            | Welcome email with getting-started guide     |
+| `trial-expiring`    | `jobs.trial-expiration-check` (7d warning) | "Your trial ends in 7 days"                  |
+| `trial-expired`     | `jobs.trial-expiration-check`              | "Your trial has ended — upgrade to continue" |
+| `password-reset`    | Future: `/auth/forgot-password`            | One-time reset link                          |
+| `invitation`        | `invitations.create`                       | "You've been invited to join"                |
+| `waitlist-accepted` | `waitlist.invite`                          | "You're in! Here's your access"              |
+| `payment-failed`    | `stripe.invoice.payment_failed`            | "We couldn't charge your card"               |
+| `license-expiring`  | `jobs.license-expiry-check`                | "Your license expires in 7 days"             |
+
+**Push notification types:**
+
+| Type                   | Channel    | Description                                  |
+| ---------------------- | ---------- | -------------------------------------------- |
+| `dictation_reminder`   | APNs + FCM | "Haven't dictated today — keep your streak!" |
+| `feature_announcement` | APNs + FCM | Admin-triggered announcement                 |
+| `subscription_change`  | APNs + FCM | Plan upgraded/downgraded/expired             |
+
+**Cosmos container:**
+
+- `delivery_log` (pk: `/productId:channel:yyyyMM`) — id, userId, channel, template, status (sent/failed/bounced), sentAt, error
+
+**Admin UI:**
+
+- `/ops/delivery` page: delivery log with filters (channel, status, template, date range)
+- Template management: list, preview, edit (future: visual editor)
+- "Send test" button for each template
+- Delivery stats: sent/failed/bounced/opened (with SendGrid webhook integration)
+
+---
+
+#### 2.3 Outbound Webhook Subscriptions
+
+**Why:** Current `webhooks.ts` is fire-and-forget to env-var URLs with no retry, no signing, no subscriber management. External integrations (Zapier, Slack, custom) need a proper webhook subscription system.
+
+**Current state:** 3 hardcoded webhook dispatchers (invitation redeemed, referral status changed, waitlist joined). No retry. No HMAC signing. No subscription management.
+
+**Proposed design:**
+
+```
+platform-service/src/modules/webhooks/
+├── types.ts         — WebhookSubscription, WebhookDelivery, WebhookEvent schemas
+├── repository.ts    — Cosmos: webhook_subscriptions, webhook_deliveries containers
+├── dispatcher.ts    — Match event → subscriptions, queue delivery, HMAC-SHA256 sign
+├── delivery.ts      — HTTP POST with exponential backoff retry (3 attempts)
+└── routes.ts        — Admin CRUD for subscriptions + delivery log
+```
+
+**Event catalog (subscribe to any combination):**
+
+| Event                   | Payload                                        | Source                      |
+| ----------------------- | ---------------------------------------------- | --------------------------- |
+| `user.created`          | `{ userId, email, plan }`                      | `auth.register`, `auth.sso` |
+| `user.deleted`          | `{ userId }`                                   | `auth.delete`               |
+| `subscription.created`  | `{ subscriptionId, userId, plan, status }`     | Registration hook           |
+| `subscription.changed`  | `{ subscriptionId, oldPlan, newPlan, status }` | Stripe webhook              |
+| `subscription.canceled` | `{ subscriptionId, userId, reason }`           | User action / Stripe        |
+| `payment.succeeded`     | `{ invoiceId, amount, userId }`                | Stripe webhook              |
+| `payment.failed`        | `{ invoiceId, amount, userId, retryCount }`    | Stripe webhook              |
+| `invitation.redeemed`   | `{ invitationId, userId }`                     | Invitation module           |
+| `referral.completed`    | `{ referralId, referrerId, referredId }`       | Referral module             |
+| `waitlist.joined`       | `{ email, position }`                          | Waitlist module             |
+| `flag.toggled`          | `{ flagId, enabled, percentage }`              | Flags module                |
+| `license.activated`     | `{ licenseId, userId, deviceId }`              | License module              |
+| `license.expired`       | `{ licenseId, userId }`                        | Jobs: license-expiry-check  |
+
+**Security:**
+
+- Every delivery signed with `X-Webhook-Signature: sha256=<HMAC>` using per-subscription secret
+- Subscription secret generated at creation time, displayed once, rotatable
+- Replay protection: `X-Webhook-Timestamp` header, reject if > 5 min old
+
+**Retry policy:**
+
+- 3 attempts with exponential backoff: 10s → 60s → 300s
+- After 3 failures: mark subscription as `failing`, admin notification
+- After 10 consecutive failures: auto-disable subscription
+
+**Admin UI:**
+
+- `/ops/webhooks` page: list subscriptions, create/edit/delete, test delivery
+- Delivery log: status (success/failed/retrying), response code, duration, payload preview
+- Per-subscription health indicator (green/yellow/red based on recent success rate)
+
+**Cosmos containers:**
+
+- `webhook_subscriptions` (pk: `/productId`) — id, url, secret, events[], enabled, failureCount, lastDeliveryAt
+- `webhook_deliveries` (pk: `/subscriptionId:yyyyMM`) — id, event, status, attempts[], responseCode, duration
+
+---
+
+#### 2.4 Async Event Bus / Internal Pub-Sub
+
+**Why:** Today everything is synchronous request-response. As the platform grows, many operations should be fire-and-forget: audit log writes, webhook delivery, email sending, telemetry cluster updates, usage tracking. Without decoupling, any slow downstream operation blocks the API response.
+
+**Current state:** Some fire-and-forget with unhandled promise rejections (e.g., telemetry cluster updates). No formal event bus.
+
+**Proposed design:**
+
+```
+packages/events/
+├── src/
+│   ├── index.ts     — EventBus class, typed event definitions
+│   ├── types.ts     — PlatformEvent union type, EventHandler interface
+│   └── memory.ts    — In-memory implementation (default)
+```
+
+**Event flow:**
+
+```
+API route handler
+  → bus.emit('user.created', { userId, email, plan })
+    → [handler] audit.record()
+    → [handler] webhook.dispatch()
+    → [handler] email.sendWelcome()
+    → [handler] analytics.track()
+```
+
+**Implementation options:**
+
+- **Phase 1:** In-memory `EventEmitter` wrapper with typed events (zero dependencies)
+- **Phase 2:** Azure Service Bus adapter for cross-service events
+- **Phase 3:** Azure Event Grid for external consumer webhooks
+
+**Typed event definitions (Zod):**
+
+```typescript
+const PlatformEvents = {
+  'user.created': z.object({ userId: z.string(), email: z.string(), plan: z.string() }),
+  'user.deleted': z.object({ userId: z.string() }),
+  'subscription.changed': z.object({
+    subscriptionId: z.string(),
+    oldPlan: z.string(),
+    newPlan: z.string(),
+  }),
+  'payment.failed': z.object({ invoiceId: z.string(), userId: z.string() }),
+  // ... all events from webhook catalog
+} as const;
+```
+
+**Benefits:**
+
+- Audit logging becomes a subscriber, not inline code
+- Webhook delivery becomes a subscriber, not inline code
+- Email sending becomes a subscriber, not inline code
+- New features can subscribe to events without modifying existing modules
+
+---
+
+#### 2.5 Missing Auth Flows — Password Reset & Email Verification
+
+**Why:** The auth module has login, register, SSO, and refresh — but **no password reset** and **no email verification**. These are table-stakes for any production auth system.
+
+**Current state:** If a user forgets their password, there is no recovery path. Registration accepts any email without verification.
+
+**Proposed additions to `auth` module:**
+
+**Password reset flow:**
+
+1. `POST /auth/forgot-password` — accepts `{ email, productId }`, generates a time-limited reset token (UUID), stores hash in `password_reset_tokens` container, sends email with reset link (via delivery module §2.2)
+2. `POST /auth/reset-password` — accepts `{ token, newPassword }`, validates token, updates `passwordHash`, invalidates token, optionally revokes all sessions (§2.7)
+
+**Email verification flow:**
+
+1. On register: generate verification token, store in `email_verifications` container, send email
+2. `POST /auth/verify-email` — accepts `{ token }`, marks user email as verified
+3. `POST /auth/resend-verification` — rate-limited, re-sends verification email
+4. Add `emailVerified: boolean` field to `UserDoc`
+
+**Reset token document:**
+
+```typescript
+interface PasswordResetToken {
+  id: string; // UUID
+  productId: string;
+  userId: string;
+  tokenHash: string; // SHA-256 hash of the token (raw token sent via email)
+  expiresAt: string; // 1 hour from creation
+  usedAt?: string;
+  createdAt: string;
+}
+```
+
+**Security considerations:**
+
+- Store hash of token, not raw token (same pattern as API tokens)
+- Tokens expire in 1 hour
+- Rate limit: 3 reset requests per email per hour
+- After successful reset, invalidate all existing sessions
+- Log all reset attempts to audit
+
+**Cosmos container:**
+
+- `password_reset_tokens` (pk: `/productId`) — short-lived, TTL 24h auto-expiry
+
+**Dependency:** Requires email delivery (§2.2) for sending reset links and verification emails. Can ship the endpoints first with console-logged URLs for dev/testing.
+
+---
+
+#### 2.6 Public Status Page
+
+**Why:** Users and admins need a single place to check if services are operational. The health-check script exists but has no user-facing output.
+
+**Current state:** `monitoring/health-check.ts` polls services and prints to stdout. No persistent status, no incident history, no public URL.
+
+**Proposed design:**
+
+**Option A — Self-hosted (minimal):**
+
+```
+platform-service/src/modules/status/
+├── types.ts         — ServiceStatus, Incident, MaintenanceWindow schemas
+├── repository.ts    — Cosmos: service_status, incidents containers
+├── poller.ts        — Periodic health poll (reuses @bytelyst/monitoring)
+└── routes.ts        — Public: GET /public/status, GET /public/status/history
+```
+
+**Option B — External (Instatus, Statuspage, or Upptime):**
+
+- Upptime (GitHub-based, free, open-source) — runs as a GitHub Action, publishes to GitHub Pages
+- Better for public credibility (hosted on a separate domain)
+
+**Recommendation:** Option A for internal/admin use, Option B for public-facing.
+
+**Status page data model:**
+
+| Field                | Type       | Description                                            |
+| -------------------- | ---------- | ------------------------------------------------------ |
+| `services`           | array      | Current status per service (operational/degraded/down) |
+| `incidents`          | array      | Active and past incidents with timeline                |
+| `maintenanceWindows` | array      | Scheduled maintenance with start/end times             |
+| `overallStatus`      | enum       | `operational` / `degraded` / `major_outage`            |
+| `lastCheckedAt`      | ISO string | When the poller last ran                               |
+
+**Admin UI:**
+
+- `/ops/status` page (or extend existing Mission Control `/ops`): service health cards with history sparklines
+- Incident management: create/update/resolve incidents with public-facing messages
+- Maintenance scheduling: create windows with auto-banners
+
+---
+
+### P1 — Operational Maturity
+
+These components improve reliability, debuggability, and operational efficiency. Not launch-blocking, but critical for a team running production services.
+
+---
+
+#### 2.7 Session Management & Active Devices
+
+**Why:** Licenses track `deviceIds` but there's no concept of active sessions. Users can't see where they're logged in. Admins can't force-revoke a compromised session. "Sign out all devices" is impossible.
+
+**Current state:** JWT tokens with expiry. No session tracking. No revocation list. Refresh tokens are stateless.
+
+**Proposed design:**
+
+```
+platform-service/src/modules/sessions/
+├── types.ts         — SessionDoc, CreateSessionInput schemas
+├── repository.ts    — Cosmos: sessions container (pk: /userId)
+├── middleware.ts    — Session validation (check revocation on each request)
+└── routes.ts        — User: list my sessions, revoke one, revoke all
+                     — Admin: list user sessions, force-revoke
+```
+
+**Session document:**
+
+```typescript
+interface SessionDoc {
+  id: string; // session ID (embedded in JWT)
+  productId: string;
+  userId: string;
+  deviceId?: string; // linked to license device
+  platform: string; // ios, android, desktop, web
+  ipAddress: string;
+  userAgent: string;
+  lastActiveAt: string;
+  createdAt: string;
+  revokedAt?: string;
+  expiresAt: string;
+}
+```
+
+**Endpoints:**
+
+- `GET /sessions` — list my active sessions
+- `DELETE /sessions/:id` — revoke specific session
+- `DELETE /sessions` — revoke all sessions (sign out everywhere)
+- `GET /sessions/user/:userId` — admin: list user's sessions
+- `DELETE /sessions/user/:userId` — admin: force-revoke all
+
+**Integration:** Refresh token endpoint creates a session. Auth middleware checks session isn't revoked (Cosmos point-read by session ID, cached in-memory with short TTL).
+
+---
+
+#### 2.8 Database Migration & Schema Evolution Tracker
+
+**Why:** Cosmos DB is schemaless, but breaking changes still happen: new required fields, partition key changes, index policy updates, container renames. Without tracking, deployments are error-prone and rollbacks are impossible.
+
+**Current state:** No migration tracking. Schema changes are applied ad-hoc.
+
+**Proposed design:**
+
+```
+platform-service/src/migrations/
+├── runner.ts        — Run pending migrations on startup (idempotent)
+├── registry.ts      — List of migration files, ordered by version
+└── migrations/
+    ├── 001_add_productId_to_legacy_users.ts
+    ├── 002_create_telemetry_containers.ts
+    └── ...
+```
+
+**Migration document (in `migrations` container):**
+
+```typescript
+interface MigrationDoc {
+  id: string; // "001_add_productId_to_legacy_users"
+  productId: string; // "platform"
+  version: number;
+  description: string;
+  appliedAt: string;
+  durationMs: number;
+  status: 'applied' | 'failed' | 'rolled_back';
+  error?: string;
+}
+```
+
+**Behavior:**
+
+- On service startup, runner checks `migrations` container for applied versions
+- Runs any unapplied migrations in order
+- Each migration is idempotent (safe to re-run)
+- Failed migrations are recorded but don't block startup (logged as warnings)
+- Admin UI: `/ops/migrations` page showing applied/pending/failed
+
+---
+
+#### 2.9 Data Export & Bulk Operations
+
+**Why:** Admins regularly need: export users as CSV, export audit logs, bulk status updates, bulk license revocation. Today these require direct database queries.
+
+**Current state:** Waitlist has a CSV export endpoint. Nothing else supports bulk operations.
+
+**Proposed design:**
+
+```
+platform-service/src/modules/exports/
+├── types.ts         — ExportJob, ExportFormat schemas
+├── repository.ts    — Cosmos: export_jobs container
+├── workers/
+│   ├── users.ts     — Export users as CSV/JSON
+│   ├── audit.ts     — Export audit log
+│   ├── telemetry.ts — Export telemetry events
+│   ├── usage.ts     — Export usage data
+│   └── subscriptions.ts — Export subscriptions
+└── routes.ts        — POST /exports (start), GET /exports (list), GET /exports/:id/download
+```
+
+**Flow:**
+
+1. Admin POST `/api/exports` → `{ type: 'users', format: 'csv', filters: { plan: 'free' } }`
+2. Background job runs query, writes result to blob storage
+3. Job status updates: `pending` → `processing` → `ready` / `failed`
+4. Admin downloads from signed blob URL
+
+**Supported exports:**
+
+- Users (with filters: plan, status, date range)
+- Audit log (with filters: action, userId, date range)
+- Telemetry events (with filters: platform, eventType, date range)
+- Usage records (with filters: userId, date range)
+- Subscriptions (with filters: plan, status)
+- Licenses (with filters: status, plan)
+
+**Admin UI:**
+
+- `/ops/exports` page: create new export, list past exports, download links
+- Progress indicator for running exports
+- Auto-cleanup: delete export blobs after 7 days
+
+---
+
+#### 2.10 Maintenance Mode & Graceful Degradation
+
+**Why:** Kill switch is binary (on/off per product). Need nuanced control: read-only mode, specific features disabled, custom banner messages, admin bypass, scheduled windows.
+
+**Current state:** `settings/kill-switch` endpoint returns boolean per product. Clients check and fully disable themselves.
+
+**Proposed design:**
+
+Extend the existing `settings` module:
+
+```typescript
+interface MaintenanceConfig {
+  mode: 'off' | 'read_only' | 'maintenance' | 'emergency';
+  message: string; // Shown to users
+  adminMessage?: string; // Shown to admins
+  bypassRoles: string[]; // Roles that can bypass (e.g., ['admin', 'super_admin'])
+  bypassIPs: string[]; // IP addresses that bypass
+  scheduledStart?: string; // ISO — for planned maintenance
+  scheduledEnd?: string;
+  affectedServices: string[]; // ['api', 'dictation', 'extraction'] or ['*']
+  updatedAt: string;
+  updatedBy: string;
+}
+```
+
+**Modes:**
+
+- `off` — Normal operation
+- `read_only` — GET requests allowed, writes blocked (for database maintenance)
+- `maintenance` — All requests return 503 with message (except admin bypass)
+- `emergency` — Kill switch + maintenance message + all clients show error
+
+**Endpoints:**
+
+- `GET /settings/maintenance` — Public: check current mode + message
+- `PUT /settings/maintenance` — Admin: update mode, message, bypass rules
+- `GET /settings/maintenance/schedule` — Upcoming maintenance windows
+
+**Client integration:**
+
+- Clients poll `/settings/maintenance` alongside kill-switch check
+- If `mode !== 'off'`, show banner with `message`
+- If `mode === 'maintenance'`, disable write operations with user-facing explanation
+
+**Admin UI:**
+
+- Extend existing Settings page or add `/ops/maintenance`
+- Mode toggle (off/read-only/maintenance/emergency)
+- Message editor with preview
+- Schedule builder with start/end date pickers
+- Bypass IP whitelist management
+
+---
+
+#### 2.11 Rate Limit Dashboard & IP Allow/Deny Lists
+
+**Why:** `ratelimit` module exists but admins have zero visibility into who's being rate-limited, and no ability to whitelist VIP users or blacklist abusive IPs.
+
+**Current state:** In-memory sliding window rate limiter with configurable rules. No persistence, no admin visibility.
+
+**Proposed design:**
+
+Extend `ratelimit` module:
+
+```typescript
+interface RateLimitEntry {
+  key: string; // userId or IP
+  productId: string;
+  currentCount: number;
+  windowStart: string;
+  wasLimited: boolean;
+  lastLimitedAt?: string;
+}
+
+interface IPRule {
+  id: string;
+  productId: string;
+  ip: string; // CIDR notation supported
+  action: 'allow' | 'deny';
+  reason: string;
+  createdBy: string;
+  createdAt: string;
+  expiresAt?: string; // Temporary blocks
+}
+```
+
+**Additional endpoints:**
+
+- `GET /ratelimit/stats` — Admin: top rate-limited keys, total 429s in last hour/day
+- `GET /ratelimit/blocked` — Admin: currently blocked keys
+- `POST /ratelimit/ip-rules` — Admin: add IP allow/deny rule
+- `GET /ratelimit/ip-rules` — Admin: list rules
+- `DELETE /ratelimit/ip-rules/:id` — Admin: remove rule
+
+**Admin UI:**
+
+- `/ops/rate-limits` page: real-time rate limit stats
+- Top offenders table (most 429 responses)
+- IP rules management (allow/deny with expiry)
+- Per-user rate limit override
+
+---
+
+### P2 — Product Intelligence
+
+These components provide deeper insight into product health, user behavior, and experiment outcomes. They transform raw data into actionable intelligence.
+
+---
+
+#### 2.12 A/B Testing & Experiments Framework
+
+**Why:** Feature flags exist but only support on/off with percentage rollout. No variant assignment, metric collection, or statistical significance calculation.
+
+**Current state:** `flags` module with boolean flags and FNV-1a deterministic rollout.
+
+**Proposed design:**
+
+Extend `flags` module or create sibling `experiments` module:
+
+```
+platform-service/src/modules/experiments/
+├── types.ts         — Experiment, Variant, ExperimentMetric schemas
+├── repository.ts    — Cosmos: experiments container
+├── assignment.ts    — Deterministic variant assignment (extend FNV-1a)
+├── analysis.ts      — Statistical significance calculation
+└── routes.ts        — Admin CRUD + results endpoint
+```
+
+**Experiment document:**
+
+```typescript
+interface ExperimentDoc {
+  id: string;
+  productId: string;
+  name: string;
+  hypothesis: string;
+  status: 'draft' | 'running' | 'paused' | 'concluded';
+  variants: Variant[]; // [{id: 'control', weight: 50}, {id: 'treatment', weight: 50}]
+  targetingRules: {}; // Same as flag targeting
+  primaryMetric: string; // e.g., 'dictation_completed_rate'
+  secondaryMetrics: string[];
+  startedAt?: string;
+  concludedAt?: string;
+  winningVariant?: string;
+  sampleSize: number;
+  results?: ExperimentResults;
+}
+```
+
+**Admin UI:**
+
+- `/experiments` page: list experiments, create new, view results
+- Results view: conversion rates per variant, confidence interval, statistical significance indicator
+- "Conclude" action: pick winner, auto-convert to feature flag
+
+---
+
+#### 2.13 Analytics Aggregation Pipeline
+
+**Why:** `usage` tracks raw events but there are no pre-aggregated rollups. Admin dashboard charts require expensive real-time queries. DAU/WAU/MAU, retention cohorts, and funnel analysis are impossible without rollups.
+
+**Current state:** Raw `usage_daily` records. No aggregation.
+
+**Proposed design:**
+
+```
+platform-service/src/modules/analytics/
+├── types.ts         — MetricRollup, CohortEntry, FunnelStep schemas
+├── repository.ts    — Cosmos: analytics_rollups container
+├── rollup-jobs/
+│   ├── dau-wau-mau.ts    — Daily/weekly/monthly active users
+│   ├── retention.ts      — Cohort retention (D1, D7, D14, D30)
+│   ├── funnel.ts         — Conversion funnels (signup → activate → dictate → subscribe)
+│   └── feature-adoption.ts — Per-feature usage rates
+└── routes.ts        — Admin: GET /analytics/dau, /retention, /funnel, /adoption
+```
+
+**Rollup schedule (via jobs module):**
+
+- DAU: every hour (incremental)
+- WAU/MAU: daily at 1am UTC
+- Retention cohorts: daily at 2am UTC
+- Funnels: daily at 2:30am UTC
+
+**Key metrics:**
+
+- **DAU/WAU/MAU** — with breakdown by platform, plan
+- **Retention cohorts** — "Of users who signed up in week X, what % are active in week X+1, X+4?"
+- **Conversion funnel** — signup → first dictation → 5th dictation → subscription
+- **Feature adoption** — % of active users using each major feature
+- **Revenue metrics** — MRR, churn rate, ARPU, LTV (from subscriptions + Stripe data)
+
+**Admin UI:**
+
+- Extend dashboard home or create `/analytics` page
+- Charts: DAU/WAU/MAU line chart, retention heatmap, funnel bar chart, MRR trend
+
+---
+
+#### 2.14 In-App Feedback & Support Widget
+
+**Why:** Tracker handles issue tracking but there's no way for end users to submit feedback directly from the app. Bug reports with device context, NPS surveys, and feature requests should flow into the tracker automatically.
+
+**Current state:** Public roadmap allows feature submissions and voting. No in-app feedback widget.
+
+**Proposed design:**
+
+```
+platform-service/src/modules/feedback/
+├── types.ts         — FeedbackEntry, FeedbackType, DeviceContext schemas
+├── repository.ts    — Cosmos: feedback container (pk: /productId)
+└── routes.ts        — POST /feedback (authenticated), GET /feedback (admin query)
+```
+
+**Feedback types:**
+
+- `bug_report` — with device context, screenshot URL (blob), reproduction steps
+- `feature_request` — auto-creates tracker item in `items` module
+- `nps_survey` — score (0-10), comment, context
+- `general` — free-form text
+
+**Client integration:**
+
+- Shake-to-report (iOS/Android) or keyboard shortcut (Desktop)
+- Auto-attach: device model, OS version, app version, current screen, last 10 telemetry events
+- Screenshot capture (optional, privacy-respecting)
+
+**Admin UI:**
+
+- `/feedback` page: list feedback with filters (type, platform, date range, NPS score range)
+- Quick actions: convert to tracker item, reply, dismiss
+- NPS dashboard: score distribution over time, detractor/promoter breakdown
+
+---
+
+#### 2.15 User Impersonation / Admin Shadow Mode
+
+**Why:** When a user reports a bug, admins need to see exactly what they see. Without impersonation, debugging requires asking users for screenshots and steps, which is slow and unreliable.
+
+**Current state:** No impersonation capability.
+
+**Proposed design:**
+
+**Endpoint:**
+
+- `POST /auth/impersonate` — Admin only. Accepts `{ targetUserId }`. Returns a scoped shadow token.
+
+**Shadow token properties:**
+
+- Contains `impersonatedBy: adminUserId` claim
+- Read-only by default (no writes unless explicitly allowed)
+- Expires in 15 minutes (non-renewable)
+- All actions logged to audit with `impersonatedBy` field
+- Visible banner in dashboard: "You are viewing as [user name] — all actions are audited"
+
+**Admin UI:**
+
+- On the user detail page (`/users/:id`), add "View as User" button
+- Opens user dashboard in new tab with shadow token
+- Impersonation sessions listed on `/ops/audit` with filter
+
+---
+
+#### 2.16 Changelog & In-App Release Notes
+
+**Why:** Users should know what changed in each release. A changelog system also serves as internal documentation and can be shown as a "What's New" modal in the app.
+
+**Current state:** `CHANGELOG.md` exists in the repo but nothing in-app.
+
+**Proposed design:**
+
+```
+platform-service/src/modules/changelog/
+├── types.ts         — ChangelogEntry, ReleaseNote schemas
+├── repository.ts    — Cosmos: changelog container (pk: /productId)
+└── routes.ts        — Public: GET /changelog (paginated)
+                     — Admin: CRUD changelog entries
+```
+
+**Entry document:**
+
+```typescript
+interface ChangelogEntry {
+  id: string;
+  productId: string;
+  version: string; // "1.2.0"
+  title: string;
+  body: string; // Markdown
+  category: 'feature' | 'improvement' | 'bugfix' | 'security';
+  platforms: string[]; // ['ios', 'android', 'desktop', 'web']
+  publishedAt?: string;
+  isDraft: boolean;
+  createdBy: string;
+}
+```
+
+**Client integration:**
+
+- App checks `GET /api/changelog?since=<lastSeenVersion>` on launch
+- If new entries exist, show "What's New" modal
+- User can dismiss; `lastSeenVersion` stored in settings
+
+**Admin UI:**
+
+- `/changelog` page: create/edit/publish entries with Markdown editor
+- Preview mode before publishing
+- Schedule publishing for future date
+
+---
+
+### P3 — Scale & Polish
+
+These components are important for scale, security, and developer experience, but are lower urgency.
+
+---
+
+#### 2.17 CDN & Asset Pipeline
+
+**Why:** Blob storage serves files directly from Azure. No edge caching, no image optimization, no automatic resizing for avatars/thumbnails.
+
+**Proposed approach:**
+
+- Azure CDN or Cloudflare in front of blob storage
+- Image resize on upload (Sharp) for avatars: 64px, 128px, 256px variants
+- Cache headers: `Cache-Control: public, max-age=31536000, immutable` for content-addressed assets
+- Release binaries served via CDN for faster desktop app updates
+
+---
+
+#### 2.18 API Versioning Strategy
+
+**Why:** As external consumers appear (webhook integrations, third-party tools), breaking API changes need to be managed. Today all endpoints are unversioned.
+
+**Proposed approach:**
+
+- URL prefix: `/v1/api/...`
+- Deprecation header: `Sunset: <date>` + `Deprecation: true`
+- Version lifecycle: `current` → `deprecated` (6 months notice) → `retired`
+- OpenAPI spec generated per version
+- Fastify plugin that routes to versioned handlers
+
+---
+
+#### 2.19 OpenAPI / Auto-Generated API Docs
+
+**Why:** Platform-service already passes `swagger` config to `createServiceApp()`, but Zod schemas aren't fully wired to route definitions. The admin `/docs` page is a markdown doc browser (not API docs). Auto-generated API docs from Zod schemas would be nearly free.
+
+**Current state:** `@fastify/swagger` is configured with title/description but route schemas aren't connected via `@fastify/type-provider-zod`. Swagger UI may already be partially served but without route-level detail.
+
+**Proposed approach:**
+
+- Wire `@fastify/type-provider-zod` to connect existing Zod schemas to Fastify route definitions
+- Verify `@fastify/swagger-ui` is serving at `/documentation` on platform-service
+- Add route-level `schema: { body, querystring, params, response }` using existing Zod schemas
+- Export OpenAPI JSON at `/documentation/json`
+- Admin dashboard links to platform-service Swagger UI
+
+---
+
+#### 2.20 Localization / i18n Service
+
+**Why:** Centralized string management for all platforms. When adding a new language, change one place, not four codebases.
+
+**Proposed approach:**
+
+- `translations` Cosmos container (pk: `/productId:locale`)
+- Admin UI: string management with translation status per locale
+- Client SDK: fetch translations on launch, cache locally
+- Fallback chain: requested locale → base locale → English
+
+---
+
+#### 2.21 Full-Text Search
+
+**Why:** Admin needs to search users by partial name/email. Users need to search memories/items. Cosmos SQL `CONTAINS()` is slow and doesn't rank results.
+
+**Proposed approach:**
+
+- **Phase 1:** Cosmos DB full-text search (preview feature, no extra cost)
+- **Phase 2:** Azure AI Search for richer capabilities (fuzzy matching, facets, suggestions)
+- Admin UI: unified search bar across entities (users, items, audit logs)
+
+---
+
+#### 2.22 Multi-Tenant Workspace / Org / Team Management
+
+**Why:** `productId` scopes data per product, but within a product there's no team or organization concept. Enterprise customers need: org hierarchy, team-scoped permissions, shared brains/workspaces.
+
+**Proposed design (future):**
+
+```
+users → belong to → organizations → have → teams → own → resources
+```
+
+This is a major architectural expansion. Defer until enterprise tier is validated.
+
+---
+
+#### 2.23 Data Retention & Lifecycle Policies
+
+**Why:** Telemetry has TTL. Other containers don't. Old audit logs, expired sessions, redeemed promos, and stale waitlist entries accumulate forever.
+
+**Proposed approach:**
+
+- Admin-configurable retention policies per container
+- Scheduled job (from §2.1) runs cleanup
+- Default policies: audit (365 days), telemetry (30 days), sessions (90 days), export files (7 days)
+- Admin UI: `/ops/retention` page showing policies and next cleanup run
+
+---
+
+#### 2.24 Automated Backup & Point-in-Time Restore
+
+**Why:** Azure Cosmos DB has continuous backup, but admin needs visibility and one-click restore capability.
+
+**Proposed approach:**
+
+- Admin UI: `/ops/backups` page showing Azure backup status
+- Manual export to blob (scheduled job from §2.1)
+- Restore button: triggers Azure Cosmos point-in-time restore API
+- Cross-region replication status indicator
+
+---
+
+#### 2.25 Billing Dunning & Payment Recovery
+
+**Why:** Stripe handles retries, but the platform needs to: notify users of failed payments, offer grace periods, and eventually downgrade plans.
+
+**Proposed flow:**
+
+1. `invoice.payment_failed` → send "payment failed" email (§2.2) + in-app banner
+2. After 3 failures (Stripe Smart Retries) → send "final warning" email
+3. After grace period (7 days) → downgrade to free plan + email notification
+4. All transitions logged to audit
+
+**Integration:** Stripe webhook handler (existing) + email delivery (§2.2) + scheduled job (§2.1) for grace period enforcement.
+
+---
+
+## 3. Implementation Priority Matrix
+
+| Phase        | Components                                 | Effort | Dependencies                     | Unlocks                                                    |
+| ------------ | ------------------------------------------ | ------ | -------------------------------- | ---------------------------------------------------------- |
+| **Sprint 1** | 2.1 Scheduled Jobs                         | M      | None                             | Foundation for all time-based operations                   |
+| **Sprint 1** | 2.4 Event Bus                              | S      | None                             | Decoupling for email, webhooks, audit                      |
+| **Sprint 2** | 2.2 Email Delivery                         | M      | 2.4 Event Bus                    | User communication (welcome, trial expiry, payment failed) |
+| **Sprint 2** | 2.5 Password Reset + Email Verify          | S      | 2.2 Email Delivery               | Auth completeness — table-stakes for production            |
+| **Sprint 3** | 2.3 Webhook Subscriptions                  | M      | 2.4 Event Bus                    | Third-party integrations, Zapier/Slack                     |
+| **Sprint 3** | 2.7 Session Management                     | S      | None                             | Security (sign out everywhere, revocation)                 |
+| **Sprint 4** | 2.10 Maintenance Mode                      | S      | None                             | Operational control during deployments                     |
+| **Sprint 4** | 2.9 Data Export                            | S      | 2.1 Jobs (for blob cleanup)      | Admin self-service, compliance                             |
+| **Sprint 5** | 2.13 Analytics Rollups                     | M      | 2.1 Jobs (for rollup scheduling) | Dashboard charts, business metrics                         |
+| **Sprint 5** | 2.19 OpenAPI Docs                          | S      | None                             | Developer experience, API discoverability                  |
+| **Sprint 6** | 2.6 Status Page                            | S      | None                             | User trust, incident communication                         |
+| **Sprint 6** | 2.16 Changelog                             | S      | None                             | User engagement, release communication                     |
+| **Sprint 7** | 2.11 Rate Limit Dashboard                  | S      | None                             | Ops visibility                                             |
+| **Sprint 7** | 2.25 Billing Dunning                       | S      | 2.1 Jobs + 2.2 Email             | Payment recovery automation                                |
+| **Later**    | 2.8, 2.12, 2.14–2.15, 2.17–2.18, 2.20–2.24 | Varies | —                                | Scale, polish, enterprise                                  |
+
+**Effort key:** S = Small (1–2 days), M = Medium (3–5 days), L = Large (1–2 weeks)
+
+**Critical path:** Event Bus (2.4) → Email Delivery (2.2) → Password Reset (2.5). These three should be the first items built, in that order.
+
+---
+
+## 4. New Cosmos Containers & Cost Impact
+
+Each new component introduces Cosmos containers. Cosmos DB Serverless charges per RU consumed + storage, so idle containers cost only storage (~$0.25/GB/month).
+
+| Component              | New Containers                                | Partition Key                             | Est. TTL        | Est. Daily RU                       |
+| ---------------------- | --------------------------------------------- | ----------------------------------------- | --------------- | ----------------------------------- |
+| **2.1 Jobs**           | `job_definitions`, `job_runs`                 | `/productId`, `/productId:jobName`        | runs: 90d       | ~50 RU (low volume)                 |
+| **2.2 Email/Push**     | `delivery_log`, `email_templates`             | `/productId:channel:yyyyMM`, `/productId` | log: 90d        | ~200 RU                             |
+| **2.3 Webhooks**       | `webhook_subscriptions`, `webhook_deliveries` | `/productId`, `/subscriptionId:yyyyMM`    | deliveries: 30d | ~100 RU                             |
+| **2.5 Password Reset** | `password_reset_tokens`                       | `/productId`                              | 24h auto        | ~10 RU                              |
+| **2.6 Status**         | `service_status`, `incidents`                 | `/productId`, `/productId`                | None            | ~20 RU                              |
+| **2.7 Sessions**       | `sessions`                                    | `/userId`                                 | 90d             | ~500 RU (read-heavy)                |
+| **2.8 Migrations**     | `migrations`                                  | `/productId`                              | None            | ~5 RU (startup only)                |
+| **2.9 Exports**        | `export_jobs`                                 | `/productId`                              | 30d             | ~20 RU                              |
+| **2.12 Experiments**   | `experiments`                                 | `/productId`                              | None            | ~50 RU                              |
+| **2.13 Analytics**     | `analytics_rollups`                           | `/productId:metric:period`                | None            | ~300 RU (write-heavy during rollup) |
+| **2.14 Feedback**      | `feedback`                                    | `/productId`                              | None            | ~50 RU                              |
+| **2.16 Changelog**     | `changelog`                                   | `/productId`                              | None            | ~10 RU                              |
+| **2.20 i18n**          | `translations`                                | `/productId:locale`                       | None            | ~100 RU (read-heavy, cacheable)     |
+| **2.23 Retention**     | `retention_policies`                          | `/productId`                              | None            | ~5 RU                               |
+
+**Total new containers:** ~17 (across all phases)
+**Existing containers:** ~25+ (across platform-service + dashboards)
+**Cost impact:** Minimal for Serverless tier — idle containers only consume storage. Active containers during job runs add burst RU.
+
+**Recommendation:** Register all new containers in `cosmos-init.ts` alongside existing ones. Use TTL liberally for transient data (tokens, deliveries, job runs) to keep storage bounded.
+
+---
+
+## 5. New Environment Variables
+
+New components will require additional env vars. All should be added to `.env.example` files in both repos and documented.
+
+| Component            | Variable                   | Example                          | Required                  |
+| -------------------- | -------------------------- | -------------------------------- | ------------------------- |
+| **2.1 Jobs**         | `JOB_RUNNER_ENABLED`       | `true`                           | No (default: true)        |
+| **2.1 Jobs**         | `JOB_TICK_INTERVAL_MS`     | `60000`                          | No (default: 60s)         |
+| **2.2 Email**        | `SENDGRID_API_KEY`         | `SG.xxx`                         | Yes (for email delivery)  |
+| **2.2 Email**        | `EMAIL_FROM_ADDRESS`       | `noreply@lysnrai.com`            | Yes                       |
+| **2.2 Email**        | `EMAIL_FROM_NAME`          | `LysnrAI`                        | No                        |
+| **2.2 Push**         | `APNS_KEY_ID`              | `ABC123`                         | Yes (for iOS push)        |
+| **2.2 Push**         | `APNS_TEAM_ID`             | `748N7QPX7J`                     | Yes                       |
+| **2.2 Push**         | `APNS_KEY_PATH`            | `./certs/AuthKey.p8`             | Yes                       |
+| **2.2 Push**         | `FCM_SERVICE_ACCOUNT_JSON` | `{...}`                          | Yes (for Android push)    |
+| **2.5 Auth**         | `PASSWORD_RESET_URL_BASE`  | `https://app.lysnrai.com/reset`  | Yes                       |
+| **2.5 Auth**         | `EMAIL_VERIFY_URL_BASE`    | `https://app.lysnrai.com/verify` | Yes                       |
+| **2.10 Maintenance** | `MAINTENANCE_MODE`         | `off`                            | No (default: off)         |
+| **2.10 Maintenance** | `MAINTENANCE_BYPASS_IPS`   | `10.0.0.1,10.0.0.2`              | No                        |
+| **2.19 OpenAPI**     | `SWAGGER_UI_ENABLED`       | `true`                           | No (default: true in dev) |
+
+**Secret management:** `SENDGRID_API_KEY`, `APNS_*`, and `FCM_*` should be added to Azure Key Vault as `lysnr-sendgrid-api-key`, `lysnr-apns-key-id`, etc. Update `LYSNR_SECRETS` in `@bytelyst/config` to include them.
+
+---
+
+## 6. Quick Reference — Where Things Live
+
+| Component                | Repo                      | Path                                            |
+| ------------------------ | ------------------------- | ----------------------------------------------- |
+| Platform-service modules | `learning_ai_common_plat` | `services/platform-service/src/modules/`        |
+| Shared packages          | `learning_ai_common_plat` | `packages/`                                     |
+| Admin dashboard          | `learning_voice_ai_agent` | `admin-dashboard-web/`                          |
+| User dashboard           | `learning_voice_ai_agent` | `user-dashboard-web/`                           |
+| Tracker dashboard        | `learning_voice_ai_agent` | `tracker-dashboard-web/`                        |
+| Docker Compose           | both repos                | `docker-compose.yml`                            |
+| Monitoring               | `learning_ai_common_plat` | `services/monitoring/`                          |
+| Design tokens            | `learning_ai_common_plat` | `packages/design-tokens/`                       |
+| Existing webhooks        | `learning_ai_common_plat` | `services/platform-service/src/lib/webhooks.ts` |
+| Telemetry design doc     | `learning_ai_common_plat` | `docs/WINDSURF/CLIENT_TELEMETRY_DESIGN.md`      |
+| Telemetry roadmap        | `learning_ai_common_plat` | `docs/WINDSURF/TELEMETRY_ROADMAP.md`            |
+| **This document**        | `learning_ai_common_plat` | `docs/WINDSURF/PLATFORM_COMPONENTS_ROADMAP.md`  |
+
+---
+
+## Appendix: Component Dependency Graph
+
+```
+                    ┌─────────────────────┐
+                    │   Event Bus (2.4)    │
+                    └─────────┬───────────┘
+                              │ emits events to all subscribers
+        ┌───────────┼───────────┼───────────┐
+        │           │           │           │
+        ▼           ▼           ▼           ▼
+┌───────────┐ ┌───────────┐ ┌───────────┐ ┌───────────┐
+│ Email/Push│ │ Webhook   │ │ Audit Log │ │ Analytics  │
+│ (2.2)     │ │ (2.3)     │ │ (existing)│ │ (2.13)    │
+└─────┬─────┘ └───────────┘ └───────────┘ └───────────┘
+      │
+      │ triggers
+      ▼
+┌───────────┐
+│ Password  │
+│ Reset(2.5)│
+└───────────┘
+
+┌───────────────┐   ┌─────────────────┐   ┌─────────────────┐
+│ Scheduled     │──▶│ Analytics       │   │ Data Export      │
+│ Jobs (2.1)    │   │ Rollups (2.13)  │   │ (2.9)           │
+└───────┬───────┘   └─────────────────┘   └─────────────────┘
+        │
+        │ triggers on schedule
+        ▼
+┌───────────────┐   ┌─────────────────┐   ┌─────────────────┐
+│ Trial Expiry  │   │ Usage Reset     │   │ Retention        │
+│ Check         │   │                 │   │ Cleanup (2.23)   │
+└───────────────┘   └─────────────────┘   └─────────────────┘
+
+┌───────────────┐   ┌─────────────────┐
+│ Billing       │──▶│ Email/Push      │
+│ Dunning(2.25) │   │ Delivery (2.2)  │
+└───────────────┘   └─────────────────┘
+```
+
+---
+
+_This document is a living brainstorm. Items will be promoted to dedicated design docs (like `CLIENT_TELEMETRY_DESIGN.md`) as they move into implementation._