# Platform Components Roadmap — What's Built, What's Missing, What's Next

> **Status:** Living document — brainstorm + gap analysis  
> **Last updated:** 2026-03-15  
> **Scope:** All infrastructure components relevant to admin, DevOps, and product operations across the ByteLyst platform.  
> **Repos:** `learning_ai_common_plat` (platform-service, packages) · `learning_voice_ai_agent` (dashboards, clients)

---

## Table of Contents

1. [Current Inventory](#1-current-inventory)
2. [Gap Analysis — Missing Components](#2-gap-analysis--missing-components)
   - [P0 — Foundational](#p0--foundational)
   - [P1 — Operational Maturity](#p1--operational-maturity)
   - [P2 — Product Intelligence](#p2--product-intelligence)
   - [P3 — Scale & Polish](#p3--scale--polish)
3. [Implementation Priority Matrix](#3-implementation-priority-matrix)
4. [New Cosmos Containers & Cost Impact](#4-new-cosmos-containers--cost-impact)
5. [New Environment Variables](#5-new-environment-variables)
6. [Quick Reference — Where Things Live](#6-quick-reference--where-things-live)

- [Appendix A: Risks & Open Questions](#appendix-a-risks--open-questions)
- [Appendix B: Component Dependency Graph](#appendix-b-component-dependency-graph)
- [Appendix C: Review Findings](#appendix-c-review-findings)

---

## 1. Current Inventory

### 1.1 Platform-Service Modules (30 modules)

| Category     | Module          | Endpoints | Description                                                                                         |
| ------------ | --------------- | --------- | --------------------------------------------------------------------------------------------------- |
| **Identity** | `auth`          | 11 routes | Login, register, refresh, SSO, profile, admin user CRUD                                             |
| **Identity** | `tokens`        | 5 routes  | API token management (CRUD + validate)                                                              |
| **Identity** | `licenses`      | 6 routes  | License key generation, activation, device binding, validate                                        |
| **Billing**  | `subscriptions` | 5 routes  | Plan management, trial tracking, period management                                                  |
| **Billing**  | `stripe`        | 2 routes  | Inbound Stripe webhook + portal session                                                             |
| **Billing**  | `plans`         | 4 routes  | Plan definitions (free, pro, enterprise)                                                            |
| **Billing**  | `usage`         | 4 routes  | Usage tracking and quota enforcement                                                                |
| **Billing**  | `promos`        | 5 routes  | Promo code creation, validation, redemption                                                         |
| **Growth**   | `invitations`   | 5 routes  | Invitation code generation, redemption, tracking                                                    |
| **Growth**   | `referrals`     | 5 routes  | Referral link tracking, status transitions                                                          |
| **Growth**   | `waitlist`      | 12 routes | Pre-launch signups, position tracking, admin batch invite, CSV export                               |
| **Growth**   | `public`        | 5 routes  | Public roadmap, community voting, feature submissions                                               |
| **Content**  | `items`         | 5 routes  | Tracker items (bugs, features, tasks)                                                               |
| **Content**  | `comments`      | 4 routes  | Threaded comments on items                                                                          |
| **Content**  | `votes`         | 3 routes  | User votes on items and comments                                                                    |
| **Content**  | `memory`        | 5 routes  | Memory items — create, reassign, patch, delete                                                      |
| **Ops**      | `audit`         | Query     | Audit log recording and admin queries                                                               |
| **Ops**      | `flags`         | 5 routes  | Feature flags with FNV-1a deterministic rollout                                                     |
| **Ops**      | `telemetry`     | 9 routes  | Client event ingestion, error clustering, collection policies, GDPR erasure                         |
| **Ops**      | `notifications` | 5 routes  | Device registration, notification preferences                                                       |
| **Ops**      | `settings`      | 6 routes  | User/device settings, kill switch                                                                   |
| **Ops**      | `ratelimit`     | 4 routes  | Rate limit checking, config management                                                              |
| **Ops**      | `themes`        | 7 routes  | Platform theming (iOS, Android, Desktop)                                                            |
| **Ops**      | `blob`          | 5 routes  | Azure Blob Storage SAS tokens, list, delete, info                                                   |
| **Registry** | `products`      | 4 routes  | Multi-product registry with full lifecycle (draft → pre_launch → beta → active → sunset → disabled) |
| **Ops**      | `jobs`          | 5 routes  | Scheduled jobs: cron parser, registry, runner, 6 built-in jobs, manual trigger                      |
| **Ops**      | `status`        | 6 routes  | Public status page: health checker, incidents CRUD, history                                         |
| **Ops**      | `delivery`      | 6 routes  | Transactional email: 8 templates, renderer, SendGrid/Postmark/console adapters, delivery log        |
| **Identity** | `auth` (reset)  | 4 routes  | Password reset (forgot/reset) + email verification (verify/resend) — added to auth module           |
| **Infra**    | `event-bus`     | Singleton | In-memory typed pub/sub via @bytelyst/events — emits on register, password reset, email verified    |

### 1.2 Shared Packages (13 packages)

| Package                   | Purpose                                                     |
| ------------------------- | ----------------------------------------------------------- |
| `@bytelyst/errors`        | Typed HTTP errors (400–429)                                 |
| `@bytelyst/cosmos`        | Cosmos DB client singleton + container registry             |
| `@bytelyst/config`        | Zod env loader, product identity, AKV resolver              |
| `@bytelyst/auth`          | JWT utilities, auth middleware, password hashing            |
| `@bytelyst/api-client`    | Fetch wrapper with auth token injection                     |
| `@bytelyst/fastify-core`  | `createServiceApp()` factory + `startService()`             |
| `@bytelyst/react-auth`    | React auth context factory                                  |
| `@bytelyst/logger`        | Structured logging (pino-based)                             |
| `@bytelyst/testing`       | Shared test mocks, Fastify inject helpers                   |
| `@bytelyst/blob`          | Azure Blob Storage client + SAS helpers                     |
| `@bytelyst/extraction`    | Extraction client + shared types                            |
| `@bytelyst/monitoring`    | Health-check utilities                                      |
| `@bytelyst/design-tokens` | Cross-platform token generator (JSON → CSS/TS/Kotlin/Swift) |
| `@bytelyst/events`        | Typed in-memory event bus with error isolation (14 tests)   |

### 1.3 Services

| Service                | Port | Description                                          |
| ---------------------- | ---- | ---------------------------------------------------- |
| **platform-service**   | 4003 | Consolidated Fastify service (30 modules, 988 tests) |
| **extraction-service** | 4005 | LangExtract text extraction + Python sidecar         |
| **monitoring**         | 4004 | Health-check aggregator (all services)               |

### 1.4 Dashboards

| Dashboard                 | Port | Pages                                                            |
| ------------------------- | ---- | ---------------------------------------------------------------- |
| **admin-dashboard-web**   | 3001 | ~25 pages — users, billing, flags, ops, telemetry, secrets, etc. |
| **user-dashboard-web**    | 3002 | User portal — subscription, usage, settings                      |
| **tracker-dashboard-web** | 3003 | Public roadmap, issue tracker, community voting                  |

### 1.5 Infrastructure Already In Place

| Component              | Status     | Notes                                                                                                                                         |
| ---------------------- | ---------- | --------------------------------------------------------------------------------------------------------------------------------------------- |
| **Health checks**      | ✅         | Per-service `/health` + aggregated monitoring script                                                                                          |
| **Structured logging** | ✅         | Pino (Fastify) + structlog (Python)                                                                                                           |
| **Log aggregation**    | ✅         | Loki + Grafana (Docker Compose)                                                                                                               |
| **Reverse proxy**      | ✅         | Traefik (Docker Compose)                                                                                                                      |
| **Secret management**  | ✅         | Azure Key Vault + admin CRUD UI at `/ops/secrets`                                                                                             |
| **Feature flags**      | ✅         | FNV-1a hash, percentage rollout, admin UI                                                                                                     |
| **Client telemetry**   | ✅         | All platforms instrumented, admin Client Logs page                                                                                            |
| **Rate limiting**      | ✅         | In-memory sliding window + configurable rules per product                                                                                     |
| **Outbound webhooks**  | ⚠️ Partial | Fire-and-forget POST for 3 events (`lib/webhooks.ts`); subscription model built in `modules/webhooks/` with HMAC signing + retry              |
| **Event bus**          | ✅         | `@bytelyst/events` package + singleton in platform-service; auth emits user.created, password_reset, email_verified                           |
| **Scheduled jobs**     | ✅         | Cron parser, registry, in-process runner, 6 built-in jobs, admin API                                                                          |
| **Email delivery**     | ✅         | 8 templates, renderer, SendGrid/Postmark/console adapters, delivery log, event bus subscribers                                                |
| **Password reset**     | ✅         | forgot-password + reset-password endpoints, SHA-256 token hashing, anti-enumeration                                                           |
| **Email verification** | ✅         | verify-email + resend-verification endpoints, emailVerified field on UserDoc                                                                  |
| **Status page**        | ✅         | Health checker (3 services), incident management, public + admin endpoints                                                                    |
| **Kill switch**        | ✅         | Per-product, checked by all clients via `/settings/kill-switch`                                                                               |
| **Audit logging**      | ✅         | Records admin actions, queryable from admin dashboard                                                                                         |
| **Blob storage**       | ✅         | 6 containers (audio, transcripts, attachments, avatars, releases, backups), SAS tokens, admin endpoints                                       |
| **Swagger / OpenAPI**  | ⚠️ Partial | `createServiceApp()` passes `swagger` config; Fastify plugin wired but Zod schemas not fully connected to route definitions via type provider |
| **Prometheus metrics** | ⚠️ Partial | `metrics: true` in `createServiceApp()` — basic request metrics exposed; no custom business metrics, no Grafana dashboards for them           |
| **Product registry**   | ✅         | Multi-product with full status lifecycle (draft → pre_launch → beta → active → sunset → disabled), prelaunch config, custom fields            |
| **Admin doc browser**  | ✅         | `/docs` page with markdown viewer, search, and AI chat — browses repo documentation                                                           |

---

## 2. Gap Analysis — Missing Components

### P0 — Foundational

These are blocking features that nearly every production app needs. Without them, critical operational workflows are manual or impossible.

---

#### 2.1 Scheduled Jobs / Background Task Runner

**Why:** No way to run recurring work today. Trial expirations, subscription renewals, usage quota resets, stale data cleanup, digest emails, and report generation all require a scheduler.

**Current state:** Zero. All logic is request-driven (HTTP request → response).

**Proposed design:**

```
platform-service/src/modules/jobs/
├── types.ts         — JobDefinition, JobRun, JobSchedule schemas
├── registry.ts      — Job registry (register named jobs with cron expressions)
├── runner.ts        — Tick loop: evaluate cron, run due jobs, record outcomes
├── repository.ts    — Cosmos: job_definitions, job_runs containers
└── routes.ts        — Admin: list jobs, trigger manually, view run history, pause/resume
```

**Built-in jobs to ship on day 1:**

| Job                      | Schedule              | Description                                                                                            |
| ------------------------ | --------------------- | ------------------------------------------------------------------------------------------------------ |
| `trial-expiration-check` | Every hour            | Find subscriptions with `status=trialing` past `currentPeriodEnd`, transition to `expired` or `active` |
| `usage-quota-reset`      | Daily at midnight UTC | Reset daily/monthly counters in `usage_daily` container                                                |
| `stale-session-cleanup`  | Every 6 hours         | Remove expired refresh tokens and inactive sessions                                                    |
| `telemetry-ttl-sweep`    | Daily at 3am UTC      | Delete telemetry events past retention TTL (Cosmos TTL is best-effort)                                 |
| `waitlist-reminder`      | Weekly                | Identify stale waitlist entries, mark for follow-up                                                    |
| `license-expiry-check`   | Daily                 | Warn users whose licenses expire within 7 days                                                         |

**Options for the runner:**

- **In-process tick loop** (simplest): `setInterval` in platform-service, with leader election via Cosmos lease
- **Azure Functions timer triggers** (serverless): Lower cost, built-in cron, but adds deployment complexity
- **BullMQ + Redis** (heavy): Best for high-throughput, but adds a Redis dependency

**Recommendation:** Start with in-process tick loop + Cosmos lease for leader election (avoids Redis). Migrate to Azure Functions if job volume grows.

**Admin UI:**

- `/ops/jobs` page: list all registered jobs, last run status, next scheduled run
- Manual trigger button per job
- Run history table with duration, outcome, error details
- Pause/resume toggle per job

**Cosmos containers:**

- `job_definitions` (pk: `/productId`) — name, cron, enabled, lastRunAt, nextRunAt
- `job_runs` (pk: `/productId:jobName`) — runId, startedAt, completedAt, status, error, metrics

---

#### 2.2 Transactional Email & Push Delivery

**Why:** The `notifications` module manages device registration and preferences, but has **no delivery mechanism**. Notifications are database records with no way to reach users.

**Current state:** Device registration + preference management only. No email, no push, no SMS.

**Proposed design:**

```
platform-service/src/modules/delivery/
├── types.ts         — DeliveryRequest, DeliveryLog, ChannelConfig schemas
├── channels/
│   ├── email.ts     — SendGrid/Postmark adapter
│   ├── push-apns.ts — Apple Push Notification Service
│   ├── push-fcm.ts  — Firebase Cloud Messaging
│   └── sms.ts       — Twilio/Azure Communication Services (future)
├── renderer.ts      — Template rendering (Handlebars for email bodies)
├── repository.ts    — delivery_log + email_templates containers
├── dispatcher.ts    — Route delivery request to correct channel(s) based on prefs
└── routes.ts        — Admin: send test, view delivery log, manage templates
```

**Email templates to ship on day 1:**

| Template            | Trigger                                    | Description                                  |
| ------------------- | ------------------------------------------ | -------------------------------------------- |
| `welcome`           | `auth.register`                            | Welcome email with getting-started guide     |
| `trial-expiring`    | `jobs.trial-expiration-check` (7d warning) | "Your trial ends in 7 days"                  |
| `trial-expired`     | `jobs.trial-expiration-check`              | "Your trial has ended — upgrade to continue" |
| `password-reset`    | Future: `/auth/forgot-password`            | One-time reset link                          |
| `invitation`        | `invitations.create`                       | "You've been invited to join"                |
| `waitlist-accepted` | `waitlist.invite`                          | "You're in! Here's your access"              |
| `payment-failed`    | `stripe.invoice.payment_failed`            | "We couldn't charge your card"               |
| `license-expiring`  | `jobs.license-expiry-check`                | "Your license expires in 7 days"             |

**Push notification types:**

| Type                   | Channel    | Description                                  |
| ---------------------- | ---------- | -------------------------------------------- |
| `dictation_reminder`   | APNs + FCM | "Haven't dictated today — keep your streak!" |
| `feature_announcement` | APNs + FCM | Admin-triggered announcement                 |
| `subscription_change`  | APNs + FCM | Plan upgraded/downgraded/expired             |

**Cosmos container:**

- `delivery_log` (pk: `/productId:channel:yyyyMM`) — id, userId, channel, template, status (sent/failed/bounced), sentAt, error

**Admin UI:**

- `/ops/delivery` page: delivery log with filters (channel, status, template, date range)
- Template management: list, preview, edit (future: visual editor)
- "Send test" button for each template
- Delivery stats: sent/failed/bounced/opened (with SendGrid webhook integration)

---

#### 2.3 Outbound Webhook Subscriptions

**Why:** Current `webhooks.ts` is fire-and-forget to env-var URLs with no retry, no signing, no subscriber management. External integrations (Zapier, Slack, custom) need a proper webhook subscription system.

**Current state:** 3 hardcoded webhook dispatchers (invitation redeemed, referral status changed, waitlist joined). No retry. No HMAC signing. No subscription management.

**Proposed design:**

```
platform-service/src/modules/webhooks/
├── types.ts         — WebhookSubscription, WebhookDelivery, WebhookEvent schemas
├── repository.ts    — Cosmos: webhook_subscriptions, webhook_deliveries containers
├── dispatcher.ts    — Match event → subscriptions, queue delivery, HMAC-SHA256 sign
├── delivery.ts      — HTTP POST with exponential backoff retry (3 attempts)
└── routes.ts        — Admin CRUD for subscriptions + delivery log
```

**Event catalog (subscribe to any combination):**

| Event                   | Payload                                        | Source                          |
| ----------------------- | ---------------------------------------------- | ------------------------------- |
| `user.created`          | `{ userId, email, plan }`                      | `auth.register`, `auth.sso`     |
| `user.deleted`          | `{ userId }`                                   | Admin: `DELETE /auth/users/:id` |
| `subscription.created`  | `{ subscriptionId, userId, plan, status }`     | Registration hook               |
| `subscription.changed`  | `{ subscriptionId, oldPlan, newPlan, status }` | Stripe webhook                  |
| `subscription.canceled` | `{ subscriptionId, userId, reason }`           | User action / Stripe            |
| `payment.succeeded`     | `{ invoiceId, amount, userId }`                | Stripe webhook                  |
| `payment.failed`        | `{ invoiceId, amount, userId, retryCount }`    | Stripe webhook                  |
| `invitation.redeemed`   | `{ invitationId, userId }`                     | Invitation module               |
| `referral.completed`    | `{ referralId, referrerId, referredId }`       | Referral module                 |
| `waitlist.joined`       | `{ email, position }`                          | Waitlist module                 |
| `flag.toggled`          | `{ flagId, enabled, percentage }`              | Flags module                    |
| `license.activated`     | `{ licenseId, userId, deviceId }`              | License module                  |
| `license.expired`       | `{ licenseId, userId }`                        | Jobs: license-expiry-check      |

**Security:**

- Every delivery signed with `X-Webhook-Signature: sha256=<HMAC>` using per-subscription secret
- Subscription secret generated at creation time, displayed once, rotatable
- Replay protection: `X-Webhook-Timestamp` header, reject if > 5 min old

**Retry policy:**

- 3 attempts with exponential backoff: 10s → 60s → 300s
- After 3 failures: mark subscription as `failing`, admin notification
- After 10 consecutive failures: auto-disable subscription

**Admin UI:**

- `/ops/webhooks` page: list subscriptions, create/edit/delete, test delivery
- Delivery log: status (success/failed/retrying), response code, duration, payload preview
- Per-subscription health indicator (green/yellow/red based on recent success rate)

**Cosmos containers:**

- `webhook_subscriptions` (pk: `/productId`) — id, url, secret, events[], enabled, failureCount, lastDeliveryAt
- `webhook_deliveries` (pk: `/subscriptionId:yyyyMM`) — id, event, status, attempts[], responseCode, duration

---

#### 2.4 Async Event Bus / Internal Pub-Sub

**Why:** Today everything is synchronous request-response. As the platform grows, many operations should be fire-and-forget: audit log writes, webhook delivery, email sending, telemetry cluster updates, usage tracking. Without decoupling, any slow downstream operation blocks the API response.

**Current state:** Some fire-and-forget with unhandled promise rejections (e.g., telemetry cluster updates). No formal event bus.

**Proposed design:**

```
packages/events/
├── src/
│   ├── index.ts     — EventBus class, typed event definitions
│   ├── types.ts     — PlatformEvent union type, EventHandler interface
│   └── memory.ts    — In-memory implementation (default)
```

**Event flow:**

```
API route handler
  → bus.emit('user.created', { userId, email, plan })
    → [handler] audit.record()
    → [handler] webhook.dispatch()
    → [handler] email.sendWelcome()
    → [handler] analytics.track()
```

**Implementation options:**

- **Phase 1:** In-memory `EventEmitter` wrapper with typed events (zero dependencies)
- **Phase 2:** Azure Service Bus adapter for cross-service events
- **Phase 3:** Azure Event Grid for external consumer webhooks

**Typed event definitions (Zod):**

```typescript
const PlatformEvents = {
  'user.created': z.object({ userId: z.string(), email: z.string(), plan: z.string() }),
  'user.deleted': z.object({ userId: z.string() }),
  'subscription.changed': z.object({
    subscriptionId: z.string(),
    oldPlan: z.string(),
    newPlan: z.string(),
  }),
  'payment.failed': z.object({ invoiceId: z.string(), userId: z.string() }),
  // ... all events from webhook catalog
} as const;
```

**Migration from existing `lib/webhooks.ts`:**

- Existing `dispatchInvitationRedeemed()`, `dispatchReferralStatusChanged()`, `dispatchWaitlistJoined()` become event bus subscribers
- Phase 1: Register existing webhooks.ts functions as handlers on the bus
- Phase 2: Replace inline dispatch calls in routes with `bus.emit()`
- Phase 3: Remove `lib/webhooks.ts` once all callers migrated

**Benefits:**

- Audit logging becomes a subscriber, not inline code
- Webhook delivery becomes a subscriber, not inline code
- Email sending becomes a subscriber, not inline code
- New features can subscribe to events without modifying existing modules

---

#### 2.5 Missing Auth Flows — Password Reset & Email Verification

**Why:** The auth module has login, register, SSO, and refresh — but **no password reset** and **no email verification**. These are table-stakes for any production auth system.

**Current state:** If a user forgets their password, there is no recovery path. Registration accepts any email without verification.

**Proposed additions to `auth` module:**

**Password reset flow:**

1. `POST /auth/forgot-password` — accepts `{ email, productId }`, generates a time-limited reset token (UUID), stores hash in `password_reset_tokens` container, sends email with reset link (via delivery module §2.2)
2. `POST /auth/reset-password` — accepts `{ token, newPassword }`, validates token, updates `passwordHash`, invalidates token, optionally revokes all sessions (§2.7)

**Email verification flow:**

1. On register: generate verification token, store in `email_verifications` container, send email
2. `POST /auth/verify-email` — accepts `{ token }`, marks user email as verified
3. `POST /auth/resend-verification` — rate-limited, re-sends verification email
4. Add `emailVerified: boolean` field to `UserDoc`

**Reset token document:**

```typescript
interface PasswordResetToken {
  id: string; // UUID
  productId: string;
  userId: string;
  tokenHash: string; // SHA-256 hash of the token (raw token sent via email)
  expiresAt: string; // 1 hour from creation
  usedAt?: string;
  createdAt: string;
}
```

**Security considerations:**

- Store hash of token, not raw token (same pattern as API tokens)
- Tokens expire in 1 hour
- Rate limit: 3 reset requests per email per hour
- After successful reset, invalidate all existing sessions
- Log all reset attempts to audit

**Cosmos container:**

- `password_reset_tokens` (pk: `/productId`) — short-lived, TTL 24h auto-expiry

**Dependency:** Requires email delivery (§2.2) for sending reset links and verification emails. Can ship the endpoints first with `req.log.info`-logged URLs for dev/testing (never `console.log`).

---

#### 2.6 Public Status Page

**Why:** Users and admins need a single place to check if services are operational. The health-check script exists but has no user-facing output.

**Current state:** `monitoring/health-check.ts` polls services and prints to stdout. No persistent status, no incident history, no public URL.

**Proposed design:**

**Option A — Self-hosted (minimal):**

```
platform-service/src/modules/status/
├── types.ts         — ServiceStatus, Incident, MaintenanceWindow schemas
├── repository.ts    — Cosmos: service_status, incidents containers
├── poller.ts        — Periodic health poll (reuses @bytelyst/monitoring)
└── routes.ts        — Public: GET /public/status, GET /public/status/history
```

**Option B — External (Instatus, Statuspage, or Upptime):**

- Upptime (GitHub-based, free, open-source) — runs as a GitHub Action, publishes to GitHub Pages
- Better for public credibility (hosted on a separate domain)

**Recommendation:** Option A for internal/admin use, Option B for public-facing.

**Status page data model:**

| Field                | Type       | Description                                            |
| -------------------- | ---------- | ------------------------------------------------------ |
| `services`           | array      | Current status per service (operational/degraded/down) |
| `incidents`          | array      | Active and past incidents with timeline                |
| `maintenanceWindows` | array      | Scheduled maintenance with start/end times             |
| `overallStatus`      | enum       | `operational` / `degraded` / `major_outage`            |
| `lastCheckedAt`      | ISO string | When the poller last ran                               |

**Admin UI:**

- `/ops/status` page (or extend existing Mission Control `/ops`): service health cards with history sparklines
- Incident management: create/update/resolve incidents with public-facing messages
- Maintenance scheduling: create windows with auto-banners

---

### P1 — Operational Maturity

These components improve reliability, debuggability, and operational efficiency. Not launch-blocking, but critical for a team running production services.

---

#### 2.7 Session Management & Active Devices

**Why:** Licenses track `deviceIds` but there's no concept of active sessions. Users can't see where they're logged in. Admins can't force-revoke a compromised session. "Sign out all devices" is impossible.

**Current state:** JWT tokens with expiry. No session tracking. No revocation list. Refresh tokens are stateless.

**Proposed design:**

```
platform-service/src/modules/sessions/
├── types.ts         — SessionDoc, CreateSessionInput schemas
├── repository.ts    — Cosmos: sessions container (pk: /userId)
├── middleware.ts    — Session validation (check revocation on each request)
└── routes.ts        — User: list my sessions, revoke one, revoke all
                     — Admin: list user sessions, force-revoke
```

**Session document:**

```typescript
interface SessionDoc {
  id: string; // session ID (embedded in JWT)
  productId: string;
  userId: string;
  deviceId?: string; // linked to license device
  platform: string; // ios, android, desktop, web
  ipAddress: string;
  userAgent: string;
  lastActiveAt: string;
  createdAt: string;
  revokedAt?: string;
  expiresAt: string;
}
```

**Endpoints:**

- `GET /sessions` — list my active sessions
- `DELETE /sessions/:id` — revoke specific session
- `DELETE /sessions` — revoke all sessions (sign out everywhere)
- `GET /sessions/user/:userId` — admin: list user's sessions
- `DELETE /sessions/user/:userId` — admin: force-revoke all

**Integration:** Refresh token endpoint creates a session. Auth middleware checks session isn't revoked (Cosmos point-read by session ID, cached in-memory with short TTL).

---

#### 2.8 Database Migration & Schema Evolution Tracker

**Why:** Cosmos DB is schemaless, but breaking changes still happen: new required fields, partition key changes, index policy updates, container renames. Without tracking, deployments are error-prone and rollbacks are impossible.

**Current state:** No migration tracking. Schema changes are applied ad-hoc.

**Proposed design:**

```
platform-service/src/migrations/
├── runner.ts        — Run pending migrations on startup (idempotent)
├── registry.ts      — List of migration files, ordered by version
└── migrations/
    ├── 001_add_productId_to_legacy_users.ts
    ├── 002_create_telemetry_containers.ts
    └── ...
```

**Migration document (in `migrations` container):**

```typescript
interface MigrationDoc {
  id: string; // "001_add_productId_to_legacy_users"
  productId: string; // "platform"
  version: number;
  description: string;
  appliedAt: string;
  durationMs: number;
  status: 'applied' | 'failed' | 'rolled_back';
  error?: string;
}
```

**Behavior:**

- On service startup, runner checks `migrations` container for applied versions
- Runs any unapplied migrations in order
- Each migration is idempotent (safe to re-run)
- Failed migrations are recorded but don't block startup (logged as warnings)
- Admin UI: `/ops/migrations` page showing applied/pending/failed

---

#### 2.9 Data Export & Bulk Operations

**Why:** Admins regularly need: export users as CSV, export audit logs, bulk status updates, bulk license revocation. Today these require direct database queries.

**Current state:** Waitlist has a CSV export endpoint. Nothing else supports bulk operations.

**Proposed design:**

```
platform-service/src/modules/exports/
├── types.ts         — ExportJob, ExportFormat schemas
├── repository.ts    — Cosmos: export_jobs container
├── workers/
│   ├── users.ts     — Export users as CSV/JSON
│   ├── audit.ts     — Export audit log
│   ├── telemetry.ts — Export telemetry events
│   ├── usage.ts     — Export usage data
│   └── subscriptions.ts — Export subscriptions
└── routes.ts        — POST /exports (start), GET /exports (list), GET /exports/:id/download
```

**Flow:**

1. Admin POST `/api/exports` → `{ type: 'users', format: 'csv', filters: { plan: 'free' } }`
2. Background job runs query, writes result to blob storage (via existing `blob` module)
3. Job status updates: `pending` → `processing` → `ready` / `failed`
4. Admin downloads from signed blob URL (SAS token via `@bytelyst/blob`)

**Dependencies:** `blob` module (existing) for storage, `jobs` module (§2.1) for auto-cleanup of expired exports.

**Supported exports:**

- Users (with filters: plan, status, date range)
- Audit log (with filters: action, userId, date range)
- Telemetry events (with filters: platform, eventType, date range)
- Usage records (with filters: userId, date range)
- Subscriptions (with filters: plan, status)
- Licenses (with filters: status, plan)

**Admin UI:**

- `/ops/exports` page: create new export, list past exports, download links
- Progress indicator for running exports
- Auto-cleanup: delete export blobs after 7 days

---

#### 2.10 Maintenance Mode & Graceful Degradation

**Why:** Kill switch is binary (on/off per product). Need nuanced control: read-only mode, specific features disabled, custom banner messages, admin bypass, scheduled windows.

**Current state:** `settings/kill-switch` endpoint returns boolean per product. Clients check and fully disable themselves.

**Proposed design:**

Extend the existing `settings` module:

```typescript
interface MaintenanceConfig {
  mode: 'off' | 'read_only' | 'maintenance' | 'emergency';
  message: string; // Shown to users
  adminMessage?: string; // Shown to admins
  bypassRoles: string[]; // Roles that can bypass (e.g., ['admin', 'super_admin'])
  bypassIPs: string[]; // IP addresses that bypass
  scheduledStart?: string; // ISO — for planned maintenance
  scheduledEnd?: string;
  affectedServices: string[]; // ['api', 'dictation', 'extraction'] or ['*']
  updatedAt: string;
  updatedBy: string;
}
```

**Modes:**

- `off` — Normal operation
- `read_only` — GET requests allowed, writes blocked (for database maintenance)
- `maintenance` — All requests return 503 with message (except admin bypass)
- `emergency` — Kill switch + maintenance message + all clients show error

**Endpoints:**

- `GET /settings/maintenance` — Public: check current mode + message
- `PUT /settings/maintenance` — Admin: update mode, message, bypass rules
- `GET /settings/maintenance/schedule` — Upcoming maintenance windows

**Client integration:**

- Clients poll `/settings/maintenance` alongside kill-switch check
- If `mode !== 'off'`, show banner with `message`
- If `mode === 'maintenance'`, disable write operations with user-facing explanation

**Admin UI:**

- Extend existing Settings page or add `/ops/maintenance`
- Mode toggle (off/read-only/maintenance/emergency)
- Message editor with preview
- Schedule builder with start/end date pickers
- Bypass IP whitelist management

**Storage:** Maintenance config is a single document per product in the existing `settings` container (field: `maintenanceConfig`). No new Cosmos container needed.

---

#### 2.11 Rate Limit Dashboard & IP Allow/Deny Lists

**Why:** `ratelimit` module exists but admins have zero visibility into who's being rate-limited, and no ability to whitelist VIP users or blacklist abusive IPs.

**Current state:** In-memory sliding window rate limiter with configurable rules. No persistence, no admin visibility.

**Proposed design:**

Extend `ratelimit` module:

```typescript
interface RateLimitEntry {
  key: string; // userId or IP
  productId: string;
  currentCount: number;
  windowStart: string;
  wasLimited: boolean;
  lastLimitedAt?: string;
}

interface IPRule {
  id: string;
  productId: string;
  ip: string; // CIDR notation supported
  action: 'allow' | 'deny';
  reason: string;
  createdBy: string;
  createdAt: string;
  expiresAt?: string; // Temporary blocks
}
```

**Additional endpoints:**

- `GET /ratelimit/stats` — Admin: top rate-limited keys, total 429s in last hour/day
- `GET /ratelimit/blocked` — Admin: currently blocked keys
- `POST /ratelimit/ip-rules` — Admin: add IP allow/deny rule
- `GET /ratelimit/ip-rules` — Admin: list rules
- `DELETE /ratelimit/ip-rules/:id` — Admin: remove rule

**Admin UI:**

- `/ops/rate-limits` page: real-time rate limit stats
- Top offenders table (most 429 responses)
- IP rules management (allow/deny with expiry)
- Per-user rate limit override

**Cosmos container:**

- `ip_rules` (pk: `/productId`) — persistent IP allow/deny rules
- Rate limit stats remain in-memory (ephemeral); no persistence needed for counters

---

### P2 — Product Intelligence

These components provide deeper insight into product health, user behavior, and experiment outcomes. They transform raw data into actionable intelligence.

---

#### 2.12 A/B Testing & Experiments Framework

**Why:** Feature flags exist but only support on/off with percentage rollout. No variant assignment, metric collection, or statistical significance calculation.

**Current state:** `flags` module with boolean flags and FNV-1a deterministic rollout.

**Proposed design:**

Extend `flags` module or create sibling `experiments` module:

```
platform-service/src/modules/experiments/
├── types.ts         — Experiment, Variant, ExperimentMetric schemas
├── repository.ts    — Cosmos: experiments container
├── assignment.ts    — Deterministic variant assignment (extend FNV-1a)
├── analysis.ts      — Statistical significance calculation
└── routes.ts        — Admin CRUD + results endpoint
```

**Experiment document:**

```typescript
interface ExperimentDoc {
  id: string;
  productId: string;
  name: string;
  hypothesis: string;
  status: 'draft' | 'running' | 'paused' | 'concluded';
  variants: Variant[]; // [{id: 'control', weight: 50}, {id: 'treatment', weight: 50}]
  targetingRules: FlagTargetingRules; // Reuse from flags module (platforms, versions, percentage)
  primaryMetric: string; // e.g., 'dictation_completed_rate'
  secondaryMetrics: string[];
  startedAt?: string;
  concludedAt?: string;
  winningVariant?: string;
  sampleSize: number;
  results?: ExperimentResults;
}
```

**Admin UI:**

- `/experiments` page: list experiments, create new, view results
- Results view: conversion rates per variant, confidence interval, statistical significance indicator
- "Conclude" action: pick winner, auto-convert to feature flag

---

#### 2.13 Analytics Aggregation Pipeline

**Why:** `usage` tracks raw events but there are no pre-aggregated rollups. Admin dashboard charts require expensive real-time queries. DAU/WAU/MAU, retention cohorts, and funnel analysis are impossible without rollups.

**Current state:** Raw `usage_daily` records. No aggregation.

**Proposed design:**

```
platform-service/src/modules/analytics/
├── types.ts         — MetricRollup, CohortEntry, FunnelStep schemas
├── repository.ts    — Cosmos: analytics_rollups container
├── rollup-jobs/
│   ├── dau-wau-mau.ts    — Daily/weekly/monthly active users
│   ├── retention.ts      — Cohort retention (D1, D7, D14, D30)
│   ├── funnel.ts         — Conversion funnels (signup → activate → dictate → subscribe)
│   └── feature-adoption.ts — Per-feature usage rates
└── routes.ts        — Admin: GET /analytics/dau, /retention, /funnel, /adoption
```

**Rollup schedule (via jobs module):**

- DAU: every hour (incremental)
- WAU/MAU: daily at 1am UTC
- Retention cohorts: daily at 2am UTC
- Funnels: daily at 2:30am UTC

**Key metrics:**

- **DAU/WAU/MAU** — with breakdown by platform, plan
- **Retention cohorts** — "Of users who signed up in week X, what % are active in week X+1, X+4?"
- **Conversion funnel** — signup → first dictation → 5th dictation → subscription
- **Feature adoption** — % of active users using each major feature
- **Revenue metrics** — MRR, churn rate, ARPU, LTV (from subscriptions + Stripe data)

**Admin UI:**

- Extend dashboard home or create `/analytics` page
- Charts: DAU/WAU/MAU line chart, retention heatmap, funnel bar chart, MRR trend

---

#### 2.14 In-App Feedback & Support Widget

**Why:** Tracker handles issue tracking but there's no way for end users to submit feedback directly from the app. Bug reports with device context, NPS surveys, and feature requests should flow into the tracker automatically.

**Current state:** Public roadmap allows feature submissions and voting. No in-app feedback widget.

**Proposed design:**

```
platform-service/src/modules/feedback/
├── types.ts         — FeedbackEntry, FeedbackType, DeviceContext schemas
├── repository.ts    — Cosmos: feedback container (pk: /productId)
└── routes.ts        — POST /feedback (authenticated), GET /feedback (admin query)
```

**Feedback types:**

- `bug_report` — with device context, screenshot URL (blob), reproduction steps
- `feature_request` — auto-creates tracker item in `items` module
- `nps_survey` — score (0-10), comment, context
- `general` — free-form text

**Client integration:**

- Shake-to-report (iOS/Android) or keyboard shortcut (Desktop)
- Auto-attach: device model, OS version, app version, current screen, last 10 telemetry events
- Screenshot capture (optional, privacy-respecting)

**Admin UI:**

- `/feedback` page: list feedback with filters (type, platform, date range, NPS score range)
- Quick actions: convert to tracker item, reply, dismiss
- NPS dashboard: score distribution over time, detractor/promoter breakdown

---

#### 2.15 User Impersonation / Admin Shadow Mode

**Why:** When a user reports a bug, admins need to see exactly what they see. Without impersonation, debugging requires asking users for screenshots and steps, which is slow and unreliable.

**Current state:** No impersonation capability.

**Proposed design:**

**Endpoint:**

- `POST /auth/impersonate` — Admin only. Accepts `{ targetUserId }`. Returns a scoped shadow token.

**Shadow token properties:**

- Contains `impersonatedBy: adminUserId` claim
- Read-only by default (no writes unless explicitly allowed)
- Expires in 15 minutes (non-renewable)
- All actions logged to audit with `impersonatedBy` field
- Visible banner in dashboard: "You are viewing as [user name] — all actions are audited"

**Admin UI:**

- On the user detail page (`/users/:id`), add "View as User" button
- Opens user dashboard in new tab with shadow token
- Impersonation sessions listed on `/ops/audit` with filter

---

#### 2.16 Changelog & In-App Release Notes

**Why:** Users should know what changed in each release. A changelog system also serves as internal documentation and can be shown as a "What's New" modal in the app.

**Current state:** `CHANGELOG.md` exists in the repo but nothing in-app.

**Proposed design:**

```
platform-service/src/modules/changelog/
├── types.ts         — ChangelogEntry, ReleaseNote schemas
├── repository.ts    — Cosmos: changelog container (pk: /productId)
└── routes.ts        — Public: GET /changelog (paginated)
                     — Admin: CRUD changelog entries
```

**Entry document:**

```typescript
interface ChangelogEntry {
  id: string;
  productId: string;
  version: string; // "1.2.0"
  title: string;
  body: string; // Markdown
  category: 'feature' | 'improvement' | 'bugfix' | 'security';
  platforms: string[]; // ['ios', 'android', 'desktop', 'web']
  publishedAt?: string;
  isDraft: boolean;
  createdBy: string;
}
```

**Client integration:**

- App checks `GET /api/changelog?since=<lastSeenVersion>` on launch
- If new entries exist, show "What's New" modal
- User can dismiss; `lastSeenVersion` stored in settings

**Admin UI:**

- `/changelog` page: create/edit/publish entries with Markdown editor
- Preview mode before publishing
- Schedule publishing for future date

---

### P3 — Scale & Polish

These components are important for scale, security, and developer experience, but are lower urgency.

---

#### 2.17 CDN & Asset Pipeline

**Why:** Blob storage serves files directly from Azure. No edge caching, no image optimization, no automatic resizing for avatars/thumbnails.

**Proposed approach:**

- Azure CDN or Cloudflare in front of blob storage
- Image resize on upload (Sharp) for avatars: 64px, 128px, 256px variants
- Cache headers: `Cache-Control: public, max-age=31536000, immutable` for content-addressed assets
- Release binaries served via CDN for faster desktop app updates

---

#### 2.18 API Versioning Strategy

**Why:** As external consumers appear (webhook integrations, third-party tools), breaking API changes need to be managed. Today all endpoints are unversioned.

**Proposed approach:**

- URL prefix: `/v1/api/...`
- Deprecation header: `Sunset: <date>` + `Deprecation: true`
- Version lifecycle: `current` → `deprecated` (6 months notice) → `retired`
- OpenAPI spec generated per version
- Fastify plugin that routes to versioned handlers

---

#### 2.19 OpenAPI / Auto-Generated API Docs

**Why:** Platform-service already passes `swagger` config to `createServiceApp()`, but Zod schemas aren't fully wired to route definitions. The admin `/docs` page is a markdown doc browser (not API docs). Auto-generated API docs from Zod schemas would be nearly free.

**Current state:** `@fastify/swagger` is configured with title/description but route schemas aren't connected via `@fastify/type-provider-zod`. Swagger UI may already be partially served but without route-level detail.

**Proposed approach:**

- Wire `@fastify/type-provider-zod` to connect existing Zod schemas to Fastify route definitions
- Verify `@fastify/swagger-ui` is serving at `/documentation` on platform-service
- Add route-level `schema: { body, querystring, params, response }` using existing Zod schemas
- Export OpenAPI JSON at `/documentation/json`
- Admin dashboard links to platform-service Swagger UI

---

#### 2.20 Localization / i18n Service

**Why:** Centralized string management for all platforms. When adding a new language, change one place, not four codebases.

**Proposed approach:**

- `translations` Cosmos container (pk: `/productId:locale`)
- Admin UI: string management with translation status per locale
- Client SDK: fetch translations on launch, cache locally
- Fallback chain: requested locale → base locale → English

---

#### 2.21 Full-Text Search

**Why:** Admin needs to search users by partial name/email. Users need to search memories/items. Cosmos SQL `CONTAINS()` is slow and doesn't rank results.

**Proposed approach:**

- **Phase 1:** Cosmos DB full-text search (preview feature, no extra cost)
- **Phase 2:** Azure AI Search for richer capabilities (fuzzy matching, facets, suggestions)
- Admin UI: unified search bar across entities (users, items, audit logs)

---

#### 2.22 Multi-Tenant Workspace / Org / Team Management

**Why:** `productId` scopes data per product, but within a product there's no team or organization concept. Enterprise customers need: org hierarchy, team-scoped permissions, shared brains/workspaces.

**Proposed design (future):**

```
users → belong to → organizations → have → teams → own → resources
```

This is a major architectural expansion. Defer until enterprise tier is validated.

---

#### 2.23 Data Retention & Lifecycle Policies

**Why:** Telemetry has TTL. Other containers don't. Old audit logs, expired sessions, redeemed promos, and stale waitlist entries accumulate forever.

**Proposed approach:**

- Admin-configurable retention policies per container
- Scheduled job (from §2.1) runs cleanup
- Default policies: audit (365 days), telemetry (30 days), sessions (90 days), export files (7 days)
- Admin UI: `/ops/retention` page showing policies and next cleanup run

---

#### 2.24 Automated Backup & Point-in-Time Restore

**Why:** Azure Cosmos DB has continuous backup, but admin needs visibility and one-click restore capability.

**Proposed approach:**

- Admin UI: `/ops/backups` page showing Azure backup status
- Manual export to blob (scheduled job from §2.1)
- Restore button: triggers Azure Cosmos point-in-time restore API
- Cross-region replication status indicator

---

#### 2.25 Billing Dunning & Payment Recovery

**Why:** Stripe handles retries, but the platform needs to: notify users of failed payments, offer grace periods, and eventually downgrade plans.

**Proposed flow:**

1. `invoice.payment_failed` → send "payment failed" email (§2.2) + in-app banner
2. After 3 failures (Stripe Smart Retries) → send "final warning" email
3. After grace period (7 days) → downgrade to free plan + email notification
4. All transitions logged to audit

**Integration:** Stripe webhook handler (existing) + email delivery (§2.2) + scheduled job (§2.1) for grace period enforcement.

---

## 3. Implementation Priority Matrix

| Phase        | Components                                 | Effort | Dependencies                     | Unlocks                                                    |
| ------------ | ------------------------------------------ | ------ | -------------------------------- | ---------------------------------------------------------- |
| **Sprint 1** | 2.1 Scheduled Jobs                         | M      | None                             | Foundation for all time-based operations                   |
| **Sprint 1** | 2.4 Event Bus                              | S      | None                             | Decoupling for email, webhooks, audit                      |
| **Sprint 2** | 2.2 Email Delivery                         | M      | 2.4 Event Bus                    | User communication (welcome, trial expiry, payment failed) |
| **Sprint 2** | 2.5 Password Reset + Email Verify          | S      | 2.2 Email Delivery               | Auth completeness — table-stakes for production            |
| **Sprint 3** | 2.3 Webhook Subscriptions                  | M      | 2.4 Event Bus                    | Third-party integrations, Zapier/Slack                     |
| **Sprint 3** | 2.7 Session Management                     | S      | None                             | Security (sign out everywhere, revocation)                 |
| **Sprint 4** | 2.10 Maintenance Mode                      | S      | None                             | Operational control during deployments                     |
| **Sprint 4** | 2.9 Data Export                            | S      | 2.1 Jobs (for blob cleanup)      | Admin self-service, compliance                             |
| **Sprint 5** | 2.13 Analytics Rollups                     | M      | 2.1 Jobs (for rollup scheduling) | Dashboard charts, business metrics                         |
| **Sprint 5** | 2.19 OpenAPI Docs                          | S      | None                             | Developer experience, API discoverability                  |
| **Sprint 6** | 2.6 Status Page                            | S      | None                             | User trust, incident communication                         |
| **Sprint 6** | 2.16 Changelog                             | S      | None                             | User engagement, release communication                     |
| **Sprint 7** | 2.11 Rate Limit Dashboard                  | S      | None                             | Ops visibility                                             |
| **Sprint 7** | 2.25 Billing Dunning                       | S      | 2.1 Jobs + 2.2 Email             | Payment recovery automation                                |
| **Later**    | 2.8, 2.12, 2.14–2.15, 2.17–2.18, 2.20–2.24 | Varies | —                                | Scale, polish, enterprise                                  |

**Effort key:** S = Small (1–2 days), M = Medium (3–5 days), L = Large (1–2 weeks)

**Critical path:** Event Bus (2.4) → Email Delivery (2.2) → Password Reset (2.5). These three should be the first items built, in that order.

---

## 4. New Cosmos Containers & Cost Impact

Each new component introduces Cosmos containers. Cosmos DB Serverless charges per RU consumed + storage, so idle containers cost only storage (~$0.25/GB/month).

| Component              | New Containers                                 | Partition Key                             | Est. TTL        | Est. Daily RU                       |
| ---------------------- | ---------------------------------------------- | ----------------------------------------- | --------------- | ----------------------------------- |
| **2.1 Jobs**           | `job_definitions`, `job_runs`                  | `/productId`, `/productId:jobName`        | runs: 90d       | ~50 RU (low volume)                 |
| **2.2 Email/Push**     | `delivery_log`, `email_templates`              | `/productId:channel:yyyyMM`, `/productId` | log: 90d        | ~200 RU                             |
| **2.3 Webhooks**       | `webhook_subscriptions`, `webhook_deliveries`  | `/productId`, `/subscriptionId:yyyyMM`    | deliveries: 30d | ~100 RU                             |
| **2.5 Password Reset** | `password_reset_tokens`, `email_verifications` | `/productId`, `/productId`                | 24h auto        | ~10 RU                              |
| **2.6 Status**         | `service_status`, `incidents`                  | `/productId`, `/productId`                | None            | ~20 RU                              |
| **2.7 Sessions**       | `sessions`                                     | `/userId`                                 | 90d             | ~500 RU (read-heavy)                |
| **2.8 Migrations**     | `migrations`                                   | `/productId`                              | None            | ~5 RU (startup only)                |
| **2.9 Exports**        | `export_jobs`                                  | `/productId`                              | 30d             | ~20 RU                              |
| **2.12 Experiments**   | `experiments`                                  | `/productId`                              | None            | ~50 RU                              |
| **2.13 Analytics**     | `analytics_rollups`                            | `/productId:metric:period`                | None            | ~300 RU (write-heavy during rollup) |
| **2.11 IP Rules**      | `ip_rules`                                     | `/productId`                              | None (manual)   | ~10 RU                              |
| **2.14 Feedback**      | `feedback`                                     | `/productId`                              | None            | ~50 RU                              |
| **2.16 Changelog**     | `changelog`                                    | `/productId`                              | None            | ~10 RU                              |
| **2.20 i18n**          | `translations`                                 | `/productId:locale`                       | None            | ~100 RU (read-heavy, cacheable)     |
| **2.23 Retention**     | `retention_policies`                           | `/productId`                              | None            | ~5 RU                               |

**Total new containers:** ~19 (across all phases)
**Existing containers:** 27 (defined in `cosmos-init.ts`: products, users, settings, devices, notification_prefs, audit_log, feature_flags, invitation_codes, referrals, subscriptions, payments, licenses, plans, usage_daily, api_tokens, tracker_items, comments, votes, themes, waitlist, memory_items, daily_briefs, reflections, brain_insights, telemetry_events, telemetry_error_clusters, telemetry_collection_policies). Note: `promos` module uses Stripe API directly — no Cosmos container.
**Cost impact:** Minimal for Serverless tier — idle containers only consume storage. Active containers during job runs add burst RU.

**Recommendation:** Register all new containers in `cosmos-init.ts` alongside existing ones. Use TTL liberally for transient data (tokens, deliveries, job runs) to keep storage bounded.

---

## 5. New Environment Variables

New components will require additional env vars. All should be added to `.env.example` files in both repos and documented.

| Component            | Variable                      | Example                          | Required                  |
| -------------------- | ----------------------------- | -------------------------------- | ------------------------- |
| **2.1 Jobs**         | `JOB_RUNNER_ENABLED`          | `true`                           | No (default: true)        |
| **2.1 Jobs**         | `JOB_TICK_INTERVAL_MS`        | `60000`                          | No (default: 60s)         |
| **2.2 Email**        | `SENDGRID_API_KEY`            | `SG.xxx`                         | Yes (for email delivery)  |
| **2.2 Email**        | `EMAIL_FROM_ADDRESS`          | `noreply@lysnrai.com`            | Yes                       |
| **2.2 Email**        | `EMAIL_FROM_NAME`             | `LysnrAI`                        | No                        |
| **2.2 Push**         | `APNS_KEY_ID`                 | `ABC123`                         | Yes (for iOS push)        |
| **2.2 Push**         | `APNS_TEAM_ID`                | `748N7QPX7J`                     | Yes                       |
| **2.2 Push**         | `APNS_KEY_PATH`               | `./certs/AuthKey.p8`             | Yes                       |
| **2.2 Push**         | `FCM_SERVICE_ACCOUNT_JSON`    | `{...}`                          | Yes (for Android push)    |
| **2.5 Auth**         | `PASSWORD_RESET_URL_BASE`     | `https://app.lysnrai.com/reset`  | Yes                       |
| **2.5 Auth**         | `EMAIL_VERIFY_URL_BASE`       | `https://app.lysnrai.com/verify` | Yes                       |
| **2.10 Maintenance** | `MAINTENANCE_MODE`            | `off`                            | No (default: off)         |
| **2.10 Maintenance** | `MAINTENANCE_BYPASS_IPS`      | `10.0.0.1,10.0.0.2`              | No                        |
| **2.3 Webhooks**     | `WEBHOOK_DELIVERY_TIMEOUT_MS` | `5000`                           | No (default: 5s)          |
| **2.3 Webhooks**     | `WEBHOOK_MAX_RETRIES`         | `3`                              | No (default: 3)           |
| **2.7 Sessions**     | `SESSION_TTL_DAYS`            | `90`                             | No (default: 90)          |
| **2.7 Sessions**     | `SESSION_CACHE_TTL_MS`        | `30000`                          | No (default: 30s)         |
| **2.19 OpenAPI**     | `SWAGGER_UI_ENABLED`          | `true`                           | No (default: true in dev) |

**Secret management:** `SENDGRID_API_KEY`, `APNS_*`, and `FCM_*` should be added to Azure Key Vault as `lysnr-sendgrid-api-key`, `lysnr-apns-key-id`, etc. Update `LYSNR_SECRETS` in `@bytelyst/config` to include them.

---

## 6. Quick Reference — Where Things Live

| Component                | Repo                                | Path                                                   |
| ------------------------ | ----------------------------------- | ------------------------------------------------------ |
| Platform-service modules | `learning_ai_common_plat`           | `services/platform-service/src/modules/`               |
| Shared packages          | `learning_ai_common_plat`           | `packages/`                                            |
| Admin dashboard          | `learning_voice_ai_agent`           | `admin-dashboard-web/`                                 |
| User dashboard           | `learning_voice_ai_agent`           | `user-dashboard-web/`                                  |
| Tracker dashboard        | `learning_voice_ai_agent`           | `tracker-dashboard-web/`                               |
| Docker Compose           | both repos                          | `docker-compose.yml`                                   |
| Monitoring               | `learning_ai_common_plat`           | `services/monitoring/`                                 |
| Design tokens            | `learning_ai_common_plat`           | `packages/design-tokens/`                              |
| MindLyst native app      | `learning_multimodal_memory_agents` | `mindlyst-native/` (KMP + SwiftUI + Compose + Next.js) |
| MindLyst web             | `learning_multimodal_memory_agents` | `mindlyst-native/web/`                                 |
| Existing webhooks        | `learning_ai_common_plat`           | `services/platform-service/src/lib/webhooks.ts`        |
| Cosmos container defs    | `learning_ai_common_plat`           | `services/platform-service/src/lib/cosmos-init.ts`     |
| Telemetry design doc     | `learning_ai_common_plat`           | `docs/WINDSURF/CLIENT_TELEMETRY_DESIGN.md`             |
| Telemetry roadmap        | `learning_ai_common_plat`           | `docs/WINDSURF/TELEMETRY_ROADMAP.md`                   |
| **This document**        | `learning_ai_common_plat`           | `docs/WINDSURF/PLATFORM_COMPONENTS_ROADMAP.md`         |

---

## Appendix A: Risks & Open Questions

| #   | Topic                              | Risk / Question                                                                                                                                     | Mitigation                                                                                                                                                                     |
| --- | ---------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| 1   | **Leader election for jobs**       | In-process tick loop with Cosmos lease — what happens during deploys? Two instances may briefly both hold leases.                                   | Cosmos lease has a built-in TTL. Use 30s lease with 10s renewal. During deploy overlap, the old instance's lease expires before the new one acquires. Jobs must be idempotent. |
| 2   | **Email deliverability**           | SendGrid requires domain verification (SPF/DKIM/DMARC). Without it, emails land in spam.                                                            | Set up `lysnrai.com` domain authentication in SendGrid before shipping §2.2. Budget 1–2 days for DNS propagation.                                                              |
| 3   | **Session validation latency**     | Checking Cosmos on every request for session revocation adds ~5–10ms per request.                                                                   | In-memory cache with 30s TTL (§2.7). Revocation is eventually consistent — acceptable trade-off for most apps. Document the 30s window.                                        |
| 4   | **Cosmos container proliferation** | 28 existing + 19 new = 47 containers. Serverless tier has no per-container cost, but management complexity grows.                                   | Group related containers by module. Document all containers in `cosmos-init.ts`. Consider container-per-module naming convention.                                              |
| 5   | **Event bus ordering guarantees**  | In-memory `EventEmitter` has no ordering guarantees across handlers. If audit must record before webhook fires, ordering matters.                   | Phase 1: Document that handlers run concurrently with no ordering. If ordering is needed, use handler priority weights or sequential mode.                                     |
| 6   | **Push notification certificates** | APNs requires yearly certificate renewal. If it expires, all iOS push silently stops.                                                               | Add `apns-cert-expiry-check` to scheduled jobs (§2.1). Alert admin 30 days before expiry.                                                                                      |
| 7   | **Webhook abuse**                  | External subscribers could register slow endpoints that back up the delivery queue.                                                                 | Per-subscription timeout (5s default), circuit breaker after 10 consecutive failures, auto-disable.                                                                            |
| 8   | **Migration rollback**             | Cosmos is schemaless — some migrations (e.g., partition key changes) are irreversible.                                                              | Mark migrations as `reversible: true/false`. Require manual approval for irreversible migrations. Always back up before running.                                               |
| 9   | **MindLyst parity**                | MindLyst web uses Cosmos directly (in-memory fallback). Shared components (email, sessions, webhooks) must work for MindLyst too, not just LysnrAI. | All new modules use `productId` for multi-product isolation. MindLyst can consume the same platform-service APIs.                                                              |
| 10  | **Priority conflicts**             | Sprint plan assumes available engineering bandwidth. If telemetry or mobile work takes priority, these sprints slip.                                | Treat sprint assignments as relative ordering, not calendar commitments. Re-evaluate after each sprint.                                                                        |

---

## Appendix B: Component Dependency Graph

```
                    ┌─────────────────────┐
                    │   Event Bus (2.4)    │
                    └─────────┬───────────┘
                              │ emits to subscribers
        ┌───────────┼───────────┼───────────┐
        │           │           │           │
        ▼           ▼           ▼           ▼
┌───────────┐ ┌───────────┐ ┌───────────┐ ┌───────────┐
│ Email/Push│ │ Webhook   │ │ Audit Log │ │ Analytics  │
│ (2.2)     │ │ (2.3)     │ │ (existing)│ │ (2.13)    │
└─────┬─────┘ └───────────┘ └───────────┘ └───────────┘
      │
      │ sends
      ▼
┌───────────┐
│ Password  │
│ Reset(2.5)│
└───────────┘

┌───────────────┐──▶┌─────────────────┐   ┌─────────────────┐
│ Scheduled     │   │ Analytics       │   │ Blob Storage    │
│ Jobs (2.1)    │   │ Rollups (2.13)  │   │ (existing)      │
└───────┬───────┘   └─────────────────┘   └────────┬────────┘
        │                                       │
        │ triggers on schedule                   ▲ writes exports
        ▼                                       │
┌───────────────┐   ┌─────────────────┐   ┌─────────────────┐
│ Trial Expiry  │   │ Usage Reset     │   │ Data Export      │
│ (2.1 job)     │   │ (2.1 job)       │   │ (2.9)           │
└───────────────┘   └─────────────────┘   └─────────────────┘

┌───────────────┐   ┌─────────────────┐   ┌─────────────────┐
│ Billing       │──▶│ Email/Push      │   │ Retention        │
│ Dunning(2.25) │   │ Delivery (2.2)  │   │ Cleanup (2.23)   │
└───────────────┘   └─────────────────┘   └─────────────────┘
```

---

## Appendix C: Review Findings

Systematic review performed 2026-02-17. All issues below have been fixed inline.

| #   | Severity | Section  | Finding                                                                                                                                                      | Fix                                                                                                                     |
| --- | -------- | -------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------ | ----------------------------------------------------------------------------------------------------------------------- |
| 1   | **Bug**  | §1.3     | Test count stale: said "158+ tests" — actual count is **621** (verified via `grep -c 'it(' *.test.ts`).                                                      | Updated to 621.                                                                                                         |
| 2   | **Bug**  | §1.1     | Endpoint column inconsistent: some modules said "CRUD" (vague, could be 4–8 routes), others had exact counts.                                                | Replaced all "CRUD" with actual route counts.                                                                           |
| 3   | **Bug**  | §2.5     | Said "console-logged URLs for dev/testing" — violates project rule: never `console.log` in production code.                                                  | Changed to `req.log.info`.                                                                                              |
| 4   | **Bug**  | §2.12    | `ExperimentDoc.targetingRules: {}` — meaningless empty object type.                                                                                          | Changed to `FlagTargetingRules` (reuse from flags module).                                                              |
| 5   | **Bug**  | §2.3     | Webhook event `user.deleted` source said `auth.delete` — no such endpoint name. Actual route is `DELETE /auth/users/:id` (admin action).                     | Fixed source column.                                                                                                    |
| 6   | **Bug**  | §4       | `email_verifications` container (from §2.5) missing from Cosmos table. Only `password_reset_tokens` was listed.                                              | Added `email_verifications` to §2.5 row.                                                                                |
| 7   | **Bug**  | §4       | Existing container count said "~25+" — actual is **27** (counted from `cosmos-init.ts`; `promos` uses Stripe API directly, no Cosmos container).             | Updated to 27 with full container list.                                                                                 |
| 8   | **Bug**  | §4       | Total new containers said "~17" — after adding `email_verifications` and `ip_rules`, count is **19**.                                                        | Updated.                                                                                                                |
| 9   | **Gap**  | §2.2     | No clarity on email template storage strategy. `renderer.ts` mentioned but not whether templates are Cosmos-stored or file-based.                            | Clarified: `repository.ts` now references `delivery_log + email_templates` containers.                                  |
| 10  | **Gap**  | §2.4     | No migration strategy from existing `lib/webhooks.ts` to new event bus pattern.                                                                              | Added "Migration from existing `lib/webhooks.ts`" subsection with 3-phase plan.                                         |
| 11  | **Gap**  | §2.10    | Maintenance mode proposed extending `settings` module but didn't clarify storage location. Missing from §4 Cosmos table.                                     | Added: stored as single document per product in existing `settings` container (no new container needed).                |
| 12  | **Gap**  | §2.11    | IP rules need persistence but no container was mentioned. Missing from §4 table.                                                                             | Added `ip_rules` container (pk: `/productId`) to both §2.11 and §4 table.                                               |
| 13  | **Gap**  | §2.9     | Data Export didn't mention blob module dependency (exports written to blob storage).                                                                         | Added explicit dependency note on `blob` module and `jobs` module for cleanup.                                          |
| 14  | **Gap**  | §5       | Missing env vars for webhooks (timeout, retries) and sessions (TTL, cache TTL).                                                                              | Added 4 new env vars: `WEBHOOK_DELIVERY_TIMEOUT_MS`, `WEBHOOK_MAX_RETRIES`, `SESSION_TTL_DAYS`, `SESSION_CACHE_TTL_MS`. |
| 15  | **Gap**  | §6       | Quick Reference missing MindLyst repo (`learning_multimodal_memory_agents`). Doc scope says "ByteLyst platform" which includes MindLyst.                     | Added MindLyst native app and web entries. Also added `cosmos-init.ts` path.                                            |
| 16  | **Gap**  | Appendix | Dependency graph incomplete: missing Jobs → Data Export connection, missing Blob → Data Export dependency, downstream jobs not labeled with section numbers. | Rewrote graph with all connections and section labels.                                                                  |
| 17  | **Gap**  | Overall  | No "Risks & Open Questions" section — design docs should call out unknowns.                                                                                  | Added Appendix A with 10 risk items and mitigations.                                                                    |
| 18  | **Gap**  | TOC      | Table of Contents didn't include Appendix sections.                                                                                                          | Added Appendix A, B, C to TOC.                                                                                          |
| 19  | **Gap**  | §2.5     | Password reset cross-referenced "§2.6" for sessions but sessions was renumbered to §2.7 in previous edit pass.                                               | Fixed to §2.7 (caught in prior pass).                                                                                   |
| 20  | **Gap**  | §1.5     | Infrastructure table was missing Swagger/OpenAPI (partially wired) and Prometheus metrics (partially enabled).                                               | Added in prior pass — verified still present.                                                                           |

---

_This document is a living brainstorm. Items will be promoted to dedicated design docs (like `CLIENT_TELEMETRY_DESIGN.md`) as they move into implementation._