From c59049efecd6d091fb5523b9f8b6597ed5ee4c0e Mon Sep 17 00:00:00 2001 From: saravanakumardb1 Date: Tue, 17 Feb 2026 08:47:46 -0800 Subject: [PATCH] docs: add client telemetry & log insights detailed design --- docs/WINDSURF/CLIENT_TELEMETRY_DESIGN.md | 945 +++++++++++++++++++++++ 1 file changed, 945 insertions(+) create mode 100644 docs/WINDSURF/CLIENT_TELEMETRY_DESIGN.md diff --git a/docs/WINDSURF/CLIENT_TELEMETRY_DESIGN.md b/docs/WINDSURF/CLIENT_TELEMETRY_DESIGN.md new file mode 100644 index 00000000..ff4b2e02 --- /dev/null +++ b/docs/WINDSURF/CLIENT_TELEMETRY_DESIGN.md @@ -0,0 +1,945 @@ +# Client Telemetry & Log Insights — Detailed Design + +> **Audience:** Engineering (AI agents + humans) working on ByteLyst/LysnrAI repos. +> **Scope:** Cross-platform client telemetry ingestion, segment-based collection control, storage, admin UI, and privacy guardrails. +> **Status:** Design — not yet implemented. +> **Last updated:** 2026-02-17 + +--- + +## Table of Contents + +1. [Problem Statement](#1-problem-statement) +2. [Goals & Non-Goals](#2-goals--non-goals) +3. [Architecture Overview](#3-architecture-overview) +4. [Telemetry Event Schema (Canonical)](#4-telemetry-event-schema-canonical) +5. [Segment-Based Collection Control](#5-segment-based-collection-control) +6. [Ingestion API Contract](#6-ingestion-api-contract) +7. [Storage & Partitioning](#7-storage--partitioning) +8. [Error Clustering (Derived)](#8-error-clustering-derived) +9. [Admin / DevOps UI](#9-admin--devops-ui) +10. [Client SDK Integration](#10-client-sdk-integration) +11. [Privacy & Security](#11-privacy--security) +12. [Rollout Plan](#12-rollout-plan) +13. [Open Questions](#13-open-questions) + +--- + +## 1. Problem Statement + +When a user reports "keyboard voice dictation doesn't type into Messages on iPhone 17 Pro," we currently have **zero server-side visibility** into what happened on that device. We cannot see: + +- Did recognition start? Which backend (Azure / local)? +- Did recognition produce results? +- Did `insertText` succeed or no-op? +- What error code/domain terminated the session? +- What app version / build / OS / permissions state was active? + +We need a lightweight, always-on (but controllable) telemetry pipeline that: + +1. Collects structured diagnostic events from all client platforms. +2. Correlates events by user, device, platform, version, and session. +3. Surfaces insights in the admin dashboard for debugging and release health. +4. Can be turned on/off per segment (user, platform, region, version, etc.). + +--- + +## 2. Goals & Non-Goals + +### Goals + +- **G1:** Unified event schema across iOS, Android, Desktop, Web. +- **G2:** Per-user, per-platform, per-version, per-region segment targeting for collection. +- **G3:** Admin UI with drill-down from cluster → user → session → event. +- **G4:** Privacy-first: no raw dictation text, no PII in payloads. +- **G5:** Low overhead: async batched sends, client-side sampling for noisy events. +- **G6:** Leverage existing infrastructure (platform-service, Cosmos DB, feature flags). + +### Non-Goals + +- Real-time streaming dashboards (v1 uses polling/refresh). +- Full APM / distributed tracing replacement (use Azure Monitor for that). +- Client-side crash reporting (use native crash reporters — Crashlytics, Sentry). + +--- + +## 3. Architecture Overview + +``` +┌──────────────────────────────────────────────────────────────────┐ +│ Client Platforms │ +│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌──────────┐ │ +│ │ iOS App │ │ iOS Kbd │ │ Android │ │ Desktop │ │ Web Apps │ │ +│ └────┬────┘ └────┬────┘ └────┬────┘ └────┬────┘ └────┬─────┘ │ +│ │ │ │ │ │ │ +│ ▼ ▼ ▼ ▼ ▼ │ +│ ┌──────────────────────────────────────────────────────────┐ │ +│ │ Client Telemetry SDK (per-platform thin layer) │ │ +│ │ • Collects events → batches → POST /api/telemetry │ │ +│ │ • Checks collection policy via feature flag poll │ │ +│ │ • Samples debug events, never samples error/fatal │ │ +│ └──────────────────────────────────┬───────────────────────┘ │ +└─────────────────────────────────────┼───────────────────────────┘ + │ HTTPS + ▼ +┌──────────────────────────────────────────────────────────────────┐ +│ platform-service (:4003) │ +│ ┌──────────────────────────────────────────────────────────┐ │ +│ │ POST /api/telemetry/events (batch ingest) │ │ +│ │ GET /api/telemetry/query (admin read) │ │ +│ │ GET /api/telemetry/clusters (aggregated error view) │ │ +│ │ GET /api/telemetry/config (collection policy) │ │ +│ └──────────────────────────────────┬───────────────────────┘ │ +│ │ │ +│ ┌──────────────────────────────────▼───────────────────────┐ │ +│ │ Cosmos DB │ │ +│ │ • telemetry_events (raw, TTL 30–60d) │ │ +│ │ • telemetry_error_clusters (derived, TTL 90–180d) │ │ +│ │ • telemetry_collection_policies (segment rules) │ │ +│ └──────────────────────────────────────────────────────────┘ │ +│ │ +│ Existing modules used: │ +│ • feature_flags — segment evaluation (FNV-1a hash) │ +│ • auth — JWT validation for authenticated events │ +│ • rate-limit — per-user/install throttling │ +└──────────────────────────────────────────────────────────────────┘ + │ + ▼ +┌──────────────────────────────────────────────────────────────────┐ +│ admin-dashboard-web (:3001) │ +│ ┌──────────────────────────────────────────────────────────┐ │ +│ │ Ops → Client Logs │ │ +│ │ • Live event stream (recent errors) │ │ +│ │ • Error cluster view (top failures by platform/build) │ │ +│ │ • User timeline (all events for one user) │ │ +│ │ • Collection policy manager (segment targeting UI) │ │ +│ └──────────────────────────────────────────────────────────┘ │ +└──────────────────────────────────────────────────────────────────┘ +``` + +--- + +## 4. Telemetry Event Schema (Canonical) + +Every client event MUST conform to this schema. Fields marked **REQUIRED** must always be present. + +### 4.1 Identity Fields + +| Field | Type | Required | Description | +| -------------------- | --------------- | ----------- | -------------------------------------------- | +| `id` | `string` (uuid) | REQUIRED | Unique event ID, generated client-side | +| `productId` | `string` | REQUIRED | Product identifier (e.g. `"lysnrai"`) | +| `userId` | `string?` | Conditional | Present when user is authenticated | +| `anonymousInstallId` | `string?` | Conditional | Stable per-install UUID when `userId` absent | +| `sessionId` | `string` | REQUIRED | App/keyboard session correlation ID | +| `requestId` | `string?` | Optional | Cross-service correlation (`x-request-id`) | + +> **Rule:** At least one of `userId` or `anonymousInstallId` MUST be present. + +### 4.2 Source Classification Fields + +| Field | Type | Required | Description | +| ------------- | --------- | ----------- | ------------------------------------------------------------------------------------------------- | +| `platform` | `enum` | REQUIRED | `"ios"` \| `"android"` \| `"web"` \| `"desktop"` | +| `channel` | `enum` | REQUIRED | `"mobile_app"` \| `"keyboard_extension"` \| `"web_app"` \| `"desktop_app"` \| `"backend_service"` | +| `osFamily` | `enum` | REQUIRED | `"ios"` \| `"android"` \| `"macos"` \| `"windows"` \| `"linux"` \| `"chromeos"` \| `"other"` | +| `osVersion` | `string?` | Recommended | e.g. `"iOS 18.2"`, `"Windows 11 24H2"`, `"macOS 15.3"`, `"Ubuntu 24.04"` | +| `deviceModel` | `string?` | Optional | e.g. `"iPhone17,3"`, `"Pixel 9"`, `"MacBookPro18,3"` | +| `locale` | `string?` | Optional | BCP 47 locale, e.g. `"en-US"`, `"ta-IN"` | +| `timezone` | `string?` | Optional | IANA timezone, e.g. `"America/Los_Angeles"`, `"Asia/Kolkata"` | +| `countryCode` | `string?` | Optional | ISO 3166-1 alpha-2, e.g. `"US"`, `"IN"` — derived from locale or IP server-side | +| `regionCode` | `string?` | Optional | e.g. `"WA"`, `"TN"` — derived server-side from IP geo | + +### 4.3 App Release Fields + +| Field | Type | Required | Description | +| ---------------- | -------- | -------- | ---------------------------------------------------------------------------- | +| `appVersion` | `string` | REQUIRED | Semantic version: `CFBundleShortVersionString` / `versionName` / npm version | +| `buildNumber` | `string` | REQUIRED | `CFBundleVersion` / `versionCode` / web release commit hash | +| `releaseChannel` | `enum` | REQUIRED | `"dev"` \| `"beta"` \| `"prod"` | + +### 4.4 Event Semantics Fields + +| Field | Type | Required | Description | +| ----------- | --------- | -------- | ---------------------------------------------------------------------------------------------- | +| `eventType` | `enum` | REQUIRED | `"debug"` \| `"info"` \| `"warn"` \| `"error"` \| `"fatal"` | +| `module` | `string` | REQUIRED | Logical module: `"keyboard_dictation"`, `"auth"`, `"sync"`, `"settings"`, `"onboarding"` | +| `feature` | `string?` | Optional | Sub-feature: `"voice_typing"`, `"settings_deeplink"`, `"azure_recognition"` | +| `eventName` | `string` | REQUIRED | Snake_case event: `"mic_tapped"`, `"recognition_failed"`, `"insert_noop"`, `"session_started"` | + +### 4.5 Error & Diagnostics Fields + +| Field | Type | Required | Description | +| ------------- | --------- | -------- | -------------------------------------------------------------------- | +| `errorDomain` | `string?` | On error | iOS NSError domain, Android exception class, JS Error name | +| `errorCode` | `string?` | On error | Normalized string code | +| `message` | `string?` | On error | Sanitized, max 512 chars — NEVER raw user content | +| `stackTrace` | `string?` | Optional | Redacted/capped at 8KB — only for `fatal` events | +| `fingerprint` | `string?` | Optional | Client-side hash of `(module + eventName + errorCode + errorDomain)` | + +### 4.6 Structured Metadata (Extensible) + +| Field | Type | Required | Description | +| --------- | -------------------------- | -------- | ----------------------------------------------------------- | +| `tags` | `Record?` | Optional | Small indexed key-value pairs (max 20 keys, 128 chars each) | +| `metrics` | `Record?` | Optional | Numeric measurements: durations, counters, sizes | +| `context` | `Record?` | Optional | Schema-validated safe object, max 4KB serialized | + +### 4.7 Module-Specific: Keyboard Dictation + +When `module = "keyboard_dictation"`, clients SHOULD include a structured `dictation` object: + +| Field | Type | Description | +| ------------------------------- | ------------------------------------------------------------------- | ----------------------------------------------------------------- | +| `dictation.backend` | `"azure"` \| `"local"` \| `"none"` | Which recognition backend was active | +| `dictation.hasFullAccess` | `boolean` | Keyboard Full Access toggle state | +| `dictation.micPermission` | `"granted"` \| `"denied"` \| `"undetermined"` | Microphone permission | +| `dictation.speechPermission` | `"authorized"` \| `"denied"` \| `"restricted"` \| `"notDetermined"` | Speech recognition permission | +| `dictation.recognitionStarted` | `boolean` | Did recognition engine actually start? | +| `dictation.finalResultReceived` | `boolean` | Did at least one final result arrive? | +| `dictation.insertAttempted` | `boolean` | Did `insertText` / `commitText` get called? | +| `dictation.insertNoOpDetected` | `boolean` | Did retry logic detect a no-op insert? | +| `dictation.transcriptLength` | `number` | Character count only — NEVER raw text | +| `dictation.sessionDurationMs` | `number` | Time from mic tap to stop | +| `dictation.hostApp` | `string?` | Bundle ID of host app if available (e.g. `"com.apple.MobileSMS"`) | + +### 4.8 Timing Fields + +| Field | Type | Required | Description | +| ------------ | -------------------- | ---------- | ------------------------- | +| `occurredAt` | `string` (ISO 8601) | REQUIRED | Client-side timestamp | +| `receivedAt` | `string` (ISO 8601) | Server-set | Set by ingestion endpoint | +| `ttlAt` | `string?` (ISO 8601) | Server-set | Cosmos TTL expiry marker | + +--- + +## 5. Segment-Based Collection Control + +### 5.1 Motivation + +Telemetry should not be a firehose. We need granular control to: + +- **Debug a specific user:** Turn on verbose logging for user `usr_abc123` only. +- **Target a platform:** Collect keyboard dictation events only from iOS. +- **Target a region:** Enable collection for users in `US:WA` (Seattle area) or `IN:TN` (Chennai area). +- **Target a version:** Collect from users on build < 26 (old builds with known bug). +- **Target an OS:** Only Linux desktop users. +- **Global kill switch:** Disable all collection instantly. + +### 5.2 Collection Policy Document Schema + +Stored in Cosmos container `telemetry_collection_policies`: + +```ts +interface TelemetryCollectionPolicy { + id: string; // uuid + productId: string; // REQUIRED + + // Identity + name: string; // human-readable: "Debug iOS keyboard for user X" + description: string; + enabled: boolean; // master toggle + priority: number; // higher = evaluated first (for conflicts) + + // What to collect + eventTypes: ('debug' | 'info' | 'warn' | 'error' | 'fatal')[]; + modules: string[]; // empty = all modules + samplingRate: number; // 0.0–1.0 (1.0 = collect everything matching) + + // Segment targeting rules (ALL conditions must match = AND logic) + targeting: { + // User targeting + userIds?: string[]; // specific user IDs + anonymousInstallIds?: string[]; // specific install IDs + + // Platform targeting + platforms?: ('ios' | 'android' | 'web' | 'desktop')[]; + channels?: ( + | 'mobile_app' + | 'keyboard_extension' + | 'web_app' + | 'desktop_app' + | 'backend_service' + )[]; + osFamilies?: ('ios' | 'android' | 'macos' | 'windows' | 'linux' | 'chromeos')[]; + + // Version targeting + appVersions?: string[]; // exact match list: ["1.0.0", "1.1.0"] + appVersionRange?: { + // semver range + min?: string; // inclusive + max?: string; // inclusive + }; + buildNumbers?: string[]; // exact match list: ["25", "26"] + buildNumberRange?: { + min?: number; // inclusive + max?: number; // inclusive + }; + + // Region targeting (derived from client locale/timezone or server-side IP geo) + countryCodes?: string[]; // ISO 3166-1 alpha-2: ["US", "IN"] + regionCodes?: string[]; // sub-national: ["US:WA", "IN:TN", "IN:KA"] + + // Release channel targeting + releaseChannels?: ('dev' | 'beta' | 'prod')[]; + + // Percentage rollout (uses existing FNV-1a hash from feature flags) + percentage?: number; // 0–100, deterministic per userId/installId + }; + + // Lifecycle + startsAt?: string; // ISO — policy activates at this time + expiresAt?: string; // ISO — policy auto-deactivates + createdAt: string; + updatedAt: string; + createdBy: string; // admin userId who created it +} +``` + +### 5.3 Policy Evaluation Logic (Client-Side) + +Clients poll `GET /api/telemetry/config` periodically (every 5 min or on app foreground). The server evaluates all active policies against the client's context and returns a **merged collection config**: + +```ts +// Response from GET /api/telemetry/config?platform=ios&channel=keyboard_extension&... +interface TelemetryCollectionConfig { + enabled: boolean; // global kill switch + eventTypes: string[]; // which event types to collect + modules: string[]; // which modules (empty = all) + samplingRates: { + // per event type + debug: number; // 0.0–1.0 + info: number; + warn: number; + error: number; + fatal: number; + }; + batchSize: number; // max events per POST + flushIntervalMs: number; // how often to flush batch + maxQueueSize: number; // drop oldest if exceeded +} +``` + +### 5.4 Evaluation Rules + +1. **Global default:** If no policies match, use a hardcoded default: + - Collect `warn`, `error`, `fatal` only + - Sample `warn` at 50%, `error`/`fatal` at 100% + - Flush every 60s, batch of 20 + +2. **Policy matching:** A policy matches if ALL non-null targeting conditions are met (AND logic). + +3. **Policy merge (multiple matches):** Highest-priority policy wins for each field. Exception: `eventTypes` are unioned (if any policy enables `debug`, it's enabled). + +4. **Percentage rollout:** Uses the same FNV-1a hash from the existing feature flags module: + + ```ts + hashUserFlag(userId || anonymousInstallId, `telemetry_policy_${policyId}`) < percentage; + ``` + +5. **Time bounds:** `startsAt`/`expiresAt` are checked server-side before including in response. + +### 5.5 Example Policies + +#### A) Debug one user's iOS keyboard + +```json +{ + "name": "Debug user sd9235 iOS keyboard", + "enabled": true, + "priority": 100, + "eventTypes": ["debug", "info", "warn", "error", "fatal"], + "modules": ["keyboard_dictation"], + "samplingRate": 1.0, + "targeting": { + "userIds": ["usr_sd9235"], + "platforms": ["ios"], + "channels": ["keyboard_extension"] + }, + "expiresAt": "2026-02-20T00:00:00Z" +} +``` + +#### B) All iOS users on old builds + +```json +{ + "name": "Collect errors from iOS builds < 26", + "enabled": true, + "priority": 50, + "eventTypes": ["warn", "error", "fatal"], + "modules": [], + "samplingRate": 1.0, + "targeting": { + "platforms": ["ios"], + "buildNumberRange": { "min": 1, "max": 25 } + } +} +``` + +#### C) Seattle-area users only + +```json +{ + "name": "Seattle region telemetry", + "enabled": true, + "priority": 60, + "eventTypes": ["info", "warn", "error", "fatal"], + "modules": [], + "samplingRate": 0.5, + "targeting": { + "regionCodes": ["US:WA"] + } +} +``` + +#### D) Only Linux desktop + +```json +{ + "name": "Linux desktop diagnostics", + "enabled": true, + "priority": 50, + "eventTypes": ["warn", "error", "fatal"], + "modules": [], + "samplingRate": 1.0, + "targeting": { + "platforms": ["desktop"], + "osFamilies": ["linux"] + } +} +``` + +#### E) 10% of all web users (canary) + +```json +{ + "name": "Web telemetry canary rollout", + "enabled": true, + "priority": 30, + "eventTypes": ["warn", "error", "fatal"], + "modules": [], + "samplingRate": 1.0, + "targeting": { + "platforms": ["web"], + "percentage": 10 + } +} +``` + +#### F) Chennai, India — mobile app only + +```json +{ + "name": "Chennai mobile diagnostics", + "enabled": true, + "priority": 60, + "eventTypes": ["info", "warn", "error", "fatal"], + "modules": [], + "samplingRate": 1.0, + "targeting": { + "platforms": ["ios", "android"], + "channels": ["mobile_app"], + "regionCodes": ["IN:TN"] + } +} +``` + +#### G) Global kill switch (disable all collection) + +```json +{ + "name": "GLOBAL OFF", + "enabled": true, + "priority": 999, + "eventTypes": [], + "modules": [], + "samplingRate": 0.0, + "targeting": {} +} +``` + +--- + +## 6. Ingestion API Contract + +### 6.1 `POST /api/telemetry/events` — Batch Ingest + +**Auth:** JWT (authenticated users) or API key header (anonymous installs). + +**Request:** + +```ts +// Zod schema +const TelemetryIngestRequest = z.object({ + productId: z.string().min(1), + events: z.array(TelemetryEventSchema).min(1).max(50), + clientClockSkewMs: z.number().optional(), +}); +``` + +**Response (200):** + +```ts +interface TelemetryIngestResponse { + accepted: number; + rejected: number; + errors?: Array<{ index: number; reason: string }>; + serverTime: string; +} +``` + +**Rate limits:** + +- Authenticated: 100 requests/min per userId +- Anonymous: 30 requests/min per anonymousInstallId +- Payload: max 256KB per request + +**Validation rules:** + +1. `productId` must match a known, active product. +2. Each event must have `id`, `productId`, `platform`, `channel`, `osFamily`, `appVersion`, `buildNumber`, `releaseChannel`, `eventType`, `module`, `eventName`, `sessionId`, `occurredAt`. +3. At least one of `userId` or `anonymousInstallId`. +4. `message` capped at 512 chars, `stackTrace` at 8KB, `tags` max 20 keys. +5. PII regex rejection: reject events containing patterns matching email, phone, credit card. +6. No raw dictation text allowed in any field. + +### 6.2 `GET /api/telemetry/config` — Collection Config (Client Poll) + +**Auth:** JWT or API key. + +**Query params:** + +| Param | Type | Description | +| ---------------- | ------- | ----------------------- | +| `platform` | string | Client platform | +| `channel` | string | Client channel | +| `osFamily` | string | OS family | +| `appVersion` | string | Current app version | +| `buildNumber` | string | Current build number | +| `releaseChannel` | string | dev/beta/prod | +| `countryCode` | string? | Client-reported country | +| `regionCode` | string? | Client-reported region | + +**Response:** `TelemetryCollectionConfig` (see §5.3). + +**Cache:** Client should cache this for 5 minutes. Server sets `Cache-Control: max-age=300`. + +### 6.3 `GET /api/telemetry/query` — Admin Query (Read) + +**Auth:** Admin JWT only. + +**Query params:** + +| Param | Type | Description | +| -------------------- | ------------ | --------------------------------- | +| `userId` | string? | Filter by user | +| `anonymousInstallId` | string? | Filter by install | +| `platform` | string? | Filter by platform | +| `channel` | string? | Filter by channel | +| `osFamily` | string? | Filter by OS family | +| `appVersion` | string? | Filter by version | +| `buildNumber` | string? | Filter by build | +| `module` | string? | Filter by module | +| `eventName` | string? | Filter by event name | +| `eventType` | string? | Filter by severity | +| `from` | string (ISO) | Start time | +| `to` | string (ISO) | End time | +| `limit` | number | Max results (default 50, max 200) | +| `continuationToken` | string? | Pagination | + +**Response:** + +```ts +interface TelemetryQueryResponse { + events: TelemetryEvent[]; + total: number; + continuationToken?: string; +} +``` + +### 6.4 `GET /api/telemetry/clusters` — Error Clusters (Admin) + +**Auth:** Admin JWT only. + +**Query params:** Same filters as query, plus `groupBy` (default: `fingerprint`). + +**Response:** + +```ts +interface TelemetryClusterResponse { + clusters: TelemetryErrorCluster[]; + total: number; +} +``` + +### 6.5 Collection Policy CRUD (Admin) + +| Method | Path | Description | +| -------- | ----------------------------- | ----------------- | +| `GET` | `/api/telemetry/policies` | List all policies | +| `POST` | `/api/telemetry/policies` | Create policy | +| `PUT` | `/api/telemetry/policies/:id` | Update policy | +| `DELETE` | `/api/telemetry/policies/:id` | Delete policy | + +--- + +## 7. Storage & Partitioning + +### 7.1 Cosmos Containers + +#### `telemetry_events` (raw events) + +| Property | Value | +| ------------- | ------------------------------------------------------------ | +| Partition key | `/pk` where `pk = ${productId}:${yyyyMM}:${platform}` | +| TTL | 30–60 days (configurable via env `TELEMETRY_EVENT_TTL_DAYS`) | +| RU budget | Start at 400 RU/s autoscale, monitor and adjust | + +**Rationale:** Partitioning by product + month + platform keeps hot data together for typical queries ("show me iOS errors from this month") while distributing load. + +**Composite indexes:** + +```json +[ + { "path": "/eventType", "order": "ascending" }, + { "path": "/occurredAt", "order": "descending" } +] +``` + +**Additional indexed paths:** `/userId`, `/anonymousInstallId`, `/module`, `/eventName`, `/appVersion`, `/buildNumber`, `/channel`, `/osFamily`. + +#### `telemetry_error_clusters` (aggregated) + +| Property | Value | +| ------------- | ----------------------------------------------------- | +| Partition key | `/pk` where `pk = ${productId}:${platform}:${module}` | +| TTL | 90–180 days | +| RU budget | 200 RU/s autoscale | + +#### `telemetry_collection_policies` (segment rules) + +| Property | Value | +| ------------- | ----------------------- | +| Partition key | `/productId` | +| TTL | None (manual lifecycle) | +| RU budget | Minimal (low volume) | + +### 7.2 Container Registration + +Add to `registerContainers()` call in platform-service `src/lib/cosmos.ts`: + +```ts +registerContainers([ + // ... existing containers ... + { id: 'telemetry_events', partitionKeyPath: '/pk' }, + { id: 'telemetry_error_clusters', partitionKeyPath: '/pk' }, + { id: 'telemetry_collection_policies', partitionKeyPath: '/productId' }, +]); +``` + +--- + +## 8. Error Clustering (Derived) + +### 8.1 Fingerprint Generation + +Client-side (optional) and server-side (authoritative): + +```ts +function generateFingerprint(event: TelemetryEvent): string { + const input = [ + event.platform, + event.channel, + event.module, + event.eventName, + event.errorDomain ?? '', + event.errorCode ?? '', + normalizeMessage(event.message ?? ''), + ].join(':'); + return sha256(input).substring(0, 16); // 16-char hex +} + +function normalizeMessage(msg: string): string { + // Strip numbers, UUIDs, paths, timestamps + return msg + .replace(/[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}/gi, '') + .replace(/\d+/g, '') + .replace(/\/[\w/.]+/g, '') + .toLowerCase() + .trim(); +} +``` + +### 8.2 Cluster Document + +```ts +interface TelemetryErrorCluster { + id: string; // fingerprint + time window key + pk: string; // ${productId}:${platform}:${module} + productId: string; + fingerprint: string; + + // Dimensions + platform: string; + channel: string; + osFamily: string; + module: string; + eventName: string; + appVersion: string; + buildNumber: string; + + // Aggregates + firstSeenAt: string; + lastSeenAt: string; + totalCount: number; + affectedUserIds: string[]; // capped at 100 + affectedInstallIds: string[]; // capped at 100 + + // Representative sample + sampleErrorDomain?: string; + sampleErrorCode?: string; + sampleMessage?: string; + severity: 'warn' | 'error' | 'fatal'; +} +``` + +### 8.3 Cluster Update Strategy + +On each ingested `error`/`fatal` event: + +1. Compute fingerprint. +2. Upsert cluster doc: increment `totalCount`, update `lastSeenAt`, append to `affectedUserIds` (dedup, cap at 100). +3. Run as a lightweight post-ingest step (same request, not a separate job — keeps it simple for v1). + +--- + +## 9. Admin / DevOps UI + +### 9.1 Page: `Ops → Client Logs` + +Located at `admin-dashboard-web/src/app/(dashboard)/ops/client-logs/page.tsx`. + +#### Filter Bar + +| Filter | Type | Default | +| ------------ | ---------------------------------------- | ------------ | +| User ID | text input | — | +| Platform | multi-select: ios, android, web, desktop | all | +| Channel | multi-select | all | +| OS Family | multi-select | all | +| App Version | text/select | — | +| Build Number | text/select | — | +| Module | select | all | +| Event Type | multi-select | error, fatal | +| Time Range | date range picker | last 24h | + +#### Views + +1. **Error Clusters (default):** Table of top clusters sorted by `totalCount` desc. + - Columns: fingerprint, module, eventName, platform, build, count, affected users, last seen. + - Click → drill into cluster detail (sample events, user list). + +2. **Event Stream:** Chronological list of raw events matching filters. + - Columns: time, user, platform, channel, build, module, eventName, eventType, message. + - Click → full event detail (JSON + dictation struct if present). + +3. **User Timeline:** Enter a userId → see all events chronologically. + - Useful for "what happened to user X's keyboard session." + +### 9.2 Page: `Ops → Telemetry Policies` + +Located at `admin-dashboard-web/src/app/(dashboard)/ops/telemetry-policies/page.tsx`. + +- CRUD for collection policies. +- Visual segment builder (dropdowns for platform, OS, version range, region, etc.). +- Priority ordering (drag/drop or numeric). +- Enable/disable toggle per policy. +- "Preview" button: show how many matching users/installs (based on recent telemetry). + +--- + +## 10. Client SDK Integration + +### 10.1 iOS (Swift) — App + Keyboard Extension + +```swift +// Shared via App Group (group.com.bytelyst.LysnrAI) +class LysnrTelemetry { + static let shared = LysnrTelemetry() + + func track( + eventType: EventType, + module: String, + eventName: String, + message: String? = nil, + errorCode: String? = nil, + errorDomain: String? = nil, + dictation: DictationContext? = nil, + tags: [String: String]? = nil, + metrics: [String: Double]? = nil + ) + + func flush() // force-send queued events + func refreshConfig() // poll collection policy +} +``` + +- Keyboard extension: posts events to App Group shared `UserDefaults` queue; main app flushes. +- Alternatively, keyboard extension sends directly if Full Access is enabled (network available). + +### 10.2 Android (Kotlin) + +```kotlin +object LysnrTelemetry { + fun track( + eventType: EventType, + module: String, + eventName: String, + message: String? = null, + errorCode: String? = null, + dictation: DictationContext? = null, + ) + fun flush() + fun refreshConfig() +} +``` + +### 10.3 Desktop (Python) + +```python +from lysnrai.telemetry import telemetry + +telemetry.track( + event_type="error", + module="speech_recognition", + event_name="azure_timeout", + message="Recognition timed out after 30s", + tags={"backend": "azure"}, + metrics={"duration_ms": 30000}, +) +``` + +### 10.4 Web (TypeScript) + +```ts +import { telemetry } from '@/lib/telemetry'; + +telemetry.track({ + eventType: 'error', + module: 'auth', + eventName: 'token_refresh_failed', + errorCode: '401', + message: 'JWT expired and refresh failed', +}); +``` + +--- + +## 11. Privacy & Security + +### 11.1 Hard Rules + +1. **NEVER** send raw dictated/transcribed text in any field. +2. **NEVER** send passwords, tokens, API keys, or PII (email, phone, SSN). +3. `message` field: sanitized, max 512 chars, no user content. +4. `stackTrace`: redacted file paths, max 8KB, only on `fatal`. +5. Server-side PII regex scanner rejects events containing detected PII patterns. +6. `countryCode` / `regionCode`: derived from IP geo server-side (never GPS coordinates). + +### 11.2 Data Retention + +| Container | Default TTL | Configurable | +| ------------------------------- | ----------- | ---------------------------- | +| `telemetry_events` | 30 days | `TELEMETRY_EVENT_TTL_DAYS` | +| `telemetry_error_clusters` | 90 days | `TELEMETRY_CLUSTER_TTL_DAYS` | +| `telemetry_collection_policies` | No TTL | Manual delete / `expiresAt` | + +### 11.3 Access Control + +- **Ingest (`POST /api/telemetry/events`):** Any authenticated user or valid install API key. +- **Read (`GET /api/telemetry/query`, `/clusters`):** Admin JWT only. +- **Policy management:** Admin JWT only. +- **No public endpoints.** Telemetry data is internal/operational only. + +### 11.4 Rate Limiting + +| Client Type | Limit | +| ------------------ | ----------- | +| Authenticated user | 100 req/min | +| Anonymous install | 30 req/min | +| Admin query | 60 req/min | + +--- + +## 12. Rollout Plan + +### Phase 1 — MVP (1–2 weeks) + +**Goal:** iOS keyboard dictation debugging visible in admin dashboard. + +| Component | Scope | +| ---------------- | ----------------------------------------------------------------------------- | +| platform-service | `telemetry` module: `types.ts`, `repository.ts`, `routes.ts` (ingest + query) | +| platform-service | Collection policy CRUD + config endpoint | +| iOS keyboard | `LysnrTelemetry` client in KeyboardViewController — keyboard_dictation events | +| admin-dashboard | `Ops → Client Logs` page with basic event stream + filters | +| Cosmos | Register 3 containers | + +**Delivers:** When a user reports "keyboard not typing," admin can look up their userId, see exact error flow, permissions state, backend choice, and insertion outcome. + +### Phase 2 — Full Platform Coverage (2–3 weeks) + +| Component | Scope | +| ---------------------- | ------------------------------------------------------- | +| iOS app | Telemetry for auth, settings, onboarding modules | +| Android app + keyboard | Full telemetry parity with iOS | +| Desktop (Python) | Telemetry for speech recognition, hotkey, paste modules | +| admin-dashboard | Error cluster view, user timeline view | +| platform-service | Cluster aggregation on ingest | + +### Phase 3 — Advanced (3–4 weeks) + +| Component | Scope | +| ---------------- | ---------------------------------------------------- | +| Web dashboards | Telemetry for auth, API errors, page load | +| admin-dashboard | Telemetry policy builder UI, version comparison view | +| platform-service | Alerting rules (error spike → Slack/email) | +| All clients | Region/geo enrichment server-side | + +--- + +## 13. Open Questions + +| # | Question | Status | +| --- | ----------------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------- | +| 1 | Should keyboard extension send events directly (requires Full Access + network) or queue via App Group for main app to flush? | **Recommend:** Direct when Full Access on, App Group queue as fallback | +| 2 | Do we need a separate Cosmos database for telemetry to isolate RU costs? | **Recommend:** Same database, separate containers (simpler), revisit if RU contention appears | +| 3 | Should we support exporting telemetry to Azure Monitor / Application Insights for alerting? | Defer to Phase 3 | +| 4 | Max retention for raw events? Compliance requirements? | Default 30 days, configurable | +| 5 | Do we need GDPR right-to-erasure support for telemetry? | Yes — add `DELETE /api/telemetry/user/:userId` endpoint | + +--- + +## Appendix A: Env Vars + +| Var | Default | Description | +| ----------------------------- | -------- | ------------------------------ | +| `TELEMETRY_ENABLED` | `true` | Global server-side kill switch | +| `TELEMETRY_EVENT_TTL_DAYS` | `30` | Raw event retention | +| `TELEMETRY_CLUSTER_TTL_DAYS` | `90` | Cluster retention | +| `TELEMETRY_MAX_BATCH_SIZE` | `50` | Max events per ingest request | +| `TELEMETRY_MAX_PAYLOAD_BYTES` | `262144` | 256KB max request body | +| `TELEMETRY_PII_SCAN_ENABLED` | `true` | Server-side PII rejection | + +## Appendix B: Related Files + +| File | Repo | Purpose | +| ----------------------------------------------------------------- | ----------- | -------------------------------------------- | +| `services/platform-service/src/modules/telemetry/` | common-plat | Telemetry module (types, repo, routes) | +| `services/platform-service/src/modules/flags/` | common-plat | Feature flags (reused for segment % rollout) | +| `admin-dashboard-web/src/app/(dashboard)/ops/client-logs/` | lysnrai | Admin log viewer | +| `admin-dashboard-web/src/app/(dashboard)/ops/telemetry-policies/` | lysnrai | Policy manager UI | +| `mobile_app/ios/LysnrKeyboard/KeyboardViewController.swift` | lysnrai | iOS keyboard (first telemetry client) | +| `mobile_app/android/.../LysnrInputMethodService.kt` | lysnrai | Android keyboard (Phase 2) | +| `src/telemetry/` | lysnrai | Python desktop telemetry client (Phase 2) |