# Client Telemetry & Log Insights — Detailed Design > **Audience:** Engineering (AI agents + humans) working on ByteLyst/LysnrAI repos. > **Scope:** Cross-platform client telemetry ingestion, segment-based collection control, storage, admin UI, and privacy guardrails. > **Status:** Design — implementing today, keyboard-first. > **Last updated:** 2026-02-17 (rev 2 — 18 gaps fixed) --- ## Table of Contents 1. [Problem Statement](#1-problem-statement) 2. [Goals & Non-Goals](#2-goals--non-goals) 3. [Architecture Overview](#3-architecture-overview) 4. [Telemetry Event Schema (Canonical)](#4-telemetry-event-schema-canonical) 5. [Segment-Based Collection Control](#5-segment-based-collection-control) 6. [Ingestion API Contract](#6-ingestion-api-contract) 7. [Storage & Partitioning](#7-storage--partitioning) 8. [Error Clustering (Derived)](#8-error-clustering-derived) 9. [Admin / DevOps UI](#9-admin--devops-ui) 10. [Client SDK Integration](#10-client-sdk-integration) 11. [Privacy & Security](#11-privacy--security) 12. [Rollout Plan](#12-rollout-plan) 13. [Open Questions](#13-open-questions) --- ## 1. Problem Statement When a user reports "keyboard voice dictation doesn't type into Messages on iPhone 17 Pro," we currently have **zero server-side visibility** into what happened on that device. We cannot see: - Did recognition start? Which backend (Azure / local)? - Did recognition produce results? - Did `insertText` succeed or no-op? - What error code/domain terminated the session? - What app version / build / OS / permissions state was active? We need a lightweight, always-on (but controllable) telemetry pipeline that: 1. Collects structured diagnostic events from all client platforms. 2. Correlates events by user, device, platform, version, and session. 3. Surfaces insights in the admin dashboard for debugging and release health. 4. Can be turned on/off per segment (user, platform, region, version, etc.). --- ## 2. Goals & Non-Goals ### Goals - **G1:** Unified event schema across iOS, Android, Desktop, Web. - **G2:** Per-user, per-platform, per-version, per-region segment targeting for collection. - **G3:** Admin UI with drill-down from cluster → user → session → event. - **G4:** Privacy-first: no raw dictation text, no PII in payloads. - **G5:** Low overhead: async batched sends, client-side sampling for noisy events. - **G6:** Leverage existing infrastructure (platform-service, Cosmos DB, feature flags). ### Non-Goals - Real-time streaming dashboards (v1 uses polling/refresh). - Full APM / distributed tracing replacement (use Azure Monitor for that). - Client-side crash reporting (use native crash reporters — Crashlytics, Sentry). --- ## 3. Architecture Overview ``` ┌──────────────────────────────────────────────────────────────────┐ │ Client Platforms │ │ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌──────────┐ │ │ │ iOS App │ │ iOS Kbd │ │ Android │ │ Desktop │ │ Web Apps │ │ │ └────┬────┘ └────┬────┘ └────┬────┘ └────┬────┘ └────┬─────┘ │ │ │ │ │ │ │ │ │ ▼ ▼ ▼ ▼ ▼ │ │ ┌──────────────────────────────────────────────────────────┐ │ │ │ Client Telemetry SDK (per-platform thin layer) │ │ │ │ • Collects events → batches → POST /api/telemetry │ │ │ │ • Checks collection policy via feature flag poll │ │ │ │ • Samples debug events, never samples error/fatal │ │ │ └──────────────────────────────────┬───────────────────────┘ │ └─────────────────────────────────────┼───────────────────────────┘ │ HTTPS ▼ ┌──────────────────────────────────────────────────────────────────┐ │ platform-service (:4003) │ │ ┌──────────────────────────────────────────────────────────┐ │ │ │ POST /api/telemetry/events (batch ingest) │ │ │ │ GET /api/telemetry/query (admin read) │ │ │ │ GET /api/telemetry/clusters (aggregated error view) │ │ │ │ GET /api/telemetry/config (collection policy) │ │ │ └──────────────────────────────────┬───────────────────────┘ │ │ │ │ │ ┌──────────────────────────────────▼───────────────────────┐ │ │ │ Cosmos DB │ │ │ │ • telemetry_events (raw, TTL 30–60d) │ │ │ │ • telemetry_error_clusters (derived, TTL 90–180d) │ │ │ │ • telemetry_collection_policies (segment rules) │ │ │ └──────────────────────────────────────────────────────────┘ │ │ │ │ Existing modules used: │ │ • feature_flags — segment evaluation (FNV-1a hash) │ │ • auth — JWT validation for authenticated events │ │ • rate-limit — per-user/install throttling │ └──────────────────────────────────────────────────────────────────┘ │ ▼ ┌──────────────────────────────────────────────────────────────────┐ │ admin-dashboard-web (:3001) │ │ ┌──────────────────────────────────────────────────────────┐ │ │ │ Ops → Client Logs │ │ │ │ • Live event stream (recent errors) │ │ │ │ • Error cluster view (top failures by platform/build) │ │ │ │ • User timeline (all events for one user) │ │ │ │ • Collection policy manager (segment targeting UI) │ │ │ └──────────────────────────────────────────────────────────┘ │ └──────────────────────────────────────────────────────────────────┘ ``` --- ## 4. Telemetry Event Schema (Canonical) Every client event MUST conform to this schema. Fields marked **REQUIRED** must always be present. ### 4.1 Identity Fields | Field | Type | Required | Description | | -------------------- | --------------- | ----------- | -------------------------------------------- | | `id` | `string` (uuid) | REQUIRED | Unique event ID, generated client-side | | `productId` | `string` | REQUIRED | Product identifier (e.g. `"lysnrai"`) | | `userId` | `string?` | Conditional | Present when user is authenticated | | `anonymousInstallId` | `string?` | Conditional | Stable per-install UUID when `userId` absent | | `sessionId` | `string` | REQUIRED | App/keyboard session correlation ID | | `requestId` | `string?` | Optional | Cross-service correlation (`x-request-id`) | > **Rule:** At least one of `userId` or `anonymousInstallId` MUST be present. ### 4.1.1 `anonymousInstallId` Generation Strategy Each platform generates a stable UUID on first launch and persists it: | Platform | Storage | Key | | ---------------- | ----------------------------------------------------- | -------------------------------- | | **iOS app** | Keychain (kSecAttrAccessibleAfterFirstUnlock) | `com.bytelyst.LysnrAI.installId` | | **iOS keyboard** | App Group UserDefaults (`group.com.bytelyst.LysnrAI`) | `telemetry_install_id` | | **Android** | EncryptedSharedPreferences | `telemetry_install_id` | | **Desktop** | `~/.lysnrai/telemetry_install_id` (plain file) | — | | **Web** | `localStorage` | `lysnrai_install_id` | > **iOS keyboard note:** The keyboard extension shares the install ID via App Group so main app and extension use the same identity. ### 4.1.2 Authentication for Telemetry Ingest Not all clients have a JWT (e.g., keyboard extension before user logs in). The ingest endpoint accepts two auth modes: | Mode | Header | When Used | | ----------------- | --------------------------------------- | -------------------------------------------------------- | | **JWT** | `Authorization: Bearer ` | Authenticated users (main app, web, desktop after login) | | **Install Token** | `X-Install-Token: ` | Unauthenticated clients (keyboard extension, pre-login) | **Install token validation:** The server accepts any well-formed UUID in `X-Install-Token`. It does NOT verify against a registry (install IDs are self-issued). Rate limiting is applied per install ID to prevent abuse. **Keyboard extension specifics:** - With Full Access ON: sends events directly via HTTPS using `X-Install-Token`. - With Full Access OFF: queues events to App Group UserDefaults (max 200 events, ~100KB). Main app flushes on next foreground. - If queue is full, oldest events are dropped (FIFO eviction). - **Memory constraint:** iOS keyboard extensions are limited to ~30MB. Telemetry queue MUST stay under 100KB. Events are serialized as compact JSON (no pretty-print). ### 4.2 Source Classification Fields | Field | Type | Required | Description | | ------------- | --------- | ----------- | ------------------------------------------------------------------------------------------------------------------------------------------------------ | | `platform` | `enum` | REQUIRED | `"ios"` \| `"android"` \| `"web"` \| `"desktop"` | | `channel` | `enum` | REQUIRED | `"mobile_app"` \| `"keyboard_extension"` \| `"web_app"` \| `"desktop_app"` \| `"backend_service"` | | `osFamily` | `enum` | REQUIRED | `"ios"` \| `"android"` \| `"macos"` \| `"windows"` \| `"linux"` \| `"chromeos"` \| `"other"` | | `osVersion` | `string?` | Recommended | e.g. `"iOS 18.2"`, `"Windows 11 24H2"`, `"macOS 15.3"`, `"Ubuntu 24.04"` | | `deviceModel` | `string?` | Optional | e.g. `"iPhone17,3"`, `"Pixel 9"`, `"MacBookPro18,3"` | | `locale` | `string?` | Optional | BCP 47 locale, e.g. `"en-US"`, `"ta-IN"` | | `timezone` | `string?` | Optional | IANA timezone, e.g. `"America/Los_Angeles"`, `"Asia/Kolkata"` | | `countryCode` | `string?` | Optional | ISO 3166-1 alpha-2, e.g. `"US"`, `"IN"` — derived from locale or IP server-side | | `regionCode` | `string?` | Optional | Prefixed format: `"US:WA"`, `"IN:TN"` — derived server-side from IP geo. Always `{country}:{region}` to avoid ambiguity (TN = Tennessee or Tamil Nadu) | ### 4.3 App Release Fields | Field | Type | Required | Description | | ---------------- | -------- | -------- | ---------------------------------------------------------------------------- | | `appVersion` | `string` | REQUIRED | Semantic version: `CFBundleShortVersionString` / `versionName` / npm version | | `buildNumber` | `string` | REQUIRED | `CFBundleVersion` / `versionCode` / web release commit hash | | `releaseChannel` | `enum` | REQUIRED | `"dev"` \| `"beta"` \| `"prod"` | ### 4.4 Event Semantics Fields | Field | Type | Required | Description | | ----------- | --------- | -------- | ---------------------------------------------------------------------------------------------- | | `eventType` | `enum` | REQUIRED | `"debug"` \| `"info"` \| `"warn"` \| `"error"` \| `"fatal"` | | `module` | `string` | REQUIRED | Logical module: `"keyboard_dictation"`, `"auth"`, `"sync"`, `"settings"`, `"onboarding"` | | `feature` | `string?` | Optional | Sub-feature: `"voice_typing"`, `"settings_deeplink"`, `"azure_recognition"` | | `eventName` | `string` | REQUIRED | Snake_case event: `"mic_tapped"`, `"recognition_failed"`, `"insert_noop"`, `"session_started"` | ### 4.5 Error & Diagnostics Fields | Field | Type | Required | Description | | ------------- | --------- | -------- | -------------------------------------------------------------------- | | `errorDomain` | `string?` | On error | iOS NSError domain, Android exception class, JS Error name | | `errorCode` | `string?` | On error | Normalized string code | | `message` | `string?` | On error | Sanitized, max 512 chars — NEVER raw user content | | `stackTrace` | `string?` | Optional | Redacted/capped at 8KB — only for `fatal` events | | `fingerprint` | `string?` | Optional | Client-side hash of `(module + eventName + errorCode + errorDomain)` | ### 4.6 Structured Metadata (Extensible) | Field | Type | Required | Description | | --------- | -------------------------- | -------- | ----------------------------------------------------------- | | `tags` | `Record?` | Optional | Small indexed key-value pairs (max 20 keys, 128 chars each) | | `metrics` | `Record?` | Optional | Numeric measurements: durations, counters, sizes | | `context` | `Record?` | Optional | Schema-validated safe object, max 4KB serialized | ### 4.7 Module-Specific: Keyboard Dictation When `module = "keyboard_dictation"`, clients SHOULD include a structured `dictation` object: | Field | Type | Description | | ---------------------------------- | ------------------------------------------------------------------- | ---------------------------------------------------------------------------- | | `dictation.backend` | `"azure"` \| `"local"` \| `"none"` | Which recognition backend was active | | `dictation.hasFullAccess` | `boolean` | Keyboard Full Access toggle state | | `dictation.micPermission` | `"granted"` \| `"denied"` \| `"undetermined"` | Microphone permission | | `dictation.speechPermission` | `"authorized"` \| `"denied"` \| `"restricted"` \| `"notDetermined"` | Speech recognition permission | | `dictation.recognitionStarted` | `boolean` | Did recognition engine actually start? | | `dictation.finalResultReceived` | `boolean` | Did at least one final result arrive? | | `dictation.insertAttempted` | `boolean` | Did `insertText` / `commitText` get called? | | `dictation.insertNoOpDetected` | `boolean` | Did retry logic detect a no-op insert? | | `dictation.transcriptLength` | `number` | Character count only — NEVER raw text | | `dictation.sessionDurationMs` | `number` | Time from mic tap to stop | | `dictation.hostApp` | `string?` | Bundle ID of host app if available (e.g. `"com.apple.MobileSMS"`) | | `dictation.errorRecoveryAttempted` | `boolean` | Was Azure→local (or vice versa) recovery attempted during this session? | | `dictation.errorRecoverySucceeded` | `boolean?` | If recovery was attempted, did the fallback backend produce results? | | `dictation.audioSessionCategory` | `string?` | iOS AVAudioSession category active during dictation (e.g. `"playAndRecord"`) | ### 4.8 Server-Computed Fields These fields are **set by the ingestion endpoint**, never by clients. | Field | Type | Required | Description | | ------------ | ------------------- | ---------- | ------------------------------------------------------------------------------------------------------------------------------ | | `pk` | `string` | Server-set | Cosmos partition key: `${productId}:${yyyyMM}:${platform}`. Computed from event fields on ingest | | `occurredAt` | `string` (ISO 8601) | REQUIRED | Client-side timestamp (client provides this) | | `receivedAt` | `string` (ISO 8601) | Server-set | Server receipt timestamp | | `ttl` | `number` | Server-set | Cosmos TTL in **seconds** (not ISO date). Cosmos uses `_ts + ttl` for auto-expiry. Default: `TELEMETRY_EVENT_TTL_DAYS * 86400` | --- ## 5. Segment-Based Collection Control ### 5.1 Motivation Telemetry should not be a firehose. We need granular control to: - **Debug a specific user:** Turn on verbose logging for user `usr_abc123` only. - **Target a platform:** Collect keyboard dictation events only from iOS. - **Target a region:** Enable collection for users in `US:WA` (Seattle area) or `IN:TN` (Chennai area). - **Target a version:** Collect from users on build < 26 (old builds with known bug). - **Target an OS:** Only Linux desktop users. - **Global kill switch:** Disable all collection instantly. ### 5.2 Collection Policy Document Schema Stored in Cosmos container `telemetry_collection_policies`: ```ts interface TelemetryCollectionPolicy { id: string; // uuid productId: string; // REQUIRED // Identity name: string; // human-readable: "Debug iOS keyboard for user X" description: string; enabled: boolean; // master toggle priority: number; // higher = evaluated first (for conflicts) // What to collect eventTypes: ('debug' | 'info' | 'warn' | 'error' | 'fatal')[]; modules: string[]; // empty = all modules samplingRate: number; // 0.0–1.0 (1.0 = collect everything matching) // Segment targeting rules (ALL conditions must match = AND logic) targeting: { // User targeting userIds?: string[]; // specific user IDs anonymousInstallIds?: string[]; // specific install IDs // Platform targeting platforms?: ('ios' | 'android' | 'web' | 'desktop')[]; channels?: ( | 'mobile_app' | 'keyboard_extension' | 'web_app' | 'desktop_app' | 'backend_service' )[]; osFamilies?: ('ios' | 'android' | 'macos' | 'windows' | 'linux' | 'chromeos')[]; // Version targeting appVersions?: string[]; // exact match list: ["1.0.0", "1.1.0"] appVersionRange?: { // semver range min?: string; // inclusive max?: string; // inclusive }; buildNumbers?: string[]; // exact match list: ["25", "26"] buildNumberRange?: { min?: number; // inclusive max?: number; // inclusive }; // Region targeting (derived from client locale/timezone or server-side IP geo) countryCodes?: string[]; // ISO 3166-1 alpha-2: ["US", "IN"] regionCodes?: string[]; // sub-national: ["US:WA", "IN:TN", "IN:KA"] // Release channel targeting releaseChannels?: ('dev' | 'beta' | 'prod')[]; // Percentage rollout (uses existing FNV-1a hash from feature flags) percentage?: number; // 0–100, deterministic per userId/installId }; // Lifecycle startsAt?: string; // ISO — policy activates at this time expiresAt?: string; // ISO — policy auto-deactivates createdAt: string; updatedAt: string; createdBy: string; // admin userId who created it } ``` ### 5.3 Policy Evaluation Logic (Client-Side) Clients poll `GET /api/telemetry/config` periodically (every 5 min or on app foreground). The server evaluates all active policies against the client's context and returns a **merged collection config**: ```ts // Response from GET /api/telemetry/config?platform=ios&channel=keyboard_extension&... interface TelemetryCollectionConfig { enabled: boolean; // global kill switch eventTypes: string[]; // which event types to collect modules: string[]; // which modules (empty = all) samplingRates: { // per event type debug: number; // 0.0–1.0 info: number; warn: number; error: number; fatal: number; }; batchSize: number; // max events per POST flushIntervalMs: number; // how often to flush batch maxQueueSize: number; // drop oldest if exceeded } ``` ### 5.4 Evaluation Rules 1. **Global default:** If no policies match, use a hardcoded default: - Collect `warn`, `error`, `fatal` only - Sample `warn` at 50%, `error`/`fatal` at 100% - Flush every 60s, batch of 20, max queue 200 2. **Empty targeting = matches ALL:** A policy with `targeting: {}` (all fields omitted) matches every client. This is how the global kill switch works (example G). 3. **Policy matching:** A policy matches if ALL **present** (non-null/non-undefined) targeting conditions are met (AND logic). Omitted conditions are ignored (not checked). 4. **Policy merge (multiple matches):** Highest-priority policy wins for each field. Exception: `eventTypes` are **unioned** (if any matching policy enables `debug`, it’s enabled). 5. **Percentage rollout:** Uses the same FNV-1a hash from the existing feature flags module: ```ts hashUserFlag(userId || anonymousInstallId, `telemetry_policy_${policyId}`) < percentage; ``` 6. **Time bounds:** `startsAt`/`expiresAt` are checked server-side before including in response. 7. **`samplingRate` → `samplingRates` mapping:** A policy’s single `samplingRate` applies to ALL its `eventTypes`. When merging multiple policies, the highest-priority policy’s rate wins per event type. If a policy enables `["debug", "info"]` at rate 0.5 and another enables `["error", "fatal"]` at rate 1.0, the merged config is: ```json { "debug": 0.5, "info": 0.5, "warn": 0.0, "error": 1.0, "fatal": 1.0 } ``` 8. **`batchSize`, `flushIntervalMs`, `maxQueueSize` defaults:** These are NOT set per-policy. They come from server-side env vars with these defaults: | Param | Default | Env Var | |-------|---------|--------| | `batchSize` | 20 | `TELEMETRY_CLIENT_BATCH_SIZE` | | `flushIntervalMs` | 60000 (60s) | `TELEMETRY_CLIENT_FLUSH_MS` | | `maxQueueSize` | 200 | `TELEMETRY_CLIENT_MAX_QUEUE` | The config endpoint returns these with the merged policy so clients don’t hardcode them. ### 5.5 Example Policies #### A) Debug one user's iOS keyboard ```json { "name": "Debug user sd9235 iOS keyboard", "enabled": true, "priority": 100, "eventTypes": ["debug", "info", "warn", "error", "fatal"], "modules": ["keyboard_dictation"], "samplingRate": 1.0, "targeting": { "userIds": ["usr_sd9235"], "platforms": ["ios"], "channels": ["keyboard_extension"] }, "expiresAt": "2026-02-20T00:00:00Z" } ``` #### B) All iOS users on old builds ```json { "name": "Collect errors from iOS builds < 26", "enabled": true, "priority": 50, "eventTypes": ["warn", "error", "fatal"], "modules": [], "samplingRate": 1.0, "targeting": { "platforms": ["ios"], "buildNumberRange": { "min": 1, "max": 25 } } } ``` #### C) Seattle-area users only ```json { "name": "Seattle region telemetry", "enabled": true, "priority": 60, "eventTypes": ["info", "warn", "error", "fatal"], "modules": [], "samplingRate": 0.5, "targeting": { "regionCodes": ["US:WA"] } } ``` #### D) Only Linux desktop ```json { "name": "Linux desktop diagnostics", "enabled": true, "priority": 50, "eventTypes": ["warn", "error", "fatal"], "modules": [], "samplingRate": 1.0, "targeting": { "platforms": ["desktop"], "osFamilies": ["linux"] } } ``` #### E) 10% of all web users (canary) ```json { "name": "Web telemetry canary rollout", "enabled": true, "priority": 30, "eventTypes": ["warn", "error", "fatal"], "modules": [], "samplingRate": 1.0, "targeting": { "platforms": ["web"], "percentage": 10 } } ``` #### F) Chennai, India — mobile app only ```json { "name": "Chennai mobile diagnostics", "enabled": true, "priority": 60, "eventTypes": ["info", "warn", "error", "fatal"], "modules": [], "samplingRate": 1.0, "targeting": { "platforms": ["ios", "android"], "channels": ["mobile_app"], "regionCodes": ["IN:TN"] } } ``` #### G) Global kill switch (disable all collection) ```json { "name": "GLOBAL OFF", "enabled": true, "priority": 999, "eventTypes": [], "modules": [], "samplingRate": 0.0, "targeting": {} } ``` --- ## 6. Ingestion API Contract ### 6.1 `POST /api/telemetry/events` — Batch Ingest **Auth:** JWT (`Authorization: Bearer`) or Install Token (`X-Install-Token: `). See §4.1.2. **Request:** ```ts // --- Zod schema for a single telemetry event --- const TelemetryEventSchema = z .object({ // Identity id: z.string().uuid(), productId: z.string().min(1), userId: z.string().optional(), anonymousInstallId: z.string().uuid().optional(), sessionId: z.string().min(1), requestId: z.string().optional(), // Source classification platform: z.enum(['ios', 'android', 'web', 'desktop']), channel: z.enum([ 'mobile_app', 'keyboard_extension', 'web_app', 'desktop_app', 'backend_service', ]), osFamily: z.enum(['ios', 'android', 'macos', 'windows', 'linux', 'chromeos', 'other']), osVersion: z.string().optional(), deviceModel: z.string().optional(), locale: z.string().optional(), timezone: z.string().optional(), // App release appVersion: z.string().min(1), buildNumber: z.string().min(1), releaseChannel: z.enum(['dev', 'beta', 'prod']), // Event semantics eventType: z.enum(['debug', 'info', 'warn', 'error', 'fatal']), module: z.string().min(1), feature: z.string().optional(), eventName: z.string().min(1), // Error & diagnostics errorDomain: z.string().optional(), errorCode: z.string().optional(), message: z.string().max(512).optional(), stackTrace: z.string().max(8192).optional(), fingerprint: z.string().optional(), // Structured metadata tags: z.record(z.string().max(128)).optional(), metrics: z.record(z.number()).optional(), context: z.record(z.unknown()).optional(), // Timing occurredAt: z.string().datetime(), }) .refine(e => e.userId || e.anonymousInstallId, { message: 'At least one of userId or anonymousInstallId is required', }); // --- Batch ingest request --- const TelemetryIngestRequest = z.object({ productId: z.string().min(1), events: z.array(TelemetryEventSchema).min(1).max(50), clientClockSkewMs: z.number().optional(), }); ``` **Response (200):** ```ts interface TelemetryIngestResponse { accepted: number; rejected: number; errors?: Array<{ index: number; reason: string }>; serverTime: string; } ``` **Rate limits:** - Authenticated: 100 requests/min per userId - Anonymous: 30 requests/min per anonymousInstallId - Payload: max 256KB per request **Validation rules:** 1. **`productId` authority:** Request-level `productId` is authoritative. Per-event `productId` MUST match the request-level value; mismatches are rejected. 2. Zod validation enforces all required fields (see schema above). 3. At least one of `userId` or `anonymousInstallId` (Zod refine). 4. `message` capped at 512 chars, `stackTrace` at 8KB, `tags` max 20 keys, `context` max 4KB serialized. 5. PII regex rejection: reject events containing patterns matching email, phone, credit card. 6. No raw dictation text allowed in any field. **Idempotency:** Events are upserted by `id`. If a client retries a batch (e.g., network timeout), duplicate event IDs are silently overwritten. This ensures exactly-once semantics without client-side dedup tracking. ### 6.2 `GET /api/telemetry/config` — Collection Config (Client Poll) **Auth:** JWT or API key. **Query params:** | Param | Type | Description | | -------------------- | ------- | --------------------------------------------------------- | | `userId` | string? | Authenticated user ID (for percentage rollout evaluation) | | `anonymousInstallId` | string? | Install ID (fallback for percentage rollout) | | `platform` | string | Client platform | | `channel` | string | Client channel | | `osFamily` | string | OS family | | `appVersion` | string | Current app version | | `buildNumber` | string | Current build number | | `releaseChannel` | string | dev/beta/prod | | `countryCode` | string? | Client-reported country | | `regionCode` | string? | Client-reported region (prefixed: `US:WA`) | **Response:** `TelemetryCollectionConfig` (see §5.3). **Cache:** Client should cache this for 5 minutes. Server sets `Cache-Control: max-age=300`. ### 6.3 `GET /api/telemetry/query` — Admin Query (Read) **Auth:** Admin JWT only. **Query params:** | Param | Type | Description | | -------------------- | ------------ | --------------------------------- | | `userId` | string? | Filter by user | | `anonymousInstallId` | string? | Filter by install | | `platform` | string? | Filter by platform | | `channel` | string? | Filter by channel | | `osFamily` | string? | Filter by OS family | | `appVersion` | string? | Filter by version | | `buildNumber` | string? | Filter by build | | `module` | string? | Filter by module | | `eventName` | string? | Filter by event name | | `eventType` | string? | Filter by severity | | `from` | string (ISO) | Start time | | `to` | string (ISO) | End time | | `limit` | number | Max results (default 50, max 200) | | `continuationToken` | string? | Pagination | **Response:** ```ts interface TelemetryQueryResponse { events: TelemetryEvent[]; total: number; continuationToken?: string; } ``` ### 6.4 `GET /api/telemetry/clusters` — Error Clusters (Admin) **Auth:** Admin JWT only. **Query params:** Same filters as query, plus `groupBy` (default: `fingerprint`). **Response:** ```ts interface TelemetryClusterResponse { clusters: TelemetryErrorCluster[]; total: number; } ``` ### 6.5 Collection Policy CRUD (Admin) | Method | Path | Description | | -------- | ----------------------------- | ----------------- | | `GET` | `/api/telemetry/policies` | List all policies | | `POST` | `/api/telemetry/policies` | Create policy | | `PUT` | `/api/telemetry/policies/:id` | Update policy | | `DELETE` | `/api/telemetry/policies/:id` | Delete policy | ### 6.6 `DELETE /api/telemetry/user/:userId` — GDPR Right-to-Erasure **Auth:** Admin JWT only. Deletes ALL telemetry events and cluster references for the given `userId`. Returns count of deleted documents. Required for GDPR compliance. **Response:** ```ts interface TelemetryErasureResponse { userId: string; eventsDeleted: number; clustersUpdated: number; } ``` --- ## 7. Storage & Partitioning ### 7.1 Cosmos Containers #### `telemetry_events` (raw events) | Property | Value | | ------------- | -------------------------------------------------------------------------------------------------------------------------------------------- | | Partition key | `/pk` where `pk = ${productId}:${yyyyMM}:${platform}` | | TTL | `defaultTtl: 30 * 86400` (30 days in seconds, configurable via `TELEMETRY_EVENT_TTL_DAYS`). Cosmos auto-deletes docs when `_ts + ttl` passes | | RU budget | Start at 400 RU/s autoscale, monitor and adjust | **Rationale:** Partitioning by product + month + platform keeps hot data together for typical queries ("show me iOS errors from this month") while distributing load. **Composite indexes:** ```json [ { "path": "/eventType", "order": "ascending" }, { "path": "/occurredAt", "order": "descending" } ] ``` **Additional indexed paths:** `/userId`, `/anonymousInstallId`, `/module`, `/eventName`, `/appVersion`, `/buildNumber`, `/channel`, `/osFamily`. #### `telemetry_error_clusters` (aggregated) | Property | Value | | ------------- | -------------------------------------------------------------------------------------------- | | Partition key | `/pk` where `pk = ${productId}:${platform}:${module}` | | TTL | `defaultTtl: 90 * 86400` (90 days in seconds, configurable via `TELEMETRY_CLUSTER_TTL_DAYS`) | | RU budget | 200 RU/s autoscale | #### `telemetry_collection_policies` (segment rules) | Property | Value | | ------------- | ----------------------- | | Partition key | `/productId` | | TTL | None (manual lifecycle) | | RU budget | Minimal (low volume) | ### 7.2 Container Registration Add to `registerContainers()` call in platform-service `src/lib/cosmos.ts`: ```ts registerContainers([ // ... existing containers ... { id: 'telemetry_events', partitionKeyPath: '/pk' }, { id: 'telemetry_error_clusters', partitionKeyPath: '/pk' }, { id: 'telemetry_collection_policies', partitionKeyPath: '/productId' }, ]); ``` --- ## 8. Error Clustering (Derived) ### 8.1 Fingerprint Generation Client-side (optional) and server-side (authoritative): ```ts function generateFingerprint(event: TelemetryEvent): string { const input = [ event.platform, event.channel, event.module, event.eventName, event.errorDomain ?? '', event.errorCode ?? '', normalizeMessage(event.message ?? ''), ].join(':'); return sha256(input).substring(0, 16); // 16-char hex } function normalizeMessage(msg: string): string { // Strip numbers, UUIDs, paths, timestamps return msg .replace(/[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}/gi, '') .replace(/\d+/g, '') .replace(/\/[\w/.]+/g, '') .toLowerCase() .trim(); } ``` ### 8.2 Cluster Document ```ts interface TelemetryErrorCluster { id: string; // fingerprint + time window key (e.g. `${fingerprint}:${yyyyMM}`) pk: string; // ${productId}:${platform}:${module} productId: string; fingerprint: string; // Dimensions (version-agnostic — one cluster spans all versions) platform: string; channel: string; module: string; eventName: string; // Version breakdown — which builds are affected affectedVersions: Array<{ appVersion: string; buildNumber: string; count: number; lastSeenAt: string; }>; // capped at 50 entries // Aggregates firstSeenAt: string; lastSeenAt: string; totalCount: number; affectedUserIds: string[]; // capped at 100 affectedInstallIds: string[]; // capped at 100 affectedOsFamilies: string[]; // e.g. ["ios", "macos"] // Representative sample (from most recent event) sampleErrorDomain?: string; sampleErrorCode?: string; sampleMessage?: string; severity: 'warn' | 'error' | 'fatal'; ttl: number; // Cosmos TTL in seconds } ``` ### 8.3 Cluster Update Strategy On each ingested `warn`, `error`, or `fatal` event: 1. Compute fingerprint. 2. Upsert cluster doc: increment `totalCount`, update `lastSeenAt`, append to `affectedUserIds` (dedup, cap at 100). 3. Run as a lightweight post-ingest step (same request, not a separate job — keeps it simple for v1). --- ## 9. Admin / DevOps UI ### 9.1 Page: `Ops → Client Logs` Located at `admin-dashboard-web/src/app/(dashboard)/ops/client-logs/page.tsx`. #### Filter Bar | Filter | Type | Default | | ------------ | ---------------------------------------- | ------------ | | User ID | text input | — | | Platform | multi-select: ios, android, web, desktop | all | | Channel | multi-select | all | | OS Family | multi-select | all | | App Version | text/select | — | | Build Number | text/select | — | | Module | select | all | | Event Type | multi-select | error, fatal | | Time Range | date range picker | last 24h | #### Views 1. **Error Clusters (default):** Table of top clusters sorted by `totalCount` desc. - Columns: fingerprint, module, eventName, platform, build, count, affected users, last seen. - Click → drill into cluster detail (sample events, user list). 2. **Event Stream:** Chronological list of raw events matching filters. - Columns: time, user, platform, channel, build, module, eventName, eventType, message. - Click → full event detail (JSON + dictation struct if present). 3. **User Timeline:** Enter a userId → see all events chronologically. - Useful for "what happened to user X's keyboard session." ### 9.2 Page: `Ops → Telemetry Policies` Located at `admin-dashboard-web/src/app/(dashboard)/ops/telemetry-policies/page.tsx`. - CRUD for collection policies. - Visual segment builder (dropdowns for platform, OS, version range, region, etc.). - Priority ordering (drag/drop or numeric). - Enable/disable toggle per policy. - "Preview" button: show how many matching users/installs (based on recent telemetry). --- ## 10. Client SDK Integration ### 10.1 iOS (Swift) — App + Keyboard Extension ```swift // Shared via App Group (group.com.bytelyst.LysnrAI) class LysnrTelemetry { static let shared = LysnrTelemetry() // Core properties (set once at init) let productId = "lysnrai" let platform = "ios" let osFamily = "ios" var channel: String // "mobile_app" or "keyboard_extension" var installId: String // from App Group UserDefaults var userId: String? // from App Group (set after login) func track( eventType: EventType, module: String, eventName: String, message: String? = nil, errorCode: String? = nil, errorDomain: String? = nil, dictation: DictationContext? = nil, tags: [String: String]? = nil, metrics: [String: Double]? = nil ) func flush() // force-send queued events func refreshConfig() // poll collection policy // Keyboard-specific func queueToAppGroup() // write pending events to App Group UserDefaults func flushAppGroupQueue() // called by main app on foreground } ``` **Keyboard extension offline strategy:** - **Full Access ON:** Sends events directly via URLSession. Falls back to App Group queue on network failure. - **Full Access OFF:** Always queues to App Group UserDefaults (`telemetry_event_queue` key). - **Main app responsibility:** On each foreground, calls `LysnrTelemetry.shared.flushAppGroupQueue()` to drain keyboard-queued events. - **Queue limits:** Max 200 events (~100KB). FIFO eviction when full. See §4.1.2 for memory constraints. ### 10.2 Android (Kotlin) ```kotlin object LysnrTelemetry { fun track( eventType: EventType, module: String, eventName: String, message: String? = null, errorCode: String? = null, dictation: DictationContext? = null, ) fun flush() fun refreshConfig() } ``` ### 10.3 Desktop (Python) ```python from lysnrai.telemetry import telemetry telemetry.track( event_type="error", module="speech_recognition", event_name="azure_timeout", message="Recognition timed out after 30s", tags={"backend": "azure"}, metrics={"duration_ms": 30000}, ) ``` ### 10.4 Web (TypeScript) ```ts import { telemetry } from '@/lib/telemetry'; telemetry.track({ eventType: 'error', module: 'auth', eventName: 'token_refresh_failed', errorCode: '401', message: 'JWT expired and refresh failed', }); ``` --- ## 11. Privacy & Security ### 11.1 Hard Rules 1. **NEVER** send raw dictated/transcribed text in any field. 2. **NEVER** send passwords, tokens, API keys, or PII (email, phone, SSN). 3. `message` field: sanitized, max 512 chars, no user content. 4. `stackTrace`: redacted file paths, max 8KB, only on `fatal`. 5. Server-side PII regex scanner rejects events containing detected PII patterns. 6. `countryCode` / `regionCode`: derived from IP geo server-side (never GPS coordinates). ### 11.2 Data Retention | Container | Default TTL | Configurable | | ------------------------------- | ----------- | ---------------------------- | | `telemetry_events` | 30 days | `TELEMETRY_EVENT_TTL_DAYS` | | `telemetry_error_clusters` | 90 days | `TELEMETRY_CLUSTER_TTL_DAYS` | | `telemetry_collection_policies` | No TTL | Manual delete / `expiresAt` | ### 11.3 Access Control - **Ingest (`POST /api/telemetry/events`):** Any authenticated user (JWT) or valid install token (`X-Install-Token`). See §4.1.2. - **Read (`GET /api/telemetry/query`, `/clusters`):** Admin JWT only. Enforced via `req.jwtPayload?.role === 'admin'` check (same pattern as other admin-only modules). - **Policy management:** Admin JWT only (same check). - **GDPR erasure:** Admin JWT only. - **No public endpoints.** Telemetry data is internal/operational only. ### 11.4 Rate Limiting | Client Type | Limit | | ------------------ | ----------- | | Authenticated user | 100 req/min | | Anonymous install | 30 req/min | | Admin query | 60 req/min | --- ## 12. Rollout Plan ### Phase 1 — MVP (1–2 weeks) **Goal:** iOS keyboard dictation debugging visible in admin dashboard. | Component | Scope | | ---------------- | ----------------------------------------------------------------------------- | | platform-service | `telemetry` module: `types.ts`, `repository.ts`, `routes.ts` (ingest + query) | | platform-service | Collection policy CRUD + config endpoint | | iOS keyboard | `LysnrTelemetry` client in KeyboardViewController — keyboard_dictation events | | admin-dashboard | `Ops → Client Logs` page with basic event stream + filters | | Cosmos | Register 3 containers | **Delivers:** When a user reports "keyboard not typing," admin can look up their userId, see exact error flow, permissions state, backend choice, and insertion outcome. ### Phase 2 — Full Platform Coverage (2–3 weeks) | Component | Scope | | ---------------------- | ------------------------------------------------------- | | iOS app | Telemetry for auth, settings, onboarding modules | | Android app + keyboard | Full telemetry parity with iOS | | Desktop (Python) | Telemetry for speech recognition, hotkey, paste modules | | admin-dashboard | Error cluster view, user timeline view | | platform-service | Cluster aggregation on ingest | ### Phase 3 — Advanced (3–4 weeks) | Component | Scope | | ---------------- | ---------------------------------------------------- | | Web dashboards | Telemetry for auth, API errors, page load | | admin-dashboard | Telemetry policy builder UI, version comparison view | | platform-service | Alerting rules (error spike → Slack/email) | | All clients | Region/geo enrichment server-side | --- ## 13. Open Questions | # | Question | Status | | --- | ----------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------- | | 1 | Should keyboard extension send events directly (requires Full Access + network) or queue via App Group for main app to flush? | **RESOLVED (rev 2):** Direct when Full Access on, App Group queue as fallback. See §4.1.2. | | 2 | Do we need a separate Cosmos database for telemetry to isolate RU costs? | **Recommend:** Same database, separate containers (simpler), revisit if RU contention appears | | 3 | Should we support exporting telemetry to Azure Monitor / Application Insights for alerting? | Defer to Phase 3 | | 4 | Max retention for raw events? Compliance requirements? | **RESOLVED (rev 2):** 30 days default, configurable via `TELEMETRY_EVENT_TTL_DAYS`. Cosmos TTL in seconds. | | 5 | Do we need GDPR right-to-erasure support for telemetry? | **RESOLVED (rev 2):** Yes — `DELETE /api/telemetry/user/:userId` added to §6.6. | --- ## Appendix A: Env Vars | Var | Default | Description | | ----------------------------- | -------- | ------------------------------------------------------- | | `TELEMETRY_ENABLED` | `true` | Global server-side kill switch | | `TELEMETRY_EVENT_TTL_DAYS` | `30` | Raw event retention (Cosmos TTL = days × 86400 seconds) | | `TELEMETRY_CLUSTER_TTL_DAYS` | `90` | Cluster retention | | `TELEMETRY_MAX_BATCH_SIZE` | `50` | Max events per ingest request | | `TELEMETRY_MAX_PAYLOAD_BYTES` | `262144` | 256KB max request body | | `TELEMETRY_PII_SCAN_ENABLED` | `true` | Server-side PII rejection | | `TELEMETRY_CLIENT_BATCH_SIZE` | `20` | Returned in config response for client-side batching | | `TELEMETRY_CLIENT_FLUSH_MS` | `60000` | Returned in config response for client flush interval | | `TELEMETRY_CLIENT_MAX_QUEUE` | `200` | Returned in config response for client max queue size | ## Appendix B: Related Files | File | Repo | Purpose | | ------------------------------------------------------------------- | ----------- | ----------------------------------------------------------- | | `services/platform-service/src/modules/telemetry/` | common-plat | Telemetry module (types, repo, routes — 14 endpoints) | | `services/platform-service/src/modules/telemetry/telemetry.test.ts` | common-plat | Telemetry unit tests (624 tests total) | | `services/platform-service/src/modules/flags/` | common-plat | Feature flags (reused for segment % rollout) | | `services/platform-service/src/modules/audit/` | common-plat | Audit log module (telemetry actions logged) | | `scripts/cosmos-telemetry-indexes.sh` | common-plat | Cosmos DB indexing policy for telemetry | | `admin-dashboard-web/src/app/(dashboard)/ops/client-logs/` | lysnrai | Admin log viewer + clusters + geo + metrics | | `admin-dashboard-web/src/app/(dashboard)/ops/telemetry-policies/` | lysnrai | Policy manager UI + live preview | | `admin-dashboard-web/src/app/api/telemetry/` | lysnrai | API proxy routes (events, clusters, metrics, geo, policies) | | `admin-dashboard-web/src/lib/platform-client.ts` | lysnrai | Platform-service client (telemetry functions) | | `mobile_app/ios/LysnrKeyboard/KeyboardViewController.swift` | lysnrai | iOS keyboard (first telemetry client) | | `mobile_app/android/.../LysnrInputMethodService.kt` | lysnrai | Android keyboard (Phase 2) | | `src/telemetry/` | lysnrai | Python desktop telemetry client (Phase 2) |