docs: add client telemetry & log insights detailed design

This commit is contained in:
saravanakumardb1 2026-02-17 08:47:46 -08:00
parent 18a5b342d9
commit c59049efec

View File

@ -0,0 +1,945 @@
# Client Telemetry & Log Insights — Detailed Design
> **Audience:** Engineering (AI agents + humans) working on ByteLyst/LysnrAI repos.
> **Scope:** Cross-platform client telemetry ingestion, segment-based collection control, storage, admin UI, and privacy guardrails.
> **Status:** Design — not yet implemented.
> **Last updated:** 2026-02-17
---
## Table of Contents
1. [Problem Statement](#1-problem-statement)
2. [Goals & Non-Goals](#2-goals--non-goals)
3. [Architecture Overview](#3-architecture-overview)
4. [Telemetry Event Schema (Canonical)](#4-telemetry-event-schema-canonical)
5. [Segment-Based Collection Control](#5-segment-based-collection-control)
6. [Ingestion API Contract](#6-ingestion-api-contract)
7. [Storage & Partitioning](#7-storage--partitioning)
8. [Error Clustering (Derived)](#8-error-clustering-derived)
9. [Admin / DevOps UI](#9-admin--devops-ui)
10. [Client SDK Integration](#10-client-sdk-integration)
11. [Privacy & Security](#11-privacy--security)
12. [Rollout Plan](#12-rollout-plan)
13. [Open Questions](#13-open-questions)
---
## 1. Problem Statement
When a user reports "keyboard voice dictation doesn't type into Messages on iPhone 17 Pro," we currently have **zero server-side visibility** into what happened on that device. We cannot see:
- Did recognition start? Which backend (Azure / local)?
- Did recognition produce results?
- Did `insertText` succeed or no-op?
- What error code/domain terminated the session?
- What app version / build / OS / permissions state was active?
We need a lightweight, always-on (but controllable) telemetry pipeline that:
1. Collects structured diagnostic events from all client platforms.
2. Correlates events by user, device, platform, version, and session.
3. Surfaces insights in the admin dashboard for debugging and release health.
4. Can be turned on/off per segment (user, platform, region, version, etc.).
---
## 2. Goals & Non-Goals
### Goals
- **G1:** Unified event schema across iOS, Android, Desktop, Web.
- **G2:** Per-user, per-platform, per-version, per-region segment targeting for collection.
- **G3:** Admin UI with drill-down from cluster → user → session → event.
- **G4:** Privacy-first: no raw dictation text, no PII in payloads.
- **G5:** Low overhead: async batched sends, client-side sampling for noisy events.
- **G6:** Leverage existing infrastructure (platform-service, Cosmos DB, feature flags).
### Non-Goals
- Real-time streaming dashboards (v1 uses polling/refresh).
- Full APM / distributed tracing replacement (use Azure Monitor for that).
- Client-side crash reporting (use native crash reporters — Crashlytics, Sentry).
---
## 3. Architecture Overview
```
┌──────────────────────────────────────────────────────────────────┐
│ Client Platforms │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌──────────┐ │
│ │ iOS App │ │ iOS Kbd │ │ Android │ │ Desktop │ │ Web Apps │ │
│ └────┬────┘ └────┬────┘ └────┬────┘ └────┬────┘ └────┬─────┘ │
│ │ │ │ │ │ │
│ ▼ ▼ ▼ ▼ ▼ │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ Client Telemetry SDK (per-platform thin layer) │ │
│ │ • Collects events → batches → POST /api/telemetry │ │
│ │ • Checks collection policy via feature flag poll │ │
│ │ • Samples debug events, never samples error/fatal │ │
│ └──────────────────────────────────┬───────────────────────┘ │
└─────────────────────────────────────┼───────────────────────────┘
│ HTTPS
┌──────────────────────────────────────────────────────────────────┐
│ platform-service (:4003) │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ POST /api/telemetry/events (batch ingest) │ │
│ │ GET /api/telemetry/query (admin read) │ │
│ │ GET /api/telemetry/clusters (aggregated error view) │ │
│ │ GET /api/telemetry/config (collection policy) │ │
│ └──────────────────────────────────┬───────────────────────┘ │
│ │ │
│ ┌──────────────────────────────────▼───────────────────────┐ │
│ │ Cosmos DB │ │
│ │ • telemetry_events (raw, TTL 3060d) │ │
│ │ • telemetry_error_clusters (derived, TTL 90180d) │ │
│ │ • telemetry_collection_policies (segment rules) │ │
│ └──────────────────────────────────────────────────────────┘ │
│ │
│ Existing modules used: │
│ • feature_flags — segment evaluation (FNV-1a hash) │
│ • auth — JWT validation for authenticated events │
│ • rate-limit — per-user/install throttling │
└──────────────────────────────────────────────────────────────────┘
┌──────────────────────────────────────────────────────────────────┐
│ admin-dashboard-web (:3001) │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ Ops → Client Logs │ │
│ │ • Live event stream (recent errors) │ │
│ │ • Error cluster view (top failures by platform/build) │ │
│ │ • User timeline (all events for one user) │ │
│ │ • Collection policy manager (segment targeting UI) │ │
│ └──────────────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────────────┘
```
---
## 4. Telemetry Event Schema (Canonical)
Every client event MUST conform to this schema. Fields marked **REQUIRED** must always be present.
### 4.1 Identity Fields
| Field | Type | Required | Description |
| -------------------- | --------------- | ----------- | -------------------------------------------- |
| `id` | `string` (uuid) | REQUIRED | Unique event ID, generated client-side |
| `productId` | `string` | REQUIRED | Product identifier (e.g. `"lysnrai"`) |
| `userId` | `string?` | Conditional | Present when user is authenticated |
| `anonymousInstallId` | `string?` | Conditional | Stable per-install UUID when `userId` absent |
| `sessionId` | `string` | REQUIRED | App/keyboard session correlation ID |
| `requestId` | `string?` | Optional | Cross-service correlation (`x-request-id`) |
> **Rule:** At least one of `userId` or `anonymousInstallId` MUST be present.
### 4.2 Source Classification Fields
| Field | Type | Required | Description |
| ------------- | --------- | ----------- | ------------------------------------------------------------------------------------------------- |
| `platform` | `enum` | REQUIRED | `"ios"` \| `"android"` \| `"web"` \| `"desktop"` |
| `channel` | `enum` | REQUIRED | `"mobile_app"` \| `"keyboard_extension"` \| `"web_app"` \| `"desktop_app"` \| `"backend_service"` |
| `osFamily` | `enum` | REQUIRED | `"ios"` \| `"android"` \| `"macos"` \| `"windows"` \| `"linux"` \| `"chromeos"` \| `"other"` |
| `osVersion` | `string?` | Recommended | e.g. `"iOS 18.2"`, `"Windows 11 24H2"`, `"macOS 15.3"`, `"Ubuntu 24.04"` |
| `deviceModel` | `string?` | Optional | e.g. `"iPhone17,3"`, `"Pixel 9"`, `"MacBookPro18,3"` |
| `locale` | `string?` | Optional | BCP 47 locale, e.g. `"en-US"`, `"ta-IN"` |
| `timezone` | `string?` | Optional | IANA timezone, e.g. `"America/Los_Angeles"`, `"Asia/Kolkata"` |
| `countryCode` | `string?` | Optional | ISO 3166-1 alpha-2, e.g. `"US"`, `"IN"` — derived from locale or IP server-side |
| `regionCode` | `string?` | Optional | e.g. `"WA"`, `"TN"` — derived server-side from IP geo |
### 4.3 App Release Fields
| Field | Type | Required | Description |
| ---------------- | -------- | -------- | ---------------------------------------------------------------------------- |
| `appVersion` | `string` | REQUIRED | Semantic version: `CFBundleShortVersionString` / `versionName` / npm version |
| `buildNumber` | `string` | REQUIRED | `CFBundleVersion` / `versionCode` / web release commit hash |
| `releaseChannel` | `enum` | REQUIRED | `"dev"` \| `"beta"` \| `"prod"` |
### 4.4 Event Semantics Fields
| Field | Type | Required | Description |
| ----------- | --------- | -------- | ---------------------------------------------------------------------------------------------- |
| `eventType` | `enum` | REQUIRED | `"debug"` \| `"info"` \| `"warn"` \| `"error"` \| `"fatal"` |
| `module` | `string` | REQUIRED | Logical module: `"keyboard_dictation"`, `"auth"`, `"sync"`, `"settings"`, `"onboarding"` |
| `feature` | `string?` | Optional | Sub-feature: `"voice_typing"`, `"settings_deeplink"`, `"azure_recognition"` |
| `eventName` | `string` | REQUIRED | Snake_case event: `"mic_tapped"`, `"recognition_failed"`, `"insert_noop"`, `"session_started"` |
### 4.5 Error & Diagnostics Fields
| Field | Type | Required | Description |
| ------------- | --------- | -------- | -------------------------------------------------------------------- |
| `errorDomain` | `string?` | On error | iOS NSError domain, Android exception class, JS Error name |
| `errorCode` | `string?` | On error | Normalized string code |
| `message` | `string?` | On error | Sanitized, max 512 chars — NEVER raw user content |
| `stackTrace` | `string?` | Optional | Redacted/capped at 8KB — only for `fatal` events |
| `fingerprint` | `string?` | Optional | Client-side hash of `(module + eventName + errorCode + errorDomain)` |
### 4.6 Structured Metadata (Extensible)
| Field | Type | Required | Description |
| --------- | -------------------------- | -------- | ----------------------------------------------------------- |
| `tags` | `Record<string, string>?` | Optional | Small indexed key-value pairs (max 20 keys, 128 chars each) |
| `metrics` | `Record<string, number>?` | Optional | Numeric measurements: durations, counters, sizes |
| `context` | `Record<string, unknown>?` | Optional | Schema-validated safe object, max 4KB serialized |
### 4.7 Module-Specific: Keyboard Dictation
When `module = "keyboard_dictation"`, clients SHOULD include a structured `dictation` object:
| Field | Type | Description |
| ------------------------------- | ------------------------------------------------------------------- | ----------------------------------------------------------------- |
| `dictation.backend` | `"azure"` \| `"local"` \| `"none"` | Which recognition backend was active |
| `dictation.hasFullAccess` | `boolean` | Keyboard Full Access toggle state |
| `dictation.micPermission` | `"granted"` \| `"denied"` \| `"undetermined"` | Microphone permission |
| `dictation.speechPermission` | `"authorized"` \| `"denied"` \| `"restricted"` \| `"notDetermined"` | Speech recognition permission |
| `dictation.recognitionStarted` | `boolean` | Did recognition engine actually start? |
| `dictation.finalResultReceived` | `boolean` | Did at least one final result arrive? |
| `dictation.insertAttempted` | `boolean` | Did `insertText` / `commitText` get called? |
| `dictation.insertNoOpDetected` | `boolean` | Did retry logic detect a no-op insert? |
| `dictation.transcriptLength` | `number` | Character count only — NEVER raw text |
| `dictation.sessionDurationMs` | `number` | Time from mic tap to stop |
| `dictation.hostApp` | `string?` | Bundle ID of host app if available (e.g. `"com.apple.MobileSMS"`) |
### 4.8 Timing Fields
| Field | Type | Required | Description |
| ------------ | -------------------- | ---------- | ------------------------- |
| `occurredAt` | `string` (ISO 8601) | REQUIRED | Client-side timestamp |
| `receivedAt` | `string` (ISO 8601) | Server-set | Set by ingestion endpoint |
| `ttlAt` | `string?` (ISO 8601) | Server-set | Cosmos TTL expiry marker |
---
## 5. Segment-Based Collection Control
### 5.1 Motivation
Telemetry should not be a firehose. We need granular control to:
- **Debug a specific user:** Turn on verbose logging for user `usr_abc123` only.
- **Target a platform:** Collect keyboard dictation events only from iOS.
- **Target a region:** Enable collection for users in `US:WA` (Seattle area) or `IN:TN` (Chennai area).
- **Target a version:** Collect from users on build < 26 (old builds with known bug).
- **Target an OS:** Only Linux desktop users.
- **Global kill switch:** Disable all collection instantly.
### 5.2 Collection Policy Document Schema
Stored in Cosmos container `telemetry_collection_policies`:
```ts
interface TelemetryCollectionPolicy {
id: string; // uuid
productId: string; // REQUIRED
// Identity
name: string; // human-readable: "Debug iOS keyboard for user X"
description: string;
enabled: boolean; // master toggle
priority: number; // higher = evaluated first (for conflicts)
// What to collect
eventTypes: ('debug' | 'info' | 'warn' | 'error' | 'fatal')[];
modules: string[]; // empty = all modules
samplingRate: number; // 0.01.0 (1.0 = collect everything matching)
// Segment targeting rules (ALL conditions must match = AND logic)
targeting: {
// User targeting
userIds?: string[]; // specific user IDs
anonymousInstallIds?: string[]; // specific install IDs
// Platform targeting
platforms?: ('ios' | 'android' | 'web' | 'desktop')[];
channels?: (
| 'mobile_app'
| 'keyboard_extension'
| 'web_app'
| 'desktop_app'
| 'backend_service'
)[];
osFamilies?: ('ios' | 'android' | 'macos' | 'windows' | 'linux' | 'chromeos')[];
// Version targeting
appVersions?: string[]; // exact match list: ["1.0.0", "1.1.0"]
appVersionRange?: {
// semver range
min?: string; // inclusive
max?: string; // inclusive
};
buildNumbers?: string[]; // exact match list: ["25", "26"]
buildNumberRange?: {
min?: number; // inclusive
max?: number; // inclusive
};
// Region targeting (derived from client locale/timezone or server-side IP geo)
countryCodes?: string[]; // ISO 3166-1 alpha-2: ["US", "IN"]
regionCodes?: string[]; // sub-national: ["US:WA", "IN:TN", "IN:KA"]
// Release channel targeting
releaseChannels?: ('dev' | 'beta' | 'prod')[];
// Percentage rollout (uses existing FNV-1a hash from feature flags)
percentage?: number; // 0100, deterministic per userId/installId
};
// Lifecycle
startsAt?: string; // ISO — policy activates at this time
expiresAt?: string; // ISO — policy auto-deactivates
createdAt: string;
updatedAt: string;
createdBy: string; // admin userId who created it
}
```
### 5.3 Policy Evaluation Logic (Client-Side)
Clients poll `GET /api/telemetry/config` periodically (every 5 min or on app foreground). The server evaluates all active policies against the client's context and returns a **merged collection config**:
```ts
// Response from GET /api/telemetry/config?platform=ios&channel=keyboard_extension&...
interface TelemetryCollectionConfig {
enabled: boolean; // global kill switch
eventTypes: string[]; // which event types to collect
modules: string[]; // which modules (empty = all)
samplingRates: {
// per event type
debug: number; // 0.01.0
info: number;
warn: number;
error: number;
fatal: number;
};
batchSize: number; // max events per POST
flushIntervalMs: number; // how often to flush batch
maxQueueSize: number; // drop oldest if exceeded
}
```
### 5.4 Evaluation Rules
1. **Global default:** If no policies match, use a hardcoded default:
- Collect `warn`, `error`, `fatal` only
- Sample `warn` at 50%, `error`/`fatal` at 100%
- Flush every 60s, batch of 20
2. **Policy matching:** A policy matches if ALL non-null targeting conditions are met (AND logic).
3. **Policy merge (multiple matches):** Highest-priority policy wins for each field. Exception: `eventTypes` are unioned (if any policy enables `debug`, it's enabled).
4. **Percentage rollout:** Uses the same FNV-1a hash from the existing feature flags module:
```ts
hashUserFlag(userId || anonymousInstallId, `telemetry_policy_${policyId}`) < percentage;
```
5. **Time bounds:** `startsAt`/`expiresAt` are checked server-side before including in response.
### 5.5 Example Policies
#### A) Debug one user's iOS keyboard
```json
{
"name": "Debug user sd9235 iOS keyboard",
"enabled": true,
"priority": 100,
"eventTypes": ["debug", "info", "warn", "error", "fatal"],
"modules": ["keyboard_dictation"],
"samplingRate": 1.0,
"targeting": {
"userIds": ["usr_sd9235"],
"platforms": ["ios"],
"channels": ["keyboard_extension"]
},
"expiresAt": "2026-02-20T00:00:00Z"
}
```
#### B) All iOS users on old builds
```json
{
"name": "Collect errors from iOS builds < 26",
"enabled": true,
"priority": 50,
"eventTypes": ["warn", "error", "fatal"],
"modules": [],
"samplingRate": 1.0,
"targeting": {
"platforms": ["ios"],
"buildNumberRange": { "min": 1, "max": 25 }
}
}
```
#### C) Seattle-area users only
```json
{
"name": "Seattle region telemetry",
"enabled": true,
"priority": 60,
"eventTypes": ["info", "warn", "error", "fatal"],
"modules": [],
"samplingRate": 0.5,
"targeting": {
"regionCodes": ["US:WA"]
}
}
```
#### D) Only Linux desktop
```json
{
"name": "Linux desktop diagnostics",
"enabled": true,
"priority": 50,
"eventTypes": ["warn", "error", "fatal"],
"modules": [],
"samplingRate": 1.0,
"targeting": {
"platforms": ["desktop"],
"osFamilies": ["linux"]
}
}
```
#### E) 10% of all web users (canary)
```json
{
"name": "Web telemetry canary rollout",
"enabled": true,
"priority": 30,
"eventTypes": ["warn", "error", "fatal"],
"modules": [],
"samplingRate": 1.0,
"targeting": {
"platforms": ["web"],
"percentage": 10
}
}
```
#### F) Chennai, India — mobile app only
```json
{
"name": "Chennai mobile diagnostics",
"enabled": true,
"priority": 60,
"eventTypes": ["info", "warn", "error", "fatal"],
"modules": [],
"samplingRate": 1.0,
"targeting": {
"platforms": ["ios", "android"],
"channels": ["mobile_app"],
"regionCodes": ["IN:TN"]
}
}
```
#### G) Global kill switch (disable all collection)
```json
{
"name": "GLOBAL OFF",
"enabled": true,
"priority": 999,
"eventTypes": [],
"modules": [],
"samplingRate": 0.0,
"targeting": {}
}
```
---
## 6. Ingestion API Contract
### 6.1 `POST /api/telemetry/events` — Batch Ingest
**Auth:** JWT (authenticated users) or API key header (anonymous installs).
**Request:**
```ts
// Zod schema
const TelemetryIngestRequest = z.object({
productId: z.string().min(1),
events: z.array(TelemetryEventSchema).min(1).max(50),
clientClockSkewMs: z.number().optional(),
});
```
**Response (200):**
```ts
interface TelemetryIngestResponse {
accepted: number;
rejected: number;
errors?: Array<{ index: number; reason: string }>;
serverTime: string;
}
```
**Rate limits:**
- Authenticated: 100 requests/min per userId
- Anonymous: 30 requests/min per anonymousInstallId
- Payload: max 256KB per request
**Validation rules:**
1. `productId` must match a known, active product.
2. Each event must have `id`, `productId`, `platform`, `channel`, `osFamily`, `appVersion`, `buildNumber`, `releaseChannel`, `eventType`, `module`, `eventName`, `sessionId`, `occurredAt`.
3. At least one of `userId` or `anonymousInstallId`.
4. `message` capped at 512 chars, `stackTrace` at 8KB, `tags` max 20 keys.
5. PII regex rejection: reject events containing patterns matching email, phone, credit card.
6. No raw dictation text allowed in any field.
### 6.2 `GET /api/telemetry/config` — Collection Config (Client Poll)
**Auth:** JWT or API key.
**Query params:**
| Param | Type | Description |
| ---------------- | ------- | ----------------------- |
| `platform` | string | Client platform |
| `channel` | string | Client channel |
| `osFamily` | string | OS family |
| `appVersion` | string | Current app version |
| `buildNumber` | string | Current build number |
| `releaseChannel` | string | dev/beta/prod |
| `countryCode` | string? | Client-reported country |
| `regionCode` | string? | Client-reported region |
**Response:** `TelemetryCollectionConfig` (see §5.3).
**Cache:** Client should cache this for 5 minutes. Server sets `Cache-Control: max-age=300`.
### 6.3 `GET /api/telemetry/query` — Admin Query (Read)
**Auth:** Admin JWT only.
**Query params:**
| Param | Type | Description |
| -------------------- | ------------ | --------------------------------- |
| `userId` | string? | Filter by user |
| `anonymousInstallId` | string? | Filter by install |
| `platform` | string? | Filter by platform |
| `channel` | string? | Filter by channel |
| `osFamily` | string? | Filter by OS family |
| `appVersion` | string? | Filter by version |
| `buildNumber` | string? | Filter by build |
| `module` | string? | Filter by module |
| `eventName` | string? | Filter by event name |
| `eventType` | string? | Filter by severity |
| `from` | string (ISO) | Start time |
| `to` | string (ISO) | End time |
| `limit` | number | Max results (default 50, max 200) |
| `continuationToken` | string? | Pagination |
**Response:**
```ts
interface TelemetryQueryResponse {
events: TelemetryEvent[];
total: number;
continuationToken?: string;
}
```
### 6.4 `GET /api/telemetry/clusters` — Error Clusters (Admin)
**Auth:** Admin JWT only.
**Query params:** Same filters as query, plus `groupBy` (default: `fingerprint`).
**Response:**
```ts
interface TelemetryClusterResponse {
clusters: TelemetryErrorCluster[];
total: number;
}
```
### 6.5 Collection Policy CRUD (Admin)
| Method | Path | Description |
| -------- | ----------------------------- | ----------------- |
| `GET` | `/api/telemetry/policies` | List all policies |
| `POST` | `/api/telemetry/policies` | Create policy |
| `PUT` | `/api/telemetry/policies/:id` | Update policy |
| `DELETE` | `/api/telemetry/policies/:id` | Delete policy |
---
## 7. Storage & Partitioning
### 7.1 Cosmos Containers
#### `telemetry_events` (raw events)
| Property | Value |
| ------------- | ------------------------------------------------------------ |
| Partition key | `/pk` where `pk = ${productId}:${yyyyMM}:${platform}` |
| TTL | 3060 days (configurable via env `TELEMETRY_EVENT_TTL_DAYS`) |
| RU budget | Start at 400 RU/s autoscale, monitor and adjust |
**Rationale:** Partitioning by product + month + platform keeps hot data together for typical queries ("show me iOS errors from this month") while distributing load.
**Composite indexes:**
```json
[
{ "path": "/eventType", "order": "ascending" },
{ "path": "/occurredAt", "order": "descending" }
]
```
**Additional indexed paths:** `/userId`, `/anonymousInstallId`, `/module`, `/eventName`, `/appVersion`, `/buildNumber`, `/channel`, `/osFamily`.
#### `telemetry_error_clusters` (aggregated)
| Property | Value |
| ------------- | ----------------------------------------------------- |
| Partition key | `/pk` where `pk = ${productId}:${platform}:${module}` |
| TTL | 90180 days |
| RU budget | 200 RU/s autoscale |
#### `telemetry_collection_policies` (segment rules)
| Property | Value |
| ------------- | ----------------------- |
| Partition key | `/productId` |
| TTL | None (manual lifecycle) |
| RU budget | Minimal (low volume) |
### 7.2 Container Registration
Add to `registerContainers()` call in platform-service `src/lib/cosmos.ts`:
```ts
registerContainers([
// ... existing containers ...
{ id: 'telemetry_events', partitionKeyPath: '/pk' },
{ id: 'telemetry_error_clusters', partitionKeyPath: '/pk' },
{ id: 'telemetry_collection_policies', partitionKeyPath: '/productId' },
]);
```
---
## 8. Error Clustering (Derived)
### 8.1 Fingerprint Generation
Client-side (optional) and server-side (authoritative):
```ts
function generateFingerprint(event: TelemetryEvent): string {
const input = [
event.platform,
event.channel,
event.module,
event.eventName,
event.errorDomain ?? '',
event.errorCode ?? '',
normalizeMessage(event.message ?? ''),
].join(':');
return sha256(input).substring(0, 16); // 16-char hex
}
function normalizeMessage(msg: string): string {
// Strip numbers, UUIDs, paths, timestamps
return msg
.replace(/[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}/gi, '<UUID>')
.replace(/\d+/g, '<N>')
.replace(/\/[\w/.]+/g, '<PATH>')
.toLowerCase()
.trim();
}
```
### 8.2 Cluster Document
```ts
interface TelemetryErrorCluster {
id: string; // fingerprint + time window key
pk: string; // ${productId}:${platform}:${module}
productId: string;
fingerprint: string;
// Dimensions
platform: string;
channel: string;
osFamily: string;
module: string;
eventName: string;
appVersion: string;
buildNumber: string;
// Aggregates
firstSeenAt: string;
lastSeenAt: string;
totalCount: number;
affectedUserIds: string[]; // capped at 100
affectedInstallIds: string[]; // capped at 100
// Representative sample
sampleErrorDomain?: string;
sampleErrorCode?: string;
sampleMessage?: string;
severity: 'warn' | 'error' | 'fatal';
}
```
### 8.3 Cluster Update Strategy
On each ingested `error`/`fatal` event:
1. Compute fingerprint.
2. Upsert cluster doc: increment `totalCount`, update `lastSeenAt`, append to `affectedUserIds` (dedup, cap at 100).
3. Run as a lightweight post-ingest step (same request, not a separate job — keeps it simple for v1).
---
## 9. Admin / DevOps UI
### 9.1 Page: `Ops → Client Logs`
Located at `admin-dashboard-web/src/app/(dashboard)/ops/client-logs/page.tsx`.
#### Filter Bar
| Filter | Type | Default |
| ------------ | ---------------------------------------- | ------------ |
| User ID | text input | — |
| Platform | multi-select: ios, android, web, desktop | all |
| Channel | multi-select | all |
| OS Family | multi-select | all |
| App Version | text/select | — |
| Build Number | text/select | — |
| Module | select | all |
| Event Type | multi-select | error, fatal |
| Time Range | date range picker | last 24h |
#### Views
1. **Error Clusters (default):** Table of top clusters sorted by `totalCount` desc.
- Columns: fingerprint, module, eventName, platform, build, count, affected users, last seen.
- Click → drill into cluster detail (sample events, user list).
2. **Event Stream:** Chronological list of raw events matching filters.
- Columns: time, user, platform, channel, build, module, eventName, eventType, message.
- Click → full event detail (JSON + dictation struct if present).
3. **User Timeline:** Enter a userId → see all events chronologically.
- Useful for "what happened to user X's keyboard session."
### 9.2 Page: `Ops → Telemetry Policies`
Located at `admin-dashboard-web/src/app/(dashboard)/ops/telemetry-policies/page.tsx`.
- CRUD for collection policies.
- Visual segment builder (dropdowns for platform, OS, version range, region, etc.).
- Priority ordering (drag/drop or numeric).
- Enable/disable toggle per policy.
- "Preview" button: show how many matching users/installs (based on recent telemetry).
---
## 10. Client SDK Integration
### 10.1 iOS (Swift) — App + Keyboard Extension
```swift
// Shared via App Group (group.com.bytelyst.LysnrAI)
class LysnrTelemetry {
static let shared = LysnrTelemetry()
func track(
eventType: EventType,
module: String,
eventName: String,
message: String? = nil,
errorCode: String? = nil,
errorDomain: String? = nil,
dictation: DictationContext? = nil,
tags: [String: String]? = nil,
metrics: [String: Double]? = nil
)
func flush() // force-send queued events
func refreshConfig() // poll collection policy
}
```
- Keyboard extension: posts events to App Group shared `UserDefaults` queue; main app flushes.
- Alternatively, keyboard extension sends directly if Full Access is enabled (network available).
### 10.2 Android (Kotlin)
```kotlin
object LysnrTelemetry {
fun track(
eventType: EventType,
module: String,
eventName: String,
message: String? = null,
errorCode: String? = null,
dictation: DictationContext? = null,
)
fun flush()
fun refreshConfig()
}
```
### 10.3 Desktop (Python)
```python
from lysnrai.telemetry import telemetry
telemetry.track(
event_type="error",
module="speech_recognition",
event_name="azure_timeout",
message="Recognition timed out after 30s",
tags={"backend": "azure"},
metrics={"duration_ms": 30000},
)
```
### 10.4 Web (TypeScript)
```ts
import { telemetry } from '@/lib/telemetry';
telemetry.track({
eventType: 'error',
module: 'auth',
eventName: 'token_refresh_failed',
errorCode: '401',
message: 'JWT expired and refresh failed',
});
```
---
## 11. Privacy & Security
### 11.1 Hard Rules
1. **NEVER** send raw dictated/transcribed text in any field.
2. **NEVER** send passwords, tokens, API keys, or PII (email, phone, SSN).
3. `message` field: sanitized, max 512 chars, no user content.
4. `stackTrace`: redacted file paths, max 8KB, only on `fatal`.
5. Server-side PII regex scanner rejects events containing detected PII patterns.
6. `countryCode` / `regionCode`: derived from IP geo server-side (never GPS coordinates).
### 11.2 Data Retention
| Container | Default TTL | Configurable |
| ------------------------------- | ----------- | ---------------------------- |
| `telemetry_events` | 30 days | `TELEMETRY_EVENT_TTL_DAYS` |
| `telemetry_error_clusters` | 90 days | `TELEMETRY_CLUSTER_TTL_DAYS` |
| `telemetry_collection_policies` | No TTL | Manual delete / `expiresAt` |
### 11.3 Access Control
- **Ingest (`POST /api/telemetry/events`):** Any authenticated user or valid install API key.
- **Read (`GET /api/telemetry/query`, `/clusters`):** Admin JWT only.
- **Policy management:** Admin JWT only.
- **No public endpoints.** Telemetry data is internal/operational only.
### 11.4 Rate Limiting
| Client Type | Limit |
| ------------------ | ----------- |
| Authenticated user | 100 req/min |
| Anonymous install | 30 req/min |
| Admin query | 60 req/min |
---
## 12. Rollout Plan
### Phase 1 — MVP (12 weeks)
**Goal:** iOS keyboard dictation debugging visible in admin dashboard.
| Component | Scope |
| ---------------- | ----------------------------------------------------------------------------- |
| platform-service | `telemetry` module: `types.ts`, `repository.ts`, `routes.ts` (ingest + query) |
| platform-service | Collection policy CRUD + config endpoint |
| iOS keyboard | `LysnrTelemetry` client in KeyboardViewController — keyboard_dictation events |
| admin-dashboard | `Ops → Client Logs` page with basic event stream + filters |
| Cosmos | Register 3 containers |
**Delivers:** When a user reports "keyboard not typing," admin can look up their userId, see exact error flow, permissions state, backend choice, and insertion outcome.
### Phase 2 — Full Platform Coverage (23 weeks)
| Component | Scope |
| ---------------------- | ------------------------------------------------------- |
| iOS app | Telemetry for auth, settings, onboarding modules |
| Android app + keyboard | Full telemetry parity with iOS |
| Desktop (Python) | Telemetry for speech recognition, hotkey, paste modules |
| admin-dashboard | Error cluster view, user timeline view |
| platform-service | Cluster aggregation on ingest |
### Phase 3 — Advanced (34 weeks)
| Component | Scope |
| ---------------- | ---------------------------------------------------- |
| Web dashboards | Telemetry for auth, API errors, page load |
| admin-dashboard | Telemetry policy builder UI, version comparison view |
| platform-service | Alerting rules (error spike → Slack/email) |
| All clients | Region/geo enrichment server-side |
---
## 13. Open Questions
| # | Question | Status |
| --- | ----------------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------- |
| 1 | Should keyboard extension send events directly (requires Full Access + network) or queue via App Group for main app to flush? | **Recommend:** Direct when Full Access on, App Group queue as fallback |
| 2 | Do we need a separate Cosmos database for telemetry to isolate RU costs? | **Recommend:** Same database, separate containers (simpler), revisit if RU contention appears |
| 3 | Should we support exporting telemetry to Azure Monitor / Application Insights for alerting? | Defer to Phase 3 |
| 4 | Max retention for raw events? Compliance requirements? | Default 30 days, configurable |
| 5 | Do we need GDPR right-to-erasure support for telemetry? | Yes — add `DELETE /api/telemetry/user/:userId` endpoint |
---
## Appendix A: Env Vars
| Var | Default | Description |
| ----------------------------- | -------- | ------------------------------ |
| `TELEMETRY_ENABLED` | `true` | Global server-side kill switch |
| `TELEMETRY_EVENT_TTL_DAYS` | `30` | Raw event retention |
| `TELEMETRY_CLUSTER_TTL_DAYS` | `90` | Cluster retention |
| `TELEMETRY_MAX_BATCH_SIZE` | `50` | Max events per ingest request |
| `TELEMETRY_MAX_PAYLOAD_BYTES` | `262144` | 256KB max request body |
| `TELEMETRY_PII_SCAN_ENABLED` | `true` | Server-side PII rejection |
## Appendix B: Related Files
| File | Repo | Purpose |
| ----------------------------------------------------------------- | ----------- | -------------------------------------------- |
| `services/platform-service/src/modules/telemetry/` | common-plat | Telemetry module (types, repo, routes) |
| `services/platform-service/src/modules/flags/` | common-plat | Feature flags (reused for segment % rollout) |
| `admin-dashboard-web/src/app/(dashboard)/ops/client-logs/` | lysnrai | Admin log viewer |
| `admin-dashboard-web/src/app/(dashboard)/ops/telemetry-policies/` | lysnrai | Policy manager UI |
| `mobile_app/ios/LysnrKeyboard/KeyboardViewController.swift` | lysnrai | iOS keyboard (first telemetry client) |
| `mobile_app/android/.../LysnrInputMethodService.kt` | lysnrai | Android keyboard (Phase 2) |
| `src/telemetry/` | lysnrai | Python desktop telemetry client (Phase 2) |