docs: add client telemetry & log insights detailed design
This commit is contained in:
parent
18a5b342d9
commit
c59049efec
945
docs/WINDSURF/CLIENT_TELEMETRY_DESIGN.md
Normal file
945
docs/WINDSURF/CLIENT_TELEMETRY_DESIGN.md
Normal file
@ -0,0 +1,945 @@
|
||||
# Client Telemetry & Log Insights — Detailed Design
|
||||
|
||||
> **Audience:** Engineering (AI agents + humans) working on ByteLyst/LysnrAI repos.
|
||||
> **Scope:** Cross-platform client telemetry ingestion, segment-based collection control, storage, admin UI, and privacy guardrails.
|
||||
> **Status:** Design — not yet implemented.
|
||||
> **Last updated:** 2026-02-17
|
||||
|
||||
---
|
||||
|
||||
## Table of Contents
|
||||
|
||||
1. [Problem Statement](#1-problem-statement)
|
||||
2. [Goals & Non-Goals](#2-goals--non-goals)
|
||||
3. [Architecture Overview](#3-architecture-overview)
|
||||
4. [Telemetry Event Schema (Canonical)](#4-telemetry-event-schema-canonical)
|
||||
5. [Segment-Based Collection Control](#5-segment-based-collection-control)
|
||||
6. [Ingestion API Contract](#6-ingestion-api-contract)
|
||||
7. [Storage & Partitioning](#7-storage--partitioning)
|
||||
8. [Error Clustering (Derived)](#8-error-clustering-derived)
|
||||
9. [Admin / DevOps UI](#9-admin--devops-ui)
|
||||
10. [Client SDK Integration](#10-client-sdk-integration)
|
||||
11. [Privacy & Security](#11-privacy--security)
|
||||
12. [Rollout Plan](#12-rollout-plan)
|
||||
13. [Open Questions](#13-open-questions)
|
||||
|
||||
---
|
||||
|
||||
## 1. Problem Statement
|
||||
|
||||
When a user reports "keyboard voice dictation doesn't type into Messages on iPhone 17 Pro," we currently have **zero server-side visibility** into what happened on that device. We cannot see:
|
||||
|
||||
- Did recognition start? Which backend (Azure / local)?
|
||||
- Did recognition produce results?
|
||||
- Did `insertText` succeed or no-op?
|
||||
- What error code/domain terminated the session?
|
||||
- What app version / build / OS / permissions state was active?
|
||||
|
||||
We need a lightweight, always-on (but controllable) telemetry pipeline that:
|
||||
|
||||
1. Collects structured diagnostic events from all client platforms.
|
||||
2. Correlates events by user, device, platform, version, and session.
|
||||
3. Surfaces insights in the admin dashboard for debugging and release health.
|
||||
4. Can be turned on/off per segment (user, platform, region, version, etc.).
|
||||
|
||||
---
|
||||
|
||||
## 2. Goals & Non-Goals
|
||||
|
||||
### Goals
|
||||
|
||||
- **G1:** Unified event schema across iOS, Android, Desktop, Web.
|
||||
- **G2:** Per-user, per-platform, per-version, per-region segment targeting for collection.
|
||||
- **G3:** Admin UI with drill-down from cluster → user → session → event.
|
||||
- **G4:** Privacy-first: no raw dictation text, no PII in payloads.
|
||||
- **G5:** Low overhead: async batched sends, client-side sampling for noisy events.
|
||||
- **G6:** Leverage existing infrastructure (platform-service, Cosmos DB, feature flags).
|
||||
|
||||
### Non-Goals
|
||||
|
||||
- Real-time streaming dashboards (v1 uses polling/refresh).
|
||||
- Full APM / distributed tracing replacement (use Azure Monitor for that).
|
||||
- Client-side crash reporting (use native crash reporters — Crashlytics, Sentry).
|
||||
|
||||
---
|
||||
|
||||
## 3. Architecture Overview
|
||||
|
||||
```
|
||||
┌──────────────────────────────────────────────────────────────────┐
|
||||
│ Client Platforms │
|
||||
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌──────────┐ │
|
||||
│ │ iOS App │ │ iOS Kbd │ │ Android │ │ Desktop │ │ Web Apps │ │
|
||||
│ └────┬────┘ └────┬────┘ └────┬────┘ └────┬────┘ └────┬─────┘ │
|
||||
│ │ │ │ │ │ │
|
||||
│ ▼ ▼ ▼ ▼ ▼ │
|
||||
│ ┌──────────────────────────────────────────────────────────┐ │
|
||||
│ │ Client Telemetry SDK (per-platform thin layer) │ │
|
||||
│ │ • Collects events → batches → POST /api/telemetry │ │
|
||||
│ │ • Checks collection policy via feature flag poll │ │
|
||||
│ │ • Samples debug events, never samples error/fatal │ │
|
||||
│ └──────────────────────────────────┬───────────────────────┘ │
|
||||
└─────────────────────────────────────┼───────────────────────────┘
|
||||
│ HTTPS
|
||||
▼
|
||||
┌──────────────────────────────────────────────────────────────────┐
|
||||
│ platform-service (:4003) │
|
||||
│ ┌──────────────────────────────────────────────────────────┐ │
|
||||
│ │ POST /api/telemetry/events (batch ingest) │ │
|
||||
│ │ GET /api/telemetry/query (admin read) │ │
|
||||
│ │ GET /api/telemetry/clusters (aggregated error view) │ │
|
||||
│ │ GET /api/telemetry/config (collection policy) │ │
|
||||
│ └──────────────────────────────────┬───────────────────────┘ │
|
||||
│ │ │
|
||||
│ ┌──────────────────────────────────▼───────────────────────┐ │
|
||||
│ │ Cosmos DB │ │
|
||||
│ │ • telemetry_events (raw, TTL 30–60d) │ │
|
||||
│ │ • telemetry_error_clusters (derived, TTL 90–180d) │ │
|
||||
│ │ • telemetry_collection_policies (segment rules) │ │
|
||||
│ └──────────────────────────────────────────────────────────┘ │
|
||||
│ │
|
||||
│ Existing modules used: │
|
||||
│ • feature_flags — segment evaluation (FNV-1a hash) │
|
||||
│ • auth — JWT validation for authenticated events │
|
||||
│ • rate-limit — per-user/install throttling │
|
||||
└──────────────────────────────────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌──────────────────────────────────────────────────────────────────┐
|
||||
│ admin-dashboard-web (:3001) │
|
||||
│ ┌──────────────────────────────────────────────────────────┐ │
|
||||
│ │ Ops → Client Logs │ │
|
||||
│ │ • Live event stream (recent errors) │ │
|
||||
│ │ • Error cluster view (top failures by platform/build) │ │
|
||||
│ │ • User timeline (all events for one user) │ │
|
||||
│ │ • Collection policy manager (segment targeting UI) │ │
|
||||
│ └──────────────────────────────────────────────────────────┘ │
|
||||
└──────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 4. Telemetry Event Schema (Canonical)
|
||||
|
||||
Every client event MUST conform to this schema. Fields marked **REQUIRED** must always be present.
|
||||
|
||||
### 4.1 Identity Fields
|
||||
|
||||
| Field | Type | Required | Description |
|
||||
| -------------------- | --------------- | ----------- | -------------------------------------------- |
|
||||
| `id` | `string` (uuid) | REQUIRED | Unique event ID, generated client-side |
|
||||
| `productId` | `string` | REQUIRED | Product identifier (e.g. `"lysnrai"`) |
|
||||
| `userId` | `string?` | Conditional | Present when user is authenticated |
|
||||
| `anonymousInstallId` | `string?` | Conditional | Stable per-install UUID when `userId` absent |
|
||||
| `sessionId` | `string` | REQUIRED | App/keyboard session correlation ID |
|
||||
| `requestId` | `string?` | Optional | Cross-service correlation (`x-request-id`) |
|
||||
|
||||
> **Rule:** At least one of `userId` or `anonymousInstallId` MUST be present.
|
||||
|
||||
### 4.2 Source Classification Fields
|
||||
|
||||
| Field | Type | Required | Description |
|
||||
| ------------- | --------- | ----------- | ------------------------------------------------------------------------------------------------- |
|
||||
| `platform` | `enum` | REQUIRED | `"ios"` \| `"android"` \| `"web"` \| `"desktop"` |
|
||||
| `channel` | `enum` | REQUIRED | `"mobile_app"` \| `"keyboard_extension"` \| `"web_app"` \| `"desktop_app"` \| `"backend_service"` |
|
||||
| `osFamily` | `enum` | REQUIRED | `"ios"` \| `"android"` \| `"macos"` \| `"windows"` \| `"linux"` \| `"chromeos"` \| `"other"` |
|
||||
| `osVersion` | `string?` | Recommended | e.g. `"iOS 18.2"`, `"Windows 11 24H2"`, `"macOS 15.3"`, `"Ubuntu 24.04"` |
|
||||
| `deviceModel` | `string?` | Optional | e.g. `"iPhone17,3"`, `"Pixel 9"`, `"MacBookPro18,3"` |
|
||||
| `locale` | `string?` | Optional | BCP 47 locale, e.g. `"en-US"`, `"ta-IN"` |
|
||||
| `timezone` | `string?` | Optional | IANA timezone, e.g. `"America/Los_Angeles"`, `"Asia/Kolkata"` |
|
||||
| `countryCode` | `string?` | Optional | ISO 3166-1 alpha-2, e.g. `"US"`, `"IN"` — derived from locale or IP server-side |
|
||||
| `regionCode` | `string?` | Optional | e.g. `"WA"`, `"TN"` — derived server-side from IP geo |
|
||||
|
||||
### 4.3 App Release Fields
|
||||
|
||||
| Field | Type | Required | Description |
|
||||
| ---------------- | -------- | -------- | ---------------------------------------------------------------------------- |
|
||||
| `appVersion` | `string` | REQUIRED | Semantic version: `CFBundleShortVersionString` / `versionName` / npm version |
|
||||
| `buildNumber` | `string` | REQUIRED | `CFBundleVersion` / `versionCode` / web release commit hash |
|
||||
| `releaseChannel` | `enum` | REQUIRED | `"dev"` \| `"beta"` \| `"prod"` |
|
||||
|
||||
### 4.4 Event Semantics Fields
|
||||
|
||||
| Field | Type | Required | Description |
|
||||
| ----------- | --------- | -------- | ---------------------------------------------------------------------------------------------- |
|
||||
| `eventType` | `enum` | REQUIRED | `"debug"` \| `"info"` \| `"warn"` \| `"error"` \| `"fatal"` |
|
||||
| `module` | `string` | REQUIRED | Logical module: `"keyboard_dictation"`, `"auth"`, `"sync"`, `"settings"`, `"onboarding"` |
|
||||
| `feature` | `string?` | Optional | Sub-feature: `"voice_typing"`, `"settings_deeplink"`, `"azure_recognition"` |
|
||||
| `eventName` | `string` | REQUIRED | Snake_case event: `"mic_tapped"`, `"recognition_failed"`, `"insert_noop"`, `"session_started"` |
|
||||
|
||||
### 4.5 Error & Diagnostics Fields
|
||||
|
||||
| Field | Type | Required | Description |
|
||||
| ------------- | --------- | -------- | -------------------------------------------------------------------- |
|
||||
| `errorDomain` | `string?` | On error | iOS NSError domain, Android exception class, JS Error name |
|
||||
| `errorCode` | `string?` | On error | Normalized string code |
|
||||
| `message` | `string?` | On error | Sanitized, max 512 chars — NEVER raw user content |
|
||||
| `stackTrace` | `string?` | Optional | Redacted/capped at 8KB — only for `fatal` events |
|
||||
| `fingerprint` | `string?` | Optional | Client-side hash of `(module + eventName + errorCode + errorDomain)` |
|
||||
|
||||
### 4.6 Structured Metadata (Extensible)
|
||||
|
||||
| Field | Type | Required | Description |
|
||||
| --------- | -------------------------- | -------- | ----------------------------------------------------------- |
|
||||
| `tags` | `Record<string, string>?` | Optional | Small indexed key-value pairs (max 20 keys, 128 chars each) |
|
||||
| `metrics` | `Record<string, number>?` | Optional | Numeric measurements: durations, counters, sizes |
|
||||
| `context` | `Record<string, unknown>?` | Optional | Schema-validated safe object, max 4KB serialized |
|
||||
|
||||
### 4.7 Module-Specific: Keyboard Dictation
|
||||
|
||||
When `module = "keyboard_dictation"`, clients SHOULD include a structured `dictation` object:
|
||||
|
||||
| Field | Type | Description |
|
||||
| ------------------------------- | ------------------------------------------------------------------- | ----------------------------------------------------------------- |
|
||||
| `dictation.backend` | `"azure"` \| `"local"` \| `"none"` | Which recognition backend was active |
|
||||
| `dictation.hasFullAccess` | `boolean` | Keyboard Full Access toggle state |
|
||||
| `dictation.micPermission` | `"granted"` \| `"denied"` \| `"undetermined"` | Microphone permission |
|
||||
| `dictation.speechPermission` | `"authorized"` \| `"denied"` \| `"restricted"` \| `"notDetermined"` | Speech recognition permission |
|
||||
| `dictation.recognitionStarted` | `boolean` | Did recognition engine actually start? |
|
||||
| `dictation.finalResultReceived` | `boolean` | Did at least one final result arrive? |
|
||||
| `dictation.insertAttempted` | `boolean` | Did `insertText` / `commitText` get called? |
|
||||
| `dictation.insertNoOpDetected` | `boolean` | Did retry logic detect a no-op insert? |
|
||||
| `dictation.transcriptLength` | `number` | Character count only — NEVER raw text |
|
||||
| `dictation.sessionDurationMs` | `number` | Time from mic tap to stop |
|
||||
| `dictation.hostApp` | `string?` | Bundle ID of host app if available (e.g. `"com.apple.MobileSMS"`) |
|
||||
|
||||
### 4.8 Timing Fields
|
||||
|
||||
| Field | Type | Required | Description |
|
||||
| ------------ | -------------------- | ---------- | ------------------------- |
|
||||
| `occurredAt` | `string` (ISO 8601) | REQUIRED | Client-side timestamp |
|
||||
| `receivedAt` | `string` (ISO 8601) | Server-set | Set by ingestion endpoint |
|
||||
| `ttlAt` | `string?` (ISO 8601) | Server-set | Cosmos TTL expiry marker |
|
||||
|
||||
---
|
||||
|
||||
## 5. Segment-Based Collection Control
|
||||
|
||||
### 5.1 Motivation
|
||||
|
||||
Telemetry should not be a firehose. We need granular control to:
|
||||
|
||||
- **Debug a specific user:** Turn on verbose logging for user `usr_abc123` only.
|
||||
- **Target a platform:** Collect keyboard dictation events only from iOS.
|
||||
- **Target a region:** Enable collection for users in `US:WA` (Seattle area) or `IN:TN` (Chennai area).
|
||||
- **Target a version:** Collect from users on build < 26 (old builds with known bug).
|
||||
- **Target an OS:** Only Linux desktop users.
|
||||
- **Global kill switch:** Disable all collection instantly.
|
||||
|
||||
### 5.2 Collection Policy Document Schema
|
||||
|
||||
Stored in Cosmos container `telemetry_collection_policies`:
|
||||
|
||||
```ts
|
||||
interface TelemetryCollectionPolicy {
|
||||
id: string; // uuid
|
||||
productId: string; // REQUIRED
|
||||
|
||||
// Identity
|
||||
name: string; // human-readable: "Debug iOS keyboard for user X"
|
||||
description: string;
|
||||
enabled: boolean; // master toggle
|
||||
priority: number; // higher = evaluated first (for conflicts)
|
||||
|
||||
// What to collect
|
||||
eventTypes: ('debug' | 'info' | 'warn' | 'error' | 'fatal')[];
|
||||
modules: string[]; // empty = all modules
|
||||
samplingRate: number; // 0.0–1.0 (1.0 = collect everything matching)
|
||||
|
||||
// Segment targeting rules (ALL conditions must match = AND logic)
|
||||
targeting: {
|
||||
// User targeting
|
||||
userIds?: string[]; // specific user IDs
|
||||
anonymousInstallIds?: string[]; // specific install IDs
|
||||
|
||||
// Platform targeting
|
||||
platforms?: ('ios' | 'android' | 'web' | 'desktop')[];
|
||||
channels?: (
|
||||
| 'mobile_app'
|
||||
| 'keyboard_extension'
|
||||
| 'web_app'
|
||||
| 'desktop_app'
|
||||
| 'backend_service'
|
||||
)[];
|
||||
osFamilies?: ('ios' | 'android' | 'macos' | 'windows' | 'linux' | 'chromeos')[];
|
||||
|
||||
// Version targeting
|
||||
appVersions?: string[]; // exact match list: ["1.0.0", "1.1.0"]
|
||||
appVersionRange?: {
|
||||
// semver range
|
||||
min?: string; // inclusive
|
||||
max?: string; // inclusive
|
||||
};
|
||||
buildNumbers?: string[]; // exact match list: ["25", "26"]
|
||||
buildNumberRange?: {
|
||||
min?: number; // inclusive
|
||||
max?: number; // inclusive
|
||||
};
|
||||
|
||||
// Region targeting (derived from client locale/timezone or server-side IP geo)
|
||||
countryCodes?: string[]; // ISO 3166-1 alpha-2: ["US", "IN"]
|
||||
regionCodes?: string[]; // sub-national: ["US:WA", "IN:TN", "IN:KA"]
|
||||
|
||||
// Release channel targeting
|
||||
releaseChannels?: ('dev' | 'beta' | 'prod')[];
|
||||
|
||||
// Percentage rollout (uses existing FNV-1a hash from feature flags)
|
||||
percentage?: number; // 0–100, deterministic per userId/installId
|
||||
};
|
||||
|
||||
// Lifecycle
|
||||
startsAt?: string; // ISO — policy activates at this time
|
||||
expiresAt?: string; // ISO — policy auto-deactivates
|
||||
createdAt: string;
|
||||
updatedAt: string;
|
||||
createdBy: string; // admin userId who created it
|
||||
}
|
||||
```
|
||||
|
||||
### 5.3 Policy Evaluation Logic (Client-Side)
|
||||
|
||||
Clients poll `GET /api/telemetry/config` periodically (every 5 min or on app foreground). The server evaluates all active policies against the client's context and returns a **merged collection config**:
|
||||
|
||||
```ts
|
||||
// Response from GET /api/telemetry/config?platform=ios&channel=keyboard_extension&...
|
||||
interface TelemetryCollectionConfig {
|
||||
enabled: boolean; // global kill switch
|
||||
eventTypes: string[]; // which event types to collect
|
||||
modules: string[]; // which modules (empty = all)
|
||||
samplingRates: {
|
||||
// per event type
|
||||
debug: number; // 0.0–1.0
|
||||
info: number;
|
||||
warn: number;
|
||||
error: number;
|
||||
fatal: number;
|
||||
};
|
||||
batchSize: number; // max events per POST
|
||||
flushIntervalMs: number; // how often to flush batch
|
||||
maxQueueSize: number; // drop oldest if exceeded
|
||||
}
|
||||
```
|
||||
|
||||
### 5.4 Evaluation Rules
|
||||
|
||||
1. **Global default:** If no policies match, use a hardcoded default:
|
||||
- Collect `warn`, `error`, `fatal` only
|
||||
- Sample `warn` at 50%, `error`/`fatal` at 100%
|
||||
- Flush every 60s, batch of 20
|
||||
|
||||
2. **Policy matching:** A policy matches if ALL non-null targeting conditions are met (AND logic).
|
||||
|
||||
3. **Policy merge (multiple matches):** Highest-priority policy wins for each field. Exception: `eventTypes` are unioned (if any policy enables `debug`, it's enabled).
|
||||
|
||||
4. **Percentage rollout:** Uses the same FNV-1a hash from the existing feature flags module:
|
||||
|
||||
```ts
|
||||
hashUserFlag(userId || anonymousInstallId, `telemetry_policy_${policyId}`) < percentage;
|
||||
```
|
||||
|
||||
5. **Time bounds:** `startsAt`/`expiresAt` are checked server-side before including in response.
|
||||
|
||||
### 5.5 Example Policies
|
||||
|
||||
#### A) Debug one user's iOS keyboard
|
||||
|
||||
```json
|
||||
{
|
||||
"name": "Debug user sd9235 iOS keyboard",
|
||||
"enabled": true,
|
||||
"priority": 100,
|
||||
"eventTypes": ["debug", "info", "warn", "error", "fatal"],
|
||||
"modules": ["keyboard_dictation"],
|
||||
"samplingRate": 1.0,
|
||||
"targeting": {
|
||||
"userIds": ["usr_sd9235"],
|
||||
"platforms": ["ios"],
|
||||
"channels": ["keyboard_extension"]
|
||||
},
|
||||
"expiresAt": "2026-02-20T00:00:00Z"
|
||||
}
|
||||
```
|
||||
|
||||
#### B) All iOS users on old builds
|
||||
|
||||
```json
|
||||
{
|
||||
"name": "Collect errors from iOS builds < 26",
|
||||
"enabled": true,
|
||||
"priority": 50,
|
||||
"eventTypes": ["warn", "error", "fatal"],
|
||||
"modules": [],
|
||||
"samplingRate": 1.0,
|
||||
"targeting": {
|
||||
"platforms": ["ios"],
|
||||
"buildNumberRange": { "min": 1, "max": 25 }
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
#### C) Seattle-area users only
|
||||
|
||||
```json
|
||||
{
|
||||
"name": "Seattle region telemetry",
|
||||
"enabled": true,
|
||||
"priority": 60,
|
||||
"eventTypes": ["info", "warn", "error", "fatal"],
|
||||
"modules": [],
|
||||
"samplingRate": 0.5,
|
||||
"targeting": {
|
||||
"regionCodes": ["US:WA"]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
#### D) Only Linux desktop
|
||||
|
||||
```json
|
||||
{
|
||||
"name": "Linux desktop diagnostics",
|
||||
"enabled": true,
|
||||
"priority": 50,
|
||||
"eventTypes": ["warn", "error", "fatal"],
|
||||
"modules": [],
|
||||
"samplingRate": 1.0,
|
||||
"targeting": {
|
||||
"platforms": ["desktop"],
|
||||
"osFamilies": ["linux"]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
#### E) 10% of all web users (canary)
|
||||
|
||||
```json
|
||||
{
|
||||
"name": "Web telemetry canary rollout",
|
||||
"enabled": true,
|
||||
"priority": 30,
|
||||
"eventTypes": ["warn", "error", "fatal"],
|
||||
"modules": [],
|
||||
"samplingRate": 1.0,
|
||||
"targeting": {
|
||||
"platforms": ["web"],
|
||||
"percentage": 10
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
#### F) Chennai, India — mobile app only
|
||||
|
||||
```json
|
||||
{
|
||||
"name": "Chennai mobile diagnostics",
|
||||
"enabled": true,
|
||||
"priority": 60,
|
||||
"eventTypes": ["info", "warn", "error", "fatal"],
|
||||
"modules": [],
|
||||
"samplingRate": 1.0,
|
||||
"targeting": {
|
||||
"platforms": ["ios", "android"],
|
||||
"channels": ["mobile_app"],
|
||||
"regionCodes": ["IN:TN"]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
#### G) Global kill switch (disable all collection)
|
||||
|
||||
```json
|
||||
{
|
||||
"name": "GLOBAL OFF",
|
||||
"enabled": true,
|
||||
"priority": 999,
|
||||
"eventTypes": [],
|
||||
"modules": [],
|
||||
"samplingRate": 0.0,
|
||||
"targeting": {}
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 6. Ingestion API Contract
|
||||
|
||||
### 6.1 `POST /api/telemetry/events` — Batch Ingest
|
||||
|
||||
**Auth:** JWT (authenticated users) or API key header (anonymous installs).
|
||||
|
||||
**Request:**
|
||||
|
||||
```ts
|
||||
// Zod schema
|
||||
const TelemetryIngestRequest = z.object({
|
||||
productId: z.string().min(1),
|
||||
events: z.array(TelemetryEventSchema).min(1).max(50),
|
||||
clientClockSkewMs: z.number().optional(),
|
||||
});
|
||||
```
|
||||
|
||||
**Response (200):**
|
||||
|
||||
```ts
|
||||
interface TelemetryIngestResponse {
|
||||
accepted: number;
|
||||
rejected: number;
|
||||
errors?: Array<{ index: number; reason: string }>;
|
||||
serverTime: string;
|
||||
}
|
||||
```
|
||||
|
||||
**Rate limits:**
|
||||
|
||||
- Authenticated: 100 requests/min per userId
|
||||
- Anonymous: 30 requests/min per anonymousInstallId
|
||||
- Payload: max 256KB per request
|
||||
|
||||
**Validation rules:**
|
||||
|
||||
1. `productId` must match a known, active product.
|
||||
2. Each event must have `id`, `productId`, `platform`, `channel`, `osFamily`, `appVersion`, `buildNumber`, `releaseChannel`, `eventType`, `module`, `eventName`, `sessionId`, `occurredAt`.
|
||||
3. At least one of `userId` or `anonymousInstallId`.
|
||||
4. `message` capped at 512 chars, `stackTrace` at 8KB, `tags` max 20 keys.
|
||||
5. PII regex rejection: reject events containing patterns matching email, phone, credit card.
|
||||
6. No raw dictation text allowed in any field.
|
||||
|
||||
### 6.2 `GET /api/telemetry/config` — Collection Config (Client Poll)
|
||||
|
||||
**Auth:** JWT or API key.
|
||||
|
||||
**Query params:**
|
||||
|
||||
| Param | Type | Description |
|
||||
| ---------------- | ------- | ----------------------- |
|
||||
| `platform` | string | Client platform |
|
||||
| `channel` | string | Client channel |
|
||||
| `osFamily` | string | OS family |
|
||||
| `appVersion` | string | Current app version |
|
||||
| `buildNumber` | string | Current build number |
|
||||
| `releaseChannel` | string | dev/beta/prod |
|
||||
| `countryCode` | string? | Client-reported country |
|
||||
| `regionCode` | string? | Client-reported region |
|
||||
|
||||
**Response:** `TelemetryCollectionConfig` (see §5.3).
|
||||
|
||||
**Cache:** Client should cache this for 5 minutes. Server sets `Cache-Control: max-age=300`.
|
||||
|
||||
### 6.3 `GET /api/telemetry/query` — Admin Query (Read)
|
||||
|
||||
**Auth:** Admin JWT only.
|
||||
|
||||
**Query params:**
|
||||
|
||||
| Param | Type | Description |
|
||||
| -------------------- | ------------ | --------------------------------- |
|
||||
| `userId` | string? | Filter by user |
|
||||
| `anonymousInstallId` | string? | Filter by install |
|
||||
| `platform` | string? | Filter by platform |
|
||||
| `channel` | string? | Filter by channel |
|
||||
| `osFamily` | string? | Filter by OS family |
|
||||
| `appVersion` | string? | Filter by version |
|
||||
| `buildNumber` | string? | Filter by build |
|
||||
| `module` | string? | Filter by module |
|
||||
| `eventName` | string? | Filter by event name |
|
||||
| `eventType` | string? | Filter by severity |
|
||||
| `from` | string (ISO) | Start time |
|
||||
| `to` | string (ISO) | End time |
|
||||
| `limit` | number | Max results (default 50, max 200) |
|
||||
| `continuationToken` | string? | Pagination |
|
||||
|
||||
**Response:**
|
||||
|
||||
```ts
|
||||
interface TelemetryQueryResponse {
|
||||
events: TelemetryEvent[];
|
||||
total: number;
|
||||
continuationToken?: string;
|
||||
}
|
||||
```
|
||||
|
||||
### 6.4 `GET /api/telemetry/clusters` — Error Clusters (Admin)
|
||||
|
||||
**Auth:** Admin JWT only.
|
||||
|
||||
**Query params:** Same filters as query, plus `groupBy` (default: `fingerprint`).
|
||||
|
||||
**Response:**
|
||||
|
||||
```ts
|
||||
interface TelemetryClusterResponse {
|
||||
clusters: TelemetryErrorCluster[];
|
||||
total: number;
|
||||
}
|
||||
```
|
||||
|
||||
### 6.5 Collection Policy CRUD (Admin)
|
||||
|
||||
| Method | Path | Description |
|
||||
| -------- | ----------------------------- | ----------------- |
|
||||
| `GET` | `/api/telemetry/policies` | List all policies |
|
||||
| `POST` | `/api/telemetry/policies` | Create policy |
|
||||
| `PUT` | `/api/telemetry/policies/:id` | Update policy |
|
||||
| `DELETE` | `/api/telemetry/policies/:id` | Delete policy |
|
||||
|
||||
---
|
||||
|
||||
## 7. Storage & Partitioning
|
||||
|
||||
### 7.1 Cosmos Containers
|
||||
|
||||
#### `telemetry_events` (raw events)
|
||||
|
||||
| Property | Value |
|
||||
| ------------- | ------------------------------------------------------------ |
|
||||
| Partition key | `/pk` where `pk = ${productId}:${yyyyMM}:${platform}` |
|
||||
| TTL | 30–60 days (configurable via env `TELEMETRY_EVENT_TTL_DAYS`) |
|
||||
| RU budget | Start at 400 RU/s autoscale, monitor and adjust |
|
||||
|
||||
**Rationale:** Partitioning by product + month + platform keeps hot data together for typical queries ("show me iOS errors from this month") while distributing load.
|
||||
|
||||
**Composite indexes:**
|
||||
|
||||
```json
|
||||
[
|
||||
{ "path": "/eventType", "order": "ascending" },
|
||||
{ "path": "/occurredAt", "order": "descending" }
|
||||
]
|
||||
```
|
||||
|
||||
**Additional indexed paths:** `/userId`, `/anonymousInstallId`, `/module`, `/eventName`, `/appVersion`, `/buildNumber`, `/channel`, `/osFamily`.
|
||||
|
||||
#### `telemetry_error_clusters` (aggregated)
|
||||
|
||||
| Property | Value |
|
||||
| ------------- | ----------------------------------------------------- |
|
||||
| Partition key | `/pk` where `pk = ${productId}:${platform}:${module}` |
|
||||
| TTL | 90–180 days |
|
||||
| RU budget | 200 RU/s autoscale |
|
||||
|
||||
#### `telemetry_collection_policies` (segment rules)
|
||||
|
||||
| Property | Value |
|
||||
| ------------- | ----------------------- |
|
||||
| Partition key | `/productId` |
|
||||
| TTL | None (manual lifecycle) |
|
||||
| RU budget | Minimal (low volume) |
|
||||
|
||||
### 7.2 Container Registration
|
||||
|
||||
Add to `registerContainers()` call in platform-service `src/lib/cosmos.ts`:
|
||||
|
||||
```ts
|
||||
registerContainers([
|
||||
// ... existing containers ...
|
||||
{ id: 'telemetry_events', partitionKeyPath: '/pk' },
|
||||
{ id: 'telemetry_error_clusters', partitionKeyPath: '/pk' },
|
||||
{ id: 'telemetry_collection_policies', partitionKeyPath: '/productId' },
|
||||
]);
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 8. Error Clustering (Derived)
|
||||
|
||||
### 8.1 Fingerprint Generation
|
||||
|
||||
Client-side (optional) and server-side (authoritative):
|
||||
|
||||
```ts
|
||||
function generateFingerprint(event: TelemetryEvent): string {
|
||||
const input = [
|
||||
event.platform,
|
||||
event.channel,
|
||||
event.module,
|
||||
event.eventName,
|
||||
event.errorDomain ?? '',
|
||||
event.errorCode ?? '',
|
||||
normalizeMessage(event.message ?? ''),
|
||||
].join(':');
|
||||
return sha256(input).substring(0, 16); // 16-char hex
|
||||
}
|
||||
|
||||
function normalizeMessage(msg: string): string {
|
||||
// Strip numbers, UUIDs, paths, timestamps
|
||||
return msg
|
||||
.replace(/[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}/gi, '<UUID>')
|
||||
.replace(/\d+/g, '<N>')
|
||||
.replace(/\/[\w/.]+/g, '<PATH>')
|
||||
.toLowerCase()
|
||||
.trim();
|
||||
}
|
||||
```
|
||||
|
||||
### 8.2 Cluster Document
|
||||
|
||||
```ts
|
||||
interface TelemetryErrorCluster {
|
||||
id: string; // fingerprint + time window key
|
||||
pk: string; // ${productId}:${platform}:${module}
|
||||
productId: string;
|
||||
fingerprint: string;
|
||||
|
||||
// Dimensions
|
||||
platform: string;
|
||||
channel: string;
|
||||
osFamily: string;
|
||||
module: string;
|
||||
eventName: string;
|
||||
appVersion: string;
|
||||
buildNumber: string;
|
||||
|
||||
// Aggregates
|
||||
firstSeenAt: string;
|
||||
lastSeenAt: string;
|
||||
totalCount: number;
|
||||
affectedUserIds: string[]; // capped at 100
|
||||
affectedInstallIds: string[]; // capped at 100
|
||||
|
||||
// Representative sample
|
||||
sampleErrorDomain?: string;
|
||||
sampleErrorCode?: string;
|
||||
sampleMessage?: string;
|
||||
severity: 'warn' | 'error' | 'fatal';
|
||||
}
|
||||
```
|
||||
|
||||
### 8.3 Cluster Update Strategy
|
||||
|
||||
On each ingested `error`/`fatal` event:
|
||||
|
||||
1. Compute fingerprint.
|
||||
2. Upsert cluster doc: increment `totalCount`, update `lastSeenAt`, append to `affectedUserIds` (dedup, cap at 100).
|
||||
3. Run as a lightweight post-ingest step (same request, not a separate job — keeps it simple for v1).
|
||||
|
||||
---
|
||||
|
||||
## 9. Admin / DevOps UI
|
||||
|
||||
### 9.1 Page: `Ops → Client Logs`
|
||||
|
||||
Located at `admin-dashboard-web/src/app/(dashboard)/ops/client-logs/page.tsx`.
|
||||
|
||||
#### Filter Bar
|
||||
|
||||
| Filter | Type | Default |
|
||||
| ------------ | ---------------------------------------- | ------------ |
|
||||
| User ID | text input | — |
|
||||
| Platform | multi-select: ios, android, web, desktop | all |
|
||||
| Channel | multi-select | all |
|
||||
| OS Family | multi-select | all |
|
||||
| App Version | text/select | — |
|
||||
| Build Number | text/select | — |
|
||||
| Module | select | all |
|
||||
| Event Type | multi-select | error, fatal |
|
||||
| Time Range | date range picker | last 24h |
|
||||
|
||||
#### Views
|
||||
|
||||
1. **Error Clusters (default):** Table of top clusters sorted by `totalCount` desc.
|
||||
- Columns: fingerprint, module, eventName, platform, build, count, affected users, last seen.
|
||||
- Click → drill into cluster detail (sample events, user list).
|
||||
|
||||
2. **Event Stream:** Chronological list of raw events matching filters.
|
||||
- Columns: time, user, platform, channel, build, module, eventName, eventType, message.
|
||||
- Click → full event detail (JSON + dictation struct if present).
|
||||
|
||||
3. **User Timeline:** Enter a userId → see all events chronologically.
|
||||
- Useful for "what happened to user X's keyboard session."
|
||||
|
||||
### 9.2 Page: `Ops → Telemetry Policies`
|
||||
|
||||
Located at `admin-dashboard-web/src/app/(dashboard)/ops/telemetry-policies/page.tsx`.
|
||||
|
||||
- CRUD for collection policies.
|
||||
- Visual segment builder (dropdowns for platform, OS, version range, region, etc.).
|
||||
- Priority ordering (drag/drop or numeric).
|
||||
- Enable/disable toggle per policy.
|
||||
- "Preview" button: show how many matching users/installs (based on recent telemetry).
|
||||
|
||||
---
|
||||
|
||||
## 10. Client SDK Integration
|
||||
|
||||
### 10.1 iOS (Swift) — App + Keyboard Extension
|
||||
|
||||
```swift
|
||||
// Shared via App Group (group.com.bytelyst.LysnrAI)
|
||||
class LysnrTelemetry {
|
||||
static let shared = LysnrTelemetry()
|
||||
|
||||
func track(
|
||||
eventType: EventType,
|
||||
module: String,
|
||||
eventName: String,
|
||||
message: String? = nil,
|
||||
errorCode: String? = nil,
|
||||
errorDomain: String? = nil,
|
||||
dictation: DictationContext? = nil,
|
||||
tags: [String: String]? = nil,
|
||||
metrics: [String: Double]? = nil
|
||||
)
|
||||
|
||||
func flush() // force-send queued events
|
||||
func refreshConfig() // poll collection policy
|
||||
}
|
||||
```
|
||||
|
||||
- Keyboard extension: posts events to App Group shared `UserDefaults` queue; main app flushes.
|
||||
- Alternatively, keyboard extension sends directly if Full Access is enabled (network available).
|
||||
|
||||
### 10.2 Android (Kotlin)
|
||||
|
||||
```kotlin
|
||||
object LysnrTelemetry {
|
||||
fun track(
|
||||
eventType: EventType,
|
||||
module: String,
|
||||
eventName: String,
|
||||
message: String? = null,
|
||||
errorCode: String? = null,
|
||||
dictation: DictationContext? = null,
|
||||
)
|
||||
fun flush()
|
||||
fun refreshConfig()
|
||||
}
|
||||
```
|
||||
|
||||
### 10.3 Desktop (Python)
|
||||
|
||||
```python
|
||||
from lysnrai.telemetry import telemetry
|
||||
|
||||
telemetry.track(
|
||||
event_type="error",
|
||||
module="speech_recognition",
|
||||
event_name="azure_timeout",
|
||||
message="Recognition timed out after 30s",
|
||||
tags={"backend": "azure"},
|
||||
metrics={"duration_ms": 30000},
|
||||
)
|
||||
```
|
||||
|
||||
### 10.4 Web (TypeScript)
|
||||
|
||||
```ts
|
||||
import { telemetry } from '@/lib/telemetry';
|
||||
|
||||
telemetry.track({
|
||||
eventType: 'error',
|
||||
module: 'auth',
|
||||
eventName: 'token_refresh_failed',
|
||||
errorCode: '401',
|
||||
message: 'JWT expired and refresh failed',
|
||||
});
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 11. Privacy & Security
|
||||
|
||||
### 11.1 Hard Rules
|
||||
|
||||
1. **NEVER** send raw dictated/transcribed text in any field.
|
||||
2. **NEVER** send passwords, tokens, API keys, or PII (email, phone, SSN).
|
||||
3. `message` field: sanitized, max 512 chars, no user content.
|
||||
4. `stackTrace`: redacted file paths, max 8KB, only on `fatal`.
|
||||
5. Server-side PII regex scanner rejects events containing detected PII patterns.
|
||||
6. `countryCode` / `regionCode`: derived from IP geo server-side (never GPS coordinates).
|
||||
|
||||
### 11.2 Data Retention
|
||||
|
||||
| Container | Default TTL | Configurable |
|
||||
| ------------------------------- | ----------- | ---------------------------- |
|
||||
| `telemetry_events` | 30 days | `TELEMETRY_EVENT_TTL_DAYS` |
|
||||
| `telemetry_error_clusters` | 90 days | `TELEMETRY_CLUSTER_TTL_DAYS` |
|
||||
| `telemetry_collection_policies` | No TTL | Manual delete / `expiresAt` |
|
||||
|
||||
### 11.3 Access Control
|
||||
|
||||
- **Ingest (`POST /api/telemetry/events`):** Any authenticated user or valid install API key.
|
||||
- **Read (`GET /api/telemetry/query`, `/clusters`):** Admin JWT only.
|
||||
- **Policy management:** Admin JWT only.
|
||||
- **No public endpoints.** Telemetry data is internal/operational only.
|
||||
|
||||
### 11.4 Rate Limiting
|
||||
|
||||
| Client Type | Limit |
|
||||
| ------------------ | ----------- |
|
||||
| Authenticated user | 100 req/min |
|
||||
| Anonymous install | 30 req/min |
|
||||
| Admin query | 60 req/min |
|
||||
|
||||
---
|
||||
|
||||
## 12. Rollout Plan
|
||||
|
||||
### Phase 1 — MVP (1–2 weeks)
|
||||
|
||||
**Goal:** iOS keyboard dictation debugging visible in admin dashboard.
|
||||
|
||||
| Component | Scope |
|
||||
| ---------------- | ----------------------------------------------------------------------------- |
|
||||
| platform-service | `telemetry` module: `types.ts`, `repository.ts`, `routes.ts` (ingest + query) |
|
||||
| platform-service | Collection policy CRUD + config endpoint |
|
||||
| iOS keyboard | `LysnrTelemetry` client in KeyboardViewController — keyboard_dictation events |
|
||||
| admin-dashboard | `Ops → Client Logs` page with basic event stream + filters |
|
||||
| Cosmos | Register 3 containers |
|
||||
|
||||
**Delivers:** When a user reports "keyboard not typing," admin can look up their userId, see exact error flow, permissions state, backend choice, and insertion outcome.
|
||||
|
||||
### Phase 2 — Full Platform Coverage (2–3 weeks)
|
||||
|
||||
| Component | Scope |
|
||||
| ---------------------- | ------------------------------------------------------- |
|
||||
| iOS app | Telemetry for auth, settings, onboarding modules |
|
||||
| Android app + keyboard | Full telemetry parity with iOS |
|
||||
| Desktop (Python) | Telemetry for speech recognition, hotkey, paste modules |
|
||||
| admin-dashboard | Error cluster view, user timeline view |
|
||||
| platform-service | Cluster aggregation on ingest |
|
||||
|
||||
### Phase 3 — Advanced (3–4 weeks)
|
||||
|
||||
| Component | Scope |
|
||||
| ---------------- | ---------------------------------------------------- |
|
||||
| Web dashboards | Telemetry for auth, API errors, page load |
|
||||
| admin-dashboard | Telemetry policy builder UI, version comparison view |
|
||||
| platform-service | Alerting rules (error spike → Slack/email) |
|
||||
| All clients | Region/geo enrichment server-side |
|
||||
|
||||
---
|
||||
|
||||
## 13. Open Questions
|
||||
|
||||
| # | Question | Status |
|
||||
| --- | ----------------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------- |
|
||||
| 1 | Should keyboard extension send events directly (requires Full Access + network) or queue via App Group for main app to flush? | **Recommend:** Direct when Full Access on, App Group queue as fallback |
|
||||
| 2 | Do we need a separate Cosmos database for telemetry to isolate RU costs? | **Recommend:** Same database, separate containers (simpler), revisit if RU contention appears |
|
||||
| 3 | Should we support exporting telemetry to Azure Monitor / Application Insights for alerting? | Defer to Phase 3 |
|
||||
| 4 | Max retention for raw events? Compliance requirements? | Default 30 days, configurable |
|
||||
| 5 | Do we need GDPR right-to-erasure support for telemetry? | Yes — add `DELETE /api/telemetry/user/:userId` endpoint |
|
||||
|
||||
---
|
||||
|
||||
## Appendix A: Env Vars
|
||||
|
||||
| Var | Default | Description |
|
||||
| ----------------------------- | -------- | ------------------------------ |
|
||||
| `TELEMETRY_ENABLED` | `true` | Global server-side kill switch |
|
||||
| `TELEMETRY_EVENT_TTL_DAYS` | `30` | Raw event retention |
|
||||
| `TELEMETRY_CLUSTER_TTL_DAYS` | `90` | Cluster retention |
|
||||
| `TELEMETRY_MAX_BATCH_SIZE` | `50` | Max events per ingest request |
|
||||
| `TELEMETRY_MAX_PAYLOAD_BYTES` | `262144` | 256KB max request body |
|
||||
| `TELEMETRY_PII_SCAN_ENABLED` | `true` | Server-side PII rejection |
|
||||
|
||||
## Appendix B: Related Files
|
||||
|
||||
| File | Repo | Purpose |
|
||||
| ----------------------------------------------------------------- | ----------- | -------------------------------------------- |
|
||||
| `services/platform-service/src/modules/telemetry/` | common-plat | Telemetry module (types, repo, routes) |
|
||||
| `services/platform-service/src/modules/flags/` | common-plat | Feature flags (reused for segment % rollout) |
|
||||
| `admin-dashboard-web/src/app/(dashboard)/ops/client-logs/` | lysnrai | Admin log viewer |
|
||||
| `admin-dashboard-web/src/app/(dashboard)/ops/telemetry-policies/` | lysnrai | Policy manager UI |
|
||||
| `mobile_app/ios/LysnrKeyboard/KeyboardViewController.swift` | lysnrai | iOS keyboard (first telemetry client) |
|
||||
| `mobile_app/android/.../LysnrInputMethodService.kt` | lysnrai | Android keyboard (Phase 2) |
|
||||
| `src/telemetry/` | lysnrai | Python desktop telemetry client (Phase 2) |
|
||||
Loading…
Reference in New Issue
Block a user