53 KiB
Client Telemetry & Log Insights — Detailed Design
Audience: Engineering (AI agents + humans) working on ByteLyst/LysnrAI repos. Scope: Cross-platform client telemetry ingestion, segment-based collection control, storage, admin UI, and privacy guardrails. Status: Design — implementing today, keyboard-first. Last updated: 2026-02-17 (rev 2 — 18 gaps fixed)
Table of Contents
- Problem Statement
- Goals & Non-Goals
- Architecture Overview
- Telemetry Event Schema (Canonical)
- Segment-Based Collection Control
- Ingestion API Contract
- Storage & Partitioning
- Error Clustering (Derived)
- Admin / DevOps UI
- Client SDK Integration
- Privacy & Security
- Rollout Plan
- Open Questions
1. Problem Statement
When a user reports "keyboard voice dictation doesn't type into Messages on iPhone 17 Pro," we currently have zero server-side visibility into what happened on that device. We cannot see:
- Did recognition start? Which backend (Azure / local)?
- Did recognition produce results?
- Did
insertTextsucceed or no-op? - What error code/domain terminated the session?
- What app version / build / OS / permissions state was active?
We need a lightweight, always-on (but controllable) telemetry pipeline that:
- Collects structured diagnostic events from all client platforms.
- Correlates events by user, device, platform, version, and session.
- Surfaces insights in the admin dashboard for debugging and release health.
- Can be turned on/off per segment (user, platform, region, version, etc.).
2. Goals & Non-Goals
Goals
- G1: Unified event schema across iOS, Android, Desktop, Web.
- G2: Per-user, per-platform, per-version, per-region segment targeting for collection.
- G3: Admin UI with drill-down from cluster → user → session → event.
- G4: Privacy-first: no raw dictation text, no PII in payloads.
- G5: Low overhead: async batched sends, client-side sampling for noisy events.
- G6: Leverage existing infrastructure (platform-service, Cosmos DB, feature flags).
Non-Goals
- Real-time streaming dashboards (v1 uses polling/refresh).
- Full APM / distributed tracing replacement (use Azure Monitor for that).
- Client-side crash reporting (use native crash reporters — Crashlytics, Sentry).
3. Architecture Overview
┌──────────────────────────────────────────────────────────────────┐
│ Client Platforms │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌──────────┐ │
│ │ iOS App │ │ iOS Kbd │ │ Android │ │ Desktop │ │ Web Apps │ │
│ └────┬────┘ └────┬────┘ └────┬────┘ └────┬────┘ └────┬─────┘ │
│ │ │ │ │ │ │
│ ▼ ▼ ▼ ▼ ▼ │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ Client Telemetry SDK (per-platform thin layer) │ │
│ │ • Collects events → batches → POST /api/telemetry │ │
│ │ • Checks collection policy via feature flag poll │ │
│ │ • Samples debug events, never samples error/fatal │ │
│ └──────────────────────────────────┬───────────────────────┘ │
└─────────────────────────────────────┼───────────────────────────┘
│ HTTPS
▼
┌──────────────────────────────────────────────────────────────────┐
│ platform-service (:4003) │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ POST /api/telemetry/events (batch ingest) │ │
│ │ GET /api/telemetry/query (admin read) │ │
│ │ GET /api/telemetry/clusters (aggregated error view) │ │
│ │ GET /api/telemetry/config (collection policy) │ │
│ └──────────────────────────────────┬───────────────────────┘ │
│ │ │
│ ┌──────────────────────────────────▼───────────────────────┐ │
│ │ Cosmos DB │ │
│ │ • telemetry_events (raw, TTL 30–60d) │ │
│ │ • telemetry_error_clusters (derived, TTL 90–180d) │ │
│ │ • telemetry_collection_policies (segment rules) │ │
│ └──────────────────────────────────────────────────────────┘ │
│ │
│ Existing modules used: │
│ • feature_flags — segment evaluation (FNV-1a hash) │
│ • auth — JWT validation for authenticated events │
│ • rate-limit — per-user/install throttling │
└──────────────────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────────┐
│ admin-dashboard-web (:3001) │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ Ops → Client Logs │ │
│ │ • Live event stream (recent errors) │ │
│ │ • Error cluster view (top failures by platform/build) │ │
│ │ • User timeline (all events for one user) │ │
│ │ • Collection policy manager (segment targeting UI) │ │
│ └──────────────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────────────┘
4. Telemetry Event Schema (Canonical)
Every client event MUST conform to this schema. Fields marked REQUIRED must always be present.
4.1 Identity Fields
| Field | Type | Required | Description |
|---|---|---|---|
id |
string (uuid) |
REQUIRED | Unique event ID, generated client-side |
productId |
string |
REQUIRED | Product identifier (e.g. "lysnrai") |
userId |
string? |
Conditional | Present when user is authenticated |
anonymousInstallId |
string? |
Conditional | Stable per-install UUID when userId absent |
sessionId |
string |
REQUIRED | App/keyboard session correlation ID |
requestId |
string? |
Optional | Cross-service correlation (x-request-id) |
Rule: At least one of
userIdoranonymousInstallIdMUST be present.
4.1.1 anonymousInstallId Generation Strategy
Each platform generates a stable UUID on first launch and persists it:
| Platform | Storage | Key |
|---|---|---|
| iOS app | Keychain (kSecAttrAccessibleAfterFirstUnlock) | com.bytelyst.LysnrAI.installId |
| iOS keyboard | App Group UserDefaults (group.com.bytelyst.LysnrAI) |
telemetry_install_id |
| Android | EncryptedSharedPreferences | telemetry_install_id |
| Desktop | ~/.lysnrai/telemetry_install_id (plain file) |
— |
| Web | localStorage |
lysnrai_install_id |
iOS keyboard note: The keyboard extension shares the install ID via App Group so main app and extension use the same identity.
4.1.2 Authentication for Telemetry Ingest
Not all clients have a JWT (e.g., keyboard extension before user logs in). The ingest endpoint accepts two auth modes:
| Mode | Header | When Used |
|---|---|---|
| JWT | Authorization: Bearer <token> |
Authenticated users (main app, web, desktop after login) |
| Install Token | X-Install-Token: <anonymousInstallId> |
Unauthenticated clients (keyboard extension, pre-login) |
Install token validation: The server accepts any well-formed UUID in X-Install-Token. It does NOT verify against a registry (install IDs are self-issued). Rate limiting is applied per install ID to prevent abuse.
Keyboard extension specifics:
- With Full Access ON: sends events directly via HTTPS using
X-Install-Token. - With Full Access OFF: queues events to App Group UserDefaults (max 200 events, ~100KB). Main app flushes on next foreground.
- If queue is full, oldest events are dropped (FIFO eviction).
- Memory constraint: iOS keyboard extensions are limited to ~30MB. Telemetry queue MUST stay under 100KB. Events are serialized as compact JSON (no pretty-print).
4.2 Source Classification Fields
| Field | Type | Required | Description |
|---|---|---|---|
platform |
enum |
REQUIRED | "ios" | "android" | "web" | "desktop" |
channel |
enum |
REQUIRED | "mobile_app" | "keyboard_extension" | "web_app" | "desktop_app" | "backend_service" |
osFamily |
enum |
REQUIRED | "ios" | "android" | "macos" | "windows" | "linux" | "chromeos" | "other" |
osVersion |
string? |
Recommended | e.g. "iOS 18.2", "Windows 11 24H2", "macOS 15.3", "Ubuntu 24.04" |
deviceModel |
string? |
Optional | e.g. "iPhone17,3", "Pixel 9", "MacBookPro18,3" |
locale |
string? |
Optional | BCP 47 locale, e.g. "en-US", "ta-IN" |
timezone |
string? |
Optional | IANA timezone, e.g. "America/Los_Angeles", "Asia/Kolkata" |
countryCode |
string? |
Optional | ISO 3166-1 alpha-2, e.g. "US", "IN" — derived from locale or IP server-side |
regionCode |
string? |
Optional | Prefixed format: "US:WA", "IN:TN" — derived server-side from IP geo. Always {country}:{region} to avoid ambiguity (TN = Tennessee or Tamil Nadu) |
4.3 App Release Fields
| Field | Type | Required | Description |
|---|---|---|---|
appVersion |
string |
REQUIRED | Semantic version: CFBundleShortVersionString / versionName / npm version |
buildNumber |
string |
REQUIRED | CFBundleVersion / versionCode / web release commit hash |
releaseChannel |
enum |
REQUIRED | "dev" | "beta" | "prod" |
4.4 Event Semantics Fields
| Field | Type | Required | Description |
|---|---|---|---|
eventType |
enum |
REQUIRED | "debug" | "info" | "warn" | "error" | "fatal" |
module |
string |
REQUIRED | Logical module: "keyboard_dictation", "auth", "sync", "settings", "onboarding" |
feature |
string? |
Optional | Sub-feature: "voice_typing", "settings_deeplink", "azure_recognition" |
eventName |
string |
REQUIRED | Snake_case event: "mic_tapped", "recognition_failed", "insert_noop", "session_started" |
4.5 Error & Diagnostics Fields
| Field | Type | Required | Description |
|---|---|---|---|
errorDomain |
string? |
On error | iOS NSError domain, Android exception class, JS Error name |
errorCode |
string? |
On error | Normalized string code |
message |
string? |
On error | Sanitized, max 512 chars — NEVER raw user content |
stackTrace |
string? |
Optional | Redacted/capped at 8KB — only for fatal events |
fingerprint |
string? |
Optional | Client-side hash of (module + eventName + errorCode + errorDomain) |
4.6 Structured Metadata (Extensible)
| Field | Type | Required | Description |
|---|---|---|---|
tags |
Record<string, string>? |
Optional | Small indexed key-value pairs (max 20 keys, 128 chars each) |
metrics |
Record<string, number>? |
Optional | Numeric measurements: durations, counters, sizes |
context |
Record<string, unknown>? |
Optional | Schema-validated safe object, max 4KB serialized |
4.7 Module-Specific: Keyboard Dictation
When module = "keyboard_dictation", clients SHOULD include a structured dictation object:
| Field | Type | Description |
|---|---|---|
dictation.backend |
"azure" | "local" | "none" |
Which recognition backend was active |
dictation.hasFullAccess |
boolean |
Keyboard Full Access toggle state |
dictation.micPermission |
"granted" | "denied" | "undetermined" |
Microphone permission |
dictation.speechPermission |
"authorized" | "denied" | "restricted" | "notDetermined" |
Speech recognition permission |
dictation.recognitionStarted |
boolean |
Did recognition engine actually start? |
dictation.finalResultReceived |
boolean |
Did at least one final result arrive? |
dictation.insertAttempted |
boolean |
Did insertText / commitText get called? |
dictation.insertNoOpDetected |
boolean |
Did retry logic detect a no-op insert? |
dictation.transcriptLength |
number |
Character count only — NEVER raw text |
dictation.sessionDurationMs |
number |
Time from mic tap to stop |
dictation.hostApp |
string? |
Bundle ID of host app if available (e.g. "com.apple.MobileSMS") |
dictation.errorRecoveryAttempted |
boolean |
Was Azure→local (or vice versa) recovery attempted during this session? |
dictation.errorRecoverySucceeded |
boolean? |
If recovery was attempted, did the fallback backend produce results? |
dictation.audioSessionCategory |
string? |
iOS AVAudioSession category active during dictation (e.g. "playAndRecord") |
4.8 Server-Computed Fields
These fields are set by the ingestion endpoint, never by clients.
| Field | Type | Required | Description |
|---|---|---|---|
pk |
string |
Server-set | Cosmos partition key: ${productId}:${yyyyMM}:${platform}. Computed from event fields on ingest |
occurredAt |
string (ISO 8601) |
REQUIRED | Client-side timestamp (client provides this) |
receivedAt |
string (ISO 8601) |
Server-set | Server receipt timestamp |
ttl |
number |
Server-set | Cosmos TTL in seconds (not ISO date). Cosmos uses _ts + ttl for auto-expiry. Default: TELEMETRY_EVENT_TTL_DAYS * 86400 |
5. Segment-Based Collection Control
5.1 Motivation
Telemetry should not be a firehose. We need granular control to:
- Debug a specific user: Turn on verbose logging for user
usr_abc123only. - Target a platform: Collect keyboard dictation events only from iOS.
- Target a region: Enable collection for users in
US:WA(Seattle area) orIN:TN(Chennai area). - Target a version: Collect from users on build < 26 (old builds with known bug).
- Target an OS: Only Linux desktop users.
- Global kill switch: Disable all collection instantly.
5.2 Collection Policy Document Schema
Stored in Cosmos container telemetry_collection_policies:
interface TelemetryCollectionPolicy {
id: string; // uuid
productId: string; // REQUIRED
// Identity
name: string; // human-readable: "Debug iOS keyboard for user X"
description: string;
enabled: boolean; // master toggle
priority: number; // higher = evaluated first (for conflicts)
// What to collect
eventTypes: ('debug' | 'info' | 'warn' | 'error' | 'fatal')[];
modules: string[]; // empty = all modules
samplingRate: number; // 0.0–1.0 (1.0 = collect everything matching)
// Segment targeting rules (ALL conditions must match = AND logic)
targeting: {
// User targeting
userIds?: string[]; // specific user IDs
anonymousInstallIds?: string[]; // specific install IDs
// Platform targeting
platforms?: ('ios' | 'android' | 'web' | 'desktop')[];
channels?: (
| 'mobile_app'
| 'keyboard_extension'
| 'web_app'
| 'desktop_app'
| 'backend_service'
)[];
osFamilies?: ('ios' | 'android' | 'macos' | 'windows' | 'linux' | 'chromeos')[];
// Version targeting
appVersions?: string[]; // exact match list: ["1.0.0", "1.1.0"]
appVersionRange?: {
// semver range
min?: string; // inclusive
max?: string; // inclusive
};
buildNumbers?: string[]; // exact match list: ["25", "26"]
buildNumberRange?: {
min?: number; // inclusive
max?: number; // inclusive
};
// Region targeting (derived from client locale/timezone or server-side IP geo)
countryCodes?: string[]; // ISO 3166-1 alpha-2: ["US", "IN"]
regionCodes?: string[]; // sub-national: ["US:WA", "IN:TN", "IN:KA"]
// Release channel targeting
releaseChannels?: ('dev' | 'beta' | 'prod')[];
// Percentage rollout (uses existing FNV-1a hash from feature flags)
percentage?: number; // 0–100, deterministic per userId/installId
};
// Lifecycle
startsAt?: string; // ISO — policy activates at this time
expiresAt?: string; // ISO — policy auto-deactivates
createdAt: string;
updatedAt: string;
createdBy: string; // admin userId who created it
}
5.3 Policy Evaluation Logic (Client-Side)
Clients poll GET /api/telemetry/config periodically (every 5 min or on app foreground). The server evaluates all active policies against the client's context and returns a merged collection config:
// Response from GET /api/telemetry/config?platform=ios&channel=keyboard_extension&...
interface TelemetryCollectionConfig {
enabled: boolean; // global kill switch
eventTypes: string[]; // which event types to collect
modules: string[]; // which modules (empty = all)
samplingRates: {
// per event type
debug: number; // 0.0–1.0
info: number;
warn: number;
error: number;
fatal: number;
};
batchSize: number; // max events per POST
flushIntervalMs: number; // how often to flush batch
maxQueueSize: number; // drop oldest if exceeded
}
5.4 Evaluation Rules
-
Global default: If no policies match, use a hardcoded default:
- Collect
warn,error,fatalonly - Sample
warnat 50%,error/fatalat 100% - Flush every 60s, batch of 20, max queue 200
- Collect
-
Empty targeting = matches ALL: A policy with
targeting: {}(all fields omitted) matches every client. This is how the global kill switch works (example G). -
Policy matching: A policy matches if ALL present (non-null/non-undefined) targeting conditions are met (AND logic). Omitted conditions are ignored (not checked).
-
Policy merge (multiple matches): Highest-priority policy wins for each field. Exception:
eventTypesare unioned (if any matching policy enablesdebug, it’s enabled). -
Percentage rollout: Uses the same FNV-1a hash from the existing feature flags module:
hashUserFlag(userId || anonymousInstallId, `telemetry_policy_${policyId}`) < percentage; -
Time bounds:
startsAt/expiresAtare checked server-side before including in response. -
samplingRate→samplingRatesmapping: A policy’s singlesamplingRateapplies to ALL itseventTypes. When merging multiple policies, the highest-priority policy’s rate wins per event type. If a policy enables["debug", "info"]at rate 0.5 and another enables["error", "fatal"]at rate 1.0, the merged config is:{ "debug": 0.5, "info": 0.5, "warn": 0.0, "error": 1.0, "fatal": 1.0 } -
batchSize,flushIntervalMs,maxQueueSizedefaults: These are NOT set per-policy. They come from server-side env vars with these defaults:Param Default Env Var batchSize20 TELEMETRY_CLIENT_BATCH_SIZEflushIntervalMs60000 (60s) TELEMETRY_CLIENT_FLUSH_MSmaxQueueSize200 TELEMETRY_CLIENT_MAX_QUEUEThe config endpoint returns these with the merged policy so clients don’t hardcode them.
5.5 Example Policies
A) Debug one user's iOS keyboard
{
"name": "Debug user sd9235 iOS keyboard",
"enabled": true,
"priority": 100,
"eventTypes": ["debug", "info", "warn", "error", "fatal"],
"modules": ["keyboard_dictation"],
"samplingRate": 1.0,
"targeting": {
"userIds": ["usr_sd9235"],
"platforms": ["ios"],
"channels": ["keyboard_extension"]
},
"expiresAt": "2026-02-20T00:00:00Z"
}
B) All iOS users on old builds
{
"name": "Collect errors from iOS builds < 26",
"enabled": true,
"priority": 50,
"eventTypes": ["warn", "error", "fatal"],
"modules": [],
"samplingRate": 1.0,
"targeting": {
"platforms": ["ios"],
"buildNumberRange": { "min": 1, "max": 25 }
}
}
C) Seattle-area users only
{
"name": "Seattle region telemetry",
"enabled": true,
"priority": 60,
"eventTypes": ["info", "warn", "error", "fatal"],
"modules": [],
"samplingRate": 0.5,
"targeting": {
"regionCodes": ["US:WA"]
}
}
D) Only Linux desktop
{
"name": "Linux desktop diagnostics",
"enabled": true,
"priority": 50,
"eventTypes": ["warn", "error", "fatal"],
"modules": [],
"samplingRate": 1.0,
"targeting": {
"platforms": ["desktop"],
"osFamilies": ["linux"]
}
}
E) 10% of all web users (canary)
{
"name": "Web telemetry canary rollout",
"enabled": true,
"priority": 30,
"eventTypes": ["warn", "error", "fatal"],
"modules": [],
"samplingRate": 1.0,
"targeting": {
"platforms": ["web"],
"percentage": 10
}
}
F) Chennai, India — mobile app only
{
"name": "Chennai mobile diagnostics",
"enabled": true,
"priority": 60,
"eventTypes": ["info", "warn", "error", "fatal"],
"modules": [],
"samplingRate": 1.0,
"targeting": {
"platforms": ["ios", "android"],
"channels": ["mobile_app"],
"regionCodes": ["IN:TN"]
}
}
G) Global kill switch (disable all collection)
{
"name": "GLOBAL OFF",
"enabled": true,
"priority": 999,
"eventTypes": [],
"modules": [],
"samplingRate": 0.0,
"targeting": {}
}
6. Ingestion API Contract
6.1 POST /api/telemetry/events — Batch Ingest
Auth: JWT (Authorization: Bearer) or Install Token (X-Install-Token: <uuid>). See §4.1.2.
Request:
// --- Zod schema for a single telemetry event ---
const TelemetryEventSchema = z
.object({
// Identity
id: z.string().uuid(),
productId: z.string().min(1),
userId: z.string().optional(),
anonymousInstallId: z.string().uuid().optional(),
sessionId: z.string().min(1),
requestId: z.string().optional(),
// Source classification
platform: z.enum(['ios', 'android', 'web', 'desktop']),
channel: z.enum([
'mobile_app',
'keyboard_extension',
'web_app',
'desktop_app',
'backend_service',
]),
osFamily: z.enum(['ios', 'android', 'macos', 'windows', 'linux', 'chromeos', 'other']),
osVersion: z.string().optional(),
deviceModel: z.string().optional(),
locale: z.string().optional(),
timezone: z.string().optional(),
// App release
appVersion: z.string().min(1),
buildNumber: z.string().min(1),
releaseChannel: z.enum(['dev', 'beta', 'prod']),
// Event semantics
eventType: z.enum(['debug', 'info', 'warn', 'error', 'fatal']),
module: z.string().min(1),
feature: z.string().optional(),
eventName: z.string().min(1),
// Error & diagnostics
errorDomain: z.string().optional(),
errorCode: z.string().optional(),
message: z.string().max(512).optional(),
stackTrace: z.string().max(8192).optional(),
fingerprint: z.string().optional(),
// Structured metadata
tags: z.record(z.string().max(128)).optional(),
metrics: z.record(z.number()).optional(),
context: z.record(z.unknown()).optional(),
// Timing
occurredAt: z.string().datetime(),
})
.refine(e => e.userId || e.anonymousInstallId, {
message: 'At least one of userId or anonymousInstallId is required',
});
// --- Batch ingest request ---
const TelemetryIngestRequest = z.object({
productId: z.string().min(1),
events: z.array(TelemetryEventSchema).min(1).max(50),
clientClockSkewMs: z.number().optional(),
});
Response (200):
interface TelemetryIngestResponse {
accepted: number;
rejected: number;
errors?: Array<{ index: number; reason: string }>;
serverTime: string;
}
Rate limits:
- Authenticated: 100 requests/min per userId
- Anonymous: 30 requests/min per anonymousInstallId
- Payload: max 256KB per request
Validation rules:
productIdauthority: Request-levelproductIdis authoritative. Per-eventproductIdMUST match the request-level value; mismatches are rejected.- Zod validation enforces all required fields (see schema above).
- At least one of
userIdoranonymousInstallId(Zod refine). messagecapped at 512 chars,stackTraceat 8KB,tagsmax 20 keys,contextmax 4KB serialized.- PII regex rejection: reject events containing patterns matching email, phone, credit card.
- No raw dictation text allowed in any field.
Idempotency: Events are upserted by id. If a client retries a batch (e.g., network timeout), duplicate event IDs are silently overwritten. This ensures exactly-once semantics without client-side dedup tracking.
6.2 GET /api/telemetry/config — Collection Config (Client Poll)
Auth: JWT or API key.
Query params:
| Param | Type | Description |
|---|---|---|
userId |
string? | Authenticated user ID (for percentage rollout evaluation) |
anonymousInstallId |
string? | Install ID (fallback for percentage rollout) |
platform |
string | Client platform |
channel |
string | Client channel |
osFamily |
string | OS family |
appVersion |
string | Current app version |
buildNumber |
string | Current build number |
releaseChannel |
string | dev/beta/prod |
countryCode |
string? | Client-reported country |
regionCode |
string? | Client-reported region (prefixed: US:WA) |
Response: TelemetryCollectionConfig (see §5.3).
Cache: Client should cache this for 5 minutes. Server sets Cache-Control: max-age=300.
6.3 GET /api/telemetry/query — Admin Query (Read)
Auth: Admin JWT only.
Query params:
| Param | Type | Description |
|---|---|---|
userId |
string? | Filter by user |
anonymousInstallId |
string? | Filter by install |
platform |
string? | Filter by platform |
channel |
string? | Filter by channel |
osFamily |
string? | Filter by OS family |
appVersion |
string? | Filter by version |
buildNumber |
string? | Filter by build |
module |
string? | Filter by module |
eventName |
string? | Filter by event name |
eventType |
string? | Filter by severity |
from |
string (ISO) | Start time |
to |
string (ISO) | End time |
limit |
number | Max results (default 50, max 200) |
continuationToken |
string? | Pagination |
Response:
interface TelemetryQueryResponse {
events: TelemetryEvent[];
total: number;
continuationToken?: string;
}
6.4 GET /api/telemetry/clusters — Error Clusters (Admin)
Auth: Admin JWT only.
Query params: Same filters as query, plus groupBy (default: fingerprint).
Response:
interface TelemetryClusterResponse {
clusters: TelemetryErrorCluster[];
total: number;
}
6.5 Collection Policy CRUD (Admin)
| Method | Path | Description |
|---|---|---|
GET |
/api/telemetry/policies |
List all policies |
POST |
/api/telemetry/policies |
Create policy |
PUT |
/api/telemetry/policies/:id |
Update policy |
DELETE |
/api/telemetry/policies/:id |
Delete policy |
6.6 DELETE /api/telemetry/user/:userId — GDPR Right-to-Erasure
Auth: Admin JWT only.
Deletes ALL telemetry events and cluster references for the given userId. Returns count of deleted documents. Required for GDPR compliance.
Response:
interface TelemetryErasureResponse {
userId: string;
eventsDeleted: number;
clustersUpdated: number;
}
7. Storage & Partitioning
7.1 Cosmos Containers
telemetry_events (raw events)
| Property | Value |
|---|---|
| Partition key | /pk where pk = ${productId}:${yyyyMM}:${platform} |
| TTL | defaultTtl: 30 * 86400 (30 days in seconds, configurable via TELEMETRY_EVENT_TTL_DAYS). Cosmos auto-deletes docs when _ts + ttl passes |
| RU budget | Start at 400 RU/s autoscale, monitor and adjust |
Rationale: Partitioning by product + month + platform keeps hot data together for typical queries ("show me iOS errors from this month") while distributing load.
Composite indexes:
[
{ "path": "/eventType", "order": "ascending" },
{ "path": "/occurredAt", "order": "descending" }
]
Additional indexed paths: /userId, /anonymousInstallId, /module, /eventName, /appVersion, /buildNumber, /channel, /osFamily.
telemetry_error_clusters (aggregated)
| Property | Value |
|---|---|
| Partition key | /pk where pk = ${productId}:${platform}:${module} |
| TTL | defaultTtl: 90 * 86400 (90 days in seconds, configurable via TELEMETRY_CLUSTER_TTL_DAYS) |
| RU budget | 200 RU/s autoscale |
telemetry_collection_policies (segment rules)
| Property | Value |
|---|---|
| Partition key | /productId |
| TTL | None (manual lifecycle) |
| RU budget | Minimal (low volume) |
7.2 Container Registration
Add to registerContainers() call in platform-service src/lib/cosmos.ts:
registerContainers([
// ... existing containers ...
{ id: 'telemetry_events', partitionKeyPath: '/pk' },
{ id: 'telemetry_error_clusters', partitionKeyPath: '/pk' },
{ id: 'telemetry_collection_policies', partitionKeyPath: '/productId' },
]);
8. Error Clustering (Derived)
8.1 Fingerprint Generation
Client-side (optional) and server-side (authoritative):
function generateFingerprint(event: TelemetryEvent): string {
const input = [
event.platform,
event.channel,
event.module,
event.eventName,
event.errorDomain ?? '',
event.errorCode ?? '',
normalizeMessage(event.message ?? ''),
].join(':');
return sha256(input).substring(0, 16); // 16-char hex
}
function normalizeMessage(msg: string): string {
// Strip numbers, UUIDs, paths, timestamps
return msg
.replace(/[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}/gi, '<UUID>')
.replace(/\d+/g, '<N>')
.replace(/\/[\w/.]+/g, '<PATH>')
.toLowerCase()
.trim();
}
8.2 Cluster Document
interface TelemetryErrorCluster {
id: string; // fingerprint + time window key (e.g. `${fingerprint}:${yyyyMM}`)
pk: string; // ${productId}:${platform}:${module}
productId: string;
fingerprint: string;
// Dimensions (version-agnostic — one cluster spans all versions)
platform: string;
channel: string;
module: string;
eventName: string;
// Version breakdown — which builds are affected
affectedVersions: Array<{
appVersion: string;
buildNumber: string;
count: number;
lastSeenAt: string;
}>; // capped at 50 entries
// Aggregates
firstSeenAt: string;
lastSeenAt: string;
totalCount: number;
affectedUserIds: string[]; // capped at 100
affectedInstallIds: string[]; // capped at 100
affectedOsFamilies: string[]; // e.g. ["ios", "macos"]
// Representative sample (from most recent event)
sampleErrorDomain?: string;
sampleErrorCode?: string;
sampleMessage?: string;
severity: 'warn' | 'error' | 'fatal';
ttl: number; // Cosmos TTL in seconds
}
8.3 Cluster Update Strategy
On each ingested warn, error, or fatal event:
- Compute fingerprint.
- Upsert cluster doc: increment
totalCount, updatelastSeenAt, append toaffectedUserIds(dedup, cap at 100). - Run as a lightweight post-ingest step (same request, not a separate job — keeps it simple for v1).
9. Admin / DevOps UI
9.1 Page: Ops → Client Logs
Located at admin-dashboard-web/src/app/(dashboard)/ops/client-logs/page.tsx.
Filter Bar
| Filter | Type | Default |
|---|---|---|
| User ID | text input | — |
| Platform | multi-select: ios, android, web, desktop | all |
| Channel | multi-select | all |
| OS Family | multi-select | all |
| App Version | text/select | — |
| Build Number | text/select | — |
| Module | select | all |
| Event Type | multi-select | error, fatal |
| Time Range | date range picker | last 24h |
Views
-
Error Clusters (default): Table of top clusters sorted by
totalCountdesc.- Columns: fingerprint, module, eventName, platform, build, count, affected users, last seen.
- Click → drill into cluster detail (sample events, user list).
-
Event Stream: Chronological list of raw events matching filters.
- Columns: time, user, platform, channel, build, module, eventName, eventType, message.
- Click → full event detail (JSON + dictation struct if present).
-
User Timeline: Enter a userId → see all events chronologically.
- Useful for "what happened to user X's keyboard session."
9.2 Page: Ops → Telemetry Policies
Located at admin-dashboard-web/src/app/(dashboard)/ops/telemetry-policies/page.tsx.
- CRUD for collection policies.
- Visual segment builder (dropdowns for platform, OS, version range, region, etc.).
- Priority ordering (drag/drop or numeric).
- Enable/disable toggle per policy.
- "Preview" button: show how many matching users/installs (based on recent telemetry).
10. Client SDK Integration
10.1 iOS (Swift) — App + Keyboard Extension
// Shared via App Group (group.com.bytelyst.LysnrAI)
class LysnrTelemetry {
static let shared = LysnrTelemetry()
// Core properties (set once at init)
let productId = "lysnrai"
let platform = "ios"
let osFamily = "ios"
var channel: String // "mobile_app" or "keyboard_extension"
var installId: String // from App Group UserDefaults
var userId: String? // from App Group (set after login)
func track(
eventType: EventType,
module: String,
eventName: String,
message: String? = nil,
errorCode: String? = nil,
errorDomain: String? = nil,
dictation: DictationContext? = nil,
tags: [String: String]? = nil,
metrics: [String: Double]? = nil
)
func flush() // force-send queued events
func refreshConfig() // poll collection policy
// Keyboard-specific
func queueToAppGroup() // write pending events to App Group UserDefaults
func flushAppGroupQueue() // called by main app on foreground
}
Keyboard extension offline strategy:
- Full Access ON: Sends events directly via URLSession. Falls back to App Group queue on network failure.
- Full Access OFF: Always queues to App Group UserDefaults (
telemetry_event_queuekey). - Main app responsibility: On each foreground, calls
LysnrTelemetry.shared.flushAppGroupQueue()to drain keyboard-queued events. - Queue limits: Max 200 events (~100KB). FIFO eviction when full. See §4.1.2 for memory constraints.
10.2 Android (Kotlin)
object LysnrTelemetry {
fun track(
eventType: EventType,
module: String,
eventName: String,
message: String? = null,
errorCode: String? = null,
dictation: DictationContext? = null,
)
fun flush()
fun refreshConfig()
}
10.3 Desktop (Python)
from lysnrai.telemetry import telemetry
telemetry.track(
event_type="error",
module="speech_recognition",
event_name="azure_timeout",
message="Recognition timed out after 30s",
tags={"backend": "azure"},
metrics={"duration_ms": 30000},
)
10.4 Web (TypeScript)
import { telemetry } from '@/lib/telemetry';
telemetry.track({
eventType: 'error',
module: 'auth',
eventName: 'token_refresh_failed',
errorCode: '401',
message: 'JWT expired and refresh failed',
});
11. Privacy & Security
11.1 Hard Rules
- NEVER send raw dictated/transcribed text in any field.
- NEVER send passwords, tokens, API keys, or PII (email, phone, SSN).
messagefield: sanitized, max 512 chars, no user content.stackTrace: redacted file paths, max 8KB, only onfatal.- Server-side PII regex scanner rejects events containing detected PII patterns.
countryCode/regionCode: derived from IP geo server-side (never GPS coordinates).
11.2 Data Retention
| Container | Default TTL | Configurable |
|---|---|---|
telemetry_events |
30 days | TELEMETRY_EVENT_TTL_DAYS |
telemetry_error_clusters |
90 days | TELEMETRY_CLUSTER_TTL_DAYS |
telemetry_collection_policies |
No TTL | Manual delete / expiresAt |
11.3 Access Control
- Ingest (
POST /api/telemetry/events): Any authenticated user (JWT) or valid install token (X-Install-Token). See §4.1.2. - Read (
GET /api/telemetry/query,/clusters): Admin JWT only. Enforced viareq.jwtPayload?.role === 'admin'check (same pattern as other admin-only modules). - Policy management: Admin JWT only (same check).
- GDPR erasure: Admin JWT only.
- No public endpoints. Telemetry data is internal/operational only.
11.4 Rate Limiting
| Client Type | Limit |
|---|---|
| Authenticated user | 100 req/min |
| Anonymous install | 30 req/min |
| Admin query | 60 req/min |
12. Rollout Plan
Phase 1 — MVP (1–2 weeks)
Goal: iOS keyboard dictation debugging visible in admin dashboard.
| Component | Scope |
|---|---|
| platform-service | telemetry module: types.ts, repository.ts, routes.ts (ingest + query) |
| platform-service | Collection policy CRUD + config endpoint |
| iOS keyboard | LysnrTelemetry client in KeyboardViewController — keyboard_dictation events |
| admin-dashboard | Ops → Client Logs page with basic event stream + filters |
| Cosmos | Register 3 containers |
Delivers: When a user reports "keyboard not typing," admin can look up their userId, see exact error flow, permissions state, backend choice, and insertion outcome.
Phase 2 — Full Platform Coverage (2–3 weeks)
| Component | Scope |
|---|---|
| iOS app | Telemetry for auth, settings, onboarding modules |
| Android app + keyboard | Full telemetry parity with iOS |
| Desktop (Python) | Telemetry for speech recognition, hotkey, paste modules |
| admin-dashboard | Error cluster view, user timeline view |
| platform-service | Cluster aggregation on ingest |
Phase 3 — Advanced (3–4 weeks)
| Component | Scope |
|---|---|
| Web dashboards | Telemetry for auth, API errors, page load |
| admin-dashboard | Telemetry policy builder UI, version comparison view |
| platform-service | Alerting rules (error spike → Slack/email) |
| All clients | Region/geo enrichment server-side |
13. Open Questions
| # | Question | Status |
|---|---|---|
| 1 | Should keyboard extension send events directly (requires Full Access + network) or queue via App Group for main app to flush? | RESOLVED (rev 2): Direct when Full Access on, App Group queue as fallback. See §4.1.2. |
| 2 | Do we need a separate Cosmos database for telemetry to isolate RU costs? | Recommend: Same database, separate containers (simpler), revisit if RU contention appears |
| 3 | Should we support exporting telemetry to Azure Monitor / Application Insights for alerting? | Defer to Phase 3 |
| 4 | Max retention for raw events? Compliance requirements? | RESOLVED (rev 2): 30 days default, configurable via TELEMETRY_EVENT_TTL_DAYS. Cosmos TTL in seconds. |
| 5 | Do we need GDPR right-to-erasure support for telemetry? | RESOLVED (rev 2): Yes — DELETE /api/telemetry/user/:userId added to §6.6. |
Appendix A: Env Vars
| Var | Default | Description |
|---|---|---|
TELEMETRY_ENABLED |
true |
Global server-side kill switch |
TELEMETRY_EVENT_TTL_DAYS |
30 |
Raw event retention (Cosmos TTL = days × 86400 seconds) |
TELEMETRY_CLUSTER_TTL_DAYS |
90 |
Cluster retention |
TELEMETRY_MAX_BATCH_SIZE |
50 |
Max events per ingest request |
TELEMETRY_MAX_PAYLOAD_BYTES |
262144 |
256KB max request body |
TELEMETRY_PII_SCAN_ENABLED |
true |
Server-side PII rejection |
TELEMETRY_CLIENT_BATCH_SIZE |
20 |
Returned in config response for client-side batching |
TELEMETRY_CLIENT_FLUSH_MS |
60000 |
Returned in config response for client flush interval |
TELEMETRY_CLIENT_MAX_QUEUE |
200 |
Returned in config response for client max queue size |
Appendix B: Related Files
| File | Repo | Purpose |
|---|---|---|
services/platform-service/src/modules/telemetry/ |
common-plat | Telemetry module (types, repo, routes — 14 endpoints) |
services/platform-service/src/modules/telemetry/telemetry.test.ts |
common-plat | Telemetry unit tests (624 tests total) |
services/platform-service/src/modules/flags/ |
common-plat | Feature flags (reused for segment % rollout) |
services/platform-service/src/modules/audit/ |
common-plat | Audit log module (telemetry actions logged) |
scripts/cosmos-telemetry-indexes.sh |
common-plat | Cosmos DB indexing policy for telemetry |
admin-dashboard-web/src/app/(dashboard)/ops/client-logs/ |
lysnrai | Admin log viewer + clusters + geo + metrics |
admin-dashboard-web/src/app/(dashboard)/ops/telemetry-policies/ |
lysnrai | Policy manager UI + live preview |
admin-dashboard-web/src/app/api/telemetry/ |
lysnrai | API proxy routes (events, clusters, metrics, geo, policies) |
admin-dashboard-web/src/lib/platform-client.ts |
lysnrai | Platform-service client (telemetry functions) |
mobile_app/ios/LysnrKeyboard/KeyboardViewController.swift |
lysnrai | iOS keyboard (first telemetry client) |
mobile_app/android/.../LysnrInputMethodService.kt |
lysnrai | Android keyboard (Phase 2) |
src/telemetry/ |
lysnrai | Python desktop telemetry client (Phase 2) |