learning_ai_common_plat/docs/design/CLIENT_TELEMETRY_DESIGN.md

53 KiB
Raw Permalink Blame History

Client Telemetry & Log Insights — Detailed Design

Audience: Engineering (AI agents + humans) working on ByteLyst/LysnrAI repos. Scope: Cross-platform client telemetry ingestion, segment-based collection control, storage, admin UI, and privacy guardrails. Status: Design — implementing today, keyboard-first. Last updated: 2026-02-17 (rev 2 — 18 gaps fixed)


Table of Contents

  1. Problem Statement
  2. Goals & Non-Goals
  3. Architecture Overview
  4. Telemetry Event Schema (Canonical)
  5. Segment-Based Collection Control
  6. Ingestion API Contract
  7. Storage & Partitioning
  8. Error Clustering (Derived)
  9. Admin / DevOps UI
  10. Client SDK Integration
  11. Privacy & Security
  12. Rollout Plan
  13. Open Questions

1. Problem Statement

When a user reports "keyboard voice dictation doesn't type into Messages on iPhone 17 Pro," we currently have zero server-side visibility into what happened on that device. We cannot see:

  • Did recognition start? Which backend (Azure / local)?
  • Did recognition produce results?
  • Did insertText succeed or no-op?
  • What error code/domain terminated the session?
  • What app version / build / OS / permissions state was active?

We need a lightweight, always-on (but controllable) telemetry pipeline that:

  1. Collects structured diagnostic events from all client platforms.
  2. Correlates events by user, device, platform, version, and session.
  3. Surfaces insights in the admin dashboard for debugging and release health.
  4. Can be turned on/off per segment (user, platform, region, version, etc.).

2. Goals & Non-Goals

Goals

  • G1: Unified event schema across iOS, Android, Desktop, Web.
  • G2: Per-user, per-platform, per-version, per-region segment targeting for collection.
  • G3: Admin UI with drill-down from cluster → user → session → event.
  • G4: Privacy-first: no raw dictation text, no PII in payloads.
  • G5: Low overhead: async batched sends, client-side sampling for noisy events.
  • G6: Leverage existing infrastructure (platform-service, Cosmos DB, feature flags).

Non-Goals

  • Real-time streaming dashboards (v1 uses polling/refresh).
  • Full APM / distributed tracing replacement (use Azure Monitor for that).
  • Client-side crash reporting (use native crash reporters — Crashlytics, Sentry).

3. Architecture Overview

┌──────────────────────────────────────────────────────────────────┐
│                        Client Platforms                          │
│  ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌──────────┐  │
│  │ iOS App │ │ iOS Kbd │ │ Android │ │ Desktop │ │ Web Apps │  │
│  └────┬────┘ └────┬────┘ └────┬────┘ └────┬────┘ └────┬─────┘  │
│       │           │           │           │           │         │
│       ▼           ▼           ▼           ▼           ▼         │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │        Client Telemetry SDK (per-platform thin layer)    │   │
│  │  • Collects events → batches → POST /api/telemetry      │   │
│  │  • Checks collection policy via feature flag poll        │   │
│  │  • Samples debug events, never samples error/fatal       │   │
│  └──────────────────────────────────┬───────────────────────┘   │
└─────────────────────────────────────┼───────────────────────────┘
                                      │ HTTPS
                                      ▼
┌──────────────────────────────────────────────────────────────────┐
│                     platform-service (:4003)                     │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │  POST /api/telemetry/events    (batch ingest)            │   │
│  │  GET  /api/telemetry/query     (admin read)              │   │
│  │  GET  /api/telemetry/clusters  (aggregated error view)   │   │
│  │  GET  /api/telemetry/config    (collection policy)       │   │
│  └──────────────────────────────────┬───────────────────────┘   │
│                                     │                            │
│  ┌──────────────────────────────────▼───────────────────────┐   │
│  │  Cosmos DB                                               │   │
│  │  • telemetry_events (raw, TTL 3060d)                    │   │
│  │  • telemetry_error_clusters (derived, TTL 90180d)       │   │
│  │  • telemetry_collection_policies (segment rules)         │   │
│  └──────────────────────────────────────────────────────────┘   │
│                                                                  │
│  Existing modules used:                                          │
│  • feature_flags — segment evaluation (FNV-1a hash)              │
│  • auth — JWT validation for authenticated events                │
│  • rate-limit — per-user/install throttling                      │
└──────────────────────────────────────────────────────────────────┘
                                      │
                                      ▼
┌──────────────────────────────────────────────────────────────────┐
│              admin-dashboard-web (:3001)                          │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │  Ops → Client Logs                                       │   │
│  │  • Live event stream (recent errors)                     │   │
│  │  • Error cluster view (top failures by platform/build)   │   │
│  │  • User timeline (all events for one user)               │   │
│  │  • Collection policy manager (segment targeting UI)      │   │
│  └──────────────────────────────────────────────────────────┘   │
└──────────────────────────────────────────────────────────────────┘

4. Telemetry Event Schema (Canonical)

Every client event MUST conform to this schema. Fields marked REQUIRED must always be present.

4.1 Identity Fields

Field Type Required Description
id string (uuid) REQUIRED Unique event ID, generated client-side
productId string REQUIRED Product identifier (e.g. "lysnrai")
userId string? Conditional Present when user is authenticated
anonymousInstallId string? Conditional Stable per-install UUID when userId absent
sessionId string REQUIRED App/keyboard session correlation ID
requestId string? Optional Cross-service correlation (x-request-id)

Rule: At least one of userId or anonymousInstallId MUST be present.

4.1.1 anonymousInstallId Generation Strategy

Each platform generates a stable UUID on first launch and persists it:

Platform Storage Key
iOS app Keychain (kSecAttrAccessibleAfterFirstUnlock) com.bytelyst.LysnrAI.installId
iOS keyboard App Group UserDefaults (group.com.bytelyst.LysnrAI) telemetry_install_id
Android EncryptedSharedPreferences telemetry_install_id
Desktop ~/.lysnrai/telemetry_install_id (plain file)
Web localStorage lysnrai_install_id

iOS keyboard note: The keyboard extension shares the install ID via App Group so main app and extension use the same identity.

4.1.2 Authentication for Telemetry Ingest

Not all clients have a JWT (e.g., keyboard extension before user logs in). The ingest endpoint accepts two auth modes:

Mode Header When Used
JWT Authorization: Bearer <token> Authenticated users (main app, web, desktop after login)
Install Token X-Install-Token: <anonymousInstallId> Unauthenticated clients (keyboard extension, pre-login)

Install token validation: The server accepts any well-formed UUID in X-Install-Token. It does NOT verify against a registry (install IDs are self-issued). Rate limiting is applied per install ID to prevent abuse.

Keyboard extension specifics:

  • With Full Access ON: sends events directly via HTTPS using X-Install-Token.
  • With Full Access OFF: queues events to App Group UserDefaults (max 200 events, ~100KB). Main app flushes on next foreground.
  • If queue is full, oldest events are dropped (FIFO eviction).
  • Memory constraint: iOS keyboard extensions are limited to ~30MB. Telemetry queue MUST stay under 100KB. Events are serialized as compact JSON (no pretty-print).

4.2 Source Classification Fields

Field Type Required Description
platform enum REQUIRED "ios" | "android" | "web" | "desktop"
channel enum REQUIRED "mobile_app" | "keyboard_extension" | "web_app" | "desktop_app" | "backend_service"
osFamily enum REQUIRED "ios" | "android" | "macos" | "windows" | "linux" | "chromeos" | "other"
osVersion string? Recommended e.g. "iOS 18.2", "Windows 11 24H2", "macOS 15.3", "Ubuntu 24.04"
deviceModel string? Optional e.g. "iPhone17,3", "Pixel 9", "MacBookPro18,3"
locale string? Optional BCP 47 locale, e.g. "en-US", "ta-IN"
timezone string? Optional IANA timezone, e.g. "America/Los_Angeles", "Asia/Kolkata"
countryCode string? Optional ISO 3166-1 alpha-2, e.g. "US", "IN" — derived from locale or IP server-side
regionCode string? Optional Prefixed format: "US:WA", "IN:TN" — derived server-side from IP geo. Always {country}:{region} to avoid ambiguity (TN = Tennessee or Tamil Nadu)

4.3 App Release Fields

Field Type Required Description
appVersion string REQUIRED Semantic version: CFBundleShortVersionString / versionName / npm version
buildNumber string REQUIRED CFBundleVersion / versionCode / web release commit hash
releaseChannel enum REQUIRED "dev" | "beta" | "prod"

4.4 Event Semantics Fields

Field Type Required Description
eventType enum REQUIRED "debug" | "info" | "warn" | "error" | "fatal"
module string REQUIRED Logical module: "keyboard_dictation", "auth", "sync", "settings", "onboarding"
feature string? Optional Sub-feature: "voice_typing", "settings_deeplink", "azure_recognition"
eventName string REQUIRED Snake_case event: "mic_tapped", "recognition_failed", "insert_noop", "session_started"

4.5 Error & Diagnostics Fields

Field Type Required Description
errorDomain string? On error iOS NSError domain, Android exception class, JS Error name
errorCode string? On error Normalized string code
message string? On error Sanitized, max 512 chars — NEVER raw user content
stackTrace string? Optional Redacted/capped at 8KB — only for fatal events
fingerprint string? Optional Client-side hash of (module + eventName + errorCode + errorDomain)

4.6 Structured Metadata (Extensible)

Field Type Required Description
tags Record<string, string>? Optional Small indexed key-value pairs (max 20 keys, 128 chars each)
metrics Record<string, number>? Optional Numeric measurements: durations, counters, sizes
context Record<string, unknown>? Optional Schema-validated safe object, max 4KB serialized

4.7 Module-Specific: Keyboard Dictation

When module = "keyboard_dictation", clients SHOULD include a structured dictation object:

Field Type Description
dictation.backend "azure" | "local" | "none" Which recognition backend was active
dictation.hasFullAccess boolean Keyboard Full Access toggle state
dictation.micPermission "granted" | "denied" | "undetermined" Microphone permission
dictation.speechPermission "authorized" | "denied" | "restricted" | "notDetermined" Speech recognition permission
dictation.recognitionStarted boolean Did recognition engine actually start?
dictation.finalResultReceived boolean Did at least one final result arrive?
dictation.insertAttempted boolean Did insertText / commitText get called?
dictation.insertNoOpDetected boolean Did retry logic detect a no-op insert?
dictation.transcriptLength number Character count only — NEVER raw text
dictation.sessionDurationMs number Time from mic tap to stop
dictation.hostApp string? Bundle ID of host app if available (e.g. "com.apple.MobileSMS")
dictation.errorRecoveryAttempted boolean Was Azure→local (or vice versa) recovery attempted during this session?
dictation.errorRecoverySucceeded boolean? If recovery was attempted, did the fallback backend produce results?
dictation.audioSessionCategory string? iOS AVAudioSession category active during dictation (e.g. "playAndRecord")

4.8 Server-Computed Fields

These fields are set by the ingestion endpoint, never by clients.

Field Type Required Description
pk string Server-set Cosmos partition key: ${productId}:${yyyyMM}:${platform}. Computed from event fields on ingest
occurredAt string (ISO 8601) REQUIRED Client-side timestamp (client provides this)
receivedAt string (ISO 8601) Server-set Server receipt timestamp
ttl number Server-set Cosmos TTL in seconds (not ISO date). Cosmos uses _ts + ttl for auto-expiry. Default: TELEMETRY_EVENT_TTL_DAYS * 86400

5. Segment-Based Collection Control

5.1 Motivation

Telemetry should not be a firehose. We need granular control to:

  • Debug a specific user: Turn on verbose logging for user usr_abc123 only.
  • Target a platform: Collect keyboard dictation events only from iOS.
  • Target a region: Enable collection for users in US:WA (Seattle area) or IN:TN (Chennai area).
  • Target a version: Collect from users on build < 26 (old builds with known bug).
  • Target an OS: Only Linux desktop users.
  • Global kill switch: Disable all collection instantly.

5.2 Collection Policy Document Schema

Stored in Cosmos container telemetry_collection_policies:

interface TelemetryCollectionPolicy {
  id: string; // uuid
  productId: string; // REQUIRED

  // Identity
  name: string; // human-readable: "Debug iOS keyboard for user X"
  description: string;
  enabled: boolean; // master toggle
  priority: number; // higher = evaluated first (for conflicts)

  // What to collect
  eventTypes: ('debug' | 'info' | 'warn' | 'error' | 'fatal')[];
  modules: string[]; // empty = all modules
  samplingRate: number; // 0.01.0 (1.0 = collect everything matching)

  // Segment targeting rules (ALL conditions must match = AND logic)
  targeting: {
    // User targeting
    userIds?: string[]; // specific user IDs
    anonymousInstallIds?: string[]; // specific install IDs

    // Platform targeting
    platforms?: ('ios' | 'android' | 'web' | 'desktop')[];
    channels?: (
      | 'mobile_app'
      | 'keyboard_extension'
      | 'web_app'
      | 'desktop_app'
      | 'backend_service'
    )[];
    osFamilies?: ('ios' | 'android' | 'macos' | 'windows' | 'linux' | 'chromeos')[];

    // Version targeting
    appVersions?: string[]; // exact match list: ["1.0.0", "1.1.0"]
    appVersionRange?: {
      // semver range
      min?: string; // inclusive
      max?: string; // inclusive
    };
    buildNumbers?: string[]; // exact match list: ["25", "26"]
    buildNumberRange?: {
      min?: number; // inclusive
      max?: number; // inclusive
    };

    // Region targeting (derived from client locale/timezone or server-side IP geo)
    countryCodes?: string[]; // ISO 3166-1 alpha-2: ["US", "IN"]
    regionCodes?: string[]; // sub-national: ["US:WA", "IN:TN", "IN:KA"]

    // Release channel targeting
    releaseChannels?: ('dev' | 'beta' | 'prod')[];

    // Percentage rollout (uses existing FNV-1a hash from feature flags)
    percentage?: number; // 0100, deterministic per userId/installId
  };

  // Lifecycle
  startsAt?: string; // ISO — policy activates at this time
  expiresAt?: string; // ISO — policy auto-deactivates
  createdAt: string;
  updatedAt: string;
  createdBy: string; // admin userId who created it
}

5.3 Policy Evaluation Logic (Client-Side)

Clients poll GET /api/telemetry/config periodically (every 5 min or on app foreground). The server evaluates all active policies against the client's context and returns a merged collection config:

// Response from GET /api/telemetry/config?platform=ios&channel=keyboard_extension&...
interface TelemetryCollectionConfig {
  enabled: boolean; // global kill switch
  eventTypes: string[]; // which event types to collect
  modules: string[]; // which modules (empty = all)
  samplingRates: {
    // per event type
    debug: number; // 0.01.0
    info: number;
    warn: number;
    error: number;
    fatal: number;
  };
  batchSize: number; // max events per POST
  flushIntervalMs: number; // how often to flush batch
  maxQueueSize: number; // drop oldest if exceeded
}

5.4 Evaluation Rules

  1. Global default: If no policies match, use a hardcoded default:

    • Collect warn, error, fatal only
    • Sample warn at 50%, error/fatal at 100%
    • Flush every 60s, batch of 20, max queue 200
  2. Empty targeting = matches ALL: A policy with targeting: {} (all fields omitted) matches every client. This is how the global kill switch works (example G).

  3. Policy matching: A policy matches if ALL present (non-null/non-undefined) targeting conditions are met (AND logic). Omitted conditions are ignored (not checked).

  4. Policy merge (multiple matches): Highest-priority policy wins for each field. Exception: eventTypes are unioned (if any matching policy enables debug, its enabled).

  5. Percentage rollout: Uses the same FNV-1a hash from the existing feature flags module:

    hashUserFlag(userId || anonymousInstallId, `telemetry_policy_${policyId}`) < percentage;
    
  6. Time bounds: startsAt/expiresAt are checked server-side before including in response.

  7. samplingRatesamplingRates mapping: A policys single samplingRate applies to ALL its eventTypes. When merging multiple policies, the highest-priority policys rate wins per event type. If a policy enables ["debug", "info"] at rate 0.5 and another enables ["error", "fatal"] at rate 1.0, the merged config is:

    { "debug": 0.5, "info": 0.5, "warn": 0.0, "error": 1.0, "fatal": 1.0 }
    
  8. batchSize, flushIntervalMs, maxQueueSize defaults: These are NOT set per-policy. They come from server-side env vars with these defaults:

    Param Default Env Var
    batchSize 20 TELEMETRY_CLIENT_BATCH_SIZE
    flushIntervalMs 60000 (60s) TELEMETRY_CLIENT_FLUSH_MS
    maxQueueSize 200 TELEMETRY_CLIENT_MAX_QUEUE

    The config endpoint returns these with the merged policy so clients dont hardcode them.

5.5 Example Policies

A) Debug one user's iOS keyboard

{
  "name": "Debug user sd9235 iOS keyboard",
  "enabled": true,
  "priority": 100,
  "eventTypes": ["debug", "info", "warn", "error", "fatal"],
  "modules": ["keyboard_dictation"],
  "samplingRate": 1.0,
  "targeting": {
    "userIds": ["usr_sd9235"],
    "platforms": ["ios"],
    "channels": ["keyboard_extension"]
  },
  "expiresAt": "2026-02-20T00:00:00Z"
}

B) All iOS users on old builds

{
  "name": "Collect errors from iOS builds < 26",
  "enabled": true,
  "priority": 50,
  "eventTypes": ["warn", "error", "fatal"],
  "modules": [],
  "samplingRate": 1.0,
  "targeting": {
    "platforms": ["ios"],
    "buildNumberRange": { "min": 1, "max": 25 }
  }
}

C) Seattle-area users only

{
  "name": "Seattle region telemetry",
  "enabled": true,
  "priority": 60,
  "eventTypes": ["info", "warn", "error", "fatal"],
  "modules": [],
  "samplingRate": 0.5,
  "targeting": {
    "regionCodes": ["US:WA"]
  }
}

D) Only Linux desktop

{
  "name": "Linux desktop diagnostics",
  "enabled": true,
  "priority": 50,
  "eventTypes": ["warn", "error", "fatal"],
  "modules": [],
  "samplingRate": 1.0,
  "targeting": {
    "platforms": ["desktop"],
    "osFamilies": ["linux"]
  }
}

E) 10% of all web users (canary)

{
  "name": "Web telemetry canary rollout",
  "enabled": true,
  "priority": 30,
  "eventTypes": ["warn", "error", "fatal"],
  "modules": [],
  "samplingRate": 1.0,
  "targeting": {
    "platforms": ["web"],
    "percentage": 10
  }
}

F) Chennai, India — mobile app only

{
  "name": "Chennai mobile diagnostics",
  "enabled": true,
  "priority": 60,
  "eventTypes": ["info", "warn", "error", "fatal"],
  "modules": [],
  "samplingRate": 1.0,
  "targeting": {
    "platforms": ["ios", "android"],
    "channels": ["mobile_app"],
    "regionCodes": ["IN:TN"]
  }
}

G) Global kill switch (disable all collection)

{
  "name": "GLOBAL OFF",
  "enabled": true,
  "priority": 999,
  "eventTypes": [],
  "modules": [],
  "samplingRate": 0.0,
  "targeting": {}
}

6. Ingestion API Contract

6.1 POST /api/telemetry/events — Batch Ingest

Auth: JWT (Authorization: Bearer) or Install Token (X-Install-Token: <uuid>). See §4.1.2.

Request:

// --- Zod schema for a single telemetry event ---
const TelemetryEventSchema = z
  .object({
    // Identity
    id: z.string().uuid(),
    productId: z.string().min(1),
    userId: z.string().optional(),
    anonymousInstallId: z.string().uuid().optional(),
    sessionId: z.string().min(1),
    requestId: z.string().optional(),

    // Source classification
    platform: z.enum(['ios', 'android', 'web', 'desktop']),
    channel: z.enum([
      'mobile_app',
      'keyboard_extension',
      'web_app',
      'desktop_app',
      'backend_service',
    ]),
    osFamily: z.enum(['ios', 'android', 'macos', 'windows', 'linux', 'chromeos', 'other']),
    osVersion: z.string().optional(),
    deviceModel: z.string().optional(),
    locale: z.string().optional(),
    timezone: z.string().optional(),

    // App release
    appVersion: z.string().min(1),
    buildNumber: z.string().min(1),
    releaseChannel: z.enum(['dev', 'beta', 'prod']),

    // Event semantics
    eventType: z.enum(['debug', 'info', 'warn', 'error', 'fatal']),
    module: z.string().min(1),
    feature: z.string().optional(),
    eventName: z.string().min(1),

    // Error & diagnostics
    errorDomain: z.string().optional(),
    errorCode: z.string().optional(),
    message: z.string().max(512).optional(),
    stackTrace: z.string().max(8192).optional(),
    fingerprint: z.string().optional(),

    // Structured metadata
    tags: z.record(z.string().max(128)).optional(),
    metrics: z.record(z.number()).optional(),
    context: z.record(z.unknown()).optional(),

    // Timing
    occurredAt: z.string().datetime(),
  })
  .refine(e => e.userId || e.anonymousInstallId, {
    message: 'At least one of userId or anonymousInstallId is required',
  });

// --- Batch ingest request ---
const TelemetryIngestRequest = z.object({
  productId: z.string().min(1),
  events: z.array(TelemetryEventSchema).min(1).max(50),
  clientClockSkewMs: z.number().optional(),
});

Response (200):

interface TelemetryIngestResponse {
  accepted: number;
  rejected: number;
  errors?: Array<{ index: number; reason: string }>;
  serverTime: string;
}

Rate limits:

  • Authenticated: 100 requests/min per userId
  • Anonymous: 30 requests/min per anonymousInstallId
  • Payload: max 256KB per request

Validation rules:

  1. productId authority: Request-level productId is authoritative. Per-event productId MUST match the request-level value; mismatches are rejected.
  2. Zod validation enforces all required fields (see schema above).
  3. At least one of userId or anonymousInstallId (Zod refine).
  4. message capped at 512 chars, stackTrace at 8KB, tags max 20 keys, context max 4KB serialized.
  5. PII regex rejection: reject events containing patterns matching email, phone, credit card.
  6. No raw dictation text allowed in any field.

Idempotency: Events are upserted by id. If a client retries a batch (e.g., network timeout), duplicate event IDs are silently overwritten. This ensures exactly-once semantics without client-side dedup tracking.

6.2 GET /api/telemetry/config — Collection Config (Client Poll)

Auth: JWT or API key.

Query params:

Param Type Description
userId string? Authenticated user ID (for percentage rollout evaluation)
anonymousInstallId string? Install ID (fallback for percentage rollout)
platform string Client platform
channel string Client channel
osFamily string OS family
appVersion string Current app version
buildNumber string Current build number
releaseChannel string dev/beta/prod
countryCode string? Client-reported country
regionCode string? Client-reported region (prefixed: US:WA)

Response: TelemetryCollectionConfig (see §5.3).

Cache: Client should cache this for 5 minutes. Server sets Cache-Control: max-age=300.

6.3 GET /api/telemetry/query — Admin Query (Read)

Auth: Admin JWT only.

Query params:

Param Type Description
userId string? Filter by user
anonymousInstallId string? Filter by install
platform string? Filter by platform
channel string? Filter by channel
osFamily string? Filter by OS family
appVersion string? Filter by version
buildNumber string? Filter by build
module string? Filter by module
eventName string? Filter by event name
eventType string? Filter by severity
from string (ISO) Start time
to string (ISO) End time
limit number Max results (default 50, max 200)
continuationToken string? Pagination

Response:

interface TelemetryQueryResponse {
  events: TelemetryEvent[];
  total: number;
  continuationToken?: string;
}

6.4 GET /api/telemetry/clusters — Error Clusters (Admin)

Auth: Admin JWT only.

Query params: Same filters as query, plus groupBy (default: fingerprint).

Response:

interface TelemetryClusterResponse {
  clusters: TelemetryErrorCluster[];
  total: number;
}

6.5 Collection Policy CRUD (Admin)

Method Path Description
GET /api/telemetry/policies List all policies
POST /api/telemetry/policies Create policy
PUT /api/telemetry/policies/:id Update policy
DELETE /api/telemetry/policies/:id Delete policy

6.6 DELETE /api/telemetry/user/:userId — GDPR Right-to-Erasure

Auth: Admin JWT only.

Deletes ALL telemetry events and cluster references for the given userId. Returns count of deleted documents. Required for GDPR compliance.

Response:

interface TelemetryErasureResponse {
  userId: string;
  eventsDeleted: number;
  clustersUpdated: number;
}

7. Storage & Partitioning

7.1 Cosmos Containers

telemetry_events (raw events)

Property Value
Partition key /pk where pk = ${productId}:${yyyyMM}:${platform}
TTL defaultTtl: 30 * 86400 (30 days in seconds, configurable via TELEMETRY_EVENT_TTL_DAYS). Cosmos auto-deletes docs when _ts + ttl passes
RU budget Start at 400 RU/s autoscale, monitor and adjust

Rationale: Partitioning by product + month + platform keeps hot data together for typical queries ("show me iOS errors from this month") while distributing load.

Composite indexes:

[
  { "path": "/eventType", "order": "ascending" },
  { "path": "/occurredAt", "order": "descending" }
]

Additional indexed paths: /userId, /anonymousInstallId, /module, /eventName, /appVersion, /buildNumber, /channel, /osFamily.

telemetry_error_clusters (aggregated)

Property Value
Partition key /pk where pk = ${productId}:${platform}:${module}
TTL defaultTtl: 90 * 86400 (90 days in seconds, configurable via TELEMETRY_CLUSTER_TTL_DAYS)
RU budget 200 RU/s autoscale

telemetry_collection_policies (segment rules)

Property Value
Partition key /productId
TTL None (manual lifecycle)
RU budget Minimal (low volume)

7.2 Container Registration

Add to registerContainers() call in platform-service src/lib/cosmos.ts:

registerContainers([
  // ... existing containers ...
  { id: 'telemetry_events', partitionKeyPath: '/pk' },
  { id: 'telemetry_error_clusters', partitionKeyPath: '/pk' },
  { id: 'telemetry_collection_policies', partitionKeyPath: '/productId' },
]);

8. Error Clustering (Derived)

8.1 Fingerprint Generation

Client-side (optional) and server-side (authoritative):

function generateFingerprint(event: TelemetryEvent): string {
  const input = [
    event.platform,
    event.channel,
    event.module,
    event.eventName,
    event.errorDomain ?? '',
    event.errorCode ?? '',
    normalizeMessage(event.message ?? ''),
  ].join(':');
  return sha256(input).substring(0, 16); // 16-char hex
}

function normalizeMessage(msg: string): string {
  // Strip numbers, UUIDs, paths, timestamps
  return msg
    .replace(/[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}/gi, '<UUID>')
    .replace(/\d+/g, '<N>')
    .replace(/\/[\w/.]+/g, '<PATH>')
    .toLowerCase()
    .trim();
}

8.2 Cluster Document

interface TelemetryErrorCluster {
  id: string; // fingerprint + time window key (e.g. `${fingerprint}:${yyyyMM}`)
  pk: string; // ${productId}:${platform}:${module}
  productId: string;
  fingerprint: string;

  // Dimensions (version-agnostic — one cluster spans all versions)
  platform: string;
  channel: string;
  module: string;
  eventName: string;

  // Version breakdown — which builds are affected
  affectedVersions: Array<{
    appVersion: string;
    buildNumber: string;
    count: number;
    lastSeenAt: string;
  }>; // capped at 50 entries

  // Aggregates
  firstSeenAt: string;
  lastSeenAt: string;
  totalCount: number;
  affectedUserIds: string[]; // capped at 100
  affectedInstallIds: string[]; // capped at 100
  affectedOsFamilies: string[]; // e.g. ["ios", "macos"]

  // Representative sample (from most recent event)
  sampleErrorDomain?: string;
  sampleErrorCode?: string;
  sampleMessage?: string;
  severity: 'warn' | 'error' | 'fatal';
  ttl: number; // Cosmos TTL in seconds
}

8.3 Cluster Update Strategy

On each ingested warn, error, or fatal event:

  1. Compute fingerprint.
  2. Upsert cluster doc: increment totalCount, update lastSeenAt, append to affectedUserIds (dedup, cap at 100).
  3. Run as a lightweight post-ingest step (same request, not a separate job — keeps it simple for v1).

9. Admin / DevOps UI

9.1 Page: Ops → Client Logs

Located at admin-dashboard-web/src/app/(dashboard)/ops/client-logs/page.tsx.

Filter Bar

Filter Type Default
User ID text input
Platform multi-select: ios, android, web, desktop all
Channel multi-select all
OS Family multi-select all
App Version text/select
Build Number text/select
Module select all
Event Type multi-select error, fatal
Time Range date range picker last 24h

Views

  1. Error Clusters (default): Table of top clusters sorted by totalCount desc.

    • Columns: fingerprint, module, eventName, platform, build, count, affected users, last seen.
    • Click → drill into cluster detail (sample events, user list).
  2. Event Stream: Chronological list of raw events matching filters.

    • Columns: time, user, platform, channel, build, module, eventName, eventType, message.
    • Click → full event detail (JSON + dictation struct if present).
  3. User Timeline: Enter a userId → see all events chronologically.

    • Useful for "what happened to user X's keyboard session."

9.2 Page: Ops → Telemetry Policies

Located at admin-dashboard-web/src/app/(dashboard)/ops/telemetry-policies/page.tsx.

  • CRUD for collection policies.
  • Visual segment builder (dropdowns for platform, OS, version range, region, etc.).
  • Priority ordering (drag/drop or numeric).
  • Enable/disable toggle per policy.
  • "Preview" button: show how many matching users/installs (based on recent telemetry).

10. Client SDK Integration

10.1 iOS (Swift) — App + Keyboard Extension

// Shared via App Group (group.com.bytelyst.LysnrAI)
class LysnrTelemetry {
    static let shared = LysnrTelemetry()

    // Core properties (set once at init)
    let productId = "lysnrai"
    let platform = "ios"
    let osFamily = "ios"
    var channel: String     // "mobile_app" or "keyboard_extension"
    var installId: String   // from App Group UserDefaults
    var userId: String?     // from App Group (set after login)

    func track(
        eventType: EventType,
        module: String,
        eventName: String,
        message: String? = nil,
        errorCode: String? = nil,
        errorDomain: String? = nil,
        dictation: DictationContext? = nil,
        tags: [String: String]? = nil,
        metrics: [String: Double]? = nil
    )

    func flush()            // force-send queued events
    func refreshConfig()    // poll collection policy

    // Keyboard-specific
    func queueToAppGroup()  // write pending events to App Group UserDefaults
    func flushAppGroupQueue() // called by main app on foreground
}

Keyboard extension offline strategy:

  • Full Access ON: Sends events directly via URLSession. Falls back to App Group queue on network failure.
  • Full Access OFF: Always queues to App Group UserDefaults (telemetry_event_queue key).
  • Main app responsibility: On each foreground, calls LysnrTelemetry.shared.flushAppGroupQueue() to drain keyboard-queued events.
  • Queue limits: Max 200 events (~100KB). FIFO eviction when full. See §4.1.2 for memory constraints.

10.2 Android (Kotlin)

object LysnrTelemetry {
    fun track(
        eventType: EventType,
        module: String,
        eventName: String,
        message: String? = null,
        errorCode: String? = null,
        dictation: DictationContext? = null,
    )
    fun flush()
    fun refreshConfig()
}

10.3 Desktop (Python)

from lysnrai.telemetry import telemetry

telemetry.track(
    event_type="error",
    module="speech_recognition",
    event_name="azure_timeout",
    message="Recognition timed out after 30s",
    tags={"backend": "azure"},
    metrics={"duration_ms": 30000},
)

10.4 Web (TypeScript)

import { telemetry } from '@/lib/telemetry';

telemetry.track({
  eventType: 'error',
  module: 'auth',
  eventName: 'token_refresh_failed',
  errorCode: '401',
  message: 'JWT expired and refresh failed',
});

11. Privacy & Security

11.1 Hard Rules

  1. NEVER send raw dictated/transcribed text in any field.
  2. NEVER send passwords, tokens, API keys, or PII (email, phone, SSN).
  3. message field: sanitized, max 512 chars, no user content.
  4. stackTrace: redacted file paths, max 8KB, only on fatal.
  5. Server-side PII regex scanner rejects events containing detected PII patterns.
  6. countryCode / regionCode: derived from IP geo server-side (never GPS coordinates).

11.2 Data Retention

Container Default TTL Configurable
telemetry_events 30 days TELEMETRY_EVENT_TTL_DAYS
telemetry_error_clusters 90 days TELEMETRY_CLUSTER_TTL_DAYS
telemetry_collection_policies No TTL Manual delete / expiresAt

11.3 Access Control

  • Ingest (POST /api/telemetry/events): Any authenticated user (JWT) or valid install token (X-Install-Token). See §4.1.2.
  • Read (GET /api/telemetry/query, /clusters): Admin JWT only. Enforced via req.jwtPayload?.role === 'admin' check (same pattern as other admin-only modules).
  • Policy management: Admin JWT only (same check).
  • GDPR erasure: Admin JWT only.
  • No public endpoints. Telemetry data is internal/operational only.

11.4 Rate Limiting

Client Type Limit
Authenticated user 100 req/min
Anonymous install 30 req/min
Admin query 60 req/min

12. Rollout Plan

Phase 1 — MVP (12 weeks)

Goal: iOS keyboard dictation debugging visible in admin dashboard.

Component Scope
platform-service telemetry module: types.ts, repository.ts, routes.ts (ingest + query)
platform-service Collection policy CRUD + config endpoint
iOS keyboard LysnrTelemetry client in KeyboardViewController — keyboard_dictation events
admin-dashboard Ops → Client Logs page with basic event stream + filters
Cosmos Register 3 containers

Delivers: When a user reports "keyboard not typing," admin can look up their userId, see exact error flow, permissions state, backend choice, and insertion outcome.

Phase 2 — Full Platform Coverage (23 weeks)

Component Scope
iOS app Telemetry for auth, settings, onboarding modules
Android app + keyboard Full telemetry parity with iOS
Desktop (Python) Telemetry for speech recognition, hotkey, paste modules
admin-dashboard Error cluster view, user timeline view
platform-service Cluster aggregation on ingest

Phase 3 — Advanced (34 weeks)

Component Scope
Web dashboards Telemetry for auth, API errors, page load
admin-dashboard Telemetry policy builder UI, version comparison view
platform-service Alerting rules (error spike → Slack/email)
All clients Region/geo enrichment server-side

13. Open Questions

# Question Status
1 Should keyboard extension send events directly (requires Full Access + network) or queue via App Group for main app to flush? RESOLVED (rev 2): Direct when Full Access on, App Group queue as fallback. See §4.1.2.
2 Do we need a separate Cosmos database for telemetry to isolate RU costs? Recommend: Same database, separate containers (simpler), revisit if RU contention appears
3 Should we support exporting telemetry to Azure Monitor / Application Insights for alerting? Defer to Phase 3
4 Max retention for raw events? Compliance requirements? RESOLVED (rev 2): 30 days default, configurable via TELEMETRY_EVENT_TTL_DAYS. Cosmos TTL in seconds.
5 Do we need GDPR right-to-erasure support for telemetry? RESOLVED (rev 2): Yes — DELETE /api/telemetry/user/:userId added to §6.6.

Appendix A: Env Vars

Var Default Description
TELEMETRY_ENABLED true Global server-side kill switch
TELEMETRY_EVENT_TTL_DAYS 30 Raw event retention (Cosmos TTL = days × 86400 seconds)
TELEMETRY_CLUSTER_TTL_DAYS 90 Cluster retention
TELEMETRY_MAX_BATCH_SIZE 50 Max events per ingest request
TELEMETRY_MAX_PAYLOAD_BYTES 262144 256KB max request body
TELEMETRY_PII_SCAN_ENABLED true Server-side PII rejection
TELEMETRY_CLIENT_BATCH_SIZE 20 Returned in config response for client-side batching
TELEMETRY_CLIENT_FLUSH_MS 60000 Returned in config response for client flush interval
TELEMETRY_CLIENT_MAX_QUEUE 200 Returned in config response for client max queue size
File Repo Purpose
services/platform-service/src/modules/telemetry/ common-plat Telemetry module (types, repo, routes — 14 endpoints)
services/platform-service/src/modules/telemetry/telemetry.test.ts common-plat Telemetry unit tests (624 tests total)
services/platform-service/src/modules/flags/ common-plat Feature flags (reused for segment % rollout)
services/platform-service/src/modules/audit/ common-plat Audit log module (telemetry actions logged)
scripts/cosmos-telemetry-indexes.sh common-plat Cosmos DB indexing policy for telemetry
admin-dashboard-web/src/app/(dashboard)/ops/client-logs/ lysnrai Admin log viewer + clusters + geo + metrics
admin-dashboard-web/src/app/(dashboard)/ops/telemetry-policies/ lysnrai Policy manager UI + live preview
admin-dashboard-web/src/app/api/telemetry/ lysnrai API proxy routes (events, clusters, metrics, geo, policies)
admin-dashboard-web/src/lib/platform-client.ts lysnrai Platform-service client (telemetry functions)
mobile_app/ios/LysnrKeyboard/KeyboardViewController.swift lysnrai iOS keyboard (first telemetry client)
mobile_app/android/.../LysnrInputMethodService.kt lysnrai Android keyboard (Phase 2)
src/telemetry/ lysnrai Python desktop telemetry client (Phase 2)