learning_ai_common_plat/docs/WINDSURF/CLIENT_TELEMETRY_DESIGN.md

41 KiB
Raw Blame History

Client Telemetry & Log Insights — Detailed Design

Audience: Engineering (AI agents + humans) working on ByteLyst/LysnrAI repos. Scope: Cross-platform client telemetry ingestion, segment-based collection control, storage, admin UI, and privacy guardrails. Status: Design — not yet implemented. Last updated: 2026-02-17


Table of Contents

  1. Problem Statement
  2. Goals & Non-Goals
  3. Architecture Overview
  4. Telemetry Event Schema (Canonical)
  5. Segment-Based Collection Control
  6. Ingestion API Contract
  7. Storage & Partitioning
  8. Error Clustering (Derived)
  9. Admin / DevOps UI
  10. Client SDK Integration
  11. Privacy & Security
  12. Rollout Plan
  13. Open Questions

1. Problem Statement

When a user reports "keyboard voice dictation doesn't type into Messages on iPhone 17 Pro," we currently have zero server-side visibility into what happened on that device. We cannot see:

  • Did recognition start? Which backend (Azure / local)?
  • Did recognition produce results?
  • Did insertText succeed or no-op?
  • What error code/domain terminated the session?
  • What app version / build / OS / permissions state was active?

We need a lightweight, always-on (but controllable) telemetry pipeline that:

  1. Collects structured diagnostic events from all client platforms.
  2. Correlates events by user, device, platform, version, and session.
  3. Surfaces insights in the admin dashboard for debugging and release health.
  4. Can be turned on/off per segment (user, platform, region, version, etc.).

2. Goals & Non-Goals

Goals

  • G1: Unified event schema across iOS, Android, Desktop, Web.
  • G2: Per-user, per-platform, per-version, per-region segment targeting for collection.
  • G3: Admin UI with drill-down from cluster → user → session → event.
  • G4: Privacy-first: no raw dictation text, no PII in payloads.
  • G5: Low overhead: async batched sends, client-side sampling for noisy events.
  • G6: Leverage existing infrastructure (platform-service, Cosmos DB, feature flags).

Non-Goals

  • Real-time streaming dashboards (v1 uses polling/refresh).
  • Full APM / distributed tracing replacement (use Azure Monitor for that).
  • Client-side crash reporting (use native crash reporters — Crashlytics, Sentry).

3. Architecture Overview

┌──────────────────────────────────────────────────────────────────┐
│                        Client Platforms                          │
│  ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌──────────┐  │
│  │ iOS App │ │ iOS Kbd │ │ Android │ │ Desktop │ │ Web Apps │  │
│  └────┬────┘ └────┬────┘ └────┬────┘ └────┬────┘ └────┬─────┘  │
│       │           │           │           │           │         │
│       ▼           ▼           ▼           ▼           ▼         │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │        Client Telemetry SDK (per-platform thin layer)    │   │
│  │  • Collects events → batches → POST /api/telemetry      │   │
│  │  • Checks collection policy via feature flag poll        │   │
│  │  • Samples debug events, never samples error/fatal       │   │
│  └──────────────────────────────────┬───────────────────────┘   │
└─────────────────────────────────────┼───────────────────────────┘
                                      │ HTTPS
                                      ▼
┌──────────────────────────────────────────────────────────────────┐
│                     platform-service (:4003)                     │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │  POST /api/telemetry/events    (batch ingest)            │   │
│  │  GET  /api/telemetry/query     (admin read)              │   │
│  │  GET  /api/telemetry/clusters  (aggregated error view)   │   │
│  │  GET  /api/telemetry/config    (collection policy)       │   │
│  └──────────────────────────────────┬───────────────────────┘   │
│                                     │                            │
│  ┌──────────────────────────────────▼───────────────────────┐   │
│  │  Cosmos DB                                               │   │
│  │  • telemetry_events (raw, TTL 3060d)                    │   │
│  │  • telemetry_error_clusters (derived, TTL 90180d)       │   │
│  │  • telemetry_collection_policies (segment rules)         │   │
│  └──────────────────────────────────────────────────────────┘   │
│                                                                  │
│  Existing modules used:                                          │
│  • feature_flags — segment evaluation (FNV-1a hash)              │
│  • auth — JWT validation for authenticated events                │
│  • rate-limit — per-user/install throttling                      │
└──────────────────────────────────────────────────────────────────┘
                                      │
                                      ▼
┌──────────────────────────────────────────────────────────────────┐
│              admin-dashboard-web (:3001)                          │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │  Ops → Client Logs                                       │   │
│  │  • Live event stream (recent errors)                     │   │
│  │  • Error cluster view (top failures by platform/build)   │   │
│  │  • User timeline (all events for one user)               │   │
│  │  • Collection policy manager (segment targeting UI)      │   │
│  └──────────────────────────────────────────────────────────┘   │
└──────────────────────────────────────────────────────────────────┘

4. Telemetry Event Schema (Canonical)

Every client event MUST conform to this schema. Fields marked REQUIRED must always be present.

4.1 Identity Fields

Field Type Required Description
id string (uuid) REQUIRED Unique event ID, generated client-side
productId string REQUIRED Product identifier (e.g. "lysnrai")
userId string? Conditional Present when user is authenticated
anonymousInstallId string? Conditional Stable per-install UUID when userId absent
sessionId string REQUIRED App/keyboard session correlation ID
requestId string? Optional Cross-service correlation (x-request-id)

Rule: At least one of userId or anonymousInstallId MUST be present.

4.2 Source Classification Fields

Field Type Required Description
platform enum REQUIRED "ios" | "android" | "web" | "desktop"
channel enum REQUIRED "mobile_app" | "keyboard_extension" | "web_app" | "desktop_app" | "backend_service"
osFamily enum REQUIRED "ios" | "android" | "macos" | "windows" | "linux" | "chromeos" | "other"
osVersion string? Recommended e.g. "iOS 18.2", "Windows 11 24H2", "macOS 15.3", "Ubuntu 24.04"
deviceModel string? Optional e.g. "iPhone17,3", "Pixel 9", "MacBookPro18,3"
locale string? Optional BCP 47 locale, e.g. "en-US", "ta-IN"
timezone string? Optional IANA timezone, e.g. "America/Los_Angeles", "Asia/Kolkata"
countryCode string? Optional ISO 3166-1 alpha-2, e.g. "US", "IN" — derived from locale or IP server-side
regionCode string? Optional e.g. "WA", "TN" — derived server-side from IP geo

4.3 App Release Fields

Field Type Required Description
appVersion string REQUIRED Semantic version: CFBundleShortVersionString / versionName / npm version
buildNumber string REQUIRED CFBundleVersion / versionCode / web release commit hash
releaseChannel enum REQUIRED "dev" | "beta" | "prod"

4.4 Event Semantics Fields

Field Type Required Description
eventType enum REQUIRED "debug" | "info" | "warn" | "error" | "fatal"
module string REQUIRED Logical module: "keyboard_dictation", "auth", "sync", "settings", "onboarding"
feature string? Optional Sub-feature: "voice_typing", "settings_deeplink", "azure_recognition"
eventName string REQUIRED Snake_case event: "mic_tapped", "recognition_failed", "insert_noop", "session_started"

4.5 Error & Diagnostics Fields

Field Type Required Description
errorDomain string? On error iOS NSError domain, Android exception class, JS Error name
errorCode string? On error Normalized string code
message string? On error Sanitized, max 512 chars — NEVER raw user content
stackTrace string? Optional Redacted/capped at 8KB — only for fatal events
fingerprint string? Optional Client-side hash of (module + eventName + errorCode + errorDomain)

4.6 Structured Metadata (Extensible)

Field Type Required Description
tags Record<string, string>? Optional Small indexed key-value pairs (max 20 keys, 128 chars each)
metrics Record<string, number>? Optional Numeric measurements: durations, counters, sizes
context Record<string, unknown>? Optional Schema-validated safe object, max 4KB serialized

4.7 Module-Specific: Keyboard Dictation

When module = "keyboard_dictation", clients SHOULD include a structured dictation object:

Field Type Description
dictation.backend "azure" | "local" | "none" Which recognition backend was active
dictation.hasFullAccess boolean Keyboard Full Access toggle state
dictation.micPermission "granted" | "denied" | "undetermined" Microphone permission
dictation.speechPermission "authorized" | "denied" | "restricted" | "notDetermined" Speech recognition permission
dictation.recognitionStarted boolean Did recognition engine actually start?
dictation.finalResultReceived boolean Did at least one final result arrive?
dictation.insertAttempted boolean Did insertText / commitText get called?
dictation.insertNoOpDetected boolean Did retry logic detect a no-op insert?
dictation.transcriptLength number Character count only — NEVER raw text
dictation.sessionDurationMs number Time from mic tap to stop
dictation.hostApp string? Bundle ID of host app if available (e.g. "com.apple.MobileSMS")

4.8 Timing Fields

Field Type Required Description
occurredAt string (ISO 8601) REQUIRED Client-side timestamp
receivedAt string (ISO 8601) Server-set Set by ingestion endpoint
ttlAt string? (ISO 8601) Server-set Cosmos TTL expiry marker

5. Segment-Based Collection Control

5.1 Motivation

Telemetry should not be a firehose. We need granular control to:

  • Debug a specific user: Turn on verbose logging for user usr_abc123 only.
  • Target a platform: Collect keyboard dictation events only from iOS.
  • Target a region: Enable collection for users in US:WA (Seattle area) or IN:TN (Chennai area).
  • Target a version: Collect from users on build < 26 (old builds with known bug).
  • Target an OS: Only Linux desktop users.
  • Global kill switch: Disable all collection instantly.

5.2 Collection Policy Document Schema

Stored in Cosmos container telemetry_collection_policies:

interface TelemetryCollectionPolicy {
  id: string; // uuid
  productId: string; // REQUIRED

  // Identity
  name: string; // human-readable: "Debug iOS keyboard for user X"
  description: string;
  enabled: boolean; // master toggle
  priority: number; // higher = evaluated first (for conflicts)

  // What to collect
  eventTypes: ('debug' | 'info' | 'warn' | 'error' | 'fatal')[];
  modules: string[]; // empty = all modules
  samplingRate: number; // 0.01.0 (1.0 = collect everything matching)

  // Segment targeting rules (ALL conditions must match = AND logic)
  targeting: {
    // User targeting
    userIds?: string[]; // specific user IDs
    anonymousInstallIds?: string[]; // specific install IDs

    // Platform targeting
    platforms?: ('ios' | 'android' | 'web' | 'desktop')[];
    channels?: (
      | 'mobile_app'
      | 'keyboard_extension'
      | 'web_app'
      | 'desktop_app'
      | 'backend_service'
    )[];
    osFamilies?: ('ios' | 'android' | 'macos' | 'windows' | 'linux' | 'chromeos')[];

    // Version targeting
    appVersions?: string[]; // exact match list: ["1.0.0", "1.1.0"]
    appVersionRange?: {
      // semver range
      min?: string; // inclusive
      max?: string; // inclusive
    };
    buildNumbers?: string[]; // exact match list: ["25", "26"]
    buildNumberRange?: {
      min?: number; // inclusive
      max?: number; // inclusive
    };

    // Region targeting (derived from client locale/timezone or server-side IP geo)
    countryCodes?: string[]; // ISO 3166-1 alpha-2: ["US", "IN"]
    regionCodes?: string[]; // sub-national: ["US:WA", "IN:TN", "IN:KA"]

    // Release channel targeting
    releaseChannels?: ('dev' | 'beta' | 'prod')[];

    // Percentage rollout (uses existing FNV-1a hash from feature flags)
    percentage?: number; // 0100, deterministic per userId/installId
  };

  // Lifecycle
  startsAt?: string; // ISO — policy activates at this time
  expiresAt?: string; // ISO — policy auto-deactivates
  createdAt: string;
  updatedAt: string;
  createdBy: string; // admin userId who created it
}

5.3 Policy Evaluation Logic (Client-Side)

Clients poll GET /api/telemetry/config periodically (every 5 min or on app foreground). The server evaluates all active policies against the client's context and returns a merged collection config:

// Response from GET /api/telemetry/config?platform=ios&channel=keyboard_extension&...
interface TelemetryCollectionConfig {
  enabled: boolean; // global kill switch
  eventTypes: string[]; // which event types to collect
  modules: string[]; // which modules (empty = all)
  samplingRates: {
    // per event type
    debug: number; // 0.01.0
    info: number;
    warn: number;
    error: number;
    fatal: number;
  };
  batchSize: number; // max events per POST
  flushIntervalMs: number; // how often to flush batch
  maxQueueSize: number; // drop oldest if exceeded
}

5.4 Evaluation Rules

  1. Global default: If no policies match, use a hardcoded default:

    • Collect warn, error, fatal only
    • Sample warn at 50%, error/fatal at 100%
    • Flush every 60s, batch of 20
  2. Policy matching: A policy matches if ALL non-null targeting conditions are met (AND logic).

  3. Policy merge (multiple matches): Highest-priority policy wins for each field. Exception: eventTypes are unioned (if any policy enables debug, it's enabled).

  4. Percentage rollout: Uses the same FNV-1a hash from the existing feature flags module:

    hashUserFlag(userId || anonymousInstallId, `telemetry_policy_${policyId}`) < percentage;
    
  5. Time bounds: startsAt/expiresAt are checked server-side before including in response.

5.5 Example Policies

A) Debug one user's iOS keyboard

{
  "name": "Debug user sd9235 iOS keyboard",
  "enabled": true,
  "priority": 100,
  "eventTypes": ["debug", "info", "warn", "error", "fatal"],
  "modules": ["keyboard_dictation"],
  "samplingRate": 1.0,
  "targeting": {
    "userIds": ["usr_sd9235"],
    "platforms": ["ios"],
    "channels": ["keyboard_extension"]
  },
  "expiresAt": "2026-02-20T00:00:00Z"
}

B) All iOS users on old builds

{
  "name": "Collect errors from iOS builds < 26",
  "enabled": true,
  "priority": 50,
  "eventTypes": ["warn", "error", "fatal"],
  "modules": [],
  "samplingRate": 1.0,
  "targeting": {
    "platforms": ["ios"],
    "buildNumberRange": { "min": 1, "max": 25 }
  }
}

C) Seattle-area users only

{
  "name": "Seattle region telemetry",
  "enabled": true,
  "priority": 60,
  "eventTypes": ["info", "warn", "error", "fatal"],
  "modules": [],
  "samplingRate": 0.5,
  "targeting": {
    "regionCodes": ["US:WA"]
  }
}

D) Only Linux desktop

{
  "name": "Linux desktop diagnostics",
  "enabled": true,
  "priority": 50,
  "eventTypes": ["warn", "error", "fatal"],
  "modules": [],
  "samplingRate": 1.0,
  "targeting": {
    "platforms": ["desktop"],
    "osFamilies": ["linux"]
  }
}

E) 10% of all web users (canary)

{
  "name": "Web telemetry canary rollout",
  "enabled": true,
  "priority": 30,
  "eventTypes": ["warn", "error", "fatal"],
  "modules": [],
  "samplingRate": 1.0,
  "targeting": {
    "platforms": ["web"],
    "percentage": 10
  }
}

F) Chennai, India — mobile app only

{
  "name": "Chennai mobile diagnostics",
  "enabled": true,
  "priority": 60,
  "eventTypes": ["info", "warn", "error", "fatal"],
  "modules": [],
  "samplingRate": 1.0,
  "targeting": {
    "platforms": ["ios", "android"],
    "channels": ["mobile_app"],
    "regionCodes": ["IN:TN"]
  }
}

G) Global kill switch (disable all collection)

{
  "name": "GLOBAL OFF",
  "enabled": true,
  "priority": 999,
  "eventTypes": [],
  "modules": [],
  "samplingRate": 0.0,
  "targeting": {}
}

6. Ingestion API Contract

6.1 POST /api/telemetry/events — Batch Ingest

Auth: JWT (authenticated users) or API key header (anonymous installs).

Request:

// Zod schema
const TelemetryIngestRequest = z.object({
  productId: z.string().min(1),
  events: z.array(TelemetryEventSchema).min(1).max(50),
  clientClockSkewMs: z.number().optional(),
});

Response (200):

interface TelemetryIngestResponse {
  accepted: number;
  rejected: number;
  errors?: Array<{ index: number; reason: string }>;
  serverTime: string;
}

Rate limits:

  • Authenticated: 100 requests/min per userId
  • Anonymous: 30 requests/min per anonymousInstallId
  • Payload: max 256KB per request

Validation rules:

  1. productId must match a known, active product.
  2. Each event must have id, productId, platform, channel, osFamily, appVersion, buildNumber, releaseChannel, eventType, module, eventName, sessionId, occurredAt.
  3. At least one of userId or anonymousInstallId.
  4. message capped at 512 chars, stackTrace at 8KB, tags max 20 keys.
  5. PII regex rejection: reject events containing patterns matching email, phone, credit card.
  6. No raw dictation text allowed in any field.

6.2 GET /api/telemetry/config — Collection Config (Client Poll)

Auth: JWT or API key.

Query params:

Param Type Description
platform string Client platform
channel string Client channel
osFamily string OS family
appVersion string Current app version
buildNumber string Current build number
releaseChannel string dev/beta/prod
countryCode string? Client-reported country
regionCode string? Client-reported region

Response: TelemetryCollectionConfig (see §5.3).

Cache: Client should cache this for 5 minutes. Server sets Cache-Control: max-age=300.

6.3 GET /api/telemetry/query — Admin Query (Read)

Auth: Admin JWT only.

Query params:

Param Type Description
userId string? Filter by user
anonymousInstallId string? Filter by install
platform string? Filter by platform
channel string? Filter by channel
osFamily string? Filter by OS family
appVersion string? Filter by version
buildNumber string? Filter by build
module string? Filter by module
eventName string? Filter by event name
eventType string? Filter by severity
from string (ISO) Start time
to string (ISO) End time
limit number Max results (default 50, max 200)
continuationToken string? Pagination

Response:

interface TelemetryQueryResponse {
  events: TelemetryEvent[];
  total: number;
  continuationToken?: string;
}

6.4 GET /api/telemetry/clusters — Error Clusters (Admin)

Auth: Admin JWT only.

Query params: Same filters as query, plus groupBy (default: fingerprint).

Response:

interface TelemetryClusterResponse {
  clusters: TelemetryErrorCluster[];
  total: number;
}

6.5 Collection Policy CRUD (Admin)

Method Path Description
GET /api/telemetry/policies List all policies
POST /api/telemetry/policies Create policy
PUT /api/telemetry/policies/:id Update policy
DELETE /api/telemetry/policies/:id Delete policy

7. Storage & Partitioning

7.1 Cosmos Containers

telemetry_events (raw events)

Property Value
Partition key /pk where pk = ${productId}:${yyyyMM}:${platform}
TTL 3060 days (configurable via env TELEMETRY_EVENT_TTL_DAYS)
RU budget Start at 400 RU/s autoscale, monitor and adjust

Rationale: Partitioning by product + month + platform keeps hot data together for typical queries ("show me iOS errors from this month") while distributing load.

Composite indexes:

[
  { "path": "/eventType", "order": "ascending" },
  { "path": "/occurredAt", "order": "descending" }
]

Additional indexed paths: /userId, /anonymousInstallId, /module, /eventName, /appVersion, /buildNumber, /channel, /osFamily.

telemetry_error_clusters (aggregated)

Property Value
Partition key /pk where pk = ${productId}:${platform}:${module}
TTL 90180 days
RU budget 200 RU/s autoscale

telemetry_collection_policies (segment rules)

Property Value
Partition key /productId
TTL None (manual lifecycle)
RU budget Minimal (low volume)

7.2 Container Registration

Add to registerContainers() call in platform-service src/lib/cosmos.ts:

registerContainers([
  // ... existing containers ...
  { id: 'telemetry_events', partitionKeyPath: '/pk' },
  { id: 'telemetry_error_clusters', partitionKeyPath: '/pk' },
  { id: 'telemetry_collection_policies', partitionKeyPath: '/productId' },
]);

8. Error Clustering (Derived)

8.1 Fingerprint Generation

Client-side (optional) and server-side (authoritative):

function generateFingerprint(event: TelemetryEvent): string {
  const input = [
    event.platform,
    event.channel,
    event.module,
    event.eventName,
    event.errorDomain ?? '',
    event.errorCode ?? '',
    normalizeMessage(event.message ?? ''),
  ].join(':');
  return sha256(input).substring(0, 16); // 16-char hex
}

function normalizeMessage(msg: string): string {
  // Strip numbers, UUIDs, paths, timestamps
  return msg
    .replace(/[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}/gi, '<UUID>')
    .replace(/\d+/g, '<N>')
    .replace(/\/[\w/.]+/g, '<PATH>')
    .toLowerCase()
    .trim();
}

8.2 Cluster Document

interface TelemetryErrorCluster {
  id: string; // fingerprint + time window key
  pk: string; // ${productId}:${platform}:${module}
  productId: string;
  fingerprint: string;

  // Dimensions
  platform: string;
  channel: string;
  osFamily: string;
  module: string;
  eventName: string;
  appVersion: string;
  buildNumber: string;

  // Aggregates
  firstSeenAt: string;
  lastSeenAt: string;
  totalCount: number;
  affectedUserIds: string[]; // capped at 100
  affectedInstallIds: string[]; // capped at 100

  // Representative sample
  sampleErrorDomain?: string;
  sampleErrorCode?: string;
  sampleMessage?: string;
  severity: 'warn' | 'error' | 'fatal';
}

8.3 Cluster Update Strategy

On each ingested error/fatal event:

  1. Compute fingerprint.
  2. Upsert cluster doc: increment totalCount, update lastSeenAt, append to affectedUserIds (dedup, cap at 100).
  3. Run as a lightweight post-ingest step (same request, not a separate job — keeps it simple for v1).

9. Admin / DevOps UI

9.1 Page: Ops → Client Logs

Located at admin-dashboard-web/src/app/(dashboard)/ops/client-logs/page.tsx.

Filter Bar

Filter Type Default
User ID text input
Platform multi-select: ios, android, web, desktop all
Channel multi-select all
OS Family multi-select all
App Version text/select
Build Number text/select
Module select all
Event Type multi-select error, fatal
Time Range date range picker last 24h

Views

  1. Error Clusters (default): Table of top clusters sorted by totalCount desc.

    • Columns: fingerprint, module, eventName, platform, build, count, affected users, last seen.
    • Click → drill into cluster detail (sample events, user list).
  2. Event Stream: Chronological list of raw events matching filters.

    • Columns: time, user, platform, channel, build, module, eventName, eventType, message.
    • Click → full event detail (JSON + dictation struct if present).
  3. User Timeline: Enter a userId → see all events chronologically.

    • Useful for "what happened to user X's keyboard session."

9.2 Page: Ops → Telemetry Policies

Located at admin-dashboard-web/src/app/(dashboard)/ops/telemetry-policies/page.tsx.

  • CRUD for collection policies.
  • Visual segment builder (dropdowns for platform, OS, version range, region, etc.).
  • Priority ordering (drag/drop or numeric).
  • Enable/disable toggle per policy.
  • "Preview" button: show how many matching users/installs (based on recent telemetry).

10. Client SDK Integration

10.1 iOS (Swift) — App + Keyboard Extension

// Shared via App Group (group.com.bytelyst.LysnrAI)
class LysnrTelemetry {
    static let shared = LysnrTelemetry()

    func track(
        eventType: EventType,
        module: String,
        eventName: String,
        message: String? = nil,
        errorCode: String? = nil,
        errorDomain: String? = nil,
        dictation: DictationContext? = nil,
        tags: [String: String]? = nil,
        metrics: [String: Double]? = nil
    )

    func flush()          // force-send queued events
    func refreshConfig()  // poll collection policy
}
  • Keyboard extension: posts events to App Group shared UserDefaults queue; main app flushes.
  • Alternatively, keyboard extension sends directly if Full Access is enabled (network available).

10.2 Android (Kotlin)

object LysnrTelemetry {
    fun track(
        eventType: EventType,
        module: String,
        eventName: String,
        message: String? = null,
        errorCode: String? = null,
        dictation: DictationContext? = null,
    )
    fun flush()
    fun refreshConfig()
}

10.3 Desktop (Python)

from lysnrai.telemetry import telemetry

telemetry.track(
    event_type="error",
    module="speech_recognition",
    event_name="azure_timeout",
    message="Recognition timed out after 30s",
    tags={"backend": "azure"},
    metrics={"duration_ms": 30000},
)

10.4 Web (TypeScript)

import { telemetry } from '@/lib/telemetry';

telemetry.track({
  eventType: 'error',
  module: 'auth',
  eventName: 'token_refresh_failed',
  errorCode: '401',
  message: 'JWT expired and refresh failed',
});

11. Privacy & Security

11.1 Hard Rules

  1. NEVER send raw dictated/transcribed text in any field.
  2. NEVER send passwords, tokens, API keys, or PII (email, phone, SSN).
  3. message field: sanitized, max 512 chars, no user content.
  4. stackTrace: redacted file paths, max 8KB, only on fatal.
  5. Server-side PII regex scanner rejects events containing detected PII patterns.
  6. countryCode / regionCode: derived from IP geo server-side (never GPS coordinates).

11.2 Data Retention

Container Default TTL Configurable
telemetry_events 30 days TELEMETRY_EVENT_TTL_DAYS
telemetry_error_clusters 90 days TELEMETRY_CLUSTER_TTL_DAYS
telemetry_collection_policies No TTL Manual delete / expiresAt

11.3 Access Control

  • Ingest (POST /api/telemetry/events): Any authenticated user or valid install API key.
  • Read (GET /api/telemetry/query, /clusters): Admin JWT only.
  • Policy management: Admin JWT only.
  • No public endpoints. Telemetry data is internal/operational only.

11.4 Rate Limiting

Client Type Limit
Authenticated user 100 req/min
Anonymous install 30 req/min
Admin query 60 req/min

12. Rollout Plan

Phase 1 — MVP (12 weeks)

Goal: iOS keyboard dictation debugging visible in admin dashboard.

Component Scope
platform-service telemetry module: types.ts, repository.ts, routes.ts (ingest + query)
platform-service Collection policy CRUD + config endpoint
iOS keyboard LysnrTelemetry client in KeyboardViewController — keyboard_dictation events
admin-dashboard Ops → Client Logs page with basic event stream + filters
Cosmos Register 3 containers

Delivers: When a user reports "keyboard not typing," admin can look up their userId, see exact error flow, permissions state, backend choice, and insertion outcome.

Phase 2 — Full Platform Coverage (23 weeks)

Component Scope
iOS app Telemetry for auth, settings, onboarding modules
Android app + keyboard Full telemetry parity with iOS
Desktop (Python) Telemetry for speech recognition, hotkey, paste modules
admin-dashboard Error cluster view, user timeline view
platform-service Cluster aggregation on ingest

Phase 3 — Advanced (34 weeks)

Component Scope
Web dashboards Telemetry for auth, API errors, page load
admin-dashboard Telemetry policy builder UI, version comparison view
platform-service Alerting rules (error spike → Slack/email)
All clients Region/geo enrichment server-side

13. Open Questions

# Question Status
1 Should keyboard extension send events directly (requires Full Access + network) or queue via App Group for main app to flush? Recommend: Direct when Full Access on, App Group queue as fallback
2 Do we need a separate Cosmos database for telemetry to isolate RU costs? Recommend: Same database, separate containers (simpler), revisit if RU contention appears
3 Should we support exporting telemetry to Azure Monitor / Application Insights for alerting? Defer to Phase 3
4 Max retention for raw events? Compliance requirements? Default 30 days, configurable
5 Do we need GDPR right-to-erasure support for telemetry? Yes — add DELETE /api/telemetry/user/:userId endpoint

Appendix A: Env Vars

Var Default Description
TELEMETRY_ENABLED true Global server-side kill switch
TELEMETRY_EVENT_TTL_DAYS 30 Raw event retention
TELEMETRY_CLUSTER_TTL_DAYS 90 Cluster retention
TELEMETRY_MAX_BATCH_SIZE 50 Max events per ingest request
TELEMETRY_MAX_PAYLOAD_BYTES 262144 256KB max request body
TELEMETRY_PII_SCAN_ENABLED true Server-side PII rejection
File Repo Purpose
services/platform-service/src/modules/telemetry/ common-plat Telemetry module (types, repo, routes)
services/platform-service/src/modules/flags/ common-plat Feature flags (reused for segment % rollout)
admin-dashboard-web/src/app/(dashboard)/ops/client-logs/ lysnrai Admin log viewer
admin-dashboard-web/src/app/(dashboard)/ops/telemetry-policies/ lysnrai Policy manager UI
mobile_app/ios/LysnrKeyboard/KeyboardViewController.swift lysnrai iOS keyboard (first telemetry client)
mobile_app/android/.../LysnrInputMethodService.kt lysnrai Android keyboard (Phase 2)
src/telemetry/ lysnrai Python desktop telemetry client (Phase 2)