learning_ai_common_plat/docs/roadmaps/completed/telemetry_IMPLEMENTATION_ROADMAP.md

35 KiB

Client Telemetry — Implementation Roadmap

Status: ALL PHASES COMPLETE
Last updated: 2026-03-02
Design doc: CLIENT_TELEMETRY_DESIGN.md
Repos: learning_ai_common_plat (platform-service) · learning_voice_ai_agent (all clients + dashboards)


Phase 0 — Design & Review

  • Write comprehensive telemetry design doc — schema, APIs, admin UX, privacy guardrails (c59049e)
  • Systematic review: identify and fix 18 bugs/gaps in the design doc (083cf02)
    • TTL format (ISO → seconds), regionCode prefix format, missing pk field
    • Auth model for keyboard extension (X-Install-Token)
    • Config endpoint query params (userId/anonymousInstallId)
    • Error clustering made version-agnostic (affectedVersions array)
    • GDPR erasure endpoint added
    • iOS offline queue strategy (App Group UserDefaults, FIFO eviction)
    • Global defaults for batchSize/flushInterval/maxQueueSize

Phase 1 — MVP (iOS Keyboard + Backend + Admin UI)

Platform-Service Telemetry Module

  • types.ts — Zod schemas for events, policies, clusters, queries (ce4c4ff)
  • repository.ts — Cosmos DB CRUD for events, policies, clusters (ce4c4ff)
  • routes.ts — Fastify routes: ingestion, config, admin query, clusters, policy CRUD, GDPR erasure (ce4c4ff)
  • telemetry.test.ts — 34 Vitest tests for schemas + policy evaluation (ce4c4ff)
  • Register telemetry routes in server.ts (ce4c4ff)
  • Add Cosmos containers (telemetry_events, telemetry_error_clusters, telemetry_collection_policies) to cosmos-init.ts (ce4c4ff)

iOS Keyboard Telemetry Client

  • LysnrTelemetry.swift — Singleton client with App Group offline queue, X-Install-Token auth, 200-event cap (e546475)
  • Instrument KeyboardViewController.swift — 10+ telemetry points (e546475)
    • session_started / session_ended (with full DictationContext)
    • backend_selected (azure / local + reason)
    • recognition_started / recognition_failed
    • mic_permission_denied
    • insert_noop detection
    • error_recovery_attempted (local→azure, azure→local)
    • Session summary metrics (duration, segments, words, transcript length)

Admin Dashboard — Client Logs Page

  • /ops/client-logs/page.tsx — Events table + Error Clusters tab (d202f94)
    • Stat cards (total events, errors, warnings, keyboard events)
    • Filters (platform, channel, level, module, free-text search)
    • Expandable event detail rows (device, tags, metrics, dictation context)
    • Error Clusters tab with severity, affected versions, user count
  • /api/telemetry/route.ts — API route proxying to platform-service (d202f94)
  • platform-client.tsqueryTelemetryEvents + queryTelemetryClusters (d202f94)
  • sidebar-nav.tsx — "Client Logs" nav item with FileText icon (d202f94)

Phase 2 — Full Platform Coverage

iOS Main App

  • TelemetryService.swift — Main app telemetry service with App Group queue drain on foreground (a173baa)
  • LysnrAIApp.swiftscenePhase integration for activate/deactivate lifecycle (a173baa)
    • app_foregrounded / app_backgrounded events
    • Keyboard queue flush on every foreground transition
    • 60-second periodic flush timer

Desktop App (Python)

  • platform_telemetry.pyPlatformTelemetry singleton with urllib.request POST, threaded flush timer, persistent install_id in ~/.LysnrAI/install_id (a173baa)
  • main.py instrumentation (a173baa)
    • app_started / app_stopped lifecycle events
    • dictation_started (with backend tag)
    • dictation_completed (with duration_ms, word_count, transcript_length metrics)
    • mic_permission_denied / recording_start_failed error events

Web User Dashboard

  • telemetry.ts — Browser client with sendBeacon, localStorage install ID, auto-flush on visibility change (130e1d6)
  • /api/telemetry/ingest/route.ts — Server-side proxy to platform-service (130e1d6)
  • providers.tsxinitTelemetry() called on app mount (130e1d6)

Tracker Dashboard

  • telemetry.ts — Browser client (same pattern as user dashboard) (a102609)
  • /api/telemetry/ingest/route.ts — Server-side proxy to platform-service (a102609)
  • providers.tsxinitTelemetry() called on app mount (a102609)

Admin Dashboard Self-Telemetry

  • telemetry.ts — Browser client tracking admin page views, filter usage, policy changes (a102609)
  • /api/telemetry/admin-ingest/route.ts — Separate proxy from admin query route (a102609)
  • providers.tsxinitTelemetry() called on app mount (a102609)

Android

  • TelemetryClient.kt — Kotlin singleton with OkHttp POST, SharedPreferences offline queue, persistent install ID (9196f48)
  • Instrument LysnrInputMethodService.kt — 10 telemetry points (9196f48)
    • session_started / session_ended (with words_inserted metric)
    • dictation_started (with backend + reason tags)
    • dictation_completed (with duration_ms, word_count, segment_count, transcript_length)
    • mic_permission_denied
    • recognition_failed (with errorCode + errorDomain)
    • error_recovery_attempted (azure→local fallback)
  • Offline queue using SharedPreferences with FIFO eviction (9196f48)
  • Flush on app foreground via ProcessLifecycleOwner + 60s periodic flush timer (9196f48)

Phase 3 — Intelligence & Admin Tooling

Error Clustering & Alerting

  • Automated error fingerprinting (hash of platform + channel + module + eventName + errorDomain + errorCode) — Phase 1 (ce4c4ff)
  • Cluster severity escalation (warnerrorfatal based on count + affected users) — Phase 1 (ce4c4ff)
  • Webhook alerting when cluster severity escalates (Slack-compatible, env TELEMETRY_ALERT_WEBHOOK_URL) (056f323)
  • Dashboard: cluster timeline chart (Recharts stacked bar, last 14 days, severity breakdown) (dc49073)
  • Dashboard: "Resolve" / "Ignore" / "Reopen" actions on clusters (6d7b1d3)
  • Cluster status field (open/resolved/ignored) + PATCH /telemetry/clusters/:id endpoint (056f323)

Geo Enrichment

  • Server-side IP → country/region lookup on ingestion (configurable via TELEMETRY_GEO_API_URL, 24h in-memory cache, 2s timeout) (2f61ea5)
  • Populate countryCode + regionCode fields (e.g., US:WA) on events from server-side IP lookup (2f61ea5)
  • Admin UI: geographic distribution chart (horizontal bar chart + country table, Geo tab on client-logs page) (0bfd4bd, 82a25c0)
  • Policy targeting by regionCode/countryCodes ranges (schema already supports it in TelemetryTargetingSchema)

Collection Policy Builder UI

  • Admin page: /ops/telemetry-policies (c7732c9)
  • CRUD UI for collection policies (name, enabled, targeting rules, sampling rates) (c7732c9)
  • Targeting builder: platform checkboxes, channel badges, release channel selection, percentage slider (c7732c9)
  • Live preview: "N / M clients would match this policy" — POST /telemetry/policies/preview + UI button (61c919a, da9031b)
  • Policy activation/deactivation toggle (c7732c9)
  • Scheduling: startsAt / expiresAt date pickers (c7732c9)

Privacy & Compliance

  • PII regex scanner on ingestion (email, phone, SSN, credit card patterns → reject before storage) — Phase 1 (ce4c4ff)
  • Admin API: GDPR erasure endpoint DELETE /telemetry/user/:userId — Phase 1 (ce4c4ff)
  • Admin UI: GDPR erasure proxy route /api/telemetry/erasure (c7732c9)
  • Retention policy enforcement (TTL-based auto-expiry, TELEMETRY_EVENT_TTL_DAYS env var) — Phase 1 (ce4c4ff)
  • Audit log entries for policy CRUD + GDPR erasure (telemetry.policy.created/updated/deleted, telemetry.gdpr.erasure, telemetry.cluster.resolved/ignored) (056f323)
  • Admin UI: GDPR erasure tab on Client Logs page (6d7b1d3)

Performance & Scale

  • ETag caching on GET /telemetry/config (If-None-Match → 304, Cache-Control: private, max-age=60) (2fb3410)
  • Server-side rate limiting per installId (100 events/min, in-memory sliding window) (2fb3410)
  • Cosmos DB indexing policy tuning — scripts/cosmos-telemetry-indexes.sh with composite indexes for all 3 containers (056f323)
  • Batch ingestion deduplication by event.id (2fb3410)
  • In-memory ingestion metrics counters + GET /telemetry/metrics admin endpoint (056f323)
  • Admin UI: Metrics tab on Client Logs page (ingested, rejected, PII blocked, rate limited, duplicates) (6d7b1d3)
  • Prometheus OpenMetrics export endpoint GET /telemetry/metrics/prometheus (2f61ea5)

Phase 4 — Operational Wiring

This phase bridges "code exists" → "telemetry actually flows." All code-level wiring is complete. Remaining items are deployment/infra tasks (deploying platform-service, Apple Developer portal config, physical device testing).

4.1 — Platform-Service Deployment

  • Add telemetry env vars to .env.example files (TELEMETRY_ENABLED, TELEMETRY_ALERT_WEBHOOK_URL, TELEMETRY_GEO_API_URL, TELEMETRY_EVENT_TTL_DAYS)
  • POST /api/telemetry/events endpoint verified working locally via smoke test script
  • Deploy platform-service to a publicly reachable URL (Azure Container Apps / App Service) — infra task
  • Configure DNS / reverse proxy so clients can reach https://api.lysnrai.cominfra task
  • Run scripts/cosmos-telemetry-indexes.sh against live Cosmos DB — infra task

4.2 — iOS Keyboard Extension Wiring

  • Fix App Group ID mismatchPlatform/Config.swift used group.com.saravana.LysnrAI but all other files (TelemetryService, LysnrTelemetry, AuthService, KeyboardLogStore, entitlements) use group.com.bytelyst.LysnrAI. Fixed to match.
  • Write platform_service_url to App GroupTelemetryService.writePlatformURLToAppGroup() writes Config.platformServiceURL to App Group UserDefaults so keyboard extension's LysnrTelemetry.swift can read it at init (line 80)
  • Early URL write in LysnrAIApp.swift init — calls TelemetryService.writePlatformURLToAppGroup() before lazy TelemetryService access, so keyboard gets the URL even on first install
  • Mic permission pre-request already in LysnrAIApp.swift.requestPermissionsForKeyboardExtension() (both AVAudioSession.requestRecordPermission and SFSpeechRecognizer.requestAuthorization)
  • Register App Groups in Apple Developer portal — portal task
  • Test Full Access ON vs OFF paths on physical device — device testing

4.3 — iOS Main App TelemetryService Integration

  • TelemetryService.swift reads Config.platformServiceURL and writes to App Group
  • LysnrAIApp.swift wires scenePhaseTelemetryService.shared.activate() / .deactivate()
  • activate() calls flushKeyboardQueue() on every foreground transition
  • flushKeyboardQueue() reads App Group telemetry_event_queue and POSTs via platformClient.fireAndForget
  • 60-second periodic flush timer via BLTelemetryClient

4.4 — Desktop App Wiring

  • PLATFORM_SERVICE_URL already in .env.example (line 44) and mobile_app/common/env.dev.example (line 41)
  • platform_telemetry.py reads PLATFORM_SERVICE_URL from env or settings and sends via urllib.request
  • Threaded flush timer (60s) + atexit flush for offline→online drain
  • Persistent install_id in ~/.LysnrAI/install_id

4.5 — Web Dashboard Wiring

  • User dashboard: PLATFORM_SERVICE_URL in .env.example, /api/telemetry/ingest proxy route forwards to platform-service
  • Admin dashboard: PLATFORM_SERVICE_URL in .env.example, /api/telemetry/route.ts queries platform-service, /api/telemetry/admin-ingest for self-telemetry
  • Tracker dashboard: PLATFORM_SERVICE_URL in .env.example, /api/telemetry/ingest proxy route
  • All 3 dashboards use @bytelyst/telemetry-client with sendBeacon transport

4.6 — Android Wiring

  • TelemetryClient.kt reads RuntimeConfig.platformServiceUrl which loads from .env file or BuildConfig.PLATFORM_SERVICE_URL
  • local.properties.example has PLATFORM_SERVICE_URL=http://10.0.2.2:4003
  • build.gradle.kts injects PLATFORM_SERVICE_URL into BuildConfig from local.properties
  • LysnrAIApp.kt initializes TelemetryClient in onCreate() and wires ProcessLifecycleOwner for foreground/background events
  • SharedPreferences offline queue with FIFO eviction + foreground restore

4.7 — Webhook / Alert Configuration

  • TELEMETRY_ALERT_WEBHOOK_URL added to .env.example (both repos)
  • TELEMETRY_GEO_API_URL added to .env.example (default: http://ip-api.com/json)
  • TELEMETRY_EVENT_TTL_DAYS added to .env.example (default: 90)
  • Webhook alerting code already exists in platform-service (cluster severity escalation → webhook POST)
  • Geo enrichment code already exists in platform-service (IP → country/region lookup on ingestion)

4.8 — End-to-End Smoke Test

  • scripts/telemetry-smoke-test.sh — 9-step curl-based smoke test covering:
    • Health check
    • Event ingestion (info + error events)
    • Event query (admin endpoint)
    • Error cluster query
    • Config endpoint (ETag caching)
    • Metrics endpoint
    • Rate limiting burst test
    • GDPR erasure endpoint
  • Full round-trip on deployed infra (iOS keyboard → platform-service → Cosmos → admin dashboard) — needs deployed infra

Remaining Infra Tasks (cannot be done in code)

Task Type Notes
Deploy platform-service to Azure Infra Azure Container Apps or App Service
Configure DNS (api.lysnrai.com) Infra DNS + TLS cert
Run cosmos-telemetry-indexes.sh against prod Infra Creates containers + composite indexes
Register App Groups in Apple Developer portal Portal group.com.bytelyst.LysnrAI for both targets
Physical device testing (mic, Full Access) Device test Needs TestFlight build with App Group entitlements

Architecture Summary

┌─────────────────────┐    ┌──────────────────────┐    ┌───────────────────┐
│  iOS Keyboard Ext   │    │   iOS Main App       │    │  Desktop (Python) │
│  LysnrTelemetry     │───▶│  TelemetryService    │    │  PlatformTelemetry│
│  (App Group queue)  │    │  (drains queue)      │    │  (urllib POST)    │
└─────────────────────┘    └──────────┬───────────┘    └────────┬──────────┘
  Full Access ON ──┐                  │                          │
  direct POST      │                  │                          │
                   ▼                  ▼                          ▼
┌─────────────────────────────────────────────────────────────────────────┐
│                     Platform Service (Fastify, port 4003)              │
│  POST   /api/telemetry/events       — batch ingestion                 │
│  GET    /api/telemetry/config       — client collection config        │
│  GET    /api/telemetry/query        — admin event search              │
│  GET    /api/telemetry/clusters     — admin error clusters            │
│  CRUD   /api/telemetry/policies     — collection policy management    │
│  DELETE /api/telemetry/user/:userId — GDPR erasure                    │
└────────────────────────────┬────────────────────────────────────────────┘
                             │
                             ▼
┌─────────────────────────────────────────────────────────────────────────┐
│                        Azure Cosmos DB                                 │
│  telemetry_events              partitionKeyPath: /pk                   │
│    pk value = productId:yyyyMM:platform   (e.g. lysnrai:202602:ios)   │
│  telemetry_error_clusters      partitionKeyPath: /pk                   │
│    pk value = productId:platform:module   (e.g. lysnrai:ios:dictation)│
│  telemetry_collection_policies partitionKeyPath: /productId            │
└─────────────────────────────────────────────────────────────────────────┘

┌─────────────────────────┐          ┌──────────────────────┐
│  Admin Dashboard        │  GET     │  User Dashboard      │  POST
│  /ops/client-logs       │─────────▶│  /api/telemetry/     │─────────▶ platform
│  (queries via           │  query/  │  ingest              │  /events  -service
│   platform-service API) │  clusters│  (browser → proxy)   │
└─────────────────────────┘          └──────────────────────┘

┌───────────────────────┐
│  Android              │
│  TelemetryClient.kt   │──▶ POST /api/telemetry/events ──▶ platform-service
│  (SharedPreferences)  │
└───────────────────────┘

Test Coverage

Component Test File Tests Coverage
Platform-service telemetry telemetry.test.ts 89 Zod schemas (34), containsPII (6), computePk (4), normalizeMessage (7), generateFingerprint (8), policyMatchesContext (13), mergePolicies (5), checkRateLimit (3), plus additional route-logic tests
iOS LysnrTelemetry (keyboard) LysnrAITests/LysnrTelemetryTests.swift 18 Identity (5), session management (2), event types (1), DictationContext (3), track (3), flush (2), queue (1), crash-safety (1)
Desktop Python client tests/cloud/test_platform_telemetry.py 19 Event format (6), queue behavior (2), session mgmt (2), flush/HTTP (5), install ID (2), singleton (2)
Web dashboard client user-dashboard-web/src/__tests__/telemetry.test.ts 12 trackEvent (3), trackPageView (1), flush (4), install ID (2), initTelemetry (2)
Tracker dashboard client tracker-dashboard-web/src/__tests__/telemetry.test.ts 10 trackEvent (3), trackPageView (1), flush (4), initTelemetry (2)
Admin dashboard client admin-dashboard-web/src/__tests__/telemetry.test.ts 10 trackEvent (3), trackPageView (1), flush (4), initTelemetry (2)
Total 158

Verification commands

# Platform-service (89 telemetry tests within 624 total)
cd ../learning_ai_common_plat && pnpm --filter @lysnrai/platform-service test

# iOS keyboard telemetry (18 tests)
cd learning_voice_ai_agent
xcodebuild test-without-building \
  -workspace mobile_app/ios/LysnrAI.xcworkspace \
  -scheme LysnrAITests \
  -destination 'platform=iOS Simulator,name=iPhone 17 Pro' \
  -only-testing:LysnrAITests/LysnrTelemetryTests

# Desktop Python (19 tests)
python -m pytest tests/cloud/test_platform_telemetry.py -v

# Web user-dashboard (12 tests)
cd user-dashboard-web && npx vitest run src/__tests__/telemetry.test.ts

# Tracker dashboard (10 tests)
cd tracker-dashboard-web && npx vitest run src/__tests__/telemetry.test.ts

# Admin dashboard (10 tests)
cd admin-dashboard-web && npx vitest run src/__tests__/telemetry.test.ts

Not yet tested

  • iOS LysnrTelemetry.swift 18 XCTest unit tests (LysnrTelemetryTests.swift, build 28)
  • iOS TelemetryService.swift (main app) — needs XCTest target for main app
  • Android TelemetryClient.kt — needs Android instrumented tests or Robolectric
  • Admin dashboard /api/telemetry/route.ts — API route integration test
  • Platform-service HTTP integration tests (Fastify inject for telemetry routes)
  • End-to-end: client → platform-service → Cosmos read-back → admin dashboard query

Bugs Found During Review

The following bugs were discovered during systematic review of the roadmap against actual code and fixed:

# Severity Issue Fix
1 High Desktop Python id used uuid.uuid4().hex (32 hex, no dashes) — fails Zod .uuid() server validation Changed to str(uuid.uuid4())
2 High Web telemetry osFamily='web' not in Zod OsFamilyEnum — fails server validation Changed to 'other'
3 Medium Status said "Phase 2 complete" but Android is all unchecked Fixed status line
4 Medium Architecture diagram showed wrong pk for telemetry_error_clusters (/productId → actual /pk = productId:platform:module) Fixed diagram
5 Medium Tracker dashboard telemetry missing from roadmap entirely Added as Phase 2 pending
6 Medium Admin dashboard self-telemetry (page views) not mentioned Added as Phase 2 pending
7 Low Architecture diagram missing Android client box Added with "not yet implemented" note
8 Low Architecture diagram implied Admin reads Cosmos directly (it queries Platform Service) Fixed data flow arrows
9 Low Web telemetry.ts JSDoc said "via the admin dashboard proxy" (wrong dashboard) Fixed to "user dashboard's /api/telemetry/ingest proxy"
10 Low Commit log missing roadmap doc commit Added

Commit Log

Date Repo Commit Description
2026-02-16 common-plat c59049e Design doc: client telemetry & log insights
2026-02-16 common-plat 083cf02 Fix 18 gaps in telemetry design doc (rev 2)
2026-02-16 common-plat ce4c4ff Telemetry module — ingest, config, query, clusters, policies (34 tests)
2026-02-17 voice-agent e546475 iOS keyboard telemetry client + KeyboardViewController instrumentation
2026-02-17 voice-agent d202f94 Admin dashboard Client Logs page + sidebar nav
2026-02-17 voice-agent a173baa iOS main app TelemetryService + Desktop Python platform_telemetry
2026-02-17 voice-agent 130e1d6 Web user-dashboard telemetry client + ingest proxy
2026-02-17 common-plat c3d6977 Telemetry roadmap doc (this file)
2026-02-17 voice-agent ae77438 Fix: desktop uuid format + web osFamily — pass Zod validation
2026-02-17 common-plat 20f77d5 Tests: route-logic tests — PII, pk, fingerprint, policy matching (34→77)
2026-02-17 voice-agent 08efdb6 Tests: Python client (19) + web dashboard (12) telemetry tests
2026-02-17 voice-agent a102609 Tracker + admin self-telemetry clients + tests (20 tests)
2026-02-17 voice-agent 9196f48 Android TelemetryClient + keyboard instrumentation + ProcessLifecycleOwner
2026-02-17 voice-agent c7732c9 Phase 3: Policy Builder UI + GDPR erasure proxy + sidebar nav
2026-02-17 common-plat 2fb3410 Phase 3: Rate limiting, batch dedup, ETag config caching (614 tests)
2026-02-17 common-plat 056f323 Phase 3: Cluster resolve/ignore, audit logging, webhook alerts, metrics, Cosmos indexes
2026-02-17 voice-agent 6d7b1d3 Phase 3: Cluster actions UI, metrics tab, GDPR erasure UI
2026-02-17 common-plat 2f61ea5 Phase 3: Geo enrichment, Prometheus metrics export
2026-02-17 voice-agent dc49073 Phase 3: Cluster timeline chart (Recharts)
2026-02-17 common-plat 61c919a Phase 3: Policy preview endpoint (count matching clients)
2026-02-17 voice-agent da9031b Phase 3: Policy builder live preview UI + API proxy
2026-02-17 common-plat 0bfd4bd Phase 3: Geo distribution endpoint (GET /telemetry/geo, Cosmos GROUP BY)
2026-02-17 voice-agent 82a25c0 Phase 3: Geo distribution UI — bar chart + country table on client-logs Geo tab