35 KiB
35 KiB
Client Telemetry — Implementation Roadmap
Status: ALL PHASES COMPLETE ✅
Last updated: 2026-03-02
Design doc:CLIENT_TELEMETRY_DESIGN.md
Repos:learning_ai_common_plat(platform-service) ·learning_voice_ai_agent(all clients + dashboards)
Phase 0 — Design & Review
- Write comprehensive telemetry design doc — schema, APIs, admin UX, privacy guardrails (
c59049e) - Systematic review: identify and fix 18 bugs/gaps in the design doc (
083cf02)- TTL format (ISO → seconds),
regionCodeprefix format, missingpkfield - Auth model for keyboard extension (
X-Install-Token) - Config endpoint query params (
userId/anonymousInstallId) - Error clustering made version-agnostic (
affectedVersionsarray) - GDPR erasure endpoint added
- iOS offline queue strategy (App Group UserDefaults, FIFO eviction)
- Global defaults for
batchSize/flushInterval/maxQueueSize
- TTL format (ISO → seconds),
Phase 1 — MVP (iOS Keyboard + Backend + Admin UI)
Platform-Service Telemetry Module
types.ts— Zod schemas for events, policies, clusters, queries (ce4c4ff)repository.ts— Cosmos DB CRUD for events, policies, clusters (ce4c4ff)routes.ts— Fastify routes: ingestion, config, admin query, clusters, policy CRUD, GDPR erasure (ce4c4ff)telemetry.test.ts— 34 Vitest tests for schemas + policy evaluation (ce4c4ff)- Register telemetry routes in
server.ts(ce4c4ff) - Add Cosmos containers (
telemetry_events,telemetry_error_clusters,telemetry_collection_policies) tocosmos-init.ts(ce4c4ff)
iOS Keyboard Telemetry Client
LysnrTelemetry.swift— Singleton client with App Group offline queue,X-Install-Tokenauth, 200-event cap (e546475)- Instrument
KeyboardViewController.swift— 10+ telemetry points (e546475)session_started/session_ended(with fullDictationContext)backend_selected(azure / local + reason)recognition_started/recognition_failedmic_permission_deniedinsert_noopdetectionerror_recovery_attempted(local→azure, azure→local)- Session summary metrics (duration, segments, words, transcript length)
Admin Dashboard — Client Logs Page
/ops/client-logs/page.tsx— Events table + Error Clusters tab (d202f94)- Stat cards (total events, errors, warnings, keyboard events)
- Filters (platform, channel, level, module, free-text search)
- Expandable event detail rows (device, tags, metrics, dictation context)
- Error Clusters tab with severity, affected versions, user count
/api/telemetry/route.ts— API route proxying to platform-service (d202f94)platform-client.ts—queryTelemetryEvents+queryTelemetryClusters(d202f94)sidebar-nav.tsx— "Client Logs" nav item withFileTexticon (d202f94)
Phase 2 — Full Platform Coverage
iOS Main App
TelemetryService.swift— Main app telemetry service with App Group queue drain on foreground (a173baa)LysnrAIApp.swift—scenePhaseintegration for activate/deactivate lifecycle (a173baa)app_foregrounded/app_backgroundedevents- Keyboard queue flush on every foreground transition
- 60-second periodic flush timer
Desktop App (Python)
platform_telemetry.py—PlatformTelemetrysingleton withurllib.requestPOST, threaded flush timer, persistentinstall_idin~/.LysnrAI/install_id(a173baa)main.pyinstrumentation (a173baa)app_started/app_stoppedlifecycle eventsdictation_started(with backend tag)dictation_completed(with duration_ms, word_count, transcript_length metrics)mic_permission_denied/recording_start_failederror events
Web User Dashboard
telemetry.ts— Browser client withsendBeacon,localStorageinstall ID, auto-flush on visibility change (130e1d6)/api/telemetry/ingest/route.ts— Server-side proxy to platform-service (130e1d6)providers.tsx—initTelemetry()called on app mount (130e1d6)
Tracker Dashboard
telemetry.ts— Browser client (same pattern as user dashboard) (a102609)/api/telemetry/ingest/route.ts— Server-side proxy to platform-service (a102609)providers.tsx—initTelemetry()called on app mount (a102609)
Admin Dashboard Self-Telemetry
telemetry.ts— Browser client tracking admin page views, filter usage, policy changes (a102609)/api/telemetry/admin-ingest/route.ts— Separate proxy from admin query route (a102609)providers.tsx—initTelemetry()called on app mount (a102609)
Android
TelemetryClient.kt— Kotlin singleton with OkHttp POST, SharedPreferences offline queue, persistent install ID (9196f48)- Instrument
LysnrInputMethodService.kt— 10 telemetry points (9196f48)session_started/session_ended(with words_inserted metric)dictation_started(with backend + reason tags)dictation_completed(with duration_ms, word_count, segment_count, transcript_length)mic_permission_deniedrecognition_failed(with errorCode + errorDomain)error_recovery_attempted(azure→local fallback)
- Offline queue using SharedPreferences with FIFO eviction (
9196f48) - Flush on app foreground via
ProcessLifecycleOwner+ 60s periodic flush timer (9196f48)
Phase 3 — Intelligence & Admin Tooling
Error Clustering & Alerting
- Automated error fingerprinting (hash of
platform + channel + module + eventName + errorDomain + errorCode) — Phase 1 (ce4c4ff) - Cluster severity escalation (
warn→error→fatalbased on count + affected users) — Phase 1 (ce4c4ff) - Webhook alerting when cluster severity escalates (Slack-compatible, env
TELEMETRY_ALERT_WEBHOOK_URL) (056f323) - Dashboard: cluster timeline chart (Recharts stacked bar, last 14 days, severity breakdown) (
dc49073) - Dashboard: "Resolve" / "Ignore" / "Reopen" actions on clusters (
6d7b1d3) - Cluster status field (
open/resolved/ignored) +PATCH /telemetry/clusters/:idendpoint (056f323)
Geo Enrichment
- Server-side IP → country/region lookup on ingestion (configurable via
TELEMETRY_GEO_API_URL, 24h in-memory cache, 2s timeout) (2f61ea5) - Populate
countryCode+regionCodefields (e.g.,US:WA) on events from server-side IP lookup (2f61ea5) - Admin UI: geographic distribution chart (horizontal bar chart + country table, Geo tab on client-logs page) (
0bfd4bd,82a25c0) - Policy targeting by
regionCode/countryCodesranges (schema already supports it inTelemetryTargetingSchema)
Collection Policy Builder UI
- Admin page:
/ops/telemetry-policies(c7732c9) - CRUD UI for collection policies (name, enabled, targeting rules, sampling rates) (
c7732c9) - Targeting builder: platform checkboxes, channel badges, release channel selection, percentage slider (
c7732c9) - Live preview: "N / M clients would match this policy" —
POST /telemetry/policies/preview+ UI button (61c919a,da9031b) - Policy activation/deactivation toggle (
c7732c9) - Scheduling:
startsAt/expiresAtdate pickers (c7732c9)
Privacy & Compliance
- PII regex scanner on ingestion (email, phone, SSN, credit card patterns → reject before storage) — Phase 1 (
ce4c4ff) - Admin API: GDPR erasure endpoint
DELETE /telemetry/user/:userId— Phase 1 (ce4c4ff) - Admin UI: GDPR erasure proxy route
/api/telemetry/erasure(c7732c9) - Retention policy enforcement (TTL-based auto-expiry,
TELEMETRY_EVENT_TTL_DAYSenv var) — Phase 1 (ce4c4ff) - Audit log entries for policy CRUD + GDPR erasure (
telemetry.policy.created/updated/deleted,telemetry.gdpr.erasure,telemetry.cluster.resolved/ignored) (056f323) - Admin UI: GDPR erasure tab on Client Logs page (
6d7b1d3)
Performance & Scale
- ETag caching on
GET /telemetry/config(If-None-Match→ 304,Cache-Control: private, max-age=60) (2fb3410) - Server-side rate limiting per
installId(100 events/min, in-memory sliding window) (2fb3410) - Cosmos DB indexing policy tuning —
scripts/cosmos-telemetry-indexes.shwith composite indexes for all 3 containers (056f323) - Batch ingestion deduplication by
event.id(2fb3410) - In-memory ingestion metrics counters +
GET /telemetry/metricsadmin endpoint (056f323) - Admin UI: Metrics tab on Client Logs page (ingested, rejected, PII blocked, rate limited, duplicates) (
6d7b1d3) - Prometheus OpenMetrics export endpoint
GET /telemetry/metrics/prometheus(2f61ea5)
Phase 4 — Operational Wiring ✅
This phase bridges "code exists" → "telemetry actually flows." All code-level wiring is complete. Remaining items are deployment/infra tasks (deploying platform-service, Apple Developer portal config, physical device testing).
4.1 — Platform-Service Deployment
- Add telemetry env vars to
.env.examplefiles (TELEMETRY_ENABLED,TELEMETRY_ALERT_WEBHOOK_URL,TELEMETRY_GEO_API_URL,TELEMETRY_EVENT_TTL_DAYS) POST /api/telemetry/eventsendpoint verified working locally via smoke test script- Deploy platform-service to a publicly reachable URL (Azure Container Apps / App Service) — infra task
- Configure DNS / reverse proxy so clients can reach
https://api.lysnrai.com— infra task - Run
scripts/cosmos-telemetry-indexes.shagainst live Cosmos DB — infra task
4.2 — iOS Keyboard Extension Wiring
- Fix App Group ID mismatch —
Platform/Config.swiftusedgroup.com.saravana.LysnrAIbut all other files (TelemetryService, LysnrTelemetry, AuthService, KeyboardLogStore, entitlements) usegroup.com.bytelyst.LysnrAI. Fixed to match. - Write
platform_service_urlto App Group —TelemetryService.writePlatformURLToAppGroup()writesConfig.platformServiceURLto App Group UserDefaults so keyboard extension'sLysnrTelemetry.swiftcan read it at init (line 80) - Early URL write in
LysnrAIApp.swiftinit — callsTelemetryService.writePlatformURLToAppGroup()before lazy TelemetryService access, so keyboard gets the URL even on first install - Mic permission pre-request already in
LysnrAIApp.swift.requestPermissionsForKeyboardExtension()(bothAVAudioSession.requestRecordPermissionandSFSpeechRecognizer.requestAuthorization) - Register App Groups in Apple Developer portal — portal task
- Test Full Access ON vs OFF paths on physical device — device testing
4.3 — iOS Main App TelemetryService Integration
TelemetryService.swiftreadsConfig.platformServiceURLand writes to App GroupLysnrAIApp.swiftwiresscenePhase→TelemetryService.shared.activate()/.deactivate()activate()callsflushKeyboardQueue()on every foreground transitionflushKeyboardQueue()reads App Grouptelemetry_event_queueand POSTs viaplatformClient.fireAndForget- 60-second periodic flush timer via
BLTelemetryClient
4.4 — Desktop App Wiring
PLATFORM_SERVICE_URLalready in.env.example(line 44) andmobile_app/common/env.dev.example(line 41)platform_telemetry.pyreadsPLATFORM_SERVICE_URLfrom env or settings and sends viaurllib.request- Threaded flush timer (60s) + atexit flush for offline→online drain
- Persistent
install_idin~/.LysnrAI/install_id
4.5 — Web Dashboard Wiring
- User dashboard:
PLATFORM_SERVICE_URLin.env.example,/api/telemetry/ingestproxy route forwards to platform-service - Admin dashboard:
PLATFORM_SERVICE_URLin.env.example,/api/telemetry/route.tsqueries platform-service,/api/telemetry/admin-ingestfor self-telemetry - Tracker dashboard:
PLATFORM_SERVICE_URLin.env.example,/api/telemetry/ingestproxy route - All 3 dashboards use
@bytelyst/telemetry-clientwithsendBeacontransport
4.6 — Android Wiring
TelemetryClient.ktreadsRuntimeConfig.platformServiceUrlwhich loads from.envfile orBuildConfig.PLATFORM_SERVICE_URLlocal.properties.examplehasPLATFORM_SERVICE_URL=http://10.0.2.2:4003build.gradle.ktsinjectsPLATFORM_SERVICE_URLintoBuildConfigfromlocal.propertiesLysnrAIApp.ktinitializesTelemetryClientinonCreate()and wiresProcessLifecycleOwnerfor foreground/background events- SharedPreferences offline queue with FIFO eviction + foreground restore
4.7 — Webhook / Alert Configuration
TELEMETRY_ALERT_WEBHOOK_URLadded to.env.example(both repos)TELEMETRY_GEO_API_URLadded to.env.example(default:http://ip-api.com/json)TELEMETRY_EVENT_TTL_DAYSadded to.env.example(default: 90)- Webhook alerting code already exists in platform-service (
cluster severity escalation → webhook POST) - Geo enrichment code already exists in platform-service (
IP → country/region lookup on ingestion)
4.8 — End-to-End Smoke Test
scripts/telemetry-smoke-test.sh— 9-step curl-based smoke test covering:- Health check
- Event ingestion (info + error events)
- Event query (admin endpoint)
- Error cluster query
- Config endpoint (ETag caching)
- Metrics endpoint
- Rate limiting burst test
- GDPR erasure endpoint
- Full round-trip on deployed infra (iOS keyboard → platform-service → Cosmos → admin dashboard) — needs deployed infra
Remaining Infra Tasks (cannot be done in code)
| Task | Type | Notes |
|---|---|---|
| Deploy platform-service to Azure | Infra | Azure Container Apps or App Service |
| Configure DNS (api.lysnrai.com) | Infra | DNS + TLS cert |
| Run cosmos-telemetry-indexes.sh against prod | Infra | Creates containers + composite indexes |
| Register App Groups in Apple Developer portal | Portal | group.com.bytelyst.LysnrAI for both targets |
| Physical device testing (mic, Full Access) | Device test | Needs TestFlight build with App Group entitlements |
Architecture Summary
┌─────────────────────┐ ┌──────────────────────┐ ┌───────────────────┐
│ iOS Keyboard Ext │ │ iOS Main App │ │ Desktop (Python) │
│ LysnrTelemetry │───▶│ TelemetryService │ │ PlatformTelemetry│
│ (App Group queue) │ │ (drains queue) │ │ (urllib POST) │
└─────────────────────┘ └──────────┬───────────┘ └────────┬──────────┘
Full Access ON ──┐ │ │
direct POST │ │ │
▼ ▼ ▼
┌─────────────────────────────────────────────────────────────────────────┐
│ Platform Service (Fastify, port 4003) │
│ POST /api/telemetry/events — batch ingestion │
│ GET /api/telemetry/config — client collection config │
│ GET /api/telemetry/query — admin event search │
│ GET /api/telemetry/clusters — admin error clusters │
│ CRUD /api/telemetry/policies — collection policy management │
│ DELETE /api/telemetry/user/:userId — GDPR erasure │
└────────────────────────────┬────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────┐
│ Azure Cosmos DB │
│ telemetry_events partitionKeyPath: /pk │
│ pk value = productId:yyyyMM:platform (e.g. lysnrai:202602:ios) │
│ telemetry_error_clusters partitionKeyPath: /pk │
│ pk value = productId:platform:module (e.g. lysnrai:ios:dictation)│
│ telemetry_collection_policies partitionKeyPath: /productId │
└─────────────────────────────────────────────────────────────────────────┘
┌─────────────────────────┐ ┌──────────────────────┐
│ Admin Dashboard │ GET │ User Dashboard │ POST
│ /ops/client-logs │─────────▶│ /api/telemetry/ │─────────▶ platform
│ (queries via │ query/ │ ingest │ /events -service
│ platform-service API) │ clusters│ (browser → proxy) │
└─────────────────────────┘ └──────────────────────┘
┌───────────────────────┐
│ Android │
│ TelemetryClient.kt │──▶ POST /api/telemetry/events ──▶ platform-service
│ (SharedPreferences) │
└───────────────────────┘
Test Coverage
| Component | Test File | Tests | Coverage |
|---|---|---|---|
| Platform-service telemetry | telemetry.test.ts |
89 | Zod schemas (34), containsPII (6), computePk (4), normalizeMessage (7), generateFingerprint (8), policyMatchesContext (13), mergePolicies (5), checkRateLimit (3), plus additional route-logic tests |
| iOS LysnrTelemetry (keyboard) | LysnrAITests/LysnrTelemetryTests.swift |
18 | Identity (5), session management (2), event types (1), DictationContext (3), track (3), flush (2), queue (1), crash-safety (1) |
| Desktop Python client | tests/cloud/test_platform_telemetry.py |
19 | Event format (6), queue behavior (2), session mgmt (2), flush/HTTP (5), install ID (2), singleton (2) |
| Web dashboard client | user-dashboard-web/src/__tests__/telemetry.test.ts |
12 | trackEvent (3), trackPageView (1), flush (4), install ID (2), initTelemetry (2) |
| Tracker dashboard client | tracker-dashboard-web/src/__tests__/telemetry.test.ts |
10 | trackEvent (3), trackPageView (1), flush (4), initTelemetry (2) |
| Admin dashboard client | admin-dashboard-web/src/__tests__/telemetry.test.ts |
10 | trackEvent (3), trackPageView (1), flush (4), initTelemetry (2) |
| Total | 158 |
Verification commands
# Platform-service (89 telemetry tests within 624 total)
cd ../learning_ai_common_plat && pnpm --filter @lysnrai/platform-service test
# iOS keyboard telemetry (18 tests)
cd learning_voice_ai_agent
xcodebuild test-without-building \
-workspace mobile_app/ios/LysnrAI.xcworkspace \
-scheme LysnrAITests \
-destination 'platform=iOS Simulator,name=iPhone 17 Pro' \
-only-testing:LysnrAITests/LysnrTelemetryTests
# Desktop Python (19 tests)
python -m pytest tests/cloud/test_platform_telemetry.py -v
# Web user-dashboard (12 tests)
cd user-dashboard-web && npx vitest run src/__tests__/telemetry.test.ts
# Tracker dashboard (10 tests)
cd tracker-dashboard-web && npx vitest run src/__tests__/telemetry.test.ts
# Admin dashboard (10 tests)
cd admin-dashboard-web && npx vitest run src/__tests__/telemetry.test.ts
Not yet tested
- iOS
LysnrTelemetry.swift— ✅ 18 XCTest unit tests (LysnrTelemetryTests.swift, build 28) - iOS
TelemetryService.swift(main app) — needs XCTest target for main app - Android
TelemetryClient.kt— needs Android instrumented tests or Robolectric - Admin dashboard
/api/telemetry/route.ts— API route integration test - Platform-service HTTP integration tests (Fastify inject for telemetry routes)
- End-to-end: client → platform-service → Cosmos read-back → admin dashboard query
Bugs Found During Review
The following bugs were discovered during systematic review of the roadmap against actual code and fixed:
| # | Severity | Issue | Fix |
|---|---|---|---|
| 1 | High | Desktop Python id used uuid.uuid4().hex (32 hex, no dashes) — fails Zod .uuid() server validation |
Changed to str(uuid.uuid4()) |
| 2 | High | Web telemetry osFamily='web' not in Zod OsFamilyEnum — fails server validation |
Changed to 'other' |
| 3 | Medium | Status said "Phase 2 complete" but Android is all unchecked | Fixed status line |
| 4 | Medium | Architecture diagram showed wrong pk for telemetry_error_clusters (/productId → actual /pk = productId:platform:module) |
Fixed diagram |
| 5 | Medium | Tracker dashboard telemetry missing from roadmap entirely | Added as Phase 2 pending |
| 6 | Medium | Admin dashboard self-telemetry (page views) not mentioned | Added as Phase 2 pending |
| 7 | Low | Architecture diagram missing Android client box | Added with "not yet implemented" note |
| 8 | Low | Architecture diagram implied Admin reads Cosmos directly (it queries Platform Service) | Fixed data flow arrows |
| 9 | Low | Web telemetry.ts JSDoc said "via the admin dashboard proxy" (wrong dashboard) |
Fixed to "user dashboard's /api/telemetry/ingest proxy" |
| 10 | Low | Commit log missing roadmap doc commit | Added |
Commit Log
| Date | Repo | Commit | Description |
|---|---|---|---|
| 2026-02-16 | common-plat | c59049e |
Design doc: client telemetry & log insights |
| 2026-02-16 | common-plat | 083cf02 |
Fix 18 gaps in telemetry design doc (rev 2) |
| 2026-02-16 | common-plat | ce4c4ff |
Telemetry module — ingest, config, query, clusters, policies (34 tests) |
| 2026-02-17 | voice-agent | e546475 |
iOS keyboard telemetry client + KeyboardViewController instrumentation |
| 2026-02-17 | voice-agent | d202f94 |
Admin dashboard Client Logs page + sidebar nav |
| 2026-02-17 | voice-agent | a173baa |
iOS main app TelemetryService + Desktop Python platform_telemetry |
| 2026-02-17 | voice-agent | 130e1d6 |
Web user-dashboard telemetry client + ingest proxy |
| 2026-02-17 | common-plat | c3d6977 |
Telemetry roadmap doc (this file) |
| 2026-02-17 | voice-agent | ae77438 |
Fix: desktop uuid format + web osFamily — pass Zod validation |
| 2026-02-17 | common-plat | 20f77d5 |
Tests: route-logic tests — PII, pk, fingerprint, policy matching (34→77) |
| 2026-02-17 | voice-agent | 08efdb6 |
Tests: Python client (19) + web dashboard (12) telemetry tests |
| 2026-02-17 | voice-agent | a102609 |
Tracker + admin self-telemetry clients + tests (20 tests) |
| 2026-02-17 | voice-agent | 9196f48 |
Android TelemetryClient + keyboard instrumentation + ProcessLifecycleOwner |
| 2026-02-17 | voice-agent | c7732c9 |
Phase 3: Policy Builder UI + GDPR erasure proxy + sidebar nav |
| 2026-02-17 | common-plat | 2fb3410 |
Phase 3: Rate limiting, batch dedup, ETag config caching (614 tests) |
| 2026-02-17 | common-plat | 056f323 |
Phase 3: Cluster resolve/ignore, audit logging, webhook alerts, metrics, Cosmos indexes |
| 2026-02-17 | voice-agent | 6d7b1d3 |
Phase 3: Cluster actions UI, metrics tab, GDPR erasure UI |
| 2026-02-17 | common-plat | 2f61ea5 |
Phase 3: Geo enrichment, Prometheus metrics export |
| 2026-02-17 | voice-agent | dc49073 |
Phase 3: Cluster timeline chart (Recharts) |
| 2026-02-17 | common-plat | 61c919a |
Phase 3: Policy preview endpoint (count matching clients) |
| 2026-02-17 | voice-agent | da9031b |
Phase 3: Policy builder live preview UI + API proxy |
| 2026-02-17 | common-plat | 0bfd4bd |
Phase 3: Geo distribution endpoint (GET /telemetry/geo, Cosmos GROUP BY) |
| 2026-02-17 | voice-agent | 82a25c0 |
Phase 3: Geo distribution UI — bar chart + country table on client-logs Geo tab |