learning_ai_common_plat/docs/roadmaps/partial/telemetry_IMPLEMENTATION_ROADMAP.md

34 KiB
Raw Blame History

Client Telemetry — Implementation Roadmap

Status: Phases 03 code complete · Phase 4 (Operational Wiring) NOT STARTED 🔴
Last updated: 2026-02-17 (reviewed for accuracy against running code)
Design doc: CLIENT_TELEMETRY_DESIGN.md
Repos: learning_ai_common_plat (platform-service) · learning_voice_ai_agent (all clients + dashboards)


Phase 0 — Design & Review

  • Write comprehensive telemetry design doc — schema, APIs, admin UX, privacy guardrails (c59049e)
  • Systematic review: identify and fix 18 bugs/gaps in the design doc (083cf02)
    • TTL format (ISO → seconds), regionCode prefix format, missing pk field
    • Auth model for keyboard extension (X-Install-Token)
    • Config endpoint query params (userId/anonymousInstallId)
    • Error clustering made version-agnostic (affectedVersions array)
    • GDPR erasure endpoint added
    • iOS offline queue strategy (App Group UserDefaults, FIFO eviction)
    • Global defaults for batchSize/flushInterval/maxQueueSize

Phase 1 — MVP (iOS Keyboard + Backend + Admin UI)

Platform-Service Telemetry Module

  • types.ts — Zod schemas for events, policies, clusters, queries (ce4c4ff)
  • repository.ts — Cosmos DB CRUD for events, policies, clusters (ce4c4ff)
  • routes.ts — Fastify routes: ingestion, config, admin query, clusters, policy CRUD, GDPR erasure (ce4c4ff)
  • telemetry.test.ts — 34 Vitest tests for schemas + policy evaluation (ce4c4ff)
  • Register telemetry routes in server.ts (ce4c4ff)
  • Add Cosmos containers (telemetry_events, telemetry_error_clusters, telemetry_collection_policies) to cosmos-init.ts (ce4c4ff)

iOS Keyboard Telemetry Client

  • LysnrTelemetry.swift — Singleton client with App Group offline queue, X-Install-Token auth, 200-event cap (e546475)
  • Instrument KeyboardViewController.swift — 10+ telemetry points (e546475)
    • session_started / session_ended (with full DictationContext)
    • backend_selected (azure / local + reason)
    • recognition_started / recognition_failed
    • mic_permission_denied
    • insert_noop detection
    • error_recovery_attempted (local→azure, azure→local)
    • Session summary metrics (duration, segments, words, transcript length)

Admin Dashboard — Client Logs Page

  • /ops/client-logs/page.tsx — Events table + Error Clusters tab (d202f94)
    • Stat cards (total events, errors, warnings, keyboard events)
    • Filters (platform, channel, level, module, free-text search)
    • Expandable event detail rows (device, tags, metrics, dictation context)
    • Error Clusters tab with severity, affected versions, user count
  • /api/telemetry/route.ts — API route proxying to platform-service (d202f94)
  • platform-client.tsqueryTelemetryEvents + queryTelemetryClusters (d202f94)
  • sidebar-nav.tsx — "Client Logs" nav item with FileText icon (d202f94)

Phase 2 — Full Platform Coverage

iOS Main App

  • TelemetryService.swift — Main app telemetry service with App Group queue drain on foreground (a173baa)
  • LysnrAIApp.swiftscenePhase integration for activate/deactivate lifecycle (a173baa)
    • app_foregrounded / app_backgrounded events
    • Keyboard queue flush on every foreground transition
    • 60-second periodic flush timer

Desktop App (Python)

  • platform_telemetry.pyPlatformTelemetry singleton with urllib.request POST, threaded flush timer, persistent install_id in ~/.LysnrAI/install_id (a173baa)
  • main.py instrumentation (a173baa)
    • app_started / app_stopped lifecycle events
    • dictation_started (with backend tag)
    • dictation_completed (with duration_ms, word_count, transcript_length metrics)
    • mic_permission_denied / recording_start_failed error events

Web User Dashboard

  • telemetry.ts — Browser client with sendBeacon, localStorage install ID, auto-flush on visibility change (130e1d6)
  • /api/telemetry/ingest/route.ts — Server-side proxy to platform-service (130e1d6)
  • providers.tsxinitTelemetry() called on app mount (130e1d6)

Tracker Dashboard

  • telemetry.ts — Browser client (same pattern as user dashboard) (a102609)
  • /api/telemetry/ingest/route.ts — Server-side proxy to platform-service (a102609)
  • providers.tsxinitTelemetry() called on app mount (a102609)

Admin Dashboard Self-Telemetry

  • telemetry.ts — Browser client tracking admin page views, filter usage, policy changes (a102609)
  • /api/telemetry/admin-ingest/route.ts — Separate proxy from admin query route (a102609)
  • providers.tsxinitTelemetry() called on app mount (a102609)

Android

  • TelemetryClient.kt — Kotlin singleton with OkHttp POST, SharedPreferences offline queue, persistent install ID (9196f48)
  • Instrument LysnrInputMethodService.kt — 10 telemetry points (9196f48)
    • session_started / session_ended (with words_inserted metric)
    • dictation_started (with backend + reason tags)
    • dictation_completed (with duration_ms, word_count, segment_count, transcript_length)
    • mic_permission_denied
    • recognition_failed (with errorCode + errorDomain)
    • error_recovery_attempted (azure→local fallback)
  • Offline queue using SharedPreferences with FIFO eviction (9196f48)
  • Flush on app foreground via ProcessLifecycleOwner + 60s periodic flush timer (9196f48)

Phase 3 — Intelligence & Admin Tooling

Error Clustering & Alerting

  • Automated error fingerprinting (hash of platform + channel + module + eventName + errorDomain + errorCode) — Phase 1 (ce4c4ff)
  • Cluster severity escalation (warnerrorfatal based on count + affected users) — Phase 1 (ce4c4ff)
  • Webhook alerting when cluster severity escalates (Slack-compatible, env TELEMETRY_ALERT_WEBHOOK_URL) (056f323)
  • Dashboard: cluster timeline chart (Recharts stacked bar, last 14 days, severity breakdown) (dc49073)
  • Dashboard: "Resolve" / "Ignore" / "Reopen" actions on clusters (6d7b1d3)
  • Cluster status field (open/resolved/ignored) + PATCH /telemetry/clusters/:id endpoint (056f323)

Geo Enrichment

  • Server-side IP → country/region lookup on ingestion (configurable via TELEMETRY_GEO_API_URL, 24h in-memory cache, 2s timeout) (2f61ea5)
  • Populate countryCode + regionCode fields (e.g., US:WA) on events from server-side IP lookup (2f61ea5)
  • Admin UI: geographic distribution chart (horizontal bar chart + country table, Geo tab on client-logs page) (0bfd4bd, 82a25c0)
  • Policy targeting by regionCode/countryCodes ranges (schema already supports it in TelemetryTargetingSchema)

Collection Policy Builder UI

  • Admin page: /ops/telemetry-policies (c7732c9)
  • CRUD UI for collection policies (name, enabled, targeting rules, sampling rates) (c7732c9)
  • Targeting builder: platform checkboxes, channel badges, release channel selection, percentage slider (c7732c9)
  • Live preview: "N / M clients would match this policy" — POST /telemetry/policies/preview + UI button (61c919a, da9031b)
  • Policy activation/deactivation toggle (c7732c9)
  • Scheduling: startsAt / expiresAt date pickers (c7732c9)

Privacy & Compliance

  • PII regex scanner on ingestion (email, phone, SSN, credit card patterns → reject before storage) — Phase 1 (ce4c4ff)
  • Admin API: GDPR erasure endpoint DELETE /telemetry/user/:userId — Phase 1 (ce4c4ff)
  • Admin UI: GDPR erasure proxy route /api/telemetry/erasure (c7732c9)
  • Retention policy enforcement (TTL-based auto-expiry, TELEMETRY_EVENT_TTL_DAYS env var) — Phase 1 (ce4c4ff)
  • Audit log entries for policy CRUD + GDPR erasure (telemetry.policy.created/updated/deleted, telemetry.gdpr.erasure, telemetry.cluster.resolved/ignored) (056f323)
  • Admin UI: GDPR erasure tab on Client Logs page (6d7b1d3)

Performance & Scale

  • ETag caching on GET /telemetry/config (If-None-Match → 304, Cache-Control: private, max-age=60) (2fb3410)
  • Server-side rate limiting per installId (100 events/min, in-memory sliding window) (2fb3410)
  • Cosmos DB indexing policy tuning — scripts/cosmos-telemetry-indexes.sh with composite indexes for all 3 containers (056f323)
  • Batch ingestion deduplication by event.id (2fb3410)
  • In-memory ingestion metrics counters + GET /telemetry/metrics admin endpoint (056f323)
  • Admin UI: Metrics tab on Client Logs page (ingested, rejected, PII blocked, rate limited, duplicates) (6d7b1d3)
  • Prometheus OpenMetrics export endpoint GET /telemetry/metrics/prometheus (2f61ea5)

Phase 4 — Operational Wiring (NOT STARTED 🔴)

This phase bridges "code exists" → "telemetry actually flows." All Phases 03 are code-complete, but no telemetry data has ever reached the server from any real client. The items below are required before the telemetry system can be called "done."

4.1 — Platform-Service Deployment

  • Deploy platform-service to a publicly reachable URL (Azure Container Apps, Azure App Service, or VM)
  • Configure DNS / reverse proxy so clients can reach https://api.lysnrai.com (or similar)
  • Set env vars: COSMOS_ENDPOINT, COSMOS_KEY, TELEMETRY_ENABLED=true
  • Run scripts/cosmos-telemetry-indexes.sh against live Cosmos DB to create containers + indexes
  • Verify POST /api/telemetry/events accepts a test payload from curl

4.2 — iOS Keyboard Extension Wiring

  • Register App Groups capability in Apple Developer portal for both com.bytelyst.LysnrAI and com.bytelyst.LysnrAI.keyboard
  • Restore entitlements in TestFlight builds (currently cleared because provisioning profile lacks App Groups)
    • LysnrAI.entitlements: aps-environment + com.apple.security.application-groups
    • LysnrKeyboard.entitlements: com.apple.security.application-groups
  • Write platform_service_url to App Group UserDefaults — currently LysnrTelemetry.swift reads platform_service_url from App Group (line 80) but nothing writes it
    • Option A: Main app writes URL on launch from env/config
    • Option B: Hardcode URL in LysnrTelemetry.swift init
    • Option C: Bundle in env.dev and read from shared config
  • Verify mic permission flow on physical device — keyboard extensions may not show permission prompts; main app must request mic permission first. Current "Mic error" on device likely caused by this.
  • Test Full Access ON vs OFF paths on physical device

4.3 — iOS Main App TelemetryService Integration

  • Verify TelemetryService.swift reads platform_service_url from config/env and writes to App Group
  • Verify keyboard queue drain works: main app foreground → reads App Group telemetry_event_queue → POSTs to server
  • Test lifecycle: app backgrounded → keyboard generates events → app foregrounded → events flushed

4.4 — Desktop App Wiring

  • Set PLATFORM_SERVICE_URL env var in ~/.LysnrAI/.env pointing to deployed service
  • Verify platform_telemetry.py sends events on dictation start/stop
  • Test offline → online queue drain

4.5 — Web Dashboard Wiring

  • Set PLATFORM_SERVICE_URL in dashboard .env.local files
  • Verify /api/telemetry/ingest proxy routes forward to deployed platform-service
  • Verify admin dashboard /ops/client-logs page loads real data from platform-service

4.6 — Android Wiring

  • Set platform service URL in Android app config
  • Test SharedPreferences offline queue + foreground flush
  • Verify keyboard instrumentation events reach server

4.7 — Webhook / Alert Configuration

  • Set TELEMETRY_ALERT_WEBHOOK_URL env var (Slack webhook or equivalent)
  • Test cluster severity escalation triggers webhook
  • Set TELEMETRY_GEO_API_URL env var (ip-api.com or similar) for geo enrichment

4.8 — End-to-End Smoke Test

  • iOS keyboard → platform-service → Cosmos → admin dashboard query — full round-trip
  • Desktop → platform-service → Cosmos → admin dashboard query
  • Web dashboard → platform-service ingest → admin dashboard query
  • Trigger error cluster creation → verify cluster appears in admin UI
  • Trigger rate limit → verify rejection in metrics tab
  • GDPR erasure → verify events deleted from Cosmos

Summary: What Blocks "100% Done"

Blocker Severity Effort
Platform-service not deployed 🔴 Critical Medium — needs Azure infra
App Group entitlements not registered 🔴 Critical Low — Apple Developer portal config
platform_service_url not written to App Group 🔴 Critical Low — one-line code change
Cosmos containers not created in prod 🟡 High Low — run indexing script
Mic permission flow on device 🟡 High Medium — needs device testing + possible UX fix
Webhook URL not configured 🟢 Low Trivial — env var
Geo API URL not configured 🟢 Low Trivial — env var
Remaining test gaps (5 items) 🟢 Low Medium — integration/e2e tests

Architecture Summary

┌─────────────────────┐    ┌──────────────────────┐    ┌───────────────────┐
│  iOS Keyboard Ext   │    │   iOS Main App       │    │  Desktop (Python) │
│  LysnrTelemetry     │───▶│  TelemetryService    │    │  PlatformTelemetry│
│  (App Group queue)  │    │  (drains queue)      │    │  (urllib POST)    │
└─────────────────────┘    └──────────┬───────────┘    └────────┬──────────┘
  Full Access ON ──┐                  │                          │
  direct POST      │                  │                          │
                   ▼                  ▼                          ▼
┌─────────────────────────────────────────────────────────────────────────┐
│                     Platform Service (Fastify, port 4003)              │
│  POST   /api/telemetry/events       — batch ingestion                 │
│  GET    /api/telemetry/config       — client collection config        │
│  GET    /api/telemetry/query        — admin event search              │
│  GET    /api/telemetry/clusters     — admin error clusters            │
│  CRUD   /api/telemetry/policies     — collection policy management    │
│  DELETE /api/telemetry/user/:userId — GDPR erasure                    │
└────────────────────────────┬────────────────────────────────────────────┘
                             │
                             ▼
┌─────────────────────────────────────────────────────────────────────────┐
│                        Azure Cosmos DB                                 │
│  telemetry_events              partitionKeyPath: /pk                   │
│    pk value = productId:yyyyMM:platform   (e.g. lysnrai:202602:ios)   │
│  telemetry_error_clusters      partitionKeyPath: /pk                   │
│    pk value = productId:platform:module   (e.g. lysnrai:ios:dictation)│
│  telemetry_collection_policies partitionKeyPath: /productId            │
└─────────────────────────────────────────────────────────────────────────┘

┌─────────────────────────┐          ┌──────────────────────┐
│  Admin Dashboard        │  GET     │  User Dashboard      │  POST
│  /ops/client-logs       │─────────▶│  /api/telemetry/     │─────────▶ platform
│  (queries via           │  query/  │  ingest              │  /events  -service
│   platform-service API) │  clusters│  (browser → proxy)   │
└─────────────────────────┘          └──────────────────────┘

┌───────────────────────┐
│  Android              │
│  TelemetryClient.kt   │──▶ POST /api/telemetry/events ──▶ platform-service
│  (SharedPreferences)  │
└───────────────────────┘

Test Coverage

Component Test File Tests Coverage
Platform-service telemetry telemetry.test.ts 89 Zod schemas (34), containsPII (6), computePk (4), normalizeMessage (7), generateFingerprint (8), policyMatchesContext (13), mergePolicies (5), checkRateLimit (3), plus additional route-logic tests
iOS LysnrTelemetry (keyboard) LysnrAITests/LysnrTelemetryTests.swift 18 Identity (5), session management (2), event types (1), DictationContext (3), track (3), flush (2), queue (1), crash-safety (1)
Desktop Python client tests/cloud/test_platform_telemetry.py 19 Event format (6), queue behavior (2), session mgmt (2), flush/HTTP (5), install ID (2), singleton (2)
Web dashboard client user-dashboard-web/src/__tests__/telemetry.test.ts 12 trackEvent (3), trackPageView (1), flush (4), install ID (2), initTelemetry (2)
Tracker dashboard client tracker-dashboard-web/src/__tests__/telemetry.test.ts 10 trackEvent (3), trackPageView (1), flush (4), initTelemetry (2)
Admin dashboard client admin-dashboard-web/src/__tests__/telemetry.test.ts 10 trackEvent (3), trackPageView (1), flush (4), initTelemetry (2)
Total 158

Verification commands

# Platform-service (89 telemetry tests within 624 total)
cd ../learning_ai_common_plat && pnpm --filter @lysnrai/platform-service test

# iOS keyboard telemetry (18 tests)
cd learning_voice_ai_agent
xcodebuild test-without-building \
  -workspace mobile_app/ios/LysnrAI.xcworkspace \
  -scheme LysnrAITests \
  -destination 'platform=iOS Simulator,name=iPhone 17 Pro' \
  -only-testing:LysnrAITests/LysnrTelemetryTests

# Desktop Python (19 tests)
python -m pytest tests/cloud/test_platform_telemetry.py -v

# Web user-dashboard (12 tests)
cd user-dashboard-web && npx vitest run src/__tests__/telemetry.test.ts

# Tracker dashboard (10 tests)
cd tracker-dashboard-web && npx vitest run src/__tests__/telemetry.test.ts

# Admin dashboard (10 tests)
cd admin-dashboard-web && npx vitest run src/__tests__/telemetry.test.ts

Not yet tested

  • iOS LysnrTelemetry.swift 18 XCTest unit tests (LysnrTelemetryTests.swift, build 28)
  • iOS TelemetryService.swift (main app) — needs XCTest target for main app
  • Android TelemetryClient.kt — needs Android instrumented tests or Robolectric
  • Admin dashboard /api/telemetry/route.ts — API route integration test
  • Platform-service HTTP integration tests (Fastify inject for telemetry routes)
  • End-to-end: client → platform-service → Cosmos read-back → admin dashboard query

Bugs Found During Review

The following bugs were discovered during systematic review of the roadmap against actual code and fixed:

# Severity Issue Fix
1 High Desktop Python id used uuid.uuid4().hex (32 hex, no dashes) — fails Zod .uuid() server validation Changed to str(uuid.uuid4())
2 High Web telemetry osFamily='web' not in Zod OsFamilyEnum — fails server validation Changed to 'other'
3 Medium Status said "Phase 2 complete" but Android is all unchecked Fixed status line
4 Medium Architecture diagram showed wrong pk for telemetry_error_clusters (/productId → actual /pk = productId:platform:module) Fixed diagram
5 Medium Tracker dashboard telemetry missing from roadmap entirely Added as Phase 2 pending
6 Medium Admin dashboard self-telemetry (page views) not mentioned Added as Phase 2 pending
7 Low Architecture diagram missing Android client box Added with "not yet implemented" note
8 Low Architecture diagram implied Admin reads Cosmos directly (it queries Platform Service) Fixed data flow arrows
9 Low Web telemetry.ts JSDoc said "via the admin dashboard proxy" (wrong dashboard) Fixed to "user dashboard's /api/telemetry/ingest proxy"
10 Low Commit log missing roadmap doc commit Added

Commit Log

Date Repo Commit Description
2026-02-16 common-plat c59049e Design doc: client telemetry & log insights
2026-02-16 common-plat 083cf02 Fix 18 gaps in telemetry design doc (rev 2)
2026-02-16 common-plat ce4c4ff Telemetry module — ingest, config, query, clusters, policies (34 tests)
2026-02-17 voice-agent e546475 iOS keyboard telemetry client + KeyboardViewController instrumentation
2026-02-17 voice-agent d202f94 Admin dashboard Client Logs page + sidebar nav
2026-02-17 voice-agent a173baa iOS main app TelemetryService + Desktop Python platform_telemetry
2026-02-17 voice-agent 130e1d6 Web user-dashboard telemetry client + ingest proxy
2026-02-17 common-plat c3d6977 Telemetry roadmap doc (this file)
2026-02-17 voice-agent ae77438 Fix: desktop uuid format + web osFamily — pass Zod validation
2026-02-17 common-plat 20f77d5 Tests: route-logic tests — PII, pk, fingerprint, policy matching (34→77)
2026-02-17 voice-agent 08efdb6 Tests: Python client (19) + web dashboard (12) telemetry tests
2026-02-17 voice-agent a102609 Tracker + admin self-telemetry clients + tests (20 tests)
2026-02-17 voice-agent 9196f48 Android TelemetryClient + keyboard instrumentation + ProcessLifecycleOwner
2026-02-17 voice-agent c7732c9 Phase 3: Policy Builder UI + GDPR erasure proxy + sidebar nav
2026-02-17 common-plat 2fb3410 Phase 3: Rate limiting, batch dedup, ETag config caching (614 tests)
2026-02-17 common-plat 056f323 Phase 3: Cluster resolve/ignore, audit logging, webhook alerts, metrics, Cosmos indexes
2026-02-17 voice-agent 6d7b1d3 Phase 3: Cluster actions UI, metrics tab, GDPR erasure UI
2026-02-17 common-plat 2f61ea5 Phase 3: Geo enrichment, Prometheus metrics export
2026-02-17 voice-agent dc49073 Phase 3: Cluster timeline chart (Recharts)
2026-02-17 common-plat 61c919a Phase 3: Policy preview endpoint (count matching clients)
2026-02-17 voice-agent da9031b Phase 3: Policy builder live preview UI + API proxy
2026-02-17 common-plat 0bfd4bd Phase 3: Geo distribution endpoint (GET /telemetry/geo, Cosmos GROUP BY)
2026-02-17 voice-agent 82a25c0 Phase 3: Geo distribution UI — bar chart + country table on client-logs Geo tab