From c3d697711e35906ef7a65e73b79440c5585d5e17 Mon Sep 17 00:00:00 2001 From: saravanakumardb1 Date: Tue, 17 Feb 2026 09:26:49 -0800 Subject: [PATCH] docs: add telemetry implementation roadmap with phase checkboxes and commit links --- docs/WINDSURF/TELEMETRY_ROADMAP.md | 192 +++++++++++++++++++++++++++++ 1 file changed, 192 insertions(+) create mode 100644 docs/WINDSURF/TELEMETRY_ROADMAP.md diff --git a/docs/WINDSURF/TELEMETRY_ROADMAP.md b/docs/WINDSURF/TELEMETRY_ROADMAP.md new file mode 100644 index 00000000..2934c5c3 --- /dev/null +++ b/docs/WINDSURF/TELEMETRY_ROADMAP.md @@ -0,0 +1,192 @@ +# Client Telemetry — Implementation Roadmap + +> **Status:** Phase 2 complete, Phase 3 pending +> **Last updated:** 2026-02-17 +> **Design doc:** [`CLIENT_TELEMETRY_DESIGN.md`](./CLIENT_TELEMETRY_DESIGN.md) +> **Repos:** `learning_ai_common_plat` (platform-service) · `learning_voice_ai_agent` (all clients + dashboards) + +--- + +## Phase 0 — Design & Review + +- [x] Write comprehensive telemetry design doc — schema, APIs, admin UX, privacy guardrails ([`c59049e`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/c59049e)) +- [x] Systematic review: identify and fix 18 bugs/gaps in the design doc ([`083cf02`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/083cf02)) + - TTL format (ISO → seconds), `regionCode` prefix format, missing `pk` field + - Auth model for keyboard extension (`X-Install-Token`) + - Config endpoint query params (`userId`/`anonymousInstallId`) + - Error clustering made version-agnostic (`affectedVersions` array) + - GDPR erasure endpoint added + - iOS offline queue strategy (App Group UserDefaults, FIFO eviction) + - Global defaults for `batchSize`/`flushInterval`/`maxQueueSize` + +--- + +## Phase 1 — MVP (iOS Keyboard + Backend + Admin UI) + +### Platform-Service Telemetry Module + +- [x] `types.ts` — Zod schemas for events, policies, clusters, queries ([`ce4c4ff`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/ce4c4ff)) +- [x] `repository.ts` — Cosmos DB CRUD for events, policies, clusters ([`ce4c4ff`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/ce4c4ff)) +- [x] `routes.ts` — Fastify routes: ingestion, config, admin query, clusters, policy CRUD, GDPR erasure ([`ce4c4ff`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/ce4c4ff)) +- [x] `telemetry.test.ts` — 34 Vitest tests for schemas + policy evaluation ([`ce4c4ff`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/ce4c4ff)) +- [x] Register telemetry routes in `server.ts` ([`ce4c4ff`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/ce4c4ff)) +- [x] Add Cosmos containers (`telemetry_events`, `telemetry_error_clusters`, `telemetry_collection_policies`) to `cosmos-init.ts` ([`ce4c4ff`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/ce4c4ff)) + +### iOS Keyboard Telemetry Client + +- [x] `LysnrTelemetry.swift` — Singleton client with App Group offline queue, `X-Install-Token` auth, 200-event cap ([`e546475`](https://github.com/saravanakumardb1/learning_voice_ai_agent/commit/e546475)) +- [x] Instrument `KeyboardViewController.swift` — 10+ telemetry points ([`e546475`](https://github.com/saravanakumardb1/learning_voice_ai_agent/commit/e546475)) + - [x] `session_started` / `session_ended` (with full `DictationContext`) + - [x] `backend_selected` (azure / local + reason) + - [x] `recognition_started` / `recognition_failed` + - [x] `mic_permission_denied` + - [x] `insert_noop` detection + - [x] `error_recovery_attempted` (local→azure, azure→local) + - [x] Session summary metrics (duration, segments, words, transcript length) + +### Admin Dashboard — Client Logs Page + +- [x] `/ops/client-logs/page.tsx` — Events table + Error Clusters tab ([`d202f94`](https://github.com/saravanakumardb1/learning_voice_ai_agent/commit/d202f94)) + - [x] Stat cards (total events, errors, warnings, keyboard events) + - [x] Filters (platform, channel, level, module, free-text search) + - [x] Expandable event detail rows (device, tags, metrics, dictation context) + - [x] Error Clusters tab with severity, affected versions, user count +- [x] `/api/telemetry/route.ts` — API route proxying to platform-service ([`d202f94`](https://github.com/saravanakumardb1/learning_voice_ai_agent/commit/d202f94)) +- [x] `platform-client.ts` — `queryTelemetryEvents` + `queryTelemetryClusters` ([`d202f94`](https://github.com/saravanakumardb1/learning_voice_ai_agent/commit/d202f94)) +- [x] `sidebar-nav.tsx` — "Client Logs" nav item with `FileText` icon ([`d202f94`](https://github.com/saravanakumardb1/learning_voice_ai_agent/commit/d202f94)) + +--- + +## Phase 2 — Full Platform Coverage + +### iOS Main App + +- [x] `TelemetryService.swift` — Main app telemetry service with App Group queue drain on foreground ([`a173baa`](https://github.com/saravanakumardb1/learning_voice_ai_agent/commit/a173baa)) +- [x] `LysnrAIApp.swift` — `scenePhase` integration for activate/deactivate lifecycle ([`a173baa`](https://github.com/saravanakumardb1/learning_voice_ai_agent/commit/a173baa)) + - [x] `app_foregrounded` / `app_backgrounded` events + - [x] Keyboard queue flush on every foreground transition + - [x] 60-second periodic flush timer + +### Desktop App (Python) + +- [x] `platform_telemetry.py` — `PlatformTelemetry` singleton with `urllib.request` POST, threaded flush timer, persistent `install_id` in `~/.LysnrAI/install_id` ([`a173baa`](https://github.com/saravanakumardb1/learning_voice_ai_agent/commit/a173baa)) +- [x] `main.py` instrumentation ([`a173baa`](https://github.com/saravanakumardb1/learning_voice_ai_agent/commit/a173baa)) + - [x] `app_started` / `app_stopped` lifecycle events + - [x] `dictation_started` (with backend tag) + - [x] `dictation_completed` (with duration_ms, word_count, transcript_length metrics) + - [x] `mic_permission_denied` / `recording_start_failed` error events + +### Web User Dashboard + +- [x] `telemetry.ts` — Browser client with `sendBeacon`, `localStorage` install ID, auto-flush on visibility change ([`130e1d6`](https://github.com/saravanakumardb1/learning_voice_ai_agent/commit/130e1d6)) +- [x] `/api/telemetry/ingest/route.ts` — Server-side proxy to platform-service ([`130e1d6`](https://github.com/saravanakumardb1/learning_voice_ai_agent/commit/130e1d6)) +- [x] `providers.tsx` — `initTelemetry()` called on app mount ([`130e1d6`](https://github.com/saravanakumardb1/learning_voice_ai_agent/commit/130e1d6)) + +### Android (Not Started) + +- [ ] `TelemetryClient.kt` — Kotlin telemetry client for Android keyboard + main app +- [ ] Instrument `LysnrInputMethodService.kt` — dictation lifecycle events +- [ ] Offline queue using SharedPreferences or Room +- [ ] Flush on app foreground via `ProcessLifecycleOwner` + +--- + +## Phase 3 — Intelligence & Admin Tooling + +### Error Clustering & Alerting + +- [ ] Automated error fingerprinting (hash of `platform + channel + module + eventName + errorDomain + errorCode`) +- [ ] Cluster severity escalation (`warn` → `error` → `fatal` based on count + affected users) +- [ ] Slack/email alerting when cluster severity escalates +- [ ] Dashboard: cluster timeline chart showing occurrence rate over time +- [ ] Dashboard: "Resolve" / "Ignore" actions on clusters + +### Geo Enrichment + +- [ ] Server-side IP → country/region lookup on ingestion (GeoLite2 or Azure Maps) +- [ ] Populate `regionCode` field (e.g., `US:WA`) for events without client-provided region +- [ ] Admin UI: geographic heatmap of error distribution +- [ ] Policy targeting by `regionCode` ranges + +### Collection Policy Builder UI + +- [ ] Admin page: `/ops/telemetry-policies` +- [ ] CRUD UI for collection policies (name, enabled, targeting rules, sampling rates) +- [ ] Targeting builder: platform checkboxes, version range inputs, percentage slider +- [ ] Live preview: "N clients would match this policy" +- [ ] Policy activation/deactivation toggle +- [ ] Scheduling: `startsAt` / `expiresAt` date pickers + +### Privacy & Compliance + +- [ ] PII regex scanner on ingestion (email, phone, SSN patterns → redact before storage) +- [ ] Admin UI: GDPR erasure tool (search by userId → delete all events) +- [ ] Retention policy enforcement (TTL-based auto-expiry per container) +- [ ] Audit log entries for policy changes and data deletions + +### Performance & Scale + +- [ ] Client-side config caching (poll `/api/telemetry/config` with `If-None-Match` ETag) +- [ ] Server-side rate limiting per `installId` (100 events/min default) +- [ ] Cosmos DB indexing policy tuning for `telemetry_events` (composite indexes on query patterns) +- [ ] Batch ingestion deduplication by `event.id` +- [ ] Prometheus metrics for ingestion throughput and error rates + +--- + +## Architecture Summary + +``` +┌─────────────────────┐ ┌──────────────────────┐ ┌───────────────────┐ +│ iOS Keyboard Ext │ │ iOS Main App │ │ Desktop (Python) │ +│ LysnrTelemetry │───▶│ TelemetryService │ │ PlatformTelemetry│ +│ (App Group queue) │ │ (drains queue) │ │ (urllib POST) │ +└─────────┬───────────┘ └──────────┬───────────┘ └────────┬──────────┘ + │ │ │ + │ POST /api/telemetry/events │ + ▼ ▼ ▼ +┌─────────────────────────────────────────────────────────────────────────┐ +│ Platform Service (Fastify) │ +│ POST /api/telemetry/events — batch ingestion │ +│ GET /api/telemetry/config — client collection config │ +│ GET /api/telemetry/query — admin event search │ +│ GET /api/telemetry/clusters — error cluster aggregation │ +│ CRUD /api/telemetry/policies — collection policy management │ +│ DELETE /api/telemetry/gdpr/:id — GDPR erasure │ +└────────────────────────────┬────────────────────────────────────────────┘ + │ + ▼ +┌─────────────────────────────────────────────────────────────────────────┐ +│ Azure Cosmos DB │ +│ telemetry_events pk: productId:yyyyMM:platform │ +│ telemetry_error_clusters pk: /productId │ +│ telemetry_collection_policies pk: /productId │ +└─────────────────────────────────────────────────────────────────────────┘ + │ + ▼ +┌─────────────────────────────────────────────────────────────────────────┐ +│ Admin Dashboard (Next.js) │ +│ /ops/client-logs — event viewer + error clusters │ +│ /ops/telemetry-policies (Phase 3) — policy builder UI │ +└─────────────────────────────────────────────────────────────────────────┘ + +┌─────────────────────┐ +│ Web User Dashboard │ +│ telemetry.ts │──▶ POST /api/telemetry/ingest ──▶ platform-service +│ (sendBeacon) │ +└─────────────────────┘ +``` + +--- + +## Commit Log + +| Date | Repo | Commit | Description | +| ---------- | ----------- | --------------------------------------------------------------------------------------- | ----------------------------------------------------------------------- | +| 2026-02-16 | common-plat | [`c59049e`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/c59049e) | Design doc: client telemetry & log insights | +| 2026-02-16 | common-plat | [`083cf02`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/083cf02) | Fix 18 gaps in telemetry design doc (rev 2) | +| 2026-02-16 | common-plat | [`ce4c4ff`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/ce4c4ff) | Telemetry module — ingest, config, query, clusters, policies (34 tests) | +| 2026-02-17 | voice-agent | [`e546475`](https://github.com/saravanakumardb1/learning_voice_ai_agent/commit/e546475) | iOS keyboard telemetry client + KeyboardViewController instrumentation | +| 2026-02-17 | voice-agent | [`d202f94`](https://github.com/saravanakumardb1/learning_voice_ai_agent/commit/d202f94) | Admin dashboard Client Logs page + sidebar nav | +| 2026-02-17 | voice-agent | [`a173baa`](https://github.com/saravanakumardb1/learning_voice_ai_agent/commit/a173baa) | iOS main app TelemetryService + Desktop Python platform_telemetry | +| 2026-02-17 | voice-agent | [`130e1d6`](https://github.com/saravanakumardb1/learning_voice_ai_agent/commit/130e1d6) | Web user-dashboard telemetry client + ingest proxy |