docs: add telemetry implementation roadmap with phase checkboxes and commit links
This commit is contained in:
parent
ce4c4ff53d
commit
c3d697711e
192
docs/WINDSURF/TELEMETRY_ROADMAP.md
Normal file
192
docs/WINDSURF/TELEMETRY_ROADMAP.md
Normal file
@ -0,0 +1,192 @@
|
||||
# Client Telemetry — Implementation Roadmap
|
||||
|
||||
> **Status:** Phase 2 complete, Phase 3 pending
|
||||
> **Last updated:** 2026-02-17
|
||||
> **Design doc:** [`CLIENT_TELEMETRY_DESIGN.md`](./CLIENT_TELEMETRY_DESIGN.md)
|
||||
> **Repos:** `learning_ai_common_plat` (platform-service) · `learning_voice_ai_agent` (all clients + dashboards)
|
||||
|
||||
---
|
||||
|
||||
## Phase 0 — Design & Review
|
||||
|
||||
- [x] Write comprehensive telemetry design doc — schema, APIs, admin UX, privacy guardrails ([`c59049e`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/c59049e))
|
||||
- [x] Systematic review: identify and fix 18 bugs/gaps in the design doc ([`083cf02`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/083cf02))
|
||||
- TTL format (ISO → seconds), `regionCode` prefix format, missing `pk` field
|
||||
- Auth model for keyboard extension (`X-Install-Token`)
|
||||
- Config endpoint query params (`userId`/`anonymousInstallId`)
|
||||
- Error clustering made version-agnostic (`affectedVersions` array)
|
||||
- GDPR erasure endpoint added
|
||||
- iOS offline queue strategy (App Group UserDefaults, FIFO eviction)
|
||||
- Global defaults for `batchSize`/`flushInterval`/`maxQueueSize`
|
||||
|
||||
---
|
||||
|
||||
## Phase 1 — MVP (iOS Keyboard + Backend + Admin UI)
|
||||
|
||||
### Platform-Service Telemetry Module
|
||||
|
||||
- [x] `types.ts` — Zod schemas for events, policies, clusters, queries ([`ce4c4ff`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/ce4c4ff))
|
||||
- [x] `repository.ts` — Cosmos DB CRUD for events, policies, clusters ([`ce4c4ff`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/ce4c4ff))
|
||||
- [x] `routes.ts` — Fastify routes: ingestion, config, admin query, clusters, policy CRUD, GDPR erasure ([`ce4c4ff`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/ce4c4ff))
|
||||
- [x] `telemetry.test.ts` — 34 Vitest tests for schemas + policy evaluation ([`ce4c4ff`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/ce4c4ff))
|
||||
- [x] Register telemetry routes in `server.ts` ([`ce4c4ff`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/ce4c4ff))
|
||||
- [x] Add Cosmos containers (`telemetry_events`, `telemetry_error_clusters`, `telemetry_collection_policies`) to `cosmos-init.ts` ([`ce4c4ff`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/ce4c4ff))
|
||||
|
||||
### iOS Keyboard Telemetry Client
|
||||
|
||||
- [x] `LysnrTelemetry.swift` — Singleton client with App Group offline queue, `X-Install-Token` auth, 200-event cap ([`e546475`](https://github.com/saravanakumardb1/learning_voice_ai_agent/commit/e546475))
|
||||
- [x] Instrument `KeyboardViewController.swift` — 10+ telemetry points ([`e546475`](https://github.com/saravanakumardb1/learning_voice_ai_agent/commit/e546475))
|
||||
- [x] `session_started` / `session_ended` (with full `DictationContext`)
|
||||
- [x] `backend_selected` (azure / local + reason)
|
||||
- [x] `recognition_started` / `recognition_failed`
|
||||
- [x] `mic_permission_denied`
|
||||
- [x] `insert_noop` detection
|
||||
- [x] `error_recovery_attempted` (local→azure, azure→local)
|
||||
- [x] Session summary metrics (duration, segments, words, transcript length)
|
||||
|
||||
### Admin Dashboard — Client Logs Page
|
||||
|
||||
- [x] `/ops/client-logs/page.tsx` — Events table + Error Clusters tab ([`d202f94`](https://github.com/saravanakumardb1/learning_voice_ai_agent/commit/d202f94))
|
||||
- [x] Stat cards (total events, errors, warnings, keyboard events)
|
||||
- [x] Filters (platform, channel, level, module, free-text search)
|
||||
- [x] Expandable event detail rows (device, tags, metrics, dictation context)
|
||||
- [x] Error Clusters tab with severity, affected versions, user count
|
||||
- [x] `/api/telemetry/route.ts` — API route proxying to platform-service ([`d202f94`](https://github.com/saravanakumardb1/learning_voice_ai_agent/commit/d202f94))
|
||||
- [x] `platform-client.ts` — `queryTelemetryEvents` + `queryTelemetryClusters` ([`d202f94`](https://github.com/saravanakumardb1/learning_voice_ai_agent/commit/d202f94))
|
||||
- [x] `sidebar-nav.tsx` — "Client Logs" nav item with `FileText` icon ([`d202f94`](https://github.com/saravanakumardb1/learning_voice_ai_agent/commit/d202f94))
|
||||
|
||||
---
|
||||
|
||||
## Phase 2 — Full Platform Coverage
|
||||
|
||||
### iOS Main App
|
||||
|
||||
- [x] `TelemetryService.swift` — Main app telemetry service with App Group queue drain on foreground ([`a173baa`](https://github.com/saravanakumardb1/learning_voice_ai_agent/commit/a173baa))
|
||||
- [x] `LysnrAIApp.swift` — `scenePhase` integration for activate/deactivate lifecycle ([`a173baa`](https://github.com/saravanakumardb1/learning_voice_ai_agent/commit/a173baa))
|
||||
- [x] `app_foregrounded` / `app_backgrounded` events
|
||||
- [x] Keyboard queue flush on every foreground transition
|
||||
- [x] 60-second periodic flush timer
|
||||
|
||||
### Desktop App (Python)
|
||||
|
||||
- [x] `platform_telemetry.py` — `PlatformTelemetry` singleton with `urllib.request` POST, threaded flush timer, persistent `install_id` in `~/.LysnrAI/install_id` ([`a173baa`](https://github.com/saravanakumardb1/learning_voice_ai_agent/commit/a173baa))
|
||||
- [x] `main.py` instrumentation ([`a173baa`](https://github.com/saravanakumardb1/learning_voice_ai_agent/commit/a173baa))
|
||||
- [x] `app_started` / `app_stopped` lifecycle events
|
||||
- [x] `dictation_started` (with backend tag)
|
||||
- [x] `dictation_completed` (with duration_ms, word_count, transcript_length metrics)
|
||||
- [x] `mic_permission_denied` / `recording_start_failed` error events
|
||||
|
||||
### Web User Dashboard
|
||||
|
||||
- [x] `telemetry.ts` — Browser client with `sendBeacon`, `localStorage` install ID, auto-flush on visibility change ([`130e1d6`](https://github.com/saravanakumardb1/learning_voice_ai_agent/commit/130e1d6))
|
||||
- [x] `/api/telemetry/ingest/route.ts` — Server-side proxy to platform-service ([`130e1d6`](https://github.com/saravanakumardb1/learning_voice_ai_agent/commit/130e1d6))
|
||||
- [x] `providers.tsx` — `initTelemetry()` called on app mount ([`130e1d6`](https://github.com/saravanakumardb1/learning_voice_ai_agent/commit/130e1d6))
|
||||
|
||||
### Android (Not Started)
|
||||
|
||||
- [ ] `TelemetryClient.kt` — Kotlin telemetry client for Android keyboard + main app
|
||||
- [ ] Instrument `LysnrInputMethodService.kt` — dictation lifecycle events
|
||||
- [ ] Offline queue using SharedPreferences or Room
|
||||
- [ ] Flush on app foreground via `ProcessLifecycleOwner`
|
||||
|
||||
---
|
||||
|
||||
## Phase 3 — Intelligence & Admin Tooling
|
||||
|
||||
### Error Clustering & Alerting
|
||||
|
||||
- [ ] Automated error fingerprinting (hash of `platform + channel + module + eventName + errorDomain + errorCode`)
|
||||
- [ ] Cluster severity escalation (`warn` → `error` → `fatal` based on count + affected users)
|
||||
- [ ] Slack/email alerting when cluster severity escalates
|
||||
- [ ] Dashboard: cluster timeline chart showing occurrence rate over time
|
||||
- [ ] Dashboard: "Resolve" / "Ignore" actions on clusters
|
||||
|
||||
### Geo Enrichment
|
||||
|
||||
- [ ] Server-side IP → country/region lookup on ingestion (GeoLite2 or Azure Maps)
|
||||
- [ ] Populate `regionCode` field (e.g., `US:WA`) for events without client-provided region
|
||||
- [ ] Admin UI: geographic heatmap of error distribution
|
||||
- [ ] Policy targeting by `regionCode` ranges
|
||||
|
||||
### Collection Policy Builder UI
|
||||
|
||||
- [ ] Admin page: `/ops/telemetry-policies`
|
||||
- [ ] CRUD UI for collection policies (name, enabled, targeting rules, sampling rates)
|
||||
- [ ] Targeting builder: platform checkboxes, version range inputs, percentage slider
|
||||
- [ ] Live preview: "N clients would match this policy"
|
||||
- [ ] Policy activation/deactivation toggle
|
||||
- [ ] Scheduling: `startsAt` / `expiresAt` date pickers
|
||||
|
||||
### Privacy & Compliance
|
||||
|
||||
- [ ] PII regex scanner on ingestion (email, phone, SSN patterns → redact before storage)
|
||||
- [ ] Admin UI: GDPR erasure tool (search by userId → delete all events)
|
||||
- [ ] Retention policy enforcement (TTL-based auto-expiry per container)
|
||||
- [ ] Audit log entries for policy changes and data deletions
|
||||
|
||||
### Performance & Scale
|
||||
|
||||
- [ ] Client-side config caching (poll `/api/telemetry/config` with `If-None-Match` ETag)
|
||||
- [ ] Server-side rate limiting per `installId` (100 events/min default)
|
||||
- [ ] Cosmos DB indexing policy tuning for `telemetry_events` (composite indexes on query patterns)
|
||||
- [ ] Batch ingestion deduplication by `event.id`
|
||||
- [ ] Prometheus metrics for ingestion throughput and error rates
|
||||
|
||||
---
|
||||
|
||||
## Architecture Summary
|
||||
|
||||
```
|
||||
┌─────────────────────┐ ┌──────────────────────┐ ┌───────────────────┐
|
||||
│ iOS Keyboard Ext │ │ iOS Main App │ │ Desktop (Python) │
|
||||
│ LysnrTelemetry │───▶│ TelemetryService │ │ PlatformTelemetry│
|
||||
│ (App Group queue) │ │ (drains queue) │ │ (urllib POST) │
|
||||
└─────────┬───────────┘ └──────────┬───────────┘ └────────┬──────────┘
|
||||
│ │ │
|
||||
│ POST /api/telemetry/events │
|
||||
▼ ▼ ▼
|
||||
┌─────────────────────────────────────────────────────────────────────────┐
|
||||
│ Platform Service (Fastify) │
|
||||
│ POST /api/telemetry/events — batch ingestion │
|
||||
│ GET /api/telemetry/config — client collection config │
|
||||
│ GET /api/telemetry/query — admin event search │
|
||||
│ GET /api/telemetry/clusters — error cluster aggregation │
|
||||
│ CRUD /api/telemetry/policies — collection policy management │
|
||||
│ DELETE /api/telemetry/gdpr/:id — GDPR erasure │
|
||||
└────────────────────────────┬────────────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────────────────────────┐
|
||||
│ Azure Cosmos DB │
|
||||
│ telemetry_events pk: productId:yyyyMM:platform │
|
||||
│ telemetry_error_clusters pk: /productId │
|
||||
│ telemetry_collection_policies pk: /productId │
|
||||
└─────────────────────────────────────────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────────────────────────┐
|
||||
│ Admin Dashboard (Next.js) │
|
||||
│ /ops/client-logs — event viewer + error clusters │
|
||||
│ /ops/telemetry-policies (Phase 3) — policy builder UI │
|
||||
└─────────────────────────────────────────────────────────────────────────┘
|
||||
|
||||
┌─────────────────────┐
|
||||
│ Web User Dashboard │
|
||||
│ telemetry.ts │──▶ POST /api/telemetry/ingest ──▶ platform-service
|
||||
│ (sendBeacon) │
|
||||
└─────────────────────┘
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Commit Log
|
||||
|
||||
| Date | Repo | Commit | Description |
|
||||
| ---------- | ----------- | --------------------------------------------------------------------------------------- | ----------------------------------------------------------------------- |
|
||||
| 2026-02-16 | common-plat | [`c59049e`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/c59049e) | Design doc: client telemetry & log insights |
|
||||
| 2026-02-16 | common-plat | [`083cf02`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/083cf02) | Fix 18 gaps in telemetry design doc (rev 2) |
|
||||
| 2026-02-16 | common-plat | [`ce4c4ff`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/ce4c4ff) | Telemetry module — ingest, config, query, clusters, policies (34 tests) |
|
||||
| 2026-02-17 | voice-agent | [`e546475`](https://github.com/saravanakumardb1/learning_voice_ai_agent/commit/e546475) | iOS keyboard telemetry client + KeyboardViewController instrumentation |
|
||||
| 2026-02-17 | voice-agent | [`d202f94`](https://github.com/saravanakumardb1/learning_voice_ai_agent/commit/d202f94) | Admin dashboard Client Logs page + sidebar nav |
|
||||
| 2026-02-17 | voice-agent | [`a173baa`](https://github.com/saravanakumardb1/learning_voice_ai_agent/commit/a173baa) | iOS main app TelemetryService + Desktop Python platform_telemetry |
|
||||
| 2026-02-17 | voice-agent | [`130e1d6`](https://github.com/saravanakumardb1/learning_voice_ai_agent/commit/130e1d6) | Web user-dashboard telemetry client + ingest proxy |
|
||||
Loading…
Reference in New Issue
Block a user