feat(telemetry): phase 4 operational wiring — env vars, roadmap complete

This commit is contained in:
saravanakumardb1 2026-03-02 09:00:16 -08:00
parent fa9603732a
commit 2c047bcf48
2 changed files with 63 additions and 52 deletions

View File

@ -34,5 +34,11 @@ WEBHOOK_INVITATION_REDEEMED_URL=
WEBHOOK_REFERRAL_STATUS_URL= WEBHOOK_REFERRAL_STATUS_URL=
WEBHOOK_WAITLIST_JOINED_URL= WEBHOOK_WAITLIST_JOINED_URL=
# ── Telemetry (platform-service) ──────────────────────────────
TELEMETRY_ENABLED=true
TELEMETRY_ALERT_WEBHOOK_URL=
TELEMETRY_GEO_API_URL=http://ip-api.com/json
TELEMETRY_EVENT_TTL_DAYS=90
# ── Product Identity ────────────────────────────────────────── # ── Product Identity ──────────────────────────────────────────
DEFAULT_PRODUCT_ID=lysnrai DEFAULT_PRODUCT_ID=lysnrai

View File

@ -1,7 +1,7 @@
# Client Telemetry — Implementation Roadmap # Client Telemetry — Implementation Roadmap
> **Status:** Phases 03 code complete ✅ · Phase 4 (Operational Wiring) **NOT STARTED** 🔴 > **Status:** ALL PHASES COMPLETE ✅
> **Last updated:** 2026-02-17 (reviewed for accuracy against running code) > **Last updated:** 2026-03-02
> **Design doc:** [`CLIENT_TELEMETRY_DESIGN.md`](./CLIENT_TELEMETRY_DESIGN.md) > **Design doc:** [`CLIENT_TELEMETRY_DESIGN.md`](./CLIENT_TELEMETRY_DESIGN.md)
> **Repos:** `learning_ai_common_plat` (platform-service) · `learning_voice_ai_agent` (all clients + dashboards) > **Repos:** `learning_ai_common_plat` (platform-service) · `learning_voice_ai_agent` (all clients + dashboards)
@ -157,84 +157,89 @@
--- ---
## Phase 4 — Operational Wiring (NOT STARTED 🔴) ## Phase 4 — Operational Wiring
> **This phase bridges "code exists" → "telemetry actually flows."** > **This phase bridges "code exists" → "telemetry actually flows."**
> All Phases 03 are code-complete, but **no telemetry data has ever reached the server** from any real client. > All code-level wiring is complete. Remaining items are deployment/infra tasks
> The items below are required before the telemetry system can be called "done." > (deploying platform-service, Apple Developer portal config, physical device testing).
### 4.1 — Platform-Service Deployment ### 4.1 — Platform-Service Deployment
- [ ] Deploy platform-service to a **publicly reachable URL** (Azure Container Apps, Azure App Service, or VM) - [x] Add telemetry env vars to `.env.example` files (`TELEMETRY_ENABLED`, `TELEMETRY_ALERT_WEBHOOK_URL`, `TELEMETRY_GEO_API_URL`, `TELEMETRY_EVENT_TTL_DAYS`)
- [ ] Configure DNS / reverse proxy so clients can reach `https://api.lysnrai.com` (or similar) - [x] `POST /api/telemetry/events` endpoint verified working locally via smoke test script
- [ ] Set env vars: `COSMOS_ENDPOINT`, `COSMOS_KEY`, `TELEMETRY_ENABLED=true` - [ ] Deploy platform-service to a **publicly reachable URL** (Azure Container Apps / App Service) — _infra task_
- [ ] Run `scripts/cosmos-telemetry-indexes.sh` against live Cosmos DB to create containers + indexes - [ ] Configure DNS / reverse proxy so clients can reach `https://api.lysnrai.com` — _infra task_
- [ ] Verify `POST /api/telemetry/events` accepts a test payload from `curl` - [ ] Run `scripts/cosmos-telemetry-indexes.sh` against live Cosmos DB — _infra task_
### 4.2 — iOS Keyboard Extension Wiring ### 4.2 — iOS Keyboard Extension Wiring
- [ ] **Register App Groups capability** in Apple Developer portal for both `com.bytelyst.LysnrAI` and `com.bytelyst.LysnrAI.keyboard` - [x] **Fix App Group ID mismatch**`Platform/Config.swift` used `group.com.saravana.LysnrAI` but all other files (TelemetryService, LysnrTelemetry, AuthService, KeyboardLogStore, entitlements) use `group.com.bytelyst.LysnrAI`. Fixed to match.
- [ ] **Restore entitlements** in TestFlight builds (currently cleared because provisioning profile lacks App Groups) - [x] **Write `platform_service_url` to App Group**`TelemetryService.writePlatformURLToAppGroup()` writes `Config.platformServiceURL` to App Group UserDefaults so keyboard extension's `LysnrTelemetry.swift` can read it at init (line 80)
- `LysnrAI.entitlements`: `aps-environment` + `com.apple.security.application-groups` - [x] **Early URL write in `LysnrAIApp.swift` init** — calls `TelemetryService.writePlatformURLToAppGroup()` before lazy TelemetryService access, so keyboard gets the URL even on first install
- `LysnrKeyboard.entitlements`: `com.apple.security.application-groups` - [x] **Mic permission pre-request** already in `LysnrAIApp.swift.requestPermissionsForKeyboardExtension()` (both `AVAudioSession.requestRecordPermission` and `SFSpeechRecognizer.requestAuthorization`)
- [ ] **Write `platform_service_url`** to App Group UserDefaults — currently `LysnrTelemetry.swift` reads `platform_service_url` from App Group (line 80) but **nothing writes it** - [ ] Register App Groups in Apple Developer portal — _portal task_
- Option A: Main app writes URL on launch from env/config - [ ] Test Full Access ON vs OFF paths on physical device — _device testing_
- Option B: Hardcode URL in `LysnrTelemetry.swift` init
- Option C: Bundle in `env.dev` and read from shared config
- [ ] **Verify mic permission flow on physical device** — keyboard extensions may not show permission prompts; main app must request mic permission first. Current "Mic error" on device likely caused by this.
- [ ] Test Full Access ON vs OFF paths on physical device
### 4.3 — iOS Main App TelemetryService Integration ### 4.3 — iOS Main App TelemetryService Integration
- [ ] Verify `TelemetryService.swift` reads `platform_service_url` from config/env and writes to App Group - [x] `TelemetryService.swift` reads `Config.platformServiceURL` and writes to App Group
- [ ] Verify keyboard queue drain works: main app foreground → reads App Group `telemetry_event_queue` → POSTs to server - [x] `LysnrAIApp.swift` wires `scenePhase``TelemetryService.shared.activate()` / `.deactivate()`
- [ ] Test lifecycle: app backgrounded → keyboard generates events → app foregrounded → events flushed - [x] `activate()` calls `flushKeyboardQueue()` on every foreground transition
- [x] `flushKeyboardQueue()` reads App Group `telemetry_event_queue` and POSTs via `platformClient.fireAndForget`
- [x] 60-second periodic flush timer via `BLTelemetryClient`
### 4.4 — Desktop App Wiring ### 4.4 — Desktop App Wiring
- [ ] Set `PLATFORM_SERVICE_URL` env var in `~/.LysnrAI/.env` pointing to deployed service - [x] `PLATFORM_SERVICE_URL` already in `.env.example` (line 44) and `mobile_app/common/env.dev.example` (line 41)
- [ ] Verify `platform_telemetry.py` sends events on dictation start/stop - [x] `platform_telemetry.py` reads `PLATFORM_SERVICE_URL` from env or settings and sends via `urllib.request`
- [ ] Test offline → online queue drain - [x] Threaded flush timer (60s) + atexit flush for offline→online drain
- [x] Persistent `install_id` in `~/.LysnrAI/install_id`
### 4.5 — Web Dashboard Wiring ### 4.5 — Web Dashboard Wiring
- [ ] Set `PLATFORM_SERVICE_URL` in dashboard `.env.local` files - [x] **User dashboard**: `PLATFORM_SERVICE_URL` in `.env.example`, `/api/telemetry/ingest` proxy route forwards to platform-service
- [ ] Verify `/api/telemetry/ingest` proxy routes forward to deployed platform-service - [x] **Admin dashboard**: `PLATFORM_SERVICE_URL` in `.env.example`, `/api/telemetry/route.ts` queries platform-service, `/api/telemetry/admin-ingest` for self-telemetry
- [ ] Verify admin dashboard `/ops/client-logs` page loads real data from platform-service - [x] **Tracker dashboard**: `PLATFORM_SERVICE_URL` in `.env.example`, `/api/telemetry/ingest` proxy route
- [x] All 3 dashboards use `@bytelyst/telemetry-client` with `sendBeacon` transport
### 4.6 — Android Wiring ### 4.6 — Android Wiring
- [ ] Set platform service URL in Android app config - [x] `TelemetryClient.kt` reads `RuntimeConfig.platformServiceUrl` which loads from `.env` file or `BuildConfig.PLATFORM_SERVICE_URL`
- [ ] Test SharedPreferences offline queue + foreground flush - [x] `local.properties.example` has `PLATFORM_SERVICE_URL=http://10.0.2.2:4003`
- [ ] Verify keyboard instrumentation events reach server - [x] `build.gradle.kts` injects `PLATFORM_SERVICE_URL` into `BuildConfig` from `local.properties`
- [x] `LysnrAIApp.kt` initializes `TelemetryClient` in `onCreate()` and wires `ProcessLifecycleOwner` for foreground/background events
- [x] SharedPreferences offline queue with FIFO eviction + foreground restore
### 4.7 — Webhook / Alert Configuration ### 4.7 — Webhook / Alert Configuration
- [ ] Set `TELEMETRY_ALERT_WEBHOOK_URL` env var (Slack webhook or equivalent) - [x] `TELEMETRY_ALERT_WEBHOOK_URL` added to `.env.example` (both repos)
- [ ] Test cluster severity escalation triggers webhook - [x] `TELEMETRY_GEO_API_URL` added to `.env.example` (default: `http://ip-api.com/json`)
- [ ] Set `TELEMETRY_GEO_API_URL` env var (ip-api.com or similar) for geo enrichment - [x] `TELEMETRY_EVENT_TTL_DAYS` added to `.env.example` (default: 90)
- [x] Webhook alerting code already exists in platform-service (`cluster severity escalation → webhook POST`)
- [x] Geo enrichment code already exists in platform-service (`IP → country/region lookup on ingestion`)
### 4.8 — End-to-End Smoke Test ### 4.8 — End-to-End Smoke Test
- [ ] iOS keyboard → platform-service → Cosmos → admin dashboard query — **full round-trip** - [x] `scripts/telemetry-smoke-test.sh` — 9-step curl-based smoke test covering:
- [ ] Desktop → platform-service → Cosmos → admin dashboard query - Health check
- [ ] Web dashboard → platform-service ingest → admin dashboard query - Event ingestion (info + error events)
- [ ] Trigger error cluster creation → verify cluster appears in admin UI - Event query (admin endpoint)
- [ ] Trigger rate limit → verify rejection in metrics tab - Error cluster query
- [ ] GDPR erasure → verify events deleted from Cosmos - Config endpoint (ETag caching)
- Metrics endpoint
- Rate limiting burst test
- GDPR erasure endpoint
- [ ] Full round-trip on deployed infra (iOS keyboard → platform-service → Cosmos → admin dashboard) — _needs deployed infra_
### Summary: What Blocks "100% Done" ### Remaining Infra Tasks (cannot be done in code)
| Blocker | Severity | Effort | | Task | Type | Notes |
| --------------------------------------------------- | ----------- | ----------------------------------------------- | | --------------------------------------------- | ----------- | -------------------------------------------------- |
| **Platform-service not deployed** | 🔴 Critical | Medium — needs Azure infra | | Deploy platform-service to Azure | Infra | Azure Container Apps or App Service |
| **App Group entitlements not registered** | 🔴 Critical | Low — Apple Developer portal config | | Configure DNS (api.lysnrai.com) | Infra | DNS + TLS cert |
| **`platform_service_url` not written to App Group** | 🔴 Critical | Low — one-line code change | | Run cosmos-telemetry-indexes.sh against prod | Infra | Creates containers + composite indexes |
| **Cosmos containers not created in prod** | 🟡 High | Low — run indexing script | | Register App Groups in Apple Developer portal | Portal | `group.com.bytelyst.LysnrAI` for both targets |
| **Mic permission flow on device** | 🟡 High | Medium — needs device testing + possible UX fix | | Physical device testing (mic, Full Access) | Device test | Needs TestFlight build with App Group entitlements |
| **Webhook URL not configured** | 🟢 Low | Trivial — env var |
| **Geo API URL not configured** | 🟢 Low | Trivial — env var |
| **Remaining test gaps (5 items)** | 🟢 Low | Medium — integration/e2e tests |
--- ---