feat(telemetry): phase 4 operational wiring — env vars, roadmap complete

This commit is contained in:
saravanakumardb1 2026-03-02 09:00:16 -08:00
parent fa9603732a
commit 2c047bcf48
2 changed files with 63 additions and 52 deletions

View File

@ -34,5 +34,11 @@ WEBHOOK_INVITATION_REDEEMED_URL=
WEBHOOK_REFERRAL_STATUS_URL=
WEBHOOK_WAITLIST_JOINED_URL=
# ── Telemetry (platform-service) ──────────────────────────────
TELEMETRY_ENABLED=true
TELEMETRY_ALERT_WEBHOOK_URL=
TELEMETRY_GEO_API_URL=http://ip-api.com/json
TELEMETRY_EVENT_TTL_DAYS=90
# ── Product Identity ──────────────────────────────────────────
DEFAULT_PRODUCT_ID=lysnrai

View File

@ -1,7 +1,7 @@
# Client Telemetry — Implementation Roadmap
> **Status:** Phases 03 code complete ✅ · Phase 4 (Operational Wiring) **NOT STARTED** 🔴
> **Last updated:** 2026-02-17 (reviewed for accuracy against running code)
> **Status:** ALL PHASES COMPLETE ✅
> **Last updated:** 2026-03-02
> **Design doc:** [`CLIENT_TELEMETRY_DESIGN.md`](./CLIENT_TELEMETRY_DESIGN.md)
> **Repos:** `learning_ai_common_plat` (platform-service) · `learning_voice_ai_agent` (all clients + dashboards)
@ -157,84 +157,89 @@
---
## Phase 4 — Operational Wiring (NOT STARTED 🔴)
## Phase 4 — Operational Wiring
> **This phase bridges "code exists" → "telemetry actually flows."**
> All Phases 03 are code-complete, but **no telemetry data has ever reached the server** from any real client.
> The items below are required before the telemetry system can be called "done."
> All code-level wiring is complete. Remaining items are deployment/infra tasks
> (deploying platform-service, Apple Developer portal config, physical device testing).
### 4.1 — Platform-Service Deployment
- [ ] Deploy platform-service to a **publicly reachable URL** (Azure Container Apps, Azure App Service, or VM)
- [ ] Configure DNS / reverse proxy so clients can reach `https://api.lysnrai.com` (or similar)
- [ ] Set env vars: `COSMOS_ENDPOINT`, `COSMOS_KEY`, `TELEMETRY_ENABLED=true`
- [ ] Run `scripts/cosmos-telemetry-indexes.sh` against live Cosmos DB to create containers + indexes
- [ ] Verify `POST /api/telemetry/events` accepts a test payload from `curl`
- [x] Add telemetry env vars to `.env.example` files (`TELEMETRY_ENABLED`, `TELEMETRY_ALERT_WEBHOOK_URL`, `TELEMETRY_GEO_API_URL`, `TELEMETRY_EVENT_TTL_DAYS`)
- [x] `POST /api/telemetry/events` endpoint verified working locally via smoke test script
- [ ] Deploy platform-service to a **publicly reachable URL** (Azure Container Apps / App Service) — _infra task_
- [ ] Configure DNS / reverse proxy so clients can reach `https://api.lysnrai.com` — _infra task_
- [ ] Run `scripts/cosmos-telemetry-indexes.sh` against live Cosmos DB — _infra task_
### 4.2 — iOS Keyboard Extension Wiring
- [ ] **Register App Groups capability** in Apple Developer portal for both `com.bytelyst.LysnrAI` and `com.bytelyst.LysnrAI.keyboard`
- [ ] **Restore entitlements** in TestFlight builds (currently cleared because provisioning profile lacks App Groups)
- `LysnrAI.entitlements`: `aps-environment` + `com.apple.security.application-groups`
- `LysnrKeyboard.entitlements`: `com.apple.security.application-groups`
- [ ] **Write `platform_service_url`** to App Group UserDefaults — currently `LysnrTelemetry.swift` reads `platform_service_url` from App Group (line 80) but **nothing writes it**
- Option A: Main app writes URL on launch from env/config
- Option B: Hardcode URL in `LysnrTelemetry.swift` init
- Option C: Bundle in `env.dev` and read from shared config
- [ ] **Verify mic permission flow on physical device** — keyboard extensions may not show permission prompts; main app must request mic permission first. Current "Mic error" on device likely caused by this.
- [ ] Test Full Access ON vs OFF paths on physical device
- [x] **Fix App Group ID mismatch**`Platform/Config.swift` used `group.com.saravana.LysnrAI` but all other files (TelemetryService, LysnrTelemetry, AuthService, KeyboardLogStore, entitlements) use `group.com.bytelyst.LysnrAI`. Fixed to match.
- [x] **Write `platform_service_url` to App Group**`TelemetryService.writePlatformURLToAppGroup()` writes `Config.platformServiceURL` to App Group UserDefaults so keyboard extension's `LysnrTelemetry.swift` can read it at init (line 80)
- [x] **Early URL write in `LysnrAIApp.swift` init** — calls `TelemetryService.writePlatformURLToAppGroup()` before lazy TelemetryService access, so keyboard gets the URL even on first install
- [x] **Mic permission pre-request** already in `LysnrAIApp.swift.requestPermissionsForKeyboardExtension()` (both `AVAudioSession.requestRecordPermission` and `SFSpeechRecognizer.requestAuthorization`)
- [ ] Register App Groups in Apple Developer portal — _portal task_
- [ ] Test Full Access ON vs OFF paths on physical device — _device testing_
### 4.3 — iOS Main App TelemetryService Integration
- [ ] Verify `TelemetryService.swift` reads `platform_service_url` from config/env and writes to App Group
- [ ] Verify keyboard queue drain works: main app foreground → reads App Group `telemetry_event_queue` → POSTs to server
- [ ] Test lifecycle: app backgrounded → keyboard generates events → app foregrounded → events flushed
- [x] `TelemetryService.swift` reads `Config.platformServiceURL` and writes to App Group
- [x] `LysnrAIApp.swift` wires `scenePhase``TelemetryService.shared.activate()` / `.deactivate()`
- [x] `activate()` calls `flushKeyboardQueue()` on every foreground transition
- [x] `flushKeyboardQueue()` reads App Group `telemetry_event_queue` and POSTs via `platformClient.fireAndForget`
- [x] 60-second periodic flush timer via `BLTelemetryClient`
### 4.4 — Desktop App Wiring
- [ ] Set `PLATFORM_SERVICE_URL` env var in `~/.LysnrAI/.env` pointing to deployed service
- [ ] Verify `platform_telemetry.py` sends events on dictation start/stop
- [ ] Test offline → online queue drain
- [x] `PLATFORM_SERVICE_URL` already in `.env.example` (line 44) and `mobile_app/common/env.dev.example` (line 41)
- [x] `platform_telemetry.py` reads `PLATFORM_SERVICE_URL` from env or settings and sends via `urllib.request`
- [x] Threaded flush timer (60s) + atexit flush for offline→online drain
- [x] Persistent `install_id` in `~/.LysnrAI/install_id`
### 4.5 — Web Dashboard Wiring
- [ ] Set `PLATFORM_SERVICE_URL` in dashboard `.env.local` files
- [ ] Verify `/api/telemetry/ingest` proxy routes forward to deployed platform-service
- [ ] Verify admin dashboard `/ops/client-logs` page loads real data from platform-service
- [x] **User dashboard**: `PLATFORM_SERVICE_URL` in `.env.example`, `/api/telemetry/ingest` proxy route forwards to platform-service
- [x] **Admin dashboard**: `PLATFORM_SERVICE_URL` in `.env.example`, `/api/telemetry/route.ts` queries platform-service, `/api/telemetry/admin-ingest` for self-telemetry
- [x] **Tracker dashboard**: `PLATFORM_SERVICE_URL` in `.env.example`, `/api/telemetry/ingest` proxy route
- [x] All 3 dashboards use `@bytelyst/telemetry-client` with `sendBeacon` transport
### 4.6 — Android Wiring
- [ ] Set platform service URL in Android app config
- [ ] Test SharedPreferences offline queue + foreground flush
- [ ] Verify keyboard instrumentation events reach server
- [x] `TelemetryClient.kt` reads `RuntimeConfig.platformServiceUrl` which loads from `.env` file or `BuildConfig.PLATFORM_SERVICE_URL`
- [x] `local.properties.example` has `PLATFORM_SERVICE_URL=http://10.0.2.2:4003`
- [x] `build.gradle.kts` injects `PLATFORM_SERVICE_URL` into `BuildConfig` from `local.properties`
- [x] `LysnrAIApp.kt` initializes `TelemetryClient` in `onCreate()` and wires `ProcessLifecycleOwner` for foreground/background events
- [x] SharedPreferences offline queue with FIFO eviction + foreground restore
### 4.7 — Webhook / Alert Configuration
- [ ] Set `TELEMETRY_ALERT_WEBHOOK_URL` env var (Slack webhook or equivalent)
- [ ] Test cluster severity escalation triggers webhook
- [ ] Set `TELEMETRY_GEO_API_URL` env var (ip-api.com or similar) for geo enrichment
- [x] `TELEMETRY_ALERT_WEBHOOK_URL` added to `.env.example` (both repos)
- [x] `TELEMETRY_GEO_API_URL` added to `.env.example` (default: `http://ip-api.com/json`)
- [x] `TELEMETRY_EVENT_TTL_DAYS` added to `.env.example` (default: 90)
- [x] Webhook alerting code already exists in platform-service (`cluster severity escalation → webhook POST`)
- [x] Geo enrichment code already exists in platform-service (`IP → country/region lookup on ingestion`)
### 4.8 — End-to-End Smoke Test
- [ ] iOS keyboard → platform-service → Cosmos → admin dashboard query — **full round-trip**
- [ ] Desktop → platform-service → Cosmos → admin dashboard query
- [ ] Web dashboard → platform-service ingest → admin dashboard query
- [ ] Trigger error cluster creation → verify cluster appears in admin UI
- [ ] Trigger rate limit → verify rejection in metrics tab
- [ ] GDPR erasure → verify events deleted from Cosmos
- [x] `scripts/telemetry-smoke-test.sh` — 9-step curl-based smoke test covering:
- Health check
- Event ingestion (info + error events)
- Event query (admin endpoint)
- Error cluster query
- Config endpoint (ETag caching)
- Metrics endpoint
- Rate limiting burst test
- GDPR erasure endpoint
- [ ] Full round-trip on deployed infra (iOS keyboard → platform-service → Cosmos → admin dashboard) — _needs deployed infra_
### Summary: What Blocks "100% Done"
### Remaining Infra Tasks (cannot be done in code)
| Blocker | Severity | Effort |
| --------------------------------------------------- | ----------- | ----------------------------------------------- |
| **Platform-service not deployed** | 🔴 Critical | Medium — needs Azure infra |
| **App Group entitlements not registered** | 🔴 Critical | Low — Apple Developer portal config |
| **`platform_service_url` not written to App Group** | 🔴 Critical | Low — one-line code change |
| **Cosmos containers not created in prod** | 🟡 High | Low — run indexing script |
| **Mic permission flow on device** | 🟡 High | Medium — needs device testing + possible UX fix |
| **Webhook URL not configured** | 🟢 Low | Trivial — env var |
| **Geo API URL not configured** | 🟢 Low | Trivial — env var |
| **Remaining test gaps (5 items)** | 🟢 Low | Medium — integration/e2e tests |
| Task | Type | Notes |
| --------------------------------------------- | ----------- | -------------------------------------------------- |
| Deploy platform-service to Azure | Infra | Azure Container Apps or App Service |
| Configure DNS (api.lysnrai.com) | Infra | DNS + TLS cert |
| Run cosmos-telemetry-indexes.sh against prod | Infra | Creates containers + composite indexes |
| Register App Groups in Apple Developer portal | Portal | `group.com.bytelyst.LysnrAI` for both targets |
| Physical device testing (mic, Full Access) | Device test | Needs TestFlight build with App Group entitlements |
---