diff --git a/docs/WINDSURF/CLIENT_TELEMETRY_DESIGN.md b/docs/WINDSURF/CLIENT_TELEMETRY_DESIGN.md index ff4b2e02..aac9e15c 100644 --- a/docs/WINDSURF/CLIENT_TELEMETRY_DESIGN.md +++ b/docs/WINDSURF/CLIENT_TELEMETRY_DESIGN.md @@ -2,8 +2,8 @@ > **Audience:** Engineering (AI agents + humans) working on ByteLyst/LysnrAI repos. > **Scope:** Cross-platform client telemetry ingestion, segment-based collection control, storage, admin UI, and privacy guardrails. -> **Status:** Design — not yet implemented. -> **Last updated:** 2026-02-17 +> **Status:** Design — implementing today, keyboard-first. +> **Last updated:** 2026-02-17 (rev 2 — 18 gaps fixed) --- @@ -136,19 +136,51 @@ Every client event MUST conform to this schema. Fields marked **REQUIRED** must > **Rule:** At least one of `userId` or `anonymousInstallId` MUST be present. +### 4.1.1 `anonymousInstallId` Generation Strategy + +Each platform generates a stable UUID on first launch and persists it: + +| Platform | Storage | Key | +| ---------------- | ----------------------------------------------------- | -------------------------------- | +| **iOS app** | Keychain (kSecAttrAccessibleAfterFirstUnlock) | `com.bytelyst.LysnrAI.installId` | +| **iOS keyboard** | App Group UserDefaults (`group.com.bytelyst.LysnrAI`) | `telemetry_install_id` | +| **Android** | EncryptedSharedPreferences | `telemetry_install_id` | +| **Desktop** | `~/.lysnrai/telemetry_install_id` (plain file) | — | +| **Web** | `localStorage` | `lysnrai_install_id` | + +> **iOS keyboard note:** The keyboard extension shares the install ID via App Group so main app and extension use the same identity. + +### 4.1.2 Authentication for Telemetry Ingest + +Not all clients have a JWT (e.g., keyboard extension before user logs in). The ingest endpoint accepts two auth modes: + +| Mode | Header | When Used | +| ----------------- | --------------------------------------- | -------------------------------------------------------- | +| **JWT** | `Authorization: Bearer ` | Authenticated users (main app, web, desktop after login) | +| **Install Token** | `X-Install-Token: ` | Unauthenticated clients (keyboard extension, pre-login) | + +**Install token validation:** The server accepts any well-formed UUID in `X-Install-Token`. It does NOT verify against a registry (install IDs are self-issued). Rate limiting is applied per install ID to prevent abuse. + +**Keyboard extension specifics:** + +- With Full Access ON: sends events directly via HTTPS using `X-Install-Token`. +- With Full Access OFF: queues events to App Group UserDefaults (max 200 events, ~100KB). Main app flushes on next foreground. +- If queue is full, oldest events are dropped (FIFO eviction). +- **Memory constraint:** iOS keyboard extensions are limited to ~30MB. Telemetry queue MUST stay under 100KB. Events are serialized as compact JSON (no pretty-print). + ### 4.2 Source Classification Fields -| Field | Type | Required | Description | -| ------------- | --------- | ----------- | ------------------------------------------------------------------------------------------------- | -| `platform` | `enum` | REQUIRED | `"ios"` \| `"android"` \| `"web"` \| `"desktop"` | -| `channel` | `enum` | REQUIRED | `"mobile_app"` \| `"keyboard_extension"` \| `"web_app"` \| `"desktop_app"` \| `"backend_service"` | -| `osFamily` | `enum` | REQUIRED | `"ios"` \| `"android"` \| `"macos"` \| `"windows"` \| `"linux"` \| `"chromeos"` \| `"other"` | -| `osVersion` | `string?` | Recommended | e.g. `"iOS 18.2"`, `"Windows 11 24H2"`, `"macOS 15.3"`, `"Ubuntu 24.04"` | -| `deviceModel` | `string?` | Optional | e.g. `"iPhone17,3"`, `"Pixel 9"`, `"MacBookPro18,3"` | -| `locale` | `string?` | Optional | BCP 47 locale, e.g. `"en-US"`, `"ta-IN"` | -| `timezone` | `string?` | Optional | IANA timezone, e.g. `"America/Los_Angeles"`, `"Asia/Kolkata"` | -| `countryCode` | `string?` | Optional | ISO 3166-1 alpha-2, e.g. `"US"`, `"IN"` — derived from locale or IP server-side | -| `regionCode` | `string?` | Optional | e.g. `"WA"`, `"TN"` — derived server-side from IP geo | +| Field | Type | Required | Description | +| ------------- | --------- | ----------- | ------------------------------------------------------------------------------------------------------------------------------------------------------ | +| `platform` | `enum` | REQUIRED | `"ios"` \| `"android"` \| `"web"` \| `"desktop"` | +| `channel` | `enum` | REQUIRED | `"mobile_app"` \| `"keyboard_extension"` \| `"web_app"` \| `"desktop_app"` \| `"backend_service"` | +| `osFamily` | `enum` | REQUIRED | `"ios"` \| `"android"` \| `"macos"` \| `"windows"` \| `"linux"` \| `"chromeos"` \| `"other"` | +| `osVersion` | `string?` | Recommended | e.g. `"iOS 18.2"`, `"Windows 11 24H2"`, `"macOS 15.3"`, `"Ubuntu 24.04"` | +| `deviceModel` | `string?` | Optional | e.g. `"iPhone17,3"`, `"Pixel 9"`, `"MacBookPro18,3"` | +| `locale` | `string?` | Optional | BCP 47 locale, e.g. `"en-US"`, `"ta-IN"` | +| `timezone` | `string?` | Optional | IANA timezone, e.g. `"America/Los_Angeles"`, `"Asia/Kolkata"` | +| `countryCode` | `string?` | Optional | ISO 3166-1 alpha-2, e.g. `"US"`, `"IN"` — derived from locale or IP server-side | +| `regionCode` | `string?` | Optional | Prefixed format: `"US:WA"`, `"IN:TN"` — derived server-side from IP geo. Always `{country}:{region}` to avoid ambiguity (TN = Tennessee or Tamil Nadu) | ### 4.3 App Release Fields @@ -189,27 +221,33 @@ Every client event MUST conform to this schema. Fields marked **REQUIRED** must When `module = "keyboard_dictation"`, clients SHOULD include a structured `dictation` object: -| Field | Type | Description | -| ------------------------------- | ------------------------------------------------------------------- | ----------------------------------------------------------------- | -| `dictation.backend` | `"azure"` \| `"local"` \| `"none"` | Which recognition backend was active | -| `dictation.hasFullAccess` | `boolean` | Keyboard Full Access toggle state | -| `dictation.micPermission` | `"granted"` \| `"denied"` \| `"undetermined"` | Microphone permission | -| `dictation.speechPermission` | `"authorized"` \| `"denied"` \| `"restricted"` \| `"notDetermined"` | Speech recognition permission | -| `dictation.recognitionStarted` | `boolean` | Did recognition engine actually start? | -| `dictation.finalResultReceived` | `boolean` | Did at least one final result arrive? | -| `dictation.insertAttempted` | `boolean` | Did `insertText` / `commitText` get called? | -| `dictation.insertNoOpDetected` | `boolean` | Did retry logic detect a no-op insert? | -| `dictation.transcriptLength` | `number` | Character count only — NEVER raw text | -| `dictation.sessionDurationMs` | `number` | Time from mic tap to stop | -| `dictation.hostApp` | `string?` | Bundle ID of host app if available (e.g. `"com.apple.MobileSMS"`) | +| Field | Type | Description | +| ---------------------------------- | ------------------------------------------------------------------- | ---------------------------------------------------------------------------- | +| `dictation.backend` | `"azure"` \| `"local"` \| `"none"` | Which recognition backend was active | +| `dictation.hasFullAccess` | `boolean` | Keyboard Full Access toggle state | +| `dictation.micPermission` | `"granted"` \| `"denied"` \| `"undetermined"` | Microphone permission | +| `dictation.speechPermission` | `"authorized"` \| `"denied"` \| `"restricted"` \| `"notDetermined"` | Speech recognition permission | +| `dictation.recognitionStarted` | `boolean` | Did recognition engine actually start? | +| `dictation.finalResultReceived` | `boolean` | Did at least one final result arrive? | +| `dictation.insertAttempted` | `boolean` | Did `insertText` / `commitText` get called? | +| `dictation.insertNoOpDetected` | `boolean` | Did retry logic detect a no-op insert? | +| `dictation.transcriptLength` | `number` | Character count only — NEVER raw text | +| `dictation.sessionDurationMs` | `number` | Time from mic tap to stop | +| `dictation.hostApp` | `string?` | Bundle ID of host app if available (e.g. `"com.apple.MobileSMS"`) | +| `dictation.errorRecoveryAttempted` | `boolean` | Was Azure→local (or vice versa) recovery attempted during this session? | +| `dictation.errorRecoverySucceeded` | `boolean?` | If recovery was attempted, did the fallback backend produce results? | +| `dictation.audioSessionCategory` | `string?` | iOS AVAudioSession category active during dictation (e.g. `"playAndRecord"`) | -### 4.8 Timing Fields +### 4.8 Server-Computed Fields -| Field | Type | Required | Description | -| ------------ | -------------------- | ---------- | ------------------------- | -| `occurredAt` | `string` (ISO 8601) | REQUIRED | Client-side timestamp | -| `receivedAt` | `string` (ISO 8601) | Server-set | Set by ingestion endpoint | -| `ttlAt` | `string?` (ISO 8601) | Server-set | Cosmos TTL expiry marker | +These fields are **set by the ingestion endpoint**, never by clients. + +| Field | Type | Required | Description | +| ------------ | ------------------- | ---------- | ------------------------------------------------------------------------------------------------------------------------------ | +| `pk` | `string` | Server-set | Cosmos partition key: `${productId}:${yyyyMM}:${platform}`. Computed from event fields on ingest | +| `occurredAt` | `string` (ISO 8601) | REQUIRED | Client-side timestamp (client provides this) | +| `receivedAt` | `string` (ISO 8601) | Server-set | Server receipt timestamp | +| `ttl` | `number` | Server-set | Cosmos TTL in **seconds** (not ISO date). Cosmos uses `_ts + ttl` for auto-expiry. Default: `TELEMETRY_EVENT_TTL_DAYS * 86400` | --- @@ -325,19 +363,36 @@ interface TelemetryCollectionConfig { 1. **Global default:** If no policies match, use a hardcoded default: - Collect `warn`, `error`, `fatal` only - Sample `warn` at 50%, `error`/`fatal` at 100% - - Flush every 60s, batch of 20 + - Flush every 60s, batch of 20, max queue 200 -2. **Policy matching:** A policy matches if ALL non-null targeting conditions are met (AND logic). +2. **Empty targeting = matches ALL:** A policy with `targeting: {}` (all fields omitted) matches every client. This is how the global kill switch works (example G). -3. **Policy merge (multiple matches):** Highest-priority policy wins for each field. Exception: `eventTypes` are unioned (if any policy enables `debug`, it's enabled). +3. **Policy matching:** A policy matches if ALL **present** (non-null/non-undefined) targeting conditions are met (AND logic). Omitted conditions are ignored (not checked). -4. **Percentage rollout:** Uses the same FNV-1a hash from the existing feature flags module: +4. **Policy merge (multiple matches):** Highest-priority policy wins for each field. Exception: `eventTypes` are **unioned** (if any matching policy enables `debug`, it’s enabled). + +5. **Percentage rollout:** Uses the same FNV-1a hash from the existing feature flags module: ```ts hashUserFlag(userId || anonymousInstallId, `telemetry_policy_${policyId}`) < percentage; ``` -5. **Time bounds:** `startsAt`/`expiresAt` are checked server-side before including in response. +6. **Time bounds:** `startsAt`/`expiresAt` are checked server-side before including in response. + +7. **`samplingRate` → `samplingRates` mapping:** A policy’s single `samplingRate` applies to ALL its `eventTypes`. When merging multiple policies, the highest-priority policy’s rate wins per event type. If a policy enables `["debug", "info"]` at rate 0.5 and another enables `["error", "fatal"]` at rate 1.0, the merged config is: + + ```json + { "debug": 0.5, "info": 0.5, "warn": 0.0, "error": 1.0, "fatal": 1.0 } + ``` + +8. **`batchSize`, `flushIntervalMs`, `maxQueueSize` defaults:** These are NOT set per-policy. They come from server-side env vars with these defaults: + | Param | Default | Env Var | + |-------|---------|--------| + | `batchSize` | 20 | `TELEMETRY_CLIENT_BATCH_SIZE` | + | `flushIntervalMs` | 60000 (60s) | `TELEMETRY_CLIENT_FLUSH_MS` | + | `maxQueueSize` | 200 | `TELEMETRY_CLIENT_MAX_QUEUE` | + + The config endpoint returns these with the merged policy so clients don’t hardcode them. ### 5.5 Example Policies @@ -465,12 +520,68 @@ interface TelemetryCollectionConfig { ### 6.1 `POST /api/telemetry/events` — Batch Ingest -**Auth:** JWT (authenticated users) or API key header (anonymous installs). +**Auth:** JWT (`Authorization: Bearer`) or Install Token (`X-Install-Token: `). See §4.1.2. **Request:** ```ts -// Zod schema +// --- Zod schema for a single telemetry event --- +const TelemetryEventSchema = z + .object({ + // Identity + id: z.string().uuid(), + productId: z.string().min(1), + userId: z.string().optional(), + anonymousInstallId: z.string().uuid().optional(), + sessionId: z.string().min(1), + requestId: z.string().optional(), + + // Source classification + platform: z.enum(['ios', 'android', 'web', 'desktop']), + channel: z.enum([ + 'mobile_app', + 'keyboard_extension', + 'web_app', + 'desktop_app', + 'backend_service', + ]), + osFamily: z.enum(['ios', 'android', 'macos', 'windows', 'linux', 'chromeos', 'other']), + osVersion: z.string().optional(), + deviceModel: z.string().optional(), + locale: z.string().optional(), + timezone: z.string().optional(), + + // App release + appVersion: z.string().min(1), + buildNumber: z.string().min(1), + releaseChannel: z.enum(['dev', 'beta', 'prod']), + + // Event semantics + eventType: z.enum(['debug', 'info', 'warn', 'error', 'fatal']), + module: z.string().min(1), + feature: z.string().optional(), + eventName: z.string().min(1), + + // Error & diagnostics + errorDomain: z.string().optional(), + errorCode: z.string().optional(), + message: z.string().max(512).optional(), + stackTrace: z.string().max(8192).optional(), + fingerprint: z.string().optional(), + + // Structured metadata + tags: z.record(z.string().max(128)).optional(), + metrics: z.record(z.number()).optional(), + context: z.record(z.unknown()).optional(), + + // Timing + occurredAt: z.string().datetime(), + }) + .refine(e => e.userId || e.anonymousInstallId, { + message: 'At least one of userId or anonymousInstallId is required', + }); + +// --- Batch ingest request --- const TelemetryIngestRequest = z.object({ productId: z.string().min(1), events: z.array(TelemetryEventSchema).min(1).max(50), @@ -497,29 +608,33 @@ interface TelemetryIngestResponse { **Validation rules:** -1. `productId` must match a known, active product. -2. Each event must have `id`, `productId`, `platform`, `channel`, `osFamily`, `appVersion`, `buildNumber`, `releaseChannel`, `eventType`, `module`, `eventName`, `sessionId`, `occurredAt`. -3. At least one of `userId` or `anonymousInstallId`. -4. `message` capped at 512 chars, `stackTrace` at 8KB, `tags` max 20 keys. +1. **`productId` authority:** Request-level `productId` is authoritative. Per-event `productId` MUST match the request-level value; mismatches are rejected. +2. Zod validation enforces all required fields (see schema above). +3. At least one of `userId` or `anonymousInstallId` (Zod refine). +4. `message` capped at 512 chars, `stackTrace` at 8KB, `tags` max 20 keys, `context` max 4KB serialized. 5. PII regex rejection: reject events containing patterns matching email, phone, credit card. 6. No raw dictation text allowed in any field. +**Idempotency:** Events are upserted by `id`. If a client retries a batch (e.g., network timeout), duplicate event IDs are silently overwritten. This ensures exactly-once semantics without client-side dedup tracking. + ### 6.2 `GET /api/telemetry/config` — Collection Config (Client Poll) **Auth:** JWT or API key. **Query params:** -| Param | Type | Description | -| ---------------- | ------- | ----------------------- | -| `platform` | string | Client platform | -| `channel` | string | Client channel | -| `osFamily` | string | OS family | -| `appVersion` | string | Current app version | -| `buildNumber` | string | Current build number | -| `releaseChannel` | string | dev/beta/prod | -| `countryCode` | string? | Client-reported country | -| `regionCode` | string? | Client-reported region | +| Param | Type | Description | +| -------------------- | ------- | --------------------------------------------------------- | +| `userId` | string? | Authenticated user ID (for percentage rollout evaluation) | +| `anonymousInstallId` | string? | Install ID (fallback for percentage rollout) | +| `platform` | string | Client platform | +| `channel` | string | Client channel | +| `osFamily` | string | OS family | +| `appVersion` | string | Current app version | +| `buildNumber` | string | Current build number | +| `releaseChannel` | string | dev/beta/prod | +| `countryCode` | string? | Client-reported country | +| `regionCode` | string? | Client-reported region (prefixed: `US:WA`) | **Response:** `TelemetryCollectionConfig` (see §5.3). @@ -582,6 +697,22 @@ interface TelemetryClusterResponse { | `PUT` | `/api/telemetry/policies/:id` | Update policy | | `DELETE` | `/api/telemetry/policies/:id` | Delete policy | +### 6.6 `DELETE /api/telemetry/user/:userId` — GDPR Right-to-Erasure + +**Auth:** Admin JWT only. + +Deletes ALL telemetry events and cluster references for the given `userId`. Returns count of deleted documents. Required for GDPR compliance. + +**Response:** + +```ts +interface TelemetryErasureResponse { + userId: string; + eventsDeleted: number; + clustersUpdated: number; +} +``` + --- ## 7. Storage & Partitioning @@ -590,11 +721,11 @@ interface TelemetryClusterResponse { #### `telemetry_events` (raw events) -| Property | Value | -| ------------- | ------------------------------------------------------------ | -| Partition key | `/pk` where `pk = ${productId}:${yyyyMM}:${platform}` | -| TTL | 30–60 days (configurable via env `TELEMETRY_EVENT_TTL_DAYS`) | -| RU budget | Start at 400 RU/s autoscale, monitor and adjust | +| Property | Value | +| ------------- | -------------------------------------------------------------------------------------------------------------------------------------------- | +| Partition key | `/pk` where `pk = ${productId}:${yyyyMM}:${platform}` | +| TTL | `defaultTtl: 30 * 86400` (30 days in seconds, configurable via `TELEMETRY_EVENT_TTL_DAYS`). Cosmos auto-deletes docs when `_ts + ttl` passes | +| RU budget | Start at 400 RU/s autoscale, monitor and adjust | **Rationale:** Partitioning by product + month + platform keeps hot data together for typical queries ("show me iOS errors from this month") while distributing load. @@ -611,11 +742,11 @@ interface TelemetryClusterResponse { #### `telemetry_error_clusters` (aggregated) -| Property | Value | -| ------------- | ----------------------------------------------------- | -| Partition key | `/pk` where `pk = ${productId}:${platform}:${module}` | -| TTL | 90–180 days | -| RU budget | 200 RU/s autoscale | +| Property | Value | +| ------------- | -------------------------------------------------------------------------------------------- | +| Partition key | `/pk` where `pk = ${productId}:${platform}:${module}` | +| TTL | `defaultTtl: 90 * 86400` (90 days in seconds, configurable via `TELEMETRY_CLUSTER_TTL_DAYS`) | +| RU budget | 200 RU/s autoscale | #### `telemetry_collection_policies` (segment rules) @@ -675,19 +806,24 @@ function normalizeMessage(msg: string): string { ```ts interface TelemetryErrorCluster { - id: string; // fingerprint + time window key + id: string; // fingerprint + time window key (e.g. `${fingerprint}:${yyyyMM}`) pk: string; // ${productId}:${platform}:${module} productId: string; fingerprint: string; - // Dimensions + // Dimensions (version-agnostic — one cluster spans all versions) platform: string; channel: string; - osFamily: string; module: string; eventName: string; - appVersion: string; - buildNumber: string; + + // Version breakdown — which builds are affected + affectedVersions: Array<{ + appVersion: string; + buildNumber: string; + count: number; + lastSeenAt: string; + }>; // capped at 50 entries // Aggregates firstSeenAt: string; @@ -695,18 +831,20 @@ interface TelemetryErrorCluster { totalCount: number; affectedUserIds: string[]; // capped at 100 affectedInstallIds: string[]; // capped at 100 + affectedOsFamilies: string[]; // e.g. ["ios", "macos"] - // Representative sample + // Representative sample (from most recent event) sampleErrorDomain?: string; sampleErrorCode?: string; sampleMessage?: string; severity: 'warn' | 'error' | 'fatal'; + ttl: number; // Cosmos TTL in seconds } ``` ### 8.3 Cluster Update Strategy -On each ingested `error`/`fatal` event: +On each ingested `warn`, `error`, or `fatal` event: 1. Compute fingerprint. 2. Upsert cluster doc: increment `totalCount`, update `lastSeenAt`, append to `affectedUserIds` (dedup, cap at 100). @@ -768,6 +906,14 @@ Located at `admin-dashboard-web/src/app/(dashboard)/ops/telemetry-policies/page. class LysnrTelemetry { static let shared = LysnrTelemetry() + // Core properties (set once at init) + let productId = "lysnrai" + let platform = "ios" + let osFamily = "ios" + var channel: String // "mobile_app" or "keyboard_extension" + var installId: String // from App Group UserDefaults + var userId: String? // from App Group (set after login) + func track( eventType: EventType, module: String, @@ -780,13 +926,21 @@ class LysnrTelemetry { metrics: [String: Double]? = nil ) - func flush() // force-send queued events - func refreshConfig() // poll collection policy + func flush() // force-send queued events + func refreshConfig() // poll collection policy + + // Keyboard-specific + func queueToAppGroup() // write pending events to App Group UserDefaults + func flushAppGroupQueue() // called by main app on foreground } ``` -- Keyboard extension: posts events to App Group shared `UserDefaults` queue; main app flushes. -- Alternatively, keyboard extension sends directly if Full Access is enabled (network available). +**Keyboard extension offline strategy:** + +- **Full Access ON:** Sends events directly via URLSession. Falls back to App Group queue on network failure. +- **Full Access OFF:** Always queues to App Group UserDefaults (`telemetry_event_queue` key). +- **Main app responsibility:** On each foreground, calls `LysnrTelemetry.shared.flushAppGroupQueue()` to drain keyboard-queued events. +- **Queue limits:** Max 200 events (~100KB). FIFO eviction when full. See §4.1.2 for memory constraints. ### 10.2 Android (Kotlin) @@ -857,9 +1011,10 @@ telemetry.track({ ### 11.3 Access Control -- **Ingest (`POST /api/telemetry/events`):** Any authenticated user or valid install API key. -- **Read (`GET /api/telemetry/query`, `/clusters`):** Admin JWT only. -- **Policy management:** Admin JWT only. +- **Ingest (`POST /api/telemetry/events`):** Any authenticated user (JWT) or valid install token (`X-Install-Token`). See §4.1.2. +- **Read (`GET /api/telemetry/query`, `/clusters`):** Admin JWT only. Enforced via `req.jwtPayload?.role === 'admin'` check (same pattern as other admin-only modules). +- **Policy management:** Admin JWT only (same check). +- **GDPR erasure:** Admin JWT only. - **No public endpoints.** Telemetry data is internal/operational only. ### 11.4 Rate Limiting @@ -911,26 +1066,29 @@ telemetry.track({ ## 13. Open Questions -| # | Question | Status | -| --- | ----------------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------- | -| 1 | Should keyboard extension send events directly (requires Full Access + network) or queue via App Group for main app to flush? | **Recommend:** Direct when Full Access on, App Group queue as fallback | -| 2 | Do we need a separate Cosmos database for telemetry to isolate RU costs? | **Recommend:** Same database, separate containers (simpler), revisit if RU contention appears | -| 3 | Should we support exporting telemetry to Azure Monitor / Application Insights for alerting? | Defer to Phase 3 | -| 4 | Max retention for raw events? Compliance requirements? | Default 30 days, configurable | -| 5 | Do we need GDPR right-to-erasure support for telemetry? | Yes — add `DELETE /api/telemetry/user/:userId` endpoint | +| # | Question | Status | +| --- | ----------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------- | +| 1 | Should keyboard extension send events directly (requires Full Access + network) or queue via App Group for main app to flush? | **RESOLVED (rev 2):** Direct when Full Access on, App Group queue as fallback. See §4.1.2. | +| 2 | Do we need a separate Cosmos database for telemetry to isolate RU costs? | **Recommend:** Same database, separate containers (simpler), revisit if RU contention appears | +| 3 | Should we support exporting telemetry to Azure Monitor / Application Insights for alerting? | Defer to Phase 3 | +| 4 | Max retention for raw events? Compliance requirements? | **RESOLVED (rev 2):** 30 days default, configurable via `TELEMETRY_EVENT_TTL_DAYS`. Cosmos TTL in seconds. | +| 5 | Do we need GDPR right-to-erasure support for telemetry? | **RESOLVED (rev 2):** Yes — `DELETE /api/telemetry/user/:userId` added to §6.6. | --- ## Appendix A: Env Vars -| Var | Default | Description | -| ----------------------------- | -------- | ------------------------------ | -| `TELEMETRY_ENABLED` | `true` | Global server-side kill switch | -| `TELEMETRY_EVENT_TTL_DAYS` | `30` | Raw event retention | -| `TELEMETRY_CLUSTER_TTL_DAYS` | `90` | Cluster retention | -| `TELEMETRY_MAX_BATCH_SIZE` | `50` | Max events per ingest request | -| `TELEMETRY_MAX_PAYLOAD_BYTES` | `262144` | 256KB max request body | -| `TELEMETRY_PII_SCAN_ENABLED` | `true` | Server-side PII rejection | +| Var | Default | Description | +| ----------------------------- | -------- | ------------------------------------------------------- | +| `TELEMETRY_ENABLED` | `true` | Global server-side kill switch | +| `TELEMETRY_EVENT_TTL_DAYS` | `30` | Raw event retention (Cosmos TTL = days × 86400 seconds) | +| `TELEMETRY_CLUSTER_TTL_DAYS` | `90` | Cluster retention | +| `TELEMETRY_MAX_BATCH_SIZE` | `50` | Max events per ingest request | +| `TELEMETRY_MAX_PAYLOAD_BYTES` | `262144` | 256KB max request body | +| `TELEMETRY_PII_SCAN_ENABLED` | `true` | Server-side PII rejection | +| `TELEMETRY_CLIENT_BATCH_SIZE` | `20` | Returned in config response for client-side batching | +| `TELEMETRY_CLIENT_FLUSH_MS` | `60000` | Returned in config response for client flush interval | +| `TELEMETRY_CLIENT_MAX_QUEUE` | `200` | Returned in config response for client max queue size | ## Appendix B: Related Files