docs: fix 18 gaps in telemetry design doc (rev 2)
This commit is contained in:
parent
c59049efec
commit
083cf029c1
@ -2,8 +2,8 @@
|
||||
|
||||
> **Audience:** Engineering (AI agents + humans) working on ByteLyst/LysnrAI repos.
|
||||
> **Scope:** Cross-platform client telemetry ingestion, segment-based collection control, storage, admin UI, and privacy guardrails.
|
||||
> **Status:** Design — not yet implemented.
|
||||
> **Last updated:** 2026-02-17
|
||||
> **Status:** Design — implementing today, keyboard-first.
|
||||
> **Last updated:** 2026-02-17 (rev 2 — 18 gaps fixed)
|
||||
|
||||
---
|
||||
|
||||
@ -136,19 +136,51 @@ Every client event MUST conform to this schema. Fields marked **REQUIRED** must
|
||||
|
||||
> **Rule:** At least one of `userId` or `anonymousInstallId` MUST be present.
|
||||
|
||||
### 4.1.1 `anonymousInstallId` Generation Strategy
|
||||
|
||||
Each platform generates a stable UUID on first launch and persists it:
|
||||
|
||||
| Platform | Storage | Key |
|
||||
| ---------------- | ----------------------------------------------------- | -------------------------------- |
|
||||
| **iOS app** | Keychain (kSecAttrAccessibleAfterFirstUnlock) | `com.bytelyst.LysnrAI.installId` |
|
||||
| **iOS keyboard** | App Group UserDefaults (`group.com.bytelyst.LysnrAI`) | `telemetry_install_id` |
|
||||
| **Android** | EncryptedSharedPreferences | `telemetry_install_id` |
|
||||
| **Desktop** | `~/.lysnrai/telemetry_install_id` (plain file) | — |
|
||||
| **Web** | `localStorage` | `lysnrai_install_id` |
|
||||
|
||||
> **iOS keyboard note:** The keyboard extension shares the install ID via App Group so main app and extension use the same identity.
|
||||
|
||||
### 4.1.2 Authentication for Telemetry Ingest
|
||||
|
||||
Not all clients have a JWT (e.g., keyboard extension before user logs in). The ingest endpoint accepts two auth modes:
|
||||
|
||||
| Mode | Header | When Used |
|
||||
| ----------------- | --------------------------------------- | -------------------------------------------------------- |
|
||||
| **JWT** | `Authorization: Bearer <token>` | Authenticated users (main app, web, desktop after login) |
|
||||
| **Install Token** | `X-Install-Token: <anonymousInstallId>` | Unauthenticated clients (keyboard extension, pre-login) |
|
||||
|
||||
**Install token validation:** The server accepts any well-formed UUID in `X-Install-Token`. It does NOT verify against a registry (install IDs are self-issued). Rate limiting is applied per install ID to prevent abuse.
|
||||
|
||||
**Keyboard extension specifics:**
|
||||
|
||||
- With Full Access ON: sends events directly via HTTPS using `X-Install-Token`.
|
||||
- With Full Access OFF: queues events to App Group UserDefaults (max 200 events, ~100KB). Main app flushes on next foreground.
|
||||
- If queue is full, oldest events are dropped (FIFO eviction).
|
||||
- **Memory constraint:** iOS keyboard extensions are limited to ~30MB. Telemetry queue MUST stay under 100KB. Events are serialized as compact JSON (no pretty-print).
|
||||
|
||||
### 4.2 Source Classification Fields
|
||||
|
||||
| Field | Type | Required | Description |
|
||||
| ------------- | --------- | ----------- | ------------------------------------------------------------------------------------------------- |
|
||||
| `platform` | `enum` | REQUIRED | `"ios"` \| `"android"` \| `"web"` \| `"desktop"` |
|
||||
| `channel` | `enum` | REQUIRED | `"mobile_app"` \| `"keyboard_extension"` \| `"web_app"` \| `"desktop_app"` \| `"backend_service"` |
|
||||
| `osFamily` | `enum` | REQUIRED | `"ios"` \| `"android"` \| `"macos"` \| `"windows"` \| `"linux"` \| `"chromeos"` \| `"other"` |
|
||||
| `osVersion` | `string?` | Recommended | e.g. `"iOS 18.2"`, `"Windows 11 24H2"`, `"macOS 15.3"`, `"Ubuntu 24.04"` |
|
||||
| `deviceModel` | `string?` | Optional | e.g. `"iPhone17,3"`, `"Pixel 9"`, `"MacBookPro18,3"` |
|
||||
| `locale` | `string?` | Optional | BCP 47 locale, e.g. `"en-US"`, `"ta-IN"` |
|
||||
| `timezone` | `string?` | Optional | IANA timezone, e.g. `"America/Los_Angeles"`, `"Asia/Kolkata"` |
|
||||
| `countryCode` | `string?` | Optional | ISO 3166-1 alpha-2, e.g. `"US"`, `"IN"` — derived from locale or IP server-side |
|
||||
| `regionCode` | `string?` | Optional | e.g. `"WA"`, `"TN"` — derived server-side from IP geo |
|
||||
| Field | Type | Required | Description |
|
||||
| ------------- | --------- | ----------- | ------------------------------------------------------------------------------------------------------------------------------------------------------ |
|
||||
| `platform` | `enum` | REQUIRED | `"ios"` \| `"android"` \| `"web"` \| `"desktop"` |
|
||||
| `channel` | `enum` | REQUIRED | `"mobile_app"` \| `"keyboard_extension"` \| `"web_app"` \| `"desktop_app"` \| `"backend_service"` |
|
||||
| `osFamily` | `enum` | REQUIRED | `"ios"` \| `"android"` \| `"macos"` \| `"windows"` \| `"linux"` \| `"chromeos"` \| `"other"` |
|
||||
| `osVersion` | `string?` | Recommended | e.g. `"iOS 18.2"`, `"Windows 11 24H2"`, `"macOS 15.3"`, `"Ubuntu 24.04"` |
|
||||
| `deviceModel` | `string?` | Optional | e.g. `"iPhone17,3"`, `"Pixel 9"`, `"MacBookPro18,3"` |
|
||||
| `locale` | `string?` | Optional | BCP 47 locale, e.g. `"en-US"`, `"ta-IN"` |
|
||||
| `timezone` | `string?` | Optional | IANA timezone, e.g. `"America/Los_Angeles"`, `"Asia/Kolkata"` |
|
||||
| `countryCode` | `string?` | Optional | ISO 3166-1 alpha-2, e.g. `"US"`, `"IN"` — derived from locale or IP server-side |
|
||||
| `regionCode` | `string?` | Optional | Prefixed format: `"US:WA"`, `"IN:TN"` — derived server-side from IP geo. Always `{country}:{region}` to avoid ambiguity (TN = Tennessee or Tamil Nadu) |
|
||||
|
||||
### 4.3 App Release Fields
|
||||
|
||||
@ -189,27 +221,33 @@ Every client event MUST conform to this schema. Fields marked **REQUIRED** must
|
||||
|
||||
When `module = "keyboard_dictation"`, clients SHOULD include a structured `dictation` object:
|
||||
|
||||
| Field | Type | Description |
|
||||
| ------------------------------- | ------------------------------------------------------------------- | ----------------------------------------------------------------- |
|
||||
| `dictation.backend` | `"azure"` \| `"local"` \| `"none"` | Which recognition backend was active |
|
||||
| `dictation.hasFullAccess` | `boolean` | Keyboard Full Access toggle state |
|
||||
| `dictation.micPermission` | `"granted"` \| `"denied"` \| `"undetermined"` | Microphone permission |
|
||||
| `dictation.speechPermission` | `"authorized"` \| `"denied"` \| `"restricted"` \| `"notDetermined"` | Speech recognition permission |
|
||||
| `dictation.recognitionStarted` | `boolean` | Did recognition engine actually start? |
|
||||
| `dictation.finalResultReceived` | `boolean` | Did at least one final result arrive? |
|
||||
| `dictation.insertAttempted` | `boolean` | Did `insertText` / `commitText` get called? |
|
||||
| `dictation.insertNoOpDetected` | `boolean` | Did retry logic detect a no-op insert? |
|
||||
| `dictation.transcriptLength` | `number` | Character count only — NEVER raw text |
|
||||
| `dictation.sessionDurationMs` | `number` | Time from mic tap to stop |
|
||||
| `dictation.hostApp` | `string?` | Bundle ID of host app if available (e.g. `"com.apple.MobileSMS"`) |
|
||||
| Field | Type | Description |
|
||||
| ---------------------------------- | ------------------------------------------------------------------- | ---------------------------------------------------------------------------- |
|
||||
| `dictation.backend` | `"azure"` \| `"local"` \| `"none"` | Which recognition backend was active |
|
||||
| `dictation.hasFullAccess` | `boolean` | Keyboard Full Access toggle state |
|
||||
| `dictation.micPermission` | `"granted"` \| `"denied"` \| `"undetermined"` | Microphone permission |
|
||||
| `dictation.speechPermission` | `"authorized"` \| `"denied"` \| `"restricted"` \| `"notDetermined"` | Speech recognition permission |
|
||||
| `dictation.recognitionStarted` | `boolean` | Did recognition engine actually start? |
|
||||
| `dictation.finalResultReceived` | `boolean` | Did at least one final result arrive? |
|
||||
| `dictation.insertAttempted` | `boolean` | Did `insertText` / `commitText` get called? |
|
||||
| `dictation.insertNoOpDetected` | `boolean` | Did retry logic detect a no-op insert? |
|
||||
| `dictation.transcriptLength` | `number` | Character count only — NEVER raw text |
|
||||
| `dictation.sessionDurationMs` | `number` | Time from mic tap to stop |
|
||||
| `dictation.hostApp` | `string?` | Bundle ID of host app if available (e.g. `"com.apple.MobileSMS"`) |
|
||||
| `dictation.errorRecoveryAttempted` | `boolean` | Was Azure→local (or vice versa) recovery attempted during this session? |
|
||||
| `dictation.errorRecoverySucceeded` | `boolean?` | If recovery was attempted, did the fallback backend produce results? |
|
||||
| `dictation.audioSessionCategory` | `string?` | iOS AVAudioSession category active during dictation (e.g. `"playAndRecord"`) |
|
||||
|
||||
### 4.8 Timing Fields
|
||||
### 4.8 Server-Computed Fields
|
||||
|
||||
| Field | Type | Required | Description |
|
||||
| ------------ | -------------------- | ---------- | ------------------------- |
|
||||
| `occurredAt` | `string` (ISO 8601) | REQUIRED | Client-side timestamp |
|
||||
| `receivedAt` | `string` (ISO 8601) | Server-set | Set by ingestion endpoint |
|
||||
| `ttlAt` | `string?` (ISO 8601) | Server-set | Cosmos TTL expiry marker |
|
||||
These fields are **set by the ingestion endpoint**, never by clients.
|
||||
|
||||
| Field | Type | Required | Description |
|
||||
| ------------ | ------------------- | ---------- | ------------------------------------------------------------------------------------------------------------------------------ |
|
||||
| `pk` | `string` | Server-set | Cosmos partition key: `${productId}:${yyyyMM}:${platform}`. Computed from event fields on ingest |
|
||||
| `occurredAt` | `string` (ISO 8601) | REQUIRED | Client-side timestamp (client provides this) |
|
||||
| `receivedAt` | `string` (ISO 8601) | Server-set | Server receipt timestamp |
|
||||
| `ttl` | `number` | Server-set | Cosmos TTL in **seconds** (not ISO date). Cosmos uses `_ts + ttl` for auto-expiry. Default: `TELEMETRY_EVENT_TTL_DAYS * 86400` |
|
||||
|
||||
---
|
||||
|
||||
@ -325,19 +363,36 @@ interface TelemetryCollectionConfig {
|
||||
1. **Global default:** If no policies match, use a hardcoded default:
|
||||
- Collect `warn`, `error`, `fatal` only
|
||||
- Sample `warn` at 50%, `error`/`fatal` at 100%
|
||||
- Flush every 60s, batch of 20
|
||||
- Flush every 60s, batch of 20, max queue 200
|
||||
|
||||
2. **Policy matching:** A policy matches if ALL non-null targeting conditions are met (AND logic).
|
||||
2. **Empty targeting = matches ALL:** A policy with `targeting: {}` (all fields omitted) matches every client. This is how the global kill switch works (example G).
|
||||
|
||||
3. **Policy merge (multiple matches):** Highest-priority policy wins for each field. Exception: `eventTypes` are unioned (if any policy enables `debug`, it's enabled).
|
||||
3. **Policy matching:** A policy matches if ALL **present** (non-null/non-undefined) targeting conditions are met (AND logic). Omitted conditions are ignored (not checked).
|
||||
|
||||
4. **Percentage rollout:** Uses the same FNV-1a hash from the existing feature flags module:
|
||||
4. **Policy merge (multiple matches):** Highest-priority policy wins for each field. Exception: `eventTypes` are **unioned** (if any matching policy enables `debug`, it’s enabled).
|
||||
|
||||
5. **Percentage rollout:** Uses the same FNV-1a hash from the existing feature flags module:
|
||||
|
||||
```ts
|
||||
hashUserFlag(userId || anonymousInstallId, `telemetry_policy_${policyId}`) < percentage;
|
||||
```
|
||||
|
||||
5. **Time bounds:** `startsAt`/`expiresAt` are checked server-side before including in response.
|
||||
6. **Time bounds:** `startsAt`/`expiresAt` are checked server-side before including in response.
|
||||
|
||||
7. **`samplingRate` → `samplingRates` mapping:** A policy’s single `samplingRate` applies to ALL its `eventTypes`. When merging multiple policies, the highest-priority policy’s rate wins per event type. If a policy enables `["debug", "info"]` at rate 0.5 and another enables `["error", "fatal"]` at rate 1.0, the merged config is:
|
||||
|
||||
```json
|
||||
{ "debug": 0.5, "info": 0.5, "warn": 0.0, "error": 1.0, "fatal": 1.0 }
|
||||
```
|
||||
|
||||
8. **`batchSize`, `flushIntervalMs`, `maxQueueSize` defaults:** These are NOT set per-policy. They come from server-side env vars with these defaults:
|
||||
| Param | Default | Env Var |
|
||||
|-------|---------|--------|
|
||||
| `batchSize` | 20 | `TELEMETRY_CLIENT_BATCH_SIZE` |
|
||||
| `flushIntervalMs` | 60000 (60s) | `TELEMETRY_CLIENT_FLUSH_MS` |
|
||||
| `maxQueueSize` | 200 | `TELEMETRY_CLIENT_MAX_QUEUE` |
|
||||
|
||||
The config endpoint returns these with the merged policy so clients don’t hardcode them.
|
||||
|
||||
### 5.5 Example Policies
|
||||
|
||||
@ -465,12 +520,68 @@ interface TelemetryCollectionConfig {
|
||||
|
||||
### 6.1 `POST /api/telemetry/events` — Batch Ingest
|
||||
|
||||
**Auth:** JWT (authenticated users) or API key header (anonymous installs).
|
||||
**Auth:** JWT (`Authorization: Bearer`) or Install Token (`X-Install-Token: <uuid>`). See §4.1.2.
|
||||
|
||||
**Request:**
|
||||
|
||||
```ts
|
||||
// Zod schema
|
||||
// --- Zod schema for a single telemetry event ---
|
||||
const TelemetryEventSchema = z
|
||||
.object({
|
||||
// Identity
|
||||
id: z.string().uuid(),
|
||||
productId: z.string().min(1),
|
||||
userId: z.string().optional(),
|
||||
anonymousInstallId: z.string().uuid().optional(),
|
||||
sessionId: z.string().min(1),
|
||||
requestId: z.string().optional(),
|
||||
|
||||
// Source classification
|
||||
platform: z.enum(['ios', 'android', 'web', 'desktop']),
|
||||
channel: z.enum([
|
||||
'mobile_app',
|
||||
'keyboard_extension',
|
||||
'web_app',
|
||||
'desktop_app',
|
||||
'backend_service',
|
||||
]),
|
||||
osFamily: z.enum(['ios', 'android', 'macos', 'windows', 'linux', 'chromeos', 'other']),
|
||||
osVersion: z.string().optional(),
|
||||
deviceModel: z.string().optional(),
|
||||
locale: z.string().optional(),
|
||||
timezone: z.string().optional(),
|
||||
|
||||
// App release
|
||||
appVersion: z.string().min(1),
|
||||
buildNumber: z.string().min(1),
|
||||
releaseChannel: z.enum(['dev', 'beta', 'prod']),
|
||||
|
||||
// Event semantics
|
||||
eventType: z.enum(['debug', 'info', 'warn', 'error', 'fatal']),
|
||||
module: z.string().min(1),
|
||||
feature: z.string().optional(),
|
||||
eventName: z.string().min(1),
|
||||
|
||||
// Error & diagnostics
|
||||
errorDomain: z.string().optional(),
|
||||
errorCode: z.string().optional(),
|
||||
message: z.string().max(512).optional(),
|
||||
stackTrace: z.string().max(8192).optional(),
|
||||
fingerprint: z.string().optional(),
|
||||
|
||||
// Structured metadata
|
||||
tags: z.record(z.string().max(128)).optional(),
|
||||
metrics: z.record(z.number()).optional(),
|
||||
context: z.record(z.unknown()).optional(),
|
||||
|
||||
// Timing
|
||||
occurredAt: z.string().datetime(),
|
||||
})
|
||||
.refine(e => e.userId || e.anonymousInstallId, {
|
||||
message: 'At least one of userId or anonymousInstallId is required',
|
||||
});
|
||||
|
||||
// --- Batch ingest request ---
|
||||
const TelemetryIngestRequest = z.object({
|
||||
productId: z.string().min(1),
|
||||
events: z.array(TelemetryEventSchema).min(1).max(50),
|
||||
@ -497,29 +608,33 @@ interface TelemetryIngestResponse {
|
||||
|
||||
**Validation rules:**
|
||||
|
||||
1. `productId` must match a known, active product.
|
||||
2. Each event must have `id`, `productId`, `platform`, `channel`, `osFamily`, `appVersion`, `buildNumber`, `releaseChannel`, `eventType`, `module`, `eventName`, `sessionId`, `occurredAt`.
|
||||
3. At least one of `userId` or `anonymousInstallId`.
|
||||
4. `message` capped at 512 chars, `stackTrace` at 8KB, `tags` max 20 keys.
|
||||
1. **`productId` authority:** Request-level `productId` is authoritative. Per-event `productId` MUST match the request-level value; mismatches are rejected.
|
||||
2. Zod validation enforces all required fields (see schema above).
|
||||
3. At least one of `userId` or `anonymousInstallId` (Zod refine).
|
||||
4. `message` capped at 512 chars, `stackTrace` at 8KB, `tags` max 20 keys, `context` max 4KB serialized.
|
||||
5. PII regex rejection: reject events containing patterns matching email, phone, credit card.
|
||||
6. No raw dictation text allowed in any field.
|
||||
|
||||
**Idempotency:** Events are upserted by `id`. If a client retries a batch (e.g., network timeout), duplicate event IDs are silently overwritten. This ensures exactly-once semantics without client-side dedup tracking.
|
||||
|
||||
### 6.2 `GET /api/telemetry/config` — Collection Config (Client Poll)
|
||||
|
||||
**Auth:** JWT or API key.
|
||||
|
||||
**Query params:**
|
||||
|
||||
| Param | Type | Description |
|
||||
| ---------------- | ------- | ----------------------- |
|
||||
| `platform` | string | Client platform |
|
||||
| `channel` | string | Client channel |
|
||||
| `osFamily` | string | OS family |
|
||||
| `appVersion` | string | Current app version |
|
||||
| `buildNumber` | string | Current build number |
|
||||
| `releaseChannel` | string | dev/beta/prod |
|
||||
| `countryCode` | string? | Client-reported country |
|
||||
| `regionCode` | string? | Client-reported region |
|
||||
| Param | Type | Description |
|
||||
| -------------------- | ------- | --------------------------------------------------------- |
|
||||
| `userId` | string? | Authenticated user ID (for percentage rollout evaluation) |
|
||||
| `anonymousInstallId` | string? | Install ID (fallback for percentage rollout) |
|
||||
| `platform` | string | Client platform |
|
||||
| `channel` | string | Client channel |
|
||||
| `osFamily` | string | OS family |
|
||||
| `appVersion` | string | Current app version |
|
||||
| `buildNumber` | string | Current build number |
|
||||
| `releaseChannel` | string | dev/beta/prod |
|
||||
| `countryCode` | string? | Client-reported country |
|
||||
| `regionCode` | string? | Client-reported region (prefixed: `US:WA`) |
|
||||
|
||||
**Response:** `TelemetryCollectionConfig` (see §5.3).
|
||||
|
||||
@ -582,6 +697,22 @@ interface TelemetryClusterResponse {
|
||||
| `PUT` | `/api/telemetry/policies/:id` | Update policy |
|
||||
| `DELETE` | `/api/telemetry/policies/:id` | Delete policy |
|
||||
|
||||
### 6.6 `DELETE /api/telemetry/user/:userId` — GDPR Right-to-Erasure
|
||||
|
||||
**Auth:** Admin JWT only.
|
||||
|
||||
Deletes ALL telemetry events and cluster references for the given `userId`. Returns count of deleted documents. Required for GDPR compliance.
|
||||
|
||||
**Response:**
|
||||
|
||||
```ts
|
||||
interface TelemetryErasureResponse {
|
||||
userId: string;
|
||||
eventsDeleted: number;
|
||||
clustersUpdated: number;
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 7. Storage & Partitioning
|
||||
@ -590,11 +721,11 @@ interface TelemetryClusterResponse {
|
||||
|
||||
#### `telemetry_events` (raw events)
|
||||
|
||||
| Property | Value |
|
||||
| ------------- | ------------------------------------------------------------ |
|
||||
| Partition key | `/pk` where `pk = ${productId}:${yyyyMM}:${platform}` |
|
||||
| TTL | 30–60 days (configurable via env `TELEMETRY_EVENT_TTL_DAYS`) |
|
||||
| RU budget | Start at 400 RU/s autoscale, monitor and adjust |
|
||||
| Property | Value |
|
||||
| ------------- | -------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| Partition key | `/pk` where `pk = ${productId}:${yyyyMM}:${platform}` |
|
||||
| TTL | `defaultTtl: 30 * 86400` (30 days in seconds, configurable via `TELEMETRY_EVENT_TTL_DAYS`). Cosmos auto-deletes docs when `_ts + ttl` passes |
|
||||
| RU budget | Start at 400 RU/s autoscale, monitor and adjust |
|
||||
|
||||
**Rationale:** Partitioning by product + month + platform keeps hot data together for typical queries ("show me iOS errors from this month") while distributing load.
|
||||
|
||||
@ -611,11 +742,11 @@ interface TelemetryClusterResponse {
|
||||
|
||||
#### `telemetry_error_clusters` (aggregated)
|
||||
|
||||
| Property | Value |
|
||||
| ------------- | ----------------------------------------------------- |
|
||||
| Partition key | `/pk` where `pk = ${productId}:${platform}:${module}` |
|
||||
| TTL | 90–180 days |
|
||||
| RU budget | 200 RU/s autoscale |
|
||||
| Property | Value |
|
||||
| ------------- | -------------------------------------------------------------------------------------------- |
|
||||
| Partition key | `/pk` where `pk = ${productId}:${platform}:${module}` |
|
||||
| TTL | `defaultTtl: 90 * 86400` (90 days in seconds, configurable via `TELEMETRY_CLUSTER_TTL_DAYS`) |
|
||||
| RU budget | 200 RU/s autoscale |
|
||||
|
||||
#### `telemetry_collection_policies` (segment rules)
|
||||
|
||||
@ -675,19 +806,24 @@ function normalizeMessage(msg: string): string {
|
||||
|
||||
```ts
|
||||
interface TelemetryErrorCluster {
|
||||
id: string; // fingerprint + time window key
|
||||
id: string; // fingerprint + time window key (e.g. `${fingerprint}:${yyyyMM}`)
|
||||
pk: string; // ${productId}:${platform}:${module}
|
||||
productId: string;
|
||||
fingerprint: string;
|
||||
|
||||
// Dimensions
|
||||
// Dimensions (version-agnostic — one cluster spans all versions)
|
||||
platform: string;
|
||||
channel: string;
|
||||
osFamily: string;
|
||||
module: string;
|
||||
eventName: string;
|
||||
appVersion: string;
|
||||
buildNumber: string;
|
||||
|
||||
// Version breakdown — which builds are affected
|
||||
affectedVersions: Array<{
|
||||
appVersion: string;
|
||||
buildNumber: string;
|
||||
count: number;
|
||||
lastSeenAt: string;
|
||||
}>; // capped at 50 entries
|
||||
|
||||
// Aggregates
|
||||
firstSeenAt: string;
|
||||
@ -695,18 +831,20 @@ interface TelemetryErrorCluster {
|
||||
totalCount: number;
|
||||
affectedUserIds: string[]; // capped at 100
|
||||
affectedInstallIds: string[]; // capped at 100
|
||||
affectedOsFamilies: string[]; // e.g. ["ios", "macos"]
|
||||
|
||||
// Representative sample
|
||||
// Representative sample (from most recent event)
|
||||
sampleErrorDomain?: string;
|
||||
sampleErrorCode?: string;
|
||||
sampleMessage?: string;
|
||||
severity: 'warn' | 'error' | 'fatal';
|
||||
ttl: number; // Cosmos TTL in seconds
|
||||
}
|
||||
```
|
||||
|
||||
### 8.3 Cluster Update Strategy
|
||||
|
||||
On each ingested `error`/`fatal` event:
|
||||
On each ingested `warn`, `error`, or `fatal` event:
|
||||
|
||||
1. Compute fingerprint.
|
||||
2. Upsert cluster doc: increment `totalCount`, update `lastSeenAt`, append to `affectedUserIds` (dedup, cap at 100).
|
||||
@ -768,6 +906,14 @@ Located at `admin-dashboard-web/src/app/(dashboard)/ops/telemetry-policies/page.
|
||||
class LysnrTelemetry {
|
||||
static let shared = LysnrTelemetry()
|
||||
|
||||
// Core properties (set once at init)
|
||||
let productId = "lysnrai"
|
||||
let platform = "ios"
|
||||
let osFamily = "ios"
|
||||
var channel: String // "mobile_app" or "keyboard_extension"
|
||||
var installId: String // from App Group UserDefaults
|
||||
var userId: String? // from App Group (set after login)
|
||||
|
||||
func track(
|
||||
eventType: EventType,
|
||||
module: String,
|
||||
@ -780,13 +926,21 @@ class LysnrTelemetry {
|
||||
metrics: [String: Double]? = nil
|
||||
)
|
||||
|
||||
func flush() // force-send queued events
|
||||
func refreshConfig() // poll collection policy
|
||||
func flush() // force-send queued events
|
||||
func refreshConfig() // poll collection policy
|
||||
|
||||
// Keyboard-specific
|
||||
func queueToAppGroup() // write pending events to App Group UserDefaults
|
||||
func flushAppGroupQueue() // called by main app on foreground
|
||||
}
|
||||
```
|
||||
|
||||
- Keyboard extension: posts events to App Group shared `UserDefaults` queue; main app flushes.
|
||||
- Alternatively, keyboard extension sends directly if Full Access is enabled (network available).
|
||||
**Keyboard extension offline strategy:**
|
||||
|
||||
- **Full Access ON:** Sends events directly via URLSession. Falls back to App Group queue on network failure.
|
||||
- **Full Access OFF:** Always queues to App Group UserDefaults (`telemetry_event_queue` key).
|
||||
- **Main app responsibility:** On each foreground, calls `LysnrTelemetry.shared.flushAppGroupQueue()` to drain keyboard-queued events.
|
||||
- **Queue limits:** Max 200 events (~100KB). FIFO eviction when full. See §4.1.2 for memory constraints.
|
||||
|
||||
### 10.2 Android (Kotlin)
|
||||
|
||||
@ -857,9 +1011,10 @@ telemetry.track({
|
||||
|
||||
### 11.3 Access Control
|
||||
|
||||
- **Ingest (`POST /api/telemetry/events`):** Any authenticated user or valid install API key.
|
||||
- **Read (`GET /api/telemetry/query`, `/clusters`):** Admin JWT only.
|
||||
- **Policy management:** Admin JWT only.
|
||||
- **Ingest (`POST /api/telemetry/events`):** Any authenticated user (JWT) or valid install token (`X-Install-Token`). See §4.1.2.
|
||||
- **Read (`GET /api/telemetry/query`, `/clusters`):** Admin JWT only. Enforced via `req.jwtPayload?.role === 'admin'` check (same pattern as other admin-only modules).
|
||||
- **Policy management:** Admin JWT only (same check).
|
||||
- **GDPR erasure:** Admin JWT only.
|
||||
- **No public endpoints.** Telemetry data is internal/operational only.
|
||||
|
||||
### 11.4 Rate Limiting
|
||||
@ -911,26 +1066,29 @@ telemetry.track({
|
||||
|
||||
## 13. Open Questions
|
||||
|
||||
| # | Question | Status |
|
||||
| --- | ----------------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------- |
|
||||
| 1 | Should keyboard extension send events directly (requires Full Access + network) or queue via App Group for main app to flush? | **Recommend:** Direct when Full Access on, App Group queue as fallback |
|
||||
| 2 | Do we need a separate Cosmos database for telemetry to isolate RU costs? | **Recommend:** Same database, separate containers (simpler), revisit if RU contention appears |
|
||||
| 3 | Should we support exporting telemetry to Azure Monitor / Application Insights for alerting? | Defer to Phase 3 |
|
||||
| 4 | Max retention for raw events? Compliance requirements? | Default 30 days, configurable |
|
||||
| 5 | Do we need GDPR right-to-erasure support for telemetry? | Yes — add `DELETE /api/telemetry/user/:userId` endpoint |
|
||||
| # | Question | Status |
|
||||
| --- | ----------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------- |
|
||||
| 1 | Should keyboard extension send events directly (requires Full Access + network) or queue via App Group for main app to flush? | **RESOLVED (rev 2):** Direct when Full Access on, App Group queue as fallback. See §4.1.2. |
|
||||
| 2 | Do we need a separate Cosmos database for telemetry to isolate RU costs? | **Recommend:** Same database, separate containers (simpler), revisit if RU contention appears |
|
||||
| 3 | Should we support exporting telemetry to Azure Monitor / Application Insights for alerting? | Defer to Phase 3 |
|
||||
| 4 | Max retention for raw events? Compliance requirements? | **RESOLVED (rev 2):** 30 days default, configurable via `TELEMETRY_EVENT_TTL_DAYS`. Cosmos TTL in seconds. |
|
||||
| 5 | Do we need GDPR right-to-erasure support for telemetry? | **RESOLVED (rev 2):** Yes — `DELETE /api/telemetry/user/:userId` added to §6.6. |
|
||||
|
||||
---
|
||||
|
||||
## Appendix A: Env Vars
|
||||
|
||||
| Var | Default | Description |
|
||||
| ----------------------------- | -------- | ------------------------------ |
|
||||
| `TELEMETRY_ENABLED` | `true` | Global server-side kill switch |
|
||||
| `TELEMETRY_EVENT_TTL_DAYS` | `30` | Raw event retention |
|
||||
| `TELEMETRY_CLUSTER_TTL_DAYS` | `90` | Cluster retention |
|
||||
| `TELEMETRY_MAX_BATCH_SIZE` | `50` | Max events per ingest request |
|
||||
| `TELEMETRY_MAX_PAYLOAD_BYTES` | `262144` | 256KB max request body |
|
||||
| `TELEMETRY_PII_SCAN_ENABLED` | `true` | Server-side PII rejection |
|
||||
| Var | Default | Description |
|
||||
| ----------------------------- | -------- | ------------------------------------------------------- |
|
||||
| `TELEMETRY_ENABLED` | `true` | Global server-side kill switch |
|
||||
| `TELEMETRY_EVENT_TTL_DAYS` | `30` | Raw event retention (Cosmos TTL = days × 86400 seconds) |
|
||||
| `TELEMETRY_CLUSTER_TTL_DAYS` | `90` | Cluster retention |
|
||||
| `TELEMETRY_MAX_BATCH_SIZE` | `50` | Max events per ingest request |
|
||||
| `TELEMETRY_MAX_PAYLOAD_BYTES` | `262144` | 256KB max request body |
|
||||
| `TELEMETRY_PII_SCAN_ENABLED` | `true` | Server-side PII rejection |
|
||||
| `TELEMETRY_CLIENT_BATCH_SIZE` | `20` | Returned in config response for client-side batching |
|
||||
| `TELEMETRY_CLIENT_FLUSH_MS` | `60000` | Returned in config response for client flush interval |
|
||||
| `TELEMETRY_CLIENT_MAX_QUEUE` | `200` | Returned in config response for client max queue size |
|
||||
|
||||
## Appendix B: Related Files
|
||||
|
||||
|
||||
Loading…
Reference in New Issue
Block a user