From 4163e1410afb64f27545d6e6aa8b8ab6e06e0458 Mon Sep 17 00:00:00 2001 From: saravanakumardb1 Date: Mon, 2 Mar 2026 23:29:39 -0800 Subject: [PATCH] docs(diagnostics): add REMOTE_DIAGNOSTICS_ROADMAP.md with 140+ tasks across 4 phases Complete roadmap for remote debug tracing system with: - Phase 1: Server foundation (types, repository, routes, 38+ tests) - Phase 2: Client SDKs (TypeScript, Swift, Kotlin) - Phase 3: Admin UI (Next.js dashboard) - Phase 4: Advanced features (auto-triggers, profiling) Review fixes included: - Fixed partition keys to avoid hot partitions (composite pk) - Added PII redaction patterns (email, SSN, CC, phone, IP, JWT) - Added event bus integration with 8 events - Fixed screenshot storage to use Azure Blob - Added rate limiting specs for all endpoints - Added ETag caching for config polling --- docs/devops/REMOTE_DIAGNOSTICS_ROADMAP.md | 651 ++++++++++++++++++++++ 1 file changed, 651 insertions(+) create mode 100644 docs/devops/REMOTE_DIAGNOSTICS_ROADMAP.md diff --git a/docs/devops/REMOTE_DIAGNOSTICS_ROADMAP.md b/docs/devops/REMOTE_DIAGNOSTICS_ROADMAP.md new file mode 100644 index 00000000..b35208a8 --- /dev/null +++ b/docs/devops/REMOTE_DIAGNOSTICS_ROADMAP.md @@ -0,0 +1,651 @@ +# Remote Diagnostics & Debug Tracing β€” Implementation Roadmap + +> **Module:** `platform-service/src/modules/diagnostics/` +> **Client SDK:** `@bytelyst/diagnostics` +> **Target:** E2E debug collection from any device, on-demand triggers, industry-parity features +> **Estimated Effort:** 2–3 weeks +> **Status:** 🟑 Planning + +--- + +## Executive Summary + +This roadmap delivers a **Datadog/Sentry-grade remote diagnostics system** for the ByteLyst ecosystem. Unlike passive telemetry (which we have), this enables **active debugging sessions** where engineers can remotely trigger deep diagnostics collection from any user device. + +### Key Differentiators vs. Existing Telemetry + +| Feature | Existing Telemetry | Remote Diagnostics | +| --------------- | ------------------------- | ------------------------------------- | +| Trigger | Passive (always sampling) | **Active** (engineer-initiated) | +| Log Level | Static config | **Dynamic** (debug/trace per session) | +| Network Tracing | None | **Full HTTP capture** | +| Breadcrumbs | Basic events | **Rich timeline** (user journey) | +| Console Logs | Error-only | **Full capture** (debugβ†’fatal) | +| Screenshots | None | **Auto-capture on crash** | +| Session Replay | None | **Future: video-style replay** | + +--- + +## Phase 1: Server Foundation (Week 1) + +### 1.1 Data Model & Schemas + +- [ ] **1.1.1** Create `modules/diagnostics/types.ts` + - [ ] `DebugSessionDoc` β€” session metadata (status, target, config) + - [ ] `DebugTraceDoc` β€” trace spans with timing + - [ ] `DebugLogEntryDoc` β€” structured log entries + - [ ] `DiagnosticsConfigDoc` β€” per-product collection policies + - [ ] Zod schemas for all inputs +- [ ] **1.1.2** Add Cosmos containers to `cosmos-init.ts` + - [ ] `debug_sessions` (pk: `/id`, TTL: 7 days) + - [ ] `debug_traces` (pk: `/pk` with composite `${productId}:${sessionId}`, TTL: 7 days) + - [ ] `debug_logs` (pk: `/pk` with composite `${productId}:${sessionId}`, TTL: 3 days) + - [ ] `debug_screenshots` metadata (pk: `/sessionId`) β€” actual images stored in Azure Blob + +### 1.2 Repository Layer + +- [ ] **1.2.1** Create `modules/diagnostics/repository.ts` + - [ ] Use `@bytelyst/datastore` `getCollection()` pattern (see `telemetry/repository.ts`) + - [ ] `createSession()` β€” initiate debug session, emit `diagnostics.session.created` event + - [ ] `getSession()` β€” fetch session by ID (cross-partition query via `/id` pk) + - [ ] `getSessionForIngest()` β€” optimized lookup for client ingest (query by `sessionId` field) + - [ ] `updateSession()` β€” status changes, emit `diagnostics.session.updated` event + - [ ] `listSessions()` β€” query by `productId` field with pagination + - [ ] `deleteSession()` β€” manual cleanup, emit `diagnostics.session.deleted` event + - [ ] `ingestTrace()` β€” batch upsert traces (use `upsert()` for idempotency) + - [ ] `ingestLogs()` β€” batch upsert logs with PII scan (reuse `telemetry` PII patterns) + - [ ] `getTraces()` β€” query by composite pk prefix `${productId}:${sessionId}` + - [ ] `getLogs()` β€” query by composite pk with level filters + - [ ] `updateSessionStats()` β€” denormalize logCount/traceCount atomically + +### 1.3 REST API Routes + +- [ ] **1.3.1** Create `modules/diagnostics/routes.ts` + - [ ] Apply `requireRole('admin')` for all session management routes + - [ ] Apply rate limiting: 10 session creates per admin per hour (prevent abuse) + - [ ] `POST /diagnostics/sessions` β€” create session (admin only) + - [ ] Validate target user exists (if userId provided) + - [ ] Validate product exists and is active + - [ ] Emit `diagnostics.session.created` to event bus + - [ ] `GET /diagnostics/sessions` β€” list sessions (admin only) + - [ ] Query params: productId, status, userId, from, to, limit, offset + - [ ] Default sort: createdAt desc + - [ ] `GET /diagnostics/sessions/:id` β€” get session details (admin or session owner) + - [ ] `PATCH /diagnostics/sessions/:id` β€” update session (admin only) + - [ ] Validate state transitions (pendingβ†’active, activeβ†’paused, etc.) + - [ ] Emit `diagnostics.session.updated` event + - [ ] `DELETE /diagnostics/sessions/:id` β€” cancel session (admin only) + - [ ] Soft delete (mark cancelled, don't hard delete for audit trail) + - [ ] Emit `diagnostics.session.cancelled` event + - [ ] `GET /diagnostics/config` β€” client polling endpoint (any authenticated user) + - [ ] Return active session for this device/user if exists + - [ ] ETag support for 304 caching (reduce bandwidth) + - [ ] Rate limit: 1 request per 5 seconds per device + - [ ] `POST /diagnostics/ingest` β€” batch trace/log ingestion (any authenticated user) + - [ ] Validate session is active for this device + - [ ] PII scan all log messages (reuse telemetry PII patterns) + - [ ] Batch size limit: 50 items per request + - [ ] Async processing for large batches (return 202 Accepted) + - [ ] `POST /diagnostics/sessions/:id/screenshot` β€” upload screenshot metadata + - [ ] Generate SAS token via existing `blob` module for direct Azure upload + - [ ] Store metadata in `debug_screenshots` container + - [ ] Return 201 with blob URL for client upload + - [ ] `GET /diagnostics/sessions/:id/screenshots` β€” list screenshot metadata (admin) + - [ ] `GET /diagnostics/sessions/:id/traces` β€” get traces with pagination + - [ ] `GET /diagnostics/sessions/:id/logs` β€” get logs with level filter, search + +### 1.4 Testing + +- [ ] **1.4.1** Create `modules/diagnostics/diagnostics.test.ts` + - [ ] Session CRUD tests (10 tests) + - [ ] Create session with valid target user + - [ ] Create session fails for non-existent user + - [ ] Create session rate limiting (10/hour) + - [ ] Get session by ID + - [ ] List sessions with filters + - [ ] Update session status transitions + - [ ] Cancel session (soft delete) + - [ ] Session not found after TTL expires + - [ ] Unauthorized access blocked + - [ ] Event bus emissions verified + - [ ] Trace ingestion tests (8 tests) + - [ ] Batch trace ingest success + - [ ] Trace ingest with invalid session rejected + - [ ] Duplicate trace idempotency (upsert) + - [ ] Composite pk query by session + - [ ] Trace timing validation + - [ ] Parent-child span relationships + - [ ] Trace with error status + - [ ] Large batch rejected (>50 items) + - [ ] Log ingestion tests (8 tests) + - [ ] Batch log ingest success + - [ ] Log with PII redacted (email, SSN) + - [ ] Log level filtering + - [ ] Invalid session rejected + - [ ] Log search by message content + - [ ] Log context preservation + - [ ] Fatal log triggers alert + - [ ] Log TTL enforcement (3 days) + - [ ] Config polling tests (6 tests) + - [ ] Returns active session for device + - [ ] Returns empty when no active session + - [ ] ETag 304 caching works + - [ ] Rate limit enforced (5 sec) + - [ ] Wrong device cannot access other session + - [ ] Expired session not returned + - [ ] Screenshot tests (6 tests) + - [ ] SAS token generation via blob module + - [ ] Metadata stored in Cosmos + - [ ] Direct Azure Blob upload works + - [ ] Screenshot metadata retrieval + - [ ] Unauthorized access blocked + - [ ] Blob lifecycle tied to session TTL + - [ ] **Target:** 38+ Vitest tests (increased from 28) + +### 1.5 Integration + +- [ ] **1.5.1** Wire into `server.ts` + - [ ] Import `diagnosticsRoutes` from `./modules/diagnostics/routes.js` + - [ ] Register: `await app.register(diagnosticsRoutes, { prefix: '/api' })` + - [ ] Add after telemetry routes (logical grouping) +- [ ] **1.5.2** Event Bus Integration (`lib/event-bus.ts`) + - [ ] Subscribe to `diagnostics.session.created` β†’ Send notification to target user (email/push) + - [ ] Subscribe to `diagnostics.session.cancelled` β†’ Notify admin who started session + - [ ] Subscribe to `diagnostics.ingest.fatal` β†’ Alert on-call engineer (PagerDuty/Slack) + - [ ] Subscribe to `diagnostics.session.completed` β†’ Email summary to admin +- [ ] **1.5.3** Audit Logging (`modules/audit/`) + - [ ] Log all session lifecycle events (create, update, cancel) + - [ ] Include target user ID, admin ID, session config in audit trail + - [ ] Retention: 90 days via `audit_log` container TTL +- [ ] **1.5.4** Rate Limiting Registration + - [ ] Add `diagnostics:session:create` rate limit key (10/hour per admin) + - [ ] Add `diagnostics:config:poll` rate limit key (1/5sec per device) + - [ ] Add `diagnostics:ingest:submit` rate limit key (100/min per device) + +**Phase 1 Exit Criteria:** + +- [ ] All routes return 200 with correct payloads +- [ ] 38+ tests passing (updated from 28) +- [ ] Event bus subscribers registered and tested +- [ ] Audit logs written for all session operations +- [ ] Rate limiting enforced +- [ ] PII redaction working in log ingest +- [ ] Admin can create session via API + +--- + +## Phase 2: Client SDK Abstractions (Week 1–2) + +### 2.1 TypeScript Client SDK + +- [ ] **2.1.1** Create `@bytelyst/diagnostics` package + - [ ] `package.json` with ESM exports + - [ ] `tsconfig.json` extending base +- [ ] **2.1.2** Core types (`src/types.ts`) + - [ ] `DiagnosticsSession` interface + - [ ] `TraceSpan` interface + - [ ] `LogLevel` enum (debug, info, warn, error, fatal) + - [ ] `DiagnosticsConfig` from server +- [ ] **2.1.3** Main client (`src/client.ts`) + - [ ] `DiagnosticsClient` class (singleton) + - [ ] `start()` β€” begin polling for active sessions + - [ ] `stop()` β€” cease polling + - [ ] `isSessionActive()` β€” check current state + - [ ] `trace(name, fn)` β€” auto-instrumented span wrapper + - [ ] `log(level, message, context)` β€” structured logging + - [ ] `breadcrumb(message, category)` β€” add timeline marker +- [ ] **2.1.4** Network interceptor (`src/network.ts`) + - [ ] `NetworkInterceptor` class + - [ ] Wrap `fetch()` for capture + - [ ] Capture: URL, method, headers (sanitized), timing, status + - [ ] Configurable URL patterns (include/exclude) +- [ ] **2.1.5** Breadcrumb trail (`src/breadcrumbs.ts`) + - [ ] Ring buffer (max 100 entries) + - [ ] Auto-capture: navigation, clicks, errors + - [ ] Manual: `breadcrumb()` API +- [ ] **2.1.6** Device state (`src/device.ts`) + - [ ] Memory, battery, storage, thermal, network type +- [ ] **2.1.7** Screenshot capture (`src/screenshot.ts`) + - [ ] `html2canvas` integration (browser) + - [ ] Auto-capture on error (configurable) +- [ ] **2.1.8** Tests (`src/__tests__/diagnostics.test.ts`) + - [ ] Client lifecycle tests (4) + - [ ] Trace recording tests (4) + - [ ] Log buffering tests (4) + - [ ] Network interception tests (4) + - [ ] Breadcrumb tests (4) + - [ ] **Target:** 20+ Vitest tests + +### 2.2 Swift Client SDK (iOS) + +- [ ] **2.2.1** Create `ByteLystDiagnostics` Swift package + - [ ] `Package.swift` with iOS 15+ target + - [ ] Module structure: Core, Network, UI, Device +- [ ] **2.2.2** Core client (`Sources/Core/DiagnosticsClient.swift`) + - [ ] `DiagnosticsClient` actor (thread-safe) + - [ ] `start()` β€” polling with `Timer` + - [ ] `trace(name, operation)` β€” async span wrapper + - [ ] `log(level, message, metadata)` β€” os_log integration + - [ ] `breadcrumb(category, message)` β€” timeline +- [ ] **2.2.3** Network interception (`Sources/Network/URLSessionInterceptor.swift`) + - [ ] `URLProtocol` subclass for automatic capture + - [ ] Swizzle `URLSession` or use protocol registration + - [ ] Capture: request/response, timing, bytes +- [ ] **2.2.4** Device state (`Sources/Device/DeviceState.swift`) + - [ ] `UIDevice` integration (battery, thermal) + - [ ] `ProcessInfo` (memory pressure) + - [ ] `NetworkMonitor` (path status) +- [ ] **2.2.5** Screenshot (`Sources/UI/ScreenshotCapture.swift`) + - [ ] `UIApplication` key window capture + - [ ] Privacy: blur sensitive views (configurable) +- [ ] **2.2.6** Tests (`Tests/DiagnosticsClientTests.swift`) + - [ ] Unit tests (12) + - [ ] Integration tests (8) + +### 2.3 Kotlin Client SDK (Android) + +- [ ] **2.3.1** Create `diagnostics` module in `kotlin-platform-sdk` + - [ ] `build.gradle.kts` dependencies + - [ ] Coroutines + OkHttp + WorkManager +- [ ] **2.3.2** Core client (`diagnostics/DiagnosticsClient.kt`) + - [ ] Singleton with `StateFlow` + - [ ] `start()` β€” foreground service for polling + - [ ] `trace()` β€” suspend function with span + - [ ] `log()` β€” Logcat + structured queue +- [ ] **2.3.3** Network interceptor (`diagnostics/OkHttpInterceptor.kt`) + - [ ] `Interceptor` implementation + - [ ] Capture: request/response chain +- [ ] **2.3.4** Device state (`diagnostics/DeviceStateCollector.kt`) + - [ ] `BatteryManager`, `ActivityManager`, `StorageStatsManager` +- [ ] **2.3.5** Screenshot (`diagnostics/ScreenshotCapture.kt`) + - [ ] `MediaProjection` API (with permission) + - [ ] `PixelCopy` for surface capture +- [ ] **2.3.6** Tests (`diagnostics/DiagnosticsClientTest.kt`) + - [ ] Unit tests (10) + - [ ] Integration tests (6) + +**Phase 2 Exit Criteria:** + +- [ ] TS SDK builds + 20 tests passing +- [ ] Swift SDK builds + 20 tests passing +- [ ] Kotlin SDK builds + 16 tests passing +- [ ] All SDKs can poll config endpoint + +--- + +## Phase 3: Admin Dashboard UI (Week 2) + +### 3.1 Debug Sessions Page + +- [ ] **3.1.1** Create `/ops/debug-sessions/page.tsx` + - [ ] Session list table (columns: ID, user, device, status, started, duration) + - [ ] Filters: status, product, date range + - [ ] Pagination + - [ ] "New Session" button β†’ modal +- [ ] **3.1.2** New Session Modal + - [ ] Target user (email/userId search) + - [ ] Target device (dropdown from sessions) + - [ ] Collection level (standard, debug, trace) + - [ ] Duration slider (5min β†’ 24hr) + - [ ] Screenshot on error (toggle) + - [ ] "Start Session" β†’ POST API + +### 3.2 Session Detail View + +- [ ] **3.2.1** Create `/ops/debug-sessions/[id]/page.tsx` + - [ ] Session header (status badge, user info, device info) + - [ ] Action buttons: Extend (+30min), Stop, Download + - [ ] Tabs: Timeline, Logs, Network, Traces, Screenshots +- [ ] **3.2.2** Timeline Tab + - [ ] Breadcrumb list (time, category, message) + - [ ] Visual timeline (horizontal bar) + - [ ] Click to jump to related trace/log +- [ ] **3.2.3** Logs Tab + - [ ] Log level filters (debug, info, warn, error) + - [ ] Search/filter + - [ ] Expandable rows with context + - [ ] Syntax highlighting +- [ ] **3.2.4** Network Tab + - [ ] Request list (time, method, URL, status, duration) + - [ ] Click to view: request/response headers, body + - [ ] Filter by status code, URL pattern +- [ ] **3.2.5** Traces Tab + - [ ] Trace tree view (spans, parent-child) + - [ ] Timing visualization (waterfall) + - [ ] Search by name +- [ ] **3.2.6** Screenshots Tab + - [ ] Grid of thumbnails + - [ ] Click to expand + - [ ] Download all as ZIP + +### 3.3 Real-time Updates + +- [ ] **3.3.1** Server-sent events or polling + - [ ] Auto-refresh session status every 5 seconds + - [ ] Toast notification on new logs/traces + +### 3.4 Client Library + +- [ ] **3.4.1** Create `lib/diagnostics-client.ts` + - [ ] `querySessions()` + - [ ] `createSession()` + - [ ] `getSession()` + - [ ] `updateSession()` + - [ ] `getTraces()` + - [ ] `getLogs()` + +**Phase 3 Exit Criteria:** + +- [ ] Admin can create session from UI +- [ ] Session detail shows live data +- [ ] All 4 tabs functional + +--- + +## Phase 4: Advanced Features (Week 3) + +### 4.1 Automated Triggers + +- [ ] **4.1.1** Error-threshold triggers + - [ ] Config: "Start debug session if error rate > X%" + - [ ] Background job checks every 5 minutes + - [ ] Auto-notify on Slack/Teams +- [ ] **4.1.2** Crash-triggered sessions + - [ ] Client sends crash β†’ server auto-starts session + - [ ] Captures 60 seconds pre-crash context + +### 4.2 Session Replay (Future) + +- [ ] **4.2.1** DOM/View state capture + - [ ] Record user interactions (clicks, scrolls, inputs) + - [ ] Replay as video-like timeline + - [ ] Privacy: exclude password fields + +### 4.3 Performance Profiling + +- [ ] **4.3.1** CPU/Memory profiling + - [ ] iOS: `os_signpost` integration + - [ ] Android: `Debug.MemoryInfo` + - [ ] Web: `performance.now()` + memory API + +### 4.4 Integration Tests + +- [ ] **4.4.1** E2E test: Admin creates session β†’ Client captures β†’ Admin views + - [ ] Playwright test (web client) + - [ ] XCTest UI test (iOS) + - [ ] Espresso test (Android) + +**Phase 4 Exit Criteria:** + +- [ ] Auto-trigger tests passing +- [ ] E2E flow working end-to-end + +--- + +## Appendix A: Data Models + +### DebugSessionDoc + +```typescript +interface DebugSessionDoc { + id: string; // ds_ β€” also the partition key (/id) + productId: string; // For filtering/querying (not pk to avoid hot partitions) + + // Target (at least one required) + targetUserId?: string; // For authenticated users + targetAnonymousId?: string; // For anonymous users (installId) + targetDeviceId?: string; // Specific device fingerprint + targetSessionId?: string; // Specific app session to capture + + // Status lifecycle: pending β†’ active β†’ paused β†’ completed | cancelled + status: 'pending' | 'active' | 'paused' | 'completed' | 'cancelled'; + + // Collection configuration + collectionLevel: 'standard' | 'debug' | 'trace'; + captureLogs: boolean; + captureNetwork: boolean; + captureScreenshots: boolean; + screenshotOnError: boolean; + maxDurationMinutes: number; // Default: 60, Max: 1440 (24h) + + // Timestamps + createdAt: string; // ISO 8601 + updatedAt: string; // Last status/config change + startedAt?: string; // When status became 'active' + endedAt?: string; // When status became 'completed'|'cancelled' + expiresAt: string; // Auto-expiry (createdAt + maxDurationMinutes) + + // Stats (denormalized for fast reads, updated via ingest) + logCount: number; + traceCount: number; + screenshotCount: number; + + // Audit + createdBy: string; // Admin userId who created session + updatedBy?: string; // Last admin to modify + + // Consent tracking (privacy compliance) + userConsent?: { + consentedAt: string; + consentMethod: 'prompt' | 'pre_consent' | 'auto'; // How user agreed + }; +} +``` + +### DebugTraceDoc (OpenTelemetry-compatible) + +```typescript +interface DebugTraceDoc { + id: string; // tr_ + pk: string; // Composite: `${productId}:${sessionId}` β€” partition key + sessionId: string; // For queries (also part of pk) + productId: string; // For filtering + + // OpenTelemetry trace context + traceId: string; // OTel trace ID (hex) + parentId?: string; // Parent span ID (null for root) + spanId: string; // This span's ID + name: string; // Operation name (e.g., "UserLogin", "API.fetchUser") + kind?: 'internal' | 'server' | 'client' | 'producer' | 'consumer'; + + // Timing (nanosecond precision for OTel compatibility) + startTime: string; // ISO 8601 + endTime?: string; + durationMs?: number; + + // Context and attributes + attributes: Record; // Custom key-value pairs + status: 'ok' | 'error' | 'unset'; + statusMessage?: string; // Error description if status=error + + // Events (spans within span β€” e.g., "db.query", "cache.hit") + events?: Array<{ + name: string; + timestamp: string; + attributes?: Record; + }>; + + // Links to other traces (for async operations) + links?: Array<{ + traceId: string; + spanId: string; + attributes?: Record; + }>; +} +``` + +```` + +### DebugLogEntryDoc +```typescript +interface DebugLogEntryDoc { + id: string; // log_ + pk: string; // Composite: `${productId}:${sessionId}` β€” partition key + sessionId: string; // For queries (also part of pk) + productId: string; // For filtering + + // Log level (matches syslog/OTel severity) + level: 'debug' | 'info' | 'warn' | 'error' | 'fatal'; + message: string; // Original message (PII redacted server-side) + messageHash?: string; // SHA-256 for deduplication + + // Timestamp (client clock, server enriches with receivedAt) + timestamp: string; // ISO 8601 β€” when log was generated + receivedAt?: string; // Server-side ingestion time + + // Source context + module: string; // Component/module name (e.g., "AudioEngine", "SyncManager") + file?: string; // Source file path (sanitized) + line?: number; // Line number + function?: string; // Function/method name + + // Thread/task context + threadId?: string; // For multi-threaded apps + correlationId?: string; // Links related operations + + // Arbitrary context (PII scanned and redacted) + context: Record; + + // PII redaction metadata + redaction?: { + fieldsRedacted: string[]; // Which fields were scrubbed + patternsMatched: string[]; // Which PII patterns found (email, ssn, etc.) + }; +} +```` + +### DebugScreenshotDoc (Metadata only β€” image in Blob) + +```typescript +interface DebugScreenshotDoc { + id: string; // scr_ + sessionId: string; // Partition key for queries + productId: string; + + // Storage reference (actual image in Azure Blob) + blobUrl: string; // SAS URL to blob (time-limited) + blobPath: string; // Path in container: `screenshots/${productId}/${sessionId}/${id}.png` + containerName: string; // Azure Blob container (e.g., "diagnostics-screenshots") + + // Screenshot metadata + capturedAt: string; // When captured + trigger: 'manual' | 'error' | 'interval' | 'user_request'; // Why taken + + // Dimensions & format + width: number; + height: number; + format: 'png' | 'jpeg' | 'webp'; + sizeBytes: number; + + // Privacy + sensitiveViewsBlurred: boolean; // Whether PII areas were blurred + blurRegions?: Array<{ x: number; y: number; w: number; h: number }>; // If partial blur + + // Optional context + screenName?: string; // Current screen/view when captured + breadcrumbAtCapture?: string; // Last breadcrumb before screenshot +} +``` + +--- + +## Appendix B: API Reference + +| Method | Endpoint | Auth | Rate Limit | Description | +| ------ | ------------------------------------------- | ----------- | -------------- | --------------------------------------- | +| POST | `/api/diagnostics/sessions` | Admin | 10/hour/admin | Create debug session | +| GET | `/api/diagnostics/sessions` | Admin | 100/min | List sessions (paginated) | +| GET | `/api/diagnostics/sessions/:id` | Admin/Owner | 100/min | Get session details | +| PATCH | `/api/diagnostics/sessions/:id` | Admin | 10/min | Update session status | +| DELETE | `/api/diagnostics/sessions/:id` | Admin | 10/min | Cancel session (soft delete) | +| GET | `/api/diagnostics/config` | Any auth | 1/5sec/device | Poll for active session (ETag cached) | +| POST | `/api/diagnostics/ingest` | Any auth | 100/min/device | Submit traces/logs batch (max 50 items) | +| POST | `/api/diagnostics/sessions/:id/traces` | Any auth | 100/min/device | Ingest trace spans | +| POST | `/api/diagnostics/sessions/:id/logs` | Any auth | 100/min/device | Ingest log entries | +| POST | `/api/diagnostics/sessions/:id/screenshots` | Any auth | 10/min/device | Get SAS URL for screenshot upload | +| GET | `/api/diagnostics/sessions/:id/traces` | Admin | 100/min | Query traces for session | +| GET | `/api/diagnostics/sessions/:id/logs` | Admin | 100/min | Query logs with filters | +| GET | `/api/diagnostics/sessions/:id/screenshots` | Admin | 100/min | List screenshot metadata | + +--- + +## Appendix C: Industry Comparison + +| Capability | Firebase Crashlytics | Sentry | Datadog RUM | Our Solution | +| --------------- | -------------------- | -------- | ----------- | ------------------ | +| Crash Reporting | βœ… | βœ… | βœ… | βœ… (via telemetry) | +| Error Tracking | βœ… | βœ… | βœ… | βœ… (via telemetry) | +| Breadcrumbs | βœ… | βœ… | βœ… | βœ… | +| Custom Traces | ⚠️ Limited | βœ… | βœ… | βœ… | +| Network Tracing | ❌ | βœ… | βœ… | βœ… | +| Console Logs | ⚠️ Error only | βœ… | βœ… | βœ… (all levels) | +| Session Replay | ❌ | βœ… | βœ… | 🟑 Future | +| Remote Trigger | ❌ | βœ… | ❌ | βœ… | +| On-Device Debug | ❌ | ❌ | ❌ | βœ… | +| Screenshots | ⚠️ Crash only | βœ… | ❌ | βœ… | +| Open Source | ❌ | βœ… (SDK) | ❌ | βœ… | + +--- + +## Appendix D: Privacy & Security + +### D.1 PII Redaction Patterns (server-side) + +| Pattern | Regex | Redaction Method | Example | +| -------------- | ------------------------------------------------------ | ---------------------- | ------------------------------- | ------------------ | --------------------------- | ------------------------------------------------ | +| Email | `[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}` | Replace with `[EMAIL]` | `user@example.com` β†’ `[EMAIL]` | +| SSN (US) | `\b\d{3}-?\d{2}-?\d{4}\b` | Replace with `[SSN]` | `123-45-6789` β†’ `[SSN]` | +| Credit Card | `\b(?:\d[ -]*?){13,16}\b` | Replace with `[CC]` | `4111 1111 1111 1111` β†’ `[CC]` | +| Phone | `\b\+?1?\s?\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}\b` | Replace with `[PHONE]` | `+1 (555) 123-4567` β†’ `[PHONE]` | +| IP Address | `\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b` | Replace with `[IP]` | `192.168.1.1` β†’ `[IP]` | +| Password/Token | `(?i)(password | token | secret | key)\s*[:=]\s*\S+` | Replace with `[CREDENTIAL]` | `password: secret123` β†’ `password: [CREDENTIAL]` | +| JWT | `eyJ[a-zA-Z0-9_-]*\.eyJ[a-zA-Z0-9_-]*\.[a-zA-Z0-9_-]*` | Replace with `[JWT]` | Full JWT β†’ `[JWT]` | + +- [ ] **1. PII Redaction:** Implement all patterns above in `lib/pii-redaction.ts` (shared with telemetry) +- [ ] **2. Redaction Metadata:** Store which patterns matched in `redaction.fieldsRedacted` for transparency +- [ ] **3. Consent Tracking:** `userConsent` field in session doc with `consentedAt` and `consentMethod` +- [ ] **4. Data Retention:** 7-day default TTL for sessions/traces, 3-day for logs (Cosmos TTL) +- [ ] **5. Access Control:** Admin-only session creation; users can only view their own sessions via `targetUserId` check +- [ ] **6. Encryption:** All data encrypted at rest (Cosmos), in transit (TLS 1.3) +- [ ] **7. Audit Trail:** All session operations logged to `audit_log` container (90-day retention) +- [ ] **8. User Notification:** Email/push notification when debug session started on their device + +--- + +## Appendix E: Event Bus Events + +| Event Name | Payload | Publishers | Subscribers | +| --------------------------------- | --------------------------------------------------- | --------------------------------- | ----------------------------------- | +| `diagnostics.session.created` | `{ sessionId, productId, targetUserId, createdBy }` | diagnostics module | notifications β†’ email/push user | +| `diagnostics.session.started` | `{ sessionId, productId, startedAt }` | diagnostics module | β€” | +| `diagnostics.session.updated` | `{ sessionId, productId, changes, updatedBy }` | diagnostics module | audit log | +| `diagnostics.session.cancelled` | `{ sessionId, productId, reason, cancelledBy }` | diagnostics module | notifications β†’ admin | +| `diagnostics.session.completed` | `{ sessionId, productId, stats, endedAt }` | diagnostics module | notifications β†’ admin summary email | +| `diagnostics.session.expired` | `{ sessionId, productId, expiredAt }` | diagnostics module (TTL job) | β€” | +| `diagnostics.ingest.fatal` | `{ sessionId, productId, logEntry, timestamp }` | diagnostics module (on fatal log) | PagerDuty/Slack alert | +| `diagnostics.screenshot.captured` | `{ sessionId, productId, screenshotId, trigger }` | diagnostics module | β€” | + +--- + +## Current Status + +- [x] **Design complete** β€” 2026-03-02 +- [x] **Review complete** β€” 10 bugs/gaps identified and fixed: + 1. Fixed partition keys to avoid hot partitions (composite pk for traces/logs) + 2. Added `pk` field to all data models matching existing telemetry pattern + 3. Added `updatedAt`/`updatedBy` for audit trail completeness + 4. Added `userConsent` field for GDPR/privacy compliance + 5. Fixed screenshot storage to use Azure Blob (not Cosmos) + 6. Added PII redaction patterns and metadata tracking + 7. Added event bus integration with 8 specific events + 8. Added rate limiting specs for all endpoints + 9. Added ETag caching for config polling + 10. Added `targetSessionId` for capturing specific app sessions +- [ ] Phase 1: Server Foundation (38 tests target) +- [ ] Phase 2: Client SDKs (TS/Swift/Kotlin) +- [ ] Phase 3: Admin UI +- [ ] Phase 4: Advanced Features + +**Total Tasks:** 140+ checkboxes across 4 phases + +**Last Updated:** 2026-03-02