docs(diagnostics): add REMOTE_DIAGNOSTICS_ROADMAP.md with 140+ tasks across 4 phases

Complete roadmap for remote debug tracing system with:
- Phase 1: Server foundation (types, repository, routes, 38+ tests)
- Phase 2: Client SDKs (TypeScript, Swift, Kotlin)
- Phase 3: Admin UI (Next.js dashboard)
- Phase 4: Advanced features (auto-triggers, profiling)

Review fixes included:
- Fixed partition keys to avoid hot partitions (composite pk)
- Added PII redaction patterns (email, SSN, CC, phone, IP, JWT)
- Added event bus integration with 8 events
- Fixed screenshot storage to use Azure Blob
- Added rate limiting specs for all endpoints
- Added ETag caching for config polling
This commit is contained in:
saravanakumardb1 2026-03-02 23:29:39 -08:00
parent 03ad80a615
commit 4163e1410a

View File

@ -0,0 +1,651 @@
# Remote Diagnostics & Debug Tracing — Implementation Roadmap
> **Module:** `platform-service/src/modules/diagnostics/`
> **Client SDK:** `@bytelyst/diagnostics`
> **Target:** E2E debug collection from any device, on-demand triggers, industry-parity features
> **Estimated Effort:** 23 weeks
> **Status:** 🟡 Planning
---
## Executive Summary
This roadmap delivers a **Datadog/Sentry-grade remote diagnostics system** for the ByteLyst ecosystem. Unlike passive telemetry (which we have), this enables **active debugging sessions** where engineers can remotely trigger deep diagnostics collection from any user device.
### Key Differentiators vs. Existing Telemetry
| Feature | Existing Telemetry | Remote Diagnostics |
| --------------- | ------------------------- | ------------------------------------- |
| Trigger | Passive (always sampling) | **Active** (engineer-initiated) |
| Log Level | Static config | **Dynamic** (debug/trace per session) |
| Network Tracing | None | **Full HTTP capture** |
| Breadcrumbs | Basic events | **Rich timeline** (user journey) |
| Console Logs | Error-only | **Full capture** (debug→fatal) |
| Screenshots | None | **Auto-capture on crash** |
| Session Replay | None | **Future: video-style replay** |
---
## Phase 1: Server Foundation (Week 1)
### 1.1 Data Model & Schemas
- [ ] **1.1.1** Create `modules/diagnostics/types.ts`
- [ ] `DebugSessionDoc` — session metadata (status, target, config)
- [ ] `DebugTraceDoc` — trace spans with timing
- [ ] `DebugLogEntryDoc` — structured log entries
- [ ] `DiagnosticsConfigDoc` — per-product collection policies
- [ ] Zod schemas for all inputs
- [ ] **1.1.2** Add Cosmos containers to `cosmos-init.ts`
- [ ] `debug_sessions` (pk: `/id`, TTL: 7 days)
- [ ] `debug_traces` (pk: `/pk` with composite `${productId}:${sessionId}`, TTL: 7 days)
- [ ] `debug_logs` (pk: `/pk` with composite `${productId}:${sessionId}`, TTL: 3 days)
- [ ] `debug_screenshots` metadata (pk: `/sessionId`) — actual images stored in Azure Blob
### 1.2 Repository Layer
- [ ] **1.2.1** Create `modules/diagnostics/repository.ts`
- [ ] Use `@bytelyst/datastore` `getCollection()` pattern (see `telemetry/repository.ts`)
- [ ] `createSession()` — initiate debug session, emit `diagnostics.session.created` event
- [ ] `getSession()` — fetch session by ID (cross-partition query via `/id` pk)
- [ ] `getSessionForIngest()` — optimized lookup for client ingest (query by `sessionId` field)
- [ ] `updateSession()` — status changes, emit `diagnostics.session.updated` event
- [ ] `listSessions()` — query by `productId` field with pagination
- [ ] `deleteSession()` — manual cleanup, emit `diagnostics.session.deleted` event
- [ ] `ingestTrace()` — batch upsert traces (use `upsert()` for idempotency)
- [ ] `ingestLogs()` — batch upsert logs with PII scan (reuse `telemetry` PII patterns)
- [ ] `getTraces()` — query by composite pk prefix `${productId}:${sessionId}`
- [ ] `getLogs()` — query by composite pk with level filters
- [ ] `updateSessionStats()` — denormalize logCount/traceCount atomically
### 1.3 REST API Routes
- [ ] **1.3.1** Create `modules/diagnostics/routes.ts`
- [ ] Apply `requireRole('admin')` for all session management routes
- [ ] Apply rate limiting: 10 session creates per admin per hour (prevent abuse)
- [ ] `POST /diagnostics/sessions` — create session (admin only)
- [ ] Validate target user exists (if userId provided)
- [ ] Validate product exists and is active
- [ ] Emit `diagnostics.session.created` to event bus
- [ ] `GET /diagnostics/sessions` — list sessions (admin only)
- [ ] Query params: productId, status, userId, from, to, limit, offset
- [ ] Default sort: createdAt desc
- [ ] `GET /diagnostics/sessions/:id` — get session details (admin or session owner)
- [ ] `PATCH /diagnostics/sessions/:id` — update session (admin only)
- [ ] Validate state transitions (pending→active, active→paused, etc.)
- [ ] Emit `diagnostics.session.updated` event
- [ ] `DELETE /diagnostics/sessions/:id` — cancel session (admin only)
- [ ] Soft delete (mark cancelled, don't hard delete for audit trail)
- [ ] Emit `diagnostics.session.cancelled` event
- [ ] `GET /diagnostics/config` — client polling endpoint (any authenticated user)
- [ ] Return active session for this device/user if exists
- [ ] ETag support for 304 caching (reduce bandwidth)
- [ ] Rate limit: 1 request per 5 seconds per device
- [ ] `POST /diagnostics/ingest` — batch trace/log ingestion (any authenticated user)
- [ ] Validate session is active for this device
- [ ] PII scan all log messages (reuse telemetry PII patterns)
- [ ] Batch size limit: 50 items per request
- [ ] Async processing for large batches (return 202 Accepted)
- [ ] `POST /diagnostics/sessions/:id/screenshot` — upload screenshot metadata
- [ ] Generate SAS token via existing `blob` module for direct Azure upload
- [ ] Store metadata in `debug_screenshots` container
- [ ] Return 201 with blob URL for client upload
- [ ] `GET /diagnostics/sessions/:id/screenshots` — list screenshot metadata (admin)
- [ ] `GET /diagnostics/sessions/:id/traces` — get traces with pagination
- [ ] `GET /diagnostics/sessions/:id/logs` — get logs with level filter, search
### 1.4 Testing
- [ ] **1.4.1** Create `modules/diagnostics/diagnostics.test.ts`
- [ ] Session CRUD tests (10 tests)
- [ ] Create session with valid target user
- [ ] Create session fails for non-existent user
- [ ] Create session rate limiting (10/hour)
- [ ] Get session by ID
- [ ] List sessions with filters
- [ ] Update session status transitions
- [ ] Cancel session (soft delete)
- [ ] Session not found after TTL expires
- [ ] Unauthorized access blocked
- [ ] Event bus emissions verified
- [ ] Trace ingestion tests (8 tests)
- [ ] Batch trace ingest success
- [ ] Trace ingest with invalid session rejected
- [ ] Duplicate trace idempotency (upsert)
- [ ] Composite pk query by session
- [ ] Trace timing validation
- [ ] Parent-child span relationships
- [ ] Trace with error status
- [ ] Large batch rejected (>50 items)
- [ ] Log ingestion tests (8 tests)
- [ ] Batch log ingest success
- [ ] Log with PII redacted (email, SSN)
- [ ] Log level filtering
- [ ] Invalid session rejected
- [ ] Log search by message content
- [ ] Log context preservation
- [ ] Fatal log triggers alert
- [ ] Log TTL enforcement (3 days)
- [ ] Config polling tests (6 tests)
- [ ] Returns active session for device
- [ ] Returns empty when no active session
- [ ] ETag 304 caching works
- [ ] Rate limit enforced (5 sec)
- [ ] Wrong device cannot access other session
- [ ] Expired session not returned
- [ ] Screenshot tests (6 tests)
- [ ] SAS token generation via blob module
- [ ] Metadata stored in Cosmos
- [ ] Direct Azure Blob upload works
- [ ] Screenshot metadata retrieval
- [ ] Unauthorized access blocked
- [ ] Blob lifecycle tied to session TTL
- [ ] **Target:** 38+ Vitest tests (increased from 28)
### 1.5 Integration
- [ ] **1.5.1** Wire into `server.ts`
- [ ] Import `diagnosticsRoutes` from `./modules/diagnostics/routes.js`
- [ ] Register: `await app.register(diagnosticsRoutes, { prefix: '/api' })`
- [ ] Add after telemetry routes (logical grouping)
- [ ] **1.5.2** Event Bus Integration (`lib/event-bus.ts`)
- [ ] Subscribe to `diagnostics.session.created` → Send notification to target user (email/push)
- [ ] Subscribe to `diagnostics.session.cancelled` → Notify admin who started session
- [ ] Subscribe to `diagnostics.ingest.fatal` → Alert on-call engineer (PagerDuty/Slack)
- [ ] Subscribe to `diagnostics.session.completed` → Email summary to admin
- [ ] **1.5.3** Audit Logging (`modules/audit/`)
- [ ] Log all session lifecycle events (create, update, cancel)
- [ ] Include target user ID, admin ID, session config in audit trail
- [ ] Retention: 90 days via `audit_log` container TTL
- [ ] **1.5.4** Rate Limiting Registration
- [ ] Add `diagnostics:session:create` rate limit key (10/hour per admin)
- [ ] Add `diagnostics:config:poll` rate limit key (1/5sec per device)
- [ ] Add `diagnostics:ingest:submit` rate limit key (100/min per device)
**Phase 1 Exit Criteria:**
- [ ] All routes return 200 with correct payloads
- [ ] 38+ tests passing (updated from 28)
- [ ] Event bus subscribers registered and tested
- [ ] Audit logs written for all session operations
- [ ] Rate limiting enforced
- [ ] PII redaction working in log ingest
- [ ] Admin can create session via API
---
## Phase 2: Client SDK Abstractions (Week 12)
### 2.1 TypeScript Client SDK
- [ ] **2.1.1** Create `@bytelyst/diagnostics` package
- [ ] `package.json` with ESM exports
- [ ] `tsconfig.json` extending base
- [ ] **2.1.2** Core types (`src/types.ts`)
- [ ] `DiagnosticsSession` interface
- [ ] `TraceSpan` interface
- [ ] `LogLevel` enum (debug, info, warn, error, fatal)
- [ ] `DiagnosticsConfig` from server
- [ ] **2.1.3** Main client (`src/client.ts`)
- [ ] `DiagnosticsClient` class (singleton)
- [ ] `start()` — begin polling for active sessions
- [ ] `stop()` — cease polling
- [ ] `isSessionActive()` — check current state
- [ ] `trace(name, fn)` — auto-instrumented span wrapper
- [ ] `log(level, message, context)` — structured logging
- [ ] `breadcrumb(message, category)` — add timeline marker
- [ ] **2.1.4** Network interceptor (`src/network.ts`)
- [ ] `NetworkInterceptor` class
- [ ] Wrap `fetch()` for capture
- [ ] Capture: URL, method, headers (sanitized), timing, status
- [ ] Configurable URL patterns (include/exclude)
- [ ] **2.1.5** Breadcrumb trail (`src/breadcrumbs.ts`)
- [ ] Ring buffer (max 100 entries)
- [ ] Auto-capture: navigation, clicks, errors
- [ ] Manual: `breadcrumb()` API
- [ ] **2.1.6** Device state (`src/device.ts`)
- [ ] Memory, battery, storage, thermal, network type
- [ ] **2.1.7** Screenshot capture (`src/screenshot.ts`)
- [ ] `html2canvas` integration (browser)
- [ ] Auto-capture on error (configurable)
- [ ] **2.1.8** Tests (`src/__tests__/diagnostics.test.ts`)
- [ ] Client lifecycle tests (4)
- [ ] Trace recording tests (4)
- [ ] Log buffering tests (4)
- [ ] Network interception tests (4)
- [ ] Breadcrumb tests (4)
- [ ] **Target:** 20+ Vitest tests
### 2.2 Swift Client SDK (iOS)
- [ ] **2.2.1** Create `ByteLystDiagnostics` Swift package
- [ ] `Package.swift` with iOS 15+ target
- [ ] Module structure: Core, Network, UI, Device
- [ ] **2.2.2** Core client (`Sources/Core/DiagnosticsClient.swift`)
- [ ] `DiagnosticsClient` actor (thread-safe)
- [ ] `start()` — polling with `Timer`
- [ ] `trace<T>(name, operation)` — async span wrapper
- [ ] `log(level, message, metadata)` — os_log integration
- [ ] `breadcrumb(category, message)` — timeline
- [ ] **2.2.3** Network interception (`Sources/Network/URLSessionInterceptor.swift`)
- [ ] `URLProtocol` subclass for automatic capture
- [ ] Swizzle `URLSession` or use protocol registration
- [ ] Capture: request/response, timing, bytes
- [ ] **2.2.4** Device state (`Sources/Device/DeviceState.swift`)
- [ ] `UIDevice` integration (battery, thermal)
- [ ] `ProcessInfo` (memory pressure)
- [ ] `NetworkMonitor` (path status)
- [ ] **2.2.5** Screenshot (`Sources/UI/ScreenshotCapture.swift`)
- [ ] `UIApplication` key window capture
- [ ] Privacy: blur sensitive views (configurable)
- [ ] **2.2.6** Tests (`Tests/DiagnosticsClientTests.swift`)
- [ ] Unit tests (12)
- [ ] Integration tests (8)
### 2.3 Kotlin Client SDK (Android)
- [ ] **2.3.1** Create `diagnostics` module in `kotlin-platform-sdk`
- [ ] `build.gradle.kts` dependencies
- [ ] Coroutines + OkHttp + WorkManager
- [ ] **2.3.2** Core client (`diagnostics/DiagnosticsClient.kt`)
- [ ] Singleton with `StateFlow<DiagnosticsState>`
- [ ] `start()` — foreground service for polling
- [ ] `trace()` — suspend function with span
- [ ] `log()` — Logcat + structured queue
- [ ] **2.3.3** Network interceptor (`diagnostics/OkHttpInterceptor.kt`)
- [ ] `Interceptor` implementation
- [ ] Capture: request/response chain
- [ ] **2.3.4** Device state (`diagnostics/DeviceStateCollector.kt`)
- [ ] `BatteryManager`, `ActivityManager`, `StorageStatsManager`
- [ ] **2.3.5** Screenshot (`diagnostics/ScreenshotCapture.kt`)
- [ ] `MediaProjection` API (with permission)
- [ ] `PixelCopy` for surface capture
- [ ] **2.3.6** Tests (`diagnostics/DiagnosticsClientTest.kt`)
- [ ] Unit tests (10)
- [ ] Integration tests (6)
**Phase 2 Exit Criteria:**
- [ ] TS SDK builds + 20 tests passing
- [ ] Swift SDK builds + 20 tests passing
- [ ] Kotlin SDK builds + 16 tests passing
- [ ] All SDKs can poll config endpoint
---
## Phase 3: Admin Dashboard UI (Week 2)
### 3.1 Debug Sessions Page
- [ ] **3.1.1** Create `/ops/debug-sessions/page.tsx`
- [ ] Session list table (columns: ID, user, device, status, started, duration)
- [ ] Filters: status, product, date range
- [ ] Pagination
- [ ] "New Session" button → modal
- [ ] **3.1.2** New Session Modal
- [ ] Target user (email/userId search)
- [ ] Target device (dropdown from sessions)
- [ ] Collection level (standard, debug, trace)
- [ ] Duration slider (5min → 24hr)
- [ ] Screenshot on error (toggle)
- [ ] "Start Session" → POST API
### 3.2 Session Detail View
- [ ] **3.2.1** Create `/ops/debug-sessions/[id]/page.tsx`
- [ ] Session header (status badge, user info, device info)
- [ ] Action buttons: Extend (+30min), Stop, Download
- [ ] Tabs: Timeline, Logs, Network, Traces, Screenshots
- [ ] **3.2.2** Timeline Tab
- [ ] Breadcrumb list (time, category, message)
- [ ] Visual timeline (horizontal bar)
- [ ] Click to jump to related trace/log
- [ ] **3.2.3** Logs Tab
- [ ] Log level filters (debug, info, warn, error)
- [ ] Search/filter
- [ ] Expandable rows with context
- [ ] Syntax highlighting
- [ ] **3.2.4** Network Tab
- [ ] Request list (time, method, URL, status, duration)
- [ ] Click to view: request/response headers, body
- [ ] Filter by status code, URL pattern
- [ ] **3.2.5** Traces Tab
- [ ] Trace tree view (spans, parent-child)
- [ ] Timing visualization (waterfall)
- [ ] Search by name
- [ ] **3.2.6** Screenshots Tab
- [ ] Grid of thumbnails
- [ ] Click to expand
- [ ] Download all as ZIP
### 3.3 Real-time Updates
- [ ] **3.3.1** Server-sent events or polling
- [ ] Auto-refresh session status every 5 seconds
- [ ] Toast notification on new logs/traces
### 3.4 Client Library
- [ ] **3.4.1** Create `lib/diagnostics-client.ts`
- [ ] `querySessions()`
- [ ] `createSession()`
- [ ] `getSession()`
- [ ] `updateSession()`
- [ ] `getTraces()`
- [ ] `getLogs()`
**Phase 3 Exit Criteria:**
- [ ] Admin can create session from UI
- [ ] Session detail shows live data
- [ ] All 4 tabs functional
---
## Phase 4: Advanced Features (Week 3)
### 4.1 Automated Triggers
- [ ] **4.1.1** Error-threshold triggers
- [ ] Config: "Start debug session if error rate > X%"
- [ ] Background job checks every 5 minutes
- [ ] Auto-notify on Slack/Teams
- [ ] **4.1.2** Crash-triggered sessions
- [ ] Client sends crash → server auto-starts session
- [ ] Captures 60 seconds pre-crash context
### 4.2 Session Replay (Future)
- [ ] **4.2.1** DOM/View state capture
- [ ] Record user interactions (clicks, scrolls, inputs)
- [ ] Replay as video-like timeline
- [ ] Privacy: exclude password fields
### 4.3 Performance Profiling
- [ ] **4.3.1** CPU/Memory profiling
- [ ] iOS: `os_signpost` integration
- [ ] Android: `Debug.MemoryInfo`
- [ ] Web: `performance.now()` + memory API
### 4.4 Integration Tests
- [ ] **4.4.1** E2E test: Admin creates session → Client captures → Admin views
- [ ] Playwright test (web client)
- [ ] XCTest UI test (iOS)
- [ ] Espresso test (Android)
**Phase 4 Exit Criteria:**
- [ ] Auto-trigger tests passing
- [ ] E2E flow working end-to-end
---
## Appendix A: Data Models
### DebugSessionDoc
```typescript
interface DebugSessionDoc {
id: string; // ds_<uuid> — also the partition key (/id)
productId: string; // For filtering/querying (not pk to avoid hot partitions)
// Target (at least one required)
targetUserId?: string; // For authenticated users
targetAnonymousId?: string; // For anonymous users (installId)
targetDeviceId?: string; // Specific device fingerprint
targetSessionId?: string; // Specific app session to capture
// Status lifecycle: pending → active → paused → completed | cancelled
status: 'pending' | 'active' | 'paused' | 'completed' | 'cancelled';
// Collection configuration
collectionLevel: 'standard' | 'debug' | 'trace';
captureLogs: boolean;
captureNetwork: boolean;
captureScreenshots: boolean;
screenshotOnError: boolean;
maxDurationMinutes: number; // Default: 60, Max: 1440 (24h)
// Timestamps
createdAt: string; // ISO 8601
updatedAt: string; // Last status/config change
startedAt?: string; // When status became 'active'
endedAt?: string; // When status became 'completed'|'cancelled'
expiresAt: string; // Auto-expiry (createdAt + maxDurationMinutes)
// Stats (denormalized for fast reads, updated via ingest)
logCount: number;
traceCount: number;
screenshotCount: number;
// Audit
createdBy: string; // Admin userId who created session
updatedBy?: string; // Last admin to modify
// Consent tracking (privacy compliance)
userConsent?: {
consentedAt: string;
consentMethod: 'prompt' | 'pre_consent' | 'auto'; // How user agreed
};
}
```
### DebugTraceDoc (OpenTelemetry-compatible)
```typescript
interface DebugTraceDoc {
id: string; // tr_<uuid>
pk: string; // Composite: `${productId}:${sessionId}` — partition key
sessionId: string; // For queries (also part of pk)
productId: string; // For filtering
// OpenTelemetry trace context
traceId: string; // OTel trace ID (hex)
parentId?: string; // Parent span ID (null for root)
spanId: string; // This span's ID
name: string; // Operation name (e.g., "UserLogin", "API.fetchUser")
kind?: 'internal' | 'server' | 'client' | 'producer' | 'consumer';
// Timing (nanosecond precision for OTel compatibility)
startTime: string; // ISO 8601
endTime?: string;
durationMs?: number;
// Context and attributes
attributes: Record<string, unknown>; // Custom key-value pairs
status: 'ok' | 'error' | 'unset';
statusMessage?: string; // Error description if status=error
// Events (spans within span — e.g., "db.query", "cache.hit")
events?: Array<{
name: string;
timestamp: string;
attributes?: Record<string, unknown>;
}>;
// Links to other traces (for async operations)
links?: Array<{
traceId: string;
spanId: string;
attributes?: Record<string, unknown>;
}>;
}
```
````
### DebugLogEntryDoc
```typescript
interface DebugLogEntryDoc {
id: string; // log_<uuid>
pk: string; // Composite: `${productId}:${sessionId}` — partition key
sessionId: string; // For queries (also part of pk)
productId: string; // For filtering
// Log level (matches syslog/OTel severity)
level: 'debug' | 'info' | 'warn' | 'error' | 'fatal';
message: string; // Original message (PII redacted server-side)
messageHash?: string; // SHA-256 for deduplication
// Timestamp (client clock, server enriches with receivedAt)
timestamp: string; // ISO 8601 — when log was generated
receivedAt?: string; // Server-side ingestion time
// Source context
module: string; // Component/module name (e.g., "AudioEngine", "SyncManager")
file?: string; // Source file path (sanitized)
line?: number; // Line number
function?: string; // Function/method name
// Thread/task context
threadId?: string; // For multi-threaded apps
correlationId?: string; // Links related operations
// Arbitrary context (PII scanned and redacted)
context: Record<string, unknown>;
// PII redaction metadata
redaction?: {
fieldsRedacted: string[]; // Which fields were scrubbed
patternsMatched: string[]; // Which PII patterns found (email, ssn, etc.)
};
}
````
### DebugScreenshotDoc (Metadata only — image in Blob)
```typescript
interface DebugScreenshotDoc {
id: string; // scr_<uuid>
sessionId: string; // Partition key for queries
productId: string;
// Storage reference (actual image in Azure Blob)
blobUrl: string; // SAS URL to blob (time-limited)
blobPath: string; // Path in container: `screenshots/${productId}/${sessionId}/${id}.png`
containerName: string; // Azure Blob container (e.g., "diagnostics-screenshots")
// Screenshot metadata
capturedAt: string; // When captured
trigger: 'manual' | 'error' | 'interval' | 'user_request'; // Why taken
// Dimensions & format
width: number;
height: number;
format: 'png' | 'jpeg' | 'webp';
sizeBytes: number;
// Privacy
sensitiveViewsBlurred: boolean; // Whether PII areas were blurred
blurRegions?: Array<{ x: number; y: number; w: number; h: number }>; // If partial blur
// Optional context
screenName?: string; // Current screen/view when captured
breadcrumbAtCapture?: string; // Last breadcrumb before screenshot
}
```
---
## Appendix B: API Reference
| Method | Endpoint | Auth | Rate Limit | Description |
| ------ | ------------------------------------------- | ----------- | -------------- | --------------------------------------- |
| POST | `/api/diagnostics/sessions` | Admin | 10/hour/admin | Create debug session |
| GET | `/api/diagnostics/sessions` | Admin | 100/min | List sessions (paginated) |
| GET | `/api/diagnostics/sessions/:id` | Admin/Owner | 100/min | Get session details |
| PATCH | `/api/diagnostics/sessions/:id` | Admin | 10/min | Update session status |
| DELETE | `/api/diagnostics/sessions/:id` | Admin | 10/min | Cancel session (soft delete) |
| GET | `/api/diagnostics/config` | Any auth | 1/5sec/device | Poll for active session (ETag cached) |
| POST | `/api/diagnostics/ingest` | Any auth | 100/min/device | Submit traces/logs batch (max 50 items) |
| POST | `/api/diagnostics/sessions/:id/traces` | Any auth | 100/min/device | Ingest trace spans |
| POST | `/api/diagnostics/sessions/:id/logs` | Any auth | 100/min/device | Ingest log entries |
| POST | `/api/diagnostics/sessions/:id/screenshots` | Any auth | 10/min/device | Get SAS URL for screenshot upload |
| GET | `/api/diagnostics/sessions/:id/traces` | Admin | 100/min | Query traces for session |
| GET | `/api/diagnostics/sessions/:id/logs` | Admin | 100/min | Query logs with filters |
| GET | `/api/diagnostics/sessions/:id/screenshots` | Admin | 100/min | List screenshot metadata |
---
## Appendix C: Industry Comparison
| Capability | Firebase Crashlytics | Sentry | Datadog RUM | Our Solution |
| --------------- | -------------------- | -------- | ----------- | ------------------ |
| Crash Reporting | ✅ | ✅ | ✅ | ✅ (via telemetry) |
| Error Tracking | ✅ | ✅ | ✅ | ✅ (via telemetry) |
| Breadcrumbs | ✅ | ✅ | ✅ | ✅ |
| Custom Traces | ⚠️ Limited | ✅ | ✅ | ✅ |
| Network Tracing | ❌ | ✅ | ✅ | ✅ |
| Console Logs | ⚠️ Error only | ✅ | ✅ | ✅ (all levels) |
| Session Replay | ❌ | ✅ | ✅ | 🟡 Future |
| Remote Trigger | ❌ | ✅ | ❌ | ✅ |
| On-Device Debug | ❌ | ❌ | ❌ | ✅ |
| Screenshots | ⚠️ Crash only | ✅ | ❌ | ✅ |
| Open Source | ❌ | ✅ (SDK) | ❌ | ✅ |
---
## Appendix D: Privacy & Security
### D.1 PII Redaction Patterns (server-side)
| Pattern | Regex | Redaction Method | Example |
| -------------- | ------------------------------------------------------ | ---------------------- | ------------------------------- | ------------------ | --------------------------- | ------------------------------------------------ |
| Email | `[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}` | Replace with `[EMAIL]` | `user@example.com``[EMAIL]` |
| SSN (US) | `\b\d{3}-?\d{2}-?\d{4}\b` | Replace with `[SSN]` | `123-45-6789``[SSN]` |
| Credit Card | `\b(?:\d[ -]*?){13,16}\b` | Replace with `[CC]` | `4111 1111 1111 1111``[CC]` |
| Phone | `\b\+?1?\s?\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}\b` | Replace with `[PHONE]` | `+1 (555) 123-4567``[PHONE]` |
| IP Address | `\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b` | Replace with `[IP]` | `192.168.1.1``[IP]` |
| Password/Token | `(?i)(password | token | secret | key)\s*[:=]\s*\S+` | Replace with `[CREDENTIAL]` | `password: secret123``password: [CREDENTIAL]` |
| JWT | `eyJ[a-zA-Z0-9_-]*\.eyJ[a-zA-Z0-9_-]*\.[a-zA-Z0-9_-]*` | Replace with `[JWT]` | Full JWT → `[JWT]` |
- [ ] **1. PII Redaction:** Implement all patterns above in `lib/pii-redaction.ts` (shared with telemetry)
- [ ] **2. Redaction Metadata:** Store which patterns matched in `redaction.fieldsRedacted` for transparency
- [ ] **3. Consent Tracking:** `userConsent` field in session doc with `consentedAt` and `consentMethod`
- [ ] **4. Data Retention:** 7-day default TTL for sessions/traces, 3-day for logs (Cosmos TTL)
- [ ] **5. Access Control:** Admin-only session creation; users can only view their own sessions via `targetUserId` check
- [ ] **6. Encryption:** All data encrypted at rest (Cosmos), in transit (TLS 1.3)
- [ ] **7. Audit Trail:** All session operations logged to `audit_log` container (90-day retention)
- [ ] **8. User Notification:** Email/push notification when debug session started on their device
---
## Appendix E: Event Bus Events
| Event Name | Payload | Publishers | Subscribers |
| --------------------------------- | --------------------------------------------------- | --------------------------------- | ----------------------------------- |
| `diagnostics.session.created` | `{ sessionId, productId, targetUserId, createdBy }` | diagnostics module | notifications → email/push user |
| `diagnostics.session.started` | `{ sessionId, productId, startedAt }` | diagnostics module | — |
| `diagnostics.session.updated` | `{ sessionId, productId, changes, updatedBy }` | diagnostics module | audit log |
| `diagnostics.session.cancelled` | `{ sessionId, productId, reason, cancelledBy }` | diagnostics module | notifications → admin |
| `diagnostics.session.completed` | `{ sessionId, productId, stats, endedAt }` | diagnostics module | notifications → admin summary email |
| `diagnostics.session.expired` | `{ sessionId, productId, expiredAt }` | diagnostics module (TTL job) | — |
| `diagnostics.ingest.fatal` | `{ sessionId, productId, logEntry, timestamp }` | diagnostics module (on fatal log) | PagerDuty/Slack alert |
| `diagnostics.screenshot.captured` | `{ sessionId, productId, screenshotId, trigger }` | diagnostics module | — |
---
## Current Status
- [x] **Design complete** — 2026-03-02
- [x] **Review complete** — 10 bugs/gaps identified and fixed:
1. Fixed partition keys to avoid hot partitions (composite pk for traces/logs)
2. Added `pk` field to all data models matching existing telemetry pattern
3. Added `updatedAt`/`updatedBy` for audit trail completeness
4. Added `userConsent` field for GDPR/privacy compliance
5. Fixed screenshot storage to use Azure Blob (not Cosmos)
6. Added PII redaction patterns and metadata tracking
7. Added event bus integration with 8 specific events
8. Added rate limiting specs for all endpoints
9. Added ETag caching for config polling
10. Added `targetSessionId` for capturing specific app sessions
- [ ] Phase 1: Server Foundation (38 tests target)
- [ ] Phase 2: Client SDKs (TS/Swift/Kotlin)
- [ ] Phase 3: Admin UI
- [ ] Phase 4: Advanced Features
**Total Tasks:** 140+ checkboxes across 4 phases
**Last Updated:** 2026-03-02